Microsoft Word - homework4.docx
Data Focused Python XXXXXXXXXXHomework 4
Due: 11:59 PM Sunday, November 27, 2022 This is an individual assignment.
Problem 1
a. Read the file 'expenses.txt' into a DataFrame named expenses – the data is separated by a colon ( : ). The first
line contains column headers. Display expenses
Read the file 'people.txt’, also colon-separated with column headers in the first line, into DataFrame people.
Display people.
Read the file ‘departments.txt’ into DataFrame departments – same formatting. Display departments.
. Clean people by removing any row with null data – display the rows removed (hint: use a variant of isna() ).
Then remove anyone whose department is not in departments (recall the isin( ) method from Lab 7) – again
display the rows removed. Display the cleaned version of people.
c. Clean expenses by removing any row with null data – display rows removed. Then remove rows whose id is
not in people – display rows removed. Then remove rows with a malformed date field (only check for the
co
ect number of digits, 8 – use a lambda and apply() ) – display rows removed. Display the cleaned version of
expenses.
d. Merge expenses and people on the ‘ID’ column using an inner join – name the result alldata. Sort alldata on
‘ID’. Create sums by grouping the rows (use groupby( ) ) on ‘ID’ and then use agg(‘sum’) to get the total of
each person’s expenses. Display sums.
Problem 2
Use the expenses and people DataFrames from Problem 1 to display the following graphs. Make sure you label
the graphs and axes co
ectly. Use the pandas functions groupby( ) with the column you need, together with
either count( ) or sum( ) – as in, groupby(
).sum(). Also, use the groupby parameter as_index=False to
store the grouping column as an actual column instead of as the index. The following graph shows the idea,
although the values may not be co
ect (it's sample data, and I was lazy about labels):
a. Bar chart containing the departments and the number of people employed in each.
. Bar chart containing the id's and the total expense amount for each id.
c. Bar chart containing the expense categories and the total expense amount for each category.
d. Pie chart containing the same information as #c.
Read the file “countryData.csv” into a DataFrame named countryData; this is a version of the GDP – Olympic
medals data from Lab 7. Use countryData to create regression plots for the following using Seaborn’s regplot.
Compute the co
elation coefficient and use it as an annotation on the graph.
e. Population against GDP
f. GDP against Weighted
g. Formula against Total
h. Formula against Weighted
JTW
Problem 3
Write a script by copying the relevant parts of the Lab 8 code and modify it to do these things:
a. Create a list with these topic strings: Python; Data Science; Data Analysis; Machine Learning; and Deep
Learning. Use these topics, one at a time, to query the Google Books API.
For each returned JSON string:
. Convert the JSON string to a dict using loads( ) (as in the lab), then use this to convert it to a DataFrame:
pd.io.json.json_normalize ( thedict['items'] )
c. Extract just the 'volumeInfo.title' and 'volumeInfo.authors' columns.
d. Relabel those two columns as 'Title' and 'Authors'.
After creating the five DataFrame objects, use concat( ) to create one table called bigTable (use
ignore_index=True). The function takes a list of the DataFrames to concatenate (i.e., in [ ]'s).
e. Display bigTable.
f. Re-display bigTable in the following way, using regular Python to display data extracted from bigTable.
Create the table headers (left justified), then use a for loop over bigTable.index, which will count on the index
number starting at 0. Display at most 25 characters of the title (just use [:25], even if the title has fewer
characters and only the first author. It should look something like this (your data may vary):
ID:Amount:Category:Date:Description
5:5.25:supply: XXXXXXXXXX:box of staples
7:79.81:meal: XXXXXXXXXX:lunch with ABC Corp. clients Al, Bob, and Cy
4:43.00:travel: XXXXXXXXXX:cab back to office
4:383.75:travel: XXXXXXXXXX:flight to Boston, to visit ABC Corp.
22:55.00:travel: XXXXXXXXXX:cab to ABC Corp. in Cam
idge, MA
17:23.25:meal: XXXXXXXXXX:dinner at Logan Airport
5:318.47:supply: XXXXXXXXXX:paper, toner, pens, paperclips, tape
22:142.12:meal: XXXXXXXXXX:host dinner with ABC clients, Al, Bob, Cy, Dave, Ellie
20:20.20::
20:20.20:::
49:303.94:util: XXXXXXXXXX:Peoples Gas
49:121.07:util: XXXXXXXXXX:Verizon Wireless
8:7.59:supply: XXXXXXXXXX:Python book (used)
8:79.99:supply: XXXXXXXXXX:spare 20" monito
13:49.86:supply: XXXXXXXXXX:Stoch Cal for Finance II
7:6.53:meal: XXXXXXXXXX:Dunkin Donuts, drive to Big Inc. near DC
7:127.23:meal: XXXXXXXXXX:dinner, Tavern64
22:33.07:meal: XXXXXXXXXX:dinner, Uncle Julio's
7:86.00:travel: XXXXXXXXXX:mileage, drive to/from Big Inc., Reston, VA
7::travel: XXXXXXXXXX:mileage, drive to/from Big Inc., Reston, VA
50:22.00:travel: XXXXXXXXXX:tolls
7:378.81:travel: XXXXXXXXXX:Hyatt Hotel, Reston VA, for Big Inc. meeting
8:1247.49:supply: XXXXXXXXXX:Dell 7000 laptop/workstation
40:6.99:supply: XXXXXXXXXX:HDMI cable
49:212.06:util: XXXXXXXXXX:Duquesne Light
8:23.86:supply: XXXXXXXXXX:Practical Guide to Quant Finance Interviews
5:195.89:supply: XXXXXXXXXX:black toner, HP 304A, 2-pack
5:195.89:supply: XXXXXXXXXX:
22:86.00:travel: XXXXXXXXXX:mileage, drive to/from Big Inc., Reston, VA
18:32.27:meal: XXXXXXXXXX:lunch at Clyde's with Fred and Gina, Big Inc.
7:22.00:travel: XXXXXXXXXX:tolls
5:119.56:util: XXXXXXXXXX:Verizon Wireless
5:284.23:util: XXXXXXXXXX:Peoples Gas
5:8.98:supply: XXXXXXXXXX:Flair pens
5:8.98:supply:202325:Flair pens
5:22.95:supply: XXXXXXXXXX:Bic pens
4:149.95:travel: XXXXXXXXXX:Car rental
28:2245.25:supply:2022512:party
28::supply:2022512:party
18:77.75:meal: XXXXXXXXXX:Lunch with investors
7:950.15:travel: XXXXXXXXXX:flight to Chicago
5:22.95:supply: XXXXXXXXXX:Bic pens
4:149.95:travel: XXXXXXXXXX:Car rental
18:77.75:meal: XXXXXXXXXX:Lunch with investors
7:950.15:travel: XXXXXXXXXX:flight to Chicago
7:950.15:travel:220418:flight to Chicago
5:22.95:supply: XXXXXXXXXX:Bic pens
4:5.19:meal: XXXXXXXXXX:McDonalds
4:149.95:travel: XXXXXXXXXX:Car rental
18:77.75:meal: XXXXXXXXXX:Lunch with investors
18:77.75:: XXXXXXXXXX:Lunch with investors
7:950.15:travel: XXXXXXXXXX:flight to Chicago
Department:Location
Sales:New York
Office:New York
Research:San Francisco
Manufacturing:Pittsburgh
Marketing:Chicago