Extended - Basic Eda Python Fellow
Extended - Basic Eda Python Fellow
basic_eda_python_fellow
1.1 Introduction
Business Context. The city of New York has seen a rise in the number of accidents on the roads
in the city. They would like to know if the number of accidents have increased in the last few
weeks. For all the reported accidents, they have collected details for each accident and have been
maintaining records for the past year and a half (from January 2018 to August 2019).
The city has contracted you to build visualizations that would help them identify patterns in
accidents, which would help them take preventive actions to reduce the number of accidents in
the future. They would like specific information on certain parameters like borough, time of day,
reason for accident, etc.
Business Problem. Your task is to format the given data and provide visualizations that would
answer the specific questions the client has, which are mentioned below.
Analytical Context. You are given a CSV file (stored in the already created data folder) con-
taining details about each accident like date, time, location of the accident, reason for the accident,
types of vehicles involved, injury and death count, etc. The delimiter in the given CSV file is ;
instead of the default ,. You will be performing the following tasks on the data:
1. Extract additional borough data stored in a JSON file
2. Read, transform, and prepare data for visualization
3. Construct and analyze visualizations of the data to identify patterns in the dataset
The client has a specific set of questions they would like to get answers to. You will need to provide
visualizations to accompany these:
1. How have the number of accidents fluctuated over the past year and a half? Have they
increased over that time?
1
2. For any particular day, during which hours are accidents most likely to occur?
3. Are there more accidents on weekdays than weekends?
4. What are the accidents’ count-to-area ratio per borough? Which boroughs have dispropor-
tionately large numbers of accidents for their size?
5. For each borough, during which hours are accidents most likely to occur?
6. What are the top 5 causes of accidents in the city?
7. What types of vehicles are most involved in accidents per borough?
8. What types of vehicles are most involved in deaths?
Note: To solve this extended case, please read the function docstrings very carefully. They
contain information that you will need! Also, please don’t include print() statements inside your
functions (they will most likely produce an error in the test cells). Finally, for the purposes of this
case, do not worry about standardizing text variables - for example, treat taxi and Taxi as though
they were different values.
[19]: {'the bronx': {'name': 'the bronx', 'population': 1471160.0, 'area': 42.1},
'brooklyn': {'name': 'brooklyn', 'population': 2648771.0, 'area': 70.82},
'manhattan': {'name': 'manhattan', 'population': 1664727.0, 'area': 22.83},
'queens': {'name': 'queens', 'population': 2358582.0, 'area': 108.53},
'staten island': {'name': 'staten island',
'population': 479458.0,
'area': 58.37}}
Similarly, let’s use the pandas function read_csv() to load the file accidents.csv as a DataFrame.
We will name this DataFrame df.
[20]: with open('data/accidents.csv') as f:
df=pd.read_csv(f, delimiter=';')
2
ON STREET NAME NUMBER OF PEDESTRIANS INJURED \
0 NaN 0
1 FLATLANDS AVENUE 1
2 NaN 0
3 MAIN STREET 0
4 NaN 0
[5 rows x 24 columns]
3
[22]: df.columns
4
- B. The analysis of the data provided will be useful to know which are the high risk locations so
that the road safety audits at high risk locations can be implementes.
- F. To launch the Speed Reducer Program, the analysis of the date is a must, so that the program
could be based on facts about the data rather than just personal opinions.
1.4.1 Exercise 2
2.1 (2 points) Group the available accident data by month.
Hint: You may find the pandas functions pd.to_datetime() and dt.to_period() useful.
[23]: def ex_2(df):
"""
Group accidents by month
Arguments:
`df`: A pandas DataFrame
Outputs:
`monthly_accidents`: The grouped Series
"""
# YOUR CODE HERE
monthly_accidents = df.copy() ␣
,→ #make a copy of the data not to modify it.
monthly_accidents.DATE = pd.to_datetime(monthly_accidents.DATE) ␣
,→ #change the variable to datetime type
monthly_accidents.DATE = monthly_accidents.DATE.dt.to_period("M") ␣
,→ #gather the months of the DATE variable
monthly_accidents = monthly_accidents.groupby(['DATE'])['COLLISION_ID'].
,→size() #group by DATE and print the size of the groups
return monthly_accidents
2.2
5
lineplot = lineplot.groupby(['DATE'])['COLLISION_ID'].count() #group the␣
,→variables
[24]: <AxesSubplot:xlabel='DATE'>
2.2.2 (1 point) Has the number of accidents increased over the past year and a half? Justify
your answer with an interpretation of a plot.
No. By and large, the number of accidents over the past year and a half has actually decreased.
The patter is not very clear nor stable, however there is a steady decrease in the number of accidents
over the past year and a half, a bit less time than what is shown in the plot, that is 20 months.
1.4.3 Exercise 4
4.1 (2 points) Create a new column HOUR based on the data from the TIME column.
Hint: You may find the dt.hour accessor useful.
[25]: def ex_4(df):
"""
Group accidents by hour of day
Arguments:
`df`: A pandas DataFrame
6
Outputs:
`hourly_accidents`: The grouped Series
"""
# YOUR CODE HERE
DF4 = df.copy() #make a␣
,→copy of the data not to modify it.
DF4['HOUR'] = DF4['HOUR'].dt.hour ␣
,→#Extract only the hour of the HOUR variable
return hourly_accidents
4.2
4.2.1 (1 point) Plot a bar graph of the distribution per hour throughout the day.
7
4.2.2 (1 point) How does the number of accidents vary throughout a single day?
The number of accidents varies during the day time considerably. The maximum number of acci-
dents occure in the afternoon around 2pm-5pm, and the lowest number of accidents are at around
2am until 5am, which is what one would expect, guiven the fact that at that time there are less
number of vehicles on the streets. Overall, a almost perfect pattern could be seen similar to a
sin(x) function, which means that there is a pick followed by a valley in a steady way.
1.4.5 Exercise 6
6.1 (2 points) Calculate the number of accidents by day of the week.
Hint: You may find the dt.weekday accessor useful.
[27]: def ex_6(df):
"""
Group accidents by day of the week
Arguments:
`df`: A pandas DataFrame
Outputs:
`weekday_accidents`: The grouped Series
8
"""
# YOUR CODE HERE
weekday_accidents = df.copy() ␣
,→#make a copy of the data not to modify it.
weekday_accidents['DATE'] = pd.to_datetime(weekday_accidents['DATE']) ␣
,→#change DATE to datetime object
weekday_accidents['DAY_NAME'] = weekday_accidents['DATE'].dt.day_name() ␣
,→#extract the weekday
weekday_accidents = DF6.groupby(['DAY_NAME'])['COLLISION_ID'].count() ␣
,→#group by day of the week and count
return weekday_accidents
6.2
6.2.1 (1 point) Plot a bar graph based on the accidents count by day of the week.
weekday_accidents = weekday_accidents.groupby(['DAY_NAME'])['COLLISION_ID'].
,→count().to_frame() #Group the dataframe
weekday_accidents = weekday_accidents.reset_index(drop=False) ␣
,→ #Reset the index
weekday_accidents = weekday_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #Rename the variables
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='DAY_NAME',data=weekday_accidents)
plt.show()
9
6.2.2 (1 point) How does the number of accidents vary throughout a single week?
It can be seen in the plot that in the weekends there are less accidents than during the week. For
example in the sundays there were the lowest number of accidents. It can be expected given the
fact that that is usually the day with less cars on the street and usually people don’t work or go
out that day as much as they do during the weekdays.
On the other hand, the number of accidents during the week is similar for each day, but friday has
the majority of accidents over the other days.
1.4.6 Exercise 7
7.1 (2 points) Calculate the total number of accidents for each borough.
Arguments:
`df`: A pandas DataFrame
Outputs:
`boroughs`: The grouped Series
"""
# YOUR CODE HERE
# raise NotImplementedError() # Remove this line when you enter your␣
,→solution
10
return boroughs
7.2
boroughs_accidents = boroughs_accidents.groupby('BOROUGH')['COLLISION_ID'].
,→count().to_frame() #group by Borough
boroughs_accidents = boroughs_accidents.reset_index() ␣
,→ #Reset the index
boroughs_accidents = boroughs_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #Rename the variables
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='BOROUGH',data=boroughs_accidents)
plt.show()
11
7.3 (hard | 3 points) Calculate the number of accidents per square mile for each borough.
Hint: You will have to update the keys in the borough dictionary to match the names in the
DataFrame.
[31]: def ex_7_3(df, borough_data):
"""
Calculate accidents per sq mile for each borough
Arguments:
`borough_frame`: A pandas DataFrame with the count of accidents per borough
`borough_data`: A python dictionary with population and area data for each␣
,→borough
Outputs:
`borough_frame`: The same `borough_frame` DataFrame used as input, only␣
,→with an
"""
#change the names in the JSON file
borough_data['the bronx']['name'] = 'BRONX'
borough_data['brooklyn']['name'] = 'BROOKLYN'
borough_data['manhattan']['name'] = 'MANHATTAN'
borough_data['queens']['name'] = 'QUEENS'
borough_data['staten island']['name'] = 'STATEN ISLAND'
boroughs = ex_7_1(df)
borough_frame = pd.DataFrame(boroughs)
borough_frame['accidents_per_sq_mi'] = borough_frame['COLLISION_ID'] /␣
,→borough_frame['area'] #create a new variable called accidents_per_sq_mi
7.4
12
7.4.1 (1 point) Plot a bar graph of the accidents per square mile per borough with the data you
just calculated.
[34]: # YOUR CODE HERE
acc_per_area = ex_7_3(df, borough_data)
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents_per_sq_mi',x='BOROUGH',data=acc_per_area)
plt.show()
1.4.7 Exercise 8
8.1 (2 points) Create a Series of the number of accidents per hour and borough.
Arguments:
`df`: A pandas DataFrame
Outputs:
13
`bor_hour`: A Series. This should be the result of doing groupby by borough
and hour.
"""
# YOUR CODE HERE
bor_hour = df.copy() #make a copy of the data not to modify it.
bor_hour['HOUR'] = pd.to_datetime(bor_hour['TIME']) #create␣
,→the HOUR variable
bor_hour['HOUR'] = bor_hour['HOUR'].dt.hour ␣
,→#Extract the hout from the HOUR variable
bor_hour = bor_hour.groupby(['BOROUGH','HOUR'])['COLLISION_ID'].count() ␣
,→#Group by BOROUGH and HOUR
return bor_hour
8.2
8.2.1 (2 points) Plot a bar graph for each borough showing the number of accidents for each
hour of the day.
Hint: You can use sns.FacetGrid to create a grid of plots with the hourly data of each borough.
[45]: # YOUR CODE HERE
#plot
borough_hourly = ex_8_1(df).to_frame().reset_index()
g = sns.FacetGrid(borough_hourly, col= 'BOROUGH')
g = g.map(sns.barplot,'HOUR','COLLISION_ID')
/home/jovyan/.local/lib/python3.8/site-packages/seaborn/axisgrid.py:670:
UserWarning: Using the barplot function without specifying `order` is likely to
produce an incorrect plot.
warnings.warn(warning)
8.2.2 (1 point) Which hours have the most accidents for each borough?
By and large, the highest number accidents occure between 2 and 5 pm, specially at 4s pm. The
lowest number of accidents are very early in the morning, between 1 and 5 am, and this patter is
consistent among the boroughs.
Another ver interesting pattern that can be seen in almost every borough is that there are a increase
14
of accidents around 8 am and then it slightly decreased for the next hours. It can be because that’s
the time in which more people are on their way to work, so the more people on the strets, the more
likely an accident can occure.
Arguments:
`contrib_df`: A pandas DataFrame.
Outputs:
`factors_most_acc`: A pandas DataFrame. It has only 6 elements, which are,
sorted in descending order, the contributing factors with the most␣
,→accidents.
factors_most_acc = factors_most_acc.dropna() ␣
,→#remove NAs
factors_most_acc = factors_most_acc.drop(columns="vehicle") ␣
,→#remove vehicle column
factors_most_acc = factors_most_acc.drop_duplicates() ␣
,→#get rid of duplicates
15
return factors_most_acc
Arguments:
`df`: A pandas DataFrame.
Outputs:
`vehi_most_acc`: A pandas DataFrame. It has only 10 elements, which are,
sorted in descending order, the borough-vehicle pairs with the most␣
,→accidents.
vehi_cols = ['VEHICLE TYPE CODE 1','VEHICLE TYPE CODE 2','VEHICLE TYPE CODE␣
,→3','VEHICLE TYPE CODE 4','VEHICLE TYPE CODE 5']
vehi_most_acc = vehi_most_acc.dropna() ␣
,→#remove NAs
vehi_most_acc = vehi_most_acc.drop(columns="vehicle") ␣
,→#drop vehicle columns
vehi_most_acc = vehi_most_acc.drop_duplicates() ␣
,→#get rid of duplicated values
,→#group by
16
vehi_most_acc = vehi_most_acc.rename(columns={'COLLISION_ID':'index'}) ␣
#change variable name
,→
vehi_most_acc = vehi_most_acc.head(10)
return vehi_most_acc
1.4.11 Exercise 12
12.1 (hard | 3 points) Calculate the number of deaths caused by each type of vehicle.
Hint 1: As an example of how to compute vehicle involvement in deaths, suppose two people
died in an accident where 5 vehicles were involved, and 4 are PASSENGER VEHICLE and 1 is a
SPORT UTILITY/STATION WAGON. Then we would add two deaths to both the PASSENGER
VEHICLE and SPORT UTILITY/STATION WAGON types.)
Hint 2: You will need to use pd.melt() and proceed as in the previous exercises to avoid double-
counting the types of vehicles (i.e. you should remove duplicate “accident ID - vehicle type” pairs).
17
[48]: def ex_12_1(df):
"""
Calculate total killed per vehicle type and plot the result
as a bar graph
Arguments:
`df`: A pandas DataFrame.
Outputs:
`result`: A pandas DataFrame. Its index should be the vehicle type. Its only
column should be `TOTAL KILLED`
"""
result = result.drop_duplicates() ␣
,→#remove duplicated rows
return result
12.2
[55]: top_5_veh
18
[55]: Contributiong Factor TOTAL KILLED
0 Station Wagon/Sport Utility Vehicle 100
1 Sedan 79
2 PASSENGER VEHICLE 33
3 SPORT UTILITY / STATION WAGON 26
4 Motorcycle 22
12.2.2 (2 points) Which vehicles are most often involved in deaths, and by how much more than
the others?
The Station Wagon/Sport Utiliy Vehicle were the most often involved in deaths with about
100 person killed in total in the period analysed. These vehicles had around 20 more kills than the
Sedan ones, around 60 kills more than the PASSENGER VEHICLE, 65 more than SPORT
UTILITY / STATION WAGON and around 80 more than the Motorcycle vehicles.
However those are aproximation taken from the bar plot. Should one needs the exact values a
simple look at the dataframe the plot was build with would give the exact numbers.
19
assert ex_2(df).loc["2018-10"] == 13336, "Ex. 2.1 - Wrong output! Try using the␣
,→.size() aggregation function with your .groupby()."
[253]: # Ex 4.1
assert type(ex_4(df)) == type(pd.Series([9,1,2])), "Ex. 4.1 - Your output isn't␣
,→a pandas Series. If you use .groupby() and an aggregation function, the␣
assert ex_4(df).loc[13] == 14224, "Ex. 4.1 - Wrong output! Try using the .
,→size() aggregation function with your .groupby()."
assert max(ex_6(df)) == 37886, "Ex. 6.1 - Your results don't match ours!␣
,→Remember that you can use the .size() aggregation function to count the␣
assert max(ex_7_1(df)) == 76253, "Ex. 7.1 - Your results don't match ours!␣
,→Remember that you can use the .size() aggregation function to count the␣
20
assert round(min(e73["accidents_per_sq_mi"])) == 149, "Ex. 7.3 - Your output␣
,→doesn't match ours! Remember that you need to divide the number of accidents␣
,→in each of the five boroughs by the respective areas in square miles."
assert ex_8_1(df).max() == 5701, "Ex. 8.1 - Your numbers don't match ours. If␣
,→you haven't already, you can try using .size() as your aggregation function."
[219]: # Ex. 9
assert type(ex_9(df)) == type(pd.Series([9,1,2]).to_frame()), "Ex. 9 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣
assert len(ex_9(df)) == 6, "Ex. 9 - Your output doesn't have six elements. Did␣
,→you forget to use .head(6)?"
,→haven't already, you can try using .count() as your aggregation function."
[234]: # Ex. 10
assert type(ex_10(df)) == type(pd.Series([9,1,2]).to_frame()), "Ex. 10 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣
,→haven't already, you can try using .count() as your aggregation function."
21
assert type(e12) == type(pd.Series([9,1,2]).to_frame()), "Ex. 12.1 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣
assert int(e12.loc["Bike"]) == 19, "Ex. 12.1 - Your output doesn't match ours!␣
,→Remember that you need to remove the duplicate pairs and use the .sum()␣
1.6 Attribution
“Vehicle Collisions in NYC 2015-Present”, New York Police Department, NYC Open Data terms
of use, https://www.kaggle.com/nypd/vehicle-collisions
“Boroughs of New York City”, Creative Commons Attribution-ShareAlike License,
https://en.wikipedia.org/wiki/Boroughs_of_New_York_City
22