Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
148 views

Extended - Basic Eda Python Fellow

The document provides an overview of a dataset containing details of traffic accidents in New York City from January 2018 to August 2019. The client has asked several questions to better understand patterns in accidents over time and location. The assistant is tasked with loading the accident data from a CSV file and borough data from a JSON file, analyzing the data, and providing visualizations to answer the client's questions about factors like hourly, daily, and seasonal trends; causes of accidents; and vehicles involved by borough.

Uploaded by

Rosario
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views

Extended - Basic Eda Python Fellow

The document provides an overview of a dataset containing details of traffic accidents in New York City from January 2018 to August 2019. The client has asked several questions to better understand patterns in accidents over time and location. The assistant is tasked with loading the accident data from a CSV file and borough data from a JSON file, analyzing the data, and providing visualizations to answer the client's questions about factors like hourly, daily, and seasonal trends; causes of accidents; and vehicles involved by borough.

Uploaded by

Rosario
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

extended.

basic_eda_python_fellow

September 15, 2022

1 How can we control the increasing number of accidents in New


York?
Total points: 46
[18]: import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import base64

1.1 Introduction
Business Context. The city of New York has seen a rise in the number of accidents on the roads
in the city. They would like to know if the number of accidents have increased in the last few
weeks. For all the reported accidents, they have collected details for each accident and have been
maintaining records for the past year and a half (from January 2018 to August 2019).
The city has contracted you to build visualizations that would help them identify patterns in
accidents, which would help them take preventive actions to reduce the number of accidents in
the future. They would like specific information on certain parameters like borough, time of day,
reason for accident, etc.
Business Problem. Your task is to format the given data and provide visualizations that would
answer the specific questions the client has, which are mentioned below.
Analytical Context. You are given a CSV file (stored in the already created data folder) con-
taining details about each accident like date, time, location of the accident, reason for the accident,
types of vehicles involved, injury and death count, etc. The delimiter in the given CSV file is ;
instead of the default ,. You will be performing the following tasks on the data:
1. Extract additional borough data stored in a JSON file
2. Read, transform, and prepare data for visualization
3. Construct and analyze visualizations of the data to identify patterns in the dataset
The client has a specific set of questions they would like to get answers to. You will need to provide
visualizations to accompany these:
1. How have the number of accidents fluctuated over the past year and a half? Have they
increased over that time?

1
2. For any particular day, during which hours are accidents most likely to occur?
3. Are there more accidents on weekdays than weekends?
4. What are the accidents’ count-to-area ratio per borough? Which boroughs have dispropor-
tionately large numbers of accidents for their size?
5. For each borough, during which hours are accidents most likely to occur?
6. What are the top 5 causes of accidents in the city?
7. What types of vehicles are most involved in accidents per borough?
8. What types of vehicles are most involved in deaths?
Note: To solve this extended case, please read the function docstrings very carefully. They
contain information that you will need! Also, please don’t include print() statements inside your
functions (they will most likely produce an error in the test cells). Finally, for the purposes of this
case, do not worry about standardizing text variables - for example, treat taxi and Taxi as though
they were different values.

1.2 Fetching the relevant data


The client has requested analysis of the accidents-to-area ratio for boroughs. Borough data is stored
in a JSON file in the data folder (this file was created using data from Wikipedia).
Let’s use the function json.load() to load the file borough_data.json as a dictionary:

[19]: with open('data/borough_data.json') as f:


borough_data=json.load(f)
borough_data

[19]: {'the bronx': {'name': 'the bronx', 'population': 1471160.0, 'area': 42.1},
'brooklyn': {'name': 'brooklyn', 'population': 2648771.0, 'area': 70.82},
'manhattan': {'name': 'manhattan', 'population': 1664727.0, 'area': 22.83},
'queens': {'name': 'queens', 'population': 2358582.0, 'area': 108.53},
'staten island': {'name': 'staten island',
'population': 479458.0,
'area': 58.37}}

Similarly, let’s use the pandas function read_csv() to load the file accidents.csv as a DataFrame.
We will name this DataFrame df.
[20]: with open('data/accidents.csv') as f:
df=pd.read_csv(f, delimiter=';')

[21]: df.head() #echamos un vistazo a nuestro dataset

[21]: DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE \


0 09/26/2018 12:12 BRONX 10454.0 40.808987 -73.911316
1 09/25/2018 16:30 BROOKLYN 11236.0 40.636005 -73.912510
2 08/22/2019 19:30 QUEENS 11101.0 40.755490 -73.939530
3 09/23/2018 13:10 QUEENS 11367.0 NaN NaN
4 08/20/2019 22:40 BRONX 10468.0 40.868336 -73.901270

2
ON STREET NAME NUMBER OF PEDESTRIANS INJURED \
0 NaN 0
1 FLATLANDS AVENUE 1
2 NaN 0
3 MAIN STREET 0
4 NaN 0

NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED … \


0 0 0 …
1 0 0 …
2 0 0 …
3 0 1 …
4 0 0 …

CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 \


0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Unspecified NaN
4 Unspecified NaN

CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID \


0 NaN NaN 3988123
1 NaN NaN 3987962
2 NaN NaN 4193132
3 NaN NaN 3985962
4 NaN NaN 4192111

VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 \


0 Sedan NaN
1 Sedan NaN
2 Sedan NaN
3 Bike Station Wagon/Sport Utility Vehicle
4 Sedan Sedan

VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5


0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

[5 rows x 24 columns]

1.3 Overview of the data


Let’s go through the columns present in the DataFrame:

3
[22]: df.columns

[22]: Index(['DATE', 'TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE',


'ON STREET NAME', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
dtype='object')

We have the following columns:


1. BOROUGH: The borough in which the accident occurred
2. COLLISION_ID: A unique identifier for this collision
3. CONTRIBUTING FACTOR VEHICLE (1, 2, 3, 4, 5): Reasons for the accident
4. CROSS STREET NAME: Nearest cross street to the location of the accident
5. DATE: Date of the accident
6. TIME: Time of the accident
7. LATITUDE: Latitude of the accident
8. LONGITUDE: Longitude of the accident
9. NUMBER OF (CYCLISTS, MOTORISTS, PEDESTRIANS) INJURED: Injuries
by category
10. NUMBER OF (CYCLISTS, MOTORISTS, PEDESTRIANS) KILLED: Deaths by
category
11. ON STREET NAME: Street where the accident occurred
12. VEHICLE TYPE CODE (1, 2, 3, 4, 5): Types of vehicles involved in the accident
13. ZIP CODE: Zip code of the accident location

1.3.1 Exercise 1 (2 points)


Since 2014, New York City has been implementing a road safety plan named Vision Zero. It aims
to reduce the number of traffic deaths to zero by the end of 2024. The plan is creating new and
enhancing current safety measures, some of these include:
A. Automated pedestrian detection B. Road safety audits at high risk locations C. Expansion in
the cycle track network D. Targeted education and awareness initiatives E. Creation of pedestrian
refuge islands F. Launch Integrated Data-Driven Speed Reducer Program (speed humps & speed
cushions)
Which of these initiatives could directly benefit from an analysis of the data provided? Select all
that apply.
Note: In this notebook, whenever you are asked to write text, use the cell below the question cell
to write your answer there. If you write in the same cell as the question, your answer will not be
recorded.
Answer: B and F.

4
- B. The analysis of the data provided will be useful to know which are the high risk locations so
that the road safety audits at high risk locations can be implementes.
- F. To launch the Speed Reducer Program, the analysis of the date is a must, so that the program
could be based on facts about the data rather than just personal opinions.

1.4 Answering the client’s questions


Let’s go ahead and answer each of the client’s questions.

1.4.1 Exercise 2
2.1 (2 points) Group the available accident data by month.
Hint: You may find the pandas functions pd.to_datetime() and dt.to_period() useful.
[23]: def ex_2(df):
"""
Group accidents by month

Arguments:
`df`: A pandas DataFrame

Outputs:
`monthly_accidents`: The grouped Series
"""
# YOUR CODE HERE
monthly_accidents = df.copy() ␣
,→ #make a copy of the data not to modify it.
monthly_accidents.DATE = pd.to_datetime(monthly_accidents.DATE) ␣
,→ #change the variable to datetime type
monthly_accidents.DATE = monthly_accidents.DATE.dt.to_period("M") ␣
,→ #gather the months of the DATE variable
monthly_accidents = monthly_accidents.groupby(['DATE'])['COLLISION_ID'].
,→size() #group by DATE and print the size of the groups

return monthly_accidents

2.2

2.2.1 (1 point) Generate a line plot of accidents over time.

[24]: # YOUR CODE HERE


lineplot = df.copy() #make a copy of␣
,→the data not to modify it.

lineplot.DATE = pd.to_datetime(lineplot.DATE) #change the␣


,→DATE variable to datetime type

lineplot = lineplot.drop_duplicates() #remove␣


,→duplicated values

5
lineplot = lineplot.groupby(['DATE'])['COLLISION_ID'].count() #group the␣
,→variables

fig, ax = plt.subplots(figsize=(20,4)) #Create the␣


,→canvas for the plot

lineplot.plot() #plot the␣


,→lineplot

[24]: <AxesSubplot:xlabel='DATE'>

2.2.2 (1 point) Has the number of accidents increased over the past year and a half? Justify
your answer with an interpretation of a plot.
No. By and large, the number of accidents over the past year and a half has actually decreased.
The patter is not very clear nor stable, however there is a steady decrease in the number of accidents
over the past year and a half, a bit less time than what is shown in the plot, that is 20 months.

1.4.2 Exercise 3 (2 points)


From the plot above, which month(s) seem to have the least number of accidents? What do you
think are the reasons behind this?
January, February and April. Although, it is not very clear in the graph, we can see that the
lowest number of accidents has occured in those months. A sorted table with the average of the
number of accidents by month would be a very straigthforward way to answer this question.

1.4.3 Exercise 4
4.1 (2 points) Create a new column HOUR based on the data from the TIME column.
Hint: You may find the dt.hour accessor useful.
[25]: def ex_4(df):
"""
Group accidents by hour of day

Arguments:
`df`: A pandas DataFrame

6
Outputs:
`hourly_accidents`: The grouped Series

"""
# YOUR CODE HERE
DF4 = df.copy() #make a␣
,→copy of the data not to modify it.

DF4['HOUR'] = pd.to_datetime(DF4['TIME']) #Create␣


,→a variable called HOUR the same than TIME

DF4['HOUR'] = DF4['HOUR'].dt.hour ␣
,→#Extract only the hour of the HOUR variable

hourly_accidents = DF4.groupby(['HOUR'])['COLLISION_ID'].count() #Group␣


,→by HOUR and count

return hourly_accidents

4.2

4.2.1 (1 point) Plot a bar graph of the distribution per hour throughout the day.

[26]: # YOUR CODE HERE


hourly_accidents = df.copy() ␣
,→ #make a copy of the data not to modify it.
hourly_accidents['HOUR'] = pd.to_datetime(hourly_accidents['TIME']) ␣
,→ #create a nuw column called HOUR
hourly_accidents['HOUR'] = hourly_accidents['HOUR'].dt.hour ␣
,→ #extracts the hout of the HOUR variable
hourly_accidents = hourly_accidents.groupby(['HOUR'])['COLLISION_ID'].count().
,→to_frame() #group
hourly_accidents = hourly_accidents.reset_index() ␣
,→ #reset index
hourly_accidents = hourly_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #change a column name
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='HOUR',data=hourly_accidents)
plt.show()

7
4.2.2 (1 point) How does the number of accidents vary throughout a single day?
The number of accidents varies during the day time considerably. The maximum number of acci-
dents occure in the afternoon around 2pm-5pm, and the lowest number of accidents are at around
2am until 5am, which is what one would expect, guiven the fact that at that time there are less
number of vehicles on the streets. Overall, a almost perfect pattern could be seen similar to a
sin(x) function, which means that there is a pick followed by a valley in a steady way.

1.4.4 Exercise 5 (2 points)


In the above question we have aggregated the number of accidents per hour disregarding the date
and place of occurrence. What criticism would you give to this approach?
It is very likely that the different boroughs behave differently and the day of the week also affect
the number of accidents throughout the day, so that the information shown in the plot above is an
approach that can not be used to draw important conclusions about the accidentality in NY per
hour, because it doesn’t consider all the variables that could affect this question.

1.4.5 Exercise 6
6.1 (2 points) Calculate the number of accidents by day of the week.
Hint: You may find the dt.weekday accessor useful.
[27]: def ex_6(df):
"""
Group accidents by day of the week

Arguments:
`df`: A pandas DataFrame

Outputs:
`weekday_accidents`: The grouped Series

8
"""
# YOUR CODE HERE
weekday_accidents = df.copy() ␣
,→#make a copy of the data not to modify it.

weekday_accidents['DATE'] = pd.to_datetime(weekday_accidents['DATE']) ␣
,→#change DATE to datetime object

weekday_accidents['DAY_NAME'] = weekday_accidents['DATE'].dt.day_name() ␣
,→#extract the weekday

weekday_accidents = DF6.groupby(['DAY_NAME'])['COLLISION_ID'].count() ␣
,→#group by day of the week and count

return weekday_accidents

6.2

6.2.1 (1 point) Plot a bar graph based on the accidents count by day of the week.

[28]: # YOUR CODE HERE


weekday_accidents = df.copy() ␣
,→ #make a copy of the data not to modify it.
weekday_accidents['DATE'] = pd.to_datetime(weekday_accidents['DATE']) ␣
,→ #Convert the DATE variable to datetime object
weekday_accidents['DAY_NAME'] = weekday_accidents['DATE'].dt.day_name() ␣
,→ #Extract the day of the week with the day_name function

weekday_accidents = weekday_accidents.groupby(['DAY_NAME'])['COLLISION_ID'].
,→count().to_frame() #Group the dataframe
weekday_accidents = weekday_accidents.reset_index(drop=False) ␣
,→ #Reset the index
weekday_accidents = weekday_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #Rename the variables
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='DAY_NAME',data=weekday_accidents)
plt.show()

9
6.2.2 (1 point) How does the number of accidents vary throughout a single week?
It can be seen in the plot that in the weekends there are less accidents than during the week. For
example in the sundays there were the lowest number of accidents. It can be expected given the
fact that that is usually the day with less cars on the street and usually people don’t work or go
out that day as much as they do during the weekdays.
On the other hand, the number of accidents during the week is similar for each day, but friday has
the majority of accidents over the other days.

1.4.6 Exercise 7
7.1 (2 points) Calculate the total number of accidents for each borough.

[29]: def ex_7_1(df):


"""
Group accidents by borough

Arguments:
`df`: A pandas DataFrame

Outputs:
`boroughs`: The grouped Series
"""
# YOUR CODE HERE
# raise NotImplementedError() # Remove this line when you enter your␣
,→solution

boroughs = df.copy() #make a␣


,→copy of the data not to modify it.

boroughs = boroughs.groupby('BOROUGH')['COLLISION_ID'].count() #group the␣


,→data and count the

10
return boroughs

7.2

7.2.1 (1 point) Plot a bar graph of the previous data.

[30]: # YOUR CODE HERE


# raise NotImplementedError() # Remove this line when you enter your solution

boroughs_accidents = df.copy() #make a copy of the data not to␣


,→modify it.

boroughs_accidents = boroughs_accidents.groupby('BOROUGH')['COLLISION_ID'].
,→count().to_frame() #group by Borough
boroughs_accidents = boroughs_accidents.reset_index() ␣
,→ #Reset the index
boroughs_accidents = boroughs_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #Rename the variables
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='BOROUGH',data=boroughs_accidents)
plt.show()

7.2.2 (1 point) What do you notice in the plot?


Initially, it can be seen that there were more accidents in Brooklyn than in the other 4 Brough. It
is also clear that Staten island has the fewer number of accidents among the boroughs.
However that information is not very usefull because all boroughs have different size, population
and trafic, so we should not venture to draw major conclusions out of that plot.

11
7.3 (hard | 3 points) Calculate the number of accidents per square mile for each borough.
Hint: You will have to update the keys in the borough dictionary to match the names in the
DataFrame.
[31]: def ex_7_3(df, borough_data):
"""
Calculate accidents per sq mile for each borough

Arguments:
`borough_frame`: A pandas DataFrame with the count of accidents per borough
`borough_data`: A python dictionary with population and area data for each␣
,→borough

Outputs:
`borough_frame`: The same `borough_frame` DataFrame used as input, only␣
,→with an

additional column called `accidents_per_sq_mi` that results from dividing


the number of accidents in each borough by its area. Please call this new␣
,→column

exactly `accidents_per_sq_mi` - otherwise the test cells will throw an␣


,→error.

"""
#change the names in the JSON file
borough_data['the bronx']['name'] = 'BRONX'
borough_data['brooklyn']['name'] = 'BROOKLYN'
borough_data['manhattan']['name'] = 'MANHATTAN'
borough_data['queens']['name'] = 'QUEENS'
borough_data['staten island']['name'] = 'STATEN ISLAND'

boroughs = ex_7_1(df)
borough_frame = pd.DataFrame(boroughs)

# YOUR CODE HERE


borough_frame = borough_frame.reset_index() #reset the index
borough_frame = borough_frame.merge(pd.json_normalize(borough_data.
,→values()), #merge with a left join
how='left', left_on='BOROUGH',␣
,→right_on='name')

borough_frame['accidents_per_sq_mi'] = borough_frame['COLLISION_ID'] /␣
,→borough_frame['area'] #create a new variable called accidents_per_sq_mi

return borough_frame # This must be a DataFrame, NOT a Series

7.4

12
7.4.1 (1 point) Plot a bar graph of the accidents per square mile per borough with the data you
just calculated.
[34]: # YOUR CODE HERE
acc_per_area = ex_7_3(df, borough_data)
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents_per_sq_mi',x='BOROUGH',data=acc_per_area)
plt.show()

7.4.2 (1 point) What can you conclude?


It can be seen here that Manthattan borough has the maximun number of accidents per area,
followed by Brooklyn, Bronx, Queens and Staten Island in the last place.
This information is very important because here we are considering not only the absolute number
of accident in a borough, but also the size of that borough so that we have relative information to
the area and the measures can be comparable among the boroughs.
In my opinion, considering information such as population per borough and trafic per borough
would also be an interesting approach to compare information among those groups.

1.4.7 Exercise 8
8.1 (2 points) Create a Series of the number of accidents per hour and borough.

[35]: def ex_8_1(df):


"""
Calculate accidents per hour for each borough

Arguments:
`df`: A pandas DataFrame

Outputs:

13
`bor_hour`: A Series. This should be the result of doing groupby by borough
and hour.
"""
# YOUR CODE HERE
bor_hour = df.copy() #make a copy of the data not to modify it.
bor_hour['HOUR'] = pd.to_datetime(bor_hour['TIME']) #create␣
,→the HOUR variable

bor_hour['HOUR'] = bor_hour['HOUR'].dt.hour ␣
,→#Extract the hout from the HOUR variable

bor_hour = bor_hour.groupby(['BOROUGH','HOUR'])['COLLISION_ID'].count() ␣
,→#Group by BOROUGH and HOUR

return bor_hour

8.2

8.2.1 (2 points) Plot a bar graph for each borough showing the number of accidents for each
hour of the day.
Hint: You can use sns.FacetGrid to create a grid of plots with the hourly data of each borough.
[45]: # YOUR CODE HERE
#plot
borough_hourly = ex_8_1(df).to_frame().reset_index()
g = sns.FacetGrid(borough_hourly, col= 'BOROUGH')
g = g.map(sns.barplot,'HOUR','COLLISION_ID')

/home/jovyan/.local/lib/python3.8/site-packages/seaborn/axisgrid.py:670:
UserWarning: Using the barplot function without specifying `order` is likely to
produce an incorrect plot.
warnings.warn(warning)

8.2.2 (1 point) Which hours have the most accidents for each borough?
By and large, the highest number accidents occure between 2 and 5 pm, specially at 4s pm. The
lowest number of accidents are very early in the morning, between 1 and 5 am, and this patter is
consistent among the boroughs.
Another ver interesting pattern that can be seen in almost every borough is that there are a increase

14
of accidents around 8 am and then it slightly decreased for the next hours. It can be because that’s
the time in which more people are on their way to work, so the more people on the strets, the more
likely an accident can occure.

1.4.8 Exercise 9 (hard | 3 points)


Using contrib_df, find which 6 factors cause the most accidents. It is important that you avoid
double counting the contributing factors of a single accident.
Hint: You can use the pd.melt() function to take a subset of df and convert it from wide format
to narrow format.
[46]: def ex_9(df):
"""
Finds which 6 factors cause the most accidents, without
double counting the contributing factors of a single accident.

Arguments:
`contrib_df`: A pandas DataFrame.

Outputs:
`factors_most_acc`: A pandas DataFrame. It has only 6 elements, which are,
sorted in descending order, the contributing factors with the most␣
,→accidents.

The column with the actual numbers is named `index`.


"""

# YOUR CODE HERE


factors_most_acc = df.copy() #make a copy of the data not to modify␣
,→it.

factors_most_acc = pd.melt(factors_most_acc, id_vars= ['COLLISION_ID'],␣


,→value_vars = ['CONTRIBUTING FACTOR VEHICLE 1',

'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',


'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5'],
var_name = 'vehicle' , value_name = "Contributiong Factor") ␣
,→#melt to have a long format

factors_most_acc = factors_most_acc.dropna() ␣
,→#remove NAs

factors_most_acc = factors_most_acc.drop(columns="vehicle") ␣
,→#remove vehicle column

factors_most_acc = factors_most_acc.drop_duplicates() ␣
,→#get rid of duplicates

factors_most_acc = factors_most_acc.groupby(['Contributiong Factor']).


,→count().sort_values(by='COLLISION_ID',ascending = False).reset_index() #
factors_most_acc = factors_most_acc.rename(columns={'COLLISION_ID':
,→'index'}) #rename the variable
factors_most_acc = factors_most_acc.head(6) ␣
,→ #select only the first 6 rows

15
return factors_most_acc

1.4.9 Exercise 10 (hard | 3 points)


Which 10 vehicle type-borough pairs are most involved in accidents? Avoid double counting the
types of vehicles involved in a single accident. You can apply a similar approach to the one used
in the previous exercise using pd.melt().
Hint: You may want to include BOROUGH as one of your id_vars (the other being index) in
pd.melt(). Including BOROUGH in your final .groupby() is also a good idea.

[47]: def ex_10(df):


"""
Finds the 10 borough:vehicle type pairs with more accidents, without
double counting the vehicle types of a single accident.

Arguments:
`df`: A pandas DataFrame.

Outputs:
`vehi_most_acc`: A pandas DataFrame. It has only 10 elements, which are,
sorted in descending order, the borough-vehicle pairs with the most␣
,→accidents.

The column with the actual numbers is named `index`


"""

vehi_cols = ['VEHICLE TYPE CODE 1','VEHICLE TYPE CODE 2','VEHICLE TYPE CODE␣
,→3','VEHICLE TYPE CODE 4','VEHICLE TYPE CODE 5']

# YOUR CODE HERE


vehi_most_acc = df.copy() #make a copy of the data␣
,→not to modify it.

vehi_most_acc = pd.melt(vehi_most_acc, id_vars= ['COLLISION_ID','BOROUGH'],␣


,→value_vars = vehi_cols,

var_name = 'vehicle' , value_name = "Contributiong Factor") ␣


,→#melt to have a long format

vehi_most_acc = vehi_most_acc.dropna() ␣
,→#remove NAs

vehi_most_acc = vehi_most_acc.drop(columns="vehicle") ␣
,→#drop vehicle columns

vehi_most_acc = vehi_most_acc.drop_duplicates() ␣
,→#get rid of duplicated values

vehi_most_acc = vehi_most_acc.groupby(['BOROUGH','Contributiong Factor']).


,→count().sort_values(by='COLLISION_ID',ascending = False).reset_index() ␣

,→#group by

16
vehi_most_acc = vehi_most_acc.rename(columns={'COLLISION_ID':'index'}) ␣
#change variable name
,→

vehi_most_acc = vehi_most_acc.head(10)

return vehi_most_acc

1.4.10 Exercise 11 (2 points)


In a 2018 interview with The New York Times, New York’s mayor de Blasio stated that “Vision
Zero is clearly working”. That year, the number of deaths in traffic accidents in NYC dropped to a
historically low 202. Yet, as reported by am New York Metro, the number of fatalities has increased
by 30% in the first quarter of 2019 compared to the previous year and the number of pedestrians
and cyclists injured has not seen any improvement.
Which of the following BEST describes how you would use the provided data to understand what
went wrong in the first quarter of 2019? Please explain the reasons for your choice.
A. Consider the accidents of the first quarter of 2019. Then, check for the most common causes
of accidents where pedestrians and cyclists were involved. Give a recommendation based solely on
this information. B. Create a pair of heat maps of the accidents involving injured/killed pedestrians
and cyclists in the first quarter of 2018 and 2019. Compare these two to see if there is any change in
the concentration of accidents. In critical areas, study the type of factors involved in the accidents.
Give a recommendation to visit these areas to study the problem further. C. The provided data is
insufficient to improve our understanding of the situation. D. None of the above. (If you choose
this, please elaborate on what you would do instead.)
B. To understand the increase by 30% in the number of deaths in accidents in NY it is very
important to compare what is happening in that period to what happend the previous year. To
do so, a heat map could make the comparation easier. It is not the only plot that could be used,
though. It is also very important to consider the concentration of accidents, the critical areas and
the factors that were involved in the accidents, so we can understand what went wrong in the first
quarter of 2019. It is also important to look at the most common causes of accidents in the first
quarter of 2019 how the A answer suggest and give recomendation, but without comparing the
data with the previous year we can not identify if those causes are the same than last year or where
are the flaws in the first quarter of the year. Finally, we can give accurate recommendation based
on the complete information and perhaphs visit these areas to study the problem further.

1.4.11 Exercise 12
12.1 (hard | 3 points) Calculate the number of deaths caused by each type of vehicle.
Hint 1: As an example of how to compute vehicle involvement in deaths, suppose two people
died in an accident where 5 vehicles were involved, and 4 are PASSENGER VEHICLE and 1 is a
SPORT UTILITY/STATION WAGON. Then we would add two deaths to both the PASSENGER
VEHICLE and SPORT UTILITY/STATION WAGON types.)
Hint 2: You will need to use pd.melt() and proceed as in the previous exercises to avoid double-
counting the types of vehicles (i.e. you should remove duplicate “accident ID - vehicle type” pairs).

17
[48]: def ex_12_1(df):
"""
Calculate total killed per vehicle type and plot the result
as a bar graph

Arguments:
`df`: A pandas DataFrame.

Outputs:
`result`: A pandas DataFrame. Its index should be the vehicle type. Its only
column should be `TOTAL KILLED`
"""

# YOUR CODE HERE


vehi_cols = ['VEHICLE TYPE CODE 1','VEHICLE TYPE CODE 2','VEHICLE TYPE CODE␣
,→3','VEHICLE TYPE CODE 4','VEHICLE TYPE CODE 5']

result = df.copy() #make␣


,→a copy not to modify the origunal df

result = pd.melt(result, id_vars= ['COLLISION_ID', #melt␣


,→to have a long format

'NUMBER OF PEDESTRIANS KILLED',


'NUMBER OF CYCLIST KILLED',
'NUMBER OF MOTORIST KILLED'],
value_vars = vehi_cols,
var_name = 'vehicle' , value_name = "Contributiong Factor")
result = result.dropna() ␣
,→#remove missing values

result = result.drop_duplicates() ␣
,→#remove duplicated rows

result['TOTAL KILLED'] = result['NUMBER OF PEDESTRIANS KILLED']+␣


,→result['NUMBER OF CYCLIST KILLED']+ result['NUMBER OF MOTORIST KILLED'] #␣

,→create new variable

result = result.drop(columns=['vehicle','NUMBER OF PEDESTRIANS KILLED',␣


,→'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST KILLED']) #remove columns

result = result.groupby(['Contributiong Factor']).agg({'TOTAL KILLED':


,→'sum'}).sort_values(by='TOTAL KILLED',ascending = False)#groupping

return result

12.2

12.2.1 (1 point) Plot a bar chart for the top 5 vehicles.

[55]: top_5_veh

18
[55]: Contributiong Factor TOTAL KILLED
0 Station Wagon/Sport Utility Vehicle 100
1 Sedan 79
2 PASSENGER VEHICLE 33
3 SPORT UTILITY / STATION WAGON 26
4 Motorcycle 22

[54]: # YOUR CODE HERE


top_5_veh = ex_12_1(df).reset_index().head(5) #extract the top 5 rows
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='TOTAL KILLED',x='Contributiong Factor',data=top_5_veh)
plt.show()

12.2.2 (2 points) Which vehicles are most often involved in deaths, and by how much more than
the others?
The Station Wagon/Sport Utiliy Vehicle were the most often involved in deaths with about
100 person killed in total in the period analysed. These vehicles had around 20 more kills than the
Sedan ones, around 60 kills more than the PASSENGER VEHICLE, 65 more than SPORT
UTILITY / STATION WAGON and around 80 more than the Motorcycle vehicles.
However those are aproximation taken from the bar plot. Should one needs the exact values a
simple look at the dataframe the plot was build with would give the exact numbers.

1.5 Testing cells


[154]: # Ex. 2.1
assert type(ex_2(df)) == type(pd.Series([9,1,2])), "Ex. 2.1 - Your output isn't␣
,→a pandas Series. If you use .groupby() and an aggregation function, the␣

,→output is a Series by default."

19
assert ex_2(df).loc["2018-10"] == 13336, "Ex. 2.1 - Wrong output! Try using the␣
,→.size() aggregation function with your .groupby()."

print("Exercise 2.1 looks correct!")

Exercise 2.1 looks correct!

[253]: # Ex 4.1
assert type(ex_4(df)) == type(pd.Series([9,1,2])), "Ex. 4.1 - Your output isn't␣
,→a pandas Series. If you use .groupby() and an aggregation function, the␣

,→output is a Series by default."

assert ex_4(df).loc[13] == 14224, "Ex. 4.1 - Wrong output! Try using the .
,→size() aggregation function with your .groupby()."

print("Exercise 4.1 looks correct!")

Exercise 4.1 looks correct!

[315]: # Ex. 6.1


assert type(ex_6(df)) == type(pd.Series([9,1,2])), "Ex. 6.1 - Your output isn't␣
,→a pandas Series. If you use .groupby() and an aggregation function, the␣

,→output is a Series by default."

assert max(ex_6(df)) == 37886, "Ex. 6.1 - Your results don't match ours!␣
,→Remember that you can use the .size() aggregation function to count the␣

,→number of elements in a groupby group."

print("Exercise 6.1 looks correct!")

Exercise 6.1 looks correct!

[325]: # Ex. 7.1


assert type(ex_7_1(df)) == type(pd.Series([9,1,2])), "Ex. 7.1 - Your output␣
,→isn't a pandas Series. If you use .groupby() and an aggregation function,␣

,→the output is a Series by default."

assert max(ex_7_1(df)) == 76253, "Ex. 7.1 - Your results don't match ours!␣
,→Remember that you can use the .size() aggregation function to count the␣

,→number of elements in a groupby group."

print("Exercise 7.1 looks correct!")

Exercise 7.1 looks correct!

[83]: # Ex. 7.3


with open('data/borough_data.json') as f:
borough_data=json.load(f)
borough_data
e73 = ex_7_3(df, borough_data)
assert "accidents_per_sq_mi" in e73.columns, "Ex. 7.3 - You didn't create an␣
,→'accidents_per_sq_mi' in your DataFrame!"

20
assert round(min(e73["accidents_per_sq_mi"])) == 149, "Ex. 7.3 - Your output␣
,→doesn't match ours! Remember that you need to divide the number of accidents␣

,→in each of the five boroughs by the respective areas in square miles."

print("Exercise 7.3 looks correct!")

Exercise 7.3 looks correct!

[103]: # Ex. 8.1


assert type(ex_8_1(df)) == type(pd.Series([9,1,2])), "Ex. 8.1 - Your output␣
,→isn't a pandas Series. If you use .groupby() and an aggregation function,␣

,→the output is a Series by default."

assert ex_8_1(df).max() == 5701, "Ex. 8.1 - Your numbers don't match ours. If␣
,→you haven't already, you can try using .size() as your aggregation function."

print("Exercise 8.1 looks correct!")

Exercise 8.1 looks correct!

[219]: # Ex. 9
assert type(ex_9(df)) == type(pd.Series([9,1,2]).to_frame()), "Ex. 9 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣

,→function, the output is a Series by default."

assert len(ex_9(df)) == 6, "Ex. 9 - Your output doesn't have six elements. Did␣
,→you forget to use .head(6)?"

assert int(ex_9(df)["index"].sum()) == 316248, "Ex. 9 - Your numbers don't␣


,→match ours. Are you sure you sorted your Series in descending order? If you␣

,→haven't already, you can try using .count() as your aggregation function."

print("Exercise 9 looks correct!")

Exercise 9 looks correct!

[234]: # Ex. 10
assert type(ex_10(df)) == type(pd.Series([9,1,2]).to_frame()), "Ex. 10 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣

,→function, the output is a Series by default."

assert len(ex_10(df)["index"]) == 10, "Ex. 10 - Your output doesn't have 10␣


,→elements. Did you forget to use .head(10)?"

assert ex_10(df)["index"].sum() == 229882, "Ex. 10 - Your numbers don't match␣


,→ours. Are you sure you sorted your Series in descending order? If you␣

,→haven't already, you can try using .count() as your aggregation function."

print("Exercise 10 looks correct!")

Exercise 10 looks correct!

[25]: # Ex. 12.1


e12 = ex_12_1(df)

21
assert type(e12) == type(pd.Series([9,1,2]).to_frame()), "Ex. 12.1 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣

,→function, the output is a Series by default."

assert int(e12.loc["Bike"]) == 19, "Ex. 12.1 - Your output doesn't match ours!␣
,→Remember that you need to remove the duplicate pairs and use the .sum()␣

,→aggregation function in your groupby."

print("Exercise 12.1 looks correct!")

Exercise 12.1 looks correct!

1.6 Attribution
“Vehicle Collisions in NYC 2015-Present”, New York Police Department, NYC Open Data terms
of use, https://www.kaggle.com/nypd/vehicle-collisions
“Boroughs of New York City”, Creative Commons Attribution-ShareAlike License,
https://en.wikipedia.org/wiki/Boroughs_of_New_York_City

22

You might also like