0% found this document useful (0 votes)

148 views

Extended - Basic Eda Python Fellow

The document provides an overview of a dataset containing details of traffic accidents in New York City from January 2018 to August 2019. The client has asked several questions to better understand patterns in accidents over time and location. The assistant is tasked with loading the accident data from a CSV file and borough data from a JSON file, analyzing the data, and providing visualizations to answer the client's questions about factors like hourly, daily, and seasonal trends; causes of accidents; and vehicles involved by borough.

Uploaded by

Rosario

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

148 views

Extended - Basic Eda Python Fellow

Uploaded by

Rosario

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

extended.

basic_eda_python_fellow

September 15, 2022

1 How can we control the increasing number of accidents in New

York?
Total points: 46
[18]: import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import base64

1.1 Introduction
Business Context. The city of New York has seen a rise in the number of accidents on the roads
in the city. They would like to know if the number of accidents have increased in the last few
weeks. For all the reported accidents, they have collected details for each accident and have been
maintaining records for the past year and a half (from January 2018 to August 2019).
The city has contracted you to build visualizations that would help them identify patterns in
accidents, which would help them take preventive actions to reduce the number of accidents in
the future. They would like specific information on certain parameters like borough, time of day,
reason for accident, etc.
Business Problem. Your task is to format the given data and provide visualizations that would
answer the specific questions the client has, which are mentioned below.
Analytical Context. You are given a CSV file (stored in the already created data folder) con-
taining details about each accident like date, time, location of the accident, reason for the accident,
types of vehicles involved, injury and death count, etc. The delimiter in the given CSV file is ;
instead of the default ,. You will be performing the following tasks on the data:
1. Extract additional borough data stored in a JSON file
2. Read, transform, and prepare data for visualization
3. Construct and analyze visualizations of the data to identify patterns in the dataset
The client has a specific set of questions they would like to get answers to. You will need to provide
visualizations to accompany these:
1. How have the number of accidents fluctuated over the past year and a half? Have they
increased over that time?

1
2. For any particular day, during which hours are accidents most likely to occur?
3. Are there more accidents on weekdays than weekends?
4. What are the accidents’ count-to-area ratio per borough? Which boroughs have dispropor-
tionately large numbers of accidents for their size?
5. For each borough, during which hours are accidents most likely to occur?
6. What are the top 5 causes of accidents in the city?
7. What types of vehicles are most involved in accidents per borough?
8. What types of vehicles are most involved in deaths?
Note: To solve this extended case, please read the function docstrings very carefully. They
contain information that you will need! Also, please don’t include print() statements inside your
functions (they will most likely produce an error in the test cells). Finally, for the purposes of this
case, do not worry about standardizing text variables - for example, treat taxi and Taxi as though
they were different values.

1.2 Fetching the relevant data

The client has requested analysis of the accidents-to-area ratio for boroughs. Borough data is stored
in a JSON file in the data folder (this file was created using data from Wikipedia).
Let’s use the function json.load() to load the file borough_data.json as a dictionary:

[19]: with open('data/borough_data.json') as f:

borough_data=json.load(f)
borough_data

[19]: {'the bronx': {'name': 'the bronx', 'population': 1471160.0, 'area': 42.1},
'brooklyn': {'name': 'brooklyn', 'population': 2648771.0, 'area': 70.82},
'manhattan': {'name': 'manhattan', 'population': 1664727.0, 'area': 22.83},
'queens': {'name': 'queens', 'population': 2358582.0, 'area': 108.53},
'staten island': {'name': 'staten island',
'population': 479458.0,
'area': 58.37}}

Similarly, let’s use the pandas function read_csv() to load the file accidents.csv as a DataFrame.
We will name this DataFrame df.
[20]: with open('data/accidents.csv') as f:
df=pd.read_csv(f, delimiter=';')

[21]: df.head() #echamos un vistazo a nuestro dataset

[21]: DATE TIME BOROUGH ZIP CODE LATITUDE LONGITUDE \

0 09/26/2018 12:12 BRONX 10454.0 40.808987 -73.911316
1 09/25/2018 16:30 BROOKLYN 11236.0 40.636005 -73.912510
2 08/22/2019 19:30 QUEENS 11101.0 40.755490 -73.939530
3 09/23/2018 13:10 QUEENS 11367.0 NaN NaN
4 08/20/2019 22:40 BRONX 10468.0 40.868336 -73.901270

2
ON STREET NAME NUMBER OF PEDESTRIANS INJURED \
0 NaN 0
1 FLATLANDS AVENUE 1
2 NaN 0
3 MAIN STREET 0
4 NaN 0

NUMBER OF PEDESTRIANS KILLED NUMBER OF CYCLIST INJURED … \

0 0 0 …
1 0 0 …
2 0 0 …
3 0 1 …
4 0 0 …

CONTRIBUTING FACTOR VEHICLE 2 CONTRIBUTING FACTOR VEHICLE 3 \

0 NaN NaN
1 NaN NaN
2 NaN NaN
3 Unspecified NaN
4 Unspecified NaN

CONTRIBUTING FACTOR VEHICLE 4 CONTRIBUTING FACTOR VEHICLE 5 COLLISION_ID \

0 NaN NaN 3988123
1 NaN NaN 3987962
2 NaN NaN 4193132
3 NaN NaN 3985962
4 NaN NaN 4192111

VEHICLE TYPE CODE 1 VEHICLE TYPE CODE 2 \

0 Sedan NaN
1 Sedan NaN
2 Sedan NaN
3 Bike Station Wagon/Sport Utility Vehicle
4 Sedan Sedan

VEHICLE TYPE CODE 3 VEHICLE TYPE CODE 4 VEHICLE TYPE CODE 5

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

[5 rows x 24 columns]

1.3 Overview of the data

Let’s go through the columns present in the DataFrame:

3
[22]: df.columns

[22]: Index(['DATE', 'TIME', 'BOROUGH', 'ZIP CODE', 'LATITUDE', 'LONGITUDE',

'ON STREET NAME', 'NUMBER OF PEDESTRIANS INJURED',
'NUMBER OF PEDESTRIANS KILLED', 'NUMBER OF CYCLIST INJURED',
'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST INJURED',
'NUMBER OF MOTORIST KILLED', 'CONTRIBUTING FACTOR VEHICLE 1',
'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',
'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5',
'COLLISION_ID', 'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 2',
'VEHICLE TYPE CODE 3', 'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5'],
dtype='object')

We have the following columns:

1. BOROUGH: The borough in which the accident occurred
2. COLLISION_ID: A unique identifier for this collision
3. CONTRIBUTING FACTOR VEHICLE (1, 2, 3, 4, 5): Reasons for the accident
4. CROSS STREET NAME: Nearest cross street to the location of the accident
5. DATE: Date of the accident
6. TIME: Time of the accident
7. LATITUDE: Latitude of the accident
8. LONGITUDE: Longitude of the accident
9. NUMBER OF (CYCLISTS, MOTORISTS, PEDESTRIANS) INJURED: Injuries
by category
10. NUMBER OF (CYCLISTS, MOTORISTS, PEDESTRIANS) KILLED: Deaths by
category
11. ON STREET NAME: Street where the accident occurred
12. VEHICLE TYPE CODE (1, 2, 3, 4, 5): Types of vehicles involved in the accident
13. ZIP CODE: Zip code of the accident location

1.3.1 Exercise 1 (2 points)

Since 2014, New York City has been implementing a road safety plan named Vision Zero. It aims
to reduce the number of traﬀic deaths to zero by the end of 2024. The plan is creating new and
enhancing current safety measures, some of these include:
A. Automated pedestrian detection B. Road safety audits at high risk locations C. Expansion in
the cycle track network D. Targeted education and awareness initiatives E. Creation of pedestrian
refuge islands F. Launch Integrated Data-Driven Speed Reducer Program (speed humps & speed
cushions)
Which of these initiatives could directly benefit from an analysis of the data provided? Select all
that apply.
Note: In this notebook, whenever you are asked to write text, use the cell below the question cell
to write your answer there. If you write in the same cell as the question, your answer will not be
recorded.
Answer: B and F.

4
- B. The analysis of the data provided will be useful to know which are the high risk locations so
that the road safety audits at high risk locations can be implementes.
- F. To launch the Speed Reducer Program, the analysis of the date is a must, so that the program
could be based on facts about the data rather than just personal opinions.

1.4 Answering the client’s questions

Let’s go ahead and answer each of the client’s questions.

1.4.1 Exercise 2
2.1 (2 points) Group the available accident data by month.
Hint: You may find the pandas functions pd.to_datetime() and dt.to_period() useful.
[23]: def ex_2(df):
"""
Group accidents by month

Arguments:
`df`: A pandas DataFrame

Outputs:
`monthly_accidents`: The grouped Series
"""
# YOUR CODE HERE
monthly_accidents = df.copy() ␣
,→ #make a copy of the data not to modify it.
monthly_accidents.DATE = pd.to_datetime(monthly_accidents.DATE) ␣
,→ #change the variable to datetime type
monthly_accidents.DATE = monthly_accidents.DATE.dt.to_period("M") ␣
,→ #gather the months of the DATE variable
monthly_accidents = monthly_accidents.groupby(['DATE'])['COLLISION_ID'].
,→size() #group by DATE and print the size of the groups

return monthly_accidents

2.2

2.2.1 (1 point) Generate a line plot of accidents over time.

[24]: # YOUR CODE HERE

lineplot = df.copy() #make a copy of␣
,→the data not to modify it.

lineplot.DATE = pd.to_datetime(lineplot.DATE) #change the␣

,→DATE variable to datetime type

lineplot = lineplot.drop_duplicates() #remove␣

,→duplicated values

5
lineplot = lineplot.groupby(['DATE'])['COLLISION_ID'].count() #group the␣
,→variables

fig, ax = plt.subplots(figsize=(20,4)) #Create the␣

,→canvas for the plot

lineplot.plot() #plot the␣

,→lineplot

[24]: <AxesSubplot:xlabel='DATE'>

2.2.2 (1 point) Has the number of accidents increased over the past year and a half? Justify
your answer with an interpretation of a plot.
No. By and large, the number of accidents over the past year and a half has actually decreased.
The patter is not very clear nor stable, however there is a steady decrease in the number of accidents
over the past year and a half, a bit less time than what is shown in the plot, that is 20 months.

1.4.2 Exercise 3 (2 points)

From the plot above, which month(s) seem to have the least number of accidents? What do you
think are the reasons behind this?
January, February and April. Although, it is not very clear in the graph, we can see that the
lowest number of accidents has occured in those months. A sorted table with the average of the
number of accidents by month would be a very straigthforward way to answer this question.

1.4.3 Exercise 4
4.1 (2 points) Create a new column HOUR based on the data from the TIME column.
Hint: You may find the dt.hour accessor useful.
[25]: def ex_4(df):
"""
Group accidents by hour of day

Arguments:
`df`: A pandas DataFrame

6
Outputs:
`hourly_accidents`: The grouped Series

"""
# YOUR CODE HERE
DF4 = df.copy() #make a␣
,→copy of the data not to modify it.

DF4['HOUR'] = pd.to_datetime(DF4['TIME']) #Create␣

,→a variable called HOUR the same than TIME

DF4['HOUR'] = DF4['HOUR'].dt.hour ␣
,→#Extract only the hour of the HOUR variable

hourly_accidents = DF4.groupby(['HOUR'])['COLLISION_ID'].count() #Group␣

,→by HOUR and count

return hourly_accidents

4.2

4.2.1 (1 point) Plot a bar graph of the distribution per hour throughout the day.

[26]: # YOUR CODE HERE

hourly_accidents = df.copy() ␣
,→ #make a copy of the data not to modify it.
hourly_accidents['HOUR'] = pd.to_datetime(hourly_accidents['TIME']) ␣
,→ #create a nuw column called HOUR
hourly_accidents['HOUR'] = hourly_accidents['HOUR'].dt.hour ␣
,→ #extracts the hout of the HOUR variable
hourly_accidents = hourly_accidents.groupby(['HOUR'])['COLLISION_ID'].count().
,→to_frame() #group
hourly_accidents = hourly_accidents.reset_index() ␣
,→ #reset index
hourly_accidents = hourly_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #change a column name
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='HOUR',data=hourly_accidents)
plt.show()

7
4.2.2 (1 point) How does the number of accidents vary throughout a single day?
The number of accidents varies during the day time considerably. The maximum number of acci-
dents occure in the afternoon around 2pm-5pm, and the lowest number of accidents are at around
2am until 5am, which is what one would expect, guiven the fact that at that time there are less
number of vehicles on the streets. Overall, a almost perfect pattern could be seen similar to a
sin(x) function, which means that there is a pick followed by a valley in a steady way.

1.4.4 Exercise 5 (2 points)

In the above question we have aggregated the number of accidents per hour disregarding the date
and place of occurrence. What criticism would you give to this approach?
It is very likely that the different boroughs behave differently and the day of the week also affect
the number of accidents throughout the day, so that the information shown in the plot above is an
approach that can not be used to draw important conclusions about the accidentality in NY per
hour, because it doesn’t consider all the variables that could affect this question.

1.4.5 Exercise 6
6.1 (2 points) Calculate the number of accidents by day of the week.
Hint: You may find the dt.weekday accessor useful.
[27]: def ex_6(df):
"""
Group accidents by day of the week

Arguments:
`df`: A pandas DataFrame

Outputs:
`weekday_accidents`: The grouped Series

8
"""
# YOUR CODE HERE
weekday_accidents = df.copy() ␣
,→#make a copy of the data not to modify it.

weekday_accidents['DATE'] = pd.to_datetime(weekday_accidents['DATE']) ␣
,→#change DATE to datetime object

weekday_accidents['DAY_NAME'] = weekday_accidents['DATE'].dt.day_name() ␣
,→#extract the weekday

weekday_accidents = DF6.groupby(['DAY_NAME'])['COLLISION_ID'].count() ␣
,→#group by day of the week and count

return weekday_accidents

6.2

6.2.1 (1 point) Plot a bar graph based on the accidents count by day of the week.

[28]: # YOUR CODE HERE

weekday_accidents = df.copy() ␣
,→ #make a copy of the data not to modify it.
weekday_accidents['DATE'] = pd.to_datetime(weekday_accidents['DATE']) ␣
,→ #Convert the DATE variable to datetime object
weekday_accidents['DAY_NAME'] = weekday_accidents['DATE'].dt.day_name() ␣
,→ #Extract the day of the week with the day_name function

weekday_accidents = weekday_accidents.groupby(['DAY_NAME'])['COLLISION_ID'].
,→count().to_frame() #Group the dataframe
weekday_accidents = weekday_accidents.reset_index(drop=False) ␣
,→ #Reset the index
weekday_accidents = weekday_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #Rename the variables
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='DAY_NAME',data=weekday_accidents)
plt.show()

9
6.2.2 (1 point) How does the number of accidents vary throughout a single week?
It can be seen in the plot that in the weekends there are less accidents than during the week. For
example in the sundays there were the lowest number of accidents. It can be expected given the
fact that that is usually the day with less cars on the street and usually people don’t work or go
out that day as much as they do during the weekdays.
On the other hand, the number of accidents during the week is similar for each day, but friday has
the majority of accidents over the other days.

1.4.6 Exercise 7
7.1 (2 points) Calculate the total number of accidents for each borough.

[29]: def ex_7_1(df):

"""
Group accidents by borough

Arguments:
`df`: A pandas DataFrame

Outputs:
`boroughs`: The grouped Series
"""
# YOUR CODE HERE
# raise NotImplementedError() # Remove this line when you enter your␣
,→solution

boroughs = df.copy() #make a␣

,→copy of the data not to modify it.

boroughs = boroughs.groupby('BOROUGH')['COLLISION_ID'].count() #group the␣

,→data and count the

10
return boroughs

7.2

7.2.1 (1 point) Plot a bar graph of the previous data.

[30]: # YOUR CODE HERE

# raise NotImplementedError() # Remove this line when you enter your solution

boroughs_accidents = df.copy() #make a copy of the data not to␣

,→modify it.

boroughs_accidents = boroughs_accidents.groupby('BOROUGH')['COLLISION_ID'].
,→count().to_frame() #group by Borough
boroughs_accidents = boroughs_accidents.reset_index() ␣
,→ #Reset the index
boroughs_accidents = boroughs_accidents.rename(columns={'COLLISION_ID':
,→'accidents'}) #Rename the variables
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents',x='BOROUGH',data=boroughs_accidents)
plt.show()

7.2.2 (1 point) What do you notice in the plot?

Initially, it can be seen that there were more accidents in Brooklyn than in the other 4 Brough. It
is also clear that Staten island has the fewer number of accidents among the boroughs.
However that information is not very usefull because all boroughs have different size, population
and trafic, so we should not venture to draw major conclusions out of that plot.

11
7.3 (hard | 3 points) Calculate the number of accidents per square mile for each borough.
Hint: You will have to update the keys in the borough dictionary to match the names in the
DataFrame.
[31]: def ex_7_3(df, borough_data):
"""
Calculate accidents per sq mile for each borough

Arguments:
`borough_frame`: A pandas DataFrame with the count of accidents per borough
`borough_data`: A python dictionary with population and area data for each␣
,→borough

Outputs:
`borough_frame`: The same `borough_frame` DataFrame used as input, only␣
,→with an

additional column called `accidents_per_sq_mi` that results from dividing

the number of accidents in each borough by its area. Please call this new␣
,→column

exactly `accidents_per_sq_mi` - otherwise the test cells will throw an␣

,→error.

"""
#change the names in the JSON file
borough_data['the bronx']['name'] = 'BRONX'
borough_data['brooklyn']['name'] = 'BROOKLYN'
borough_data['manhattan']['name'] = 'MANHATTAN'
borough_data['queens']['name'] = 'QUEENS'
borough_data['staten island']['name'] = 'STATEN ISLAND'

boroughs = ex_7_1(df)
borough_frame = pd.DataFrame(boroughs)

# YOUR CODE HERE

borough_frame = borough_frame.reset_index() #reset the index
borough_frame = borough_frame.merge(pd.json_normalize(borough_data.
,→values()), #merge with a left join
how='left', left_on='BOROUGH',␣
,→right_on='name')

borough_frame['accidents_per_sq_mi'] = borough_frame['COLLISION_ID'] /␣
,→borough_frame['area'] #create a new variable called accidents_per_sq_mi

return borough_frame # This must be a DataFrame, NOT a Series

7.4

12
7.4.1 (1 point) Plot a bar graph of the accidents per square mile per borough with the data you
just calculated.
[34]: # YOUR CODE HERE
acc_per_area = ex_7_3(df, borough_data)
plt.figure(figsize=(10, 4))
sns.barplot(y='accidents_per_sq_mi',x='BOROUGH',data=acc_per_area)
plt.show()

7.4.2 (1 point) What can you conclude?

It can be seen here that Manthattan borough has the maximun number of accidents per area,
followed by Brooklyn, Bronx, Queens and Staten Island in the last place.
This information is very important because here we are considering not only the absolute number
of accident in a borough, but also the size of that borough so that we have relative information to
the area and the measures can be comparable among the boroughs.
In my opinion, considering information such as population per borough and trafic per borough
would also be an interesting approach to compare information among those groups.

1.4.7 Exercise 8
8.1 (2 points) Create a Series of the number of accidents per hour and borough.

[35]: def ex_8_1(df):

"""
Calculate accidents per hour for each borough

Arguments:
`df`: A pandas DataFrame

Outputs:

13
`bor_hour`: A Series. This should be the result of doing groupby by borough
and hour.
"""
# YOUR CODE HERE
bor_hour = df.copy() #make a copy of the data not to modify it.
bor_hour['HOUR'] = pd.to_datetime(bor_hour['TIME']) #create␣
,→the HOUR variable

bor_hour['HOUR'] = bor_hour['HOUR'].dt.hour ␣
,→#Extract the hout from the HOUR variable

bor_hour = bor_hour.groupby(['BOROUGH','HOUR'])['COLLISION_ID'].count() ␣
,→#Group by BOROUGH and HOUR

return bor_hour

8.2

8.2.1 (2 points) Plot a bar graph for each borough showing the number of accidents for each
hour of the day.
Hint: You can use sns.FacetGrid to create a grid of plots with the hourly data of each borough.
[45]: # YOUR CODE HERE
#plot
borough_hourly = ex_8_1(df).to_frame().reset_index()
g = sns.FacetGrid(borough_hourly, col= 'BOROUGH')
g = g.map(sns.barplot,'HOUR','COLLISION_ID')

/home/jovyan/.local/lib/python3.8/site-packages/seaborn/axisgrid.py:670:
UserWarning: Using the barplot function without specifying `order` is likely to
produce an incorrect plot.
warnings.warn(warning)

8.2.2 (1 point) Which hours have the most accidents for each borough?
By and large, the highest number accidents occure between 2 and 5 pm, specially at 4s pm. The
lowest number of accidents are very early in the morning, between 1 and 5 am, and this patter is
consistent among the boroughs.
Another ver interesting pattern that can be seen in almost every borough is that there are a increase

14
of accidents around 8 am and then it slightly decreased for the next hours. It can be because that’s
the time in which more people are on their way to work, so the more people on the strets, the more
likely an accident can occure.

1.4.8 Exercise 9 (hard | 3 points)

Using contrib_df, find which 6 factors cause the most accidents. It is important that you avoid
double counting the contributing factors of a single accident.
Hint: You can use the pd.melt() function to take a subset of df and convert it from wide format
to narrow format.
[46]: def ex_9(df):
"""
Finds which 6 factors cause the most accidents, without
double counting the contributing factors of a single accident.

Arguments:
`contrib_df`: A pandas DataFrame.

Outputs:
`factors_most_acc`: A pandas DataFrame. It has only 6 elements, which are,
sorted in descending order, the contributing factors with the most␣
,→accidents.

The column with the actual numbers is named `index`.

"""

# YOUR CODE HERE

factors_most_acc = df.copy() #make a copy of the data not to modify␣
,→it.

factors_most_acc = pd.melt(factors_most_acc, id_vars= ['COLLISION_ID'],␣

,→value_vars = ['CONTRIBUTING FACTOR VEHICLE 1',

'CONTRIBUTING FACTOR VEHICLE 2', 'CONTRIBUTING FACTOR VEHICLE 3',

'CONTRIBUTING FACTOR VEHICLE 4', 'CONTRIBUTING FACTOR VEHICLE 5'],
var_name = 'vehicle' , value_name = "Contributiong Factor") ␣
,→#melt to have a long format

factors_most_acc = factors_most_acc.dropna() ␣
,→#remove NAs

factors_most_acc = factors_most_acc.drop(columns="vehicle") ␣
,→#remove vehicle column

factors_most_acc = factors_most_acc.drop_duplicates() ␣
,→#get rid of duplicates

factors_most_acc = factors_most_acc.groupby(['Contributiong Factor']).

,→count().sort_values(by='COLLISION_ID',ascending = False).reset_index() #
factors_most_acc = factors_most_acc.rename(columns={'COLLISION_ID':
,→'index'}) #rename the variable
factors_most_acc = factors_most_acc.head(6) ␣
,→ #select only the first 6 rows

15
return factors_most_acc

1.4.9 Exercise 10 (hard | 3 points)

Which 10 vehicle type-borough pairs are most involved in accidents? Avoid double counting the
types of vehicles involved in a single accident. You can apply a similar approach to the one used
in the previous exercise using pd.melt().
Hint: You may want to include BOROUGH as one of your id_vars (the other being index) in
pd.melt(). Including BOROUGH in your final .groupby() is also a good idea.

[47]: def ex_10(df):

"""
Finds the 10 borough:vehicle type pairs with more accidents, without
double counting the vehicle types of a single accident.

Arguments:
`df`: A pandas DataFrame.

Outputs:
`vehi_most_acc`: A pandas DataFrame. It has only 10 elements, which are,
sorted in descending order, the borough-vehicle pairs with the most␣
,→accidents.

The column with the actual numbers is named `index`

"""

vehi_cols = ['VEHICLE TYPE CODE 1','VEHICLE TYPE CODE 2','VEHICLE TYPE CODE␣
,→3','VEHICLE TYPE CODE 4','VEHICLE TYPE CODE 5']

# YOUR CODE HERE

vehi_most_acc = df.copy() #make a copy of the data␣
,→not to modify it.

vehi_most_acc = pd.melt(vehi_most_acc, id_vars= ['COLLISION_ID','BOROUGH'],␣

,→value_vars = vehi_cols,

var_name = 'vehicle' , value_name = "Contributiong Factor") ␣

,→#melt to have a long format

vehi_most_acc = vehi_most_acc.dropna() ␣
,→#remove NAs

vehi_most_acc = vehi_most_acc.drop(columns="vehicle") ␣
,→#drop vehicle columns

vehi_most_acc = vehi_most_acc.drop_duplicates() ␣
,→#get rid of duplicated values

vehi_most_acc = vehi_most_acc.groupby(['BOROUGH','Contributiong Factor']).

,→count().sort_values(by='COLLISION_ID',ascending = False).reset_index() ␣

,→#group by

16
vehi_most_acc = vehi_most_acc.rename(columns={'COLLISION_ID':'index'}) ␣
#change variable name
,→

vehi_most_acc = vehi_most_acc.head(10)

return vehi_most_acc

1.4.10 Exercise 11 (2 points)

In a 2018 interview with The New York Times, New York’s mayor de Blasio stated that “Vision
Zero is clearly working”. That year, the number of deaths in traﬀic accidents in NYC dropped to a
historically low 202. Yet, as reported by am New York Metro, the number of fatalities has increased
by 30% in the first quarter of 2019 compared to the previous year and the number of pedestrians
and cyclists injured has not seen any improvement.
Which of the following BEST describes how you would use the provided data to understand what
went wrong in the first quarter of 2019? Please explain the reasons for your choice.
A. Consider the accidents of the first quarter of 2019. Then, check for the most common causes
of accidents where pedestrians and cyclists were involved. Give a recommendation based solely on
this information. B. Create a pair of heat maps of the accidents involving injured/killed pedestrians
and cyclists in the first quarter of 2018 and 2019. Compare these two to see if there is any change in
the concentration of accidents. In critical areas, study the type of factors involved in the accidents.
Give a recommendation to visit these areas to study the problem further. C. The provided data is
insuﬀicient to improve our understanding of the situation. D. None of the above. (If you choose
this, please elaborate on what you would do instead.)
B. To understand the increase by 30% in the number of deaths in accidents in NY it is very
important to compare what is happening in that period to what happend the previous year. To
do so, a heat map could make the comparation easier. It is not the only plot that could be used,
though. It is also very important to consider the concentration of accidents, the critical areas and
the factors that were involved in the accidents, so we can understand what went wrong in the first
quarter of 2019. It is also important to look at the most common causes of accidents in the first
quarter of 2019 how the A answer suggest and give recomendation, but without comparing the
data with the previous year we can not identify if those causes are the same than last year or where
are the flaws in the first quarter of the year. Finally, we can give accurate recommendation based
on the complete information and perhaphs visit these areas to study the problem further.

1.4.11 Exercise 12
12.1 (hard | 3 points) Calculate the number of deaths caused by each type of vehicle.
Hint 1: As an example of how to compute vehicle involvement in deaths, suppose two people
died in an accident where 5 vehicles were involved, and 4 are PASSENGER VEHICLE and 1 is a
SPORT UTILITY/STATION WAGON. Then we would add two deaths to both the PASSENGER
VEHICLE and SPORT UTILITY/STATION WAGON types.)
Hint 2: You will need to use pd.melt() and proceed as in the previous exercises to avoid double-
counting the types of vehicles (i.e. you should remove duplicate “accident ID - vehicle type” pairs).

17
[48]: def ex_12_1(df):
"""
Calculate total killed per vehicle type and plot the result
as a bar graph

Arguments:
`df`: A pandas DataFrame.

Outputs:
`result`: A pandas DataFrame. Its index should be the vehicle type. Its only
column should be `TOTAL KILLED`
"""

# YOUR CODE HERE

vehi_cols = ['VEHICLE TYPE CODE 1','VEHICLE TYPE CODE 2','VEHICLE TYPE CODE␣
,→3','VEHICLE TYPE CODE 4','VEHICLE TYPE CODE 5']

result = df.copy() #make␣

,→a copy not to modify the origunal df

result = pd.melt(result, id_vars= ['COLLISION_ID', #melt␣

,→to have a long format

'NUMBER OF PEDESTRIANS KILLED',

'NUMBER OF CYCLIST KILLED',
'NUMBER OF MOTORIST KILLED'],
value_vars = vehi_cols,
var_name = 'vehicle' , value_name = "Contributiong Factor")
result = result.dropna() ␣
,→#remove missing values

result = result.drop_duplicates() ␣
,→#remove duplicated rows

result['TOTAL KILLED'] = result['NUMBER OF PEDESTRIANS KILLED']+␣

,→result['NUMBER OF CYCLIST KILLED']+ result['NUMBER OF MOTORIST KILLED'] #␣

,→create new variable

result = result.drop(columns=['vehicle','NUMBER OF PEDESTRIANS KILLED',␣

,→'NUMBER OF CYCLIST KILLED', 'NUMBER OF MOTORIST KILLED']) #remove columns

result = result.groupby(['Contributiong Factor']).agg({'TOTAL KILLED':

,→'sum'}).sort_values(by='TOTAL KILLED',ascending = False)#groupping

return result

12.2

12.2.1 (1 point) Plot a bar chart for the top 5 vehicles.

[55]: top_5_veh

18
[55]: Contributiong Factor TOTAL KILLED
0 Station Wagon/Sport Utility Vehicle 100
1 Sedan 79
2 PASSENGER VEHICLE 33
3 SPORT UTILITY / STATION WAGON 26
4 Motorcycle 22

[54]: # YOUR CODE HERE

top_5_veh = ex_12_1(df).reset_index().head(5) #extract the top 5 rows
#plot
plt.figure(figsize=(10, 4))
sns.barplot(y='TOTAL KILLED',x='Contributiong Factor',data=top_5_veh)
plt.show()

12.2.2 (2 points) Which vehicles are most often involved in deaths, and by how much more than
the others?
The Station Wagon/Sport Utiliy Vehicle were the most often involved in deaths with about
100 person killed in total in the period analysed. These vehicles had around 20 more kills than the
Sedan ones, around 60 kills more than the PASSENGER VEHICLE, 65 more than SPORT
UTILITY / STATION WAGON and around 80 more than the Motorcycle vehicles.
However those are aproximation taken from the bar plot. Should one needs the exact values a
simple look at the dataframe the plot was build with would give the exact numbers.

1.5 Testing cells

[154]: # Ex. 2.1
assert type(ex_2(df)) == type(pd.Series([9,1,2])), "Ex. 2.1 - Your output isn't␣
,→a pandas Series. If you use .groupby() and an aggregation function, the␣

,→output is a Series by default."

19
assert ex_2(df).loc["2018-10"] == 13336, "Ex. 2.1 - Wrong output! Try using the␣
,→.size() aggregation function with your .groupby()."

print("Exercise 2.1 looks correct!")

Exercise 2.1 looks correct!

[253]: # Ex 4.1
assert type(ex_4(df)) == type(pd.Series([9,1,2])), "Ex. 4.1 - Your output isn't␣
,→a pandas Series. If you use .groupby() and an aggregation function, the␣

,→output is a Series by default."

assert ex_4(df).loc[13] == 14224, "Ex. 4.1 - Wrong output! Try using the .
,→size() aggregation function with your .groupby()."

print("Exercise 4.1 looks correct!")

Exercise 4.1 looks correct!

[315]: # Ex. 6.1

assert type(ex_6(df)) == type(pd.Series([9,1,2])), "Ex. 6.1 - Your output isn't␣
,→a pandas Series. If you use .groupby() and an aggregation function, the␣

,→output is a Series by default."

assert max(ex_6(df)) == 37886, "Ex. 6.1 - Your results don't match ours!␣
,→Remember that you can use the .size() aggregation function to count the␣

,→number of elements in a groupby group."

print("Exercise 6.1 looks correct!")

Exercise 6.1 looks correct!

[325]: # Ex. 7.1

assert type(ex_7_1(df)) == type(pd.Series([9,1,2])), "Ex. 7.1 - Your output␣
,→isn't a pandas Series. If you use .groupby() and an aggregation function,␣

,→the output is a Series by default."

assert max(ex_7_1(df)) == 76253, "Ex. 7.1 - Your results don't match ours!␣
,→Remember that you can use the .size() aggregation function to count the␣

,→number of elements in a groupby group."

print("Exercise 7.1 looks correct!")

Exercise 7.1 looks correct!

[83]: # Ex. 7.3

with open('data/borough_data.json') as f:
borough_data=json.load(f)
borough_data
e73 = ex_7_3(df, borough_data)
assert "accidents_per_sq_mi" in e73.columns, "Ex. 7.3 - You didn't create an␣
,→'accidents_per_sq_mi' in your DataFrame!"

20
assert round(min(e73["accidents_per_sq_mi"])) == 149, "Ex. 7.3 - Your output␣
,→doesn't match ours! Remember that you need to divide the number of accidents␣

,→in each of the five boroughs by the respective areas in square miles."

print("Exercise 7.3 looks correct!")

Exercise 7.3 looks correct!

[103]: # Ex. 8.1

assert type(ex_8_1(df)) == type(pd.Series([9,1,2])), "Ex. 8.1 - Your output␣
,→isn't a pandas Series. If you use .groupby() and an aggregation function,␣

,→the output is a Series by default."

assert ex_8_1(df).max() == 5701, "Ex. 8.1 - Your numbers don't match ours. If␣
,→you haven't already, you can try using .size() as your aggregation function."

print("Exercise 8.1 looks correct!")

Exercise 8.1 looks correct!

[219]: # Ex. 9
assert type(ex_9(df)) == type(pd.Series([9,1,2]).to_frame()), "Ex. 9 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣

,→function, the output is a Series by default."

assert len(ex_9(df)) == 6, "Ex. 9 - Your output doesn't have six elements. Did␣
,→you forget to use .head(6)?"

assert int(ex_9(df)["index"].sum()) == 316248, "Ex. 9 - Your numbers don't␣

,→match ours. Are you sure you sorted your Series in descending order? If you␣

,→haven't already, you can try using .count() as your aggregation function."

print("Exercise 9 looks correct!")

Exercise 9 looks correct!

[234]: # Ex. 10
assert type(ex_10(df)) == type(pd.Series([9,1,2]).to_frame()), "Ex. 10 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣

,→function, the output is a Series by default."

assert len(ex_10(df)["index"]) == 10, "Ex. 10 - Your output doesn't have 10␣

,→elements. Did you forget to use .head(10)?"

assert ex_10(df)["index"].sum() == 229882, "Ex. 10 - Your numbers don't match␣

,→ours. Are you sure you sorted your Series in descending order? If you␣

,→haven't already, you can try using .count() as your aggregation function."

print("Exercise 10 looks correct!")

Exercise 10 looks correct!

[25]: # Ex. 12.1

e12 = ex_12_1(df)

21
assert type(e12) == type(pd.Series([9,1,2]).to_frame()), "Ex. 12.1 - Your␣
,→output isn't a pandas DataFrame. If you use .groupby() and an aggregation␣

,→function, the output is a Series by default."

assert int(e12.loc["Bike"]) == 19, "Ex. 12.1 - Your output doesn't match ours!␣
,→Remember that you need to remove the duplicate pairs and use the .sum()␣

,→aggregation function in your groupby."

print("Exercise 12.1 looks correct!")

Exercise 12.1 looks correct!

1.6 Attribution
“Vehicle Collisions in NYC 2015-Present”, New York Police Department, NYC Open Data terms
of use, https://www.kaggle.com/nypd/vehicle-collisions
“Boroughs of New York City”, Creative Commons Attribution-ShareAlike License,
https://en.wikipedia.org/wiki/Boroughs_of_New_York_City

Time Series
No ratings yet
Time Series
31 pages
Age of Ashes 1 - Hellknight Hill Pages 1-50 - Flip PDF
0% (1)
Age of Ashes 1 - Hellknight Hill Pages 1-50 - Flip PDF
100 pages
Beginners Python Cheat Sheet PCC Pygal PDF
No ratings yet
Beginners Python Cheat Sheet PCC Pygal PDF
2 pages
EV-07B Protocol v1.2 V20200508
No ratings yet
EV-07B Protocol v1.2 V20200508
48 pages
Python For Network Engineers - Huawei Presentation - Updated
No ratings yet
Python For Network Engineers - Huawei Presentation - Updated
44 pages
Tellabs T-8606 V3.60 (ARE.0) G0 Release Note/Manual Supplement
No ratings yet
Tellabs T-8606 V3.60 (ARE.0) G0 Release Note/Manual Supplement
6 pages
SMU02C V500R003C00 Site Monitoring Unit User Manual
No ratings yet
SMU02C V500R003C00 Site Monitoring Unit User Manual
198 pages
NetEco App User Guide (PMS Site)
No ratings yet
NetEco App User Guide (PMS Site)
87 pages
4-Advanced VLAN Technologies
No ratings yet
4-Advanced VLAN Technologies
34 pages
NetEco 1000S Inverter Management System Smart I-V Curve Diagnosis User Manual
No ratings yet
NetEco 1000S Inverter Management System Smart I-V Curve Diagnosis User Manual
41 pages
Python Specialization4
No ratings yet
Python Specialization4
3 pages
Basics of Python
No ratings yet
Basics of Python
8 pages
Imanager U2000-CME (V200R015) - Northbound Activate Export All Parameters Operation Guide
No ratings yet
Imanager U2000-CME (V200R015) - Northbound Activate Export All Parameters Operation Guide
9 pages
IManager NetEco 1000S V100R003C00 User Manual
100% (1)
IManager NetEco 1000S V100R003C00 User Manual
263 pages
Introduction To Data Visualization in Python - by Gilbert Tanner - Towards Data Science
No ratings yet
Introduction To Data Visualization in Python - by Gilbert Tanner - Towards Data Science
22 pages
NodeB - How To Push Script Changes Huawei
No ratings yet
NodeB - How To Push Script Changes Huawei
8 pages
1.4 Main Changes in Ericsson GSM RAN G16B: Also Relation Attendance Level
No ratings yet
1.4 Main Changes in Ericsson GSM RAN G16B: Also Relation Attendance Level
54 pages
Configure Nodeb Through Cme
No ratings yet
Configure Nodeb Through Cme
9 pages
Expanding Mobile Wireless Capacity The Challenges
No ratings yet
Expanding Mobile Wireless Capacity The Challenges
16 pages
RRU3953&RRU3953w Hardware Description (02) (PDF) - en
No ratings yet
RRU3953&RRU3953w Hardware Description (02) (PDF) - en
37 pages
Pico Celdas BTS3900B GSM User Guide - (V600R012 - 01)
No ratings yet
Pico Celdas BTS3900B GSM User Guide - (V600R012 - 01)
32 pages
Introduction To SDN & Openflow - PDF
No ratings yet
Introduction To SDN & Openflow - PDF
10 pages
Seaborn PDF
No ratings yet
Seaborn PDF
242 pages
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
100% (1)
The Python Workbook: A Brief Introduction with Exercises and Solutions 2nd Edition Ben Stephenson all chapter instant download
49 pages
Northbound CORBA Interface User Guide (V200R007C03 03) PDF
No ratings yet
Northbound CORBA Interface User Guide (V200R007C03 03) PDF
299 pages
The Python Bible For Beginners
No ratings yet
The Python Bible For Beginners
185 pages
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
No ratings yet
Figure Style and Scale: Darkgrid Whitegrid Dark White Ticks Darkgrid
15 pages
Logcat
No ratings yet
Logcat
482 pages
Data Science Assignment 1
No ratings yet
Data Science Assignment 1
20 pages
Bash Shell Cheat Sheetv2 PDF
No ratings yet
Bash Shell Cheat Sheetv2 PDF
7 pages
Programming1 Lecture Presentations
No ratings yet
Programming1 Lecture Presentations
124 pages
Network Virtualization: Related Acronyms, Terms, and Definitions
No ratings yet
Network Virtualization: Related Acronyms, Terms, and Definitions
5 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
2 pages
SMU02B and SMU02C V500R001C30 Site Monitoring Unit User Manual
No ratings yet
SMU02B and SMU02C V500R001C30 Site Monitoring Unit User Manual
186 pages
Bad Ideas
No ratings yet
Bad Ideas
69 pages
Research & Simulation - Network Simulations and Installation of NS2 and NS3
No ratings yet
Research & Simulation - Network Simulations and Installation of NS2 and NS3
2 pages
Flask Restful
No ratings yet
Flask Restful
37 pages
Manual Zte
50% (2)
Manual Zte
69 pages
Lecture 1 Kaldi
No ratings yet
Lecture 1 Kaldi
56 pages
Unit 1
No ratings yet
Unit 1
86 pages
CyberAces Module3-Python 1 Intro
No ratings yet
CyberAces Module3-Python 1 Intro
19 pages
Why M2000 Is Renamed U2000
No ratings yet
Why M2000 Is Renamed U2000
1 page
Python Mongodb Tutorial
100% (1)
Python Mongodb Tutorial
37 pages
Learn Python 3 - Modules Cheatsheet - Codecademy
No ratings yet
Learn Python 3 - Modules Cheatsheet - Codecademy
2 pages
Deploying Jupyter Notebooks For Students and Researchers
No ratings yet
Deploying Jupyter Notebooks For Students and Researchers
35 pages
Pandas Complete Notes
No ratings yet
Pandas Complete Notes
105 pages
Set 1 - Fundamental Questions Date: - 13/05/2020 Version: - 1.0 Total: - 50 Questions
No ratings yet
Set 1 - Fundamental Questions Date: - 13/05/2020 Version: - 1.0 Total: - 50 Questions
51 pages
U2000 Maintenance Documentation
No ratings yet
U2000 Maintenance Documentation
5 pages
Columbia Seaborn Tutorial
No ratings yet
Columbia Seaborn Tutorial
12 pages
Fletcher Heisler - Real Python Part 1 - Introduction To Python
No ratings yet
Fletcher Heisler - Real Python Part 1 - Introduction To Python
291 pages
Frrouting Developers Guide
No ratings yet
Frrouting Developers Guide
315 pages
Fail2Ban Developers' Documentation: Release 0.9.0.dev
No ratings yet
Fail2Ban Developers' Documentation: Release 0.9.0.dev
97 pages
Python Project Report
No ratings yet
Python Project Report
39 pages
Python Lab File
No ratings yet
Python Lab File
26 pages
Decoupling2 WP
No ratings yet
Decoupling2 WP
16 pages
Data Visualization - Getting Started With Plotly
No ratings yet
Data Visualization - Getting Started With Plotly
37 pages
Fourth Semester General Course Paper: Microprocessors Architecture and Programming
No ratings yet
Fourth Semester General Course Paper: Microprocessors Architecture and Programming
79 pages
Anr Umts PDF
No ratings yet
Anr Umts PDF
21 pages
Qos Protocols & Architectures: by Harizakis Costas
No ratings yet
Qos Protocols & Architectures: by Harizakis Costas
30 pages
Final Report - Predicting Traffic Accident Severity
100% (1)
Final Report - Predicting Traffic Accident Severity
11 pages
Regulatory Affairs of Road Accident Data 2020 India
No ratings yet
Regulatory Affairs of Road Accident Data 2020 India
23 pages
IT Practical Book Grade12
No ratings yet
IT Practical Book Grade12
361 pages
Business Blueprint in SAP Implementation
No ratings yet
Business Blueprint in SAP Implementation
2 pages
Log Id Calling No Virtual No Agent No Agent Name Department IVR Duration
No ratings yet
Log Id Calling No Virtual No Agent No Agent Name Department IVR Duration
5 pages
Audio Watermarking Thesis
100% (2)
Audio Watermarking Thesis
6 pages
A Magnetic Resonance Coupling Based Touchless Pad For Human-Computer Interfacing
No ratings yet
A Magnetic Resonance Coupling Based Touchless Pad For Human-Computer Interfacing
5 pages
CDIP-Control of Documented Information Process V 1.0
No ratings yet
CDIP-Control of Documented Information Process V 1.0
10 pages
Download Oracle Cloud Infrastructure: A Guide to Building Cloud Native Applications Jeevan Gheevarghese Joseph & Adao Oliveira Junior & Mickey Boxell ebook All Chapters PDF
100% (5)
Download Oracle Cloud Infrastructure: A Guide to Building Cloud Native Applications Jeevan Gheevarghese Joseph & Adao Oliveira Junior & Mickey Boxell ebook All Chapters PDF
66 pages
Visionine_User_Guide
No ratings yet
Visionine_User_Guide
8 pages
Wordpress Best Practices On Aws PDF
No ratings yet
Wordpress Best Practices On Aws PDF
19 pages
Lab 1.6.2
No ratings yet
Lab 1.6.2
22 pages
How Compact Fluorescent Lamps Work - and How To Dim Them
No ratings yet
How Compact Fluorescent Lamps Work - and How To Dim Them
8 pages
Lab-1: Lab Assignments
No ratings yet
Lab-1: Lab Assignments
11 pages
StratifyTrade - Volume Footprint Voids
No ratings yet
StratifyTrade - Volume Footprint Voids
5 pages
Stig's Art Grabrv2024.01.07.0
No ratings yet
Stig's Art Grabrv2024.01.07.0
7 pages
DS17 Infineon Tricore Boot Reader
No ratings yet
DS17 Infineon Tricore Boot Reader
4 pages
Full OEPT G2 - Info Sheet & FAQ (1) (2) (1) (1) (1) (1) (1) (1)
No ratings yet
Full OEPT G2 - Info Sheet & FAQ (1) (2) (1) (1) (1) (1) (1) (1)
5 pages
BASLER ELECTRIC BE1-11g Guideform Specification
No ratings yet
BASLER ELECTRIC BE1-11g Guideform Specification
4 pages
Advantys STB Distributed I/O Solution: Selection Guide
No ratings yet
Advantys STB Distributed I/O Solution: Selection Guide
1 page
Executive Post Graduate Certification in Data Analytics IHUB
No ratings yet
Executive Post Graduate Certification in Data Analytics IHUB
15 pages
Mettler IND360 Base
No ratings yet
Mettler IND360 Base
4 pages
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
No ratings yet
APznzaY78tnl5Y oAf9eS5TdgeXPDlOW4T AmtqiY4PHThk2ZQBAlN TYg2qIhzN8is6Cyb37XgnGHte3fIwNnW5MPM2BaSySYl4QXhx fXWWBjZlqfyJgJ
97 pages
U900+U2100 Site COnfiguration Design
No ratings yet
U900+U2100 Site COnfiguration Design
15 pages
Minecraft Keywords
No ratings yet
Minecraft Keywords
100 pages
Etech Finals
No ratings yet
Etech Finals
2 pages
User Manual - Temperature Monitoring System (VACLOG) - Viewers
No ratings yet
User Manual - Temperature Monitoring System (VACLOG) - Viewers
21 pages
Wideband Modem Resiliency
No ratings yet
Wideband Modem Resiliency
18 pages
Allotment Order - Aspx
No ratings yet
Allotment Order - Aspx
1 page
Smart Pigeonhole Alert System With SMS Notification: International Journal of Computing Sciences Research April 2020
No ratings yet
Smart Pigeonhole Alert System With SMS Notification: International Journal of Computing Sciences Research April 2020
22 pages
Artificial Intelligence: International Conference
No ratings yet
Artificial Intelligence: International Conference
4 pages