0% found this document useful (0 votes)

465 views

Solution - Data Analysis With Python-Project-2 - v1.0

The document describes a project to analyze bike sharing demand data from Lyft and build a predictive model. Key steps include: 1. Exploring and cleaning the data, including checking for null values and outliers. 2. Performing univariate and bivariate analyses to understand patterns and relationships between features like season, weather, and bike counts. 3. Pre-processing categorical features and splitting the data into train and test sets. 4. Building a linear regression model on the train set and evaluating its performance on the test set using the R2 score. The goal is to accurately predict bike demand based on environmental factors.

Uploaded by

Amit Kumar

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

465 views

Solution - Data Analysis With Python-Project-2 - v1.0

Uploaded by

Amit Kumar

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Data Analytics with Python

Certification Project
Bike-Sharing Demand Analysis

Objective: Use data to understand what factors affect the number of bike trips. Make a
predictive model to predict the number of trips in a particular hour slot, depending on the
environmental conditions.

Problem Statement:
Lyft, Inc. is a transportation network company based in San Francisco, California and operating
in 640 cities in the United States and 9 cities in Canada. It develops, markets, and operates the
Lyft mobile app, offering car rides, scooters, and a bicycle-sharing system. It is the second
largest rideshare company in the world, second to only Uber.
Lyft’s bike-sharing service is also among the largest in the USA. Being able to anticipate demand
is extremely important for planning of bicycles, stations, and the personnel required to
maintain these. This demand is sensitive to a lot of factors like season, humidity, rain,
weekdays, holidays, and more. To enable this planning, Lyft needs to rightly predict the demand
according to these factors.

Domain: General

Analysis to be done: Rightly predict the bike demand

Content: Dataset: Lyft bike-sharing data (hour.csv)

Fields in the data:
- instant: record index
- dteday: date
- season: season (1:spring, 2:summer, 3:fall, 4:winter)
- yr: year (0: 2011, 1: 2012)
- mnth: month (1 to 12)
- hr: hour (0 to 23)
- holiday : whether the day is a holiday or not
- weekday : day of the week
- workingday : if the day is neither weekend nor a holiday is 1, otherwise is 0
- weathersit :
- 1: Clear, Few clouds, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds
- 4: Heavy Rain + Ice Pellets + Thunderstorm + Mist, Snow + Fog
- temp : normalized temperature in Celsius; the values are divided to 41 (max)
- atemp: normalized temperature felt in Celsius; the values are divided to 50 (max)
- hum: normalized humidity; the values are divided to 100 (max)
- windspeed: normalized wind speed; the values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

Steps to perform:
As the first step, look at the null values in the file. A sanity check, to ensure that you have clean
records and the data is good to go ahead, is very important. Then, you’ll do univariate and
bivariate analyses to identify the patterns in the data and the nature of the individual features.
This is a very important step as this helps to not only identify features which could be
interesting for the predictive model later, but also helps understand what’s going on in the
data. The EDA will help identify the need to apply transformations on the features before
building the model. Finally, you will make a predictive model using linear regression.

Solution
1. Load the data file
inp0 = pd.read_csv("hour.csv")
inp0.head()

2. Check for null values in the data, drop records with NAs
inp0.isna().sum(axis=0)
Looks like there are no records with null values. Looks good so far.

3. Sanity checks:
a. Check if registered + casual = cnt for all the records. The two must add to cnt, if not
the row is junk and should be dropped.
np.sum((inp0.casual + inp0.registered) != inp0.cnt)

b. Month values should be 1-12 only

np.unique(inp0.mnth)

c. Hour should be 0-23

np.unique(inp0.hr)
4. Variables ‘casual’, ‘registered’ are redundant and need to be dropped. ‘Instant’ is the
index, and needs to be dropped too. The date column dteday will not be used in the
model building, and hence needs to be dropped. Create new dataframe named ‘inp1’.
cols_to_drop = ['casual', 'registered', 'dteday', 'instant']
inp1 = inp0.drop(cols_to_drop, axis=1).copy()

5. Univariate analysis –
- Describe the numerical fields in the dataset using pandas describe method
inp1.describe()

- Make density plot for temp. This would give a sense of the centrality and the
spread of the distribution.
inp1.temp.plot.density()

- Boxplot for atemp.

o Are there any outliers?

sns.boxplot(inp1.atemp)
There don’t seem to be any outliers for atemp.

- Histogram for hum

o Do you detect any abnormally high values?
inp1.hum.plot.hist()

No visible abnormally high values

- Density plot for windspeed

inp1.windspeed.plot.density()
- Box and density plot for cnt – this is the variable of interest.
o Do you see any outliers in the boxplot?
o Does the density plot provide a similar insight?
inp1.cnt.plot.density()

sns.boxplot(inp1.cnt)
Both plots show similar picture – some high values are present in cnt.

6. Outlier treatment –
1. Cnt – looks like some hours have rather high values of cnt. We’ll need to treat these
outliers so that they don’t skew our analysis and our model.
a. Find out the following percentiles - 10, 25, 50, 75, 90, 95, 99
b. Decide the cutoff percentile and drop records with values higher that the
cutoff. Name the new dataframe ‘inp2’.
inp1.cnt.quantile([0.1, 0.25, 0.5, 0.70, 0.9, 0.95, 0.99])

563 is the 95th percentile – only 5% records have a value higher than this. Taking this as the
cutoff.
inp2 = inp1[inp1.cnt < 563].copy()

7. Bi-variate analysis
1. Make box plot for cnt vs hr
a. What kind of pattern do you see?
plt.figure(figsize=[12,6])
sns.boxplot("hr", "cnt", data=inp2)

It’s evident that the peak hours are 5PM – 7PM, the hours 7-8AM also have high upper quartile.
A hypothesis could be that a lot of people use the bikes for commute to workplace and back.

2. Make boxplot for cnt vs weekday

a. Is there any difference in the rides by days of the week?
plt.figure(figsize=[8,5])
sns.boxplot("weekday", "cnt", data=inp2)
3. Make boxplot for cnt vs month
a. Look at the median values. Any month(s) that stand out?
plt.figure(figsize=[10,6])
sns.boxplot("mnth", "cnt", data=inp2

Looks like end of winter/ early spring months have the least bike riding instances.
4. Make boxplot for cnt vs season
a. Which season has the highest rides in general? Expected?
plt.figure(figsize=[10,6])
sns.boxplot("season", "cnt", data=inp2)
5. Make a bar plot with the median value of cnt for each hr
a. Does this paint a different picture than the box plot?
plt.figure(figsize=[8,5])
plt.bar(res.keys(), res.values)

Paints a similar picture to the boxplot. Although the view is much cleaner and the pattern
comes out much easier.
6. Make a correlation matrix for variables – atemp, temp, hum, windspeed
a. Which variables have the highest correlation?
num_vars = ['temp', 'atemp', 'hum', 'windspeed']
corrs = inp2[num_vars].corr()

Bonus: Heatmap of the correlations

sns.heatmap(corrs, annot=True, cmap="Reds")
8. Data pre-processing
A few key considerations for the pre-processing –
We seem to have plenty of categorical features. Since these categorical features can’t be used
in the predictive model, we need to convert to a suitable numerical representation. Instead
of creating dozens of new dummy variables, we will try to club levels of categorical features
wherever possible. For a feature with high number of categorical levels, we can club the
values that are very similar in value for the target variable
First, create a copy of the dataframe into inp3
1. Treating ‘mnth’ column
a. For values 5,6,7,8,9,10 – replace with a single value 5. This is because these
have very similar values for cnt.
b. Get dummies for the updated 6 ‘mnth’ values
inp3 = inp2.copy()
inp3.mnth[inp3.mnth.isin([5,6,7,8,9])] = 5
np.unique(inp3.mnth)

2. Treating ‘hr’ column

a. Create new mapping: 0-5: 0, 11-15: 11, other values are untouched. Again, the
bucketing is done in a way that hr values with similar levels of cnt are treated
the same.
inp3.hr[inp3.hr.isin([0,1,2,3,4,5])] = 0
inp3.hr[inp3.hr.isin([11,12,13,14,15])] = 11
3. Get dummy columns for season, weathersit, weekday, mnth, hr. We needn’t club
these further, because as seen from the box plots, the levels seem to have different
values for the median cnt.
cat_cols = ['season', 'weathersit', 'weekday', 'mnth', 'hr']
inp3 = pd.get_dummies(inp3, columns=cat_cols, drop_first=True)

9. Train test split – apply 70-30 split

- call the new dataframes df_train, df_test

from sklearn.model_selection import train_test_split

df_train, df_test = train_test_split(inp3, train_size = 0.7,
random_state = 100)

10. Separate X and Y for df_train and df_test. For example – you should have X_train, y_train
from df_train. y_train should be the cnt column from inp3, X_train should be all other
columns.
y_train = df_train.pop("cnt")
X_train = df_train

y_test = df_test.pop("cnt")
X_test = df_test

10 . Model building
- Use Linear regression as the technique
- Report the R2 on the train set
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)

Reporting r2 for the model

from sklearn.metrics import r2_score
y_train_pred= lr.predict(X_train)
r2_score(y_train, y_train_pred)

11. Make predictions on test set, report R2

y_test_pred= lr.predict(X_test)
r2_score(y_test, y_test_pred)

SMDM Guided Project Sample Business Report
No ratings yet
SMDM Guided Project Sample Business Report
17 pages
Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory
No ratings yet
Zomato Restaurant Clustering & Sentiment Analysis - Ipynb - Colaboratory
27 pages
E-Commerce Capstone Project Presentation
No ratings yet
E-Commerce Capstone Project Presentation
26 pages
Direct Square Variation
81% (16)
Direct Square Variation
2 pages
Tableau Assignment
No ratings yet
Tableau Assignment
7 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Data Science Project Report
100% (1)
Data Science Project Report
3 pages
SIT718 Assessment-Task 4-T3 2019-Amended PDF
No ratings yet
SIT718 Assessment-Task 4-T3 2019-Amended PDF
7 pages
CS1101 DiscussionAssignmentU1
No ratings yet
CS1101 DiscussionAssignmentU1
3 pages
Mini Project Time Series
No ratings yet
Mini Project Time Series
55 pages
Data Science in E-Commerce - Report - Writing
No ratings yet
Data Science in E-Commerce - Report - Writing
18 pages
Data Mining For Customer Segmentation
No ratings yet
Data Mining For Customer Segmentation
13 pages
New Wheels Quarterly Business Report
No ratings yet
New Wheels Quarterly Business Report
20 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Tablue
0% (1)
Tablue
2 pages
30 Amazing Machine Learning Projects For The Past Year (v.2018)
No ratings yet
30 Amazing Machine Learning Projects For The Past Year (v.2018)
22 pages
Big Data in E-Commerce
100% (2)
Big Data in E-Commerce
21 pages
Assignment Data Analysis Example
100% (1)
Assignment Data Analysis Example
10 pages
Time Series Project
No ratings yet
Time Series Project
19 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
New Wheels Quarterly Business Report
No ratings yet
New Wheels Quarterly Business Report
20 pages
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
No ratings yet
Prediction of Mobile Phone Price Class Using Supervised Machine Learning Techniques
4 pages
Credit Card Default Prediction: Final Project Report
No ratings yet
Credit Card Default Prediction: Final Project Report
28 pages
Predictive Modeling: Project Documentation Team 10
No ratings yet
Predictive Modeling: Project Documentation Team 10
16 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
Time Series Forecasting Business Report
No ratings yet
Time Series Forecasting Business Report
62 pages
Data Science Notes
No ratings yet
Data Science Notes
36 pages
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
4 pages
Exploratory Data Analysis On Haberman Dataset PDF
No ratings yet
Exploratory Data Analysis On Haberman Dataset PDF
11 pages
Customer Churn Analysis and Prediction
No ratings yet
Customer Churn Analysis and Prediction
4 pages
House Price Prediction Using Data Science
No ratings yet
House Price Prediction Using Data Science
8 pages
Capstone Presentation
No ratings yet
Capstone Presentation
9 pages
Company Bankruptcy Detection PDF
No ratings yet
Company Bankruptcy Detection PDF
34 pages
27 Jupyter Notebook
No ratings yet
27 Jupyter Notebook
42 pages
Bankruptcy Prevention Project
No ratings yet
Bankruptcy Prevention Project
16 pages
Alteryx Data Analyst Resume Example
No ratings yet
Alteryx Data Analyst Resume Example
1 page
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Data Science With R - Comcast Telecom Consumer Complaints
No ratings yet
Data Science With R - Comcast Telecom Consumer Complaints
11 pages
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
Programming Test: Learning Activations in Neural Networks: Monk AI
No ratings yet
Programming Test: Learning Activations in Neural Networks: Monk AI
2 pages
Customer Churn Prediction
No ratings yet
Customer Churn Prediction
89 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
Chapter02 - Nature of Data, Statistical Modelling, and Visualization
No ratings yet
Chapter02 - Nature of Data, Statistical Modelling, and Visualization
102 pages
CODE201911 Practices DataVisualizations
No ratings yet
CODE201911 Practices DataVisualizations
19 pages
Data Science Answers
No ratings yet
Data Science Answers
2 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
22 pages
Big Data Analytics PPT-2 (Section-A)
No ratings yet
Big Data Analytics PPT-2 (Section-A)
10 pages
Project Report Adv Stat V1.0
No ratings yet
Project Report Adv Stat V1.0
5 pages
Machine Learning Project Car Price Prediction Algorithm
No ratings yet
Machine Learning Project Car Price Prediction Algorithm
4 pages
Data Visualization Using Tableau
100% (1)
Data Visualization Using Tableau
3 pages
Data Mining
No ratings yet
Data Mining
7 pages
Curriculum: Fundamentals of Data Visualization With Tableau
No ratings yet
Curriculum: Fundamentals of Data Visualization With Tableau
2 pages
Advanced Applied Statistics Project
No ratings yet
Advanced Applied Statistics Project
16 pages
Class Assignment 1 For Business Analytics
No ratings yet
Class Assignment 1 For Business Analytics
5 pages
Webinar 06 Performance Tuning
No ratings yet
Webinar 06 Performance Tuning
14 pages
2014 - Predicting The Price of Used Cars Using Machine Learning Techniques PDF
No ratings yet
2014 - Predicting The Price of Used Cars Using Machine Learning Techniques PDF
12 pages
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
No ratings yet
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
6 pages
Data Science BluePrint
No ratings yet
Data Science BluePrint
12 pages
Machine Learning Projects PDF
No ratings yet
Machine Learning Projects PDF
5 pages
Recommended Reference Books PDF
No ratings yet
Recommended Reference Books PDF
1 page
Google Cloud Dataproc The Ultimate Step-By-Step Guide
From Everand
Google Cloud Dataproc The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet
Hadoop For Dummies
From Everand
Hadoop For Dummies
Dirk deRoos
3/5 (2)
Batch Control
No ratings yet
Batch Control
18 pages
Or 2marks Ans
100% (1)
Or 2marks Ans
6 pages
Type-2 Fuzzy Set
No ratings yet
Type-2 Fuzzy Set
12 pages
Numerical Methods for Engineers 6th Edition Chapra Solutions Manual - Free Download Available To Read All Chapters
No ratings yet
Numerical Methods for Engineers 6th Edition Chapra Solutions Manual - Free Download Available To Read All Chapters
69 pages
ANS Hw6
No ratings yet
ANS Hw6
7 pages
Mas 102 - Mathematics Ii
No ratings yet
Mas 102 - Mathematics Ii
5 pages
Real Numbers
No ratings yet
Real Numbers
34 pages
Ma 1252
No ratings yet
Ma 1252
25 pages
LI Sample Questions
No ratings yet
LI Sample Questions
10 pages
IGCSE Physics Syllabus Overview
No ratings yet
IGCSE Physics Syllabus Overview
13 pages
Were Talkin Algebra Tom Lehrer
No ratings yet
Were Talkin Algebra Tom Lehrer
2 pages
Direct Variation
No ratings yet
Direct Variation
49 pages
2024 LS G7 NMP Week3 Day2
No ratings yet
2024 LS G7 NMP Week3 Day2
20 pages
ISA Transactions: Peyman Sindareh Esfahani, Jeffrey Kurt Pieper
No ratings yet
ISA Transactions: Peyman Sindareh Esfahani, Jeffrey Kurt Pieper
11 pages
Measures of Relative Position
No ratings yet
Measures of Relative Position
20 pages
Addition and Subtraction Workbook Grade 1, 5 Minute Drill
0% (1)
Addition and Subtraction Workbook Grade 1, 5 Minute Drill
152 pages
A
No ratings yet
A
20 pages
Understanding The Problem First. You Have To Understand The Problem
No ratings yet
Understanding The Problem First. You Have To Understand The Problem
3 pages
Estonian Math
100% (1)
Estonian Math
26 pages
SPE 80437 Integrated Reservoir Simulation Studies To Optimize Recovery From A Carbonate Reservoir
No ratings yet
SPE 80437 Integrated Reservoir Simulation Studies To Optimize Recovery From A Carbonate Reservoir
14 pages
UGAA Exam Test 1 2018
100% (1)
UGAA Exam Test 1 2018
12 pages
Quadratic Equations: Objective Problems
No ratings yet
Quadratic Equations: Objective Problems
29 pages
Solution Exercises Chapter 1 Part 3
No ratings yet
Solution Exercises Chapter 1 Part 3
16 pages
Physics 12th
100% (1)
Physics 12th
119 pages
03 User Manual Chapter3 Geometry
No ratings yet
03 User Manual Chapter3 Geometry
84 pages
Cambridge Internationa As and A Level Physics Workbook - Extract
No ratings yet
Cambridge Internationa As and A Level Physics Workbook - Extract
7 pages
Experiment 3: Newton's Second Law On Atwood's Machine
No ratings yet
Experiment 3: Newton's Second Law On Atwood's Machine
10 pages
Quadratic Equation Previous Year Question - 01 (1) (9 Files Merged)
No ratings yet
Quadratic Equation Previous Year Question - 01 (1) (9 Files Merged)
37 pages