Solution - Data Analysis With Python-Project-2 - v1.0
Solution - Data Analysis With Python-Project-2 - v1.0
Certification Project
Bike-Sharing Demand Analysis
Objective: Use data to understand what factors affect the number of bike trips. Make a
predictive model to predict the number of trips in a particular hour slot, depending on the
environmental conditions.
Problem Statement:
Lyft, Inc. is a transportation network company based in San Francisco, California and operating
in 640 cities in the United States and 9 cities in Canada. It develops, markets, and operates the
Lyft mobile app, offering car rides, scooters, and a bicycle-sharing system. It is the second
largest rideshare company in the world, second to only Uber.
Lyft’s bike-sharing service is also among the largest in the USA. Being able to anticipate demand
is extremely important for planning of bicycles, stations, and the personnel required to
maintain these. This demand is sensitive to a lot of factors like season, humidity, rain,
weekdays, holidays, and more. To enable this planning, Lyft needs to rightly predict the demand
according to these factors.
Domain: General
Steps to perform:
As the first step, look at the null values in the file. A sanity check, to ensure that you have clean
records and the data is good to go ahead, is very important. Then, you’ll do univariate and
bivariate analyses to identify the patterns in the data and the nature of the individual features.
This is a very important step as this helps to not only identify features which could be
interesting for the predictive model later, but also helps understand what’s going on in the
data. The EDA will help identify the need to apply transformations on the features before
building the model. Finally, you will make a predictive model using linear regression.
Solution
1. Load the data file
inp0 = pd.read_csv("hour.csv")
inp0.head()
2. Check for null values in the data, drop records with NAs
inp0.isna().sum(axis=0)
Looks like there are no records with null values. Looks good so far.
3. Sanity checks:
a. Check if registered + casual = cnt for all the records. The two must add to cnt, if not
the row is junk and should be dropped.
np.sum((inp0.casual + inp0.registered) != inp0.cnt)
5. Univariate analysis –
- Describe the numerical fields in the dataset using pandas describe method
inp1.describe()
- Make density plot for temp. This would give a sense of the centrality and the
spread of the distribution.
inp1.temp.plot.density()
sns.boxplot(inp1.atemp)
There don’t seem to be any outliers for atemp.
sns.boxplot(inp1.cnt)
Both plots show similar picture – some high values are present in cnt.
6. Outlier treatment –
1. Cnt – looks like some hours have rather high values of cnt. We’ll need to treat these
outliers so that they don’t skew our analysis and our model.
a. Find out the following percentiles - 10, 25, 50, 75, 90, 95, 99
b. Decide the cutoff percentile and drop records with values higher that the
cutoff. Name the new dataframe ‘inp2’.
inp1.cnt.quantile([0.1, 0.25, 0.5, 0.70, 0.9, 0.95, 0.99])
563 is the 95th percentile – only 5% records have a value higher than this. Taking this as the
cutoff.
inp2 = inp1[inp1.cnt < 563].copy()
7. Bi-variate analysis
1. Make box plot for cnt vs hr
a. What kind of pattern do you see?
plt.figure(figsize=[12,6])
sns.boxplot("hr", "cnt", data=inp2)
It’s evident that the peak hours are 5PM – 7PM, the hours 7-8AM also have high upper quartile.
A hypothesis could be that a lot of people use the bikes for commute to workplace and back.
Looks like end of winter/ early spring months have the least bike riding instances.
4. Make boxplot for cnt vs season
a. Which season has the highest rides in general? Expected?
plt.figure(figsize=[10,6])
sns.boxplot("season", "cnt", data=inp2)
5. Make a bar plot with the median value of cnt for each hr
a. Does this paint a different picture than the box plot?
plt.figure(figsize=[8,5])
plt.bar(res.keys(), res.values)
Paints a similar picture to the boxplot. Although the view is much cleaner and the pattern
comes out much easier.
6. Make a correlation matrix for variables – atemp, temp, hum, windspeed
a. Which variables have the highest correlation?
num_vars = ['temp', 'atemp', 'hum', 'windspeed']
corrs = inp2[num_vars].corr()
10. Separate X and Y for df_train and df_test. For example – you should have X_train, y_train
from df_train. y_train should be the cnt column from inp3, X_train should be all other
columns.
y_train = df_train.pop("cnt")
X_train = df_train
y_test = df_test.pop("cnt")
X_test = df_test
10 . Model building
- Use Linear regression as the technique
- Report the R2 on the train set
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)