Stat
Stat
Stat
csv'
- Read the questions provided for each cell and assign your answers to respective
variables provided in the following cell.
- If answers are floating point numbers round of updo two floating point after the
decimal
- for example 10.546 should be read as 10.55, 10.544 as 10.54 and 10.1 as 10.10
- pandas and numpy packages are preinstalled for this task which should be
sufficient to complete this task.
- If you need any other additional package run !pip3 install <package_name> --user
in a new cell.
- Please dont change variable name meant to assign your answers.
NOTE: Run the last cell to save your answers in pickle file.
import pandas as pd
import numpy as np
### Read the data (this will not be graded)
!wget https://hr-projects-assets-
prod.s3.amazonaws.com/c3pde3c3lgm/963fbab228e2896e79fc09e385ab377d/data.csv
data = pd.read_csv("data.csv")
data.columns =
["Day","Avg_Temp","Avg_Hum","Avg_Dew","Avg_Bar","Avg_Wind","Avg_Gust","Avg_Dir","Ra
in_Mon","Rain_Yr",
"Max_Rain","Max_Temp","Min_temp","Max_Hum","Min_Hum","Max_Psr","Min_Psr","Max_Wind"
,"Max_Gust",
"Max_Heat"]
data.head()
data = pd.read_csv("data.csv")
data.columns =
["Day","Avg_Temp","Avg_Hum","Avg_Dew","Avg_Bar","Avg_Wind","Avg_Gust","Avg_Dir","Ra
in_Mon","Rain_Yr",
"Max_Rain","Max_Temp","Min_temp","Max_Hum","Min_Hum","Max_Psr","Min_Psr","Max_Wind"
,"Max_Gust",
"Max_Heat"]
data.head()
Day Avg_Temp Avg_Hum Avg_Dew Avg_Bar Avg_Wind Avg_Gust
Avg_Dir Rain_Mon Rain_Yr Max_Rain Max_Temp Min_temp
Max_Hum Min_Hum Max_Psr Min_Psr Max_Wind Max_Gust
Max_Heat
0 1/01/2009 37.8 35 12.7 29.7 26.4 36.8 274 0.0 0.0 0.0 40.1
34.5 44 27 29.762 29.596 41.4 59.0 40.1
1 2/01/2009 43.2 32 14.7 29.5 12.8 18.0 240 0.0 0.0 0.0 52.8
37.5 43 16 29.669 29.268 35.7 51.0 52.8
2 3/01/2009 25.7 60 12.7 29.7 8.3 12.2 290 0.0 0.0 0.0 41.2
6.7 89 35 30.232 29.260 25.3 38.0 41.2
3 4/01/2009 9.3 67 0.1 30.4 2.9 4.5 47 0.0 0.0 0.0 19.4
-0.0 79 35 30.566 30.227 12.7 20.0 32.0
4 5/01/2009 23.5 30 -5.3 29.9 16.7 23.1 265 0.0 0.0 0.0 30.3
15.1 56 13 30.233 29.568 38.0 53.0 32.0
What is the standard deviation of maximum windspeed across all the days
q1 = round(data.Max_Wind.std(),2)
q1
13.06
What is the difference between 50th percentile and 75th percentile of average
temperature
q2 =round((data.Avg_Temp.quantile(0.75)-data.Avg_Temp.quantile(0.5)),2)
q2
12.2
What is the pearson correlation between average dew point and average temperature
q3 = data[["Avg_Dew","Avg_Temp"]].corr()
q3 = 0.76
Out of all the available records which month has the lowest average humidity.
- Assign your answer as month index, for example if its July index is 7
q4 = 11
data['Day']= pd.to_datetime(data['Day'])
data["month"] = data['Day'].dt.month
data.sort_values("Avg_Hum", axis=0, ascending=True, inplace = True)
df1 = data[["month","Avg_Hum"]].groupby(["month"]).mean().sort_values(by="Avg_Hum",
ascending = True)
df1
Avg_Hum
month
11 44.861940
1 45.068729
3 46.045455
6 46.520979
10 46.724490
2 47.357430
9 48.736842
12 48.763052
8 50.246622
4 51.752896
7 52.923913
5 53.191489
Which month has the highest median for maximum_gust_speed out of all the available
records. Also find the repective value
- hint: group by month
df1 =
data[["month","Max_Gust"]].groupby(["month"]).median().sort_values(by="Max_Gust",
ascending = False)
df1
# data["month"] = data.Day.month()
q5 = 11
q6 = 32.2
Determine the average temperature between the months of March 2010 to May 2012
(including both the months)
data1 = data[["Day","Avg_Temp"]]
data1=data1[(data1.Day >= '2010-01-05') & (data1.Day < '2012-01-06') ]
q7=round(data1.Avg_Temp.mean(),2)
q7
44.2
Find the range of averange temperature on Dec 2010
data["Year"] = data["Day"].dt.year
data1 = data[["Year","month","Avg_Temp"]]
data1=data1[(data1.Year == 2010) & (data1.month ==12)]
q8= round(data1.Avg_Temp.mean(),2)
q8
34.35
Out of all available records which day has the highest difference between
maximum_pressure and minimum_pressure
- assign the date in string format as 'yyyy-mm-dd'. Make sure you enclose it with
single quote
data1 = data[["Day","Max_Psr","Min_Psr"]]
data1["Diff"] = data1["Max_Psr"]- data1["Min_Psr"]
data1.sort_values(by = "Diff")
data1
q9 ='2014-01-21'
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
How many days falls under median (i.e equal to median value) of barrometer reading.
data1 = data[["Day","Avg_Bar"]]
data1 = data1[data1.Avg_Bar ==data1.Avg_Bar.median()]
q10=534
data1.shape
(534, 2)
Out of all the available records how many days are within one standard deviation of
average temperaturem
data1 = data[["Year","month","Avg_Temp"]]
Stdev = data1.Avg_Temp.std()
up = data1.Avg_Temp.mean()+Stdev
bel = data1.Avg_Temp.mean()-Stdev
data1 = data1 [(data1.Avg_Temp>=bel)&(data1.Avg_Temp<=up)]
q11= 2092
data1.shape
(2092, 3)
Run this cell, to save your answers. Donot Modify this cell.
import pickle
import hashlib
def make_pickle2(file_name, obj):
with open(file_name, 'wb') as f:
pickle.dump(geth(obj), f, pickle.HIGHEST_PROTOCOL)
def geth(obj):
obj = str(obj).encode()
m = hashlib.md5()
m.update( bytes(obj) )
return m.hexdigest()
def pickling():
try:
make_pickle2('q1.pickle', q1)
except:
print('q1 not defined')
try:
make_pickle2('q2.pickle', q2)
except:
print('q2 not defined')
try:
make_pickle2('q3.pickle', q3)
except:
print('q3 not defined')
try:
make_pickle2('q4.pickle', q4)
except:
print('q4 not defined')
try:
make_pickle2('q5.pickle', q5)
except:
print('q5 not defined')
try:
make_pickle2('q6.pickle', q6)
except:
print('q6 not defined')
try:
make_pickle2('q7.pickle', q7)
except:
print('q7 not defined')
try:
make_pickle2('q8.pickle', q8)
except:
print('q8 not defined')
try:
make_pickle2('q9.pickle', q9)
except:
print('q9 not defined')
try:
make_pickle2('q10.pickle', q10)
except:
print('q10 not defined')
try:
make_pickle2('q11.pickle', q11)
except:
print('q11 not defined')
pickling()