Stat

The data required for this task has been provided in the file 'data.
csv'
- Read the questions provided for each cell and assign your answers to respective
variables provided in the following cell.
- If answers are floating point numbers round of updo two floating point after the
decimal
- for example 10.546 should be read as 10.55, 10.544 as 10.54 and 10.1 as 10.10
- pandas and numpy packages are preinstalled for this task which should be
sufficient to complete this task.
- If you need any other additional package run !pip3 install <package_name> --user
in a new cell.
- Please dont change variable name meant to assign your answers.
NOTE: Run the last cell to save your answers in pickle file.
import pandas as pd
import numpy as np
### Read the data (this will not be graded)
!wget https://hr-projects-assets-
prod.s3.amazonaws.com/c3pde3c3lgm/963fbab228e2896e79fc09e385ab377d/data.csv
--2021-08-20 06:27:43-- https://hr-projects-assets-

prod.s3.amazonaws.com/c3pde3c3lgm/963fbab228e2896e79fc09e385ab377d/data.csv
Resolving hr-projects-assets-prod.s3.amazonaws.com (hr-projects-assets-
prod.s3.amazonaws.com)... 52.217.198.121
Connecting to hr-projects-assets-prod.s3.amazonaws.com (hr-projects-assets-
prod.s3.amazonaws.com)|52.217.198.121|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 332846 (325K) [binary/octet-stream]
Saving to: ‘data.csv.3’
data.csv.3 100%[===================>] 325.04K 388KB/s in 0.8s
2021-08-20 06:27:45 (388 KB/s) - ‘data.csv.3’ saved [332846/332846]
data = pd.read_csv("data.csv")
data.columns =
["Day","Avg_Temp","Avg_Hum","Avg_Dew","Avg_Bar","Avg_Wind","Avg_Gust","Avg_Dir","Ra
in_Mon","Rain_Yr",
"Max_Rain","Max_Temp","Min_temp","Max_Hum","Min_Hum","Max_Psr","Min_Psr","Max_Wind"
,"Max_Gust",
"Max_Heat"]
data.head()
data = pd.read_csv("data.csv")
data.columns =
["Day","Avg_Temp","Avg_Hum","Avg_Dew","Avg_Bar","Avg_Wind","Avg_Gust","Avg_Dir","Ra
in_Mon","Rain_Yr",
"Max_Rain","Max_Temp","Min_temp","Max_Hum","Min_Hum","Max_Psr","Min_Psr","Max_Wind"
,"Max_Gust",
"Max_Heat"]
data.head()
Day Avg_Temp Avg_Hum Avg_Dew Avg_Bar Avg_Wind Avg_Gust
Avg_Dir Rain_Mon Rain_Yr Max_Rain Max_Temp Min_temp
Max_Hum Min_Hum Max_Psr Min_Psr Max_Wind Max_Gust
Max_Heat
0 1/01/2009 37.8 35 12.7 29.7 26.4 36.8 274 0.0 0.0 0.0 40.1
34.5 44 27 29.762 29.596 41.4 59.0 40.1
1 2/01/2009 43.2 32 14.7 29.5 12.8 18.0 240 0.0 0.0 0.0 52.8
37.5 43 16 29.669 29.268 35.7 51.0 52.8
2 3/01/2009 25.7 60 12.7 29.7 8.3 12.2 290 0.0 0.0 0.0 41.2
6.7 89 35 30.232 29.260 25.3 38.0 41.2
3 4/01/2009 9.3 67 0.1 30.4 2.9 4.5 47 0.0 0.0 0.0 19.4
-0.0 79 35 30.566 30.227 12.7 20.0 32.0
4 5/01/2009 23.5 30 -5.3 29.9 16.7 23.1 265 0.0 0.0 0.0 30.3
15.1 56 13 30.233 29.568 38.0 53.0 32.0
What is the standard deviation of maximum windspeed across all the days
q1 = round(data.Max_Wind.std(),2)
q1
13.06
What is the difference between 50th percentile and 75th percentile of average
temperature
q2 =round((data.Avg_Temp.quantile(0.75)-data.Avg_Temp.quantile(0.5)),2)
q2
12.2
What is the pearson correlation between average dew point and average temperature
q3 = data[["Avg_Dew","Avg_Temp"]].corr()
q3 = 0.76
Out of all the available records which month has the lowest average humidity.
- Assign your answer as month index, for example if its July index is 7
q4 = 11
data['Day']= pd.to_datetime(data['Day'])
data["month"] = data['Day'].dt.month
data.sort_values("Avg_Hum", axis=0, ascending=True, inplace = True)
df1 = data[["month","Avg_Hum"]].groupby(["month"]).mean().sort_values(by="Avg_Hum",
ascending = True)
df1
Avg_Hum
month
11 44.861940
1 45.068729
3 46.045455
6 46.520979
10 46.724490
2 47.357430
9 48.736842
12 48.763052
8 50.246622
4 51.752896
7 52.923913
5 53.191489
Which month has the highest median for maximum_gust_speed out of all the available
records. Also find the repective value
- hint: group by month
df1 =
data[["month","Max_Gust"]].groupby(["month"]).median().sort_values(by="Max_Gust",
ascending = False)
df1
# data["month"] = data.Day.month()
q5 = 11
q6 = 32.2
Determine the average temperature between the months of March 2010 to May 2012
(including both the months)
data1 = data[["Day","Avg_Temp"]]
data1=data1[(data1.Day >= '2010-01-05') & (data1.Day < '2012-01-06') ]
q7=round(data1.Avg_Temp.mean(),2)
q7
44.2
Find the range of averange temperature on Dec 2010
data["Year"] = data["Day"].dt.year
data1 = data[["Year","month","Avg_Temp"]]
data1=data1[(data1.Year == 2010) & (data1.month ==12)]
q8= round(data1.Avg_Temp.mean(),2)
q8
34.35
Out of all available records which day has the highest difference between
maximum_pressure and minimum_pressure
- assign the date in string format as 'yyyy-mm-dd'. Make sure you enclose it with
single quote
data1 = data[["Day","Max_Psr","Min_Psr"]]
data1["Diff"] = data1["Max_Psr"]- data1["Min_Psr"]
data1.sort_values(by = "Diff")
data1
q9 ='2014-01-21'
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-

docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
How many days falls under median (i.e equal to median value) of barrometer reading.
data1 = data[["Day","Avg_Bar"]]
data1 = data1[data1.Avg_Bar ==data1.Avg_Bar.median()]
q10=534
data1.shape
(534, 2)
Out of all the available records how many days are within one standard deviation of
average temperaturem
data1 = data[["Year","month","Avg_Temp"]]
Stdev = data1.Avg_Temp.std()
up = data1.Avg_Temp.mean()+Stdev
bel = data1.Avg_Temp.mean()-Stdev
data1 = data1 [(data1.Avg_Temp>=bel)&(data1.Avg_Temp<=up)]
q11= 2092
data1.shape
(2092, 3)
Run this cell, to save your answers. Donot Modify this cell.
import pickle
import hashlib
def make_pickle2(file_name, obj):
with open(file_name, 'wb') as f:
pickle.dump(geth(obj), f, pickle.HIGHEST_PROTOCOL)
def geth(obj):
obj = str(obj).encode()
m = hashlib.md5()
m.update( bytes(obj) )
return m.hexdigest()
def pickling():
try:
make_pickle2('q1.pickle', q1)
except:
print('q1 not defined')
try:
except:
try:
except:
try:
except:
try:
except:
try:
except:
try:
except:
try:
except:
try:
except:
try:
except:
try:
except:
pickling()

Stat

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Stat

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat

Uploaded by

Copyright:

Available Formats

The data required for this task has been provided in the file 'data.

--2021-08-20 06:27:43-- https://hr-projects-assets-

data.csv.3 100%[===================>] 325.04K 388KB/s in 0.8s

2021-08-20 06:27:45 (388 KB/s) - ‘data.csv.3’ saved [332846/332846]

See the caveats in the documentation: http://pandas.pydata.org/pandas-

You might also like