Advanced Python
Advanced Python
INDIA-1
ENSAM RABAT
start_time = process_time()
end_time = process_time()
print(end_time - start_time)
0.002204186999999802
start_time = process_time()
np_array += 5
end_time = process_time()
print(end_time - start_time)
0.00042786099999991833
Numpy Arrays
1
[ ]: # list
list1 = [1,2,3,4,5]
print(list1)
type(list1)
[1, 2, 3, 4, 5]
[ ]: list
[ ]: np_array = np.array([1,2,3,4,5])
print(np_array)
type(np_array)
[1 2 3 4 5]
[ ]: numpy.ndarray
[1 2 3 4]
[ ]: (4,)
[[1 2 3 4]
[5 6 7 8]]
[ ]: b.shape
[ ]: (2, 4)
[[1. 2. 3. 4.]
[5. 6. 7. 8.]]
Initial Placeholders in numpy arrays
initial placeholder means the initial values present in that particular numpy array
2
[ ]: # create a numpy array of Zeros
x = np.zeros((4,5))
print(x)
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
[[1. 1. 1.]
[1. 1. 1.]
[1. 1. 1.]]
[[5 5 5 5]
[5 5 5 5]
[5 5 5 5]
[5 5 5 5]
[5 5 5 5]]
[[1. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 1. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 1.]]
3
[[94 97 81 94 41]
[86 14 96 50 77]
[61 87 91 55 93]]
[ ]: # array of evenly spaced values --> specifying the number of values required
d = np.linspace(10,30,5)
print(d)
[10 15 20 25]
np_array = np.asarray(list2)
print(np_array)
type(np_array)
[10 20 20 20 50]
[ ]: numpy.ndarray
[[23 10 82 52 67]
[28 24 63 47 58]
[85 12 25 33 52]
[57 86 84 71 16]
[34 41 14 78 66]]
[ ]: # array dimension
print(c.shape)
(5, 5)
[ ]: # number of dimensions
print(c.ndim)
4
[ ]: # number of elements in an array
print(c.size)
25
int64
Mathematical operations on a np array
[ ]: list1 = [1,2,3,4,5]
list2 = [6,7,8,9,10]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[ ]: a = np.random.randint(0,10,(3,3))
b = np.random.randint(10,20,(3,3))
[ ]: print(a)
print(b)
[[2 3 4]
[1 0 1]
[9 7 3]]
[[10 13 10]
[16 19 10]
[10 19 17]]
[ ]: print(a+b)
print(a-b)
print(a*b)
print(a/b)
[[12 16 14]
[17 19 11]
[19 26 20]]
[[ -8 -10 -6]
[-15 -19 -9]
[ -1 -12 -14]]
[[ 20 39 40]
[ 16 0 10]
[ 90 133 51]]
[[0.2 0.23076923 0.4 ]
[0.0625 0. 0.1 ]
[0.9 0.36842105 0.17647059]]
5
[ ]: a = np.random.randint(0,10,(3,3))
b = np.random.randint(10,20,(3,3))
[ ]: print(a)
print(b)
[[1 1 3]
[6 4 3]
[1 5 7]]
[[13 10 18]
[14 11 13]
[11 16 16]]
[ ]: print(np.add(a,b))
print(np.subtract(a,b))
print(np.multiply(a,b))
print(np.divide(a,b))
[[14 11 21]
[20 15 16]
[12 21 23]]
[[-12 -9 -15]
[ -8 -7 -10]
[-10 -11 -9]]
[[ 13 10 54]
[ 84 44 39]
[ 11 80 112]]
[[0.07692308 0.1 0.16666667]
[0.42857143 0.36363636 0.23076923]
[0.09090909 0.3125 0.4375 ]]
Array Manipulation
[ ]: array = np.random.randint(0,10,(2,3))
print(array)
print(array.shape)
[[9 2 4]
[0 2 0]]
(2, 3)
[ ]: # transpose
trans = np.transpose(array)
print(trans)
print(trans.shape)
[[9 0]
[2 2]
6
[4 0]]
(3, 2)
[ ]: array = np.random.randint(0,10,(2,3))
print(array)
print(array.shape)
[[6 5 9]
[8 2 3]]
(2, 3)
[[6 8]
[5 2]
[9 3]]
(3, 2)
[ ]: # reshaping a array
a = np.random.randint(0,10,(2,3))
print(a)
print(a.shape)
[[8 2 2]
[4 6 4]]
(2, 3)
[ ]: b = a.reshape(3,2)
print(b)
print(b.shape)
[[8 2]
[2 4]
[6 4]]
(3, 2)
[ ]: c = a.reshape(6)
print(c)
print(c.shape)
[8 2 2 4 6 4]
(6,)
[ ]: d = a.reshape((1,2,3))
print(d)
7
print(d.shape)
[[[8 2 2]
[4 6 4]]]
(1, 2, 3)
8
pandas-tutorial
Pandas Library:
Useful for Data Processing & Analysis
Pandas Data Frame:
Pandas DataFrame is two-dimensional tabular data structure with labeled axes (rows and columns).
[ ]: type(california_dataset)
#Bunch is like a dictionary object , it contains a lot of data
[ ]: sklearn.utils._bunch.Bunch
[ ]: print(california_dataset)
1
39.37 , -121.24 ]]), 'target': array([4.526, 3.585, 3.521,
…, 0.923, 0.847, 0.894]), 'frame': None, 'target_names': ['MedHouseVal'],
'feature_names': ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
'AveOccup', 'Latitude', 'Longitude'], 'DESCR': '..
_california_housing_dataset:\n\nCalifornia Housing
dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n
:Number of Instances: 20640\n\n :Number of Attributes: 8 numeric, predictive
attributes and the target\n\n :Attribute Information:\n - MedInc
median income in block group\n - HouseAge median house age in block
group\n - AveRooms average number of rooms per household\n -
AveBedrms average number of bedrooms per household\n - Population
block group population\n - AveOccup average number of household
members\n - Latitude block group latitude\n - Longitude
block group longitude\n\n :Missing Attribute Values: None\n\nThis dataset was
obtained from the StatLib
repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe
target variable is the median house value for California districts,\nexpressed
in hundreds of thousands of dollars ($100,000).\n\nThis dataset was derived from
the 1990 U.S. census, using one row per census\nblock group. A block group is
the smallest geographical unit for which the U.S.\nCensus Bureau publishes
sample data (a block group typically has a population\nof 600 to 3,000
people).\n\nA household is a group of people residing within a home. Since the
average\nnumber of rooms and bedrooms in this dataset are provided per
household, these\ncolumns may take surprisingly large values for block groups
with few households\nand many empty houses, such as vacation resorts.\n\nIt can
be downloaded/loaded using
the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic::
References\n\n - Pace, R. Kelley and Ronald Barry, Sparse Spatial
Autoregressions,\n Statistics and Probability Letters, 33 (1997)
291-297\n'}
this type of data is not suitable for analysis so this is where pandas comes to play, it will help us
to import this data to more structured table
[ ]: # pandas DataFrame
california_df = pd.DataFrame(california_dataset.data, columns =␣
↪california_dataset.feature_names)
#we are creating a pandas data frame and inside this we need to give the data␣
↪we want
[ ]: california_df.head() #this function in will print you the first five rows of␣
↪that data frame
2
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25
[ ]: (20640, 8)
[ ]: type(california_df)
[ ]: pandas.core.frame.DataFrame
[ ]: type(diabetes_df)
[ ]: pandas.core.frame.DataFrame
[ ]: diabetes_df.head()
[ ]: diabetes_df.shape
3
[ ]: (768, 9)
[ ]: Financial_sample_df.head()
[ ]: Financial_sample_df.shape
[ ]: (700, 16)
[ ]: california_df.to_excel('california.xlsx')
4
[ ]: random_df.head()
[ ]: 0 1 2 3 4 5 6 \
0 0.334997 0.461948 0.798143 0.160828 0.469857 0.132035 0.973342
1 0.817427 0.134303 0.191498 0.020126 0.157262 0.308749 0.746255
2 0.786123 0.290734 0.773516 0.260323 0.970542 0.940605 0.751676
3 0.801180 0.993138 0.562503 0.524121 0.192244 0.506380 0.472183
4 0.859077 0.762377 0.853730 0.414529 0.000119 0.329558 0.166290
7 8 9
0 0.219995 0.408478 0.070123
1 0.649148 0.900201 0.726858
2 0.981982 0.536330 0.388127
3 0.234543 0.348499 0.024407
4 0.397130 0.356937 0.405396
[ ]: random_df.shape
[ ]: (20, 10)
Inspecting a DataFrame
[ ]: #finding the number of rows & columns
california_df.shape
[ ]: (20640, 8)
Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25
[ ]: # with head function by default is 5 but you can add the number of rows that␣
↪you want as parameter of the funtion
california_df.head(30)
5
[ ]: MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85
5 4.0368 52.0 4.761658 1.103627 413.0 2.139896 37.85
6 3.6591 52.0 4.931907 0.951362 1094.0 2.128405 37.84
7 3.1200 52.0 4.797527 1.061824 1157.0 1.788253 37.84
8 2.0804 42.0 4.294118 1.117647 1206.0 2.026891 37.84
9 3.6912 52.0 4.970588 0.990196 1551.0 2.172269 37.84
10 3.2031 52.0 5.477612 1.079602 910.0 2.263682 37.85
11 3.2705 52.0 4.772480 1.024523 1504.0 2.049046 37.85
12 3.0750 52.0 5.322650 1.012821 1098.0 2.346154 37.85
13 2.6736 52.0 4.000000 1.097701 345.0 1.982759 37.84
14 1.9167 52.0 4.262903 1.009677 1212.0 1.954839 37.85
15 2.1250 50.0 4.242424 1.071970 697.0 2.640152 37.85
16 2.7750 52.0 5.939577 1.048338 793.0 2.395770 37.85
17 2.1202 52.0 4.052805 0.966997 648.0 2.138614 37.85
18 1.9911 50.0 5.343675 1.085919 990.0 2.362768 37.84
19 2.6033 52.0 5.465455 1.083636 690.0 2.509091 37.84
20 1.3578 40.0 4.524096 1.108434 409.0 2.463855 37.85
21 1.7135 42.0 4.478142 1.002732 929.0 2.538251 37.85
22 1.7250 52.0 5.096234 1.131799 1015.0 2.123431 37.84
23 2.1806 52.0 5.193846 1.036923 853.0 2.624615 37.84
24 2.6000 52.0 5.270142 1.035545 1006.0 2.383886 37.84
25 2.4038 41.0 4.495798 1.033613 317.0 2.663866 37.85
26 2.4597 49.0 4.728033 1.020921 607.0 2.539749 37.85
27 1.8080 52.0 4.780856 1.060453 1102.0 2.775819 37.85
28 1.6424 50.0 4.401691 1.040169 1131.0 2.391121 37.84
29 1.6875 52.0 4.703226 1.032258 395.0 2.548387 37.84
Longitude
0 -122.23
1 -122.22
2 -122.24
3 -122.25
4 -122.25
5 -122.25
6 -122.25
7 -122.25
8 -122.26
9 -122.25
10 -122.26
11 -122.26
12 -122.26
13 -122.26
6
14 -122.26
15 -122.26
16 -122.27
17 -122.27
18 -122.26
19 -122.27
20 -122.27
21 -122.27
22 -122.27
23 -122.27
24 -122.27
25 -122.28
26 -122.28
27 -122.28
28 -122.28
29 -122.28
Longitude
20635 -121.09
20636 -121.21
20637 -121.22
20638 -121.32
20639 -121.24
[ ]: # with tail function by default is 5 but you can add the number of rows that␣
↪you want as parameter of the funtion
california_df.tail(10)
7
20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37
Longitude
20630 -121.32
20631 -121.40
20632 -121.45
20633 -121.53
20634 -121.56
20635 -121.09
20636 -121.21
20637 -121.22
20638 -121.32
20639 -121.24
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MedInc 20640 non-null float64
1 HouseAge 20640 non-null float64
2 AveRooms 20640 non-null float64
3 AveBedrms 20640 non-null float64
4 Population 20640 non-null float64
5 AveOccup 20640 non-null float64
6 Latitude 20640 non-null float64
7 Longitude 20640 non-null float64
dtypes: float64(8)
memory usage: 1.3 MB
# It returns a boolean mask where True indicates missing values and False␣
↪indicates non-missing values.
california_df.isnull()
8
… … … … … … … …
20635 False False False False False False False
20636 False False False False False False False
20637 False False False False False False False
20638 False False False False False False False
20639 False False False False False False False
Longitude
0 False
1 False
2 False
3 False
4 False
… …
20635 False
20636 False
20637 False
20638 False
20639 False
[ ]: MedInc 0
HouseAge 0
AveRooms 0
AveBedrms 0
Population 0
AveOccup 0
Latitude 0
Longitude 0
dtype: int64
[ ]: # diabetes dataframe
diabetes_df.head()
9
0 0.627 50 1
1 0.351 31 0
2 0.672 32 1
3 0.167 21 0
4 2.288 33 1
[ ]: #in the outcome column ,"1" represents diabetic person and "0" represents␣
↪non-diabetic person
[ ]: Outcome
0 500
1 268
dtype: int64
Statistical Measures
[ ]: # count or number of values
california_df.count() #count function gaves us the number of values in each␣
↪column
[ ]: MedInc 20640
HouseAge 20640
AveRooms 20640
AveBedrms 20640
Population 20640
AveOccup 20640
Latitude 20640
Longitude 20640
dtype: int64
10
[ ]: MedInc 3.870671
HouseAge 28.639486
AveRooms 5.429000
AveBedrms 1.096675
Population 1425.476744
AveOccup 3.070655
Latitude 35.631861
Longitude -119.569704
dtype: float64
[ ]: MedInc 1.899822
HouseAge 12.585558
AveRooms 2.474173
AveBedrms 0.473911
Population 1132.462122
AveOccup 10.386050
Latitude 2.135952
Longitude 2.003532
dtype: float64
[ ]: # minimum value
california_df.min()
[ ]: MedInc 0.499900
HouseAge 1.000000
AveRooms 0.846154
AveBedrms 0.333333
Population 3.000000
AveOccup 0.692308
Latitude 32.540000
Longitude -124.350000
dtype: float64
[ ]: # maximum value
california_df.max()
[ ]: MedInc 15.000100
HouseAge 52.000000
AveRooms 141.909091
AveBedrms 34.066667
Population 35682.000000
AveOccup 1243.333333
Latitude 41.950000
Longitude -114.310000
11
dtype: float64
25% row mean that 25% of values are less than that value, same for 50% and 75%
Manipulating a DataFrame
[ ]: # adding a column to a dataframe
california_df['Price'] = california_dataset.target
[ ]: california_df.head()
Longitude Price
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
12
[ ]: # removing a row
california_df.drop(index=0, axis=0) #removing row with the index 0 , just␣
↪temporary if we want
Longitude Price
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
5 -122.25 2.697
… … …
20635 -121.09 0.781
20636 -121.21 0.771
20637 -121.22 0.923
20638 -121.32 0.847
20639 -121.24 0.894
[ ]: # drop a column
california_df.drop(columns='MedInc', axis=1)
13
20638 18.0 5.329513 1.171920 741.0 2.123209 39.43
20639 16.0 5.254717 1.162264 1387.0 2.616981 39.37
Longitude Price
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
4 -122.25 3.422
… … …
20635 -121.09 0.781
20636 -121.21 0.771
20637 -121.22 0.923
20638 -121.32 0.847
20639 -121.24 0.894
[ ]: MedInc 7.257400
HouseAge 52.000000
AveRooms 8.288136
AveBedrms 1.073446
Population 496.000000
AveOccup 2.802260
Latitude 37.850000
Longitude -122.240000
Price 3.521000
Name: 2, dtype: float64
0 8.3252
1 8.3014
2 7.2574
3 5.6431
4 3.8462
…
20635 1.5603
20636 2.5568
14
20637 1.7000
20638 1.8672
20639 2.3886
Name: MedInc, Length: 20640, dtype: float64
0 41.0
1 21.0
2 52.0
3 52.0
4 52.0
…
20635 25.0
20636 18.0
20637 17.0
20638 18.0
20639 16.0
Name: HouseAge, Length: 20640, dtype: float64
0 37.88
1 37.86
2 37.85
3 37.85
4 37.85
…
20635 39.48
20636 39.49
20637 39.43
20638 39.43
20639 39.37
Name: Latitude, Length: 20640, dtype: float64
0 4.526
1 3.585
2 3.521
3 3.413
4 3.422
…
20635 0.781
20636 0.771
20637 0.923
20638 0.847
20639 0.894
Name: Price, Length: 20640, dtype: float64
Correlation: is a statistical measure used to determine the strength and direction of the relationship
between two variables
1. Positive Correlation : as one variable increases, the other variable also tends to increase.
2. Negative Correlation: as one variable increases, the other variable tends to decrease, and vice
versa.
15
[ ]: california_df.corr()
#- all the columns will be compared to other columns
#- negative value means it is negatively correlated
#- positive value means it is positively correlated
16
matplotlib-tutorial
Matplotlib:
• Useful for making Plots
[ ]: # importing matplotlib library
import matplotlib.pyplot as plt
[ ]: x = np.linspace(0,10,100)
y = np.sin(x)
z = np.cos(x)
1
[ ]: # cos wave
plt.plot(x,z)
plt.show()
2
[ ]: # parabola
x = np.linspace(-10,10,20)
y = x**2
plt.plot(x,y)
plt.show()
3
[ ]: plt.plot(x, y, 'r+')
plt.show()
4
[ ]: plt.plot(x, y, 'g.')
plt.show()
[ ]: plt.plot(x, y, 'rx')
plt.show()
5
[ ]: x = np.linspace(-5,5,50)
plt.plot(x, np.sin(x), 'g-')
plt.plot(x, np.cos(x), 'r--')
plt.show()
Bar Plot :provides a clear and concise way to compare categorical data. Their simplicity allows
for easy interpretation, making them accessible to a wide audience, even those without extensive
statistical knowledge.
[ ]: fig = plt.figure() #This line initializes a new figure object. A figure is the␣
↪entire window or page that the plot is drawn on.
#Here, it means the axes spans the entire figure from left (0) to right (1) and␣
↪from bottom (0) to top (1).
languages = ['English','French','Spanish','Latin','German']
people = [100, 50, 150, 40, 70]
ax.bar(languages, people)#This line creates a bar plot on the axes ax.
#It takes the list of languages as the x-axis values and the list of people as␣
↪the corresponding y-axis values.
6
plt.show()
Pie Chart : is very useful to find the distribution of the data in an entire dataset
[ ]: fig1 = plt.figure()
ax = fig1.add_axes([0,0,1,1])
languages = ['English','French','Spanish','Latin','German']
people = [100, 50, 150, 40, 70]
ax.pie(people, labels=languages, autopct='%1.1f%%')#This line creates a pie␣
↪chart .
# and autopct='%1.1f%%' formats the percentage display on each slice of the pie␣
↪chart.
plt.show()
7
Scatter Plot
[ ]: x = np.linspace(0,10,30)
y = np.sin(x)
z = np.cos(x)
fig2 = plt.figure()
ax = fig2.add_axes([0,0,1,1])
ax.scatter(x,y,color='g')
ax.scatter(x,z,color='b')
plt.show()
8
this is a scatter plot where the data point should be scattered so there won’t be any line joining
these functions,it is very useful in clustering applications
3D Scatter Plot
[ ]: fig3 = plt.figure()
ax = plt.axes(projection='3d') #This line creates a 3D axes object using the␣
↪projection='3d' parameter,indicating that the plot will be in 3D space.
z = 20 * np.random.random(100)
x = np.sin(z)
y = np.cos(z)
ax.scatter(x,y,z,c=z,cmap='Blues')
#This line creates a scatter plot in 3D space .
#It takes the x, y, and z coordinates of each point as inputs,
#along with the c=z parameter to specify the color of each point based on its␣
↪z-coordinate.
#The cmap='Blues' parameter sets the color map to use for mapping scalar values␣
↪to colors.
plt.show()
9
10
seaborn-tutorial
[ ]: tips.head()
'''
This code utilizes Seaborn's relplot() function to create a relational plot␣
↪(relplot) based on the provided data:
1
3. col='time' : This parameter indicates that the plots will be organized into␣
↪columns based on the values in the time column of the dataset.
4. hue='smoker' : This parameter assigns colors to the data points based on the␣
↪values in the smoker column.
5. style='smoker': This parameter determines the style of the markers used in␣
↪the plot.
6. size='size': This parameter specifies the size of the markers based on the␣
↪values in the size column of the dataset.
'''
2
and this is the advantage of seaborn over a matplotlib. in matplotlib you need to mention all steps
manually but when you are using seaborn you don’t need to do that so seaborn automatically finds
those differences and it will plot it
[ ]: # load the iris dataset
iris = sns.load_dataset('iris')
[ ]: iris.head()
'''
there are totally three species in iris they are iris cetosa, iris virginica␣
↪and iris versicolor
the idea here is to predict what species the particular iris flower belongs
based on their sepal_length, sepal_width, petal length and petal with
this is the problem statement for this particular data set
'''
[ ]: '\nthere are totally three species in iris they are iris cetosa, iris virginica
and iris versicolor\nthe idea here is to predict what species the particular
iris flower belongs\nbased on their sepal_length, sepal_width, petal length and
petal with \nthis is the problem statement for this particular data set\n'
Scatter Plot
[ ]: sns.scatterplot(x='sepal_length',y='petal_length',hue='species',data=iris)
3
[ ]: sns.scatterplot(x='sepal_length',y='petal_width',hue='species',data=iris)
4
[ ]: # loading the titanic dataset
titanic = sns.load_dataset('titanic')
'''
the idea behind this data set is to predict whether a person has survived this␣
↪titanic based on these
These models are trained on historical data with known outcomes and then used␣
↪to make predictions on new, unseen data.
'''
[ ]: titanic.head()
5
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
[ ]: titanic.shape
[ ]: (891, 15)
Count Plot
[ ]: sns.countplot(x='class',data=titanic)
[ ]: sns.countplot(x='survived',data=titanic)
6
Bar Chart
[ ]: sns.barplot(x='sex',y='survived',hue='class',data=titanic)
7
[ ]: # house price dataset
from sklearn.datasets import fetch_california_housing
house_california = fetch_california_housing()
house['PRICE'] = house_california.target
[ ]: print(house_california)
8
39.43 , -121.32 ],
[ 2.3886 , 16. , 5.25471698, …, 2.61698113,
39.37 , -121.24 ]]), 'target': array([4.526, 3.585, 3.521,
…, 0.923, 0.847, 0.894]), 'frame': None, 'target_names': ['MedHouseVal'],
'feature_names': ['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
'AveOccup', 'Latitude', 'Longitude'], 'DESCR': '..
_california_housing_dataset:\n\nCalifornia Housing
dataset\n--------------------------\n\n**Data Set Characteristics:**\n\n
:Number of Instances: 20640\n\n :Number of Attributes: 8 numeric, predictive
attributes and the target\n\n :Attribute Information:\n - MedInc
median income in block group\n - HouseAge median house age in block
group\n - AveRooms average number of rooms per household\n -
AveBedrms average number of bedrooms per household\n - Population
block group population\n - AveOccup average number of household
members\n - Latitude block group latitude\n - Longitude
block group longitude\n\n :Missing Attribute Values: None\n\nThis dataset was
obtained from the StatLib
repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n\nThe
target variable is the median house value for California districts,\nexpressed
in hundreds of thousands of dollars ($100,000).\n\nThis dataset was derived from
the 1990 U.S. census, using one row per census\nblock group. A block group is
the smallest geographical unit for which the U.S.\nCensus Bureau publishes
sample data (a block group typically has a population\nof 600 to 3,000
people).\n\nA household is a group of people residing within a home. Since the
average\nnumber of rooms and bedrooms in this dataset are provided per
household, these\ncolumns may take surprisingly large values for block groups
with few households\nand many empty houses, such as vacation resorts.\n\nIt can
be downloaded/loaded using
the\n:func:`sklearn.datasets.fetch_california_housing` function.\n\n.. topic::
References\n\n - Pace, R. Kelley and Ronald Barry, Sparse Spatial
Autoregressions,\n Statistics and Probability Letters, 33 (1997)
291-297\n'}
[ ]: house.head()
Longitude PRICE
0 -122.23 4.526
1 -122.22 3.585
2 -122.24 3.521
3 -122.25 3.413
9
4 -122.25 3.422
Distribution Plot
[ ]: sns.displot(house['PRICE'])
[ ]: <seaborn.axisgrid.FacetGrid at 0x79892e092410>
<ipython-input-20-2d26162c18b9>:1: UserWarning:
Please adapt your code to use either `displot` (a figure-level function with
similar flexibility) or `histplot` (an axes-level function for histograms).
For a guide to updating your code to use the new functions, please see
10
https://gist.github.com/mwaskom/de44147ed2974457ad6372750bbe5751
[ ]: correlation = house.corr()
'''
This code utilizes Seaborn's heatmap() function to create a heatmap␣
↪visualization of a correlation matrix.
11
1. correlation: This parameter represents the correlation matrix that will be␣
↪visualized as a heatmap.
2. cbar=True: This parameter specifies whether to include a color bar (or color␣
↪legend) alongside the heatmap.
3. square=True: This parameter ensures that the aspect ratio of the heatmap is␣
↪set to be square.
4. fmt='.1f': This parameter specifies the format of the values displayed on␣
↪the heatmap.
When set to True, each cell will display the corresponding value␣
↪from the correlation matrix.
7. cmap='Blues': This parameter sets the colormap used to color the heatmap.
The 'Blues' colormap ranges from light to dark blue, with darker shades␣
↪indicating higher values and lighter shades indicating lower values.
'''
[ ]: <Axes: >
12
correlation matrix is very important because it tells us which columns are important for our pre-
diction which columns are not important
13