Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Cia Code

Download as pdf or txt
Download as pdf or txt
You are on page 1of 38

cia-code

April 23, 2024

Importing data
[32]: import pandas as pd
df1 = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')
df = pd.DataFrame(df1)
df

[32]: address bathrooms bedrooms finishedsqft \


0 2243 Franklin St 2.0 2 1463.0
1 2002 Pacific Ave APT 4 3.5 3 3291.0
2 1945 Washington St APT 411 1.0 1 653.0
3 1896 Pacific Ave APT 802 2.5 2 2272.0
4 1840 Washington St APT 603 1.0 1 837.0
.. … … … …
434 2170 Vallejo St APT 101 2.0 3 2145.0
435 2380 Vallejo St 3.5 4 3042.0
436 2430 Vallejo St 7.5 6 4721.0
437 1859 Green St 1.0 2 1306.0
438 2131 Vallejo St APT 3 1.0 1 1100.0

lastsolddate lastsoldprice latitude longitude neighborhood \


0 02-05-2016 1950000 37.795139 -122.425309 Pacific Heights
1 1/22/2016 4200000 37.794429 -122.428513 Pacific Heights
2 12/16/2015 665000 37.792472 -122.425281 Pacific Heights
3 12/17/2014 2735000 37.794706 -122.426347 Pacific Heights
4 12-02-2015 1050000 37.793212 -122.423744 Pacific Heights
.. … … … … …
434 11/14/2012 1650000 37.795777 -122.433024 Pacific Heights
435 10-01-2012 3195000 37.795330 -122.436540 Pacific Heights
436 9/24/2012 7350000 37.795246 -122.437490 Pacific Heights
437 10/18/2011 1349000 37.796588 -122.429641 Pacific Heights
438 03-04-2016 1250000 37.795255 -122.432880 Pacific Heights

totalrooms usecode yearbuilt zipcode


0 7 Condominium 1900 94109
1 7 Condominium 1961 94109
2 3 Condominium 1987 94109
3 6 Condominium 1924 94109

1
4 3 Condominium 2012 94109
.. … … … …
434 8 Condominium 1914 94123
435 10 SingleFamily 1908 94123
436 13 SingleFamily 1905 94123
437 5 Condominium 1900 94123
438 5 Condominium 1900 94123

[439 rows x 13 columns]

0.1 Data Preprocessing


1. Dropping irrelevant variables/columns from the dataset which adds no intrinsic value to it.

[33]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣


↪'yearbuilt', 'zipcode']

df.drop(columns=columns_to_drop, inplace=True)

[34]: df.columns

[34]: Index(['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice', 'latitude',


'longitude', 'totalrooms'],
dtype='object')

[35]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bathrooms 439 non-null float64
1 bedrooms 439 non-null int64
2 finishedsqft 438 non-null float64
3 lastsoldprice 439 non-null int64
4 latitude 439 non-null float64
5 longitude 437 non-null float64
6 totalrooms 439 non-null int64
dtypes: float64(4), int64(3)
memory usage: 24.1 KB
2.Detecting and Plotting missing values
[36]: missing_values = df.isnull().sum()
print("Missing values in the data:",missing_values)

Missing values in the data: bathrooms 0


bedrooms 0

2
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64

[37]: import matplotlib.pyplot as plt

# Calculate the number of missing values in each column


missing_values = df.isnull().sum()

# Plot the missing values


plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar')
plt.title('Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()

3.Handling missing values

3
[38]: missing_values = df.isnull().sum()
print("Missing values before handling:")
print(missing_values)

Missing values before handling:


bathrooms 0
bedrooms 0
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64

[39]: #imputing with mean value of the column


from sklearn.impute import SimpleImputer

numerical_features = df.select_dtypes(include=['float64', 'int64']).columns


imputer = SimpleImputer(strategy='median')
df[numerical_features] = imputer.fit_transform(df[numerical_features])

[40]: missing_values_after = df.isnull().sum()


print("\nMissing values after handling:")
print(missing_values_after)

Missing values after handling:


bathrooms 0
bedrooms 0
finishedsqft 0
lastsoldprice 0
latitude 0
longitude 0
totalrooms 0
dtype: int64

[22]: #no missing values

4.Detecting Outliers
[41]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore

# Load the dataset


df = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')

4
# Select numerical columns for outlier detection
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Create box plots for numerical columns to visualize outliers


plt.figure(figsize=(12, 6))
for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
sns.boxplot(data=df[column])
plt.title(column)
plt.tight_layout()
plt.show()

# Identify outliers using z-score


z_scores = zscore(df[numerical_columns])
outliers = (z_scores > 3) | (z_scores < -3)

# Plot outliers using scatter plot


plt.figure(figsize=(12, 6))
for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
plt.scatter(df.index, df[column], c=outliers[:, i-1], cmap='coolwarm',␣
↪alpha=0.5)

plt.title(column)
plt.xlabel('Index')
plt.ylabel(column)
plt.tight_layout()
plt.show()

---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[41], line 15
13 plt.figure(figsize=(12, 6))
14 for i, column in enumerate(numerical_columns, 1):
---> 15 plt.subplot(2, 3, i)
16 sns.boxplot(data=df[column])
17 plt.title(column)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)

1422 fig = gcf()


1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if the user␣
↪passed no

1429 # kwargs or if the axes class and kwargs are identical.

5
1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)

597 else:
598 if not isinstance(num, Integral) or num < 1 or num > rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <= {rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]

ValueError: num must be an integer with 1 <= num <= 6, not 7

[42]: df_no_outliers = df[~outliers.any(axis=1)]

[43]: plt.figure(figsize=(12, 6))


for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
sns.boxplot(data=df_no_outliers[column])
plt.title(column)
plt.tight_layout()
plt.show()

6
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[43], line 3
1 plt.figure(figsize=(12, 6))
2 for i, column in enumerate(numerical_columns, 1):
----> 3 plt.subplot(2, 3, i)
4 sns.boxplot(data=df_no_outliers[column])
5 plt.title(column)

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)

1422 fig = gcf()


1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if the user␣
↪passed no

1429 # kwargs or if the axes class and kwargs are identical.


1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):

File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)

597 else:
598 if not isinstance(num, Integral) or num < 1 or num > rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <= {rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]

ValueError: num must be an integer with 1 <= num <= 6, not 7

7
[29]: #the outliers have been reduced to maximum

[44]: df.columns

[44]: Index(['address', 'bathrooms', 'bedrooms', 'finishedsqft', 'lastsolddate',


'lastsoldprice', 'latitude', 'longitude', 'neighborhood', 'totalrooms',
'usecode', 'yearbuilt', 'zipcode'],
dtype='object')

[45]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣


↪'yearbuilt', 'zipcode']

df.drop(columns=columns_to_drop, inplace=True)
df

[45]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \


0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms

8
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

[ ]: #CORRELATION ESTIMATION

[126]: import pandas as pd


import seaborn as sns
import matplotlib.pyplot as plt

# Compute correlation matrix


corr_matrix = df.corr()

# Set custom color palette


colors = sns.color_palette("coolwarm", as_cmap=True)

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap=colors, fmt=".2f", linewidths=0.5,␣
↪cbar=False)

plt.title('Correlation Matrix', fontsize=16, fontweight='bold')


plt.xticks(fontsize=12, rotation=45)
plt.yticks(fontsize=12)
plt.show()

9
0.2 PRINCIPLE COMPONENT ANALYSIS
1.Standardization
[52]: df_index = df.index

[53]: numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

[54]: import pandas as pd


from sklearn.preprocessing import StandardScaler

[76]: scaler = StandardScaler()


scaled_features = scaler.fit_transform(df)

df_scaled = pd.DataFrame(scaled_features)

10
df_scaled

[76]: 0 1 2 3 4 5 6
0 -0.285556 -0.399122 -0.403174 -0.211435 1.185443 1.148603 0.105818
1 0.698425 0.140000 0.680420 0.624953 0.874861 0.720970 0.105818
2 -0.941543 -0.938244 -0.883323 -0.689106 0.018791 1.152340 -0.909570
3 0.042437 -0.399122 0.076381 0.080372 0.996031 1.010062 -0.148029
4 -0.941543 -0.938244 -0.774252 -0.545990 0.342496 1.357481 -0.909570
.. … … … … … … …
434 -0.285556 0.140000 0.001099 -0.322953 1.464529 0.118894 0.359665
435 0.698425 0.679121 0.532819 0.251367 1.268994 -0.350380 0.867359
436 3.322374 1.757365 1.528090 1.795897 1.232249 -0.477175 1.628900
437 -0.941543 -0.399122 -0.496240 -0.434844 1.819293 0.570418 -0.401876
438 -0.941543 -0.938244 -0.618352 -0.471645 1.236186 0.138114 -0.401876

[439 rows x 7 columns]

2.Covariance Matrix Computation


[90]: import pandas as pd
data_filled = df.fillna(df.mean())

# Compute the covariance matrix


covariance_matrix = np.cov(data_filled, rowvar=False)
covariance_matrix
column_names = ['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice',␣
↪'latitude',

'longitude', 'totalrooms']
covariance_df = pd.DataFrame(covariance_matrix, columns=column_names,␣
↪index=column_names)

# Print the covariance matrix DataFrame


print("Covariance Matrix:")
print(covariance_df)

Covariance Matrix:
bathrooms bedrooms finishedsqft lastsoldprice \
bathrooms 2.329161e+00 2.017295e+00 2.231689e+03 3.157419e+06
bedrooms 2.017295e+00 3.448394e+00 2.356723e+03 3.033648e+06
finishedsqft 2.231689e+03 2.356723e+03 2.845895e+06 3.819981e+09
lastsoldprice 3.157419e+06 3.033648e+06 3.819981e+09 7.253364e+12
latitude 3.289555e-04 8.149072e-05 3.582618e-01 7.795914e+02
longitude -2.819931e-03 -3.999855e-03 -3.881395e+00 -6.205566e+03
totalrooms 4.699683e+00 5.724836e+00 5.793232e+03 7.124227e+06

latitude longitude totalrooms


bathrooms 0.000329 -0.002820 4.699683e+00
bedrooms 0.000081 -0.004000 5.724836e+00

11
finishedsqft 0.358262 -3.881395 5.793232e+03
lastsoldprice 779.591402 -6205.566010 7.124227e+06
latitude 0.000005 0.000009 2.110409e-04
longitude 0.000009 0.000056 -9.048211e-03
totalrooms 0.000211 -0.009048 1.555414e+01

[ ]:

3.Eigen Decomposition
[94]: import pandas as pd

# Convert eigenvalues and eigenvectors to DataFrame


eigen_df = pd.DataFrame({'Eigenvalue': eigenvalues})
eigen_df['Eigenvector'] = [eigenvectors[:, i] for i in range(len(eigenvectors))]

# Print the DataFrame


print("Eigenvalues and Eigenvectors:")
print(eigen_df)

Eigenvalues and Eigenvectors:


Eigenvalue Eigenvector
0 7.253366e+12 [-4.3530400936506286e-07, -4.182401136422737e-…
1 8.341100e+05 [0.0006819708177894264, 0.0009100232661667038,…
2 3.909019e+00 [-0.07635453461451713, -0.34829146033302216, 0…
3 1.175348e+00 [0.15061016522339143, 0.9222261380364992, -7.0…
4 5.325269e-01 [0.9856396143372855, -0.16790163752836, -0.000…
5 5.091827e-05 [0.0008041469959149176, -0.0004697221463473827…
6 3.293921e-06 [4.783139021784667e-05, -8.824796320393624e-06…
4.Rearranging eigenvectors by respective eigenvalues
[96]: import pandas as pd

# Assuming 'eigenvectors' contains the eigenvectors computed from the␣


↪covariance matrix

# Create a DataFrame for eigenvectors


eigenvectors_df = pd.DataFrame(data=eigenvectors, columns=[f'PC{i+1}' for i in␣
↪range(eigenvectors.shape[1])])

# Print the DataFrame


print("Eigenvectors:")
print(eigenvectors_df)

Eigenvectors:
PC1 PC2 PC3 PC4 PC5 \
0 -4.353040e-07 6.819708e-04 -7.635453e-02 1.506102e-01 9.856396e-01
1 -4.182401e-07 9.100233e-04 -3.482915e-01 9.222261e-01 -1.679016e-01

12
2 -5.266495e-04 9.999962e-01 2.655423e-03 -7.046468e-05 -4.754296e-04
3 -9.999999e-01 -5.266507e-04 -3.019358e-07 -6.439164e-08 -9.080315e-08
4 -1.074800e-10 -6.271480e-08 1.190288e-04 -2.203851e-05 1.232789e-04
5 8.555430e-10 -7.352065e-07 4.166878e-04 -1.856730e-04 8.704261e-04
6 -9.821963e-07 2.447252e-03 -9.342675e-01 -3.561116e-01 -1.796084e-02

PC6 PC7
0 8.041470e-04 4.783139e-05
1 -4.697221e-04 -8.824796e-06
2 9.181873e-09 -1.982864e-07
3 -6.302828e-10 -2.212895e-10
4 -1.940234e-01 9.809969e-01
5 -9.809964e-01 -1.940235e-01
6 -3.528579e-04 3.782708e-05
5.Selecting the best features k
[106]: from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Assuming 'features' is your data with missing values


# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
imputed_features = imputer.fit_transform(df)

# Apply PCA
pca = PCA()
pca.fit(imputed_features)

[106]: PCA()

[107]: cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance ratio


plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')

plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Ratio by Number of Components')
plt.grid(True)
plt.show()

13
[158]: k = 3 # Number of selected best features
best_k = eigenvectors[:, :k]

# Print the selected eigenvectors


for i in range(eigenvectors.shape[1]):
print("The selected Principle Components are:")
print(f'PC{i+1}: {eigenvectors[:, i]}')

The selected Principle Components are:


PC1: [-4.35304009e-07 -4.18240114e-07 -5.26649488e-04 -9.99999861e-01
-1.07479952e-10 8.55543039e-10 -9.82196278e-07]
The selected Principle Components are:
PC2: [ 6.81970818e-04 9.10023266e-04 9.99996220e-01 -5.26650652e-04
-6.27147956e-08 -7.35206512e-07 2.44725160e-03]
The selected Principle Components are:
PC3: [-7.63545346e-02 -3.48291460e-01 2.65542278e-03 -3.01935766e-07
1.19028810e-04 4.16687798e-04 -9.34267523e-01]
The selected Principle Components are:
PC4: [ 1.50610165e-01 9.22226138e-01 -7.04646780e-05 -6.43916411e-08
-2.20385057e-05 -1.85673016e-04 -3.56111624e-01]

14
The selected Principle Components are:
PC5: [ 9.85639614e-01 -1.67901638e-01 -4.75429552e-04 -9.08031536e-08
1.23278868e-04 8.70426120e-04 -1.79608433e-02]
The selected Principle Components are:
PC6: [ 8.04146996e-04 -4.69722146e-04 9.18187349e-09 -6.30282826e-10
-1.94023389e-01 -9.80996398e-01 -3.52857856e-04]
The selected Principle Components are:
PC7: [ 4.78313902e-05 -8.82479632e-06 -1.98286392e-07 -2.21289454e-10
9.80996888e-01 -1.94023456e-01 3.78270850e-05]

[112]: import pandas as pd

# Assuming 'best_k' contains the first k eigenvectors


principal_components = pd.DataFrame(best_k, columns=[f'PC{i+1}' for i in␣
↪range(best_k.shape[1])])

print("Selected Principal Components:")


print(principal_components)

Selected Principal Components:


PC1 PC2
0 -4.353040e-07 6.819708e-04
1 -4.182401e-07 9.100233e-04
2 -5.266495e-04 9.999962e-01
3 -9.999999e-01 -5.266507e-04
4 -1.074800e-10 -6.271480e-08
5 8.555430e-10 -7.352065e-07
6 -9.821963e-07 2.447252e-03
6.Projection
[119]: import numpy as np

# Assuming 'data' is your original dataset and 'projection_matrix' contains the␣


↪selected principal components

# Perform projection
projected_data = np.dot(df, principal_components)

# Check for NaN values in the projected data


nan_indices = np.isnan(projected_data)

# Alternatively, you can remove rows with NaN values


# projected_data = projected_data[~nan_indices.any(axis=1)]

# Print the shape of the projected data


print("Shape of projected data:", projected_data.shape)

15
# Optionally, you can print the first few rows of the projected data
print("Projected data:")
print(projected_data)

Shape of projected data: (439, 2)


Projected data:
[[-1.95000050e+06 4.36046102e+02]
[-4.20000115e+06 1.07907716e+03]
[-6.65000252e+05 3.02783870e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.05000030e+06 2.84022673e+02]
[-1.76000055e+06 5.73112033e+02]
[-9.75000478e+05 6.51522680e+02]
[-1.09500035e+06 3.73325414e+02]
[-1.22700038e+06 3.93807188e+02]
[-5.40000342e+05 5.06613770e+02]
[ nan nan]
[-1.49500040e+06 3.62668778e+02]
[-1.05000033e+06 3.47024883e+02]
[-8.90000248e+05 2.37292167e+02]
[-1.08500074e+06 1.11659562e+03]
[-1.56500052e+06 5.75799499e+02]
[-1.70000042e+06 3.47702254e+02]
[-3.75000039e+06 -2.54917964e+02]
[-8.05000278e+05 3.16050002e+02]
[-7.02000271e+05 3.30298130e+02]
[-8.75000286e+05 3.13186776e+02]
[-1.49000041e+06 3.85301614e+02]
[-1.42500032e+06 2.29534625e+02]
[-1.22500046e+06 5.54863924e+02]
[-1.56000057e+06 6.78432374e+02]
[-1.07500044e+06 5.48858145e+02]
[-1.30500046e+06 5.34731107e+02]
[-1.42700037e+06 3.17480987e+02]
[-8.95000330e+05 3.91653426e+02]
[-1.60000049e+06 5.00369389e+02]
[-1.15000009e+07 -1.25847323e+03]
[-1.21100042e+06 4.77233315e+02]
[ nan nan]
[-2.73950108e+06 1.32425486e+03]
[-2.25100107e+06 1.44552143e+03]
[-5.30000343e+05 5.11881186e+02]
[-1.60300067e+06 8.55791445e+02]
[-8.10000531e+05 7.95423179e+02]
[-1.05000037e+06 4.30026161e+02]
[-2.50000100e+06 1.24738448e+03]

16
[-1.80000060e+06 6.68041584e+02]
[-1.02500065e+06 9.60192920e+02]
[-6.25000258e+05 3.25849888e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.61150059e+06 7.01309677e+02]
[-1.40000006e+07 -2.47009155e+03]
[-1.76000055e+06 5.73112033e+02]
[-1.18500063e+06 8.75928816e+02]
[-1.02500038e+06 4.60190771e+02]
[-7.90000475e+05 6.92953262e+02]
[-1.15500043e+06 5.06728199e+02]
[ nan nan]
[-8.15000500e+05 7.34787698e+02]
[-1.13000043e+06 5.13893632e+02]
[-2.68100147e+06 2.08806798e+03]
[-7.95000360e+05 4.74318378e+02]
[-1.90000062e+06 6.69376315e+02]
[-9.50000351e+05 4.16691475e+02]
[-7.45000255e+05 2.88651712e+02]
[-1.75000064e+06 7.45373924e+02]
[-6.83500242e+05 2.80038435e+02]
[-8.25000435e+05 6.09520735e+02]
[-1.00000045e+06 5.91359038e+02]
[-1.14000038e+06 4.29629872e+02]
[-9.20000386e+05 4.90490776e+02]
[-1.17500050e+06 6.49196199e+02]
[-1.20000059e+06 8.05025255e+02]
[-5.00000318e+06 4.72676112e+03]
[-1.80000075e+06 9.59037127e+02]
[-9.15000381e+05 4.82122479e+02]
[-6.56000212e+05 2.29524020e+02]
[-3.70000099e+06 9.08405035e+02]
[-4.20000115e+06 1.07907716e+03]
[-1.30100043e+06 4.68839330e+02]
[-5.50000174e+05 1.86344469e+02]
[-1.18000066e+06 9.43561823e+02]
[-8.90000377e+05 4.81286351e+02]
[-1.97500088e+06 1.15987619e+03]
[-1.08000064e+06 9.31227134e+02]
[-1.30000037e+06 3.56363279e+02]
[-8.50000356e+05 4.52356605e+02]
[-1.12612546e+06 5.76936620e+02]
[-1.29000061e+06 8.21626454e+02]
[-9.50000425e+05 5.57693390e+02]
[-7.22000319e+05 4.15766689e+02]
[-8.03000402e+05 5.51108906e+02]
[-1.00000041e+06 5.14358474e+02]
[-8.25000469e+05 6.73522085e+02]

17
[-1.61150059e+06 7.01309677e+02]
[-1.70000076e+06 1.00470626e+03]
[-2.05000087e+06 1.12037740e+03]
[-9.15000370e+05 4.61124150e+02]
[-1.50000077e+06 1.06003254e+03]
[-5.71000071e+06 -1.61164325e+02]
[-2.52500220e+06 3.50722891e+03]
[-9.20000662e+05 1.01548879e+03]
[-1.20000035e+06 3.48031022e+02]
[-5.37500462e+05 7.34933236e+02]
[-8.40000621e+05 9.57621221e+02]
[-9.80000592e+05 8.65894752e+02]
[-9.30000253e+05 2.35221175e+02]
[-6.25000235e+05 2.80852506e+02]
[-4.00000051e+06 -7.85910660e+01]
[-3.99500080e+06 4.74040790e+02]
[-1.95000050e+06 4.36046102e+02]
[-1.00000389e+05 7.12341065e+02]
[-9.70000819e+05 1.29915509e+03]
[-1.15500043e+06 5.06728199e+02]
[-3.90000231e+06 3.36309129e+03]
[-9.00000233e+05 2.06020865e+02]
[-7.25000308e+05 3.93184370e+02]
[-8.25000473e+05 6.80520467e+02]
[-9.40000486e+05 6.74959473e+02]
[-6.20000441e+05 6.73482179e+02]
[-4.95000247e+05 3.39313771e+02]
[-1.60000095e+06 1.38537464e+03]
[-7.50000565e+05 8.75021809e+02]
[-1.49500074e+06 1.01267002e+03]
[-2.47000206e+06 3.25920161e+03]
[-8.22500360e+05 4.66835459e+02]
[-1.27500071e+06 1.01253292e+03]
[-5.37000263e+05 3.57192755e+02]
[-1.19500073e+06 1.06966247e+03]
[-1.24000068e+06 9.58960159e+02]
[-8.10000471e+05 6.81421845e+02]
[-9.65000503e+05 7.01790267e+02]
[-6.25000305e+05 4.13849556e+02]
[-7.19000401e+05 5.71346969e+02]
[-1.38500076e+06 1.07060034e+03]
[-1.31000055e+06 7.00099492e+02]
[-7.55000443e+05 6.42390335e+02]
[-9.30000687e+05 1.06022210e+03]
[-1.30000067e+06 9.31366910e+02]
[-2.15000103e+06 1.39572002e+03]
[-7.25000531e+05 8.18186120e+02]
[-7.75000359e+05 4.76855460e+02]

18
[-8.49000447e+05 6.24882605e+02]
[-5.59000212e+05 2.55609228e+02]
[-1.90000074e+06 9.05376105e+02]
[-1.32500085e+06 1.26020140e+03]
[-1.95000050e+06 4.36046102e+02]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-1.12500035e+06 3.59526569e+02]
[-1.15000058e+06 8.00361262e+02]
[-6.50000113e+06 4.27796070e+02]
[-4.50000076e+06 2.54083462e+02]
[-2.72000067e+06 5.53524033e+02]
[-1.41800050e+06 5.83219175e+02]
[-8.25000311e+05 3.73519180e+02]
[-1.85300048e+06 4.32126187e+02]
[-4.00000114e+06 1.10542164e+03]
[-1.51000072e+06 9.72774593e+02]
[-3.90000091e+06 6.96080545e+02]
[-1.71000012e+07 -2.20571130e+03]
[-1.00000042e+06 5.43356773e+02]
[-3.65000063e+06 2.37735822e+02]
[-9.95000172e+06 6.53841092e+02]
[-4.35000201e+05 2.66910756e+02]
[-1.25000039e+06 4.11696557e+02]
[-2.00000051e+06 4.44708542e+02]
[-2.82500073e+06 6.39226774e+02]
[-1.70000074e+06 9.54702407e+02]
[-1.90500072e+06 8.60744775e+02]
[-1.95000055e+06 5.23038090e+02]
[-8.25000311e+05 3.72519183e+02]
[-8.00000231e+05 2.28686043e+02]
[-8.80000309e+05 3.54557395e+02]
[-1.15500033e+06 3.24726439e+02]
[-2.53200068e+06 6.26540030e+02]
[-6.50000109e+06 3.49784984e+02]
[-4.07500365e+05 5.85399214e+02]
[-3.80000085e+06 6.11744536e+02]
[-1.20000010e+07 -1.31978680e+03]
[-3.69999995e+06 -1.07360136e+03]
[-5.62700147e+06 1.31155559e+03]
[-6.55000100e+06 1.82462146e+02]
[-2.10000073e+06 8.42045475e+02]
[-1.42500063e+06 8.27537259e+02]
[-2.61000064e+06 5.26456608e+02]
[-7.05000090e+06 -1.48860135e+02]
[-1.15000030e+06 2.64361523e+02]
[-3.21000058e+06 2.48462944e+02]
[-2.66000061e+06 4.64118294e+02]

19
[-2.15000069e+06 7.42717771e+02]
[-7.99500110e+06 -1.05509351e+01]
[-6.41000180e+05 1.73424022e+02]
[-1.19900043e+06 5.07553032e+02]
[-3.60000283e+06 4.42911322e+03]
[-5.60000175e+06 1.84077042e+03]
[-1.26000040e+06 4.25427531e+02]
[-4.97500126e+06 1.08893128e+03]
[-2.41000069e+06 6.74786235e+02]
[-2.38890021e+07 -2.34511635e+03]
[-1.70000064e+06 7.61705584e+02]
[-7.87500145e+06 6.77640060e+02]
[-1.09950019e+07 7.29495682e+02]
[-5.99499996e+06 -1.65525150e+03]
[-3.75000019e+06 -6.33928254e+02]
[-8.30000294e+05 3.39888491e+02]
[-2.25000069e+06 7.16050843e+02]
[-1.46500039e+06 3.53464922e+02]
[-6.80000262e+05 3.18884019e+02]
[-1.13000036e+06 3.82895893e+02]
[-1.24500059e+06 7.86329314e+02]
[-8.85000459e+05 6.37921810e+02]
[-6.50000349e+06 4.91179488e+03]
[-1.02000036e+06 4.09824224e+02]
[-1.99500069e+06 7.82343205e+02]
[-8.90000358e+05 4.45288934e+02]
[-3.35000104e+06 1.08873523e+03]
[-9.90000389e+05 4.78624226e+02]
[-8.05000323e+05 4.02052807e+02]
[-1.25500042e+06 4.69062222e+02]
[-3.60000105e+06 1.04607399e+03]
[-1.25000041e+06 4.41695588e+02]
[-1.97500071e+06 8.29876760e+02]
[-2.52500100e+06 1.22521631e+03]
[-2.40000062e+06 5.36060627e+02]
[-2.22500097e+06 1.26421668e+03]
[-9.41750376e+05 4.65037032e+02]
[-1.30000046e+06 5.23361055e+02]
[-1.15000028e+06 2.22358324e+02]
[-1.80000094e+06 1.31404404e+03]
[-1.95000074e+06 8.98042818e+02]
[-3.45000067e+06 3.59070272e+02]
[-1.84000077e+06 9.69974109e+02]
[-4.15000213e+06 2.94441993e+03]
[-1.70000074e+06 9.54695066e+02]
[-3.75000122e+06 1.32007210e+03]
[-1.08300041e+06 4.91645709e+02]
[-1.90000068e+06 7.87378316e+02]

20
[-2.67500079e+06 7.91230527e+02]
[-7.75000124e+06 3.04473906e+02]
[-1.63500061e+06 7.24939055e+02]
[-1.07500041e+06 5.03857974e+02]
[-2.10000073e+06 8.31042728e+02]
[-1.52500061e+06 7.55871411e+02]
[-1.45000050e+06 5.66366354e+02]
[-1.16500049e+06 6.32462789e+02]
[-1.31000057e+06 7.29099897e+02]
[-1.15000052e+06 6.80358358e+02]
[-2.15000078e+06 9.10711332e+02]
[-2.90000086e+06 8.72723245e+02]
[-6.41500174e+06 1.62155299e+03]
[-8.60000308e+06 3.59081937e+03]
[-1.01175018e+07 8.46635381e+02]
[-1.10000041e+06 4.92689594e+02]
[-2.01000050e+06 4.19445469e+02]
[-2.07500042e+06 2.57207856e+02]
[-1.07500051e+06 6.85860643e+02]
[-7.15000320e+05 4.18450800e+02]
[-2.15000090e+06 1.15071714e+03]
[-1.40000082e+06 1.18070651e+03]
[-1.00000037e+06 4.42358746e+02]
[-3.20000085e+06 7.79736396e+02]
[-6.50000266e+05 3.34683539e+02]
[-9.95000610e+05 8.96992056e+02]
[-1.02500042e+06 5.22191447e+02]
[-7.40000094e+06 -1.72200080e+02]
[-7.81000310e+05 3.83690320e+02]
[-4.95000075e+06 1.18093644e+02]
[-5.88001921e+05 3.49236230e+03]
[-2.00000076e+06 9.19721485e+02]
[-5.65000114e+06 6.69442220e+02]
[-7.50000620e+05 9.80022322e+02]
[-1.73500078e+06 1.03127753e+03]
[-1.85000082e+06 1.07070498e+03]
[-8.65000136e+06 3.06486074e+02]
[-3.80000085e+06 6.11744536e+02]
[-6.50000109e+06 3.49784984e+02]
[-9.46000514e+05 7.26797769e+02]
[-1.80000076e+06 9.77044455e+02]
[-7.45000281e+05 3.37653974e+02]
[-3.15000133e+06 1.69107816e+03]
[-1.60000073e+06 9.62373106e+02]
[-1.73000090e+06 1.24690508e+03]
[-3.35000104e+06 1.08873523e+03]
[-1.78000081e+06 1.06957403e+03]
[-3.85000126e+06 1.37241552e+03]

21
[-2.01000116e+06 1.66844979e+03]
[-3.67500085e+06 6.52573174e+02]
[-1.45000065e+06 8.61368368e+02]
[-1.28000049e+06 5.93891394e+02]
[-9.50000426e+05 5.58688491e+02]
[-3.00000110e+06 1.30506067e+03]
[-8.90000387e+05 5.00292765e+02]
[-3.80000279e+06 4.29876503e+03]
[-3.15000140e+06 1.82607247e+03]
[-5.25000104e+06 6.01106210e+02]
[-1.62500028e+06 1.13202089e+02]
[-1.68500082e+06 1.11861228e+03]
[-1.26100074e+06 1.07391487e+03]
[-7.12500518e+05 7.96768790e+02]
[-1.74000086e+06 1.18363948e+03]
[-7.30000357e+05 4.85550757e+02]
[-8.40000320e+05 3.85620026e+02]
[-6.35000307e+05 4.15583023e+02]
[-8.80000359e+06 4.49049300e+03]
[-1.36800084e+06 1.22555410e+03]
[-1.02500096e+06 1.56019242e+03]
[-1.78200060e+06 6.61517999e+02]
[-1.20000081e+06 1.22402771e+03]
[-2.85500192e+06 2.89643477e+03]
[-9.60000394e+06 4.94419228e+03]
[-1.75000051e+06 5.13371784e+02]
[-1.70000074e+06 9.54702407e+02]
[-2.10000083e+06 1.01604937e+03]
[-8.85000376e+05 4.80922744e+02]
[-9.80000417e+05 5.33890202e+02]
[-1.65000031e+06 1.49037403e+02]
[-8.60000276e+06 2.97081705e+03]
[-7.30000228e+05 2.40551683e+02]
[-1.62900053e+06 5.72099207e+02]
[-1.45000080e+06 1.13236490e+03]
[-2.00000101e+06 1.39371076e+03]
[-1.77000065e+06 7.67838339e+02]
[-5.32000208e+06 2.55023014e+03]
[-2.00000070e+06 8.02712312e+02]
[-1.80000083e+06 1.10504642e+03]
[-6.70000352e+05 4.91149895e+02]
[-4.45000128e+06 1.25642084e+03]
[-1.58950079e+06 1.07990286e+03]
[-1.61000072e+06 9.52102065e+02]
[-2.51500109e+06 1.41548826e+03]
[-4.99900132e+06 1.18728965e+03]
[-6.55000186e+05 1.80050860e+02]
[-2.15000069e+06 7.42717771e+02]

22
[-3.35000194e+06 2.80673454e+03]
[-1.05000029e+06 2.77023382e+02]
[-8.95000424e+05 5.69655201e+02]
[-1.34900042e+06 4.34559451e+02]
[-1.30000080e+06 1.17136265e+03]
[-1.08500067e+06 9.86596109e+02]
[-1.34000061e+06 8.06301277e+02]
[-2.16500102e+06 1.35980990e+03]
[-9.65000281e+05 2.79790611e+02]
[-8.30000502e+05 7.34887680e+02]
[-9.50000939e+05 1.53369061e+03]
[-7.94000449e+05 6.43844390e+02]
[-6.50000277e+06 3.54478218e+03]
[-1.47500063e+06 8.10199798e+02]
[-2.05000134e+06 2.01038171e+03]
[-4.99000099e+06 5.72030316e+02]
[-5.00000149e+06 1.50876014e+03]
[-9.10000458e+05 6.30758537e+02]
[-7.50000105e+06 1.01425589e+01]
[-1.15000050e+06 6.46361844e+02]
[-1.80000080e+06 1.05204594e+03]
[-1.25000075e+06 1.09169962e+03]
[-1.27500062e+06 8.48533539e+02]
[-7.05000433e+05 6.36716502e+02]
[-8.72500461e+05 6.46505505e+02]
[-2.60000116e+06 1.52672146e+03]
[-1.10000037e+06 4.20693564e+02]
[-2.10000097e+06 1.28204694e+03]
[-1.19900043e+06 5.07553032e+02]
[-1.62500100e+06 1.47820308e+03]
[-2.71000073e+06 6.77793788e+02]
[-2.53200068e+06 6.26540030e+02]
[-3.21000058e+06 2.48462944e+02]
[-1.50000077e+06 1.06003254e+03]
[-1.90300029e+06 4.27993523e+01]
[-3.50005829e+04 1.09757852e+03]
[-7.20000459e+05 6.82816538e+02]
[-1.30000046e+06 5.39365034e+02]
[-3.80000279e+06 4.29876503e+03]
[-4.62100210e+06 2.76636720e+03]
[-7.70000562e+05 8.64489024e+02]
[-1.60000075e+06 1.00736747e+03]
[-4.00000168e+06 2.14540963e+03]
[-6.26000357e+05 5.12324978e+02]
[-1.85300048e+06 4.32126187e+02]
[-1.16600078e+06 1.17093655e+03]
[-3.35000373e+06 6.19278769e+03]
[-1.10000047e+06 6.07692175e+02]

23
[-4.00000113e+06 1.09340763e+03]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-2.40000077e+06 8.32056547e+02]
[-7.45000014e+06 -1.69352958e+03]
[-1.30750030e+07 2.30405906e+03]
[-2.50000052e+06 3.30386011e+02]
[-4.15000071e+06 2.56414152e+02]
[-2.30000030e+06 -4.02878644e+01]
[-9.00000190e+06 1.23215198e+03]
[-3.50000192e+06 2.72376843e+03]
[-1.25000028e+07 2.09188138e+03]
[-1.30000050e+06 6.12362311e+02]
[-1.51000049e+06 5.29763976e+02]
[-1.67600019e+07 -8.26659911e+02]
[-1.20000016e+07 -2.08787587e+02]
[-2.98900122e+06 1.53084935e+03]
[-2.81000056e+06 3.29121021e+02]
[-2.52500165e+06 2.47022049e+03]
[-2.47500083e+06 9.21551970e+02]
[-1.15000049e+06 6.30358547e+02]
[-1.00900030e+06 2.96617831e+02]
[-7.45000014e+06 -1.69352958e+03]
[-2.40000062e+06 5.36044693e+02]
[-8.30000342e+05 4.29888151e+02]
[-1.41000037e+06 3.32431578e+02]
[-3.00000087e+06 8.70060097e+02]
[-3.50000112e+06 1.20173252e+03]
[-9.50000362e+06 4.37482971e+03]
[-4.15000071e+06 2.56414152e+02]
[-3.62500102e+06 9.90906119e+02]
[-2.50000064e+06 5.53384258e+02]
[-9.20000525e+05 7.55488182e+02]
[-2.50200099e+06 1.22234003e+03]
[-4.75000111e+06 8.56422746e+02]
[-3.40000034e+06 -2.55602169e+02]
[-2.90000037e+06 -6.72769005e+01]
[-5.00000262e+06 3.65976652e+03]
[-1.84500068e+06 7.96341161e+02]
[-1.23000053e+06 6.77231963e+02]
[-6.00000061e+06 -4.25890317e+02]
[-3.17500099e+06 1.04589681e+03]
[-9.65000382e+05 4.70812074e+02]
[-1.33000047e+06 5.49562287e+02]
[-1.00000013e+07 -1.24496790e+02]
[-9.50000182e+06 9.46827929e+02]
[-2.85000104e+06 1.21705736e+03]
[-2.72500105e+06 1.28288961e+03]

24
[-5.25000238e+06 3.13511444e+03]
[-2.01500055e+06 5.13808492e+02]
[-5.55000131e+06 1.02710488e+03]
[-4.65000138e+06 1.40110678e+03]
[-1.75000058e+06 6.38371312e+02]
[-1.57600071e+06 9.33028246e+02]
[-6.25000159e+06 1.36746268e+03]
[-9.49000623e+05 9.32216179e+02]
[-8.95000549e+05 8.07655211e+02]
[-1.25000055e+06 7.13700706e+02]
[-1.65000090e+06 1.27604208e+03]
[-3.19500116e+06 1.35937026e+03]
[-7.35000147e+06 8.50142342e+02]
[-1.34900050e+06 5.95558160e+02]
[-1.25000041e+06 4.41696443e+02]]

[159]: projected_data.shape

[159]: (439, 2)

[122]: explained_variance_ratio = pca.explained_variance_ratio_

# Print the explained variance ratio values


print("Explained Variance Ratio:", explained_variance_ratio)

Explained Variance Ratio: [9.99999885e-01 1.14996251e-07 5.38924715e-13


1.62041663e-13
7.34178924e-14 7.01994932e-18 4.54122976e-19]

[123]: import numpy as np


import matplotlib.pyplot as plt

# Assuming 'explained_variance_ratio' contains the explained variance ratio of␣


↪each principal component

explained_variance_ratio = explained_variance_ratio # Replace [...] with your␣


↪actual explained variance ratio values

# Calculate cumulative explained variance ratio


cumulative_variance_ratio = np.cumsum(explained_variance_ratio)

# Plot scree plot


plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')

plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Scree Plot')
plt.grid(True)

25
plt.show()

2.Biplot
[180]:

---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[180], line 3
1 import pandas as pd
2 import matplotlib.pyplot as plt
----> 3 from prince import PCA
5 # Assuming X contains your standardized data
6 pca = PCA(n_components=2)

ModuleNotFoundError: No module named 'prince'

[183]: pip install prince

Requirement already satisfied: prince in

26
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages
(0.13.0)
Requirement already satisfied: altair<6.0.0,>=4.2.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (5.3.0)
Requirement already satisfied: pandas<3.0.0,>=1.4.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (2.1.3)
Requirement already satisfied: scikit-learn<2.0.0,>=1.0.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (1.4.2)
Requirement already satisfied: jinja2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (3.1.2)
Requirement already satisfied: packaging in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (23.2)
Requirement already satisfied: jsonschema>=3.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (4.20.0)
Requirement already satisfied: typing-extensions>=4.0.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (4.7.1)
Requirement already satisfied: toolz in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (0.12.1)
Requirement already satisfied: numpy in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
altair<6.0.0,>=4.2.2->prince) (1.26.2)
Requirement already satisfied: pytz>=2020.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2023.3.post1)
Requirement already satisfied: python-dateutil>=2.8.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
pandas<3.0.0,>=1.4.1->prince) (2023.3)
Requirement already satisfied: scipy>=1.6.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (1.12.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (3.2.0)
Requirement already satisfied: joblib>=1.2.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
scikit-learn<2.0.0,>=1.0.2->prince) (1.3.2)
Requirement already satisfied: attrs>=22.2.0 in

27
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (2023.11.1)
Requirement already satisfied: referencing>=0.28.4 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (0.31.1)
Requirement already satisfied: rpds-py>=0.7.1 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (0.13.2)
Requirement already satisfied: six>=1.5 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
python-dateutil>=2.8.2->pandas<3.0.0,>=1.4.1->prince) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
jinja2->altair<6.0.0,>=4.2.2->prince) (2.1.3)
Note: you may need to restart the kernel to use updated packages.

[notice] A new release of pip available: 22.2.2 -> 24.0


[notice] To update, run: python.exe -m pip install --upgrade pip

[199]: import pandas as pd


import numpy as np
from prince import PCA
# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(df_scaled)

# Assuming X is your ndarray


X_df = pd.DataFrame(X_imputed)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_df)

[209]: import numpy as np


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')

28
X_imputed = imputer.fit_transform(df_scaled)

# Assuming X contains your standardized data


pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

# Extract loadings of each feature for the first two principal components
loadings = pca.components_.T[:, :2]

# Plot the first two principal components


plt.figure(figsize=(8, 7))
plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.5)

# Plot the loadings as vectors


for i, feature in enumerate(df_scaled):
plt.arrow(0, 0, loadings[i, 0], loadings[i, 1], color='r', alpha=0.5)
plt.text(loadings[i, 0], loadings[i, 1], feature, color='g', fontsize=10,␣
↪ha='center', va='center')

# Set labels and title


plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Biplot of Principal Components')
plt.grid(True)

plt.show()

29
[189]: import pandas as pd
import numpy as np
from prince import PCA

# Assuming X is your ndarray


X_df = pd.DataFrame(X)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_df)

3.Pairplot
[130]: import seaborn as sns
import pandas as pd

30
sns.pairplot(df)
plt.show()

[132]: from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Now proceed with PCA using X_imputed as the input

[135]: df

31
[135]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \
0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

[138]: import numpy as np


import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Separate features (X) and target variable (y)


X = df.drop(columns=['lastsoldprice'])
y = df['lastsoldprice']

# Handle missing values using imputation


imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)

# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

32
# Create a DataFrame for the principal components
pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

# Create a scatterplot matrix with both principal components


combined_df = pd.concat([pc_df, y], axis=1)
sns.pairplot(combined_df, hue='lastsoldprice')
plt.show()

[176]: import numpy as np


import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)

imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(df_scaled)

33
# Assuming X contains your standardized data
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_imputed)

# Create a heatmap of the principal components


plt.figure(figsize=(10, 6))
sns.heatmap(principal_components, cmap='coolwarm', annot=True, fmt=".2f",␣
↪cbar=True)

plt.xlabel('Principal Component')
plt.ylabel('Data Point')
plt.title('Heatmap of Principal Components')
plt.show()

[ ]:

Score plot
[152]: import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

34
import matplotlib.pyplot as plt

# Assuming your data is stored in a DataFrame called 'df'

# Impute missing values


imputer = SimpleImputer(strategy='mean') # You can change the strategy as␣
↪needed

imputed_data = imputer.fit_transform(df)

# Standardize the data


scaler = StandardScaler()
scaled_data = scaler.fit_transform(imputed_data)

# Perform PCA
pca = PCA(n_components=2) # Reduce to 2 components for a 2D plot
principal_components = pca.fit_transform(scaled_data)

# Plot the score plot with quadrants


pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
plt.figure(figsize=(8, 6))
plt.scatter(pc_df['PC1'], pc_df['PC2'])
plt.axhline(y=0, color='k', linestyle='--') # Add horizontal line at y=0
plt.axvline(x=0, color='k', linestyle='--') # Add vertical line at x=0
plt.text(0.5, 0.5, ' I', fontsize=12, ha='center')
plt.text(-0.5, 0.5, ' II', fontsize=12, ha='center')
plt.text(-0.5, -0.5, ' III', fontsize=12, ha='center')
plt.text(0.5, -0.5, ' IV', fontsize=12, ha='center')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Score Plot ')
plt.grid(True)
plt.show()

35
0.3 ADDITIONAL EXPLORATION
[148]: from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

# Assuming you have already reduced your data dimensionality and stored it in␣
↪X_reduced

# Initialize your chosen model


model = RandomForestRegressor() # Example: RandomForestRegressor, you can use␣
↪any other model

# Perform cross-validation
scores = cross_val_score(model, df_scaled, y, cv=5,␣
↪scoring='neg_mean_squared_error')

# Print the cross-validation scores


print("Cross-validation Mean Squared Error:", -scores.mean())

36
Cross-validation Mean Squared Error: 195630660332.1332

[170]: import numpy as np


import pandas as pd
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer

# Create an imputer object


imputer = SimpleImputer(strategy='mean') # You can replace 'mean' with␣
↪'median' or 'most_frequent'

# Fit the imputer to X and transform X


X_imputed = imputer.fit_transform(df_scaled)

# Assuming 'X' is your standardized data matrix


# Perform PCA
pca = PCA(n_components=2) # Choose the number of components you want
principal_components = pca.fit_transform(X_imputed)

# Get the loadings (coefficients) for each PC


loadings = pca.components_

# Create a DataFrame to display the loadings


loadings_df = pd.DataFrame(loadings, columns=df_scaled.columns,␣
↪index=[f'PC{i+1}' for i in range(pca.n_components_)])

# Display the loadings DataFrame


print("Loadings (Coefficients) for Each Principal Component:")
print(loadings_df)

Loadings (Coefficients) for Each Principal Component:


0 1 2 3 4 5 6
PC1 0.443793 0.414535 0.467587 0.420565 0.017968 -0.195893 0.443845
PC2 -0.092340 -0.005345 -0.067513 -0.077041 -0.750017 -0.646715 -0.013622

[164]: df

[164]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \


0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490

37
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880

totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

[439 rows x 7 columns]

38

You might also like