Cia Code

cia-code
April 23, 2024
Importing data
[32]: import pandas as pd
df1 = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')
df = pd.DataFrame(df1)
df
[32]: address bathrooms bedrooms finishedsqft \

0 2243 Franklin St 2.0 2 1463.0
1 2002 Pacific Ave APT 4 3.5 3 3291.0
2 1945 Washington St APT 411 1.0 1 653.0
3 1896 Pacific Ave APT 802 2.5 2 2272.0
4 1840 Washington St APT 603 1.0 1 837.0
.. … … … …
434 2170 Vallejo St APT 101 2.0 3 2145.0
435 2380 Vallejo St 3.5 4 3042.0
436 2430 Vallejo St 7.5 6 4721.0
437 1859 Green St 1.0 2 1306.0
438 2131 Vallejo St APT 3 1.0 1 1100.0
lastsolddate lastsoldprice latitude longitude neighborhood \

0 02-05-2016 1950000 37.795139 -122.425309 Pacific Heights
1 1/22/2016 4200000 37.794429 -122.428513 Pacific Heights
2 12/16/2015 665000 37.792472 -122.425281 Pacific Heights
3 12/17/2014 2735000 37.794706 -122.426347 Pacific Heights
4 12-02-2015 1050000 37.793212 -122.423744 Pacific Heights
.. … … … … …
434 11/14/2012 1650000 37.795777 -122.433024 Pacific Heights
435 10-01-2012 3195000 37.795330 -122.436540 Pacific Heights
436 9/24/2012 7350000 37.795246 -122.437490 Pacific Heights
437 10/18/2011 1349000 37.796588 -122.429641 Pacific Heights
438 03-04-2016 1250000 37.795255 -122.432880 Pacific Heights
totalrooms usecode yearbuilt zipcode

0 7 Condominium 1900 94109
1
.. … … … …
434 8 Condominium 1914 94123
435 10 SingleFamily 1908 94123
436 13 SingleFamily 1905 94123
437 5 Condominium 1900 94123
438 5 Condominium 1900 94123
[439 rows x 13 columns]
0.1 Data Preprocessing

1. Dropping irrelevant variables/columns from the dataset which adds no intrinsic value to it.
[33]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣

↪'yearbuilt', 'zipcode']
df.drop(columns=columns_to_drop, inplace=True)
[34]: df.columns
[34]: Index(['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice', 'latitude',

'longitude', 'totalrooms'],
dtype='object')
[35]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 bathrooms 439 non-null float64
1 bedrooms 439 non-null int64
2 finishedsqft 438 non-null float64
3 lastsoldprice 439 non-null int64
4 latitude 439 non-null float64
5 longitude 437 non-null float64
6 totalrooms 439 non-null int64
dtypes: float64(4), int64(3)
memory usage: 24.1 KB
2.Detecting and Plotting missing values
[36]: missing_values = df.isnull().sum()
print("Missing values in the data:",missing_values)
Missing values in the data: bathrooms 0

bedrooms 0
2
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64
[37]: import matplotlib.pyplot as plt
# Calculate the number of missing values in each column

missing_values = df.isnull().sum()
# Plot the missing values

plt.figure(figsize=(10, 6))
missing_values.plot(kind='bar')
plt.title('Missing Values in Each Column')
plt.xlabel('Columns')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
plt.show()
3.Handling missing values
3
[38]: missing_values = df.isnull().sum()
print("Missing values before handling:")
print(missing_values)
Missing values before handling:

bathrooms 0
bedrooms 0
finishedsqft 1
lastsoldprice 0
latitude 0
longitude 2
totalrooms 0
dtype: int64
[39]: #imputing with mean value of the column

from sklearn.impute import SimpleImputer
numerical_features = df.select_dtypes(include=['float64', 'int64']).columns

imputer = SimpleImputer(strategy='median')
df[numerical_features] = imputer.fit_transform(df[numerical_features])
[40]: missing_values_after = df.isnull().sum()

print("\nMissing values after handling:")
print(missing_values_after)
Missing values after handling:

bathrooms 0
bedrooms 0
finishedsqft 0
lastsoldprice 0
latitude 0
longitude 0
totalrooms 0
dtype: int64
[22]: #no missing values
4.Detecting Outliers
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore
# Load the dataset

df = pd.read_csv('C:\\Users\\lariy\\Downloads\\price prediction.csv')
4
# Select numerical columns for outlier detection
numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns
# Create box plots for numerical columns to visualize outliers

for i, column in enumerate(numerical_columns, 1):
plt.subplot(2, 3, i)
sns.boxplot(data=df[column])
plt.title(column)
plt.tight_layout()
plt.show()
# Identify outliers using z-score

z_scores = zscore(df[numerical_columns])
outliers = (z_scores > 3) | (z_scores < -3)
# Plot outliers using scatter plot

plt.scatter(df.index, df[column], c=outliers[:, i-1], cmap='coolwarm',␣
↪alpha=0.5)
plt.title(column)
plt.xlabel('Index')
plt.ylabel(column)
plt.tight_layout()
plt.show()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[41], line 15
13 plt.figure(figsize=(12, 6))
14 for i, column in enumerate(numerical_columns, 1):
---> 15 plt.subplot(2, 3, i)
16 sns.boxplot(data=df[column])
17 plt.title(column)
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)
1422 fig = gcf()

1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if the user␣
↪passed no
1429 # kwargs or if the axes class and kwargs are identical.
5
1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)
597 else:
598 if not isinstance(num, Integral) or num < 1 or num > rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <= {rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]
ValueError: num must be an integer with 1 <= num <= 6, not 7
[42]: df_no_outliers = df[~outliers.any(axis=1)]
[43]: plt.figure(figsize=(12, 6))

sns.boxplot(data=df_no_outliers[column])
plt.title(column)
plt.tight_layout()
plt.show()
6
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[43], line 3
1 plt.figure(figsize=(12, 6))
2 for i, column in enumerate(numerical_columns, 1):
----> 3 plt.subplot(2, 3, i)
4 sns.boxplot(data=df_no_outliers[column])
5 plt.title(column)
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\pyplot.
↪py:1425, in subplot(*args, **kwargs)
1422 fig = gcf()

1424 # First, search for an existing subplot with a matching spec.
-> 1425 key = SubplotSpec._from_subplot_args(fig, args)
1427 for ax in fig.axes:
1428 # If we found an Axes at the position, we can re-use it if the user␣
↪passed no
1429 # kwargs or if the axes class and kwargs are identical.

1430 if (ax.get_subplotspec() == key
1431 and (kwargs == {}
1432 or (ax._projection_init
1433 == fig._process_projection_requirements(**kwargs)))):
File␣
↪~\AppData\Local\Programs\Python\Python310\lib\site-packages\matplotlib\gridspec.
↪py:599, in SubplotSpec._from_subplot_args(figure, args)
597 else:
598 if not isinstance(num, Integral) or num < 1 or num > rows*cols:
--> 599 raise ValueError(
600 f"num must be an integer with 1 <= num <= {rows*cols}, "
601 f"not {num!r}"
602 )
603 i = j = num
604 return gs[i-1:j]
ValueError: num must be an integer with 1 <= num <= 6, not 7
7
[29]: #the outliers have been reduced to maximum
[44]: df.columns
[44]: Index(['address', 'bathrooms', 'bedrooms', 'finishedsqft', 'lastsolddate',

'lastsoldprice', 'latitude', 'longitude', 'neighborhood', 'totalrooms',
'usecode', 'yearbuilt', 'zipcode'],
dtype='object')
[45]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣

↪'yearbuilt', 'zipcode']
df.drop(columns=columns_to_drop, inplace=True)
df
[45]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \

0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880
totalrooms
8
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5
[ ]: #CORRELATION ESTIMATION

# Compute correlation matrix

corr_matrix = df.corr()
# Set custom color palette

colors = sns.color_palette("coolwarm", as_cmap=True)
sns.heatmap(corr_matrix, annot=True, cmap=colors, fmt=".2f", linewidths=0.5,␣
↪cbar=False)
plt.title('Correlation Matrix', fontsize=16, fontweight='bold')

plt.xticks(fontsize=12, rotation=45)
plt.yticks(fontsize=12)
plt.show()
9
0.2 PRINCIPLE COMPONENT ANALYSIS
1.Standardization
[52]: df_index = df.index
[53]: numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

from sklearn.preprocessing import StandardScaler
[76]: scaler = StandardScaler()

scaled_features = scaler.fit_transform(df)
df_scaled = pd.DataFrame(scaled_features)
10
df_scaled
[76]: 0 1 2 3 4 5 6
0 -0.285556 -0.399122 -0.403174 -0.211435 1.185443 1.148603 0.105818
1 0.698425 0.140000 0.680420 0.624953 0.874861 0.720970 0.105818
2 -0.941543 -0.938244 -0.883323 -0.689106 0.018791 1.152340 -0.909570
3 0.042437 -0.399122 0.076381 0.080372 0.996031 1.010062 -0.148029
4 -0.941543 -0.938244 -0.774252 -0.545990 0.342496 1.357481 -0.909570
.. … … … … … … …
434 -0.285556 0.140000 0.001099 -0.322953 1.464529 0.118894 0.359665
435 0.698425 0.679121 0.532819 0.251367 1.268994 -0.350380 0.867359
436 3.322374 1.757365 1.528090 1.795897 1.232249 -0.477175 1.628900
437 -0.941543 -0.399122 -0.496240 -0.434844 1.819293 0.570418 -0.401876
438 -0.941543 -0.938244 -0.618352 -0.471645 1.236186 0.138114 -0.401876
2.Covariance Matrix Computation

data_filled = df.fillna(df.mean())
# Compute the covariance matrix

covariance_matrix = np.cov(data_filled, rowvar=False)
covariance_matrix
column_names = ['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice',␣
↪'latitude',
'longitude', 'totalrooms']
covariance_df = pd.DataFrame(covariance_matrix, columns=column_names,␣
↪index=column_names)
# Print the covariance matrix DataFrame

print("Covariance Matrix:")
print(covariance_df)
Covariance Matrix:
bathrooms bedrooms finishedsqft lastsoldprice \
bathrooms 2.329161e+00 2.017295e+00 2.231689e+03 3.157419e+06
bedrooms 2.017295e+00 3.448394e+00 2.356723e+03 3.033648e+06
finishedsqft 2.231689e+03 2.356723e+03 2.845895e+06 3.819981e+09
lastsoldprice 3.157419e+06 3.033648e+06 3.819981e+09 7.253364e+12
latitude 3.289555e-04 8.149072e-05 3.582618e-01 7.795914e+02
longitude -2.819931e-03 -3.999855e-03 -3.881395e+00 -6.205566e+03
totalrooms 4.699683e+00 5.724836e+00 5.793232e+03 7.124227e+06
latitude longitude totalrooms

bathrooms 0.000329 -0.002820 4.699683e+00
bedrooms 0.000081 -0.004000 5.724836e+00
11
finishedsqft 0.358262 -3.881395 5.793232e+03
lastsoldprice 779.591402 -6205.566010 7.124227e+06
latitude 0.000005 0.000009 2.110409e-04
longitude 0.000009 0.000056 -9.048211e-03
totalrooms 0.000211 -0.009048 1.555414e+01
[ ]:
3.Eigen Decomposition
# Convert eigenvalues and eigenvectors to DataFrame

eigen_df = pd.DataFrame({'Eigenvalue': eigenvalues})
eigen_df['Eigenvector'] = [eigenvectors[:, i] for i in range(len(eigenvectors))]
# Print the DataFrame

print("Eigenvalues and Eigenvectors:")
print(eigen_df)
Eigenvalues and Eigenvectors:

Eigenvalue Eigenvector
0 7.253366e+12 [-4.3530400936506286e-07, -4.182401136422737e-…
1 8.341100e+05 [0.0006819708177894264, 0.0009100232661667038,…
2 3.909019e+00 [-0.07635453461451713, -0.34829146033302216, 0…
3 1.175348e+00 [0.15061016522339143, 0.9222261380364992, -7.0…
4 5.325269e-01 [0.9856396143372855, -0.16790163752836, -0.000…
5 5.091827e-05 [0.0008041469959149176, -0.0004697221463473827…
6 3.293921e-06 [4.783139021784667e-05, -8.824796320393624e-06…
4.Rearranging eigenvectors by respective eigenvalues
# Assuming 'eigenvectors' contains the eigenvectors computed from the␣

↪covariance matrix
# Create a DataFrame for eigenvectors

eigenvectors_df = pd.DataFrame(data=eigenvectors, columns=[f'PC{i+1}' for i in␣
↪range(eigenvectors.shape[1])])
# Print the DataFrame

print("Eigenvectors:")
print(eigenvectors_df)
Eigenvectors:
PC1 PC2 PC3 PC4 PC5 \
0 -4.353040e-07 6.819708e-04 -7.635453e-02 1.506102e-01 9.856396e-01
1 -4.182401e-07 9.100233e-04 -3.482915e-01 9.222261e-01 -1.679016e-01
12
2 -5.266495e-04 9.999962e-01 2.655423e-03 -7.046468e-05 -4.754296e-04
3 -9.999999e-01 -5.266507e-04 -3.019358e-07 -6.439164e-08 -9.080315e-08
4 -1.074800e-10 -6.271480e-08 1.190288e-04 -2.203851e-05 1.232789e-04
5 8.555430e-10 -7.352065e-07 4.166878e-04 -1.856730e-04 8.704261e-04
6 -9.821963e-07 2.447252e-03 -9.342675e-01 -3.561116e-01 -1.796084e-02
PC6 PC7
0 8.041470e-04 4.783139e-05
1 -4.697221e-04 -8.824796e-06
2 9.181873e-09 -1.982864e-07
3 -6.302828e-10 -2.212895e-10
4 -1.940234e-01 9.809969e-01
5 -9.809964e-01 -1.940235e-01
6 -3.528579e-04 3.782708e-05
5.Selecting the best features k
[106]: from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
# Assuming 'features' is your data with missing values

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
imputed_features = imputer.fit_transform(df)
# Apply PCA
pca = PCA()
pca.fit(imputed_features)
[106]: PCA()
[107]: cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)
# Plot cumulative explained variance ratio

plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance Ratio by Number of Components')
plt.grid(True)
plt.show()
13
[158]: k = 3 # Number of selected best features
best_k = eigenvectors[:, :k]
# Print the selected eigenvectors

for i in range(eigenvectors.shape[1]):
print("The selected Principle Components are:")
print(f'PC{i+1}: {eigenvectors[:, i]}')
The selected Principle Components are:

PC1: [-4.35304009e-07 -4.18240114e-07 -5.26649488e-04 -9.99999861e-01
-1.07479952e-10 8.55543039e-10 -9.82196278e-07]
PC2: [ 6.81970818e-04 9.10023266e-04 9.99996220e-01 -5.26650652e-04
-6.27147956e-08 -7.35206512e-07 2.44725160e-03]
PC3: [-7.63545346e-02 -3.48291460e-01 2.65542278e-03 -3.01935766e-07
1.19028810e-04 4.16687798e-04 -9.34267523e-01]
PC4: [ 1.50610165e-01 9.22226138e-01 -7.04646780e-05 -6.43916411e-08
-2.20385057e-05 -1.85673016e-04 -3.56111624e-01]
14
PC5: [ 9.85639614e-01 -1.67901638e-01 -4.75429552e-04 -9.08031536e-08
1.23278868e-04 8.70426120e-04 -1.79608433e-02]
PC6: [ 8.04146996e-04 -4.69722146e-04 9.18187349e-09 -6.30282826e-10
-1.94023389e-01 -9.80996398e-01 -3.52857856e-04]
PC7: [ 4.78313902e-05 -8.82479632e-06 -1.98286392e-07 -2.21289454e-10
9.80996888e-01 -1.94023456e-01 3.78270850e-05]
# Assuming 'best_k' contains the first k eigenvectors

principal_components = pd.DataFrame(best_k, columns=[f'PC{i+1}' for i in␣
↪range(best_k.shape[1])])
print("Selected Principal Components:")

print(principal_components)
Selected Principal Components:

PC1 PC2
0 -4.353040e-07 6.819708e-04
1 -4.182401e-07 9.100233e-04
2 -5.266495e-04 9.999962e-01
3 -9.999999e-01 -5.266507e-04
4 -1.074800e-10 -6.271480e-08
5 8.555430e-10 -7.352065e-07
6 -9.821963e-07 2.447252e-03
6.Projection
[119]: import numpy as np
# Assuming 'data' is your original dataset and 'projection_matrix' contains the␣

↪selected principal components
# Perform projection
projected_data = np.dot(df, principal_components)
# Check for NaN values in the projected data

nan_indices = np.isnan(projected_data)
# Alternatively, you can remove rows with NaN values

# projected_data = projected_data[~nan_indices.any(axis=1)]
# Print the shape of the projected data

print("Shape of projected data:", projected_data.shape)
15
# Optionally, you can print the first few rows of the projected data
print("Projected data:")
print(projected_data)
Shape of projected data: (439, 2)

Projected data:
[[-1.95000050e+06 4.36046102e+02]
[-4.20000115e+06 1.07907716e+03]
[-6.65000252e+05 3.02783870e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.05000030e+06 2.84022673e+02]
[-1.76000055e+06 5.73112033e+02]
[-9.75000478e+05 6.51522680e+02]
[-1.09500035e+06 3.73325414e+02]
[-1.22700038e+06 3.93807188e+02]
[-5.40000342e+05 5.06613770e+02]
[ nan nan]
[-1.49500040e+06 3.62668778e+02]
[-1.05000033e+06 3.47024883e+02]
[-8.90000248e+05 2.37292167e+02]
[-1.08500074e+06 1.11659562e+03]
[-1.56500052e+06 5.75799499e+02]
[-1.70000042e+06 3.47702254e+02]
[-3.75000039e+06 -2.54917964e+02]
[-8.05000278e+05 3.16050002e+02]
[-7.02000271e+05 3.30298130e+02]
[-8.75000286e+05 3.13186776e+02]
[-1.49000041e+06 3.85301614e+02]
[-1.42500032e+06 2.29534625e+02]
[-1.22500046e+06 5.54863924e+02]
[-1.56000057e+06 6.78432374e+02]
[-1.07500044e+06 5.48858145e+02]
[-1.30500046e+06 5.34731107e+02]
[-1.42700037e+06 3.17480987e+02]
[-8.95000330e+05 3.91653426e+02]
[-1.60000049e+06 5.00369389e+02]
[-1.15000009e+07 -1.25847323e+03]
[-1.21100042e+06 4.77233315e+02]
[ nan nan]
[-2.73950108e+06 1.32425486e+03]
[-2.25100107e+06 1.44552143e+03]
[-5.30000343e+05 5.11881186e+02]
[-1.60300067e+06 8.55791445e+02]
[-8.10000531e+05 7.95423179e+02]
[-1.05000037e+06 4.30026161e+02]
[-2.50000100e+06 1.24738448e+03]
16
[-1.80000060e+06 6.68041584e+02]
[-1.02500065e+06 9.60192920e+02]
[-6.25000258e+05 3.25849888e+02]
[-2.73500082e+06 8.31620176e+02]
[-1.61150059e+06 7.01309677e+02]
[-1.40000006e+07 -2.47009155e+03]
[-1.76000055e+06 5.73112033e+02]
[-1.18500063e+06 8.75928816e+02]
[-1.02500038e+06 4.60190771e+02]
[-7.90000475e+05 6.92953262e+02]
[-1.15500043e+06 5.06728199e+02]
[ nan nan]
[-8.15000500e+05 7.34787698e+02]
[-1.13000043e+06 5.13893632e+02]
[-2.68100147e+06 2.08806798e+03]
[-7.95000360e+05 4.74318378e+02]
[-1.90000062e+06 6.69376315e+02]
[-9.50000351e+05 4.16691475e+02]
[-7.45000255e+05 2.88651712e+02]
[-1.75000064e+06 7.45373924e+02]
[-6.83500242e+05 2.80038435e+02]
[-8.25000435e+05 6.09520735e+02]
[-1.00000045e+06 5.91359038e+02]
[-1.14000038e+06 4.29629872e+02]
[-9.20000386e+05 4.90490776e+02]
[-1.17500050e+06 6.49196199e+02]
[-1.20000059e+06 8.05025255e+02]
[-5.00000318e+06 4.72676112e+03]
[-1.80000075e+06 9.59037127e+02]
[-9.15000381e+05 4.82122479e+02]
[-6.56000212e+05 2.29524020e+02]
[-3.70000099e+06 9.08405035e+02]
[-4.20000115e+06 1.07907716e+03]
[-1.30100043e+06 4.68839330e+02]
[-5.50000174e+05 1.86344469e+02]
[-1.18000066e+06 9.43561823e+02]
[-8.90000377e+05 4.81286351e+02]
[-1.97500088e+06 1.15987619e+03]
[-1.08000064e+06 9.31227134e+02]
[-1.30000037e+06 3.56363279e+02]
[-8.50000356e+05 4.52356605e+02]
[-1.12612546e+06 5.76936620e+02]
[-1.29000061e+06 8.21626454e+02]
[-9.50000425e+05 5.57693390e+02]
[-7.22000319e+05 4.15766689e+02]
[-8.03000402e+05 5.51108906e+02]
[-1.00000041e+06 5.14358474e+02]
[-8.25000469e+05 6.73522085e+02]
17
[-1.61150059e+06 7.01309677e+02]
[-1.70000076e+06 1.00470626e+03]
[-2.05000087e+06 1.12037740e+03]
[-9.15000370e+05 4.61124150e+02]
[-1.50000077e+06 1.06003254e+03]
[-5.71000071e+06 -1.61164325e+02]
[-2.52500220e+06 3.50722891e+03]
[-9.20000662e+05 1.01548879e+03]
[-1.20000035e+06 3.48031022e+02]
[-5.37500462e+05 7.34933236e+02]
[-8.40000621e+05 9.57621221e+02]
[-9.80000592e+05 8.65894752e+02]
[-9.30000253e+05 2.35221175e+02]
[-6.25000235e+05 2.80852506e+02]
[-4.00000051e+06 -7.85910660e+01]
[-3.99500080e+06 4.74040790e+02]
[-1.95000050e+06 4.36046102e+02]
[-1.00000389e+05 7.12341065e+02]
[-9.70000819e+05 1.29915509e+03]
[-1.15500043e+06 5.06728199e+02]
[-3.90000231e+06 3.36309129e+03]
[-9.00000233e+05 2.06020865e+02]
[-7.25000308e+05 3.93184370e+02]
[-8.25000473e+05 6.80520467e+02]
[-9.40000486e+05 6.74959473e+02]
[-6.20000441e+05 6.73482179e+02]
[-4.95000247e+05 3.39313771e+02]
[-1.60000095e+06 1.38537464e+03]
[-7.50000565e+05 8.75021809e+02]
[-1.49500074e+06 1.01267002e+03]
[-2.47000206e+06 3.25920161e+03]
[-8.22500360e+05 4.66835459e+02]
[-1.27500071e+06 1.01253292e+03]
[-5.37000263e+05 3.57192755e+02]
[-1.19500073e+06 1.06966247e+03]
[-1.24000068e+06 9.58960159e+02]
[-8.10000471e+05 6.81421845e+02]
[-9.65000503e+05 7.01790267e+02]
[-6.25000305e+05 4.13849556e+02]
[-7.19000401e+05 5.71346969e+02]
[-1.38500076e+06 1.07060034e+03]
[-1.31000055e+06 7.00099492e+02]
[-7.55000443e+05 6.42390335e+02]
[-9.30000687e+05 1.06022210e+03]
[-1.30000067e+06 9.31366910e+02]
[-2.15000103e+06 1.39572002e+03]
[-7.25000531e+05 8.18186120e+02]
[-7.75000359e+05 4.76855460e+02]
18
[-8.49000447e+05 6.24882605e+02]
[-5.59000212e+05 2.55609228e+02]
[-1.90000074e+06 9.05376105e+02]
[-1.32500085e+06 1.26020140e+03]
[-1.95000050e+06 4.36046102e+02]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-1.12500035e+06 3.59526569e+02]
[-1.15000058e+06 8.00361262e+02]
[-6.50000113e+06 4.27796070e+02]
[-4.50000076e+06 2.54083462e+02]
[-2.72000067e+06 5.53524033e+02]
[-1.41800050e+06 5.83219175e+02]
[-8.25000311e+05 3.73519180e+02]
[-1.85300048e+06 4.32126187e+02]
[-4.00000114e+06 1.10542164e+03]
[-1.51000072e+06 9.72774593e+02]
[-3.90000091e+06 6.96080545e+02]
[-1.71000012e+07 -2.20571130e+03]
[-1.00000042e+06 5.43356773e+02]
[-3.65000063e+06 2.37735822e+02]
[-9.95000172e+06 6.53841092e+02]
[-4.35000201e+05 2.66910756e+02]
[-1.25000039e+06 4.11696557e+02]
[-2.00000051e+06 4.44708542e+02]
[-2.82500073e+06 6.39226774e+02]
[-1.70000074e+06 9.54702407e+02]
[-1.90500072e+06 8.60744775e+02]
[-1.95000055e+06 5.23038090e+02]
[-8.25000311e+05 3.72519183e+02]
[-8.00000231e+05 2.28686043e+02]
[-8.80000309e+05 3.54557395e+02]
[-1.15500033e+06 3.24726439e+02]
[-2.53200068e+06 6.26540030e+02]
[-6.50000109e+06 3.49784984e+02]
[-4.07500365e+05 5.85399214e+02]
[-3.80000085e+06 6.11744536e+02]
[-1.20000010e+07 -1.31978680e+03]
[-3.69999995e+06 -1.07360136e+03]
[-5.62700147e+06 1.31155559e+03]
[-6.55000100e+06 1.82462146e+02]
[-2.10000073e+06 8.42045475e+02]
[-1.42500063e+06 8.27537259e+02]
[-2.61000064e+06 5.26456608e+02]
[-7.05000090e+06 -1.48860135e+02]
[-1.15000030e+06 2.64361523e+02]
[-3.21000058e+06 2.48462944e+02]
[-2.66000061e+06 4.64118294e+02]
19
[-2.15000069e+06 7.42717771e+02]
[-7.99500110e+06 -1.05509351e+01]
[-6.41000180e+05 1.73424022e+02]
[-1.19900043e+06 5.07553032e+02]
[-3.60000283e+06 4.42911322e+03]
[-5.60000175e+06 1.84077042e+03]
[-1.26000040e+06 4.25427531e+02]
[-4.97500126e+06 1.08893128e+03]
[-2.41000069e+06 6.74786235e+02]
[-2.38890021e+07 -2.34511635e+03]
[-1.70000064e+06 7.61705584e+02]
[-7.87500145e+06 6.77640060e+02]
[-1.09950019e+07 7.29495682e+02]
[-5.99499996e+06 -1.65525150e+03]
[-3.75000019e+06 -6.33928254e+02]
[-8.30000294e+05 3.39888491e+02]
[-2.25000069e+06 7.16050843e+02]
[-1.46500039e+06 3.53464922e+02]
[-6.80000262e+05 3.18884019e+02]
[-1.13000036e+06 3.82895893e+02]
[-1.24500059e+06 7.86329314e+02]
[-8.85000459e+05 6.37921810e+02]
[-6.50000349e+06 4.91179488e+03]
[-1.02000036e+06 4.09824224e+02]
[-1.99500069e+06 7.82343205e+02]
[-8.90000358e+05 4.45288934e+02]
[-3.35000104e+06 1.08873523e+03]
[-9.90000389e+05 4.78624226e+02]
[-8.05000323e+05 4.02052807e+02]
[-1.25500042e+06 4.69062222e+02]
[-3.60000105e+06 1.04607399e+03]
[-1.25000041e+06 4.41695588e+02]
[-1.97500071e+06 8.29876760e+02]
[-2.52500100e+06 1.22521631e+03]
[-2.40000062e+06 5.36060627e+02]
[-2.22500097e+06 1.26421668e+03]
[-9.41750376e+05 4.65037032e+02]
[-1.30000046e+06 5.23361055e+02]
[-1.15000028e+06 2.22358324e+02]
[-1.80000094e+06 1.31404404e+03]
[-1.95000074e+06 8.98042818e+02]
[-3.45000067e+06 3.59070272e+02]
[-1.84000077e+06 9.69974109e+02]
[-4.15000213e+06 2.94441993e+03]
[-1.70000074e+06 9.54695066e+02]
[-3.75000122e+06 1.32007210e+03]
[-1.08300041e+06 4.91645709e+02]
[-1.90000068e+06 7.87378316e+02]
20
[-2.67500079e+06 7.91230527e+02]
[-7.75000124e+06 3.04473906e+02]
[-1.63500061e+06 7.24939055e+02]
[-1.07500041e+06 5.03857974e+02]
[-2.10000073e+06 8.31042728e+02]
[-1.52500061e+06 7.55871411e+02]
[-1.45000050e+06 5.66366354e+02]
[-1.16500049e+06 6.32462789e+02]
[-1.31000057e+06 7.29099897e+02]
[-1.15000052e+06 6.80358358e+02]
[-2.15000078e+06 9.10711332e+02]
[-2.90000086e+06 8.72723245e+02]
[-6.41500174e+06 1.62155299e+03]
[-8.60000308e+06 3.59081937e+03]
[-1.01175018e+07 8.46635381e+02]
[-1.10000041e+06 4.92689594e+02]
[-2.01000050e+06 4.19445469e+02]
[-2.07500042e+06 2.57207856e+02]
[-1.07500051e+06 6.85860643e+02]
[-7.15000320e+05 4.18450800e+02]
[-2.15000090e+06 1.15071714e+03]
[-1.40000082e+06 1.18070651e+03]
[-1.00000037e+06 4.42358746e+02]
[-3.20000085e+06 7.79736396e+02]
[-6.50000266e+05 3.34683539e+02]
[-9.95000610e+05 8.96992056e+02]
[-1.02500042e+06 5.22191447e+02]
[-7.40000094e+06 -1.72200080e+02]
[-7.81000310e+05 3.83690320e+02]
[-4.95000075e+06 1.18093644e+02]
[-5.88001921e+05 3.49236230e+03]
[-2.00000076e+06 9.19721485e+02]
[-5.65000114e+06 6.69442220e+02]
[-7.50000620e+05 9.80022322e+02]
[-1.73500078e+06 1.03127753e+03]
[-1.85000082e+06 1.07070498e+03]
[-8.65000136e+06 3.06486074e+02]
[-3.80000085e+06 6.11744536e+02]
[-6.50000109e+06 3.49784984e+02]
[-9.46000514e+05 7.26797769e+02]
[-1.80000076e+06 9.77044455e+02]
[-7.45000281e+05 3.37653974e+02]
[-3.15000133e+06 1.69107816e+03]
[-1.60000073e+06 9.62373106e+02]
[-1.73000090e+06 1.24690508e+03]
[-3.35000104e+06 1.08873523e+03]
[-1.78000081e+06 1.06957403e+03]
[-3.85000126e+06 1.37241552e+03]
21
[-2.01000116e+06 1.66844979e+03]
[-3.67500085e+06 6.52573174e+02]
[-1.45000065e+06 8.61368368e+02]
[-1.28000049e+06 5.93891394e+02]
[-9.50000426e+05 5.58688491e+02]
[-3.00000110e+06 1.30506067e+03]
[-8.90000387e+05 5.00292765e+02]
[-3.80000279e+06 4.29876503e+03]
[-3.15000140e+06 1.82607247e+03]
[-5.25000104e+06 6.01106210e+02]
[-1.62500028e+06 1.13202089e+02]
[-1.68500082e+06 1.11861228e+03]
[-1.26100074e+06 1.07391487e+03]
[-7.12500518e+05 7.96768790e+02]
[-1.74000086e+06 1.18363948e+03]
[-7.30000357e+05 4.85550757e+02]
[-8.40000320e+05 3.85620026e+02]
[-6.35000307e+05 4.15583023e+02]
[-8.80000359e+06 4.49049300e+03]
[-1.36800084e+06 1.22555410e+03]
[-1.02500096e+06 1.56019242e+03]
[-1.78200060e+06 6.61517999e+02]
[-1.20000081e+06 1.22402771e+03]
[-2.85500192e+06 2.89643477e+03]
[-9.60000394e+06 4.94419228e+03]
[-1.75000051e+06 5.13371784e+02]
[-1.70000074e+06 9.54702407e+02]
[-2.10000083e+06 1.01604937e+03]
[-8.85000376e+05 4.80922744e+02]
[-9.80000417e+05 5.33890202e+02]
[-1.65000031e+06 1.49037403e+02]
[-8.60000276e+06 2.97081705e+03]
[-7.30000228e+05 2.40551683e+02]
[-1.62900053e+06 5.72099207e+02]
[-1.45000080e+06 1.13236490e+03]
[-2.00000101e+06 1.39371076e+03]
[-1.77000065e+06 7.67838339e+02]
[-5.32000208e+06 2.55023014e+03]
[-2.00000070e+06 8.02712312e+02]
[-1.80000083e+06 1.10504642e+03]
[-6.70000352e+05 4.91149895e+02]
[-4.45000128e+06 1.25642084e+03]
[-1.58950079e+06 1.07990286e+03]
[-1.61000072e+06 9.52102065e+02]
[-2.51500109e+06 1.41548826e+03]
[-4.99900132e+06 1.18728965e+03]
[-6.55000186e+05 1.80050860e+02]
[-2.15000069e+06 7.42717771e+02]
22
[-3.35000194e+06 2.80673454e+03]
[-1.05000029e+06 2.77023382e+02]
[-8.95000424e+05 5.69655201e+02]
[-1.34900042e+06 4.34559451e+02]
[-1.30000080e+06 1.17136265e+03]
[-1.08500067e+06 9.86596109e+02]
[-1.34000061e+06 8.06301277e+02]
[-2.16500102e+06 1.35980990e+03]
[-9.65000281e+05 2.79790611e+02]
[-8.30000502e+05 7.34887680e+02]
[-9.50000939e+05 1.53369061e+03]
[-7.94000449e+05 6.43844390e+02]
[-6.50000277e+06 3.54478218e+03]
[-1.47500063e+06 8.10199798e+02]
[-2.05000134e+06 2.01038171e+03]
[-4.99000099e+06 5.72030316e+02]
[-5.00000149e+06 1.50876014e+03]
[-9.10000458e+05 6.30758537e+02]
[-7.50000105e+06 1.01425589e+01]
[-1.15000050e+06 6.46361844e+02]
[-1.80000080e+06 1.05204594e+03]
[-1.25000075e+06 1.09169962e+03]
[-1.27500062e+06 8.48533539e+02]
[-7.05000433e+05 6.36716502e+02]
[-8.72500461e+05 6.46505505e+02]
[-2.60000116e+06 1.52672146e+03]
[-1.10000037e+06 4.20693564e+02]
[-2.10000097e+06 1.28204694e+03]
[-1.19900043e+06 5.07553032e+02]
[-1.62500100e+06 1.47820308e+03]
[-2.71000073e+06 6.77793788e+02]
[-2.53200068e+06 6.26540030e+02]
[-3.21000058e+06 2.48462944e+02]
[-1.50000077e+06 1.06003254e+03]
[-1.90300029e+06 4.27993523e+01]
[-3.50005829e+04 1.09757852e+03]
[-7.20000459e+05 6.82816538e+02]
[-1.30000046e+06 5.39365034e+02]
[-3.80000279e+06 4.29876503e+03]
[-4.62100210e+06 2.76636720e+03]
[-7.70000562e+05 8.64489024e+02]
[-1.60000075e+06 1.00736747e+03]
[-4.00000168e+06 2.14540963e+03]
[-6.26000357e+05 5.12324978e+02]
[-1.85300048e+06 4.32126187e+02]
[-1.16600078e+06 1.17093655e+03]
[-3.35000373e+06 6.19278769e+03]
[-1.10000047e+06 6.07692175e+02]
23
[-4.00000113e+06 1.09340763e+03]
[-1.45700039e+06 3.57678809e+02]
[-1.77000065e+06 7.67838339e+02]
[-2.40000077e+06 8.32056547e+02]
[-7.45000014e+06 -1.69352958e+03]
[-1.30750030e+07 2.30405906e+03]
[-2.50000052e+06 3.30386011e+02]
[-4.15000071e+06 2.56414152e+02]
[-2.30000030e+06 -4.02878644e+01]
[-9.00000190e+06 1.23215198e+03]
[-3.50000192e+06 2.72376843e+03]
[-1.25000028e+07 2.09188138e+03]
[-1.30000050e+06 6.12362311e+02]
[-1.51000049e+06 5.29763976e+02]
[-1.67600019e+07 -8.26659911e+02]
[-1.20000016e+07 -2.08787587e+02]
[-2.98900122e+06 1.53084935e+03]
[-2.81000056e+06 3.29121021e+02]
[-2.52500165e+06 2.47022049e+03]
[-2.47500083e+06 9.21551970e+02]
[-1.15000049e+06 6.30358547e+02]
[-1.00900030e+06 2.96617831e+02]
[-7.45000014e+06 -1.69352958e+03]
[-2.40000062e+06 5.36044693e+02]
[-8.30000342e+05 4.29888151e+02]
[-1.41000037e+06 3.32431578e+02]
[-3.00000087e+06 8.70060097e+02]
[-3.50000112e+06 1.20173252e+03]
[-9.50000362e+06 4.37482971e+03]
[-4.15000071e+06 2.56414152e+02]
[-3.62500102e+06 9.90906119e+02]
[-2.50000064e+06 5.53384258e+02]
[-9.20000525e+05 7.55488182e+02]
[-2.50200099e+06 1.22234003e+03]
[-4.75000111e+06 8.56422746e+02]
[-3.40000034e+06 -2.55602169e+02]
[-2.90000037e+06 -6.72769005e+01]
[-5.00000262e+06 3.65976652e+03]
[-1.84500068e+06 7.96341161e+02]
[-1.23000053e+06 6.77231963e+02]
[-6.00000061e+06 -4.25890317e+02]
[-3.17500099e+06 1.04589681e+03]
[-9.65000382e+05 4.70812074e+02]
[-1.33000047e+06 5.49562287e+02]
[-1.00000013e+07 -1.24496790e+02]
[-9.50000182e+06 9.46827929e+02]
[-2.85000104e+06 1.21705736e+03]
[-2.72500105e+06 1.28288961e+03]
24
[-5.25000238e+06 3.13511444e+03]
[-2.01500055e+06 5.13808492e+02]
[-5.55000131e+06 1.02710488e+03]
[-4.65000138e+06 1.40110678e+03]
[-1.75000058e+06 6.38371312e+02]
[-1.57600071e+06 9.33028246e+02]
[-6.25000159e+06 1.36746268e+03]
[-9.49000623e+05 9.32216179e+02]
[-8.95000549e+05 8.07655211e+02]
[-1.25000055e+06 7.13700706e+02]
[-1.65000090e+06 1.27604208e+03]
[-3.19500116e+06 1.35937026e+03]
[-7.35000147e+06 8.50142342e+02]
[-1.34900050e+06 5.95558160e+02]
[-1.25000041e+06 4.41696443e+02]]
[159]: projected_data.shape
[159]: (439, 2)
[122]: explained_variance_ratio = pca.explained_variance_ratio_
# Print the explained variance ratio values

print("Explained Variance Ratio:", explained_variance_ratio)
Explained Variance Ratio: [9.99999885e-01 1.14996251e-07 5.38924715e-13

1.62041663e-13
7.34178924e-14 7.01994932e-18 4.54122976e-19]

# Assuming 'explained_variance_ratio' contains the explained variance ratio of␣

↪each principal component
explained_variance_ratio = explained_variance_ratio # Replace [...] with your␣

↪actual explained variance ratio values
# Calculate cumulative explained variance ratio

cumulative_variance_ratio = np.cumsum(explained_variance_ratio)
# Plot scree plot

plt.plot(range(1, len(cumulative_variance_ratio) + 1),␣
↪cumulative_variance_ratio, marker='o', linestyle='-')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Scree Plot')
plt.grid(True)
25
plt.show()
2.Biplot
[180]:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[180], line 3
1 import pandas as pd
2 import matplotlib.pyplot as plt
----> 3 from prince import PCA
5 # Assuming X contains your standardized data
6 pca = PCA(n_components=2)
ModuleNotFoundError: No module named 'prince'
[183]: pip install prince
Requirement already satisfied: prince in
26
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages
(0.13.0)
Requirement already satisfied: altair<6.0.0,>=4.2.2 in
c:\users\lariy\appdata\local\programs\python\python310\lib\site-packages (from
prince) (5.3.0)
Requirement already satisfied: pandas<3.0.0,>=1.4.1 in
prince) (2.1.3)
Requirement already satisfied: scikit-learn<2.0.0,>=1.0.2 in
prince) (1.4.2)
Requirement already satisfied: jinja2 in
altair<6.0.0,>=4.2.2->prince) (3.1.2)
Requirement already satisfied: packaging in
altair<6.0.0,>=4.2.2->prince) (23.2)
Requirement already satisfied: jsonschema>=3.0 in
altair<6.0.0,>=4.2.2->prince) (4.20.0)
Requirement already satisfied: typing-extensions>=4.0.1 in
altair<6.0.0,>=4.2.2->prince) (4.7.1)
Requirement already satisfied: toolz in
altair<6.0.0,>=4.2.2->prince) (0.12.1)
Requirement already satisfied: numpy in
altair<6.0.0,>=4.2.2->prince) (1.26.2)
Requirement already satisfied: pytz>=2020.1 in
pandas<3.0.0,>=1.4.1->prince) (2023.3.post1)
Requirement already satisfied: python-dateutil>=2.8.2 in
pandas<3.0.0,>=1.4.1->prince) (2.8.2)
Requirement already satisfied: tzdata>=2022.1 in
pandas<3.0.0,>=1.4.1->prince) (2023.3)
Requirement already satisfied: scipy>=1.6.0 in
scikit-learn<2.0.0,>=1.0.2->prince) (1.12.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in
Requirement already satisfied: joblib>=1.2.0 in
Requirement already satisfied: attrs>=22.2.0 in
27
jsonschema>=3.0->altair<6.0.0,>=4.2.2->prince) (23.1.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in
Requirement already satisfied: referencing>=0.28.4 in
Requirement already satisfied: rpds-py>=0.7.1 in
Requirement already satisfied: six>=1.5 in
python-dateutil>=2.8.2->pandas<3.0.0,>=1.4.1->prince) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in
jinja2->altair<6.0.0,>=4.2.2->prince) (2.1.3)
Note: you may need to restart the kernel to use updated packages.
[notice] A new release of pip available: 22.2.2 -> 24.0

[notice] To update, run: python.exe -m pip install --upgrade pip

import numpy as np
from prince import PCA
# Impute missing values with mean (replace 'mean' with 'median' or␣
↪'most_frequent' if desired)
X_imputed = imputer.fit_transform(df_scaled)
# Assuming X is your ndarray

X_df = pd.DataFrame(X_imputed)
# Perform PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X_df)

import pandas as pd
28
# Assuming X contains your standardized data

principal_components = pca.fit_transform(X_imputed)
# Extract loadings of each feature for the first two principal components
loadings = pca.components_.T[:, :2]
# Plot the first two principal components

plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.5)
# Plot the loadings as vectors

for i, feature in enumerate(df_scaled):
plt.arrow(0, 0, loadings[i, 0], loadings[i, 1], color='r', alpha=0.5)
plt.text(loadings[i, 0], loadings[i, 1], feature, color='g', fontsize=10,␣
↪ha='center', va='center')
# Set labels and title

plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('Biplot of Principal Components')
plt.grid(True)
plt.show()
29
import numpy as np
from prince import PCA
# Assuming X is your ndarray

X_df = pd.DataFrame(X)
# Perform PCA
principal_components = pca.fit_transform(X_df)
3.Pairplot
[130]: import seaborn as sns
import pandas as pd
30
sns.pairplot(df)
plt.show()
[132]: from sklearn.impute import SimpleImputer
X_imputed = imputer.fit_transform(X)
# Now proceed with PCA using X_imputed as the input
[135]: df
31
0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880
totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5

import pandas as pd
# Separate features (X) and target variable (y)

X = df.drop(columns=['lastsoldprice'])
y = df['lastsoldprice']
# Handle missing values using imputation

X_imputed = imputer.fit_transform(X)
# Perform PCA
32
# Create a DataFrame for the principal components
pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
# Create a scatterplot matrix with both principal components

combined_df = pd.concat([pc_df, y], axis=1)
sns.pairplot(combined_df, hue='lastsoldprice')
plt.show()

33
# Assuming X contains your standardized data
# Create a heatmap of the principal components

sns.heatmap(principal_components, cmap='coolwarm', annot=True, fmt=".2f",␣
↪cbar=True)
plt.xlabel('Principal Component')
plt.ylabel('Data Point')
plt.title('Heatmap of Principal Components')
plt.show()
[ ]:
Score plot
from sklearn.preprocessing import StandardScaler
34
# Assuming your data is stored in a DataFrame called 'df'
# Impute missing values

imputer = SimpleImputer(strategy='mean') # You can change the strategy as␣
↪needed
imputed_data = imputer.fit_transform(df)
# Standardize the data

scaler = StandardScaler()
scaled_data = scaler.fit_transform(imputed_data)
# Perform PCA
pca = PCA(n_components=2) # Reduce to 2 components for a 2D plot
principal_components = pca.fit_transform(scaled_data)
# Plot the score plot with quadrants

pc_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
plt.scatter(pc_df['PC1'], pc_df['PC2'])
plt.axhline(y=0, color='k', linestyle='--') # Add horizontal line at y=0
plt.axvline(x=0, color='k', linestyle='--') # Add vertical line at x=0
plt.text(0.5, 0.5, ' I', fontsize=12, ha='center')
plt.text(-0.5, 0.5, ' II', fontsize=12, ha='center')
plt.text(-0.5, -0.5, ' III', fontsize=12, ha='center')
plt.text(0.5, -0.5, ' IV', fontsize=12, ha='center')
plt.xlabel('PC 1')
plt.ylabel('PC 2')
plt.title('Score Plot ')
plt.grid(True)
plt.show()
35
0.3 ADDITIONAL EXPLORATION
[148]: from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor
# Assuming you have already reduced your data dimensionality and stored it in␣
↪X_reduced
# Initialize your chosen model

model = RandomForestRegressor() # Example: RandomForestRegressor, you can use␣
↪any other model
# Perform cross-validation
scores = cross_val_score(model, df_scaled, y, cv=5,␣
↪scoring='neg_mean_squared_error')
# Print the cross-validation scores

print("Cross-validation Mean Squared Error:", -scores.mean())
36
Cross-validation Mean Squared Error: 195630660332.1332

import pandas as pd
# Create an imputer object

imputer = SimpleImputer(strategy='mean') # You can replace 'mean' with␣
↪'median' or 'most_frequent'
# Fit the imputer to X and transform X

# Assuming 'X' is your standardized data matrix

# Perform PCA
pca = PCA(n_components=2) # Choose the number of components you want
# Get the loadings (coefficients) for each PC

loadings = pca.components_
# Create a DataFrame to display the loadings

loadings_df = pd.DataFrame(loadings, columns=df_scaled.columns,␣
↪index=[f'PC{i+1}' for i in range(pca.n_components_)])
# Display the loadings DataFrame

print("Loadings (Coefficients) for Each Principal Component:")
print(loadings_df)
Loadings (Coefficients) for Each Principal Component:

0 1 2 3 4 5 6
PC1 0.443793 0.414535 0.467587 0.420565 0.017968 -0.195893 0.443845
PC2 -0.092340 -0.005345 -0.067513 -0.077041 -0.750017 -0.646715 -0.013622
[164]: df

0 2.0 2 1463.0 1950000 37.795139 -122.425309
1 3.5 3 3291.0 4200000 37.794429 -122.428513
2 1.0 1 653.0 665000 37.792472 -122.425281
3 2.5 2 2272.0 2735000 37.794706 -122.426347
4 1.0 1 837.0 1050000 37.793212 -122.423744
.. … … … … … …
434 2.0 3 2145.0 1650000 37.795777 -122.433024
435 3.5 4 3042.0 3195000 37.795330 -122.436540
436 7.5 6 4721.0 7350000 37.795246 -122.437490
37
437 1.0 2 1306.0 1349000 37.796588 -122.429641
438 1.0 1 1100.0 1250000 37.795255 -122.432880
totalrooms
0 7
1 7
2 3
3 6
4 3
.. …
434 8
435 10
436 13
437 5
438 5
38

Cia Code

Uploaded by

Copyright:

Available Formats

Cia Code

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cia Code

Uploaded by

Copyright:

Available Formats

cia-code

April 23, 2024

[32]: address bathrooms bedrooms finishedsqft \

lastsolddate lastsoldprice latitude longitude neighborhood \

totalrooms usecode yearbuilt zipcode

[439 rows x 13 columns]

0.1 Data Preprocessing

[33]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣

[34]: Index(['bathrooms', 'bedrooms', 'finishedsqft', 'lastsoldprice', 'latitude',

Missing values in the data: bathrooms 0

[37]: import matplotlib.pyplot as plt

# Calculate the number of missing values in each column

# Plot the missing values

3.Handling missing values

Missing values before handling:

[39]: #imputing with mean value of the column

numerical_features = df.select_dtypes(include=['float64', 'int64']).columns

[40]: missing_values_after = df.isnull().sum()

Missing values after handling:

[22]: #no missing values

# Load the dataset

# Create box plots for numerical columns to visualize outliers

# Identify outliers using z-score

# Plot outliers using scatter plot

1422 fig = gcf()

1429 # kwargs or if the axes class and kwargs are identical.

ValueError: num must be an integer with 1 <= num <= 6, not 7

[42]: df_no_outliers = df[~outliers.any(axis=1)]

[43]: plt.figure(figsize=(12, 6))

1422 fig = gcf()

1429 # kwargs or if the axes class and kwargs are identical.

ValueError: num must be an integer with 1 <= num <= 6, not 7

[44]: Index(['address', 'bathrooms', 'bedrooms', 'finishedsqft', 'lastsolddate',

[45]: columns_to_drop = ['address', 'lastsolddate', 'neighborhood', 'usecode',␣

[45]: bathrooms bedrooms finishedsqft lastsoldprice latitude longitude \

[439 rows x 7 columns]

[126]: import pandas as pd

# Compute correlation matrix

# Set custom color palette

plt.title('Correlation Matrix', fontsize=16, fontweight='bold')

[53]: numerical_columns = df.select_dtypes(include=['float64', 'int64']).columns

[54]: import pandas as pd

[76]: scaler = StandardScaler()

[439 rows x 7 columns]

2.Covariance Matrix Computation

# Compute the covariance matrix

# Print the covariance matrix DataFrame

latitude longitude totalrooms

# Convert eigenvalues and eigenvectors to DataFrame

# Print the DataFrame

Eigenvalues and Eigenvectors:

# Assuming 'eigenvectors' contains the eigenvectors computed from the␣

# Create a DataFrame for eigenvectors

# Print the DataFrame

# Assuming 'features' is your data with missing values

[107]: cumulative_variance_ratio = np.cumsum(pca.explained_variance_ratio_)

# Plot cumulative explained variance ratio

# Print the selected eigenvectors

The selected Principle Components are: