Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
28 views

Assignment AI-ML

- The document describes building a linear regression model to predict median home values (MEDV) using the Boston Housing dataset. - It is determined that the percentage of lower status population (LSTAT) has the highest correlation with MEDV. - A linear regression model is fit using LSTAT to predict MEDV, which achieves high accuracy based on evaluation metrics. - The model provides a method to predict MEDV values using a single linear regression model on the Boston Housing dataset.

Uploaded by

Shelly Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Assignment AI-ML

- The document describes building a linear regression model to predict median home values (MEDV) using the Boston Housing dataset. - It is determined that the percentage of lower status population (LSTAT) has the highest correlation with MEDV. - A linear regression model is fit using LSTAT to predict MEDV, which achieves high accuracy based on evaluation metrics. - The model provides a method to predict MEDV values using a single linear regression model on the Boston Housing dataset.

Uploaded by

Shelly Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Assignment AI/ML

NAME: SHELLY SHARMA


CLASS: BTECH IT B
BATCH: B1
ROLL NO.: 2016820
ID: BTBTI20249

Submitted to: Dr Urvashi Prakash Shukla


The Boston Housing Dataset
Question: The MEDV value of each person in the diabetic dataset needs a linear model to be designed for its
prediction.

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning

housing in the area of Boston MA. The following describes the dataset columns:

● CRIM - per capita crime rate by town


● ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
● INDUS - proportion of non-retail business acres per town.
● CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
● NOX - nitric oxides concentration (parts per 10 million)
● RM - average number of rooms per dwelling
● AGE - proportion of owner-occupied units built prior to 1940
● DIS - weighted distances to five Boston employment centres
● RAD - index of accessibility to radial highways
● TAX - full-value property-tax rate per $10,000
● PTRATIO - pupil-teacher ratio by town
● LSTAT - % lower status of the population
● MEDV - Median value of owner-occupied homes in $1000's
ANSWER

LOGIC EXPLANATION :

Step-1:we have imported libraries such as pandas and numpy to import their functionalities in our
assignment. Here we have imported linear regression that uses the relationship between the data points
to draw straight line through all them. This line can be used to predict future values.
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its
numerical extension NumPy. As such, it offers a viable open source alternative to MATLAB

Step-2: We have defined column names as per dataset provided.


Then we have loaded the dataset using the variable df.
We have also taken a variable x to store the column MEDV.
Step-3: The isnull() method returns a DataFrame object where all the values are replaced with a Boolean
value True for NULL values, and otherwise False.
Here the isna() method returns a DataFrame object where all the values are replaced with a Boolean
value True for NA (not-a -number) values, and otherwise False.

OUTPUT:
LOGIC EXPLANATION:

Step-4: We have normalized our dataset.


Normalize is a function present in sklearn. preprocessing package. Normalization is used for
scaling input data set on a scale of 0 to 1 to have unit norm. At last we have printed our
normalized data.

OUTPUT:
LOGIC EXPLANATION:
Step-5: We have calculated the correlation of column [‘MEDV’] with all other columns.
The corr() method calculates the relationship between each column in your data set. The Result
of the corr() method is a table with a lot of numbers that represents how well the relationship is
between two columns. The number varies from -1 to 1.

OUTPUT:
After comparing the correlation of each column with MEDV , we have concluded that LSTAT provides
the best correlation with MEDV.That’s why we will further work on column LSTAT and MEDV to
predict the best values for our dataset.
LOGIC EXPLANATION:

Step-6: We have defined a variable y which stored the column values of LSTAT.
Further we plotted a graph between MEDV and LSTAT where X-axis denotes MEDV values and Y-axis
denotes the LSTAT values.

OUTPUT:
LOGIC EXPLANATION :

Step-7: We have taken x_mean as MEDV values and y_mean as LSTAT values .Using these two
parameters we have calculated our mean values for both.

Now we have calculated the slope as b1 and intercept as b0 and then printed the result. We have
stored the value of the resulting linear equation in variable y_pred. After this we have use plt.scatter()
function to plot a scatter plot and uses the values of y_pred to predict our result.
From the graph we can infer that the scattered points are near to the plotted line, proving that it has
high accuracy.
OUTPUT:
LOGIC EXPLANATION :

Step-8: We have imported mean_absolute_error from sklearn.metrics to calculate the Mean Absolute
Error. Mean Absolute Error calculates the average difference between the calculated values and actual
values.

Step-9: We have imported mean_squared_error from sklearn.metrics to calculate the Mean Square
Error. The Mean Squared Error of an estimator measures the average of error squares i.e. the average
squared difference between the estimated values and true value.

Step-10: We calculated Root Mean Squared Error.


RMSE is a square root of value gathered from the mean square error function. It helps us plot a
difference between the estimate and actual value of a parameter of the model.

Step-11: We have imported r2_score from sklearn.metrics to calculate R_Square value for our data. It is
used to evaluate the performance of a linear regression model.

Step-12: We have calculated the Adjusted R-squared of our data.


The Adjusted R-squared takes into account the number of independent variables used for predicting the
target variable.

OUTPUT:
LOGIC EXPLANATION:

Step-13: We have used reshape() function allows us to reshape an array in Python. Reshaping basically
means, changing the shape of an array. And the shape of an array is determined by the number of
elements in each dimension. Reshaping allows us to add or remove dimensions in an array.

LinearRegression fits a linear model with coefficients to minimize the residual sum of squares between
the observed targets in the dataset, and the targets predicted by the linear approximation. Regression
analysis is a form of predictive modelling technique which investigates the relationship between a
dependent (target) and independent variable (predictor). This technique is used for forecasting, time
series modelling and finding the causal effect relationship between the variables.

OUTPUT:
CONCLUSION:

After processing the whole data we came to the conclusion that LSTAT has the best accuracy in
correlation with MEDV. The plotted graph gathers the same information about the accuracy of the data
taken.
Also at last we have shown m_pred and m_predicted values that came out to be same and hence we
have designed a linear single model of regression to predict the MEDV value.

You might also like