Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

Experiment Number: 3

Aim:- Study of the Linear Regression in the Machine Learning using the
Boston Housing Dataset.
1) MACHINE LEARNING: - Machine Learning is the field of study that gives
computers the capability to learn without being explicitly programmed.
 ML is one of the most exciting technologies that one would have ever come
across. As it is evident from the name, it gives the computer that makes it more
similar to humans: The ability to learn.
 Machine learning is actively being used today, perhaps in many more places
than one would expect.

a) SUPERVISED LEARNING :- Supervised learning is where you have input


variables (x) and an output variable (Y) and you use an algorithm to learn the
mapping function from the input to the output Y = f(X).
 The goal is to approximate the mapping function so well that when you have new
input data (x) you can predict the output variables (Y) for that data

i. LINEAR REGRESSION :- A data model explicitly describes a relationship


between predictor and response variables.
 Regression algorithms are used to predict a continuous numerical output. For
example, a regression algorithm could be used to predict the price of a house
based on its size, location, and other features.
 Linear regression fits a data model that is linear in the model coefficients. The
most common type of linear regression is a least-squares fit, which can fit
both lines and polynomials, among other linear models
 Linear regression is not merely a predictive tool; it forms the basis for
various advanced models. Techniques like regularization and support vector
machines draw inspiration from linear regression, expanding its utility.
ii. CLASSIFICATION(DATA MINING) :- Data mining in general terms
means mining or digging deep into data that is in different forms to gain
patterns, and to gain knowledge on that pattern.
 In the process of data mining, large data sets are first sorted, then patterns
are identified and relationships are established to perform data analysis and
solve problems.
 Classification is a task in data mining that involves assigning a class label to
each instance in a dataset based on its features.
 The goal of classification is to build a model that accurately predicts the
class labels of new instances based on their features.

b) UNSUPERVISED LEARNING :- Unsupervised learning is a branch of machine


learning that deals with unlabeled data.
 Unlike supervised learning, where the data is labeled with a specific category or
outcome, unsupervised learning algorithms are tasked with finding patterns and
relationships within the data without any prior knowledge of the data’s meaning.
 This makes unsupervised learning a powerful tool for exploratory data analysis,
where the goal is to understand the underlying structure of the data
i. CLUSTERING :- It is basically a type of unsupervised learning method. An
unsupervised learning method is a method in which we draw references from
datasets consisting of input data without labeled responses.
 Generally, it is used as a process to find meaningful structure, explanatory
underlying processes, generative features, and groupings inherent in a set of
examples.
 Clustering is the task of dividing the population or data points into a number
of groups such that data points in the same groups are more similar to other
data points in the same group and dissimilar to the data points in other groups.
 It is basically a collection of objects on the basis of similarity and dissimilarity
between them

2) Applying the Linear regression using the Boston Housing Dataset:-


 There are 506 samples and 13 feature variables in this dataset. The objective is to
predict the value of prices of the house using the given features.
 The description of all the features of Boston-Housing-Dataset is given below:-
i) CRIM :- Per capita crime rate by town
ii) ZN :- Proportion of residential land zoned for lots over 25,000 sq. ft.
iii) INDUS :- Proportion of non-retail business acres per town
iv) CHAS :- Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
v) NOX :- Nitric oxide concentration (parts per 10 million)
vi) RM :- Average number of rooms per dwelling
vii) AGE -: Proportion of owner-occupied units built prior to 1940
viii) DIS :- Weighted distances to five Boston employment centers
ix) RAD :- Index of accessibility to radial highways
x) TAX :- Full-value property tax rate per $10,000
xi) PTRATIO :- Pupil-teacher ratio by town
xii) B :- 1000(Bk — 0.63)², where Bk is the proportion of [people of African
American descent] by town
xiii) LSTAT :- Percentage of lower status of the population
xiv) MEDV :- Median value of owner-occupied homes in $1000s

i. Importing all the required libraries :-


CODE:-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
Import seaborn as sns
ii. Now loading the data in the Dataframe and print the first five row:-
CODE;-
df = pd.read_csv("boston-housing-dataset.csv")
df.head()
OUTPUT:-
iii. Printing the statistical description:-
CODE:-
df.describe()
OUTPUT:-

iv. Printing the datatype of the given data in the dataset:-


CODE:-
df.shape
df.dtypes
OUTPUT:-
v. Print the information of the dataset:-
CODE:-
df.info()
OUTPUT:-

vi. Counting the missing value for each feature in dataset:-


CODE:-
df.isna().sum()
OUTPUT:-
vii. Creating the target feature and separating the object from the target
and input feature:-

CODE:-
target_feature = 'MEDV'
y = df[target_feature]
x = df.drop(target_feature, axis=1)
x.head()
y.head()
OUTPUT:-

viii. Splitting the dataset using train_test_split:-


CODE:-
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size= 0.2,random_state=7)
from sklearn.linear_model import LinearRegression
regression = LinearRegression()

regression.fit(x_train,y_train)
# train score
train_score= round(regression.score(x_train, y_train)*100,2)
print("train score of Linear Regression: ",train_score)
OUTPUT:-

ix. Printing the shape and size and datatypes of the testing set:-
CODE:-
y_pred = regression.predict(x_train)
print(y_pred.shape)
print(y_test.shape)
print(f"y_pred size: {y_pred.size}")
print(f"y_test size: {y_test.size}")
print(f"y_pred data type: {type(y_pred)}")
print(f"y_test data type: {type(y_test)}")
OUTPUT:-

x. Creating and calculating the Variance , Actual and Predicted values:-


CODE:-
y_test = np.array(y_test)
y_pred = y_pred[:y_test.size]
y_pred = y_pred.reshape(y_test.shape)
# Create a DataFrame with the correct shape
df1 = pd.DataFrame({'Actual':y_test,'Predicted':y_pred})
# Calculate the variance as a separate step
df1['Variance'] = df1['Actual'] - df1['Predicted']
df1.head()
OUTPUT:-

xi. Printing the 14 values of features of dataset:-


CODE:-
df.head(14)
OUTPUT:-

xii. Intercept and coefficient value:-


CODE:-
regression.intercept_
regression.coef_

OUTPUT:-
xiii. Creating a new Dataframe and get the linear coefficient :-
CODE:-
lr_coefficient = pd.DataFrame()
lr_coefficient["columns"]= x_train.columns
lr_coefficient["coefficient Estimate"]=pd.Series(regression.coef_)
print(lr_coefficient)
OUTPUT:-

xiv. Ploting the bar chart of coefficient using the matplotplotting library:-
CODE:-
fig, ax = plt.subplots(figsize = (20,10))
ax.bar(lr_coefficient["columns"],lr_coefficient["coefficient Estimate"])
ax.spines["bottom"].set_position("zero")
plt.style.use("ggplot")
plt.grid()
plt.show()
fig, ax = plt.subplots(figsize =(20,10))
x_ax = range(len(x_test))
plt.scatter(x_ax, y_test, s=30, color='green', label='original')
plt.scatter(x_ax, y_pred, s=30,color='red',label='predicated')
plt.legend()
#plt.grid()
plt.show()
OUTPUT:-
xv. Ploting the original and predicated value using the scatter and the
plot:-
CODE:-
fig, ax = plt.subplots(figsize =(20,10))
x_ax = range(len(x_test))
plt.scatter(x_ax, y_test, s=30, color='green', label='original')
plt.plot(x_ax, y_pred, lw=0.8,color='red',label='predicated')
plt.legend()
#plt.grid()
plt.show()

OUTPUT:-

xvi. Using scatter plot ,how the features are vary with the MEDV:-
CODE:-
plt.feature(figsize(20,5))
features = [‘LSTAT,’RM’]
target = df[‘MEDV’]
for i,col in enumerate(features):
plt.subplot(1, len(features), i+1)
x=df[col]
y=target
plt.scatter9x,y,marker=’0’)
plt.title(col)
plt.xlabel(col)
plt.ylabel(‘MEDV’)
output:-

xvii. Distribution of the target variable(MEDV):-


CODE:-
ns.set(rc={‘figure.figsize’(11,,8)})
sns.distplot(df[‘MEDV’],bins=30)
plt.show()
OUTPUT:-

You might also like