DSBDL - Write - Ups - 4 To 7
DSBDL - Write - Ups - 4 To 7
DSBDL - Write - Ups - 4 To 7
TITLE: Create a Linear Regression Model using Python/R to predict home prices using
Boston Housing Dataset
PROBLEM STATEMENT:
(https://www.kaggle.com/c/boston-housing). The Boston Housing dataset contains information about
various houses in Boston through different parameters. There are 506 samples and 14 feature
variables in this dataset.
The objective is to predict the value of prices of the house using the given features.
Objective of the Assignment: Students should be able to data analysis using liner regression using
Python for any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Regresion.
THEORY:
Linear Regression: It is a machine learning algorithm based on supervised learning. It targets
prediction values on the basis of independent variables.
It is preferred to find out the relationship between forecasting and variables.
A linear relationship between a independent variable (X) is continuous; while dependent
variable(Y) relationship may be continuous or discrete. A linear relationship should be available
in between predictor and target variable so known as Linear Regression.
Linear regression is popular because the cost function is Mean Squared Error (MSE) which is
equal to the average squared difference between an observation’s actual and predicted values.
It is shown as an equation of line like : Y = m*X + b + e
Where : b is intercepted, m is slope of the line and e is error term.
MultiVariate Regression :It concerns the study of two or more predictor variables. Usually a
transformation of the original features into polynomial features from a given degree is preferred and
further Linear Regression is applied on it.
A simple linear model Y = a + bX is in original feature will be transformed into polynomial
feature is transformed and further a linear regression applied to it and it will be something like
Y=a + bX + cX2
Generalization
● Generalization is the prediction of the future based on the past system.
● It needs to generalize beyond the training data to some future data that it might not have seen
yet.
● The ultimate aim of the machine learning model is to minimize the generalization error.
● The generalization error is essentially the average error for data the model has never seen.
● In general, the dataset is divided into two partition training and test sets.
● The fit method is called on the training set to build the model.
● This fit method is applied to the model on the test set to estimate the target value and evaluate
the model's performance.
● The reason the data is divided into training and test sets is to use the test set to estimate how
well the model trained on the training data and how well it would perform on the unseen data.
Conclusion:
In this way we have done data analysis using linear regression for Boston Dataset and predicted
the price of houses using the features of the Boston Dataset.
ASSIGNMENT NO : 5
PROBLEM STATEMENT:
1. Implement logistic regression using Python/R to perform classification on
Social_Network_Ads.csv dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision, Recall
on the given dataset.
Objective of the Assignment: Students should be able to data analysis using logistic regression
using Python for any open source dataset
Prerequisite:
1. Basic of Python Programming
2. Concept of Regresion.
THEORY:
Logistic Regression: Classification techniques are an essential part of machine learning and data
mining applications. Approximately 70% of problems in Data Science are classification problems.
There are lots of classification problems that are available, but logistic regression is common and is a
useful regression method for solving the binary classification problem. Another category of
classification is Multinomial classification, which handles the issues where multiple classes are present
in the target variable. For example, the IRIS dataset is a very famous example of multi-class
classification. Other examples are classifying article/blog/document categories.
Logistic Regression can be used for various classification problems such as spam detection. Diabetes
prediction, if a given customer will purchase a particular product or will they churn another competitor,
whether the user will click on a given advertisement link or not, and many more examples are in the
bucket.
Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for
two-class classification. It is easy to implement and can be used as the baseline for any binary
classification problem. Its basic fundamental concepts are also constructive in deep learning. Logistic
regression describes and estimates the relationship between one dependent binary variable and
independent variables.
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of
odds as the dependent variable. Logistic Regression predicts the probability of occurrence of a binary
Linear regression gives you a continuous output, but logistic regression provides a constant output. An
example of the continuous output is house price and stock price. Example's of the discrete output is
predicting whether a patient has cancer or not, predicting whether the customer will churn. Linear
regression is estimated using Ordinary Least Squares (OLS) while logistic regression is estimated using
Maximum Likelihood Estimation (MLE) approach.
Sigmoid Function
The sigmoid function, also called logistic function, gives an ‘S’ shaped curve that can take any real-
valued number and map it into a value between 0 and 1. If the curve goes to positive infinity, y
predicted will become 1, and if the curve goes to negative infinity, y predicted will become 0.
Types of LogisticRegression
Binary Logistic Regression: The target variable has only two possible outcomes such as Spam or
Not Spam, Cancer or No Cancer.
Multinomial Logistic Regression: The target variable has three or more nominal categories
such as predicting the type of Wine.
Ordinal Logistic Regression: the target variable has three or more ordinal categories such as
restaurant or product rating from 1 to 5.
Here each row indicates the actual classes recorded in the test data set and the each column indicates the
classes as predicted by the classifier.
Numbers on the descending diagonal indicate correct predictions, while the ascending diagonal concerns
prediction errors.
Accuracy: Accuracy is calculated as the number of correctly classified instances divided by total
number of instances.
The ideal value of accuracy is 1, and the worst is 0. It is also calculated as the sum of true positive and
𝑇𝑃+𝑇𝑁 𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐 = 𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
= Pos + Neg
Conclusion:
In this way we have done data analysis using logistic regression for Social Media Adv. And
evaluated the performance of model.
Value Addition: Visualizing Confusion Matrix using Heatmap
ASSIGNMENT NO : 6
PROBLEM STATEMENT:
1. Implement Simple Naïve Bayes classification algorithm using Python/R on iris.csv
dataset.
2. Compute Confusion matrix to find TP, FP, TN, FN, Accuracy, Error rate, Precision,
Recall on the given dataset.
Objective of the Assignment: Students should be able to data analysis using Naïve Bayes
classification algorithm using Python for any open source dataset
Prerequisite:
1. Basic of Python Programming
THEORY:
Concepts used in Naïve Bayes classifier :
o Naïve Bayes Classifier can be used for Classification of categorical data.
Let there be a ‘j’ number of classes. C={1,2,….j}
Let, input observation is specified by ‘P’ features. Therefore input observation x is given
, x = {F1,F2,…..Fp}
The Naïve Bayes classifier depends on Bayes' rule from probability theory.
o Prior probabilities: Probabilities which are calculated for some event based on no other
information are called Prior probabilities.
For example, P(A), P(B), P(C) are prior probabilities because while calculating P(A), occurrences of
event B or C are not concerned i.e. no information about occurrence of any other event is used.
Conditional Probabilities:
From equation (1) and (2) ,
PROBLEM STATEMENT:
1. Extract Sample document and apply following document preprocessing methods:
Tokenization, POS Tagging, stop words removal, Stemming and Lemmatization.
2. Create representation of document by calculating Term Frequency and Inverse Document
Frequency.
Objective of the Assignment: Students should be able to perform Text Analysis using TF IDF
Algorithm.
Prerequisite:
1. Basic of Python Programming
THEORY:
One of the most frequent types of day-to-day conversion is text communication. In our everyday
routine, we chat, message, tweet, share status, email, create blogs, and offer opinions and criticism.
All of these actions lead to a substantial amount of unstructured text being produced. It is critical to
examine huge amounts of data in this sector of the online world and social media to determine
people's opinions.
Text mining is also referred to as text analytics. Text mining is a process of exploring sizable textual
data and finding patterns. Text Mining processes the text itself, while NLP processes with the
underlying metadata. Finding frequency counts of words, length of the sentence, presence/absence of
specific words is known as text mining. Natural language processing is one of the components of text
mining. NLP helps identify sentiment, finding entities in the sentence, and category of blog/article.
Text mining is preprocessed data for text analytics. In Text Analytics, statistical and machine learning
algorithms are used to classify information.
Text Analysis Operations using natural language toolkit
NLTK(natural language toolkit) is a leading platform for building Python programs to work with
human language data. It provides easy-to-use interfaces and lexical resources such as WordNet, along
with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing,
and semantic reasoning and many more.
Analyzing movie reviews is one of the classic examples to demonstrate a simple NLP Bag-of-words
model, on movie reviews.
Tokenization:
Tokenization is the first step in text analytics. The process of breaking down a text paragraph into
smaller chunks such as words or sentences is called Tokenization. Token is a single entity that is the
building blocks for a sentence or paragraph.
Sentence tokenization : split a paragraph into list of sentences using sent_tokenize() method
Word tokenization : split a sentence into list of words using word_tokenize() method
Stop words removal
Stop words considered as noise in the text. Text may contain stop words such as is, am, are, this, a,
an, the, etc. In NLTK for removing stopwords, you need to create a list of stopwords and filter out
your list of tokens from these words.
Stemming and Lemmatization
Stemming is a normalization technique where lists of tokenized words are converted into shortened
root words to remove redundancy. Stemming is the process of reducing inflected (or sometimes
derived) words to their word stem, base or root form.
A computer program that stems word may be called a stemmer. E.g.
A stemmer reduces the words like fishing, fished, and fisher to the stem fish.
The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues,
arguing, and argus to the stem argu.
Lemmatization in NLTK is the algorithmic process of finding the lemma of a word depending on its
meaning and context. Lemmatization usually refers to the morphological analysis of words, which
aims to remove inflectional endings. It helps in returning the base or dictionary form of a word known
as the lemma.
Eg. Lemma for studies is study
Conclusion:
In this way we have done text data analysis using TF IDF algorithm.