0% found this document useful (0 votes)

3 views

Data Science- Module 2 (Updated )

Uploaded by

A .Lisha.M . Cse

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Data Science- Module 2 (Updated )

Uploaded by

A .Lisha.M . Cse

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

Data Science-

Exploratory Data Analysis and the Data

Science Process
Dr Pavan Kumar M.P B.E ,M.Sc(Engg,) by Research,Ph.D,KSET, Awardee of Rastrapathi Puraskhar
Associate Professor
Dept of ISE
J.N.N .College Of Engineering
Shimoga
Basictools(plots,graphsandsummarystatistics)ofExploratoryDataAnalysis(EDA) -Introduction

▪ exploratory data analysis or “EDA” is a critical first step in analyzing the data from an
experiment Point of view.

▪ Here are the main reasons to use EDA:

• detection of mistakes

• checking of assumptions

• preliminary selection of appropriate models Ex: Salary Predictions

• determining relationships among the explanatory variables, and assessing the

direction and rough size of relationships between explanatory and outcome
variables.
Typical data format and the types of EDA
• The data from an experiment are generally collected into a rectangular array (e.g., spreadsheet or
database), most commonly with one row per experimental subject, and one column for each
subject identifier, outcome variable, and explanatory variable.

• Each column contains the numeric values for a particular quantitative variable or the levels for a
categorical variable. (Some more complicated experiments require a more complex data layout.)

• People are not very good at looking at a column of numbers or a whole spreadsheet and then
determining important characteristics of the data.

• They find looking at numbers to be tedious, boring, and/or overwhelming.

• Hence, Exploratory data analysis techniques have been devised as an aid in this situation. Most of
these techniques work in part by hiding certain aspects of the data while making other aspects
more clear.
Typical data format and the types of EDA (Cont.,)
• Exploratory data analysis is generally cross-classified in two ways. First, each method is either
non-graphical or graphical. And second, each method is either univariate or multivariate.

• Non-graphical methods generally involve calculation of summary statistics, while graphical

methods obviously summarize the data in a diagrammatic or pictorial way.

• Univariate methods look at one variable (data column) at a time, while multivariate methods look
at two or more variables at a time to explore relationships. Usually our multivariate EDA will be
bivariate (looking at exactly two variables), but occasionally it will involve three or more
variables.

• Beyond the four categories created by the above cross-classification, each of the categories of
EDA have further divisions based on the role (outcome or explanatory) and type (categorical or
quantitative) of the variable(s) being examined.
Introduction to EDA
⚫ Discovered in the 1970s by American mathematician John Tukey . Exploratory data
analysis (EDA) is a method of analysing and investigating the data sets to summarise
their main characteristics.
Introduction -(Cont.,)
• Scientists often use Data visualization methods to discover patterns, spot anomalies, check
assumptions or test a hypothesis through summary statistics and graphical representations.

• EDA goes beyond the formal modelling or hypothesis to give maximum awareness / insight into
the data set and its structure, and in identifying influential variables.

• It can also help in selecting the most suitable data analysis technique for a given project. Ex: Car
seat sales predications.

• Specific knowledge, such as the creation of a ranked list of relevant factors to be used as
guidelines, can also be obtained using EDA.

• The four types of EDA are univariate non-graphical, multivariate nongraphical, univariate
graphical, and multivariate graphical.
Introduction -(Cont.,)
• The EDA types of techniques are either graphical or quantitative (non-graphical).

• While the graphical methods involve summarizing the data in a diagrammatic or visual way.

• The quantitative method, on the other hand, involves the calculation of summary statistics.

• These two types of methods are further categorized / divided into univariate and multivariate methods.

• Univariate methods consider one variable (data column) at a time, while multivariate methods consider two or more variables
at a time to explore relationships.

• Thus, there are four types of EDA in all —

• univariate graphical,
• multivariate graphical,
• univariate non-graphical,
• multivariate non-graphical.

• The graphical methods provide more subjective analysis, and quantitative methods are more objective.
Univariate non-graphical:
• This is the simplest form of data analysis among the four options.

• In this type of analysis, the data that is being analysed consists of just a single variable.

• The main purpose of this analysis is to describe the data and to find patterns.

• The data that come from making a particular measurement on all of the subjects in a
sample. Ex: Age , Gender etc.

• It is Classified into : categorical data, center, spread, Skewness and kurtosis, shape
(including “heaviness of the tails”), and outliers.
Univariate non-graphical (Cont.,)

A simple tabulation of the frequency of each category is the best univariate non-graphical EDA for categorical data.
• Central tendency :
• The central tendency or “location” of a distribution has to do with typical or middle values.

• The common, useful measures of central tendency are the statistics called (arithmetic) mean, median, and
sometimes mode.

• Occasionally other means such as geometric, harmonic, truncated, or Winsorized means are used as measures of
centrality. While most authors use the term “average” as a synonym for arithmetic mean.

• median is another measure of central tendency. The sample median is the middle value after all of the values are
put in an ordered list. If there are an even number of values, take the average of the two middle values.
Univariate non-graphical: (Cont.,)
• Spread:
• Several statistics are commonly used as a measure of the spread of a distribution,
including variance, standard deviation, and interquartile range. Spread is an indicator of
how far away from the center value.
• The variance is a standard measure of spread and The standard deviation is simply the
square root of the variance.
• Skewness and kurtosis
• Two additional useful univariate descriptors are the skewness and kurtosis of a
distribution. Skewness is a measure of asymmetry. Kurtosis is a measure of “peakedness”
relative to a Gaussian shape.
Univariate non-graphical: (Cont.,)
• following table where e is an estimate of skewness and u is an estimate of kurtosis, and
SE(e) and SE(u) are the corresponding standard errors.
Univariate graphical EDA
• Univariate graphical: Unlike the non-graphical method, the graphical method provides
the full picture of the data. The three main methods of analysis under this type are
histogram, stem and leaf plot, and box plots.

• The histogram represents the total count of cases for a range of values.

• A histogram is a graph that uses bars to show the distribution of a data set. Unlike a bar
chart, which has a qualitative variable on the x-axis, a histogram can help you to visualize
numerical or quantitative data and identify any patterns.
Univariate graphical EDA (Cont.,)
• Along with the data values, the stem and leaf plot shows the shape of the distribution.

• A qualitative variable is a category that can only be expressed in words. But if you
have quantitative variables on both the x- and y-axis–and there’s no space in between
the bars–then you’re probably looking at a histogram.

• The box plots graphically depict a summary of minimum, first quartile median, third
quartile, and maximum.

Consider this example: a Owner

of an orchestra wants to make a
graph which reflects the age
range of his band members. They
range in age from 15 to 40 years
old.
Stem and leaf plots
• Stem and leaf plots display the shape and spread of a
continuous data distribution. These graphs are similar to
histograms, but instead of using bars, they show digits.

• It’s a particularly valuable tool during exploratory data

analysis. They can help you identify the central tendency,
variability, skewness of your distribution, and outliers.
Stem and leaf plots are also known as stemplots.

• Stem and leaf plots have one advantage over histograms

because they display the original data, while histograms
only summarize them.
Box plot
• Another very useful univariate graphical technique
is the boxplot. The boxplot will be described here
in its vertical format, which is the most common,
but a horizontal format also is possible.

• In descriptive statistics, a box plot or boxplot (also

known as a box and whisker plot) is a type of
chart often used in explanatory data analysis.

• Box plots visually show the distribution of

numerical data and skewness by displaying the
data quartiles (or percentiles) and averages.

• Box plots show the five-number summary of a set

of data: including the minimum score, first (lower)
quartile, median, third (upper) quartile, and
maximum score.
• Minimum Score
• The lowest score, excluding outliers (shown at the end of the left
whisker).
• Lower Quartile
• Twenty-five percent of scores fall below the lower quartile value (also
known as the first quartile).
• Median
• The median marks the mid-point of the data and is shown by the line
that divides the box into two parts (sometimes known as the second
quartile). Half the scores are greater than or equal to this value, and
half are less.
• Upper Quartile
• Seventy-five percent of the scores fall below the upper quartile value
(also known as the third quartile). Thus, 25% of data are above this
value.
• Maximum Score
• The highest score, excluding outliers (shown at the end of the right
whisker).
• Whiskers
• The upper and lower whiskers represent scores outside the middle
50% (i.e., the lower 25% of scores and the upper 25% of scores).
• The Interquartile Range (or IQR)
• The box plot shows the middle 50% of scores (i.e., the range between
the 25th and 75th percentile).
Multivariate non-graphical EDA
• Multivariate non-graphical EDA techniques generally show
the relationship between two or more variables in the form
of either cross-tabulation or statistics.

• Ex: categorical data

• categorical data
• For categorical data (and quantitative data with only a few
different values) an extension of tabulation called cross-
tabulation is very useful.

• For two variables, cross-tabulation is performed by making

a two-way table with column headings that match the levels
of one variable and row headings that match the levels of
the other variable, then filling in the counts of all subjects
that share a pair of levels.
Covariance
• The sample covariance is a measure of how much
two variables “co-vary”, i.e., how much (and in
what direction) should we expect one variable to
change when the other changes.

• From Table 4.3 , The mean age is 50 and the mean

strength is 19, so we calculate the deviation for age
as age-50 and deviation for strength and strength-
19. Then we find the product of the deviations and
add them up. This total is 1106, and since n=11,
the covariance of x and y is -1106/10=-110.6.
Multivariate graphical:
• This type of EDA displays the relationship
between two or more set of data.
• A bar chart, where each group represents a level
of one of the variables and each bar within the
group represents levels of other variables.
• Side-by-side boxplots are the best graphical
EDA technique for examining the relationship
between a categorical variable and a quantitative
variable, as well as the distribution of the
quantitative variable at each level of the
categorical variable.
• Ex of Scatter plot
Scatter plots :
are the graphs that present the relationship between two variables in a data- • Types of correlation in
scatter plot
set. It represents data points on a two-dimensional plane or on a Cartesian
• The scatter plot explains the
system. The independent variable or attribute is plotted on the X-axis, while correlation between two
the dependent variable is plotted on the Y-axis. These plots are often attributes or variables.
called scatter graphs or scatter diagrams. • There can be three such
situations to see the relation
• A scatter plot is also called a scatter chart. between the two variables –
• Scatter plots instantly report a large volume of data. It is beneficial in the
1.Positive Correlation
following situations –
2.Negative Correlation
3.No Correlation
• For a large set of data points given

• Each set comprises a pair of values

• The given data is in numeric form

Scatter plots :
• Positive Correlation
• When the points in the graph are rising, moving from left to right, then the scatter
plot shows a positive correlation. It means the values of one variable are
increasing with respect to another. Now positive correlation can further be
classified into three categories:
• Perfect Positive – Which represents a perfectly straight line
• High Positive – All points are nearby
• Low Positive – When all the points are scattered
Scatter plots :
• Negative Correlation
• When the points in the scatter graph fall while moving left to right, then it is called a
negative correlation. It means the values of one variable are decreasing with respect to
another. These are also of three types:
• Perfect Negative – Which form almost a straight line
• High Negative – When points are near to one another
• Low Negative – When points are in scattered form
Scatter plots :
• No Correlation
• When the points are scattered all over the graph and it is difficult to
conclude whether the values are increasing or decreasing, then there is
no correlation between the variables.
•
Scatter plot Example
• Let us understand how to construct a scatter plot with the help of the
below example.
• Question:
• Draw a scatter plot for the given data that shows the number of games
played and scores obtained in each instance.
TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS:
• Python and R language are the two most commonly used data science tools to create an
EDA.
• Python: EDA can be done using python for identifying the missing value in a data set.
Other Python functions that can be performed are — the description of data, handling
outliers, getting insights through the plots.

• Its high-level, built-in data structure and dynamic typing and binding make it an attractive
tool for EDA.

• Analyzing a large dataset is a hectic task that takes a lot of time.

• Python provides certain open-source modules that can automate the whole process of
EDA and help in saving time.
TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS: (Cont.,)

• R:

• The R language is used widely by data scientists and statisticians for developing
statistical observations and data analysis.

• R is an open-source programming language that provides a free software environment for

statistical computing and graphics that is supported by the R Foundation for Statistical
Computing.
TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS: (Cont.,)

• Apart from above tool described above, EDA can also use:
• Perform k-means clustering:
• It’s an unsupervised learning algorithm where the info points are assigned to clusters, also
referred to as k-groups.
• k-means clustering is usually utilized in market segmentation, image compression, and
pattern recognition
• EDA is often utilized in predictive models like linear regression, where it’s wont to
predict outcomes.
• It is also utilized in univariate, bivariate, and multivariate visualization for summary
statistics, establishing relationships between each variable, and understanding how
different fields within the data interact with one another.
Philosophy of EDA
• Long before worrying about how to convince others, you first have to understand what’s
happening yourself - — Andrew Gelman
• Exploratory Data Analysis by John Tukey
• Google Start working with large scale dataset.
• In the context of data in an Internet/engineering company, EDA is done for some of the
same reasons it’s done with smaller datasets.
• Anyone working with data should do EDA. Namely, to gain intuition about the data, to
make comparisons between distributions, for sanity checking, to find out where data is
missing or if there are outliers; and to summarize the data.
• Data generated from logs, EDA also helps with de bugging the logging process.
• Plotting data and making comparisons can get you extremely far, and is far better to do
than getting a dataset and immediately running a regression model.
The Data Science Process
• Diversified activities creates the RAW –Data.

• Ex: email, medical records, logs, and surveys.

• Data is Processed- involves pipelines of data munging:

joining, cleaning, scraping, formatting / wrangling, removal of
duplicated values, and Joining, removal of the unwanted data
for the purpose of analysis.

• Clean Data : Data clean process use the various tools such as
Python, shell scripts, R, or SQL, or all of the above, for the
purpose of Nice formatting.

• Clean data - clean , outliers , replace the missing data/ values,

debugging and well structured table formation for data storage.
EDA: Collection of more data and detection of mistake,
checking of assumptions, and preliminary selection of
appropriate models etc.
The Data Science Process
• Design Model : choose the suitable statistical model
like K-NN, linear regression, Bayes algorithm etc. are
used to predict, classify and to make the decision.
• Communication, Visualization and Reporting:
• We can interpret, visualize, report, or communicate our
results in precise way.
• This could take the form of reporting the results up to
coworkers, or publishing a paper in a journal and going
out and giving academic talks about it.
• Build a Data Product : our final goal may be to build or
prototype a “data product”; e.g., a spam classifier, or a
search ranking algorithm, or a recommendation system
or car seat price predication.

• Finally, The same well defined and formatted data comes

back to the real world.
Case Study: RealDirect (Online real estate firm)
• Doug Perlson, the CEO of RealDirect, has a background in real estate law, startups, and online advertising.

• His goal with RealDirect is to use all the data he can access about real estate to improve the way people sell
and buy houses.

• Human tendency at USA is, Normally, people sell their homes about once every seven years, and they do so
with the help of professional brokers and current data. But there’s a problem both with the broker system
and the data quality.

• The Company addressing both the issues.

• The brokers, They are typically “free agents” operating on their own—think of them as home sales
consultants.

• This means that they guard their data aggressively, and the really good ones have lots of experience. But in
the grand scheme of things, that really means they have only slightly more data than the inexperienced
brokers.
Case Study: RealDirect (Online real estate firm) (Cont.,)
• RealDirect is addressing this problem by hiring a team of licensed real estate agents who work
together and pool their knowledge. To accomplish this, it built an interface for sellers, giving
them useful data driven tips on how to sell their house. It also uses interaction data to give real-
time recommendations on what to do next.

• The team of brokers also become data experts, learning to use information-collecting tools to
keep tabs on new and relevant data or to access publicly available information.

• Problem with publicly available data is that it’s old news—there’s a three-month lag between a
sale and when the data about that sale is available.

• RealDirect is working on real-time feeds on things like when people start searching for a home,
what the initial offer is, the time between offer and close, and how people search for a home
online.
How Does RealDirect Make Money?
• it offers a subscription to sellers—about $395 a month—to access the selling tools.
Second, it allows sellers to use RealDirect’s agents at a reduced commission, typically 2%
of the sale instead of the usual 2.5% or 3%.

• RealDirect to take a smaller commission and gets more volume.

• The site itself is best thought of as a platform for buyers and sellers to manage their sale
or purchase process. There are statuses for each person on site: active, offer made, offer
rejected, showing, in contract, etc. Based on your status, different actions are suggested
by the software.

• RealDirect comprises licensed brokers in various established realtor associations.

• Cretisizes from buyers and brokers!!!
• The company only looking for fulfillment of the vision and mission.
RealDirect Data Strategy
• Create the web site realdirect.com
• Hire the Data Scientist
• Data Scientist - Explore its existing website, thinking about how buyers and sellers would
navigate through it. How Website should be structured.
• no data yet for you to analyze!! Build the data is required.
• load in and clean up the data.
• Replace the outliers or missing values
• Once the data is in good shape, conduct exploratory data anal ysis to visualize and
make comparisons

• Summarize the report to CEO.

• Being Data Scientist – Communication with subordinates.
• Prepare the best business plan for company progress.
Three Basic Machine Learning Algorithms:
Many business or real-world problems that can be solved with data.

Data samples can be thought of as / used for classification and prediction problems when we
express them mathematically.

Users can develop a models and algorithms and they can be used the to classify, predict and used to
make a decision.

Once user become familiar in making the model, user has to decide which model is really required
one to use, and certain statistical model usage is depending on the context of the problem.

Choosing a statistical machine learning model is also depends on the data scientist experience.

Wrong Mythology in Data Science is – I am well versed with Linear Regression so I am always use
the same algorithm for exploring the diversified applications.
Three Basic Machine Learning Algorithms: (Cont.,)
Being good Data Scientist , it’s good to talk it through with someone who are familiar with
diversified ML algorithms, Speak with coworker, head to a meetup group before adopt the
statistical model in exploring the real world applications.
Here we focused on-
1. Linear Regression – Supervised machine learning – Predication
2. k-Nearest Neighbours (k- NN), Supervised machine learning - Classification
3. k-means. – unsupervised machine learning – Clustering / Association
Linear Regression
Introduction

• What is linear regression?

• Linear regression analysis is used to predict the value of a variable based on the value
of another variable.

• The variable you want to predict is called the dependent variable.

• The variable you are using to predict the other variable's value is called the
independent variable.
Introduction (Cont.,)
• USERS can perform the linear regression method in a variety of programs and
environments, including:
• R(Tool) linear regression.
• MATLAB linear regression.
• Sklearn linear regression.
• Linear regression Python.
• Excel linear regression.
Why Linear Regression (LR)

• Linear-regression models are relatively simple and provide an easy-to-interpret

mathematical formula that can generate predictions.
• Linear regression can be applied to various areas in business and academic study.
• Linear regression is used in everything from biological, behavioral, environmental and
social sciences to business.
• Linear-regression models have become a proven way to scientifically and reliably predict
the future.
• Linear regression is a long-established statistical procedure, the properties of linear-
regression models are well understood and can be trained very quickly.
Regression Classification
Linear Regression Type
Limitation of LR
• Major Limitation of LR as follows:
• Outliers - are abnormal data samples.
• Missing of Data – Too much of data is missing, model will not attain the best predication.
• Number of Cases – LR model ensure the best result if the ration of independent and
dependent should at least 20 :1.
Linear Regression
• In simplest form , LR model can be created by fitting a line among the scattered
data points.
• Linear regressions can be used in business to evaluate trends and make estimates or
forecasts.
• For example, if a company's sales have increased steadily every month for the past few
years, by conducting a linear analysis on the sales data with monthly sales, the company
could forecast sales in future months.
Linear Regression (Cont.,)
• Equation for LR is : Y= Mx +B
• Y = Represent the dependent variable.
• x = represent the independent variable.
• B = intercept (Value of Y in x =0)
• M = is the slope of the line ( How much Y changes in for unit change in x)
•
• Y= Mx +B

• x = New value to predict. M

B
X Y Mean X Mean Y Deviation Deviation Product of Sum of Square of
of X of Y Deviations Product of Deviations
(XY) Deviations ' of X
1 1.2 -2 -1.32 2.64 4
2 1.8 -1 -0.72 0.72 1
3 2.6 2.52 0 0.08 0 0
3 6.6
4 3.2 1 0.68 0.68 1
5 3.8 2 1.28 2.56 4
X Y Mean X Mean Y Deviation Deviation Product of Sum of Square of
of X of Y Deviations Product of Deviations
(XY) Deviations ' of X
1 1.2 -2 -1.32 2.64 4
2 1.8 -1 -0.72 0.72 1
3 2.6 2.52 0 0.08 0 0
3 6.6
4 3.2 1 0.68 0.68 1
• Calculate M = = Sum of Product of Deviations / Square of Deviations = 6.6/ 10 =0.66
5 3.8 2 1.28 2.56 4
• Calculate B = Mean of Y – (m * Mean of X) = 2.52 -0.66*3 = 0.54

• 7th week Sales predication = Y= Mx+B = 0.66*7+0.54 = 5.16

M
M
B

B
LR Graph
Sales Percentage Price
6.83 Car Seat Sales Predication
78.05

6.56 76.74
7.53 81.32

5.37 68.34
8.67 91.35

X Y Mean X Mean Y Deviation Product Sum of Square

Deviation of Y of Product of
of X Deviation of Deviation
s Deviation s' of X
(XY) s
6.83 78.05 -0.16 -1.04 0.1664 0.0256
6.56 76.74 -0.43 -2.35 1.01 0.1849
7.53 81.32 395.46/5 0.54 2.23 1.20 0.291 5.96
40.37
5.37 68.34 34.96/5= =79.09 -1.62 -10.75 17.41 2.624
8.67 91.35 6.99 1.68 12.26 20.59 2.822
Car Seat Sales Predication (Cont.,)

Calculate M = = Sum of Product of Deviations / Square of Deviations = 40.97/ 5.96 =6.77

Calculate B = Mean of Y – (m * Mean of X) = 79.09 -6.77*6.99 = 31.77

Sales predication for 10.3= Y= Mx+B = 6.77*10.3+31.77 =76.50

Validation for Regression Methods
• Regression model are validated / evaluated using some metrics. Following are metrics
used to validate the result of regression.
• Standard Error:
• Mean absolute error (MAE)
• Mean Squared Error (MSE)
• Root Mean Squared Error (RMSE)

Standard Error:
• Residuals or error is the difference between actual value (y) and Predicated Value (ȳ).
• If the difference between the actual and predicated value is Zero, then it
means that the model fits the data correctly.
Validation for Regression Methods
• Mean Absolute Error (MAE)
• MAE- is means of residuals or errors.
• Difference between estimated / predicated target values and actual target incomes. It can
be mathematically written as:

• Here is predicated value and y is actual target value and n is the total number of
Sample used for regression analysis.
Validation for Regression Methods
• Mean Squared Error (MSE)
• It is the sum of square of residuals / errors. This values is always positive and closer to 0.
• MSE is mathematically written as follows:

• Root Mean Squared Error (RMSE)

• Square root of MSE is called as Root Squared Error.
• RMSE is mathematically written as follows:
Validation for Regression Methods
• Root Mean Squared error
• The square root of the MSE is called as RMSE. Following equation is mathematically
represented as:
Validation for Regression Methods
• R Squared
• R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that
determines the proportion of variance in the dependent variable that can be explained by the independent
variable.
• In other words, r-squared shows how well the data sample fit the regression model (the goodness of fit).
• R-squared can take any values between 0 to 1. Although the statistical measure provides some useful
insights regarding the regression model.

• Where:
• Numerator (N) part is the sum of squares due to regression (RMSE : explained sum of squares )
• Denominator (D) part is the total sum of squares
• MAE and MSE depend on the context as we have seen whereas the R2 score is independent of context.
Validation for Regression Methods
• P- Value
• The linear regression coefficients describe the mathematical relationship between each
independent variable and the dependent variable. The p values for the coefficients
indicate whether these relationships are statistically significant.
Regression Problem statement
• Problem Statement for predicating the sales of items as
• Shown in the table.
• Consider 2 fresh item I6 and I7 whose actual value are
80 and 75 respectively,
• Regression model predicts the value of the Item I6 and I7 respectively, whose actual
values are 80 and 75 respectively.
KNN Algorithm
Glimpse of KNN working Principle
• Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where
it predicts the label or value of a new data point by considering the labels or values of its
K nearest neighbors in the training dataset.
Case study

• Empty circle are low income and dark circle are high income.
• Test instance is person age is 57 and income is Rs 38000 .
• Whether person belong to lower income or higher income???
K-NN
• KNN – instance based learning – work on the basis of memorize and apply.
• Evelyn Fix and Joseph Hodges developed this algorithm in 1951, which was subsequently
expanded by Thomas Cover.
• The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm that can be used to solve both classification and regression problems.
• Instance Based learning – will not create General pattern. Instead of creating a general pattern
it (Instance based learning) will compare the new problem/ pattern with existing instance.
• Ex: Spam Mail – in Gmail- Consist of Spam Filter , which consist of memory pattern of spam
mails based on that it will filter the spam mail.
• Generalize Example for –Instance based learning is:
• Student – learn / understand the concept and write the exam. – Student remember the concepts
even after the exam.
• Student – mug the concept based on his / her memory. Student will forgot the concept after the
exam - Instance based learning.
• Instance based learning- construct the target function only when new instance must be classified.
K-NN (Cont.,)
• Every time a new query instance is encountered, its relationship to previously stored
example is examined in order to assign the a target value for new instance.
• Hence, Instance based learning is also called as lazy learning or memory based learning.

• K-NN simple and powerful non-parametric algorithm that predicts the category of test
instance according to the K training samples which are closer to test instance and
classifies it to that category which has the largest probability.

• In the Fig 4.1 2 class are exist C1 and C2.

• Test instance is T, which very close approximate with C2.
• Hence, new test instance is classified as C2 class.
K-NN (Cont.,)
• KNN –Algorithm nature is – it assumes that similar type of object are close to each other
in feature space. Hence, KNN performs the instance based learning which stores the
training data instances and test instance case by case.
• KNN- classify the test instance by determining K most similar instances and summarizing
the output of those K instances.

• Computing the similarity KNN –uses the distance measuring techniques such as
Euclidean Distance, Hamming Distance and city block distance approach.

• Most default distance measuring matrices are Euclidean Distance.

•
K-NN (Cont.,)
• Compute the distance between two points using Distance formula

• Better neighbours – Here K =3, those neighbours are also would be asminimum as
possible.
K-NN (Cont.,)
• What is the K-Nearest Neighbors Algorithm and applications.
• KNN is one of the most basic essential classification algorithms in machine learning and
it belongs to the supervised learning domain.
• KNN is useful for various intense application such as pattern recognition, data mining,
and intrusion detection.
• KNN- compares the new problems with instances in the training sample. ( Previous
samples or while sample is considered ) which are stored in the memory.
K-NN algorithm
• Input : T training dataset , t test instance , K – Number of nearest neighbour, D is
distance metric.
• Output : Predicated Class
• Step 1: Compute the distance between the test instance t and every other instance i in
the instance T using Distance Formula.
• Distance Formula – Euclidean distance, Hamming distance or City block distance.
• Step 2: Sort the distance in ascending order and select the 1st K nearest training data
instance.
• Step 3: Predict the new test instance class by comparing the majority voting.
Example 1
• In KPSC Exam candidate secured the following Marks: Based on that obtained marks
categorize that whether candidate has eligible in the exam or not using K nearest
neighbour. Given Instance as follows:
• General Studies X(O1) =6 and Computer Science Y(O2) =8.
• K =3 nearest neighbour. Training Data: (Note: O observable, a actual value)

𝑑 = 𝑆𝑞𝑟𝑡((X01- x02)2+ (y01- y02)2 )

Example -1 (Cont.,) • Here need choose 3 value (
K=3).
• Instance 2 and 3 there is no
difference.
• Take the difference between 1
and 2 = 4.38
• Difference between 2 and 3
=0
• Difference between 3 and 4
=2.16
• Difference between 4 and 5
=1.16 Hence, need to
consider the instance 2,3, and
5.
• Better neighbours –(Here K
=3), those neighbours are also
would be as minimum as
possible.
• Then the outcome of above 3
instance are 2,3, and 5 attain
less value and all are Pass
case only . Hence the given
new instance is also pass.
• Step 2 Sort: 3 small values
General Computer Eligible
studies Science
6 7 Pass 1
6 8 Pass 1
8 8 Pass 2

• Step 3 : Predict the class of the test instance by majproity voting

• The class for the exam instance is predicted as PASS.
Example -1 (Cont.,)
Note
Example -2
Key Advantage and Drawback of Instance based learning
• Key Advantage and drawback of Instance based learning
• • Training is very fast and accurate
• • Learn the complex target function in easiest way.
• • Loss of information is very low.

• • Drawback
• • Cost of Classifying the new instance cost will be high.
• • Instance based approach – KNN- consider all the attributes in the training samples for
classifying the new instance, Hence, it retrieving the similar kind of instance from
memory or Polyhedron.
• If the target concept depends on only a few of the many available attributes , there is a
chances of wrong target function output.
Distance Measuring
• Distance measures play an important role in machine learning.
• A distance measure is an objective score that summarizes the relative difference between
two objects / samples in a problem domain.
• Most commonly, the two objects are rows of data that describe a subject (such as a
person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).
• Perhaps the most likely way you will encounter distance measures is KNN classifier.
• In the KNN algorithm, a classification or regression prediction is made for new examples
by calculating the distance between the new example (row) and all examples (rows) in the
training dataset.
• Popular Distance Measuring algorithm are :
• Hamming Distance
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
Distance Measuring
• Euclidean Distance
• The Euclidean distance is the most widely used distance measure in clustering.
• It is also called as L2 form.
• It calculates the straight-line distance between two points in n-dimensional space. The
formula for Euclidean distance is:
Distance Measuring
• Manhattan Distance
• Another name for Manhattan distance is City block distance.
• This is also know as box car , absolute value distance, L1 form.

•
Distance Measuring
• Chebyshev Distance:
• It approach is also know as Max value distance.
• It compute the absolute magnitude of difference between the co-ordinates of pair of
objects.
Example for Distance metrics.
Distance Measuring
• Hamming Distance :
• Can be used to find the distance between two strings or pairs of words or DNA sequences
of the same length.
• The distance between olive and ocean is 4 because aside from the “o” the other 4 letters
are different.
• The distance between shoe and hose is 3 because aside from the “e” the other 3 letters are
different.
• Just go through each position and check whether the letters the same in that position, and
if not, increment your count by 1.
Distance Measuring
• Mahalanobis distance :
• The Mahalanobis distance (MD) is the distance between two points in multivariate
space.
• The MD measures the relative distance between two variables with respect to the
centroid.
• It has excellent applications in multivariate anomaly detection, classification on highly
imbalanced datasets and one-class classification and more untapped use cases.
• It is effectively a multivariate equivalent of the Euclidean distance.
• The Mahalanobis distance of an observation x = (x1, x2, x3….xN)T from a set of
observations with mean μ= (μ1,μ2,μ3….μN)T and covariance matrix S is defined as:
K-Means
• K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the
unlabeled dataset into different clusters.
• It is an iterative partitional algorithm and here k stand for user specified requested cluster
and user may not aware of how many cluster are present in the available dataset.
• In a dataset, individual sample may belonging to any one cluster at the end.
• Unsupervised Machine Learning is the process of teaching a computer to use unlabeled,
unclassified data and enabling the algorithm to operate on that data without supervision.
• Without any previous data training, the machine’s job in this case is to organize unsorted
data according to parallels, patterns, and variations.
• K means clustering, assigns data points to one of the K clusters depending on their
distance from the center of the clusters.
K-Means (Cont.,)
• K-means algorithm – requires initialized values , then algorithm can select K data points
randomly or use prior knowledge of the data.
• It starts by randomly assigning the clusters centroid in the space. Then each data point assign to
one of the cluster based on its distance from centroid of the cluster but this distance should be
minimal.
• After assigning each point to one of the cluster, new cluster centroids are assigned. This process
runs iteratively until it finds good cluster. In the analysis we assume that number of cluster is
given in advanced and we have to put points in one of the group.

• K-Means –iterative process is continued until no changes in the instances to clusters is noticed.
Then only algorithm will be terminated with ensured the guaranteed.

• Drawback :
• In some cases, K is not clearly defined, and we have to think about the optimal number of K. K
Means clustering performs best data is well separated.
•
K-Means Algorithm
• The algorithm works as follows:
K-Means Algorithm
• Advantage
• It is very simple algorithm
• Easy to implement
• Drawback
• Sensitive to initialized values, if the user choose the random or not precise points that
leads to wrong clustering.
• If the data sample size is large, algorithm takes a long time to process.
K-Means (Cont.,)
• How to choose the K value:
• K is the user specified value. It specifying the number of cluster are present.
• There is no gold standard rules to choose the K value.
• K mean algorithm runs with multiple value of K and with in the group variance, thus it
plotted as line graph, this kind of plot is called as Elbow Curve.
End of 2 Module

Prob & Stat
No ratings yet
Prob & Stat
50 pages
Lecture 1 Exploratory Data Analysis
No ratings yet
Lecture 1 Exploratory Data Analysis
41 pages
Summary Statistics and Visualization Techniques To Explore
100% (1)
Summary Statistics and Visualization Techniques To Explore
30 pages
Chapter Six Methods of Describing Data
No ratings yet
Chapter Six Methods of Describing Data
20 pages
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
No ratings yet
Introduction To Descriptive Statistics I: Sanju Rusara Seneviratne Mbpss
35 pages
DataAnalytics(Unit 2)
No ratings yet
DataAnalytics(Unit 2)
131 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
Exploratory Spatial Data Analysis
No ratings yet
Exploratory Spatial Data Analysis
54 pages
PSY123 Lecture 10-1
No ratings yet
PSY123 Lecture 10-1
31 pages
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
No ratings yet
Module I. Basic Calculations. Average, Standard Deviation by Excel (5)
48 pages
ap_stat_exam_rev_ch1-13
No ratings yet
ap_stat_exam_rev_ch1-13
120 pages
Data Presentation
No ratings yet
Data Presentation
64 pages
Module 3 - Lesson 3.2 Quantitative Data Analysis
No ratings yet
Module 3 - Lesson 3.2 Quantitative Data Analysis
41 pages
Cental Tendency
No ratings yet
Cental Tendency
20 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
Statistics
No ratings yet
Statistics
30 pages
Research Report
No ratings yet
Research Report
47 pages
Quantitative and Qualitative
No ratings yet
Quantitative and Qualitative
41 pages
Exploratory Data Analysis types
No ratings yet
Exploratory Data Analysis types
14 pages
Week One: Introduction To Quantitative Methods MBA 2013
No ratings yet
Week One: Introduction To Quantitative Methods MBA 2013
49 pages
SPSS - Unit I
No ratings yet
SPSS - Unit I
31 pages
Unit 3
No ratings yet
Unit 3
47 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
21 pages
RVO-STATISTICS - Statistics - Introduction To Statistics IBBI
No ratings yet
RVO-STATISTICS - Statistics - Introduction To Statistics IBBI
93 pages
BUSINESS AND STATISTICS
No ratings yet
BUSINESS AND STATISTICS
29 pages
Revision SB Chap 2 7
No ratings yet
Revision SB Chap 2 7
55 pages
1 Data Collection Procedure Research Instrument and Interpretation of Data
No ratings yet
1 Data Collection Procedure Research Instrument and Interpretation of Data
57 pages
Data Presentation and Analysis
No ratings yet
Data Presentation and Analysis
71 pages
Describing Data
No ratings yet
Describing Data
13 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
26 pages
1_III YR, VII unit Intro to Statistics
No ratings yet
1_III YR, VII unit Intro to Statistics
214 pages
Introduction To Basic Statistics
No ratings yet
Introduction To Basic Statistics
53 pages
EDA
No ratings yet
EDA
52 pages
RSU - Statistics - Lecture 3 - Final - myRSU
No ratings yet
RSU - Statistics - Lecture 3 - Final - myRSU
34 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
7 pages
2 Eda
No ratings yet
2 Eda
20 pages
Unit 3
No ratings yet
Unit 3
77 pages
AIML Practice Questions IA-1 Ans
No ratings yet
AIML Practice Questions IA-1 Ans
7 pages
Data Collection and Implementation
No ratings yet
Data Collection and Implementation
55 pages
Introduction Statistics
100% (1)
Introduction Statistics
23 pages
NITKclass 1
No ratings yet
NITKclass 1
50 pages
Datamining and Analytics Unit V
No ratings yet
Datamining and Analytics Unit V
102 pages
Modified Ps Final 2023
No ratings yet
Modified Ps Final 2023
124 pages
STAT 1770 Lab 2-2
No ratings yet
STAT 1770 Lab 2-2
3 pages
Data Interpretation
No ratings yet
Data Interpretation
12 pages
Unit .......
No ratings yet
Unit .......
45 pages
Unit 2 - Summarizing Data - Charts and Tables
100% (1)
Unit 2 - Summarizing Data - Charts and Tables
33 pages
Exploratory Data Analysis_v3_part1
No ratings yet
Exploratory Data Analysis_v3_part1
36 pages
Lecture 9descriptivestatistics 171204035552
No ratings yet
Lecture 9descriptivestatistics 171204035552
26 pages
OpenStax Statistics CH02
No ratings yet
OpenStax Statistics CH02
36 pages
IDS UNIT-2
No ratings yet
IDS UNIT-2
26 pages
UALL 2044 Lecture 13
No ratings yet
UALL 2044 Lecture 13
34 pages
Unit II: Basic Data Analytic Methods
No ratings yet
Unit II: Basic Data Analytic Methods
38 pages
Data Analysis and Inference - Mba - UPD DR DENARTO DENNIS
No ratings yet
Data Analysis and Inference - Mba - UPD DR DENARTO DENNIS
23 pages
BSA Unit (1)
No ratings yet
BSA Unit (1)
18 pages
DS Module 2
No ratings yet
DS Module 2
113 pages
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
Ebook - Statistics Fundamentals For Business Analytics
No ratings yet
Ebook - Statistics Fundamentals For Business Analytics
9 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
Cfa Quan (R1-7)
No ratings yet
Cfa Quan (R1-7)
182 pages
Session 09 - BS - 2020-Z Score
No ratings yet
Session 09 - BS - 2020-Z Score
32 pages
Edexcel Gcse Statistics Coursework Exemplar
100% (2)
Edexcel Gcse Statistics Coursework Exemplar
4 pages
Performance Task - Quartiles of Grouped Data
No ratings yet
Performance Task - Quartiles of Grouped Data
2 pages
Statistics 101
100% (1)
Statistics 101
28 pages
Lec 1
No ratings yet
Lec 1
54 pages
Sample 1
100% (1)
Sample 1
12 pages
Mas 101 3
No ratings yet
Mas 101 3
63 pages
DLL Math7
No ratings yet
DLL Math7
14 pages
Automatically Identifying Fake News in Popular Twitter Threads
No ratings yet
Automatically Identifying Fake News in Popular Twitter Threads
8 pages
Cumulative Frequency
No ratings yet
Cumulative Frequency
27 pages
Lecture 1 - Introduction To Statistics
No ratings yet
Lecture 1 - Introduction To Statistics
48 pages
4.1 - Interpreting Statistics
No ratings yet
4.1 - Interpreting Statistics
3 pages
Measure of Position
No ratings yet
Measure of Position
2 pages
Customer Data Outliers Pyspark
No ratings yet
Customer Data Outliers Pyspark
1 page
Module 5
No ratings yet
Module 5
76 pages
Statistical Function1
No ratings yet
Statistical Function1
3 pages
edited maths pp2 mock
No ratings yet
edited maths pp2 mock
18 pages
TRAILS Wettstein
No ratings yet
TRAILS Wettstein
97 pages
Computer Numerical and Statistical Method Unit 1 Calicut Univercitty Note
No ratings yet
Computer Numerical and Statistical Method Unit 1 Calicut Univercitty Note
24 pages
10.1-10.4 Notes and Review
No ratings yet
10.1-10.4 Notes and Review
15 pages
Python for Machine Learning Visualization 1735231185
No ratings yet
Python for Machine Learning Visualization 1735231185
69 pages
R For Data Exploration
No ratings yet
R For Data Exploration
52 pages
Theory Content-Statistics
No ratings yet
Theory Content-Statistics
37 pages
Stats Open Elective Unit I-IV Merged
No ratings yet
Stats Open Elective Unit I-IV Merged
138 pages
Handnote Chapter 4 Measures of Dispersio
No ratings yet
Handnote Chapter 4 Measures of Dispersio
45 pages
MATH 10 - Prelim
No ratings yet
MATH 10 - Prelim
3 pages
Zhai PPT (Final)
No ratings yet
Zhai PPT (Final)
30 pages
Data Mining Unit 1(Msc Ds 3 Sem)
No ratings yet
Data Mining Unit 1(Msc Ds 3 Sem)
119 pages
Assignment 1
0% (1)
Assignment 1
12 pages