Data Science- Module 2 (Updated )
Data Science- Module 2 (Updated )
▪ exploratory data analysis or “EDA” is a critical first step in analyzing the data from an
experiment Point of view.
• detection of mistakes
• checking of assumptions
• Each column contains the numeric values for a particular quantitative variable or the levels for a
categorical variable. (Some more complicated experiments require a more complex data layout.)
• People are not very good at looking at a column of numbers or a whole spreadsheet and then
determining important characteristics of the data.
• Hence, Exploratory data analysis techniques have been devised as an aid in this situation. Most of
these techniques work in part by hiding certain aspects of the data while making other aspects
more clear.
Typical data format and the types of EDA (Cont.,)
• Exploratory data analysis is generally cross-classified in two ways. First, each method is either
non-graphical or graphical. And second, each method is either univariate or multivariate.
• Univariate methods look at one variable (data column) at a time, while multivariate methods look
at two or more variables at a time to explore relationships. Usually our multivariate EDA will be
bivariate (looking at exactly two variables), but occasionally it will involve three or more
variables.
• Beyond the four categories created by the above cross-classification, each of the categories of
EDA have further divisions based on the role (outcome or explanatory) and type (categorical or
quantitative) of the variable(s) being examined.
Introduction to EDA
⚫ Discovered in the 1970s by American mathematician John Tukey . Exploratory data
analysis (EDA) is a method of analysing and investigating the data sets to summarise
their main characteristics.
Introduction -(Cont.,)
• Scientists often use Data visualization methods to discover patterns, spot anomalies, check
assumptions or test a hypothesis through summary statistics and graphical representations.
• EDA goes beyond the formal modelling or hypothesis to give maximum awareness / insight into
the data set and its structure, and in identifying influential variables.
• It can also help in selecting the most suitable data analysis technique for a given project. Ex: Car
seat sales predications.
• Specific knowledge, such as the creation of a ranked list of relevant factors to be used as
guidelines, can also be obtained using EDA.
• The four types of EDA are univariate non-graphical, multivariate nongraphical, univariate
graphical, and multivariate graphical.
Introduction -(Cont.,)
• The EDA types of techniques are either graphical or quantitative (non-graphical).
• While the graphical methods involve summarizing the data in a diagrammatic or visual way.
• The quantitative method, on the other hand, involves the calculation of summary statistics.
• These two types of methods are further categorized / divided into univariate and multivariate methods.
• Univariate methods consider one variable (data column) at a time, while multivariate methods consider two or more variables
at a time to explore relationships.
• The graphical methods provide more subjective analysis, and quantitative methods are more objective.
Univariate non-graphical:
• This is the simplest form of data analysis among the four options.
• In this type of analysis, the data that is being analysed consists of just a single variable.
• The main purpose of this analysis is to describe the data and to find patterns.
• The data that come from making a particular measurement on all of the subjects in a
sample. Ex: Age , Gender etc.
• It is Classified into : categorical data, center, spread, Skewness and kurtosis, shape
(including “heaviness of the tails”), and outliers.
Univariate non-graphical (Cont.,)
A simple tabulation of the frequency of each category is the best univariate non-graphical EDA for categorical data.
• Central tendency :
• The central tendency or “location” of a distribution has to do with typical or middle values.
• The common, useful measures of central tendency are the statistics called (arithmetic) mean, median, and
sometimes mode.
• Occasionally other means such as geometric, harmonic, truncated, or Winsorized means are used as measures of
centrality. While most authors use the term “average” as a synonym for arithmetic mean.
• median is another measure of central tendency. The sample median is the middle value after all of the values are
put in an ordered list. If there are an even number of values, take the average of the two middle values.
Univariate non-graphical: (Cont.,)
• Spread:
• Several statistics are commonly used as a measure of the spread of a distribution,
including variance, standard deviation, and interquartile range. Spread is an indicator of
how far away from the center value.
• The variance is a standard measure of spread and The standard deviation is simply the
square root of the variance.
• Skewness and kurtosis
• Two additional useful univariate descriptors are the skewness and kurtosis of a
distribution. Skewness is a measure of asymmetry. Kurtosis is a measure of “peakedness”
relative to a Gaussian shape.
Univariate non-graphical: (Cont.,)
• following table where e is an estimate of skewness and u is an estimate of kurtosis, and
SE(e) and SE(u) are the corresponding standard errors.
Univariate graphical EDA
• Univariate graphical: Unlike the non-graphical method, the graphical method provides
the full picture of the data. The three main methods of analysis under this type are
histogram, stem and leaf plot, and box plots.
• The histogram represents the total count of cases for a range of values.
• A histogram is a graph that uses bars to show the distribution of a data set. Unlike a bar
chart, which has a qualitative variable on the x-axis, a histogram can help you to visualize
numerical or quantitative data and identify any patterns.
Univariate graphical EDA (Cont.,)
• Along with the data values, the stem and leaf plot shows the shape of the distribution.
• A qualitative variable is a category that can only be expressed in words. But if you
have quantitative variables on both the x- and y-axis–and there’s no space in between
the bars–then you’re probably looking at a histogram.
• The box plots graphically depict a summary of minimum, first quartile median, third
quartile, and maximum.
• categorical data
• For categorical data (and quantitative data with only a few
different values) an extension of tabulation called cross-
tabulation is very useful.
• Its high-level, built-in data structure and dynamic typing and binding make it an attractive
tool for EDA.
• Python provides certain open-source modules that can automate the whole process of
EDA and help in saving time.
TOOLS REQUIRED FOR EXPLORATORY DATA ANALYSIS: (Cont.,)
• R:
• The R language is used widely by data scientists and statisticians for developing
statistical observations and data analysis.
• Apart from above tool described above, EDA can also use:
• Perform k-means clustering:
• It’s an unsupervised learning algorithm where the info points are assigned to clusters, also
referred to as k-groups.
• k-means clustering is usually utilized in market segmentation, image compression, and
pattern recognition
• EDA is often utilized in predictive models like linear regression, where it’s wont to
predict outcomes.
• It is also utilized in univariate, bivariate, and multivariate visualization for summary
statistics, establishing relationships between each variable, and understanding how
different fields within the data interact with one another.
Philosophy of EDA
• Long before worrying about how to convince others, you first have to understand what’s
happening yourself - — Andrew Gelman
• Exploratory Data Analysis by John Tukey
• Google Start working with large scale dataset.
• In the context of data in an Internet/engineering company, EDA is done for some of the
same reasons it’s done with smaller datasets.
• Anyone working with data should do EDA. Namely, to gain intuition about the data, to
make comparisons between distributions, for sanity checking, to find out where data is
missing or if there are outliers; and to summarize the data.
• Data generated from logs, EDA also helps with de bugging the logging process.
• Plotting data and making comparisons can get you extremely far, and is far better to do
than getting a dataset and immediately running a regression model.
The Data Science Process
• Diversified activities creates the RAW –Data.
• Clean Data : Data clean process use the various tools such as
Python, shell scripts, R, or SQL, or all of the above, for the
purpose of Nice formatting.
• His goal with RealDirect is to use all the data he can access about real estate to improve the way people sell
and buy houses.
• Human tendency at USA is, Normally, people sell their homes about once every seven years, and they do so
with the help of professional brokers and current data. But there’s a problem both with the broker system
and the data quality.
• The brokers, They are typically “free agents” operating on their own—think of them as home sales
consultants.
• This means that they guard their data aggressively, and the really good ones have lots of experience. But in
the grand scheme of things, that really means they have only slightly more data than the inexperienced
brokers.
Case Study: RealDirect (Online real estate firm) (Cont.,)
• RealDirect is addressing this problem by hiring a team of licensed real estate agents who work
together and pool their knowledge. To accomplish this, it built an interface for sellers, giving
them useful data driven tips on how to sell their house. It also uses interaction data to give real-
time recommendations on what to do next.
• The team of brokers also become data experts, learning to use information-collecting tools to
keep tabs on new and relevant data or to access publicly available information.
• Problem with publicly available data is that it’s old news—there’s a three-month lag between a
sale and when the data about that sale is available.
• RealDirect is working on real-time feeds on things like when people start searching for a home,
what the initial offer is, the time between offer and close, and how people search for a home
online.
How Does RealDirect Make Money?
• it offers a subscription to sellers—about $395 a month—to access the selling tools.
Second, it allows sellers to use RealDirect’s agents at a reduced commission, typically 2%
of the sale instead of the usual 2.5% or 3%.
• The site itself is best thought of as a platform for buyers and sellers to manage their sale
or purchase process. There are statuses for each person on site: active, offer made, offer
rejected, showing, in contract, etc. Based on your status, different actions are suggested
by the software.
Data samples can be thought of as / used for classification and prediction problems when we
express them mathematically.
Users can develop a models and algorithms and they can be used the to classify, predict and used to
make a decision.
Once user become familiar in making the model, user has to decide which model is really required
one to use, and certain statistical model usage is depending on the context of the problem.
Choosing a statistical machine learning model is also depends on the data scientist experience.
Wrong Mythology in Data Science is – I am well versed with Linear Regression so I am always use
the same algorithm for exploring the diversified applications.
Three Basic Machine Learning Algorithms: (Cont.,)
Being good Data Scientist , it’s good to talk it through with someone who are familiar with
diversified ML algorithms, Speak with coworker, head to a meetup group before adopt the
statistical model in exploring the real world applications.
Here we focused on-
1. Linear Regression – Supervised machine learning – Predication
2. k-Nearest Neighbours (k- NN), Supervised machine learning - Classification
3. k-means. – unsupervised machine learning – Clustering / Association
Linear Regression
Introduction
• Linear regression analysis is used to predict the value of a variable based on the value
of another variable.
• The variable you are using to predict the other variable's value is called the
independent variable.
Introduction (Cont.,)
• USERS can perform the linear regression method in a variety of programs and
environments, including:
• R(Tool) linear regression.
• MATLAB linear regression.
• Sklearn linear regression.
• Linear regression Python.
• Excel linear regression.
Why Linear Regression (LR)
B
X Y Mean X Mean Y Deviation Deviation Product of Sum of Square of
of X of Y Deviations Product of Deviations
(XY) Deviations ' of X
1 1.2 -2 -1.32 2.64 4
2 1.8 -1 -0.72 0.72 1
3 2.6 2.52 0 0.08 0 0
3 6.6
4 3.2 1 0.68 0.68 1
5 3.8 2 1.28 2.56 4
X Y Mean X Mean Y Deviation Deviation Product of Sum of Square of
of X of Y Deviations Product of Deviations
(XY) Deviations ' of X
1 1.2 -2 -1.32 2.64 4
2 1.8 -1 -0.72 0.72 1
3 2.6 2.52 0 0.08 0 0
3 6.6
4 3.2 1 0.68 0.68 1
• Calculate M = = Sum of Product of Deviations / Square of Deviations = 6.6/ 10 =0.66
5 3.8 2 1.28 2.56 4
• Calculate B = Mean of Y – (m * Mean of X) = 2.52 -0.66*3 = 0.54
B
LR Graph
Sales Percentage Price
6.83 Car Seat Sales Predication
78.05
6.56 76.74
7.53 81.32
5.37 68.34
8.67 91.35
Standard Error:
• Residuals or error is the difference between actual value (y) and Predicated Value (ȳ).
• If the difference between the actual and predicated value is Zero, then it
means that the model fits the data correctly.
Validation for Regression Methods
• Mean Absolute Error (MAE)
• MAE- is means of residuals or errors.
• Difference between estimated / predicated target values and actual target incomes. It can
be mathematically written as:
• Here is predicated value and y is actual target value and n is the total number of
Sample used for regression analysis.
Validation for Regression Methods
• Mean Squared Error (MSE)
• It is the sum of square of residuals / errors. This values is always positive and closer to 0.
• MSE is mathematically written as follows:
• Where:
• Numerator (N) part is the sum of squares due to regression (RMSE : explained sum of squares )
• Denominator (D) part is the total sum of squares
• MAE and MSE depend on the context as we have seen whereas the R2 score is independent of context.
Validation for Regression Methods
• P- Value
• The linear regression coefficients describe the mathematical relationship between each
independent variable and the dependent variable. The p values for the coefficients
indicate whether these relationships are statistically significant.
Regression Problem statement
• Problem Statement for predicating the sales of items as
• Shown in the table.
• Consider 2 fresh item I6 and I7 whose actual value are
80 and 75 respectively,
• Regression model predicts the value of the Item I6 and I7 respectively, whose actual
values are 80 and 75 respectively.
KNN Algorithm
Glimpse of KNN working Principle
• Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity, where
it predicts the label or value of a new data point by considering the labels or values of its
K nearest neighbors in the training dataset.
Case study
• Empty circle are low income and dark circle are high income.
• Test instance is person age is 57 and income is Rs 38000 .
• Whether person belong to lower income or higher income???
K-NN
• KNN – instance based learning – work on the basis of memorize and apply.
• Evelyn Fix and Joseph Hodges developed this algorithm in 1951, which was subsequently
expanded by Thomas Cover.
• The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm that can be used to solve both classification and regression problems.
• Instance Based learning – will not create General pattern. Instead of creating a general pattern
it (Instance based learning) will compare the new problem/ pattern with existing instance.
• Ex: Spam Mail – in Gmail- Consist of Spam Filter , which consist of memory pattern of spam
mails based on that it will filter the spam mail.
• Generalize Example for –Instance based learning is:
• Student – learn / understand the concept and write the exam. – Student remember the concepts
even after the exam.
• Student – mug the concept based on his / her memory. Student will forgot the concept after the
exam - Instance based learning.
• Instance based learning- construct the target function only when new instance must be classified.
K-NN (Cont.,)
• Every time a new query instance is encountered, its relationship to previously stored
example is examined in order to assign the a target value for new instance.
• Hence, Instance based learning is also called as lazy learning or memory based learning.
• K-NN simple and powerful non-parametric algorithm that predicts the category of test
instance according to the K training samples which are closer to test instance and
classifies it to that category which has the largest probability.
• Computing the similarity KNN –uses the distance measuring techniques such as
Euclidean Distance, Hamming Distance and city block distance approach.
• Better neighbours – Here K =3, those neighbours are also would be asminimum as
possible.
K-NN (Cont.,)
• What is the K-Nearest Neighbors Algorithm and applications.
• KNN is one of the most basic essential classification algorithms in machine learning and
it belongs to the supervised learning domain.
• KNN is useful for various intense application such as pattern recognition, data mining,
and intrusion detection.
• KNN- compares the new problems with instances in the training sample. ( Previous
samples or while sample is considered ) which are stored in the memory.
K-NN algorithm
• Input : T training dataset , t test instance , K – Number of nearest neighbour, D is
distance metric.
• Output : Predicated Class
• Step 1: Compute the distance between the test instance t and every other instance i in
the instance T using Distance Formula.
• Distance Formula – Euclidean distance, Hamming distance or City block distance.
• Step 2: Sort the distance in ascending order and select the 1st K nearest training data
instance.
• Step 3: Predict the new test instance class by comparing the majority voting.
Example 1
• In KPSC Exam candidate secured the following Marks: Based on that obtained marks
categorize that whether candidate has eligible in the exam or not using K nearest
neighbour. Given Instance as follows:
• General Studies X(O1) =6 and Computer Science Y(O2) =8.
• K =3 nearest neighbour. Training Data: (Note: O observable, a actual value)
• • Drawback
• • Cost of Classifying the new instance cost will be high.
• • Instance based approach – KNN- consider all the attributes in the training samples for
classifying the new instance, Hence, it retrieving the similar kind of instance from
memory or Polyhedron.
• If the target concept depends on only a few of the many available attributes , there is a
chances of wrong target function output.
Distance Measuring
• Distance measures play an important role in machine learning.
• A distance measure is an objective score that summarizes the relative difference between
two objects / samples in a problem domain.
• Most commonly, the two objects are rows of data that describe a subject (such as a
person, car, or house), or an event (such as a purchase, a claim, or a diagnosis).
• Perhaps the most likely way you will encounter distance measures is KNN classifier.
• In the KNN algorithm, a classification or regression prediction is made for new examples
by calculating the distance between the new example (row) and all examples (rows) in the
training dataset.
• Popular Distance Measuring algorithm are :
• Hamming Distance
• Euclidean Distance
• Manhattan Distance
• Minkowski Distance
Distance Measuring
• Euclidean Distance
• The Euclidean distance is the most widely used distance measure in clustering.
• It is also called as L2 form.
• It calculates the straight-line distance between two points in n-dimensional space. The
formula for Euclidean distance is:
Distance Measuring
• Manhattan Distance
• Another name for Manhattan distance is City block distance.
• This is also know as box car , absolute value distance, L1 form.
•
Distance Measuring
• Chebyshev Distance:
• It approach is also know as Max value distance.
• It compute the absolute magnitude of difference between the co-ordinates of pair of
objects.
Example for Distance metrics.
Distance Measuring
• Hamming Distance :
• Can be used to find the distance between two strings or pairs of words or DNA sequences
of the same length.
• The distance between olive and ocean is 4 because aside from the “o” the other 4 letters
are different.
• The distance between shoe and hose is 3 because aside from the “e” the other 3 letters are
different.
• Just go through each position and check whether the letters the same in that position, and
if not, increment your count by 1.
Distance Measuring
• Mahalanobis distance :
• The Mahalanobis distance (MD) is the distance between two points in multivariate
space.
• The MD measures the relative distance between two variables with respect to the
centroid.
• It has excellent applications in multivariate anomaly detection, classification on highly
imbalanced datasets and one-class classification and more untapped use cases.
• It is effectively a multivariate equivalent of the Euclidean distance.
• The Mahalanobis distance of an observation x = (x1, x2, x3….xN)T from a set of
observations with mean μ= (μ1,μ2,μ3….μN)T and covariance matrix S is defined as:
K-Means
• K-Means Clustering is an Unsupervised Machine Learning algorithm, which groups the
unlabeled dataset into different clusters.
• It is an iterative partitional algorithm and here k stand for user specified requested cluster
and user may not aware of how many cluster are present in the available dataset.
• In a dataset, individual sample may belonging to any one cluster at the end.
• Unsupervised Machine Learning is the process of teaching a computer to use unlabeled,
unclassified data and enabling the algorithm to operate on that data without supervision.
• Without any previous data training, the machine’s job in this case is to organize unsorted
data according to parallels, patterns, and variations.
• K means clustering, assigns data points to one of the K clusters depending on their
distance from the center of the clusters.
K-Means (Cont.,)
• K-means algorithm – requires initialized values , then algorithm can select K data points
randomly or use prior knowledge of the data.
• It starts by randomly assigning the clusters centroid in the space. Then each data point assign to
one of the cluster based on its distance from centroid of the cluster but this distance should be
minimal.
• After assigning each point to one of the cluster, new cluster centroids are assigned. This process
runs iteratively until it finds good cluster. In the analysis we assume that number of cluster is
given in advanced and we have to put points in one of the group.
• K-Means –iterative process is continued until no changes in the instances to clusters is noticed.
Then only algorithm will be terminated with ensured the guaranteed.
• Drawback :
• In some cases, K is not clearly defined, and we have to think about the optimal number of K. K
Means clustering performs best data is well separated.
•
K-Means Algorithm
• The algorithm works as follows:
K-Means Algorithm
• Advantage
• It is very simple algorithm
• Easy to implement
• Drawback
• Sensitive to initialized values, if the user choose the random or not precise points that
leads to wrong clustering.
• If the data sample size is large, algorithm takes a long time to process.
K-Means (Cont.,)
• How to choose the K value:
• K is the user specified value. It specifying the number of cluster are present.
• There is no gold standard rules to choose the K value.
• K mean algorithm runs with multiple value of K and with in the group variance, thus it
plotted as line graph, this kind of plot is called as Elbow Curve.
End of 2 Module