Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Ids Unit 2 Notes Ckm-1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Unit 2: IDS

Descriptive Statistics
Statistics is the science, or a branch of mathematics, that involves collecting, classifying, analyzing,
interpreting, and presenting numerical facts and data.
The study of numerical and graphical ways to describe and display your data is called descriptive
statistics. It describes the data and helps us understand the features of the data by summarizing the
given sample set or population of data.
Types of descriptive statistics
There are 3 main types of descriptive statistics:
[1] The distribution concerns the frequency of each value.
[2] The central tendency concerns the averages of the values.
[3] The variability or dispersion concerns how spread out the values are.
Distribution (also called Frequency Distribution)
Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to summarize
the frequency of every possible value of a variable, rendered in percentages or numbers.
Measures of central tendency estimate a dataset's average or center, finding the result using three
methods: mean, mode, and median.
Variability (also called Dispersion)
The measure of variability gives the statistician an idea of how spread out the responses are. The
spread has three aspects — range, standard deviation, and variance.
Research example
You want to study the popularity of different leisure activities by gender. You distribute a survey and
ask participants how many times they did each of the following in the past year:
 Go to a library
 Watch a movie at a theater
 Visit a national park
Your data set is the collection of responses to the survey. Now you can use descriptive statistics to
find out the overall frequency of each activity (distribution), the averages for each activity (central
tendency), and the spread of responses for each activity (variability).
Measures of central tendency
Advantages and disadvantages of mean, median and mode.
Mean is the most commonly used measures of central tendency. It represents the average of the given
collection of data. Median is the middle value among the observed set of values and is calculated by
arranging the values in ascending order or in descending order and then choosing the middle value.
The most frequent number occurring in the data set is known as the mode.
Data Advantages Disadvantages
Mean Takes account of all A very small or very large value can affect the mean.
values to calculate the
average.
Median The median is not Since the median is an average of position, therefore arranging
affected by very large or the data in ascending or descending order of magnitude is time-
very small values. consuming in the case of a large number of observations.
Mode The only averages that There can be more than one mode, and there can also be no
can be used if the data mode which means the mode is not always representative of the
set is not in numbers. data.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Measures of Deviation:
Mean Deviation: In statistics, deviation means the difference between the observed and expected
values of a variable. In simple words, the deviation is the distance from the centre point. The centre
point can be median, mean, or mode. Similarly, the mean deviation definition in statistics or the mean
absolute deviation is used to compute how far the values fall from the middle of the data set.
Mean Deviation, also known as Mean Absolute Deviation, is the average Deviation of a Data point
from the Data set's Mean, median, or Mode. The term "Mean Deviation" is abbreviated as MAD.
Mean Deviation Formula

Example 1: Calculate the Mean Deviation about the median using the Data given below:
(Ungrouped Data)

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Example 2:

Example 3: Find the mean deviation for the following data set. (Grouped Data)

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Example 4: Find the mean deviation about the median for the following data.

*The coefficient of mean deviation is calculated by dividing mean deviation by the average.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Standard Deviation and Variance
Standard deviation is the positive square root of the variance. Standard deviation is the degree of
dispersion or the scatter of the data points relative to its mean, in descriptive statistics. It tells how the
values are spread across the data sample and it is the measure of the variation of the data points from the
mean. Variance is a measure of how data points vary from the mean, whereas standard deviation is the
measure of the distribution of statistical data. The basic difference between variance and the standard
deviation is in their units. The standard deviation is represented in the same units as the mean of data,
while the variance is represented in squared units.

Population Variance - All the members of a group are known as the population. When we want to find
how each data point in a given population varies or is spread out then we use the population variance. It is
used to give the squared distance of each data point from the population mean.
Sample Variance - If the size of the population is too large then it is difficult to take each data point into
consideration. In such a case, a select number of data points are picked up from the population to form the
sample that can describe the entire group. Thus, the sample variance can be defined as the average of the
squared distances from the mean. The variance is always calculated with respect to the sample mean.
Standard Deviation of Ungrouped Data
Example: If a die is rolled, then find the variance and standard deviation of the possibilities.
Solution: When a die is rolled, the possible number of outcomes is 6. So the sample space, n = 6 and the
data set = { 1;2;3;4;5;6}.
To find the variance, first, we need to calculate the mean of the data set.
Mean, x̅ = (1+2+3+4+5+6)/6 = 3.5
We can put the value of data and mean in the formula to get;

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Example: There are 39 plants in the garden. A few plants were selected randomly and their heights
in cm were recorded as follows: 51, 38, 79, 46, and 57. Calculate the standard deviation of their
heights. N=5.

Standard Deviation of Grouped Data

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Coefficient of Variation
Coefficient of variation is a type of relative measure of dispersion. It is expressed as the ratio of the
standard deviation to the mean. A measure of dispersion is a quantity that is used to gauge the extent of
variability of data.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Shape of data: Skewness and Kurtosis
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. If one tail is longer than
another, the distribution is skewed. These distributions are sometimes called asymmetric or asymmetrical
distributions as they don‘t show any kind of symmetry. Symmetry means that one half of the distribution
is a mirror image of the other half. For example, the normal distribution is a symmetric distribution with
no skew. The tails are exactly the same. The symmetrical distribution has zero skewness as all measures
of a central tendency lies in the middle.
The skewness is a measure of symmetry or asymmetry of data distribution, and kurtosis measures whether
data is heavy-tailed or light-tailed in a normal distribution. Data can be positive-skewed (data-pushed
towards the right side) or negative-skewed (data-pushed towards the left side).

When data is symmetrically distributed, the left-hand side, and right-hand side, contain the same number
of observations. (If the dataset has 90 values, then the left-hand side has 45 observations, and the right-
hand side has 45 observations.). But, what if not symmetrical distributed? That data is called
asymmetrical data, and that time skewness comes into the picture.
Types of skewness
1. Positive skewed or right-skewed
In statistics, a positively skewed distribution is a sort of distribution where, unlike symmetrically
distributed data where all measures of the central tendency (mean, median, and mode) equal each
other, with positively skewed data, the measures are dispersing, which means Positively Skewed
Distribution is a type of distribution where the mean, median, and mode of the distribution are positive
rather than negative or zero.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
2. Negative skewed or left-skewed
A negatively skewed distribution is the straight reverse of a positively skewed distribution. In statistics,
negatively skewed distribution refers to the distribution model where more values are plots on the right
side of the graph, and the tail of the distribution is spreading on the left side. In negatively skewed, the
mean of the data is less than the median (a large number of data-pushed on the left-hand side). Negatively
Skewed Distribution is a type of distribution where the mean, median, and mode of the distribution are
negative rather than positive or zero.

Calculate the skewness coefficient of the sample

As Pearson‘s correlation coefficient differs from -1 (perfect negative linear relationship) to +1 (perfect
positive linear relationship), including a value of 0 indicating no linear relationship, When we divide the
covariance values by the standard deviation, it truly scales the value down to a limited range of -1 to
+1. That accurately the range of the correlation values.
Pearson‘s first coefficient of skewness is helping if the data present high mode. But, if the data have low
mode or various modes, Pearson‘s first coefficient is not preferred, and Pearson‘s second coefficient may
be superior, as it does not rely on the mode.

If the skewness is between -0.5 & 0.5, the data are nearly symmetrical.
If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the data are
slightly skewed.
If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data are
extremely skewed.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Kurtosis
Kurtosis is a statistical measure, whether the data is heavy-tailed or light-tailed in a normal distribution.

In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level
of risk for an investment because it indicates that there are high probabilities of extremely large and
extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the
probabilities of extreme returns are relatively low.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Excess Kurtosis
The excess kurtosis is used in statistics and probability theory to compare the kurtosis coefficient with
that normal distribution. Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic
distribution), or near to zero (Mesokurtic distribution). Since normal distributions have a kurtosis of 3,
excess kurtosis is calculating by subtracting kurtosis by 3.
Excess kurtosis = Kurt – 3
Types of excess kurtosis
1. Leptokurtic (kurtosis > 3)
Leptokurtic is having very long and skinny tails, which means there are more chances of outliers. Positive
values of kurtosis indicate that distribution is peaked and possesses thick tails. An extreme positive
kurtosis indicates a distribution where more of the numbers are located in the tails of the distribution
instead of around the mean.
2. platykurtic (kurtosis < 3)
Platykurtic having a lower tail and stretched around center tails means most of the data points are present
in high proximity with mean. A platykurtic distribution is flatter (less peaked) when compared with the
normal distribution.
3. Mesokurtic (kurtosis = 3)
Mesokurtic is the same as the normal distribution, which means kurtosis is near to 0. In Mesokurtic,
distributions are moderate in breadth, and curves are a medium peaked height.
Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic distribution), or
near to zero (Mesokurtic distribution).
Leptokurtic distribution (kurtosis more than normal distribution).
Mesokurtic distribution (kurtosis same as the normal distribution).
Platykurtic distribution (kurtosis less than normal distribution).
Calculate Population Skewness and Population Kurtosis from the following grouped data
Example-1

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
f f

Karl Pearson Coefficient of Correlation


The study of Karl Pearson Coefficient is an inevitable part of Statistics. Statistics is majorly dependent on
Karl Pearson Coefficient Correlation method. The Karl Pearson coefficient is defined as a linear
correlation that falls in the numeric range of -1 to +1.
This is a quantitative method that offers the numeric value to form the intensity of the linear relationship
between the X and Y variable. But is it really useful for any economic calculation? Let, us find and delve
into this topic to get more detailed information on the subject matter – Karl Pearson Coefficient of
Correlation.
What do You mean by Correlation Coefficient?
Before delving into details about Karl Pearson Coefficient of Correlation, it is vital to brush up on
fundamental concepts about correlation and its coefficient in general.
The correlation coefficient can be defined as a measure of the relationship between two quantitative or
qualitative variables, i.e., X and Y. It serves as a statistical tool that helps to analyze and in turn, measure
the degree of the linear relationship between the variables.
For example, a change in the monthly income (X) of a person leads to a change in their monthly
expenditure (Y). With the help of correlation, you can measure the degree up to which such a change can
impact the other variables.
Types of Correlation Coefficient
Depending on the direction of the relationship between variables, correlation can be of three types,
namely –
Positive Correlation (0 to +1)
In this case, the direction of change between X and Y is the same. For instance, an increase in the
duration of a workout leads to an increase in the number of calories one burns.
Negative Correlation (0 to -1)
Here, the direction of change between X and Y variables is opposite. For example, when the price of a
commodity increases its demand decreases.
Zero Correlation (0)
There is no relationship between the variables in this case. For instance, an increase in height has no
impact on one‘s intelligence.
What is Karl Pearson’s Coefficient of Correlation?
This method is also known as the Product Moment Correlation Coefficient and was developed by Karl
Pearson. It is one of the three most potent and extensively used methods to measure the level of
correlation, besides the Scatter Diagram and Spearman‘s Rank Correlation.
The Karl Pearson correlation coefficient method is quantitative and offers numerical value to establish the
intensity of the linear relationship between X and Y. Such a coefficient correlation is represented as ‗r‘.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
The Karl Pearson Coefficient of Correlation formula is expressed as:

Direct Method/ Actual Mean Method

Marks obtained by 5 students in algebra and trigonometry as given below: Calculate Karl Pearson
Coefficient of Correlation without taking deviation from mean.

r= -0.424
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Assumed Mean Method

Box Plot
When we display the data distribution in a standardized way using 5 summary – minimum, Q1 (First
Quartile), median, Q3(third Quartile), and maximum, it is called a Box plot. It is also termed as box and
whisker plot. It is a type of chart that depicts a group of numerical data through their quartiles. It is a
simple way to visualize the shape of our data. It makes comparing characteristics of data between
categories very easy.
Parts of Box Plots

Minimum: The minimum value in the given dataset


First Quartile (Q1): The first quartile is the median of the lower half of the data set.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Median: The median is the middle value of the dataset, which divides the given dataset into two equal
parts. The median is considered as the second quartile.
Third Quartile (Q3): The third quartile is the median of the upper half of the data.
Maximum: The maximum value in the given dataset.
Apart from these five terms, the other terms used in the box plot are:
Interquartile Range (IQR): The difference between the third quartile and first quartile is known as the
interquartile range. (i.e.) IQR = Q3-Q1
Outlier: The data that falls on the far left or right side of the ordered data is tested to be the outliers.
Generally, the outliers fall more than the specified distance from the first and third quartile.
(i.e.) Outliers are greater than Q3+(1.5 . IQR) or less than Q1-(1.5 . IQR).
Positively Skewed: If the distance from the median to the maximum is greater than the distance from the
median to the minimum, then the box plot is positively skewed.
Negatively Skewed: If the distance from the median to minimum is greater than the distance from the
median to the maximum, then the box plot is negatively skewed.
Symmetric: The box plot is said to be symmetric if the median is equidistant from the maximum and
minimum values.
The median ( Q2 ) divides the data set into two parts, the upper set and the lower set. The lower
quartile ( Q1 ) is the median of the lower half, and the upper quartile ( Q3 ) is the median of the upper
half.
Example -1:
Find Q1 , Q2 , and Q3 for the following data set, and draw a box-and-whisker plot.
{2,6,7,8,8,11,12,13,14,15,22,23}
There are 12 data points. The middle two are 11 and 12. So the median, Q2 , is 11.5 .
The "lower half" of the data set is the set {2,6,7,8,8,11}. The median here is 7.5 So Q1=7.5.
The "upper half" of the data set is the set {12,13,14,15,22,23}. The median here is 14.5. So Q3=14.5.
A box-and-whisker plot displays the values Q1 , Q2 , and Q3 , along with the extreme values of the data
set ( 2 and 23 , in this case):

A box & whisker plot shows a "box" with left edge at Q1 , right edge at Q3 , the "middle" of the box
at Q2 (the median) and the maximum and minimum as "whiskers".
Example -2: Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Note: If the total number of values is odd then we exclude the Median while calculating Q1 and Q3. Here
since there were two central values we included them. Now, we need to calculate the Inter Quartile
Range.

The outliers which are outside the range are Outliers= 220

Example-3: Calculate Box and Whisker Plots from the following grouped data

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Uses of a Box Plot
Box plots provide a visual summary of the data with which we can quickly identify the average value of
the data, how dispersed the data is, whether the data is skewed or not (skewness).

Pivot Tables:
A pivot table is a powerful data summarization tool that can automatically sort, count, and sum up data
stored in tables and display the summarized data. Pivot tables are useful to quickly create crosstabs (a
process or function that combines and/or summarizes data from one or more sources into a concise format
for analysis or reporting) to display the joint distribution of two or more variables.
Typically, with a pivot table the user sets up and changes the data summary's structure by dragging and
dropping fields graphically. This "rotation" or pivoting of the summary table gives the concept its name.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Three key reasons for organizing data into a pivot table are:
 To summarize the data contained in a lengthy list into a compact format.
 To find relationships within the data those are otherwise hard to see because of the amount of detail.
 To organize the data into a format that‘s easy to read.
HeatMap
Heatmaps visualize the data in a 2-dimensional format in the form of colored maps. The color maps use
hue, saturation, or luminance to achieve color variation to display various details. This color variation
gives visual cues to the readers about the magnitude of numeric values. HeatMaps is about replacing
numbers with colors because the human brain understands visuals better than numbers, text, or any
written data.
A heatmap (or heat map) is a graphical representation of numerical data, where individual data points
contained in the data set are represented using different colors. The key benefit of heatmaps is that they
simplify complex numerical data into visualizations that can be understood at a glance. For example, on
website heatmaps ‗hot‘ colours depict high user engagement, while ‗cold‘ colours depict low engagement.
Uses of HeatMap
1. Business Analytics: A heat map is used as a visual business analytics tool. A heat map gives quick
visual cues about the current results, performance, and scope for improvements. Heatmaps can analyze
the existing data and find areas of intensity that might reflect where most customers reside, areas of risk
of market saturation, or cold sites and sites that need a boost.
2. Website: Heatmaps are used in websites to visualize data of visitors‘ behavior. This visualization helps
business owners and marketers to identify the best & worst-performing sections of a webpage.
3. Exploratory Data Analysis: EDA is a task performed by data scientists to get familiar with the data.
All the initial studies are done to understand the data are known as EDA. EDA is done to summarize their
main features, often with visual methods, which includes Heatmaps.
4. Molecular Biology: Heat maps are used to study disparity and similarity patterns in DNA, RNA, etc.
5. Marketing and Sales: The heatmap‘s capability to detect warm and cold spots is used to improve
marketing response rates by targeted marketing. Heatmaps allow the detection of areas that respond to
campaigns, under-served markets, customer residence, and high sale trends, which helps optimize product
lineups, capitalize on sales, create targeted customer segments, and assess regional demographics.
Types of HeatMaps
Typically, there are two types of Heatmaps:
Grid Heatmap: The magnitudes of values shown through colors are laid out into a matrix of rows and
columns, mostly by a density-based function. Below are the types of Grid Heatmaps.
Clustered Heatmap: The goal of Clustered Heatmap is to build associations between both the data points
and their features. This type of heatmap implements clustering as part of the process of grouping similar
features. Clustered Heatmaps are widely used in biological sciences for studying gene similarities across
individuals. Correlogram: A correlogram replaces each of the variables on the two axes with numeric
variables in the dataset. Each square depicts the relationship between the two intersecting variables, which
helps to build descriptive or predictive statistical models.
Spatial Heatmap: Each square in a Heatmap is assigned a color representation according to the nearby
cells‘ value. The location of color is according to the magnitude of the value in that particular space.
These Heatmaps are data-driven ―paint by numbers‖ canvas overlaid on top of an image. The cells with
higher values than other cells are given a hot color, while cells with lower values are assigned a cold
color.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference
between the means of more than two groups. A one-way ANOVA uses one independent variable, while a
two-way ANOVA uses two independent variables.
A One-Way ANOVA is used to determine how one factor impacts a response variable. For example, we
might want to know if three different studying techniques lead to different mean exam scores. To see if
there is a statistically significant difference in mean exam scores, we can conduct a one-way ANOVA.

A Two-Way ANOVA is used to determine how two factors impact a response variable, and to determine
whether or not there is an interaction between the two factors on the response variable. For example, we
might want to know how gender and how different levels of exercise impact average weight loss. We
would conduct a two-way ANOVA to find out.

ANOVA Real Life Example #1


A large scale farm is interested in understanding which of three different fertilizers leads to the highest
crop yield. They sprinkle each fertilizer on ten different fields and measure the total yield at the end of the
growing season. To understand whether there is a statistically significant difference in the mean yield that
results from these three fertilizers, researchers can conduct a one-way ANOVA, using ―type of fertilizer‖
as the factor and ―crop yield‖ as the response.
ANOVA Real Life Example #2
An example to understand this can be prescribing medicines. Suppose, there is a group of patients who
are suffering from fever. They are being given three different medicines that have the same functionality
i.e. to cure fever. To understand the effectiveness of each medicine and choose the best among them, the
ANOVA test is used.
ANOVA is used in a wide variety of real-life situations, but the most common include:
Retail: Store are often interested in understanding whether different types of promotions, store layouts,
advertisement tactics, etc. lead to different sales. This is the exact type of analysis that ANOVA is built
for.
Medical: Researchers are often interested in whether or not different medications affect patients
differently, which is why they often use one-way or two-way ANOVA‘s in these situations.
Environmental Sciences: Researchers are often interested in understanding how different levels of factors
affect plants and wildlife. Because of the nature of these types of analyses, ANOVA‘s are often used.
What is ANOVA Test
ANOVA test, in its simplest form, is used to check whether the means of three or more populations are
equal or not. The ANOVA test applies when there are more than two independent groups. The goal of the
ANOVA test is to check for variability within the groups as well as the variability among the groups. The
ANOVA test statistic is given by the f test.
ANOVA Test Definition
ANOVA test can be defined as a type of test used in hypothesis testing to compare whether the means of
two or more groups are equal or not. This test is used to check if the null hypothesis can be rejected or not
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
depending upon the statistical significance exhibited by the parameters. The decision is made by
comparing the ANOVA test statistic with the critical value.
The steps to perform the one way ANOVA test are given below:
Step 1: Calculate the mean for each group.
Step 2: Calculate the total mean. This is done by adding all the means and dividing it by the total number
of means.
Step 3: Calculate the SSB/ SSC.
Step 4: Calculate the between groups degrees of freedom.
Step 5: Calculate the SSE.
Step 6: Calculate the degrees of freedom of errors.
Step 7: Determine the MSB and the MSE.
Step 8: Find the f test statistic.
Step 9: Using the f table for the specified level of significance, find the critical value. This is given by
F(, df1. df2).
Step 10: If f > F then rejects the null hypothesis.
Assumptions for ANOVA
 Each group sample is drawn from a normally distributed population
 All populations have a common variance
 All samples are drawn independently of each other
 Within each sample, the observations are sampled randomly and independently of each other

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
ANOVA also uses a Null hypothesis and an Alternate hypothesis. The Null hypothesis in ANOVA is
valid when all the sample means are equal, or they don‘t have any significant difference. Thus, they can
be considered as a part of a larger set of the population. On the other hand, the alternate hypothesis is
valid when at least one of the sample means is different from the rest of the sample means. In
mathematical form, they can be represented as:

Sums of Squares: In statistics, the sum of squares is defined as a statistical technique that is used in
regression analysis to determine the dispersion of data points. In the ANOVA test, it is used while
computing the value of F. As the sum of squares tells you about the deviation from the mean, it is also
known as variation.
Degrees of Freedom: Degrees of Freedom refer to the maximum numbers of logically independent
values that have the freedom to vary in a data set.
The Mean Squared Error: It tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.
Example-1

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Example-2:

F-Table

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Data Pre-processing
Data preprocessing is the process of converting raw data into clean data that is proper for modeling. A
model fails for various reasons. Data preprocessing can significantly impact model results, such as
imputing missing value and handling with outliers. Data preprocessing is done to improve the quality of
data in data warehouse.
 Increases efficiency
 Ease of data mining process
 Removes noisy data, inconsistent data and incomplete data
Data mining and data warehousing are very powerful and popular techniques for analyzing and storing
data, respectively. Data warehousing is all about compiling and organizing data in a common database,
while data mining refers to the process of extracting important data from the databases. With the
definition, we can conclude that the data mining process is dependent on the data warehouse for
identifying patterns in data and draw relevant conclusions. The process of data mining involves the use of
statistical models and algorithms to find hidden patterns in the data.
Major tasks of Data preprocessing-
Data Cleaning: It is a process to clean the data in such a way that data can be easily integrated.
Data Integration: It is a process to integrate/combine all the data.
Data Reduction: It is a process to reduce the large data into smaller once in such a way that data can be
easily transformed further.
Data Transformation: It is a process to transform the data into a reliable shape.
Data Discretization: It converts a large number of data values into smaller once, so that data evaluation
and data management becomes very easy.
Data Cleaning
After you load the data, the first thing is to check how many variables are there, the type of variables, the
distributions, and data errors. It cleans the data by filling in the missing values, smoothing noisy data,
resolving the inconsistency and removing the outliers.
Things to pay attention are:
 There are some missing values.
 There are outliers for store expenses (store_exp). The maximum value is 50000. Who would spend
$50000 a year buying clothes? Is it an imputation error?
 There is a negative value ( -500) in store_exp which is not logical.
 Someone is 300 years old.
 Enter phone number in wrong format.
How can we clean Data:
Data validation: apply some constraints to make sure you have valid and consistent data. Data
validation is the process of ensuring data has undergone data cleansing to ensure they have, that is, that
they are both correct and useful.
Data screening: Data screening is a method which applies to remove all error from data and make it
correct for statistical analysis
De-duplication: Delete the duplicate data.
String matching method: Identify the close matches between your data and valid value
Approaches in Data Cleaning
1. Missing values
2. Noisy Data
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
1. Missing Values:
It is defined as the value or data that are not stored for some variable in the given dataset
How can you go about filling in the missing values for this attribute? Let‘s look at the following methods.
 Ignore the data row: This method is suggested for records where maximum amount of data is missing,
rendering the record meaningless. This method is usually avoided where only less attribute values are
missing. If all the rows with missing values are ignored i.e. removed, it will result in poor
performance.
 Fill the missing values manually: This is a very time consuming method and hence infeasible for
almost all scenarios.
 Use a global constant to fill in for missing values: A global constant like ―NA‖ or 0 can be used to fill
all the missing data. This method is used when missing values are difficult to be predicted.
 Use attribute mean or median: Mean or median of the attribute is used to fill the missing value.
 Use forward fill or backward fill method: In this, either the previous value or the next value is used to
fill the missing value. A mean of the previous and succession values may also be used.
 Use the most probable value to fill in the missing value: (Decision tree, Regression method)

2. Noisy Data: Noise is a random error or variance in a measured variable.


Approaches for Noisy Data
1. Binning 2. Regression 3.Clustering
1. Binning Methods for Data Smoothing
The binning method can be used for smoothing the data.
Mostly data is full of noise. Data smoothing is a data pre-processing technique using a different kind of
algorithm to remove the noise from the data set. This allows important patterns to stand out.
Unsorted data for price in dollars
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After Sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smoothing by bin means
For Bin 1:
(8+ 9 + 15 +16 / 4) = 12
(4 indicating the total values like 8, 9 , 15, 16)
Bin 1 = 12, 12, 12, 12
For Bin 2:
(21 + 21 + 24 + 26 / 4) = 23
Bin 2 = 23, 23, 23, 23
For Bin 3:
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
(27 + 30 + 30 + 34 / 4) = 30
Bin 3 = 30, 30, 30, 30
Smoothing by bin boundaries
Bin 1: 8, 8, 8, 15
Bin 2: 21, 21, 25, 25
Bin 3: 26, 26, 26, 34
How to smooth data by bin boundaries?
You need to pick the minimum and maximum value. Put the minimum on the left side and maximum on
the right side.
Now, what will happen to the middle values?
Middle values in bin boundaries move to its closest neighbor value with less distance.
Unsorted data for price in dollars:
Before sorting: 8 16, 9, 15, 21, 21, 24, 30, 26, 27, 30, 34
First of all, sort the data
After sorting: 8, 9, 15, 16, 21, 21, 24, 26, 27, 30, 30, 34
Smoothing the data by equal frequency bins
Bin 1: 8, 9, 15, 16
Bin 2: 21, 21, 24, 26,
Bin 3: 27, 30, 30, 34
Smooth data after bin Boundary
Before bin Boundary: Bin 1: 8, 9, 15, 16
Here, 1 is the minimum value and 16 is the maximum value.9 is near to 8, so 9 will be treated as 8. 15 is
more near to 16 and farther away from 8. So, 15 will be treated as 16.
After bin Boundary: Bin 1: 8, 8, 16, 16
Before bin Boundary: Bin 2: 21, 21, 24, 26,
After bin Boundary: Bin 2: 21, 21, 26, 26,
Before bin Boundary: Bin 3: 27, 30, 30, 34
After bin Boundary: Bin 3: 27, 27, 27, 34
2. Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be linear
(having one independent variable) or multiple (having multiple independent variables).
3. Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside
the clusters.
Data Integration
Data integration is the process of merging data from several disparate sources. While performing data
integration, you must work on data redundancy, inconsistency, duplicity, etc.
Data integration is important because it gives a uniform view of scattered data while also maintaining data
accuracy.
Data Integration Approaches
There are mainly two types of approaches for data integration. These are as follows:
Tight Coupling: It is the process of using ETL (Extraction, Transformation, and Loading) to combine
data from various sources into a single physical location.
Loose Coupling: Facts with loose coupling are most effectively kept in the actual source databases. This
approach provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to obtain the
result.
Data Integration Techniques
There are various data integration techniques in data mining. Some of them are as follows:

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Manual Integration: This method avoids using automation during data integration. The data analyst
collects, cleans, and integrates the data to produce meaningful information. This strategy is suitable for a
mini organization with a limited data set. It is a time-consuming operation.
Middleware Integration: The middleware software is used to take data from many sources, normalize it,
and store it in the resulting data set.
Application-based integration: It is using software applications to extract, transform, and load data from
disparate sources. This strategy saves time and effort, but it is a little more complicated because building
such an application necessitates technical understanding.
Uniform Access Integration: This method combines data from a more disparate source. However, the
data's position is not altered in this scenario; the data stays in its original location. This technique merely
generates a unified view of the integrated data. The integrated data does not need to be stored separately
because the end-user only sees the integrated view.
Data Transformation:
Data transformation changes the format, structure, or values of the data and converts them into clean,
usable data.
Data Transformation Techniques
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some algorithms. It
allows for highlighting important features present in the dataset. It helps in predicting the patterns. When
collecting data, it can be manipulated to eliminate or reduce any variance or any other noise form.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to construct a new
data set that eases data mining. New attributes are created and applied to assist the mining process from
the given attributes. This simplifies the original data and makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e., we may have
the height and width of each plot. So here, we can construct a new attribute 'area' from attributes 'height'
and 'weight'. This also helps understand the relations among the attributes in a data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary format. The data
may be obtained from multiple data sources to integrate these data sources into a data analysis
description. This is a crucial step since the accuracy of data analysis insights is highly dependent on the
quantity and quality of the data used.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of each year. We
can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization: Normalizing the data refers to scaling the data values to a much smaller range
such as [-1, 1] or [0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for attribute A that
are V1, V2, V3, ….Vn.
Min-max normalization: This method implements a linear transformation on the original data. Let us
consider that we have minA and maxA as the minimum and maximum value observed for attribute A and
Viis the value for attribute A that has to be normalized. The min-max normalization would map Vi to the
V'i in a new smaller range [new_minA, new_maxA].
The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute income,

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
and [0.0, 1.0] is the range in which we have to map a value of $73,600.
The value $73,600 would be transformed using min-max normalization as follows:

Z-score normalization: This method normalizes the value for attribute A using the mean and standard
deviation. The following formula is used for Z-score normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000. And we
have to normalize the value $73,600 using z-score normalization.

Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point in the
value. This movement of a decimal point depends on the maximum absolute value of A. The formula for
the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1


Salary bonus Formula CGPA Normalized after Decimal scaling
400 400 / 1000 0.4
310 310 / 1000 0.31
We will check the maximum value of our attribute ―salary bonus―. Here maximum value is 400 so we
can convert it into a decimal by dividing it by 1000. Why 1000? 400 contain three digits and we so we
can put three zeros after 1. So, it looks like 1000.
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous attribute values are
substituted by small interval labels. This makes the data easier to study and analyze. If a data mining task
handles a continuous attribute, then its discrete values can be replaced by constant quality attributes. This
improves the efficiency of the task.
For example, the values for the age attribute can be replaced by the interval labels such as (0-10, 11-20…)
or (kid, youth, adult, senior).
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy. This conversion
from a lower level to a higher conceptual level is useful to get a clearer picture of the data. Data
generalization can be divided into two approaches:
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a higher
conceptual level into a categorical value (young, old).
Data Reduction
Data reduction techniques ensure the integrity of data while reducing the data. Data reduction is a process
that reduces the volume of original data and represents it in a much smaller volume. Data reduction
techniques are used to obtain a reduced representation of the dataset that is much smaller in volume by
maintaining the integrity of the original data. By reducing the data, the efficiency of the data mining
process is improved, which produces the same analytical results.
Techniques of Data Reduction
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
1. Dimensionality reduction is the process in which we reduced the number of unwanted variables,
attributes, and. Dimensionality reduction is a very important stage of data pre-processing. Dimensionality
reduction is considered a significant task in data mining applications. For example, let‘s start with an
example. Suppose you have a dataset with a lot of dimensions (features or columns in your database).

In this example, we can see that if we know the mobile number, then we can know the mobile network or
sim provider. So, we reduce a dimension of mobile network. When we reduce the dimensions, then you
can reduce those dimensions of attributes of data by combining the dimensions in such a way that it will
not lose significant characteristics of the original dataset that is going to be ready for data mining.
2. Numerosity Reduction:
Numerosity Reduction is a data reduction technique which replaces the original data by smaller form of
data representation. There are two techniques for numerosity reduction- Parametric and Non-Parametric
methods.
Parametric Methods –
For parametric methods, data is represented using some model. The model is used to estimate the data, so
that only parameters of data are required to be stored, instead of actual data. Regression and Log-Linear
methods are used for creating such models.
Non-Parametric Methods –
These methods are used for storing reduced representations of the data include histograms, clustering,
sampling and data cube aggregation.
Histograms: Histogram is the data representation in terms of frequency.
Clustering: Clustering divides the data into groups/clusters.
Sampling: Sampling can be used for data reduction because it allows a large data set to be represented by
a much smaller random data sample (or subset).
3. Data Cube Aggregation
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional
aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus
achieving data reduction.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
The data cube aggregation is a multidimensional aggregation that eases multidimensional analysis. The
data cube present precomputed and summarized data which eases the data mining into fast access.
4. Data Compression
Data compression employs modification, encoding, or converting the structure of data in a way that
consumes less space. Data compression involves building a compact representation of information by
removing redundancy and representing data in binary form.
Data that can be restored successfully from its compressed form is called Lossless compression. In
contrast, the opposite where it is not possible to restore the original form from the compressed form is
Lossy compression. Dimensionality and numerosity reduction method are also used for data compression.
This technique reduces the size of the files using different encoding mechanisms, such as Huffman
Encoding and run-length Encoding. We can divide it into two types based on their compression
techniques.
Lossless Compression: Encoding techniques (Run Length Encoding) allow a simple and minimal data
size reduction. Lossless data compression uses algorithms to restore the precise original data from the
compressed data.
Lossy Compression: In lossy-data compression, the decompressed data may differ from the original data
but are useful enough to retrieve information from them. For example, the JPEG image format is a lossy
compression, but we can find the meaning equivalent to the original image. Methods such as the Discrete
Wavelet transform technique PCA (principal component analysis) are examples of this compression.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the continuous nature into data with
intervals. We replace many constant values of the attributes by labels of small intervals.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as 43 for age) to
high-level concepts (categorical variables such as middle age or Senior).
Data Discretization
Data discretization refers to a method of converting a huge number of data values into smaller ones so
that the evaluation and management of data become easy. In other words, data discretization is a method
of converting attributes values of continuous data into a finite set of intervals with minimum data loss.
There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization. Supervised discretization refers to a method in which the class data is used.
Unsupervised discretization refers to a method depending upon the way which operation proceeds. It
means it works on the top-down splitting strategy and bottom-up merging strategy.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
Attribute Age Age Age Age

1,5,4,9,7 11,14,17,13,18,19 31,33,36,42,44,46 70,74,77,78

After Discretization Child Young Mature Old


Another example is analytics, where we gather the static data of website visitors. For example, all visitors
who visit the site with the IP address of India are shown under country level.
Some Famous techniques of data discretization:
Histogram analysis: Histogram refers to a plot used to represent the underlying frequency distribution of
a continuous data set. Histogram assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Binning: Binning refers to a data smoothing technique that helps to group a huge number of continuous
values into smaller values. For data discretization and the development of idea hierarchy, this technique
can also be used.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values
of x numbers into clusters to isolate a computational feature of x.
Data discretization using decision tree analysis
Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is
done through a supervised procedure. In a numeric attribute discretization, first, you need to select the
attribute that has the least entropy, and then you need to run it with the help of a recursive process. The
recursive process divides it into various discretized disjoint intervals, from top to bottom, using the same
splitting criterion.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring interval, and then the
large intervals are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a
supervised procedure.
Data discretization and concept hierarchy generation
The term hierarchy represents an organizational structure or mapping in which items are ranked according
to their levels of importance. For example, in computer science, there are different types of hierarchical
systems. A document is placed in a folder in windows at a specific place in the tree structure is the best
example of a computer hierarchical tree model. There are two types of hierarchy: top-down mapping and
the second one is bottom-up mapping.
Let's understand this concept hierarchy for the dimension location with the help of an example.
A particular city can map with the belonging country. For example, New Delhi can be mapped to India,
and India can be mapped to Asia.
Top-down mapping
Top-down mapping generally starts with the top with some general information and ends with the bottom
to the specialized information.
Bottom-up mapping
Bottom-up mapping generally starts with the bottom with some specialized information and ends with the
top to the generalized information.

Notes Prepared by Chandrakanta Mahanty, Assistant Professor


Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu

You might also like