Ids Unit 2 Notes Ckm-1
Ids Unit 2 Notes Ckm-1
Ids Unit 2 Notes Ckm-1
Descriptive Statistics
Statistics is the science, or a branch of mathematics, that involves collecting, classifying, analyzing,
interpreting, and presenting numerical facts and data.
The study of numerical and graphical ways to describe and display your data is called descriptive
statistics. It describes the data and helps us understand the features of the data by summarizing the
given sample set or population of data.
Types of descriptive statistics
There are 3 main types of descriptive statistics:
[1] The distribution concerns the frequency of each value.
[2] The central tendency concerns the averages of the values.
[3] The variability or dispersion concerns how spread out the values are.
Distribution (also called Frequency Distribution)
Datasets consist of a distribution of scores or values. Statisticians use graphs and tables to summarize
the frequency of every possible value of a variable, rendered in percentages or numbers.
Measures of central tendency estimate a dataset's average or center, finding the result using three
methods: mean, mode, and median.
Variability (also called Dispersion)
The measure of variability gives the statistician an idea of how spread out the responses are. The
spread has three aspects — range, standard deviation, and variance.
Research example
You want to study the popularity of different leisure activities by gender. You distribute a survey and
ask participants how many times they did each of the following in the past year:
Go to a library
Watch a movie at a theater
Visit a national park
Your data set is the collection of responses to the survey. Now you can use descriptive statistics to
find out the overall frequency of each activity (distribution), the averages for each activity (central
tendency), and the spread of responses for each activity (variability).
Measures of central tendency
Advantages and disadvantages of mean, median and mode.
Mean is the most commonly used measures of central tendency. It represents the average of the given
collection of data. Median is the middle value among the observed set of values and is calculated by
arranging the values in ascending order or in descending order and then choosing the middle value.
The most frequent number occurring in the data set is known as the mode.
Data Advantages Disadvantages
Mean Takes account of all A very small or very large value can affect the mean.
values to calculate the
average.
Median The median is not Since the median is an average of position, therefore arranging
affected by very large or the data in ascending or descending order of magnitude is time-
very small values. consuming in the case of a large number of observations.
Mode The only averages that There can be more than one mode, and there can also be no
can be used if the data mode which means the mode is not always representative of the
set is not in numbers. data.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Measures of Deviation:
Mean Deviation: In statistics, deviation means the difference between the observed and expected
values of a variable. In simple words, the deviation is the distance from the centre point. The centre
point can be median, mean, or mode. Similarly, the mean deviation definition in statistics or the mean
absolute deviation is used to compute how far the values fall from the middle of the data set.
Mean Deviation, also known as Mean Absolute Deviation, is the average Deviation of a Data point
from the Data set's Mean, median, or Mode. The term "Mean Deviation" is abbreviated as MAD.
Mean Deviation Formula
Example 1: Calculate the Mean Deviation about the median using the Data given below:
(Ungrouped Data)
Example 3: Find the mean deviation for the following data set. (Grouped Data)
*The coefficient of mean deviation is calculated by dividing mean deviation by the average.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Standard Deviation and Variance
Standard deviation is the positive square root of the variance. Standard deviation is the degree of
dispersion or the scatter of the data points relative to its mean, in descriptive statistics. It tells how the
values are spread across the data sample and it is the measure of the variation of the data points from the
mean. Variance is a measure of how data points vary from the mean, whereas standard deviation is the
measure of the distribution of statistical data. The basic difference between variance and the standard
deviation is in their units. The standard deviation is represented in the same units as the mean of data,
while the variance is represented in squared units.
Population Variance - All the members of a group are known as the population. When we want to find
how each data point in a given population varies or is spread out then we use the population variance. It is
used to give the squared distance of each data point from the population mean.
Sample Variance - If the size of the population is too large then it is difficult to take each data point into
consideration. In such a case, a select number of data points are picked up from the population to form the
sample that can describe the entire group. Thus, the sample variance can be defined as the average of the
squared distances from the mean. The variance is always calculated with respect to the sample mean.
Standard Deviation of Ungrouped Data
Example: If a die is rolled, then find the variance and standard deviation of the possibilities.
Solution: When a die is rolled, the possible number of outcomes is 6. So the sample space, n = 6 and the
data set = { 1;2;3;4;5;6}.
To find the variance, first, we need to calculate the mean of the data set.
Mean, x̅ = (1+2+3+4+5+6)/6 = 3.5
We can put the value of data and mean in the formula to get;
When data is symmetrically distributed, the left-hand side, and right-hand side, contain the same number
of observations. (If the dataset has 90 values, then the left-hand side has 45 observations, and the right-
hand side has 45 observations.). But, what if not symmetrical distributed? That data is called
asymmetrical data, and that time skewness comes into the picture.
Types of skewness
1. Positive skewed or right-skewed
In statistics, a positively skewed distribution is a sort of distribution where, unlike symmetrically
distributed data where all measures of the central tendency (mean, median, and mode) equal each
other, with positively skewed data, the measures are dispersing, which means Positively Skewed
Distribution is a type of distribution where the mean, median, and mode of the distribution are positive
rather than negative or zero.
As Pearson‘s correlation coefficient differs from -1 (perfect negative linear relationship) to +1 (perfect
positive linear relationship), including a value of 0 indicating no linear relationship, When we divide the
covariance values by the standard deviation, it truly scales the value down to a limited range of -1 to
+1. That accurately the range of the correlation values.
Pearson‘s first coefficient of skewness is helping if the data present high mode. But, if the data have low
mode or various modes, Pearson‘s first coefficient is not preferred, and Pearson‘s second coefficient may
be superior, as it does not rely on the mode.
If the skewness is between -0.5 & 0.5, the data are nearly symmetrical.
If the skewness is between -1 & -0.5 (negative skewed) or between 0.5 & 1(positive skewed), the data are
slightly skewed.
If the skewness is lower than -1 (negative skewed) or greater than 1 (positive skewed), the data are
extremely skewed.
In finance, kurtosis is used as a measure of financial risk. A large kurtosis is associated with a high level
of risk for an investment because it indicates that there are high probabilities of extremely large and
extremely small returns. On the other hand, a small kurtosis signals a moderate level of risk because the
probabilities of extreme returns are relatively low.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Excess Kurtosis
The excess kurtosis is used in statistics and probability theory to compare the kurtosis coefficient with
that normal distribution. Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic
distribution), or near to zero (Mesokurtic distribution). Since normal distributions have a kurtosis of 3,
excess kurtosis is calculating by subtracting kurtosis by 3.
Excess kurtosis = Kurt – 3
Types of excess kurtosis
1. Leptokurtic (kurtosis > 3)
Leptokurtic is having very long and skinny tails, which means there are more chances of outliers. Positive
values of kurtosis indicate that distribution is peaked and possesses thick tails. An extreme positive
kurtosis indicates a distribution where more of the numbers are located in the tails of the distribution
instead of around the mean.
2. platykurtic (kurtosis < 3)
Platykurtic having a lower tail and stretched around center tails means most of the data points are present
in high proximity with mean. A platykurtic distribution is flatter (less peaked) when compared with the
normal distribution.
3. Mesokurtic (kurtosis = 3)
Mesokurtic is the same as the normal distribution, which means kurtosis is near to 0. In Mesokurtic,
distributions are moderate in breadth, and curves are a medium peaked height.
Excess kurtosis can be positive (Leptokurtic distribution), negative (Platykurtic distribution), or
near to zero (Mesokurtic distribution).
Leptokurtic distribution (kurtosis more than normal distribution).
Mesokurtic distribution (kurtosis same as the normal distribution).
Platykurtic distribution (kurtosis less than normal distribution).
Calculate Population Skewness and Population Kurtosis from the following grouped data
Example-1
Marks obtained by 5 students in algebra and trigonometry as given below: Calculate Karl Pearson
Coefficient of Correlation without taking deviation from mean.
r= -0.424
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Assumed Mean Method
Box Plot
When we display the data distribution in a standardized way using 5 summary – minimum, Q1 (First
Quartile), median, Q3(third Quartile), and maximum, it is called a Box plot. It is also termed as box and
whisker plot. It is a type of chart that depicts a group of numerical data through their quartiles. It is a
simple way to visualize the shape of our data. It makes comparing characteristics of data between
categories very easy.
Parts of Box Plots
A box & whisker plot shows a "box" with left edge at Q1 , right edge at Q3 , the "middle" of the box
at Q2 (the median) and the maximum and minimum as "whiskers".
Example -2: Let us take a sample data to understand how to create a box plot.
Here are the runs scored by a cricket team in a league of 12 matches.
The outliers which are outside the range are Outliers= 220
Example-3: Calculate Box and Whisker Plots from the following grouped data
Pivot Tables:
A pivot table is a powerful data summarization tool that can automatically sort, count, and sum up data
stored in tables and display the summarized data. Pivot tables are useful to quickly create crosstabs (a
process or function that combines and/or summarizes data from one or more sources into a concise format
for analysis or reporting) to display the joint distribution of two or more variables.
Typically, with a pivot table the user sets up and changes the data summary's structure by dragging and
dropping fields graphically. This "rotation" or pivoting of the summary table gives the concept its name.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
Three key reasons for organizing data into a pivot table are:
To summarize the data contained in a lengthy list into a compact format.
To find relationships within the data those are otherwise hard to see because of the amount of detail.
To organize the data into a format that‘s easy to read.
HeatMap
Heatmaps visualize the data in a 2-dimensional format in the form of colored maps. The color maps use
hue, saturation, or luminance to achieve color variation to display various details. This color variation
gives visual cues to the readers about the magnitude of numeric values. HeatMaps is about replacing
numbers with colors because the human brain understands visuals better than numbers, text, or any
written data.
A heatmap (or heat map) is a graphical representation of numerical data, where individual data points
contained in the data set are represented using different colors. The key benefit of heatmaps is that they
simplify complex numerical data into visualizations that can be understood at a glance. For example, on
website heatmaps ‗hot‘ colours depict high user engagement, while ‗cold‘ colours depict low engagement.
Uses of HeatMap
1. Business Analytics: A heat map is used as a visual business analytics tool. A heat map gives quick
visual cues about the current results, performance, and scope for improvements. Heatmaps can analyze
the existing data and find areas of intensity that might reflect where most customers reside, areas of risk
of market saturation, or cold sites and sites that need a boost.
2. Website: Heatmaps are used in websites to visualize data of visitors‘ behavior. This visualization helps
business owners and marketers to identify the best & worst-performing sections of a webpage.
3. Exploratory Data Analysis: EDA is a task performed by data scientists to get familiar with the data.
All the initial studies are done to understand the data are known as EDA. EDA is done to summarize their
main features, often with visual methods, which includes Heatmaps.
4. Molecular Biology: Heat maps are used to study disparity and similarity patterns in DNA, RNA, etc.
5. Marketing and Sales: The heatmap‘s capability to detect warm and cold spots is used to improve
marketing response rates by targeted marketing. Heatmaps allow the detection of areas that respond to
campaigns, under-served markets, customer residence, and high sale trends, which helps optimize product
lineups, capitalize on sales, create targeted customer segments, and assess regional demographics.
Types of HeatMaps
Typically, there are two types of Heatmaps:
Grid Heatmap: The magnitudes of values shown through colors are laid out into a matrix of rows and
columns, mostly by a density-based function. Below are the types of Grid Heatmaps.
Clustered Heatmap: The goal of Clustered Heatmap is to build associations between both the data points
and their features. This type of heatmap implements clustering as part of the process of grouping similar
features. Clustered Heatmaps are widely used in biological sciences for studying gene similarities across
individuals. Correlogram: A correlogram replaces each of the variables on the two axes with numeric
variables in the dataset. Each square depicts the relationship between the two intersecting variables, which
helps to build descriptive or predictive statistical models.
Spatial Heatmap: Each square in a Heatmap is assigned a color representation according to the nearby
cells‘ value. The location of color is according to the magnitude of the value in that particular space.
These Heatmaps are data-driven ―paint by numbers‖ canvas overlaid on top of an image. The cells with
higher values than other cells are given a hot color, while cells with lower values are assigned a cold
color.
Notes Prepared by Chandrakanta Mahanty, Assistant Professor
Department of CSE, 8093488380/8249119544 chandra.mahanty@giet.edu
ANOVA, which stands for Analysis of Variance, is a statistical test used to analyze the difference
between the means of more than two groups. A one-way ANOVA uses one independent variable, while a
two-way ANOVA uses two independent variables.
A One-Way ANOVA is used to determine how one factor impacts a response variable. For example, we
might want to know if three different studying techniques lead to different mean exam scores. To see if
there is a statistically significant difference in mean exam scores, we can conduct a one-way ANOVA.
A Two-Way ANOVA is used to determine how two factors impact a response variable, and to determine
whether or not there is an interaction between the two factors on the response variable. For example, we
might want to know how gender and how different levels of exercise impact average weight loss. We
would conduct a two-way ANOVA to find out.
Sums of Squares: In statistics, the sum of squares is defined as a statistical technique that is used in
regression analysis to determine the dispersion of data points. In the ANOVA test, it is used while
computing the value of F. As the sum of squares tells you about the deviation from the mean, it is also
known as variation.
Degrees of Freedom: Degrees of Freedom refer to the maximum numbers of logically independent
values that have the freedom to vary in a data set.
The Mean Squared Error: It tells us about the average error in a data set. To find the mean squared
error, we just divide the sum of squares by the degrees of freedom.
Example-1
F-Table
For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute income,
Z-score normalization: This method normalizes the value for attribute A using the mean and standard
deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000. And we
have to normalize the value $73,600 using z-score normalization.
Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point in the
value. This movement of a decimal point depends on the maximum absolute value of A. The formula for
the decimal scaling is given below:
In this example, we can see that if we know the mobile number, then we can know the mobile network or
sim provider. So, we reduce a dimension of mobile network. When we reduce the dimensions, then you
can reduce those dimensions of attributes of data by combining the dimensions in such a way that it will
not lose significant characteristics of the original dataset that is going to be ready for data mining.
2. Numerosity Reduction:
Numerosity Reduction is a data reduction technique which replaces the original data by smaller form of
data representation. There are two techniques for numerosity reduction- Parametric and Non-Parametric
methods.
Parametric Methods –
For parametric methods, data is represented using some model. The model is used to estimate the data, so
that only parameters of data are required to be stored, instead of actual data. Regression and Log-Linear
methods are used for creating such models.
Non-Parametric Methods –
These methods are used for storing reduced representations of the data include histograms, clustering,
sampling and data cube aggregation.
Histograms: Histogram is the data representation in terms of frequency.
Clustering: Clustering divides the data into groups/clusters.
Sampling: Sampling can be used for data reduction because it allows a large data set to be represented by
a much smaller random data sample (or subset).
3. Data Cube Aggregation
This technique is used to aggregate data in a simpler form. Data Cube Aggregation is a multidimensional
aggregation that uses aggregation at various levels of a data cube to represent the original data set, thus
achieving data reduction.