Unit II Descriptive Data Analys(Notes)3
Unit II Descriptive Data Analys(Notes)3
ENGINEERING
UNIT-II
UNIT II DESCRIPTIVE DATA ANALYSIS
Dataset Construction - Sampling of data - Stem and Leaf Plots - Frequency table - Time
Series data - Central Tendency Measures of the location of data - Dispersion measures
Sampling of data
1) An unbiased sample
2) A biased sample
• A biased sample does not represent the population. One or more groups of
people are given preferential screening.
In Statistics, there are different sampling techniques available to get relevant results
from the population. The two different types of sampling methods are::
• Probability Sampling
• Non-probability Sampling
What is Probability Sampling?
The probability sampling method utilizes some form of random selection. In this ethod,
all the eligible individuals have a chance of selecting the sample from the whole sample
space. This method is more time consuming and expensive than the non-probability
sampling method. The benefit of using probability sampling is that it guarantees the
sample that should be the representative of the population.
Probability Sampling methods are further classified into different types, such as simple
random sampling, systematic sampling, stratified sampling, and clustered sampling.
Let us discuss the different types of probability sampling methods along with illustrative
examples here in detaile
In simple random sampling technique, every item in the population has an equal and
likely chance of being selected in the sample. Since the item selection entirely depends
on the chance, this method is known as “Method of chance Selection”. As the sample
size is large, and the item is chosen randomly, it is known as “Representative
Sampling”.
Example:
Suppose we want to select a simple random sample of 200 students from a school.
Here, we can assign a number to every student in the school database from 1 to 500
and use a random number generator to select a sample of 200 numbers.
Systematic Sampling
In the systematic sampling method, the items are selected from the target population
by selecting the random selection point and selecting the other methods after a fixed
sample interval. It is calculated by dividing the total population size by the desired
population size.
Example:
Suppose the names of 300 students of a school are sorted in the reverse alphabetical
order. To select a sample in a systematic sampling method, we have to choose some 15
students by randomly selecting a starting number, say 5. From number 5 onwards, will
select every 15th person from the sorted list. Finally, we can end up with a sample of
some students.
Stratified Sampling
In a stratified sampling method, the total population is divided into smaller groups to
complete the sampling process. The small group is formed based on a few
characteristics in the population. After separating the population into a smaller group,
the statisticians randomly select the sample.
For example, there are three bags (A, B and C), each with different balls. Bag A has 50
balls, bag B has 100 balls, and bag C has 200 balls. We have to choose a sample of balls
from each bag proportionally. Suppose 5 balls from bag A, 10 balls from bag B and 20
balls from bag C.
Clustered Sampling
In the clustered sampling method, the cluster or group of people are formed from the
population set. The group has similar significatory characteristics. Also, they have an
equal chance of being a part of the sample. This method uses simple random sampling
for the cluster of population.
Example:
An educational institution has ten branches across the country with almost the number
of students. If we want to collect some data regarding facilities and other things, we
can’t travel to every unit to collect the required data. Hence, we can use random
sampling to select three or four branches as clusters.
All these four methods can be understood in a better manner with the help of the figure
given below. The figure contains various examples of how samples will be taken from
the population using different techniques
The non-probability sampling method is a technique in which the researcher selects the
sample based on subjective judgment rather than the random selection. In this method,
not all the members of the population have a chance to participate in the study.
Non-Probability Sampling Types
Non-probability Sampling methods are further classified into different types, such as
convenience sampling, consecutive sampling, quota sampling, judgmental sampling,
snowball sampling. Here, let us discuss all these types of non-probability sampling in
detail.
Convenience Sampling
In a convenience sampling method, the samples are selected from the population
directly because they are conveniently available for the researcher. The samples are
easy to select, and the researcher did not choose the sample that outlines the entire
population.
Example:
Quota Sampling
In the quota sampling method, the researcher forms a sample that involves the
individuals to represent the population based on specific traits or qualities. The
researcher chooses the sample subsets that bring the useful collection of data that
generalizes the entire population.
The below table shows a few differences between probability sampling methods and
non-probability sampling methods.
These are also known as Random sampling These are also called non-random sampling
methods. methods.
These are used for research which is These are used for research which is
conclusive. exploratory.
The Stem and Leaf plot is a way of organizing data into a form that makes it easy to see
the frequency of different values. In other words, we can say that a Stem and Leaf Plot is
a table in which each data value is split into a “stem” and a “leaf.” The “stem” is the left-
hand column that has the tens of digits. The “leaves” are listed in the right-hand
column, showing all the ones digit for each of the tens, the twenties, thirties, and
forties.
Remember that Stem and Leaf plots are a pictorial representation of grouped data, but
they can also be called a modal representation. Because, by quick visual inspection at
the Stem and Leaf plot, we can determine the mode.
• Draw a with two columns and name them as “Stem” and “Leaf”.
• Fill in the leaf data.
• Remember, a Stem and Leaf plot can have multiple sets of leaves.
Consider we have to make a Stem and Leaf plot for the data:
71, 43, 65, 76, 98, 82, 95, 83, 84, 96.
We’ll use the tens digits as the stem values and the one’s digits as the leaves. For better
understanding, let’s order the list, but this is optional:
43, 65, 71, 76, 82, 83, 84, 95, 96, 98.
Now, let’s draw a table with two columns and mark the left-hand column as “Stem” and
the right-hand column as “Leaf”.
The above is one of the simple cases for Stem and Leaf plots.
Here,
If we have a number like 13.4, we will make “13” the Stem and “4” the Leaf. That’s right,
the decimal doesn’t matter. Since the decimal will be in place of the vertical line
separating the Stem and Leaf, we don’t have to worry about it.
Now that you have an idea about stem and leaf plots, can you answer the following
questions, considering the data given? Let’s see.
1. What are the leaf numbers for stem 3?
Solutions:
1. First, look at the left-hand “Stems” column and locate stem 3. Then we'll look at the
corresponding numbers in the right-hand “Leaves” column. The numbers are 8 and 6.
2. Combining the stem value of 3 with the corresponding numbers 8 and 6 in the right-
hand ''Leaves'' column, using the information in part (1) of the Question above, we
found the following data values for stem 3 are 38 and 36.
Hence, the data values are obtained by combining the stem and the leaves to get 14 and
15.
4. Starting with stem = 3, we have data values of 38 and 36. Moving on to stem 4, we get
corresponding data values of 49, 41, 47 and 44. That brings us to the end of the stem-
and-leaf plot.
So, the data values greater than 30, according to the list above, are 38, 36, 49, 41, 47
and 44.
We can also combine and distribute data for two types of data. They are called Two-
sided Stem and Leaf Plots, which are also often called back-to-back stem-and-leaf
plots. With help of Two-sided Stem and Leaf Plots, we can determine Range, Median
and Mode.
Other Alternatives apart from Stem and Leaf plots to organise and group data are:
1. Frequency distribution
2. Histogram
Frequency distribution tables can be made using tally marks for both discrete and
continuous data values. The way of preparing discrete frequency tables and continuous
frequency distribution tables are different from each other.
In this section, you will learn how to make a discrete frequency distribution table with
the help of examples.
ExamplesSuppose the runs scored by the 11 players of the Indian cricket team in a
match are given as follows:25,65,03,12,35,46,67,56,00,17
The number of times data occurs in a data set is known as the frequency of data. In the
above example, frequency is the number of students who scored various marks as
tabulated. This type of tabular data collection is known as an ungrouped frequency
table.
What happens if, instead of 20 students, 200 students took the same test. Would it have
been easy to represent such data in the format of an ungrouped frequency distribution
table? Well, obviously no. To represent a vast amount of information, the data is
subdivided into groups of similar sizes known as class or class intervals, and the size of
each class is known as class width or class size.
The frequency distribution table for grouped data is also known as the continuous
frequency distribution table. This is also known as the grouped frequency distribution
table. Here, we need to make the frequency distribution table by dividing the data
values into a suitable number of classes and with the appropriate class height. Let’s
understand this with the help of the solved example given below:
Question:
The heights of 50 students, measured to the nearest centimetres, have been found to be
as follows:
161, 150, 154, 165, 168, 161, 154, 162, 150, 151, 162, 164, 171, 165, 158, 154, 156, 172,
160, 170, 153, 159, 161, 170, 162, 165, 166, 168, 165, 164, 154, 152, 153, 156, 158, 162,
160, 161, 173, 166, 161, 159, 162, 167, 168, 159, 158, 153, 154, 159
(i) Represent the data given above by a grouped frequency distribution table, taking the
class intervals as 160 – 165, 165 – 170, etc.
(ii) What can you conclude about their heights from the table?
Solution:
(i) Let us make the grouped frequency distribution table with classes:
150 – 155, 155 – 160, 160 – 165, 165 – 170, 170 – 175
(ii) From the given data and above table, we can observe that 35 students, i.e. more than
50% of the total students, are shorter than 165 cm.
Practice Problems
The utilization of time series visualization and analytics facilitates the extraction
of insights from data, enabling the generation of forecasts and a comprehensive
understanding of the information at hand. Organizations find substantial value in
time series data as it allows them to analyze both real-time and historical
metrics.
• Trend: A trend represents the general direction in which a time series is moving
over an extended period. It indicates whether the values are increasing,
decreasing, or staying relatively constant.
2. Discrete Time Series Data: Discrete time series data, on the other hand,
consists of measurements or observations that are limited to specific values or
categories. Unlike continuous data, discrete data does not have a continuous
range of possible values but instead comprises distinct and separate data
points. Common examples include:
• Count Data: Tracking the number of occurrences or events within a specific time
period.
• Binary Data: Recording data with only two possible outcomes or states.
• To show patterns and distributions within discrete time series data, bar charts,
histograms, and stacked bar plots are frequently utilized. These methods provide
insights into the distribution and frequency of particular occurrences or
categories throughout time.
We will use Python libraries for visualizing the data. The link for the dataset can
be found here. We will perform the visualization step by step, as we do in any
time-series data project.
We will import all the libraries that we will be using throughout this article in one
place so that do not have to import every time we use it this will save both our
time and effort.
• Python3
import pandas as pd
import numpy as np
To load the dataset into a dataframe we will use the pandas read_csv() function.
We will use head() function to print the first five rows of the dataset. Here we will
use the ‘parse_dates’ parameter in the read_csv function to convert the ‘Date’
column to the DatetimeIndex format. By default, Dates are stored in string format
which is not the right format for time series data analysis.
• Python3
df = pd.read_csv("stock_data.csv",
parse_dates=True,
index_col="Date")
df.head()
Output:
Date
We will drop columns from the dataset that are not important for our
visualization.
• Python3
# deleting column
df.head()
Output:
Date
Since, the volume column is of continuous data type, we will use line graph to
visualize it.
• Python3
plt.xlabel('Date')
plt.ylabel('High')
plt.show()
Output:
The representative value of a data set, generally the central value or the most
occurring value that gives a general idea of the whole data set is called Measure
of Central Tendency.
• Mean
• Median
• Mode
Mean
Mean in general terms is used for the arithmetic mean of the data, but other than
the arithmetic mean there are geometric mean and harmonic mean as well that
are calculated using different formulas. Here in this article, we will discuss the
arithmetic mean.
Arithmetic mean (xˉxˉ) is defined as the sum of the individual observations (xi)
divided by the total number of observations N. In other words, the mean is given
by the sum of all observations divided by the total number of observations.
xˉ=∑xiNxˉ=N∑xi
OR
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the
mean (xˉxˉ) is given by
⇒ xˉxˉ= 95 ÷ 5
⇒ xˉxˉ = 19
Mean for Grouped Data
Mean (xˉxˉ) is defined for the grouped data as the sum of the product of
observations (xi) and their corresponding frequencies (fi) divided by the sum of
all the frequencies (fi).
xˉ=∑fixi∑fixˉ=∑fi∑fixi
Example: If the values (xi) of the observations and their frequencies (fi) are given
as follows:
xi 4 6 15 10 9
fi 5 10 8 7 10
Types of Mean
Mean can be classified into three different class groups which are
• Arithmetic Mean
• Geometric Mean
• Harmonic Mean
xˉ=∑xiNxˉ=N∑xi
Where,
G.M.=x1⋅x2⋅x3⋅…⋅xnnG.M.=nx1⋅x2⋅x3⋅…⋅xn
Where,
H. M. =n1/x1+1/x2+…+1/xnH. M. =1/x1+1/x2+…+1/xnn
OR
H. M. =n∑(1/xi)H. M. =∑(1/xi)n
Where,
There are various properties of Arithmetic Mean, some of which are as follows:
• The algebraic sum of deviations from the arithmetic mean is zero i.e., ∑(xi–
xˉ)=0∑(xi–xˉ)=0.
• If xˉxˉis the arithmetic mean of observations and a is subtracted from each of the
observations, then the new arithmetic mean is given by x’ˉ=xˉ−ax’ˉ=xˉ−a
Although Mean is the most general way to calculate the central tendency of a
dataset however it can not give the correct idea always, especially when there is
a large gap between the datasets.
Median
Median of any distribution is that value that divides the distribution into two
equal parts such that the number of observations above it is equal to the number
of observations below it. Thus, the median is called the central value of any given
data either grouped or ungrouped.
Case 1: N is Odd
Case 2: N is Even
Example 1: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20, 32 then
the Median is given by
Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 32, 36, 38
Median = Arithmetic mean of values at (10 ÷ 2)th and [(10 ÷ 2) + 1]th position
Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 36, 38
Median=l+N/2–cff×hMedian=l+fN/2–cf×h
Where,
10 20 30 40 50
– – – – –
Class 20 30 40 50 60
Frequency 5 10 12 8 5
Solution:
10 – 20 5 5
20 – 30 10 15
30 – 40 12 27
40 – 50 8 35
50 – 60 5 40
⇒ Median = 30 + (5/12) × 10
⇒ Median = 30 + 4.17
⇒ Median = 34.17
So, the median value for this data set is 34.17
Mode
Solution:
xi 5 3 4 7
fi 2 4 2 1
Since 3 has occurred a maximum number of times i.e. 4 times in the given data;
Mode=l+[f1−f02f1−f0−f2]×hMode=l+[2f1−f0−f2f1−f0]×h
Where,
Frequency 5 8 12 16 10
Solution:
As the class interval with the highest frequency is 40-50, which has a frequency
of 16. Thus, 40-50 is the modal class.
Thus, l = 40 , h = 10 , f1 = 16 , f0 = 12 , f2 = 10
⇒ Mode = 40 + (4/10)×10
⇒ Mode = 40 + 4
⇒ Mode = 44
Therefore, the mode for this set of data is 44.
The three central tendencies are related to each other by the empirical formula
which is given as follows:
This formula is used to calculate one of the central tendencies when two other
central tendencies are given.
Mean is the Average value of the dataset and can be calculated Arithmetically,
Geometrically, and Harmonically as well. Generally by term “mean” means the
arithmetic mean of the data.
When is the Mean a good measure of Central Tendency?
Median is the middle value of the data set when arranged in increasing or
decreasing order i.e., in either ascending or desending order there are equal
number of observations on both sides of median.
The mode is a good measure of central tendency when there are clear peaks in
the dataset of frequencies of observations.
Yes, a dataset can have more than one mode as there can be two observations
with same number of highest frequencies.
The primary goal of central tendency is to offer a single value that effectively
represents a set of collected data. This value aims to capture the core or typical
aspect of the data, providing a concise summary of the overall information
Dispersion is the state of getting dispersed or spread. Statistical dispersion means the
extent to which numerical data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.
Measures of Dispersion
In statistics, the measures of dispersion help to interpret the variability of data i.e. to
know how much homogenous or heterogeneous the data is. In simple terms, it shows
how squeezed or scattered the variable is.
There are two main types of dispersion methods in statistics which are:
An absolute measure of dispersion contains the same unit as the original data set. The
absolute dispersion method expresses the variations in terms of the average of
deviations of observations like standard or means deviations. It includes
range, standard deviation, quartile deviation, etc.
1. Range: It is simply the difference between the maximum value and the minimum
value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6
2. Variance: Deduct the mean from each data in the set, square each of them and
add each square and finally divide them by the total no of values in the data set
to get the variance. Variance (σ2) = ∑(X−μ)2/N
3. Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.
5. Mean and Mean Deviation: The average of numbers is known as the mean and
the arithmetic mean of the absolute deviations of the observations from a
measure of central tendency is known as the mean deviation (also called mean
absolute deviation).
Also, read:
• Variance
• Quartiles
• Mean
The relative measures of dispersion are used to compare the distribution of two or more
data sets. This measure compares values without units. Common relative dispersion
methods include:
1. Co-efficient of Range
2. Co-efficient of Variation
Co-efficient of Dispersion
The coefficients of dispersion are calculated (along with the measure of dispersion)
when two series are compared, that differ widely in their averages. The dispersion
coefficient is also used when two series with different measurement units are
compared. It is denoted as C.D.
The most important formulas for the different dispersion methods are:
Solved Examples
Example 1: Find the Variance and Standard Deviation of the Following Numbers: 1,
3, 5, 5, 6, 7, 9, 10.
Solution:
Step 2: Squaring the above values we get, 22.563, 7.563, 0.563, 0.563, 0.063, 1.563,
10.563, 18.063
Solution:
Let Xi values be: 45, 55, 63, 76, 67, 84, 75, 48, 62, 65
Here,
= 84 – 45
= 39
= 39/129
= 0.302 (approx)
Practice Problems
1. Find the coefficient of standard deviation for the data set: 32, 35, 37, 30, 33, 36,
35 and 37
2. The mean and variance of seven observations are 8 and 16, respectively. If five of
these are 2, 4, 10, 12 and 14, find the remaining two observations.
3. In a town, 25% of the persons earned more than Rs 45,000 whereas 75% earned
more than 18,000. Compute the absolute and relative values of dispersion.
Q1
The measures of dispersion are important as it helps in understanding how much data
is spread (i.e. its variation) around a central value.
Q2
Dispersion can be calculated using various measures like mean, standard deviation,
variance, etc.
Q3
What is the Variance of the values 3, 8, 6, 10, 12, 9, 11, 10, 12, 7?
Q4
Correlation in Statistics
This section shows how to calculate and interpret correlation coefficients for ordinal
and interval level scales. Methods of correlation summarize the relationship between
two variables in a single number called the correlation coefficient. The correlation
coefficient is usually represented using the symbol r, and it ranges from -1 to +1.
A correlation coefficient quite close to 0, but either positive or negative, implies little or
no relationship between the two variables. A correlation coefficient close to plus 1
means a positive relationship between the two variables, with increases in one of the
variables being associated with increases in the other variable.
For ordinal scales, the correlation coefficient can be calculated by using Spearman’s
rho. For interval or ratio level scales, the most commonly used correlation coefficient is
Pearson’s r, ordinarily referred to as simply the correlation coefficient.
In statistics, Correlation studies and measures the direction and extent of relationship
among variables, so the correlation measures co-variation, not causation. Therefore,
we should never interpret correlation as implying cause and effect relation. For
example, there exists a correlation between two variables X and Y, which means the
value of one variable is found to change in one direction, the value of the other variable
is found to change either in the same direction (i.e. positive change) or in the opposite
direction (i.e. negative change). Furthermore, if the correlation exists, it is linear, i.e. we
can represent the relative movement of the two variables by drawing a straight line on
graph paper.
Correlation Coefficient
The correlation coefficient, r, is a summary measure that describes the extent of the
statistical relationship between two interval or ratio level variables. The correlation
coefficient is scaled so that it is always between -1 and +1. When r is close to 0 this
means that there is little relationship between the variables and the farther away from 0
r is, in either the positive or negative direction, the greater the relationship between the
two variables.
The two variables are often given the symbols X and Y. In order to illustrate how the two
variables are related, the values of X and Y are pictured by drawing the scatter diagram,
graphing combinations of the two variables. The scatter diagram is given first, and then
the method of determining Pearson’s r is presented. From the following examples,
relatively small sample sizes are given. Later, data from larger samples are given.
Scatter Diagram
A scatter diagram is a diagram that shows the values of two variables X and Y, along with
the way in which these two variables relate to each other. The values of variable X are
given along the horizontal axis, with the values of the variable Y given on the vertical
axis.
Later, when the regression model is used, one of the variables is defined as an
independent variable, and the other is defined as a dependent variable. In regression,
the independent variable X is considered to have some effect or influence on the
dependent variable Y. Correlation methods are symmetric with respect to the two
variables, with no indication of causation or direction of influence being part of the
statistical consideration. A scatter diagram is given in the following example. The same
example is later used to determine the correlation coefficient.
Types of Correlation
The scatter plot explains the correlation between the two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –
• Positive Correlation – when the values of the two variables move in the same
direction so that an increase/decrease in the value of one variable is followed by
an increase/decrease in the value of the other variable.
• Negative Correlation – when the values of the two variables move in the opposite
direction so that an increase/decrease in the value of one variable is followed by
decrease/increase in the value of the other variable.
Correlation Formula
Correlation shows the relation between two variables. Correlation coefficient shows the
measure of correlation. To compare two datasets, we use the correlation formulas.
The most common formula is the Pearson Correlation coefficient used for linear
dependency between the data sets. The value of the coefficient lies between -1 to
+1. When the coefficient comes down to zero, then the data is considered as not
related. While, if we get the value of +1, then the data are positively correlated, and -1
has a negative correlation.
Where n = Quantity of Information
rxy = Sxy/SxSy
Where Sx and Sy are the sample standard deviations, and Sxy is the sample covariance.
rxy = σxy/σxσy
Years of Education and Age of Entry to Labour Force Table.1 gives the number of years of
formal education (X) and the age of entry into the labour force (Y ), for 12 males from the
Regina Labour Force Survey. Both variables are measured in years, a ratio level of
measurement and the highest level of measurement. All of the males are aged close to
30, so that most of these males are likely to have completed their formal education.
1 10 16
2 12 17
3 15 18
4 8 15
5 20 18
6 17 22
7 12 19
8 15 22
9 12 18
10 10 15
11 8 18
12 10 16
Table 1. Years of Education and Age of Entry into Labour Force for 12 Regina Males
Since most males enter the labour force soon after they leave formal schooling, a close
relationship between these two variables is expected. By looking through the table, it
can be seen that those respondents who obtained more years of schooling generally
entered the labour force at an older age. The mean years of schooling are ̄ x = 12.4 years
and the mean age of entry into the labour force is ȳ= 17.8, a difference of 5.4 years.
This difference roughly reflects the age of entry into formal schooling, that is, age five or
six. It can be seen through that the relationship between years of schooling and age of
entry into the labour force is not perfect. Respondent 11, for example, has only 8 years
of schooling but did not enter the labour force until the age of 18. In contrast,
respondent 5 has 20 years of schooling but entered the labour force at the age of 18.
The scatter diagram provides a quick way of examining the relationship between X and
Y.
To get more information about correlation and related concepts, download BYJU’S – The
Learning App today!
Q1
Q2
What is a correlation of 1?
Q3
We know that a correlation of 1 means the two variables are associated positively,
whereas if the correlation coefficient is 0, then there is no correlation between two
variables. Thus, a correlation of 0.45 means 45% of the variance in one variable, say x, is
accounted for by the second variable, say y.
Q4
Q5
Data reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while
still preserving the most important information. This can be beneficial in situations
where the dataset is too large to be processed efficiently, or where the dataset contains
a large amount of irrelevant or redundant information.
There are several different data reduction techniques that can be used in data
mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to work
with, rather than using the entire dataset. This can be useful for reducing the size
of a dataset while still preserving the overall trends and patterns in the data.
6. It’s important to note that data reduction can have a trade-off between the
accuracy and the size of the data. The more data is reduced, the less accurate
the model will be and the less generalizable it will be.
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the
attribute required for our analysis. It reduces data size as it eliminates outdated or
redundant features.
Step-1: {X1}
Suppose there are the following attributes in the data set in which few attributes are
redundant.
3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two
types based on their compression techniques.
• Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data
size reduction. Lossless data compression uses algorithms to restore the precise
original data from the compressed data.
• Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal
component analysis) are examples of this compression. For e.g., the JPEG image
format is a lossy compression, but we can find the meaning equivalent to the
original image. In lossy-data compression, the decompressed data may differ
from the original data but are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or
smaller representations of the data instead of actual data, it is important to only store
the model parameter. Or non-parametric methods such as clustering, histogram, and
sampling.
• Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split
points) to divide the whole set of attributes and repeat this method up to the end,
then the process is known as top-down discretization also known as splitting.
• Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded
through a combination of the neighborhood values in the interval, that process is
called bottom-up discretization.
Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as
43 for age) with high-level concepts (categorical variables such as middle age or
Senior).
• Binning –
Binning is the process of changing numerical variables into categorical
counterparts. The number of categorical counterparts depends on the number of
bins specified by the user.
• Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the
attribute X, into disjoint ranges called brackets. There are several partitioning
rules:
Advantages:
3. Reduced storage costs: Data reduction can help to reduce the storage costs
associated with large datasets by reducing the size of the data.
Disadvantages:
Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set
by finding a new set of variables, smaller than the original set of variables, retaining
most of the sample’s information, and useful for the regression and classification of
data.
2. The first principal component captures the most variation in the data, but the
second principal component captures the maximum variance that
is orthogonal to the first principal component, and so on.
2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data.
If the data is not properly scaled, then PCA may not work well. Therefore, it is
important to scale the data before applying Principal Component Analysis.
Principal components are linear combinations of the original features that PCA finds
and uses to capture the most variance in the data. In order of the amount of variance
they explain, these orthogonal components are arranged.
New axes are represented in the feature space by each principal component. An
indicator of a component’s significance in capturing data variability is its capacity to
explain a larger variance.
Principal components represent the directions in which the data varies the most. The
first few components typically capture the majority of the data’s variance, allowing for a
more concise representation.
Independent Component Analysis
The heart of ICA lies in the principle of statistical independence. ICA identify
components within mixed signals that are statistically independent of each other.
or
Assumptions in ICA
1. The first assumption asserts that the source signals (original signals) are
statistically independent of each other.
The observed random vector is , representing the observed data with m components. The
hidden components are represented by the random vector , where n is the number of
hidden sources.
The observed data X is transformed into hidden components S using a linear static
transformation representation by the matrix W.
The goal is to transform the observed data x in a way that the resulting hidden
components are independent. The independence is measured by some function . The
task is to find the optimal transformation matrix W that maximizes the independence of
the hidden components.
• ICA is a powerful tool for separating mixed signals into their independent
components. This is useful in a variety of applications, such as signal
processing, image analysis, and data compression.
• ICA can be used for feature extraction, which means that it can identify
important features in the data that can be used for other tasks, such as
classification.
• ICA assumes that the underlying sources are non-Gaussian, which may not
always be true. If the underlying sources are Gaussian, ICA may not be effective.
• ICA assumes that the sources are mixed linearly, which may not always be the
case. If the sources are mixed nonlinearly, ICA may not be effective.
• ICA can be computationally expensive, especially for large datasets. This can
make it difficult to apply ICA to real-world problems.
• ICA can suffer from convergence issues, which means that it may not always be
able to find a solution. This can be a problem for complex datasets with many
sources.
Consider Cocktail Party Problem or Blind Source Separation problem to understand the
problem which is solved by independent component analysis.
Problem: To extract independent sources’ signals from a mixed signal composed of the
signals from those sources.
• Source 1
• Source 2
• Source 3
• Source 4
• Source 5
Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’
voice signals in the room, given that each microphone recorded the voice signals
coming from each speaker of different intensity due to the difference in distances
between them.
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …,
Yn are the new features and are independent components that are independent of each
other.
• Python3
import numpy as np
• Synthetic signals are generated and then combined to single matrix “S”.
• The observed signals are obtained by multiplying the matrix “S” by the transpose
of the mixing matrix “A”.
• Python3
np.random.seed(42)
samples = 200
• Fast ICA algorithm is applied to the observed mixed signals ‘X’. This fits the
model to the data and transforms the data to obtain the estimated independent
sources (S_).
• Python3
ica = FastICA(n_components=3)
[/sourcecode]
• Python3
plt.figure(figsize=(8, 6))
plt.subplot(3, 1, 1)
plt.title('Original Sources')
plt.plot(S)
plt.subplot(3, 1, 2)
plt.title('Observed Signals')
plt.plot(X)
plt.subplot(3, 1, 3)
plt.plot(S_)
plt.tight_layout()
plt.show()
Output:
Difference between PCA and ICA
Both the techniques are used in signal processing and dimensionality reduction, but
they have different goals.
It reduces the dimensions to avoid the It decomposes the mixed signal into its
problem of overfitting. independent sources’ signals.
It deals with the Principal Components. It deals with the Independent Components.
Also Check:
Are you passionate about data and looking to make one giant leap into your career?
Our Data Science Course will help you change your game and, most importantly, allow
students, professionals, and working adults to tide over into the data science
immersion. Master state-of-the-art methodologies, powerful tools, and industry best
practices, hands-on projects, and real-world applications. Become the executive head
of industries related to Data Analysis, Machine Learning, and Data Visualization with
these growing skills. Ready to Transform Your Future? Enroll Now to Be a Data Science
Expert!
Hypothesis Testing
The first step is to formulate your research question into two competing hypotheses:
1. Null Hypothesis (H0): This is the default assumption that there is no effect or
difference.
For example:
• H0: The mean height of men is equal to the mean height of women.
• Ha: The mean height of men is not equal to the mean height of women.
Gather data through experiments, surveys, or observational studies. Ensure the data
collection method is designed to test the hypothesis and is representative of the
population. This step often involves:
Select a statistical test based on the type of data and the hypothesis. The choice
depends on factors such as:
• Sample size
Use statistical software or formulas to compute the test statistic and corresponding p-
value. This step quantifies how much the sample data deviates from the null
hypothesis.
The p-value is an important concept in hypothesis testing. It represents the probability
of observing results as extreme as the sample data, assuming the null hypothesis is
true.
Compare the p-value to the predetermined significance level (α), which is typically set
at 0.05. The decision rule is as follows:
• If p-value > α: Fail to reject the null hypothesis, suggesting insufficient evidence
to support the alternative hypothesis.
It's important to note that failing to reject the null hypothesis doesn't prove it's true; it
simply means there's not enough evidence to conclude otherwise.
Report the results, including the test statistic, p-value, and conclusion. Discuss
whether the findings support the initial hypothesis and their implications. When
presenting results, consider:
Parametric tests
Parametric tests assume that the data follows a specific probability distribution,
typically the normal distribution. These tests are generally more powerful when the
assumptions are met. Common parametric tests include:
Non-parametric tests don't assume a specific distribution of the data. They are useful
when dealing with ordinal data or when the assumptions of parametric tests are
violated. Examples include:
• Mann-Whitney U test
• Kruskal-Wallis test
2. Number of Groups: Identify how many groups you're comparing (e.g., one group,
two groups, or more).
4. Data Type:
Based on these categories, you can select the appropriate statistical test. For instance,
if your data is normally distributed and you have two independent groups with
continuous data, you would use an Independent t-test. If your data is not normally
distributed with two independent groups and ordinal data, a Mann-Whitney U test is
recommended.
To help choose the appropriate test, consider using a hypothesis test flow chart as a
general guide:
Choosing the right hypothesis test for normally distributed data. Image by Author.
Choosing the right hypothesis test for non-normally distributed data. Image by Author.
These tests involve randomly shuffling the observed data many times to create a
distribution of possible outcomes under the null hypothesis. They are particularly useful
when dealing with small sample sizes or when the assumptions of parametric tests are
not met.
Bootstrapping
Monte Carlo methods use repeated random sampling to obtain numerical results. In
hypothesis testing, they can be used to estimate p-values for complex statistical
models or when analytical solutions are difficult to obtain.
When conducting hypothesis tests, it's best to understand and control for potential
errors:
• Type I Error: Rejecting the null hypothesis when it's actually true (false positive).
• Type II Error: Failing to reject the null hypothesis when it's actually false (false
negative).
The significance level (α) directly controls the probability of a Type I error. Decreasing α
reduces the chance of Type I errors but increases the risk of Type II errors.
1. Adjust the significance level based on the consequences of each error type.
The file drawer effect refers to the publication bias where studies with significant results
are more likely to be published than those with non-significant results. This can lead to
an overestimation of effects in the literature. To mitigate this:
• P-value: The probability of observing the test results under the null hypothesis.
• Significance Level (α): The threshold for rejecting the null hypothesis,
commonly set at 0.05.
• Test Statistic: A standardized value used to compare the observed data with the
null hypothesis.
• Confidence Interval: A range of values that likely contains the true population
parameter.
1. Descriptive Statistics
It is used to describe the basic features of data that provide a summary of the given data
set which can either represent the entire population or a sample of the population. It is
derived from calculations that include:
• Mode: It refers to the value that appears most often in a data set.
• Median: It is the middle value of the ordered set that divides it in exactly half.
2. Variability
• Range: This is defined as the difference between the largest and smallest value
of a dataset.
• Percentile: It refers to the measure used in statistics that indicates the value
below which the given percentage of observation in the dataset falls.
• Quartile: It is defined as the value that divides the data points into quarters.
• Interquartile Range: It measures the middle half of your data. In general terms, it
is the middle 50% of the dataset.
3. Correlation
It is one of the major statistical techniques that measure the relationship between two
variables. The correlation coefficient indicates the strength of the linear relationship
between two variables.
4. Probability Distribution
It specifies the likelihood of all possible events. In simple terms, an event refers to the
result of an experiment like tossing a coin. Events are of two types dependent and
independent.
• Dependent event: The event is said to be dependent when the occurrence of the
event is dependent on the earlier events. For example when a ball is drawn from
a bag that contains red and blue balls. If the first ball drawn is red, then the
second ball may be red or blue; this depends on the first trial.
• Linear regression: It is used to fit the regression model that explains the
relationship between a numeric predictor variable and one or more predictor
variables.
6. Normal Distribution
Normal is used to define the probability density function for a continuous random
variable in a system. The standard normal distribution has two parameters – mean and
standard deviation that are discussed above. When the distribution of random variables
is unknown, the normal distribution is used. The central limit theorem justifies why
normal distribution is used in such cases.
7. Bias
• Confirmation bias: It occurs when the person performing the statistical analysis
has some predefined assumption.