Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
5 views

Unit II Descriptive Data Analys(Notes)3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Unit II Descriptive Data Analys(Notes)3

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

DEPARTMENT OF COMPUTER SCIENCE AND

ENGINEERING

Faculty Preparatory Program


2024 – 25 EVEN Semester

Sub Code: AD23411


Subject Name: DATA ANALYTICS

UNIT-II
UNIT II DESCRIPTIVE DATA ANALYSIS

Dataset Construction - Sampling of data - Stem and Leaf Plots - Frequency table - Time

Series data - Central Tendency Measures of the location of data - Dispersion measures

– Correlation analysis - Data reduction techniques - Principal Component analysis –

Independent component analysis – Hypothesis testing – Statistical Tests

Sampling of data

A sample is a subset of individuals from a larger population. Sampling means selecting


the group that you will actually collect data from in your research. For example, if you
are researching the opinions of students in your university, you could survey a sample of
100 students.

There are two types of samples

1) An unbiased sample

2) A biased sample

• An unbiased sample represents the population. It is chosen at random and is


large enough to provide reliable information.

• A biased sample does not represent the population. One or more groups of
people are given preferential screening.

Types of Sampling Method

In Statistics, there are different sampling techniques available to get relevant results
from the population. The two different types of sampling methods are::

• Probability Sampling

• Non-probability Sampling
What is Probability Sampling?

The probability sampling method utilizes some form of random selection. In this ethod,
all the eligible individuals have a chance of selecting the sample from the whole sample
space. This method is more time consuming and expensive than the non-probability
sampling method. The benefit of using probability sampling is that it guarantees the
sample that should be the representative of the population.

Probability Sampling Types

Probability Sampling methods are further classified into different types, such as simple
random sampling, systematic sampling, stratified sampling, and clustered sampling.
Let us discuss the different types of probability sampling methods along with illustrative
examples here in detaile

Simple Random Sampling

In simple random sampling technique, every item in the population has an equal and
likely chance of being selected in the sample. Since the item selection entirely depends
on the chance, this method is known as “Method of chance Selection”. As the sample
size is large, and the item is chosen randomly, it is known as “Representative
Sampling”.

Example:

Suppose we want to select a simple random sample of 200 students from a school.
Here, we can assign a number to every student in the school database from 1 to 500
and use a random number generator to select a sample of 200 numbers.

Systematic Sampling

In the systematic sampling method, the items are selected from the target population
by selecting the random selection point and selecting the other methods after a fixed
sample interval. It is calculated by dividing the total population size by the desired
population size.

Example:

Suppose the names of 300 students of a school are sorted in the reverse alphabetical
order. To select a sample in a systematic sampling method, we have to choose some 15
students by randomly selecting a starting number, say 5. From number 5 onwards, will
select every 15th person from the sorted list. Finally, we can end up with a sample of
some students.
Stratified Sampling

In a stratified sampling method, the total population is divided into smaller groups to
complete the sampling process. The small group is formed based on a few
characteristics in the population. After separating the population into a smaller group,
the statisticians randomly select the sample.

For example, there are three bags (A, B and C), each with different balls. Bag A has 50
balls, bag B has 100 balls, and bag C has 200 balls. We have to choose a sample of balls
from each bag proportionally. Suppose 5 balls from bag A, 10 balls from bag B and 20
balls from bag C.

Clustered Sampling

In the clustered sampling method, the cluster or group of people are formed from the
population set. The group has similar significatory characteristics. Also, they have an
equal chance of being a part of the sample. This method uses simple random sampling
for the cluster of population.

Example:

An educational institution has ten branches across the country with almost the number
of students. If we want to collect some data regarding facilities and other things, we
can’t travel to every unit to collect the required data. Hence, we can use random
sampling to select three or four branches as clusters.

All these four methods can be understood in a better manner with the help of the figure
given below. The figure contains various examples of how samples will be taken from
the population using different techniques

What is Non-Probability Sampling?

The non-probability sampling method is a technique in which the researcher selects the
sample based on subjective judgment rather than the random selection. In this method,
not all the members of the population have a chance to participate in the study.
Non-Probability Sampling Types

Non-probability Sampling methods are further classified into different types, such as
convenience sampling, consecutive sampling, quota sampling, judgmental sampling,
snowball sampling. Here, let us discuss all these types of non-probability sampling in
detail.

Convenience Sampling

In a convenience sampling method, the samples are selected from the population
directly because they are conveniently available for the researcher. The samples are
easy to select, and the researcher did not choose the sample that outlines the entire
population.

Example:

In researching customer support services in a particular region, we ask your few


customers to complete a survey on the products after the purchase. This is a
convenient way to collect data. Still, as we only surveyed customers taking the same
product. At the same time, the sample is not representative of all the customers in that
area.
Consecutive Sampling

Consecutive sampling is similar to convenience sampling with a slight variation. The


researcher picks a single person or a group of people for sampling. Then the researcher
researches for a period of time to analyze the result and move to another group if
needed.

Quota Sampling

In the quota sampling method, the researcher forms a sample that involves the
individuals to represent the population based on specific traits or qualities. The
researcher chooses the sample subsets that bring the useful collection of data that
generalizes the entire population.

Probability sampling vs Non-probability Sampling Methods

The below table shows a few differences between probability sampling methods and
non-probability sampling methods.

Probability Sampling Methods Non-probability Sampling Methods

Probability Sampling is a sampling technique Non-probability sampling method is a


in which samples taken from a larger technique in which the researcher chooses
population are chosen based on probability samples based on subjective judgment,
theory. preferably random selection.

These are also known as Random sampling These are also called non-random sampling
methods. methods.

These are used for research which is These are used for research which is
conclusive. exploratory.

These are easy ways to collect the data


These involve a long time to get the data.
quickly.

There is an underlying hypothesis in


The hypothesis is derived later by conducting
probability sampling before the study starts.
the research study in the case of non-
Also, the objective of this method is to
probability sampling.
validate the defined hypothesis.

Stem and Leaf Plots

Stem and Leaf Plot Definition

The Stem and Leaf plot is a way of organizing data into a form that makes it easy to see
the frequency of different values. In other words, we can say that a Stem and Leaf Plot is
a table in which each data value is split into a “stem” and a “leaf.” The “stem” is the left-
hand column that has the tens of digits. The “leaves” are listed in the right-hand
column, showing all the ones digit for each of the tens, the twenties, thirties, and
forties.

Remember that Stem and Leaf plots are a pictorial representation of grouped data, but
they can also be called a modal representation. Because, by quick visual inspection at
the Stem and Leaf plot, we can determine the mode.

Steps for Making Stem-and-Leaf Plots

• First, determine the smallest and largest number in the data.

• Identify the stems.

• Draw a with two columns and name them as “Stem” and “Leaf”.
• Fill in the leaf data.

• Remember, a Stem and Leaf plot can have multiple sets of leaves.

Let us understand with an example:

Consider we have to make a Stem and Leaf plot for the data:
71, 43, 65, 76, 98, 82, 95, 83, 84, 96.

We’ll use the tens digits as the stem values and the one’s digits as the leaves. For better
understanding, let’s order the list, but this is optional:

43, 65, 71, 76, 82, 83, 84, 95, 96, 98.

Now, let’s draw a table with two columns and mark the left-hand column as “Stem” and
the right-hand column as “Leaf”.

The above is one of the simple cases for Stem and Leaf plots.

Here,

Stem "4" Leaf "3" means 43.

Stem "7" Leaf "6" means 76.

Stem "9" Leaf "6" means 96.

What if we Have to Make a Stem and Leaf Plot for Decimals?

If we have a number like 13.4, we will make “13” the Stem and “4” the Leaf. That’s right,
the decimal doesn’t matter. Since the decimal will be in place of the vertical line
separating the Stem and Leaf, we don’t have to worry about it.

Activity on Stem and Leaf Plot

Now that you have an idea about stem and leaf plots, can you answer the following
questions, considering the data given? Let’s see.
1. What are the leaf numbers for stem 3?

2. What are the data values for stem 3?

3. What are the data values for stem 0?

4. List the data values greater than 30.

Solutions:

1. First, look at the left-hand “Stems” column and locate stem 3. Then we'll look at the
corresponding numbers in the right-hand “Leaves” column. The numbers are 8 and 6.

2. Combining the stem value of 3 with the corresponding numbers 8 and 6 in the right-
hand ''Leaves'' column, using the information in part (1) of the Question above, we
found the following data values for stem 3 are 38 and 36.

3. The row where the stem = 1 gives us leaves = 4 and 5.

Hence, the data values are obtained by combining the stem and the leaves to get 14 and
15.

The data values are 14 and 15.

4. Starting with stem = 3, we have data values of 38 and 36. Moving on to stem 4, we get
corresponding data values of 49, 41, 47 and 44. That brings us to the end of the stem-
and-leaf plot.

So, the data values greater than 30, according to the list above, are 38, 36, 49, 41, 47
and 44.

We can also combine and distribute data for two types of data. They are called Two-
sided Stem and Leaf Plots, which are also often called back-to-back stem-and-leaf
plots. With help of Two-sided Stem and Leaf Plots, we can determine Range, Median
and Mode.

Other Alternatives apart from Stem and Leaf plots to organise and group data are:

1. Frequency distribution

2. Histogram

Frequency Distribution Table in Statistics?

In statistics, a frequency distribution table is a comprehensive way of representing the


organisation of raw data of a quantitative variable. This table shows how various values
of a variable are distributed and their corresponding frequencies. However, we can
make two frequency distribution tables:

(i) Discrete frequency distribution

(ii) Continuous frequency distribution (Grouped frequency distribution)

How to Make a Frequency distribution table?

Frequency distribution tables can be made using tally marks for both discrete and
continuous data values. The way of preparing discrete frequency tables and continuous
frequency distribution tables are different from each other.

In this section, you will learn how to make a discrete frequency distribution table with
the help of examples.

ExamplesSuppose the runs scored by the 11 players of the Indian cricket team in a
match are given as follows:25,65,03,12,35,46,67,56,00,17
The number of times data occurs in a data set is known as the frequency of data. In the
above example, frequency is the number of students who scored various marks as
tabulated. This type of tabular data collection is known as an ungrouped frequency
table.

What happens if, instead of 20 students, 200 students took the same test. Would it have
been easy to represent such data in the format of an ungrouped frequency distribution
table? Well, obviously no. To represent a vast amount of information, the data is
subdivided into groups of similar sizes known as class or class intervals, and the size of
each class is known as class width or class size.

Frequency Distribution table for Grouped data

The frequency distribution table for grouped data is also known as the continuous
frequency distribution table. This is also known as the grouped frequency distribution
table. Here, we need to make the frequency distribution table by dividing the data
values into a suitable number of classes and with the appropriate class height. Let’s
understand this with the help of the solved example given below:

Question:

The heights of 50 students, measured to the nearest centimetres, have been found to be
as follows:

161, 150, 154, 165, 168, 161, 154, 162, 150, 151, 162, 164, 171, 165, 158, 154, 156, 172,
160, 170, 153, 159, 161, 170, 162, 165, 166, 168, 165, 164, 154, 152, 153, 156, 158, 162,
160, 161, 173, 166, 161, 159, 162, 167, 168, 159, 158, 153, 154, 159

(i) Represent the data given above by a grouped frequency distribution table, taking the
class intervals as 160 – 165, 165 – 170, etc.

(ii) What can you conclude about their heights from the table?

Solution:

(i) Let us make the grouped frequency distribution table with classes:

150 – 155, 155 – 160, 160 – 165, 165 – 170, 170 – 175

Class intervals and the corresponding frequencies are tabulated as:

(ii) From the given data and above table, we can observe that 35 students, i.e. more than
50% of the total students, are shorter than 165 cm.

Practice Problems

1. The scores (out of 100) obtained by 33 students in a mathematics test are as


follows:
69, 48, 84, 58, 48, 73, 83, 48, 66, 58, 84000 66, 64, 71, 64, 66, 69, 66, 83, 66, 69,
71 81, 71, 73, 69, 66, 66, 64, 58, 64, 69, 69
Represent this data in the form of a frequency distribution.
2. The following are the marks (out of 100) of 60 students in mathematics.
16, 13, 5, 80, 86, 7, 51, 48, 24, 56, 70, 19, 61, 17, 16, 36, 34, 42, 34, 35, 72, 55, 75,
31, 52, 28,72, 97, 74, 45, 62, 68, 86, 35, 85, 36, 81, 75, 55, 26, 95, 31, 7, 78, 92,
62, 52, 56, 15, 63,25, 36, 54, 44, 47, 27, 72, 17, 4, 30.
Construct a grouped frequency distribution table with width 10 of each class
starting from 0 – 9.

3. The value of π up to 50 decimal places is given below:


3.14159265358979323846264338327950288419716939937510
(i) Make a frequency distribution of the digits from 0 to 9 after the decimal point.
(ii) What are the most and the least frequently occurring digits?

What are time series visualization and


analytics?
Time series visualization and analytics empower users to graphically represent
time-based data, enabling the identification of trends and the tracking of
changes over different periods. This data can be presented through various
formats, such as line graphs, gauges, tables, and more.

The utilization of time series visualization and analytics facilitates the extraction
of insights from data, enabling the generation of forecasts and a comprehensive
understanding of the information at hand. Organizations find substantial value in
time series data as it allows them to analyze both real-time and historical
metrics.

What is Time Series Data?

Time series data is a sequential arrangement of data points organized in


consecutive time order. Time-series analysis consists of methods for analyzing
time-series data to extract meaningful insights and other valuable
characteristics of the data.

Importance of time series analysis

Time-series data analysis is becoming very important in so many industries, like


financial industries, pharmaceuticals, social media companies, web service
providers, research, and many more. To understand the time-series data,
visualization of the data is essential. In fact, any type of data analysis is not
complete without visualizations, because one good visualization can provide
meaningful and interesting insights into the data.
Basic Time Series Concepts

• Trend: A trend represents the general direction in which a time series is moving
over an extended period. It indicates whether the values are increasing,
decreasing, or staying relatively constant.

• Seasonality: Seasonality refers to recurring patterns or cycles that occur at


regular intervals within a time series, often corresponding to specific time units
like days, weeks, months, or seasons.

• Moving average: The moving average method is a common technique used in


time series analysis to smooth out short-term fluctuations and highlight longer-
term trends or patterns in the data. It involves calculating the average of a set of
consecutive data points, referred to as a “window” or “rolling window,” as it
moves through the time series

• Noise: Noise, or random fluctuations, represents the irregular and unpredictable


components in a time series that do not follow a discernible pattern. It
introduces variability that is not attributable to the underlying trend or
seasonality.

• Differencing: Differencing is used to make the difference in values of a specified


interval. By default, it’s one, we can specify different values for plots. It is the
most popular method to remove trends in the data.

Types of Time Series Data

Time series data can be broadly classified into two sections:

1. Continuous Time Series Data:Continuous time series data involves


measurements or observations that are recorded at regular intervals, forming a
seamless and uninterrupted sequence. This type of data is characterized by a
continuous range of possible values and is commonly encountered in various
domains, including:

• Temperature Data: Continuous recordings of temperature at consistent intervals


(e.g., hourly or daily measurements).

• Stock Market Data: Continuous tracking of stock prices or values throughout


trading hours.

• Sensor Data: Continuous measurements from sensors capturing variables like


pressure, humidity, or air quality.

2. Discrete Time Series Data: Discrete time series data, on the other hand,
consists of measurements or observations that are limited to specific values or
categories. Unlike continuous data, discrete data does not have a continuous
range of possible values but instead comprises distinct and separate data
points. Common examples include:

• Count Data: Tracking the number of occurrences or events within a specific time
period.

• Categorical Data: Classifying data into distinct categories or classes (e.g.,


customer segments, product types).

• Binary Data: Recording data with only two possible outcomes or states.

Visualization Approach for Different Data Types:

• Plotting data in a continuous time series can be effectively represented


graphically using line, area, or smooth plots, which offer insights into the
dynamic behavior of the trends being studied.

• To show patterns and distributions within discrete time series data, bar charts,
histograms, and stacked bar plots are frequently utilized. These methods provide
insights into the distribution and frequency of particular occurrences or
categories throughout time.

Time Series Data Visualization using Python

We will use Python libraries for visualizing the data. The link for the dataset can
be found here. We will perform the visualization step by step, as we do in any
time-series data project.

Importing the Libraries

We will import all the libraries that we will be using throughout this article in one
place so that do not have to import every time we use it this will save both our
time and effort.

• Numpy – A Python library that is used for numerical mathematical computation


and handling multidimensional ndarray, it also has a very large collection of
mathematical functions to operate on this array.

• Pandas – A Python library built on top of NumPy for effective matrix


multiplication and dataframe manipulation, it is also used for data cleaning,
data merging, data reshaping, and data aggregation.

• Matplotlib – It is used for plotting 2D and 3D visualization plots, it also supports


a variety of output formats including graphs for data.

• Python3
import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from statsmodels.graphics.tsaplots import plot_acf

from statsmodels.tsa.stattools import adfuller

Loading The Dataset

To load the dataset into a dataframe we will use the pandas read_csv() function.
We will use head() function to print the first five rows of the dataset. Here we will
use the ‘parse_dates’ parameter in the read_csv function to convert the ‘Date’
column to the DatetimeIndex format. By default, Dates are stored in string format
which is not the right format for time series data analysis.

• Python3

# reading the dataset using read_csv

df = pd.read_csv("stock_data.csv",

parse_dates=True,

index_col="Date")

# displaying the first five rows of dataset

df.head()

Output:

Unnamed: 0 Open High Low Close Volume Name

Date

2006-01-03 NaN 39.69 41.22 38.79 40.91 24232729 AABA

2006-01-04 NaN 41.22 41.90 40.77 40.97 20553479 AABA

2006-01-05 NaN 40.93 41.73 40.85 41.53 12829610 AABA

2006-01-06 NaN 42.88 43.57 42.80 43.21 29422828 AABA


2006-01-09 NaN 43.10 43.66 42.82 43.42 16268338 AABA

Dropping Unwanted Columns

We will drop columns from the dataset that are not important for our
visualization.

• Python3

# deleting column

df.drop(columns='Unnamed: 0', inplace =True)

df.head()

Output:

Open High Low Close Volume Name

Date

2006-01-03 39.69 41.22 38.79 40.91 24232729 AABA

2006-01-04 41.22 41.90 40.77 40.97 20553479 AABA

2006-01-05 40.93 41.73 40.85 41.53 12829610 AABA

2006-01-06 42.88 43.57 42.80 43.21 29422828 AABA

2006-01-09 43.10 43.66 42.82 43.42 16268338 AABA

Plotting Line plot for Time Series data:

Since, the volume column is of continuous data type, we will use line graph to
visualize it.

• Python3

# Assuming df is your DataFrame

sns.set(style="whitegrid") # Setting the style to whitegrid for a clean


background

plt.figure(figsize=(12, 6)) # Setting the figure size

sns.lineplot(data=df, x='Date', y='High', label='High Price', color='blue')


# Adding labels and title

plt.xlabel('Date')

plt.ylabel('High')

plt.title('Share Highest Price Over Time')

plt.show()

Output:

Measures of Central Tendency Meaning

The representative value of a data set, generally the central value or the most
occurring value that gives a general idea of the whole data set is called Measure
of Central Tendency.

Measures of Central Tendency

Some of the most commonly used measures of central tendency are:

• Mean

• Median

• Mode
Mean

Mean in general terms is used for the arithmetic mean of the data, but other than
the arithmetic mean there are geometric mean and harmonic mean as well that
are calculated using different formulas. Here in this article, we will discuss the
arithmetic mean.

Mean for Ungrouped Data

Arithmetic mean (xˉxˉ) is defined as the sum of the individual observations (xi)
divided by the total number of observations N. In other words, the mean is given
by the sum of all observations divided by the total number of observations.

xˉ=∑xiNxˉ=N∑xi

OR

Mean = Sum of all Observations ÷ Total number of Observations

Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the
mean (xˉxˉ) is given by

xˉxˉ = (27 + 11 + 17 + 19 + 21) ÷ 5

⇒ xˉxˉ= 95 ÷ 5
⇒ xˉxˉ = 19
Mean for Grouped Data

Mean (xˉxˉ) is defined for the grouped data as the sum of the product of
observations (xi) and their corresponding frequencies (fi) divided by the sum of
all the frequencies (fi).

xˉ=∑fixi∑fixˉ=∑fi∑fixi

Example: If the values (xi) of the observations and their frequencies (fi) are given
as follows:

xi 4 6 15 10 9

fi 5 10 8 7 10

then Arithmetic mean (xˉxˉ) of the above distribution is given by

xˉxˉ = (4×5 + 6×10 + 15×8 + 10×7 + 9×10) ÷ (5 + 10 + 8 + 7 + 10)


⇒ xˉxˉ = (20 + 60 + 120 + 70 + 90) ÷ 40
⇒ xˉxˉ = 360 ÷ 40
⇒ xˉxˉ = 9
Related Resources,

• Mean Using Direct Method

• Shortcut Method for Arithmetic Mean

• Mean Using Step Deviation Method

Types of Mean

Mean can be classified into three different class groups which are

• Arithmetic Mean

• Geometric Mean

• Harmonic Mean

Arithmetic Mean: The formula for Arithmetic Mean is given by

xˉ=∑xiNxˉ=N∑xi

Where,

• x1, x2, x3, . . ., xn are the observations, and

• N is the number of observations.

Geometric Mean: The formula for Geometric Mean is given by

G.M.=x1⋅x2⋅x3⋅…⋅xnnG.M.=nx1⋅x2⋅x3⋅…⋅xn

Where,

• x1, x2, x3, . . ., xn are the observations, and

• n is the number of observations.

Harmonic Mean: The formula for Harmonic Mean is given by

H. M. =n1/x1+1/x2+…+1/xnH. M. =1/x1+1/x2+…+1/xnn

OR

H. M. =n∑(1/xi)H. M. =∑(1/xi)n

Where,

• x1, x2, . . ., xn are the observations, and


• n is the number of observations.

Properties of Mean (Arithmetic)

There are various properties of Arithmetic Mean, some of which are as follows:

• The algebraic sum of deviations from the arithmetic mean is zero i.e., ∑(xi–
xˉ)=0∑(xi–xˉ)=0.

• If xˉxˉis the arithmetic mean of observations and a is added to each of the


observations, then the new arithmetic mean is given by x’ˉ=xˉ+ax’ˉ=xˉ+a

• If xˉxˉis the arithmetic mean of observations and a is subtracted from each of the
observations, then the new arithmetic mean is given by x’ˉ=xˉ−ax’ˉ=xˉ−a

• If xˉxˉ is the arithmetic mean of observations and a is multiplied by each of the


observations, then the new arithmetic mean is given by x’ˉ=xˉ×ax’ˉ=xˉ×a

• If xˉxˉ is the arithmetic mean of observations and each of the observations is


divided by a, then the new arithmetic mean is given by x’ˉ=xˉ÷ax’ˉ=xˉ÷a

Disadvantage of Mean as Measure of Central Tendency

Although Mean is the most general way to calculate the central tendency of a
dataset however it can not give the correct idea always, especially when there is
a large gap between the datasets.

Median

Median of any distribution is that value that divides the distribution into two
equal parts such that the number of observations above it is equal to the number
of observations below it. Thus, the median is called the central value of any given
data either grouped or ungrouped.

Median of Ungrouped Data

To calculate the Median, the observations must be arranged in ascending or


descending order. If the total number of observations is N then there are two
cases

Case 1: N is Odd

Median = Value of observation at [(n + 1) ÷ 2]th Position

When N is odd the median is calculated as shown in the image below.


Median when N is Odd

Case 2: N is Even

Median = Arithmetic mean of Values of observations at (n ÷ 2)th and [(n ÷ 2) +


1]th Position

When N is even the median is calculated as shown in the image below.

Example 1: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20, 32 then
the Median is given by

Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 32, 36, 38

N = 10 which is even then

Median = Arithmetic mean of values at (10 ÷ 2)th and [(10 ÷ 2) + 1]th position

⇒ Median = (Value at 5th position + Value at 6th position) ÷ 2


⇒ Median = (26 + 28) ÷ 2
⇒ Median = 27
Example 2: If the observations are 25, 36, 31, 23, 22, 26, 38, 28, 20 then the
Median is given by

Arranging the data in ascending order: 20, 22, 23, 25, 26, 28, 31, 36, 38

N = 9 which is odd then

Median = Value at [(9 + 1) ÷ 2]th position

⇒ Median = Value at 5th position


⇒ Median = 26
Median of Grouped Data

Median of Grouped Data is given as follows:

Median=l+N/2–cff×hMedian=l+fN/2–cf×h

Where,

• l is the lower limit of median class,

• n is the total number of observations,

• cf is the cumulative frequency of the preceding class,

• f is the frequency of each class, and

• h is the class size.

Example: Calculate the median for the following data.

10 20 30 40 50
– – – – –
Class 20 30 40 50 60

Frequency 5 10 12 8 5

Solution:

Create the following table for the given data.


Class Frequency Cumulative Frequency

10 – 20 5 5

20 – 30 10 15

30 – 40 12 27

40 – 50 8 35

50 – 60 5 40

As n = 40 and n/2 = 20,

Thus, 30 – 40 is the median class.

l = 30, cf = 15, f = 12, and h = 10

Putting the values in the formula Median=l+N/2–cff×hMedian=l+fN/2–cf×h

Median = 30 + (20 – 15)/12) × 10

⇒ Median = 30 + (5/12) × 10
⇒ Median = 30 + 4.17
⇒ Median = 34.17
So, the median value for this data set is 34.17

Mode

Mode is the value of that observation which has a maximum frequency


corresponding to it. In other, that observation of the data occurs the maximum
number of times in a dataset.

Mode of Ungrouped Data

Mode of Ungrouped Data can be simply calculated by observing the observation


with the highest frequency. Let’s see an example of the calculation of the mode
of ungrouped data.
The mode of the data set is the highest frequency term in the data set as shown
in the image added below.

Example: Find the mode of observations 5, 3, 4, 3, 7, 3, 5, 4, 3.

Solution:

Create a table with each observation with its frequency as follows:

xi 5 3 4 7

fi 2 4 2 1

Since 3 has occurred a maximum number of times i.e. 4 times in the given data;

Hence, Mode of the given ungrouped data is 3.

Mode of Grouped Data

Formula to find the mode of the grouped data is:

Mode=l+[f1−f02f1−f0−f2]×hMode=l+[2f1−f0−f2f1−f0]×h

Where,

• l is the lower class limit of modal class,

• h is the class size,

• f1 is the frequency of modal class,

• f0 is the frequency of class which proceeds the modal class, and

• f2 is the frequency of class which succeeds the modal class.

Example: Find the mode of the dataset which is given as follows.


Class 10- 20- 30- 40- 50-
Interval 20 30 40 50 60

Frequency 5 8 12 16 10

Solution:

As the class interval with the highest frequency is 40-50, which has a frequency
of 16. Thus, 40-50 is the modal class.

Thus, l = 40 , h = 10 , f1 = 16 , f0 = 12 , f2 = 10

Plugging in the values in formula Mode=l+[f1−f02f1−f0−f2]×h Mode=l+[2f1


−f0−f2f1−f0]×h , we get

Mode = 40 + (16 – 12)/(2 × 16 – 12 – 10) × 10

⇒ Mode = 40 + (4/10)×10
⇒ Mode = 40 + 4
⇒ Mode = 44
Therefore, the mode for this set of data is 44.

Learn more about Mean, Median, and Mode of Grouped Data

Empirical Relation Between Measures of Central Tendency

The three central tendencies are related to each other by the empirical formula
which is given as follows:

2 × Mean + Mode = 3 × Median

This formula is used to calculate one of the central tendencies when two other
central tendencies are given.

FAQs on Measures of Central Tendency

What is a Measure of Central Tendency in Statistics?

Measure of Central Tendency of a dataset represent a central value or a typical


value for a dataset which can be used to do further analysis on the data.

What is the Mean?

Mean is the Average value of the dataset and can be calculated Arithmetically,
Geometrically, and Harmonically as well. Generally by term “mean” means the
arithmetic mean of the data.
When is the Mean a good measure of Central Tendency?

Mean is a good measure of central tendency when data is normally distributed


and there is no extreme values or outliers in the dataset.

What is the Median?

Median is the middle value of the data set when arranged in increasing or
decreasing order i.e., in either ascending or desending order there are equal
number of observations on both sides of median.

When is the Median a good measure of Central Tendency?

The median is a good measure of central tendency when dataset is skewed or


there are extreme values or outliers in the dataset.

What is the Mode?

Mode is highest frequency observation of the given dataset.

When is the Mode a good measure of Central Tendency?

The mode is a good measure of central tendency when there are clear peaks in
the dataset of frequencies of observations.

Can a dataset have more than one mode?

Yes, a dataset can have more than one mode as there can be two observations
with same number of highest frequencies.

What is the purpose of central tendency?

The primary goal of central tendency is to offer a single value that effectively
represents a set of collected data. This value aims to capture the core or typical
aspect of the data, providing a concise summary of the overall information

What is Dispersion in Statistics?

Dispersion is the state of getting dispersed or spread. Statistical dispersion means the
extent to which numerical data is likely to vary about an average value. In other words,
dispersion helps to understand the distribution of the data.

Measures of Dispersion

In statistics, the measures of dispersion help to interpret the variability of data i.e. to
know how much homogenous or heterogeneous the data is. In simple terms, it shows
how squeezed or scattered the variable is.

Types of Measures of Dispersion

There are two main types of dispersion methods in statistics which are:

• Absolute Measure of Dispersion

• Relative Measure of Dispersion

Absolute Measure of Dispersion

An absolute measure of dispersion contains the same unit as the original data set. The
absolute dispersion method expresses the variations in terms of the average of
deviations of observations like standard or means deviations. It includes
range, standard deviation, quartile deviation, etc.

The types of absolute measures of dispersion are:

1. Range: It is simply the difference between the maximum value and the minimum
value given in a data set. Example: 1, 3,5, 6, 7 => Range = 7 -1= 6

2. Variance: Deduct the mean from each data in the set, square each of them and
add each square and finally divide them by the total no of values in the data set
to get the variance. Variance (σ2) = ∑(X−μ)2/N

3. Standard Deviation: The square root of the variance is known as the standard
deviation i.e. S.D. = √σ.
4. Quartiles and Quartile Deviation: The quartiles are values that divide a list of
numbers into quarters. The quartile deviation is half of the distance between the
third and the first quartile.

5. Mean and Mean Deviation: The average of numbers is known as the mean and
the arithmetic mean of the absolute deviations of the observations from a
measure of central tendency is known as the mean deviation (also called mean
absolute deviation).

Also, read:

• Variance

• Quartiles

• Mean

Relative Measure of Dispersion

The relative measures of dispersion are used to compare the distribution of two or more
data sets. This measure compares values without units. Common relative dispersion
methods include:

1. Co-efficient of Range

2. Co-efficient of Variation

3. Co-efficient of Standard Deviation

4. Co-efficient of Quartile Deviation

5. Co-efficient of Mean Deviation

Co-efficient of Dispersion

The coefficients of dispersion are calculated (along with the measure of dispersion)
when two series are compared, that differ widely in their averages. The dispersion
coefficient is also used when two series with different measurement units are
compared. It is denoted as C.D.

The common coefficients of dispersion are:

C.D. in terms of Coefficient of dispersion

Range C.D. = (Xmax – Xmin) ⁄ (Xmax + Xmin)


C.D. in terms of Coefficient of dispersion

Quartile Deviation C.D. = (Q3 – Q1) ⁄ (Q3 + Q1)

Standard Deviation (S.D.) C.D. = S.D. ⁄ Mean

Mean Deviation C.D. = Mean deviation/Average

Measures of Dispersion Formulas

The most important formulas for the different dispersion methods are:

Arithmetic Mean Formula Quartile Formula

Standard Deviation Formula Variance Formula

Interquartile Range Formula All Statistics Formulas

Solved Examples

Example 1: Find the Variance and Standard Deviation of the Following Numbers: 1,
3, 5, 5, 6, 7, 9, 10.

Solution:

The mean = (1+ 3+ 5+ 5+ 6+ 7+ 9+ 10)/8 = 46/ 8 = 5.75

Step 1: Subtract the mean value from individual value

(1 – 5.75), (3 – 5.75), (5 – 5.75), (5 – 5.75), (6 – 5.75), (7 – 5.75), (9 – 5.75), (10 – 5.75)

= -4.75, -2.75, -0.75, -0.75, 0.25, 1.25, 3.25, 4.25

Step 2: Squaring the above values we get, 22.563, 7.563, 0.563, 0.563, 0.063, 1.563,
10.563, 18.063

Step 3: 22.563 + 7.563 + 0.563 + 0.563 + 0.063 + 1.563 + 10.563 + 18.063


= 61.504

Step 4: n = 8, therefore variance (σ2) = 61.504/ 8 = 7.69

Now, Standard deviation (σ) = 2.77


Example 2: Calculate the range and coefficient of range for the following data
values.

45, 55, 63, 76, 67, 84, 75, 48, 62, 65

Solution:

Let Xi values be: 45, 55, 63, 76, 67, 84, 75, 48, 62, 65

Here,

Maxium value (Xmax) = 84

Minimum or Least value (Xmin) = 45

Range = Maximum value = Minimum value

= 84 – 45

= 39

Coefficient of range = (Xmax – Xmin)/(Xmax + Xmin)

= (84 – 45)/(84 + 45)

= 39/129

= 0.302 (approx)

Practice Problems

1. Find the coefficient of standard deviation for the data set: 32, 35, 37, 30, 33, 36,
35 and 37

2. The mean and variance of seven observations are 8 and 16, respectively. If five of
these are 2, 4, 10, 12 and 14, find the remaining two observations.

3. In a town, 25% of the persons earned more than Rs 45,000 whereas 75% earned
more than 18,000. Compute the absolute and relative values of dispersion.

Frequently Asked Questions – FAQs

Q1

Why Is Dispersion Important in Statistics?

The measures of dispersion are important as it helps in understanding how much data
is spread (i.e. its variation) around a central value.
Q2

How To Calculate Dispersion?

Dispersion can be calculated using various measures like mean, standard deviation,
variance, etc.

Q3

What is the Variance of the values 3, 8, 6, 10, 12, 9, 11, 10, 12, 7?

The variance of the following numbers will be 7.36.

Q4

What are the examples of dispersion measures?

Standard deviation, Range, Mean absolute difference, Median absolute deviation,


Interquartile change, and Average deviation are examples of measures of dispersion.

Correlation in Statistics

This section shows how to calculate and interpret correlation coefficients for ordinal
and interval level scales. Methods of correlation summarize the relationship between
two variables in a single number called the correlation coefficient. The correlation
coefficient is usually represented using the symbol r, and it ranges from -1 to +1.

A correlation coefficient quite close to 0, but either positive or negative, implies little or
no relationship between the two variables. A correlation coefficient close to plus 1
means a positive relationship between the two variables, with increases in one of the
variables being associated with increases in the other variable.

A correlation coefficient close to -1 indicates a negative relationship between two


variables, with an increase in one of the variables being associated with a decrease in
the other variable. A correlation coefficient can be produced for ordinal, interval or ratio
level variables, but has little meaning for variables which are measured on a scale
which is no more than nominal.

For ordinal scales, the correlation coefficient can be calculated by using Spearman’s
rho. For interval or ratio level scales, the most commonly used correlation coefficient is
Pearson’s r, ordinarily referred to as simply the correlation coefficient.

Also, read: Correlation and Regression

What Does Correlation Measure?

In statistics, Correlation studies and measures the direction and extent of relationship
among variables, so the correlation measures co-variation, not causation. Therefore,
we should never interpret correlation as implying cause and effect relation. For
example, there exists a correlation between two variables X and Y, which means the
value of one variable is found to change in one direction, the value of the other variable
is found to change either in the same direction (i.e. positive change) or in the opposite
direction (i.e. negative change). Furthermore, if the correlation exists, it is linear, i.e. we
can represent the relative movement of the two variables by drawing a straight line on
graph paper.

Correlation Coefficient

The correlation coefficient, r, is a summary measure that describes the extent of the
statistical relationship between two interval or ratio level variables. The correlation
coefficient is scaled so that it is always between -1 and +1. When r is close to 0 this
means that there is little relationship between the variables and the farther away from 0
r is, in either the positive or negative direction, the greater the relationship between the
two variables.

The two variables are often given the symbols X and Y. In order to illustrate how the two
variables are related, the values of X and Y are pictured by drawing the scatter diagram,
graphing combinations of the two variables. The scatter diagram is given first, and then
the method of determining Pearson’s r is presented. From the following examples,
relatively small sample sizes are given. Later, data from larger samples are given.

Scatter Diagram

A scatter diagram is a diagram that shows the values of two variables X and Y, along with
the way in which these two variables relate to each other. The values of variable X are
given along the horizontal axis, with the values of the variable Y given on the vertical
axis.

Later, when the regression model is used, one of the variables is defined as an
independent variable, and the other is defined as a dependent variable. In regression,
the independent variable X is considered to have some effect or influence on the
dependent variable Y. Correlation methods are symmetric with respect to the two
variables, with no indication of causation or direction of influence being part of the
statistical consideration. A scatter diagram is given in the following example. The same
example is later used to determine the correlation coefficient.

Types of Correlation

The scatter plot explains the correlation between the two attributes or variables. It
represents how closely the two variables are connected. There can be three such
situations to see the relation between the two variables –

• Positive Correlation – when the values of the two variables move in the same
direction so that an increase/decrease in the value of one variable is followed by
an increase/decrease in the value of the other variable.
• Negative Correlation – when the values of the two variables move in the opposite
direction so that an increase/decrease in the value of one variable is followed by
decrease/increase in the value of the other variable.

• No Correlation – when there is no linear dependence or no relation between the


two variables.

Correlation Formula

Correlation shows the relation between two variables. Correlation coefficient shows the
measure of correlation. To compare two datasets, we use the correlation formulas.

Pearson Correlation Coefficient Formula

The most common formula is the Pearson Correlation coefficient used for linear
dependency between the data sets. The value of the coefficient lies between -1 to
+1. When the coefficient comes down to zero, then the data is considered as not
related. While, if we get the value of +1, then the data are positively correlated, and -1
has a negative correlation.
Where n = Quantity of Information

Σx = Total of the First Variable Value

Σy = Total of the Second Variable Value

Σxy = Sum of the Product of first & Second Value

Σx2 = Sum of the Squares of the First Value

Σy2 = Sum of the Squares of the Second Value

Linear Correlation Coefficient Formula

The formula for the linear correlation coefficient is given by;

Sample Correlation Coefficient Formula

The formula is given by:

rxy = Sxy/SxSy

Where Sx and Sy are the sample standard deviations, and Sxy is the sample covariance.

Population Correlation Coefficient Formula

The population correlation coefficient uses σx and σy as the population standard


deviations and σxy as the population covariance.

rxy = σxy/σxσy

• Pearson Correlation Formula

• Correlation Coefficient Formula

• Linear Correlation Coefficient Formula


Correlation Example

Years of Education and Age of Entry to Labour Force Table.1 gives the number of years of
formal education (X) and the age of entry into the labour force (Y ), for 12 males from the
Regina Labour Force Survey. Both variables are measured in years, a ratio level of
measurement and the highest level of measurement. All of the males are aged close to
30, so that most of these males are likely to have completed their formal education.

Respondent Number Years of Education, X Age of Entry into Labour Force, Y

1 10 16

2 12 17

3 15 18

4 8 15

5 20 18

6 17 22

7 12 19

8 15 22

9 12 18

10 10 15

11 8 18
12 10 16

Table 1. Years of Education and Age of Entry into Labour Force for 12 Regina Males

Since most males enter the labour force soon after they leave formal schooling, a close
relationship between these two variables is expected. By looking through the table, it
can be seen that those respondents who obtained more years of schooling generally
entered the labour force at an older age. The mean years of schooling are ̄ x = 12.4 years
and the mean age of entry into the labour force is ȳ= 17.8, a difference of 5.4 years.

This difference roughly reflects the age of entry into formal schooling, that is, age five or
six. It can be seen through that the relationship between years of schooling and age of
entry into the labour force is not perfect. Respondent 11, for example, has only 8 years
of schooling but did not enter the labour force until the age of 18. In contrast,
respondent 5 has 20 years of schooling but entered the labour force at the age of 18.
The scatter diagram provides a quick way of examining the relationship between X and
Y.
To get more information about correlation and related concepts, download BYJU’S – The
Learning App today!

Frequently Asked Questions on Correlation – FAQs

Q1

What is a correlation in statistics?


In statistics, correlation is a statistic that establishes the relationship between two
variables. In other words, it is the measure of association of variables.

Q2

What is a correlation of 1?

A correlation of 1 or +1 shows a perfect positive correlation, which means both the


variables move in the same direction.
A correlation of -1 shows a perfect negative correlation, which means as one variable
goes down, the other goes up.

Q3

What does a correlation of 0.45 mean?

We know that a correlation of 1 means the two variables are associated positively,
whereas if the correlation coefficient is 0, then there is no correlation between two
variables. Thus, a correlation of 0.45 means 45% of the variance in one variable, say x, is
accounted for by the second variable, say y.

Q4

What are the 4 types of correlation?

The four types of correlation coefficients are given by:


Pearson Correlation Coefficient
Linear Correlation Coefficient
Sample Correlation Coefficient
Population Correlation Coefficient

Q5

What is a correlation example?

Positive, negative, or no correlation can be observed between two variables. An example


of a positive correlation would be dimensions and weight. The big objects look heavier
and vice versa. Also, small objects tend to appear thin.

Data reduction
Data reduction is a technique used in data mining to reduce the size of a dataset while
still preserving the most important information. This can be beneficial in situations
where the dataset is too large to be processed efficiently, or where the dataset contains
a large amount of irrelevant or redundant information.

There are several different data reduction techniques that can be used in data
mining, including:
1. Data Sampling: This technique involves selecting a subset of the data to work
with, rather than using the entire dataset. This can be useful for reducing the size
of a dataset while still preserving the overall trends and patterns in the data.

2. Dimensionality Reduction: This technique involves reducing the number of


features in the dataset, either by removing features that are not relevant or by
combining multiple features into a single feature.

3. Data Compression: This technique involves using techniques such as lossy or


lossless compression to reduce the size of a dataset.

4. Data Discretization: This technique involves converting continuous data into


discrete data by partitioning the range of possible values into intervals or bins.

5. Feature Selection: This technique involves selecting a subset of features from


the dataset that are most relevant to the task at hand.

6. It’s important to note that data reduction can have a trade-off between the
accuracy and the size of the data. The more data is reduced, the less accurate
the model will be and the less generalizable it will be.

In conclusion, data reduction is an important step in data mining, as it can help to


improve the efficiency and performance of machine learning algorithms by reducing the
size of the dataset. However, it is important to be aware of the trade-off between the
size and accuracy of the data, and carefully assess the risks and benefits before
implementing it.
Methods of data reduction:
These are explained as following below.

1. Data Cube Aggregation:


This technique is used to aggregate data in a simpler form. For example, imagine the
information you gathered for your analysis for the years 2012 to 2014, that data includes
the revenue of your company every three months. They involve you in the annual sales,
rather than the quarterly average, So we can summarize the data in such a way that the
resulting data summarizes the total sales per year instead of per quarter. It summarizes
the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use the
attribute required for our analysis. It reduces data size as it eliminates outdated or
redundant features.

• Step-wise Forward Selection –


The selection begins with an empty set of attributes later on we decide the best
of the original attributes on the set based on their relevance to other attributes.
We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which few attributes are
redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}

Initial reduced attribute set: { }

Step-1: {X1}

Step-2: {X1, X2}

Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

• Step-wise Backward Selection –


This selection starts with a set of complete attributes in the original data and at
each point, it eliminates the worst remaining attribute in the set.

Suppose there are the following attributes in the data set in which few attributes are
redundant.

Initial attribute Set: {X1, X2, X3, X4, X5, X6}

Initial reduced attribute set: {X1, X2, X3, X4, X5, X6 }

Step-1: {X1, X2, X3, X4, X5}

Step-2: {X1, X2, X3, X5}

Step-3: {X1, X2, X5}

Final reduced attribute set: {X1, X2, X5}

• Combination of forwarding and Backward Selection –


It allows us to remove the worst and select the best attributes, saving time and
making the process faster.

3. Data Compression:
The data compression technique reduces the size of the files using different encoding
mechanisms (Huffman Encoding & run-length Encoding). We can divide it into two
types based on their compression techniques.

• Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and minimal data
size reduction. Lossless data compression uses algorithms to restore the precise
original data from the compressed data.

• Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA (principal
component analysis) are examples of this compression. For e.g., the JPEG image
format is a lossy compression, but we can find the meaning equivalent to the
original image. In lossy-data compression, the decompressed data may differ
from the original data but are useful enough to retrieve information from them.

4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical models or
smaller representations of the data instead of actual data, it is important to only store
the model parameter. Or non-parametric methods such as clustering, histogram, and
sampling.

5. Discretization & Concept Hierarchy Operation:


Techniques of data discretization are used to divide the attributes of the continuous
nature into data with intervals. We replace many constant values of the attributes by
labels of small intervals. This means that mining results are shown in a concise, and
easily understandable way.

• Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints or split
points) to divide the whole set of attributes and repeat this method up to the end,
then the process is known as top-down discretization also known as splitting.

• Bottom-up discretization –
If you first consider all the constant values as split points, some are discarded
through a combination of the neighborhood values in the interval, that process is
called bottom-up discretization.

Concept Hierarchies:
It reduces the data size by collecting and then replacing the low-level concepts (such as
43 for age) with high-level concepts (categorical variables such as middle age or
Senior).

For numeric data following techniques can be followed:

• Binning –
Binning is the process of changing numerical variables into categorical
counterparts. The number of categorical counterparts depends on the number of
bins specified by the user.

• Histogram analysis –
Like the process of binning, the histogram is used to partition the value for the
attribute X, into disjoint ranges called brackets. There are several partitioning
rules:

1. Equal Frequency partitioning: Partitioning the values based on their


number of occurrences in the data set.
2. Equal Width Partitioning: Partitioning the values in a fixed gap based on
the number of bins i.e. a set of values ranging from 0-20.

3. Clustering: Grouping similar data together.

ADVANTAGED OR DISADVANTAGES OF Data Reduction in Data Mining :

Data reduction in data mining can have a number of advantages and


disadvantages.

Advantages:

1. Improved efficiency: Data reduction can help to improve the efficiency of


machine learning algorithms by reducing the size of the dataset. This can make it
faster and more practical to work with large datasets.

2. Improved performance: Data reduction can help to improve the performance of


machine learning algorithms by removing irrelevant or redundant information
from the dataset. This can help to make the model more accurate and robust.

3. Reduced storage costs: Data reduction can help to reduce the storage costs
associated with large datasets by reducing the size of the data.

4. Improved interpretability: Data reduction can help to improve the interpretability


of the results by removing irrelevant or redundant information from the dataset.

Disadvantages:

1. Loss of information: Data reduction can result in a loss of information, if


important data is removed during the reduction process.

2. Impact on accuracy: Data reduction can impact the accuracy of a model, as


reducing the size of the dataset can also remove important information that is
needed for accurate predictions.

3. Impact on interpretability: Data reduction can make it harder to interpret the


results, as removing irrelevant or redundant information can also remove context
that is needed to understand the results.

4. Additional computational costs: Data reduction can add additional


computational costs to the data mining process, as it requires additional
processing time to reduce the data.

5. In conclusion, data reduction can have both advantages and disadvantages. It


can improve the efficiency and performance of machine learning algorithms by
reducing the size of the dataset. However, it can also result in a loss of
information, and make it harder to interpret the results. It’s important to weigh
the pros and cons of data reduction and carefully assess the risks and benefits
before implementing it.

Principal Component Analysis(PCA)

• Principal Component Analysis (PCA) is a statistical procedure that uses an


orthogonal transformation that converts a set of correlated variables to a set of
uncorrelated variables.PCA is the most widely used tool in exploratory data
analysis and in machine learning for predictive models. Moreover,

• Principal Component Analysis (PCA) is an unsupervised learning algorithm


technique used to examine the interrelations among a set of variables. It is also
known as a general factor analysis where regression determines a line of best fit.

• The main goal of Principal Component Analysis (PCA) is to reduce the


dimensionality of a dataset while preserving the most important patterns or
relationships between the variables without any prior knowledge of the target
variables.

Principal Component Analysis (PCA) is used to reduce the dimensionality of a data set
by finding a new set of variables, smaller than the original set of variables, retaining
most of the sample’s information, and useful for the regression and classification of
data.

Principal Component Analysis

1. Principal Component Analysis (PCA) is a technique for dimensionality reduction


that identifies a set of orthogonal axes, called principal components, that
capture the maximum variance in the data. The principal components are linear
combinations of the original variables in the dataset and are ordered in
decreasing order of importance. The total variance captured by all the principal
components is equal to the total variance in the original dataset.

2. The first principal component captures the most variation in the data, but the
second principal component captures the maximum variance that
is orthogonal to the first principal component, and so on.

3. Principal Component Analysis can be used for a variety of purposes, including


data visualization, feature selection, and data compression. In data
visualization, PCA can be used to plot high-dimensional data in two or three
dimensions, making it easier to interpret. In feature selection, PCA can be used
to identify the most important variables in a dataset. In data compression, PCA
can be used to reduce the size of a dataset without losing important information.
4. In Principal Component Analysis, it is assumed that the information is carried in
the variance of the features, that is, the higher the variation in a feature, the more
information that features carries.

Advantages of Principal Component Analysis

1. Dimensionality Reduction: Principal Component Analysis is a popular


technique used for dimensionality reduction, which is the process of reducing
the number of variables in a dataset. By reducing the number of variables, PCA
simplifies data analysis, improves performance, and makes it easier to visualize
data.

2. Feature Selection: Principal Component Analysis can be used for feature


selection, which is the process of selecting the most important variables in a
dataset. This is useful in machine learning, where the number of variables can be
very large, and it is difficult to identify the most important variables.

3. Data Visualization: Principal Component Analysis can be used for data


visualization. By reducing the number of variables, PCA can plot high-
dimensional data in two or three dimensions, making it easier to interpret.

Disadvantages of Principal Component Analysis

1. Interpretation of Principal Components: The principal components created by


Principal Component Analysis are linear combinations of the original variables,
and it is often difficult to interpret them in terms of the original variables. This
can make it difficult to explain the results of PCA to others.

2. Data Scaling: Principal Component Analysis is sensitive to the scale of the data.
If the data is not properly scaled, then PCA may not work well. Therefore, it is
important to scale the data before applying Principal Component Analysis.

3. Information Loss: Principal Component Analysis can result in information loss.


While Principal Component Analysis reduces the number of variables, it can also
lead to loss of information. The degree of information loss depends on the
number of principal components selected. Therefore, it is important to carefully
select the number of principal components to retain.

4. Non-linear Relationships: Principal Component Analysis assumes that the


relationships between variables are linear. However, if there are non-linear
relationships between variables, Principal Component Analysis may not work
well.

Frequently Asked Questions (FAQs)

1. What is Principal Component Analysis (PCA)?


PCA is a dimensionality reduction technique used in statistics and machine learning to
transform high-dimensional data into a lower-dimensional representation, preserving
the most important information.

2. How does a PCA work?

Principal components are linear combinations of the original features that PCA finds
and uses to capture the most variance in the data. In order of the amount of variance
they explain, these orthogonal components are arranged.

3. When should PCA be applied?

Using PCA is advantageous when working with multicollinear or high-dimensional


datasets. Feature extraction, noise reduction, and data preprocessing are prominent
uses for it.

4. How are principal components interpreted?

New axes are represented in the feature space by each principal component. An
indicator of a component’s significance in capturing data variability is its capacity to
explain a larger variance.

5. What is the significance of principal components?

Principal components represent the directions in which the data varies the most. The
first few components typically capture the majority of the data’s variance, allowing for a
more concise representation.
Independent Component Analysis

Independent Component Analysis (ICA) is a statistical and computational technique


used in machine learning to separate a multivariate signal into its independent non-
Gaussian components. The goal of ICA is to find a linear transformation of the data such
that the transformed data is as close to being statistically independent as possible.

The heart of ICA lies in the principle of statistical independence. ICA identify
components within mixed signals that are statistically independent of each other.

Statistical Independence Concept:

It is a probability theory that if two random variables X and Y are statistically


independent. The joint probability distribution of the pair is equal to the product of their
individual probability distributions, which means that knowing the outcome of one
variable does not change the probability of the other outcome.

or
Assumptions in ICA

1. The first assumption asserts that the source signals (original signals) are
statistically independent of each other.

2. The second assumption is that each source signal exhibits non-Gaussian


distributions.

Mathematical Representation of Independent Component Analysis

The observed random vector is , representing the observed data with m components. The
hidden components are represented by the random vector , where n is the number of
hidden sources.

Linear Static Transformation

The observed data X is transformed into hidden components S using a linear static
transformation representation by the matrix W.

Here, W = transformation matrix.

The goal is to transform the observed data x in a way that the resulting hidden
components are independent. The independence is measured by some function . The
task is to find the optimal transformation matrix W that maximizes the independence of
the hidden components.

Advantages of Independent Component Analysis (ICA):

• ICA is a powerful tool for separating mixed signals into their independent
components. This is useful in a variety of applications, such as signal
processing, image analysis, and data compression.

• ICA is a non-parametric approach, which means that it does not require


assumptions about the underlying probability distribution of the data.

• ICA is an unsupervised learning technique, which means that it can be applied


to data without the need for labeled examples. This makes it useful in situations
where labeled data is not available.

• ICA can be used for feature extraction, which means that it can identify
important features in the data that can be used for other tasks, such as
classification.

Disadvantages of Independent Component Analysis (ICA):

• ICA assumes that the underlying sources are non-Gaussian, which may not
always be true. If the underlying sources are Gaussian, ICA may not be effective.
• ICA assumes that the sources are mixed linearly, which may not always be the
case. If the sources are mixed nonlinearly, ICA may not be effective.

• ICA can be computationally expensive, especially for large datasets. This can
make it difficult to apply ICA to real-world problems.

• ICA can suffer from convergence issues, which means that it may not always be
able to find a solution. This can be a problem for complex datasets with many
sources.

Cocktail Party Problem

Consider Cocktail Party Problem or Blind Source Separation problem to understand the
problem which is solved by independent component analysis.

Problem: To extract independent sources’ signals from a mixed signal composed of the
signals from those sources.

Given: Mixed signal from five different independent sources.

Aim: To decompose the mixed signal into independent sources:

• Source 1

• Source 2

• Source 3

• Source 4

• Source 5

Solution: Independent Component Analysis


Here, there is a party going into a room full of people. There is ‘n’ number of speakers in
that room, and they are speaking simultaneously at the party. In the same room, there
are also ‘n’ microphones placed at different distances from the speakers, which are
recording ‘n’ speakers’ voice signals. Hence, the number of speakers is equal to the
number of microphones in the room.

Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’
voice signals in the room, given that each microphone recorded the voice signals
coming from each speaker of different intensity due to the difference in distances
between them.

Decomposing the mixed signal of each microphone’s recording into an independent


source’s speech signal can be done by using the machine learning technique,
independent component analysis.

where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …,
Yn are the new features and are independent components that are independent of each
other.

Implementing ICA in Python

FastICA is a specific implementation of the Independent Component Analysis (ICA)


algorithm that is designed for efficiency and speed.

Step 1: Import necessary libraries

The implementation requires to import numpy, sklearn, FastICA and matplotlib.

• Python3

import numpy as np

from sklearn.decomposition import FastICA

import matplotlib.pyplot as plt

Step 2: Generate Random Data and Mix the Signals

In the following code snippet,

• Random seed is set to generate random numbers.

• Samples and Time parameters are defined.

• Synthetic signals are generated and then combined to single matrix “S”.

• Noise is added to each element of the matrix.


• Matrix “A” is defined with coefficients the represent how the original signals are
combined to form observed signals.

• The observed signals are obtained by multiplying the matrix “S” by the transpose
of the mixing matrix “A”.

• Python3

# Generate synthetic mixed signals

np.random.seed(42)

samples = 200

time = np.linspace(0, 8, samples)

signal_1 = np.sin(2 * time)

signal_2 = np.sign(np.sin(3 * time))

signal_3 = np.random.laplace(size= samples)

S = np.c_[signal_1, signal_2, signal_3]

S += 0.2 * np.random.normal(size=S.shape) # Add noise

# Mix the signals

A = np.array([[1, 1, 1], [0.5, 2, 1], [1.5, 1, 2]]) # Mixing matrix

X = S.dot(A.T) # Observed mixed signals

Step 3: Apply ICA to unmix the signals

In the following code snippet,

• An instance of FastICA class is created and number of independent components


are set to 3.

• Fast ICA algorithm is applied to the observed mixed signals ‘X’. This fits the
model to the data and transforms the data to obtain the estimated independent
sources (S_).

• Python3
ica = FastICA(n_components=3)

S_ = ica.fit_transform(X) # Estimated sources

[/sourcecode]

Step 4: Visualize the signals

• Python3

# Plot the results

plt.figure(figsize=(8, 6))

plt.subplot(3, 1, 1)

plt.title('Original Sources')

plt.plot(S)

plt.subplot(3, 1, 2)

plt.title('Observed Signals')

plt.plot(X)

plt.subplot(3, 1, 3)

plt.title('Estimated Sources (FastICA)')

plt.plot(S_)

plt.tight_layout()

plt.show()

Output:
Difference between PCA and ICA

Both the techniques are used in signal processing and dimensionality reduction, but
they have different goals.

Principal Component Analysis Independent Component Analysis

It reduces the dimensions to avoid the It decomposes the mixed signal into its
problem of overfitting. independent sources’ signals.

It deals with the Principal Components. It deals with the Independent Components.

It doesn’t focus on the issue of variance


It focuses on maximizing the variance. among the data points.

It focuses on the mutual orthogonality It doesn’t focus on the mutual orthogonality


property of the principal components. of the components.

It doesn’t focus on the mutual It focuses on the mutual independence of


independence of the components. the components.

Also Check:

• Blind source separation using FastICA in Scikit Learn

Frequently Asked Questions (FAQs)


Q. What is the difference between PCA and ICA?

PCA emphasize to capture maximum variance and provide uncorrelated components,


while ICA focuses onextracting statistically independent components, even if they are
correlated, hence, ICA is suitable forblind source separation and signal extraction tasks.

Q. What is the application of ICA?

ICA demonstrate it’s application in diverse field such as signal processing,


neuroscience, and finance. It is employed for blind source separation, extracting
independent components from mixed signals, leading to advancements in fields like
biomedical signal processing, communication systems, and environmental monitoring.

Q. Is ICA for dimensionality reduction?

ICA is a linear dimensionality reduction method, converting a dataset into sets of


independent components. The number of independent components extracted by ICA
corresponds to the dimensions or features present in the original dataset.

Are you passionate about data and looking to make one giant leap into your career?
Our Data Science Course will help you change your game and, most importantly, allow
students, professionals, and working adults to tide over into the data science
immersion. Master state-of-the-art methodologies, powerful tools, and industry best
practices, hands-on projects, and real-world applications. Become the executive head
of industries related to Data Analysis, Machine Learning, and Data Visualization with
these growing skills. Ready to Transform Your Future? Enroll Now to Be a Data Science
Expert!

Hypothesis Testing

Hypothesis testing is a statistical procedure used to test assumptions or hypotheses


about a population parameter. It involves formulating a null hypothesis (H0) and an
alternative hypothesis (Ha), collecting data, and determining whether the evidence is
strong enough to reject the null hypothesis.

The primary purpose of hypothesis testing is to make inferences about a population


based on a sample of data. It allows researchers and analysts to quantify the likelihood
that observed differences or relationships in the data occurred by chance rather than
reflecting a true effect in the population.

Steps of Hypothesis Testing

Let’s walk through how to do a hypothesis test, one step at a time.

Step 1: State your hypotheses

The first step is to formulate your research question into two competing hypotheses:
1. Null Hypothesis (H0): This is the default assumption that there is no effect or
difference.

2. Alternative Hypothesis (Ha): This is the hypothesis that there is an effect or


difference.

For example:

• H0: The mean height of men is equal to the mean height of women.

• Ha: The mean height of men is not equal to the mean height of women.

Step 2: Collect and prepare data

Gather data through experiments, surveys, or observational studies. Ensure the data
collection method is designed to test the hypothesis and is representative of the
population. This step often involves:

• Defining the population of interest.

• Selecting an appropriate sampling method.

• Determining the sample size.

• Collecting and organizing the data.

Step 3: Choose the appropriate statistical test

Select a statistical test based on the type of data and the hypothesis. The choice
depends on factors such as:

• Data type (continuous, categorical, etc.)

• Distribution of the data (normal, non-normal)

• Sample size

• Number of groups being compared

Common tests include:

• t-tests (for comparing means)

• chi-square tests (for categorical data)

• ANOVA (for comparing means of multiple groups)

Step 4: Calculate the test statistic and p-value

Use statistical software or formulas to compute the test statistic and corresponding p-
value. This step quantifies how much the sample data deviates from the null
hypothesis.
The p-value is an important concept in hypothesis testing. It represents the probability
of observing results as extreme as the sample data, assuming the null hypothesis is
true.

Step 5: Make a decision

Compare the p-value to the predetermined significance level (α), which is typically set
at 0.05. The decision rule is as follows:

• If p-value ≤ α: Reject the null hypothesis, suggesting evidence supports the


alternative hypothesis.

• If p-value > α: Fail to reject the null hypothesis, suggesting insufficient evidence
to support the alternative hypothesis.

It's important to note that failing to reject the null hypothesis doesn't prove it's true; it
simply means there's not enough evidence to conclude otherwise.

Step 6: Present your findings

Report the results, including the test statistic, p-value, and conclusion. Discuss
whether the findings support the initial hypothesis and their implications. When
presenting results, consider:

• Providing context for the study.

• Clearly stating the hypotheses.

• Reporting the test statistic and p-value.

• Interpreting the results in plain language.

• Discussing the practical significance of the findings.

Types of Hypothesis Tests

Hypothesis tests can be broadly categorized into two main types:

Parametric tests

Parametric tests assume that the data follows a specific probability distribution,
typically the normal distribution. These tests are generally more powerful when the
assumptions are met. Common parametric tests include:

• t-tests (one-sample, independent samples, paired samples)

• ANOVA (one-way, two-way, repeated measures)

• Z-tests (one-sample, two-sample)

• F-tests (one-way, two-way)


Non-parametric tests

Non-parametric tests don't assume a specific distribution of the data. They are useful
when dealing with ordinal data or when the assumptions of parametric tests are
violated. Examples include:

• Mann-Whitney U test

• Wilcoxon signed-rank test

• Kruskal-Wallis test

Selecting the appropriate test

When choosing a hypothesis test, researchers consider a few broad categories:

1. Data Distribution: Determine if your data is normally distributed, as many tests


assume normality.

2. Number of Groups: Identify how many groups you're comparing (e.g., one group,
two groups, or more).

3. Group Independence: Decide if your groups are independent (different


subjects) or dependent (same subjects measured multiple times).

4. Data Type:

• Continuous (e.g., height, weight),

• Ordinal (e.g., rankings),

• Nominal (e.g., categories without order).

Based on these categories, you can select the appropriate statistical test. For instance,
if your data is normally distributed and you have two independent groups with
continuous data, you would use an Independent t-test. If your data is not normally
distributed with two independent groups and ordinal data, a Mann-Whitney U test is
recommended.

To help choose the appropriate test, consider using a hypothesis test flow chart as a
general guide:

Choosing the right hypothesis test for normally distributed data. Image by Author.

Choosing the right hypothesis test for non-normally distributed data. Image by Author.

Modern Approaches to Hypothesis Testing

In addition to traditional hypothesis testing methods, there are several modern


approaches:
Permutation or randomization tests

These tests involve randomly shuffling the observed data many times to create a
distribution of possible outcomes under the null hypothesis. They are particularly useful
when dealing with small sample sizes or when the assumptions of parametric tests are
not met.

Bootstrapping

Bootstrapping is a resampling technique that involves repeatedly sampling with


replacement from the original dataset. It can be used to estimate the sampling
distribution of a statistic and construct confidence intervals.

Monte Carlo simulation

Monte Carlo methods use repeated random sampling to obtain numerical results. In
hypothesis testing, they can be used to estimate p-values for complex statistical
models or when analytical solutions are difficult to obtain.

Controlling for Errors

When conducting hypothesis tests, it's best to understand and control for potential
errors:

Type I and Type II errors

• Type I Error: Rejecting the null hypothesis when it's actually true (false positive).

• Type II Error: Failing to reject the null hypothesis when it's actually false (false
negative).

The significance level (α) directly controls the probability of a Type I error. Decreasing α
reduces the chance of Type I errors but increases the risk of Type II errors.

To balance these errors:

1. Adjust the significance level based on the consequences of each error type.

2. Increase sample size to improve the power of the test.

3. Use one-tailed tests when appropriate.

The file drawer effect

The file drawer effect refers to the publication bias where studies with significant results
are more likely to be published than those with non-significant results. This can lead to
an overestimation of effects in the literature. To mitigate this:

• Consider pre-registering studies.

• Publish all results, significant or not.


• Conduct meta-analyses that account for publication bias.

• Simulate data beforehand.

Glossary of Key Terms and Definitions

• Null Hypothesis (H0): The default assumption that there is no effect or


difference.

• Alternative Hypothesis (Ha): The hypothesis that there is an effect or


difference.

• P-value: The probability of observing the test results under the null hypothesis.

• Significance Level (α): The threshold for rejecting the null hypothesis,
commonly set at 0.05.

• Test Statistic: A standardized value used to compare the observed data with the
null hypothesis.

• Type I Error: Rejecting a true null hypothesis (false positive).

• Type II Error: Failing to reject a false null hypothesis (false negative).

• Statistical Power: The probability of correctly rejecting a false null hypothesis.

• Confidence Interval: A range of values that likely contains the true population
parameter.

• Effect Size: A measure of the magnitude of the difference or relationship being


tested.

7 Basic Statistics Concepts For Data Science

1. Descriptive Statistics

It is used to describe the basic features of data that provide a summary of the given data
set which can either represent the entire population or a sample of the population. It is
derived from calculations that include:

• Mean: It is the central value which is commonly known as arithmetic average.

• Mode: It refers to the value that appears most often in a data set.

• Median: It is the middle value of the ordered set that divides it in exactly half.

2. Variability

Variability includes the following parameters:

• Standard Deviation: It is a statistic that calculates the dispersion of a data set as


compared to its mean.
• Variance: It refers to a statistical measure of the spread between the numbers in
a data set. In general terms, it means the difference from the mean. A large
variance indicates that numbers are far apart from the mean or average value.
Small variance indicates that the numbers are closer to the average values. Zero
variance indicates that the values are identical to the given set.

• Range: This is defined as the difference between the largest and smallest value
of a dataset.

• Percentile: It refers to the measure used in statistics that indicates the value
below which the given percentage of observation in the dataset falls.

• Quartile: It is defined as the value that divides the data points into quarters.

• Interquartile Range: It measures the middle half of your data. In general terms, it
is the middle 50% of the dataset.

3. Correlation

It is one of the major statistical techniques that measure the relationship between two
variables. The correlation coefficient indicates the strength of the linear relationship
between two variables.

• A correlation coefficient that is more than zero indicates a positive relationship.

• A correlation coefficient that is less than zero indicates a negative relationship.

• Correlation coefficient zero indicates that there is no relationship between the


two variables.

4. Probability Distribution

It specifies the likelihood of all possible events. In simple terms, an event refers to the
result of an experiment like tossing a coin. Events are of two types dependent and
independent.

• Independent event: The event is said to be an Independent event when it is not


affected by the earlier events. For example, tossing a coin, let us consider a coin
is tossed the first outcome is head when the coin is tossed again the outcome
may be head or tail. But this is entirely independent of the first trial.

• Dependent event: The event is said to be dependent when the occurrence of the
event is dependent on the earlier events. For example when a ball is drawn from
a bag that contains red and blue balls. If the first ball drawn is red, then the
second ball may be red or blue; this depends on the first trial.

The probability of independent events is calculated by simply multiplying the probability


of each event and for a dependent event is calculated by conditional probability.
5. Regression

It is a method that is used to determine the relationship between one or more


independent variables and a dependent variable. Regression is mainly of two types:

• Linear regression: It is used to fit the regression model that explains the
relationship between a numeric predictor variable and one or more predictor
variables.

• Logistic regression: It is used to fit a regression model that explains the


relationship between the binary response variable and one or more predictor
variables.

6. Normal Distribution

Normal is used to define the probability density function for a continuous random
variable in a system. The standard normal distribution has two parameters – mean and
standard deviation that are discussed above. When the distribution of random variables
is unknown, the normal distribution is used. The central limit theorem justifies why
normal distribution is used in such cases.

7. Bias

In statistical terms, it means when a model is representative of a complete population.


This needs to be minimized to get the desired outcome.

The three most common types of bias are:

• Selection bias: It is a phenomenon of selecting a group of data for statistical


analysis, the selection in such a way that data is not randomized resulting in the
data being unrepresentative of the whole population.

• Confirmation bias: It occurs when the person performing the statistical analysis
has some predefined assumption.

• Time interval bias: It is caused intentionally by specifying a certain time range to


favor a particular outcome.

You might also like