Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Module3

This document serves as a comprehensive cheat sheet on statistics for data science, covering essential concepts, types, and formulas. It explains descriptive and inferential statistics, measures of central tendency, variability, and probability theory, providing practical examples and applications. The guide aims to enhance understanding and confidence in statistical skills for interviews and data-related jobs.

Uploaded by

Bhagya Lakshmi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Module3

This document serves as a comprehensive cheat sheet on statistics for data science, covering essential concepts, types, and formulas. It explains descriptive and inferential statistics, measures of central tendency, variability, and probability theory, providing practical examples and applications. The guide aims to enhance understanding and confidence in statistical skills for interviews and data-related jobs.

Uploaded by

Bhagya Lakshmi
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 54

Statistics For Data Science

Statistics is like a toolkit we use to understand and make sense of


information. It helps us collect, organize, analyze, and interpret data to
find patterns, trends, and relationships in the world around us.

In this Statistics cheat sheet, you will find simplified complex statistical
concepts, with clear explanations, practical examples, and essential
formulas. This cheat sheet will make things easy when getting ready for
an interview or just starting with data science. It explains stuff like mean,
median, and hypothesis testing with examples, so you'll get it in no time.
With this cheat sheet, you'll feel more sure about your stats skills and do
great in interviews and real-life data jobs!

What is Statistics?

Statistics is the branch of mathematics that deals with collecting,


analyzing, interpreting, presenting, and organizing data. It involves
the study of methods for gathering, summarizing, and interpreting data to
make informed decisions and draw meaningful conclusions.

Statistics is widely used in various fields such as science, economics,


social sciences, business, and engineering to provide insights, make
predictions, and guide decision-making processes. Statistics is like a tool
that helps us see patterns, trends, and relationships in the world
around us. Whether it's counting how many people like pizza or figuring
out the average score on a test, statistics helps us make decisions
based on data. It is used in lots of different areas, like science, business,
and even sports, to help us learn more about the world and make better
choices.

Types of Statistics

There are commonly two types of statistics, which are discussed below:

1. Descriptive Statistics: Descriptive Statistics helps us simplify and


organize big chunks of data. This makes large amounts of data
easier to understand.

2. Inferential Statistics: Inferential Statistics is a little different. It


uses smaller data to conclude a larger group. It helps us predict and
draw conclusions about a population.

Descriptive Statistic
Statistics serves as the backbone of data science providing
tools and methodologies to extract meaningful insights from
1
raw data. Data scientists rely on statistics for every crucial task
– from cleaning messy datasets and creating powerful
visualizations to building predictive models that glimpse
into the future. Without statistics we cannot transform raw
data into actionable insights that drive business success.
Descriptive statistics help summarize and organize data so it
becomes more understandable.
Types of Descriptive Statistics
There are three categories for standard classification of
descriptive statistics methods, each serving different purposes
in summarizing and describing data. They help us understand:
1. Where the data centers (Measures of Central Tendency)
2. How spread out the data is (Measure of Variability)
3. How the data is distributed (Measures of Frequency
Distribution)
1. Measures of Central Tendency
Statistical values that describe the central position within
a dataset. There are three main measures of central tendency:

Mean: is the sum of observations divided by the total number


of observations. It is also defined as average which is the sum
divided by count.
xˉ=n∑x
where,
 x = Observations

2
 n = number of terms
Let’s look at an example of how can we find the mean of a data
set using python code implementation.

Mode: The most frequently occurring value in the


dataset. It’s useful for categorical data and in cases where
knowing the most common choice is crucial.

Median: is the middle value in the dataset that splits the


data into two halves. If the number of elements in the
data set is odd then the center element is the median
and if it is even then the median would be the average
of two central elements. For skewed data, the median often
provides a better measure of central tendency than the mean.
3
Central tendency measures are the foundation for
understanding data distribution and identifying
anomalies. For example, the mean can reveal trends, while
the median highlights skewed distributions.
2. Measure of Variability : Understanding Data
Dispersion
Knowing not just where the data centers but also how it
spreads out is crucial. Measures of variability, also called
measures of dispersion, help quantify the spread or distribution
of observations in a dataset. This is particularly important in
identifying outliers, assessing model assumptions, and
understanding data variability in relation to its mean. The key
measures of variability include:
1. Range : describes the difference between the largest and
smallest data point in our data set. The bigger the range, the
more the spread of data and vice versa. While easy to
compute range is sensitive to outliers. This measure can
provide a quick sense of the data spread but should be
complemented with other statistics.
Range = Largest data value – smallest data value

4
2. Variance: is defined as an average squared deviation
from the mean. It is calculated by finding the difference
between every data point and the average which is also known
as the mean, squaring them, adding all of them, and then
dividing by the number of data points present in our data set.
σ2=∑(x−μ)2Nσ2=N∑(x−μ)2
where,
 x -> Observation under consideration
 N -> number of terms
 mu -> Mean

5
3. Standard deviation: Standard deviation is widely used to
measure the extent of variation or dispersion in data. It’s
especially important when assessing model performance (e.g.,
residuals) or comparing datasets with different means.
It is defined as the square root of the variance. It is calculated
by finding the mean, then subtracting each number from
the mean which is also known as the average, and
squaring the result. Adding all the values and then dividing
by the no of terms followed by the square root.
σ=∑(x−μ)2Nσ=N∑(x−μ)2
where,
 x = Observation under consideration
 N = number of terms
 mu = Mean

6
Variability measures are important in residual analysis to check
how well a model fits the data.
3. Measures of Frequency Distribution
Frequency distribution table is a powerful summarize way to
show how data points are distributed across different
categories or intervals. Helps identify patterns, outliers
and the overall structure of the dataset. It is often the first
step in understanding the dataset before applying more
advanced analytical methods or creating visualizations like
histograms or pie charts.
Frequency Distribution Table Includes measure like:
 Data intervals or categories
 Frequency counts
 Relative frequencies (percentages)
 Cumulative frequencies when needed
Basics of Statistics
Basic formulas of statistics are,

Defi
Paramet nitio
ers n Formulas

Populat Entir ∑x/N


ion e
Mean, grou
(μ) p for
whic
h
infor
mati
on is
requi

7
Defi
Paramet nitio
ers n Formulas

red.

Subs
et of
popu
latio
n as
entir
e
Sample
popu ∑x/n
Mean
latio
n is
too
large
to
hand
le.

Sample/ Stan ∑(x−x‾)2n−1n−1∑(x−x)2


Populat dard
ion Devi
Standar ation
d is a
Deviati mea
on sure
that
show
s
how
muc

8
Defi
Paramet nitio
ers n Formulas

h
varia
tion
from
the
mea
n
exist
s.

Varia
nce
is
the
mea
sure
Sample/ of Variance(Population) = ∑(x−x‾)2nVariance
Populat spre (Sample) = ∑(x−x‾)2n−1Variance(Populati
ion ad of on) = n∑(x−x)2Variance(Sample) =
Variane data n−1∑(x−x)2
alon
g its
centr
al
valu
es.

Class Class Class Interval = Upper Limit - Lower Limit


Interval inter
(CI) val

9
Defi
Paramet nitio
ers n Formulas

refer
s to
the
rang
e of
valu
es
assig
ned
to a
grou
p of
data
point
s.

Frequen Num f is number of times any value comes in a


cy(f) ber article
of
time
any
parti
cular
valu
e
appe
ars
in a
data
set
is

10
Defi
Paramet nitio
ers n Formulas

calle
d
frequ
ency
of
that
valu
e.

Rang
e is
the
differ
ence
betw
een
the
Range, large Range = (Largest Data Value - Smallest
(R) st Data Value)
and
smal
lest
valu
es of
the
data
set

What is Data in Statistics?


Data is a collection of observations, it can be in the form of
numbers, words, measurements, or statements.

11
Types of Data
1. Qualitative Data:This data is descriptive. For example -
She is beautiful, He is tall, etc.
2. Quantitative Data: This is numerical information. For
example- A horse has four legs.
Types of Quantitative Data
1. Discrete Data: It has a particular fixed value and can be
counted.
2. Continuous Data: It is not fixed but has a range of data
and can be measured.
Measure of Central Tendency
 Mean: The mean can be calculated by summing all values
present in the sample divided by total number of values
present in the sample or population.
Formula:
Mean(μ)=SumofValuesNumberofValues Mean(μ)=Numbe
rofValuesSumofValues .
 Median: The median is the middle of a dataset when
arranged from lowest to highest or highest to lowest in
order to find the median, the data must be sorted. For an
odd number of data points the median is the middle value
and for an even number of data points median is the
average of the two middle values.
o For odd number of data points:
Median=(n+12)thMedian=(2n+1)th
o For even number of data points:
Median=Averageof(n2)thvalueanditsnextvalueMedia
n=Averageof(2n)thvalueanditsnextvalue
 Mode: The most frequently occurring value in the Sample
or Population is called as Mode.
Measure of Dispersion

12
 Range: Range is the difference between the maximum
and minimum values of the Sample.
 Variance (σ²): Variance is a measure of how spread-out
values from the mean by measuring the dispersion around
the Mean.
Formula:
σ2 = Σ(X−μ)2n σ2 = nΣ(X−μ)2 .
 Standard Deviation (σ): Standard Deviation is the
square root of variance. The measuring unit of S.D. is
same as the Sample values' unit. It indicates the average
distance of data points from the mean and is widely used
due to its intuitive interpretation.
Formula:
σ=(σ2)=(Σ(X−μ)2n)σ=(σ2)=(nΣ(X−μ)2)
 Interquartile Range (IQR): The range between the first
quartile (Q1) and the third quartile (Q3). It is less sensitive
to extreme values than the range.
Formula:
IQR=Q3−Q1IQR=Q3−Q1
To compute IQR, calculate the values of the first and third
quartile by arranging the data in ascending order. Then,
calculate the mean of each half of the dataset.
 Quartiles: Quartiles divides the dataset into four equal
parts:
o Q1 is the median of the lower 25%
o Q2 is the median (50%)
o Q3 is the median of the upper 25% of the dataset.
 Mean Absolute Deviation: The average of the absolute
differences between each data point and the mean. It
provides a measure of the average deviation from the
mean.
Formula:

13
MeanAbsoluteDeviation=∑i=1n∣X−μ∣nMeanAbsoluteDeviat
ion=n∑i=1n∣X−μ∣
 Coefficient of Variation (CV):
CV is the ratio of the standard deviation to the mean,
expressed as a percentage. It is useful for comparing the
relative variability of different datasets.
CV=(σμ)∗100CV=(μσ)∗100
Probability Theory
Here are some basic concepts or terminologies used in
probability:

Term Definition

The set of all possible outcomes in a


Sample
probability experiment. For instance, in a coin
Space
toss, it’s “head” and “tail”.

One of the possible results in an experiment.


Sample
For example, in rolling a fair six-sided dice,
Point
sample points are 1 to 6.

A process or trial with uncertain results.


Experiment Examples include coin tossing, card selection,
or rolling a die.

A subset of the sample space representing


Event certain outcomes. Example: getting “1” when
rolling a die.

Favorable An outcome that produces the desired or


Outcome expected consequence.

Various other probability formulas are,

14
Joint Probability Probability of P(A and B)
(Intersection of occurring events A = P(A) ×
Event) and B P(B)

Probability of P(A or B) =
occurring events A or P(A) + P(B)
Union of Events B - P(A and B)

Probability of
occurring events A P(A | B) =
Conditional when event B has P(A and
Probability occurred B)/P(B)

Bayes Theorem
Bayes' Theorem is a fundamental concept in probability theory
that relates conditional probabilities. It is named after the
Reverend Thomas Bayes, who first introduced the theorem.
Bayes' Theorem is a mathematical formula that provides a way
to update probabilities based on new evidence. The formula is
as follows:
P(A∣B)=P(B∣A)×P(A)P(B)P(A∣B)=P(B)P(B∣A)×P(A)
where
 P(A∣B): Probability of event A given that event B has
occurred (posterior probability).
 P(B∣A): Probability of event B given that event A has
occurred (likelihood).
 P(A): Probability of event A occurring (prior probability).
 P(B): Probability of event B occurring.
Types of Probability Functions
 Probability Mass Function(PMF)
Probability Mass Function is a concept in probability theory
that describes the probability distribution of a discrete

15
random variable. The PMF gives the probability of each
possible outcome of a discrete random variable.
 Probability Density Function (PDF)
Probability Density Function describes the likelihood of a
continuous random variable falling within a particular
range. It's the derivative of the cumulative distribution
function (CDF).
 Cumulative Distribution Function (CDF)
Cumulative Distribution Function gives the probability that
a random variable will take a value less than or equal to a
given value. It's the integral of the probability density
function (PDF).
 Empirical Distribution Function (EDF):
Empirical Distribution Function is a non-parametric
estimator of the cumulative distribution function (CDF)
based on observed data. For a given set of data points,
the EDF represents the proportion of observations less
than or equal to a specific value. It is constructed by
sorting the data and assigning a cumulative probability to
each data point.
Probability Distributions Functions
Normal or Gaussian Distribution
The normal distribution is a continuous probability distribution
characterized by its bell-shaped curve and can be by described
by mean (μ) and standard deviation (σ).
Formula: f(X∣μ,σ)=ϵ−0.5(X−μσ)2σ(2π)f(X∣μ,σ)=σ(
2π)ϵ−0.5(σX−μ)2
There is a empirical rule in normal distribution, which states
that:
 Approximately 68% of the data falls within one standard
deviation (σ) of the mean in both directions. This is often
referred to as the 68-95-99.7 rule.

16
 About 95% of the data falls within two standard deviations
(2σ) of the mean.
 Approximately 99.7% of the data falls within three
standard deviations (3σ) of the mean.
Central Limit Theorem
The Central Limit Theorem (CLT) states that, regardless of the
shape of the original population distribution, the sampling
distribution of the sample mean will be approximately normally
distributed if the sample size tends to infinity.
Student t-distribution
The t-distribution, also known as Student's t-distribution, is a
probability distribution that is used in statistics.
f(t)=Γ(df+12)dfπΓ(df2)(1+t2df)−df+12f(t)=dfπΓ(2df)Γ(2df+1)
(1+dft2)−2df+1
where,
 Γ(.) is the gamma function
 df = Degrees of freedom
Chi-square Distribution
The chi-squared distribution, denoted as χ2χ2 is a probability
distribution used in statistics it is related to the sum of squared
standard normal
deviates.χ2=12k/2Γ(k/2)xk2−1e−x2χ2=2k/2Γ(k/2)1x2k−1e2−x
Binomial Distribution
The binomial distribution models the number of successes in a
fixed number of independent Bernoulli trials, where each trial
has the same probability of success (p).
Formula: P(X=k)=(kn)pk(1−p)n−kP(X=k)=(kn)pk(1−p)n−k
Assuming each trial is an independent event with a success
probability of p=0.5, and we are calculating the probability of
getting 3 successes in 6 trials: P(X=3)=(36)
(0.5)3(1−0.5)3=0.3125P(X=3)=(36)(0.5)3(1−0.5)3=0.3125
17
Poisson Distribution
The Poisson distribution models the number of events that
occur in a fixed interval of time or space. It's characterized by a
single parameter (λ), the average rate of occurrence.
Formula: P(X=k)=ϵ−λλkk!P(X=k)=k!ϵ−λλk
For the previous dataset, assuming the average rate of waiting
time is λ=10, and we are calculating the probability of waiting
exactly 12 minutes: P(X=12)=ε−10.101212!
≈0.0948P(X=12)=12!ε−10.1012≈0.0948
Uniform Distribution
The uniform distribution represents a constant probability for all
outcomes in a given range.
Formula: f(X)=1b−af(X)=b−a1
For the same previous dataset, assuming the bus arrives
uniformly between 5 and 18 minutes so the probability of
waiting less than 15
minutes: P(X<15)=∫515118−5dx=1013=0.7692P(X<15)=∫515
18−51dx=1310=0.7692
Parameter estimation for Statistical Inference
 Population: Population is the group of individual, object
or measurements about which you want to draw
conclusion.
 Sample: Sample is the subset of population; the group
chosen from the larger population to gather information
and make inference about entire population.
 Expectation:Expectation, in statistics and probability
theory, represents the anticipated or average value of a
random variable. It is represented by E(x).
 Parameter: A parameter is a numerical characteristic of a
population that is of interest in statistical analysis.
Examples of parameters include the population mean (μ),

18
population standard deviation (σ), or the success
probability in a binomial distribution.
 Statistic: A statistic is a numerical value or measure
calculated from a sample of data. It is used to estimate or
infer properties of the corresponding population.
 Estimation: Estimation involves using sample data to
make inferences or predictions about population
parameters.
 Estimator: An estimator is a statistic used to estimate an
unknown parameter in a statistical model.
 Bias: Bias in parameter estimation refers to the
systematic error or deviation of the estimated value from
the true value of the parameter.
Bias(θ^)=E(θ^)−θBias(θ)=E(θ)−θ
An estimator is considered unbiased if, on average, it
produces parameter estimates that are equal to the true
parameter value. Bias is measured as the difference
between the expected value of the estimator and the true
parameter value.
E(θ^)=θE(θ)=θ

Hypothesis Testing
Hypothesis testing makes inferences about a population
parameter based on sample statistic.

Null Hypothesis (H₀) and Alternative Hypothesis (H₁)

19
 H0 : There is no significant difference or effect.
 H1 : There is a significant effect i.e the given statement
can be false.
Degrees of freedom
Degrees of freedom (df) in statistics represent the number of
values or quantities in the final calculation of a statistic that are
free to vary. It is mainly defined as sample size - one(n-1).
Level of Significance(αα)
This is the threshold used to determine statistical significance.
Common values are 0.05, 0.01, or 0.10.
p-value
The p-value, short for probability value, is a fundamental
concept in statistics that quantifies the evidence against a null
hypothesis.
 If p-value ≤ α: Reject the null hypothesis.
 If p-value > α: Fail to reject the null hypothesis (meaning
there isn't enough evidence to reject it).
Type I Error and Type II Error
Type I Error that occurs when the null hypothesis is true, but the
statistical test incorrectly rejects it. It is often referred to as a
"false positive" or "alpha error."
Type II Error that occurs when the null hypothesis is false, but
the statistical test fails to reject it. It is often referred to as a
"false negative."
Confidence Intervals
A confidence interval is a range of values that is used to
estimate the true value of a population parameter with a
certain level of confidence. It provides a measure of the
uncertainty or margin of error associated with a sample
statistic, such as the sample mean or proportion.
Example of Hypothesis testing:
20
Let us consider An e-commerce company wants to assess
whether a recent website redesign has a significant impact on
the average time users spend on their website.
The company collects the following data:
 Data on user session durations before and after the
redesign.
 Before redesign: Sample mean (x‾ x ) = 3.5 minutes,
Sample standard deviation (s) = 1.2 minutes, Sample size
(n) = 50.
 After redesign: Sample mean (x‾ x = 4.2 minutes,
Sample standard deviation (s) = 1.5 minutes, Sample size
(n) = 60.
The Hypothesis are defined as:
 Null Hypothesis (H0): The website redesign has no
impact on the average user session
duration μafter−μbefore=0μafter−μbefore=0
 Alternative Hypothesis (Ha): The website redesign has
a positive impact on the average user session
duration μafter−μbefore>0μafter−μbefore>0
Significance Level:
Choose a significance level, α=0.05(commonly used)
Test Statistic and P-Value:
 Conduct a test for the difference in means.
 Calculate the test statistic and p-value.
Result:
 If the p-value is less than the chosen significance level,
reject the null hypothesis.
 If the p-value is greater than or equal to the significance
level, fail to reject the null hypothesis.
Interpretations:

21
Based on the analysis, the company draws conclusions about
whether the website redesign has a statistically significant
impact on user session duration.
Statistical Tests:
Parametric test are statistical methods that make assumption
that the data follows normal distribution.

Z-test t-test F-test

Comparing
Comparing means of
the
two independent
Testing if the mean of variances of
samples or testing if
a sample is multiple
the mean of a sample
significantly different groups to
is significantly
from a known assess if
different from a known
population mean they are
or hypothesized
significantly
population mean
different

Used to
Used when the Used when the
compare
population standard population standard
variances
deviation is known, deviation is unknown
between two
and the sample size or when dealing with
or more
is sufficiently large. small sample sizes
groups.

One-Sample Test: One- sample:


Z= F=s12s22F=
X‾−μσnnσX−μ t = X‾−μsnnsX−μ s22s12
Two-Sample Test:
Two-Sample Test: t=X1‾−X2‾s12n1+s22
n2t=n1s12+n2s22X1

22
Z-test t-test F-test

−X2
Z
= X1‾−X2‾σ12n1+σ Paired t-Test:
22n2n1σ12+n2σ22
t=d‾sdnnsdd
X1−X2
d= difference

ANOVA (Analysis Of Variance)

Source Degre
of es Of
Variati Freedo Mean
on Sum of Squares m Squares F-Value

Betwe
en SSB= Σn1(xˉ1−xˉ) df1=k- MSB= f=MSB/
Group 2Σn1(xˉ1−xˉ)2 1 SSB/ (k-1) MSE
s

SSE=ΣΣ(xˉ1−xˉ)2 df2=N- MSE=SSE/


Error ΣΣ(xˉ1−xˉ)2 1 (N-k)

df3=N-
SST= SSE+SSE
Total 1

There are mainly two types of ANOVA:


1. One-way Anova: Used to compare means of three or
more groups to determine if there are statistically
significant differences among them.
here,
 H0: The means of all groups are equal.
 H1: At least one group mean is different.

23
2. Two-way Anova: It assess the influence of two
categorical independent variables on a dependent
variable, examining the main effects of each variable and
their interaction effect.
Chi-Squared Test
The chi-squared test is a statistical test used to determine if
there is a significant association between two categorical
variables. It compares the observed frequencies in a
contingency table with the frequencies. Formula:
X2=Σ(Oij−Eij)2EijX2=ΣEij(Oij−Eij)2.
This test is also performed on big data with multiple number of
observations.
Non-Parametric Test
Non-parametric test does not make assumptions about the
distribution of the data. They are useful when data does not
meet the assumptions required for parametric tests.
 Mann-Whitney U Test: Mann-Whitney U Test is used to
determine whether there is a difference between two
independent groups when the dependent variable is
ordinal or continuous. Applicable when assumptions for
a t-test are not met. In it we rank all data points,
combines the ranks, and calculates the test statistic.
 Kruskal-Wallis Test:Kruskal-Wallis Test is used to
determine whether there are differences among three or
more independent groups when the dependent variable is
ordinal or continuous. Non-parametric alternative to one-
way ANOVA.
A/B Testing or Split Testing
A/B testing, also known as split testing, is a method used to
compare two versions (A and B) of a webpage, app, or
marketing asset to determine which one performs better.
Example : a product manager change a website's "Shop Now"
button color from green to blue to improve the click-through
24
rate (CTR). Formulating null and alternative hypotheses, users
are divided into A and B groups, and CTRs are recorded.
Statistical tests like chi-square or t-test are applied with a 5%
confidence interval. If the p-value is below 5%, the manager
may conclude that changing the button color significantly
affects CTR, informing decisions for permanent implementation.
Regression
Regression is a statistical technique used to model the
relationship between a dependent variable and one or more
independent variables.
The equation for regression:
y=α+βxy=α+βx
Where,
 y is the dependent variable,
 x is the independent variable
 αα is the intercept
 ββ is the regression coefficient.
Regression coefficient is a measure of the strength and
direction of the relationship between a predictor variable
(independent variable) and the response variable (dependent
variable).
β=∑(Xi−X‾)(Yi−Y‾)∑(Xi−X‾)2β=∑(Xi−X)2∑(Xi−X)(Yi−Y)

Difference between Descriptive and Inferential statistics


Statistics is a key field that helps us make sense of data
through collection, analysis, and presentation. It plays an
important role in many areas, from business to healthcare, by
guiding decision-making and drawing conclusions. This process
is made easier with the help of two main branches of statistics:
descriptive and inferential.
Descriptive Statistics

25
Descriptive statistics refers to the process of summarizing and
analyzing data to describe its main features in a clear and
meaningful way. It is used to present raw data in a form that
makes it easier to understand and interpret. Descriptive
statistics involves both graphical representations (such as
charts and plots) and numerical measures to summarize data
effectively. Unlike inferential statistics, which makes predictions
about a population based on a sample, Descriptive statistics is
applied to data that is already known.
Inferential Statistics
Inferential statistics involves using data from a sample to make
predictions, generalizations, or conclusions about a larger
population. Unlike descriptive statistics, which simply
summarizes known data, inferential statistics makes inferences
or draws conclusions that go beyond the available data. It uses
probability theory to estimate population parameters and test
hypotheses. By working with a sample, inferential
statistics allows researchers to make informed decisions
without having to gather data from an entire population.

Difference between Descriptive and Inferential statistics

26
Descriptive Statistics Inferential Statistics

It gives information about It makes inferences about the


raw data which describes the population using data drawn
data in some manner. from the population.

It helps in organizing, It allows us to compare data,


analyzing, and to present and make hypotheses and
data in a meaningful manner. predictions.

It is used to explain the


It is used to describe a
chance of occurrence of an
situation.
event.

It explains already known


It attempts to reach the
data and is limited to a
conclusion about the
sample or population having
population.
a small size.

Examples include: mean,


Examples include: confidence
median, mode, range,
intervals, hypothesis testing,
variance, histograms, pie
regression models, p-values.
charts.

Allows predictions and


Limited to presenting and
conclusions that go beyond
analyzing known data.
the data at hand.

Used for predicting trends,


Used for describing trends,
testing hypotheses,
organizing data for
generalizing data from
presentation.
sample to population.

27
Descriptive Statistics Inferential Statistics

It can be achieved with the


It can be achieved
help of charts, graphs,
by probability.
tables, etc.

Covariance and Correlation


Covariance and correlation are the two key concepts in
Statistics that help us analyze the relationship between
two variables. Covariance measures how two variables
change together, indicating whether they move in the
same or opposite directions.
In this article, we will learn about the differences and
similarities between covariance and correlation, explore their
applications, and also provide examples to illustrate their use.
What is Covariance?
Covariance is a statistical measure that indicates the
direction of the linear relationship between two
variables. It assesses how much two variables change
together from their mean values.
Types of Covariance:
 Positive Covariance: When one variable increases, the
other variable tends to increase as well, and vice versa.
 Negative Covariance: When one variable increases, the
other variable tends to decrease.
 Zero Covariance: There is no linear relationship between
the two variables; they move independently of each other.
Covariance is calculated by taking the average of the product
of the deviations of each variable from their respective means.
It is useful for understanding the direction of the relationship

28
but not its strength, as its magnitude depends on the units of
the variables.
It is an essential tool for understanding how variables change
together and is widely used in various fields, including finance,
economics, and science.
Covariance:
1. It is the relationship between a pair of random variables
where a change in one variable causes a change in
another variable.
2. It can take any value between – infinity to +infinity, where
the negative value represents the negative relationship
whereas a positive value represents the positive
relationship.
3. It is used for the linear relationship between variables.
4. It gives the direction of relationship between variables.
Covariance Formula
For Population:

For Sample:

29
Here, x’ and y’ = mean of given sample set n = total no of
sample xi and yi = individual sample of set

What is Correlation?
Correlation is a standardized measure of the strength
and direction of the linear relationship between two
variables. It is derived from covariance and ranges between
-1 and 1. Unlike covariance, which only indicates the direction
of the relationship, correlation provides a standardized
measure.
 Positive Correlation (close to +1): As one variable
increases, the other variable also tends to increase.
 Negative Correlation (close to -1): As one variable
increases, the other variable tends to decrease.
 Zero Correlation: There is no linear relationship between
the variables.
The correlation coefficient ρ\rhoρ (rho) for variables X and Y is
defined as:
1. It show whether and how strongly pairs of variables are
related to each other.

30
2. Correlation takes values between -1 to +1, wherein values
close to +1 represents strong positive correlation and
values close to -1 represents strong negative correlation.
3. In this variable are indirectly related to each other.
4. It gives the direction and strength of relationship between
variables.
Correlation Formula

Here, x’ and y’ = mean of given sample set n = total no of


sample xi and yi = individual sample of set

Difference between Covariance and Correlation


This table shows the difference between Covariance and
Covariance:

31
Covariance Correlation

Correlation is a statistical
Covariance is a measure
measure that indicates how
of how much two random
strongly two variables are
variables vary together
related.

Involves the relationship Involves the relationship


between two variables or between multiple variables as
data sets well

Lie between -infinity and


Lie between -1 and +1
+infinity

Measure of correlation Scaled version of covariance

Provides direction of Provides direction and strength


relationship of relationship

Dependent on scale of
Independent on scale of variable
variable

Have dimensions Dimensionless

Applications of Covariance and Correlation


Applications of Covariance
 Portfolio Management in Finance: Covariance is used
to measure how different stocks or financial assets move
together, aiding in portfolio diversification to minimize
risk.

32
 Genetics: In genetics, covariance can help understand
the relationship between different genetic traits and how
they vary together.
 Econometrics: Covariance is employed to study the
relationship between different economic indicators, such
as the relationship between GDP growth and inflation
rates.
 Signal Processing: Covariance is used to analyze and
filter signals in various forms, including audio and image
signals.
 Environmental Science: Covariance is applied to study
relationships between environmental variables, such as
temperature and humidity changes over time.
Applications of Correlation
 Market Research: Correlation is used to identify
relationships between consumer behavior and sales
trends, helping businesses make informed marketing
decisions.
 Medical Research: Correlation helps in understanding
the relationship between different health indicators, such
as the correlation between blood pressure and cholesterol
levels.
 Weather Forecasting: Correlation is used to analyze the
relationship between various meteorological variables,
such as temperature and humidity, to improve weather
predictions.
 Machine Learning: Correlation analysis is used in feature
selection to identify which variables have strong
relationships with the target variable, improving model
accuracy.
 Univariate and multivariate normal distributions are very robust and useful in
most statistical procedures. Understanding their form and function will help
you learn a lot about most statistical routines.

33
 Univariate Distributions
 A univariate distribution is defined as a distribution that involves just one
random variable. For instance, suppose we wish to model the distribution of
returns on an asset, such as a holding of stocks; such a model would be a
univariate distribution.

 In previous learning outcome statements, we have been focusing on


univariate distributions such as the binomial, uniform, and normal
distributions. Let us now look at multivariate distributions.

Multivariate Distributions
A multivariate distribution describes the probabilities for a group of continuous
random variables, particularly if the individual variables follow a normal distribution.
Each variable has its own mean and variance. In this regard, the strength of the
relationship between the variables (correlation) is very important. As you will recall, a
linear combination of 2 normal random variables results in another normal random
variable.

34
Univariate, Bivariate and Multivariate data and its
analysis
Univariate data:
Univariate data refers to a type of data in which each
observation or data point corresponds to a single variable. In
other words, it involves the measurement or observation of a
single characteristic or attribute for each individual or item in
the dataset. Analyzing univariate data is the simplest form of
analysis in statistics.

Heights (in 16 167 17 174 17 18 18


cm) 4 .3 0 .2 8 0 6

Suppose that the heights of seven students in a class is


recorded (above table). There is only one variable, which is
height, and it is not dealing with any cause or relationship.
Key points in Univariate analysis:
1. No Relationships: Univariate analysis focuses solely on
describing and summarizing the distribution of the single
variable. It does not explore relationships between
variables or attempt to identify causes.

35
2. Descriptive Statistics: Descriptive statistics, such
as measures of central tendency (mean, median, mode)
and measures of dispersion (range, standard deviation),
are commonly used in the analysis of univariate data.
3. Visualization: Histograms, box plots, and other graphical
representations are often used to visually represent the
distribution of the single variable.
Bivariate data
Bivariate data involves two different variables, and the analysis
of this type of data focuses on understanding the relationship
or association between these two variables. Example of
bivariate data can be temperature and ice cream sales in
summer season.

Temperat Ice Cream


ure Sales

20 2000

25 2500

35 5000

Suppose the temperature and ice cream sales are the two
variables of a bivariate data(table 2). Here, the relationship is
visible from the table that temperature and sales are directly
proportional to each other and thus related because as the
temperature increases, the sales also increase.
Key points in Bivariate analysis:
1. Relationship Analysis: The primary goal of analyzing
bivariate data is to understand the relationship between
the two variables. This relationship could be positive (both
variables increase together), negative (one variable

36
increases while the other decreases), or show no clear
pattern.
2. Scatterplots: A common visualization tool for bivariate
data is a scatterplot, where each data point represents a
pair of values for the two variables. Scatterplots help
visualize patterns and trends in the data.
3. Correlation Coefficient: A quantitative measure called
the correlation coefficient is often used to quantify the
strength and direction of the linear relationship between
two variables. The correlation coefficient ranges from -1 to
1.
Multivariate data
Multivariate data refers to datasets where each observation or
sample point consists of multiple variables or features. These
variables can represent different aspects, characteristics, or
measurements related to the observed phenomenon. When
dealing with three or more variables, the data is specifically
categorized as multivariate.
Example of this type of data is suppose an advertiser wants to
compare the popularity of four advertisements on a website.

Advertisem Gend Click


ent er rate

Ad1 Male 80

Femal
Ad3 55
e

Femal
Ad2 123
e

37
Advertisem Gend Click
ent er rate

Ad1 Male 66

Ad3 Male 35

The click rates could be measured for both men and women
and relationships between variables can then be examined. It is
similar to bivariate but contains more than one dependent
variable.
Key points in Multivariate analysis:
1. Analysis Techniques:The ways to perform analysis on
this data depends on the goals to be achieved. Some of
the techniques are regression analysis, principal
component analysis, path analysis, factor analysis
and multivariate analysis of variance (MANOVA).
2. Goals of Analysis: The choice of analysis technique
depends on the specific goals of the study. For example,
researchers may be interested in predicting one variable
based on others, identifying underlying factors that
explain patterns, or comparing group means across
multiple variables.
3. Interpretation: Multivariate analysis allows for a more
nuanced interpretation of complex relationships within the
data. It helps uncover patterns that may not be apparent
when examining variables individually.
There are a lots of different tools, techniques and methods that
can be used to conduct your analysis. You could use software
libraries, visualization tools and statistic testing methods.
However, this blog we will be compare Univariate, Bivariate and
Multivariate analysis.

38
Difference between Univariate, Bivariate and
Multivariate data

Univariate Bivariate Multivariate

It only summarize It only summarize


It only summarize
single variable at a more than 2
two variables
time. variables.

It does deal with It does not deal


It does not deal
causes and with causes and
with causes and
relationships and relationships and
relationships.
analysis is done. analysis is done.

It does contain It is similar to


It does not contain
only one bivariate but it
any dependent
dependent contains more than
variable.
variable. 2 variables.

The main purpose


The main purpose The main purpose is to study the
is to describe. is to explain. relationship among
them.

The example of a The example of Example, Suppose


univariate can be bivariate can be an advertiser
height. temperature and wants to compare
ice sales in the popularity of
summer vacation. four
advertisements on
a website.
Then their click
rates could be

39
Univariate Bivariate Multivariate

measured for both


men and women
and relationships
between variable
can be examined

Covariance Matrix
A Covariance Matrix is a type of matrix used to describe the
covariance values between two items in a random vector. It is
also known as the variance-covariance matrix because the
variance of each element is represented along the matrix’s
major diagonal and the covariance is represented among the
non-diagonal elements.
It’s particularly important in fields like data science, machine
learning, and finance, where understanding relationships
between multiple variables is crucial and comes in handy when
it comes to stochastic modeling and principal component
analysis.
In this article, we will discuss various things related to the
Covariance Matrix such as its definition, example, and formula.
What is Covariance Matrix?
The variance-covariance matrix is a square matrix with
diagonal elements that represent the variance and the non-
diagonal components that express covariance. The covariance
of a variable can take any real value- positive, negative, or
zero. A positive covariance suggests that the two variables
have a positive relationship, whereas a negative covariance
indicates that they do not. If two elements do not vary
together, they have a zero covariance.

40
41
Properties of Covariance Matrix
The Properties of Covariance Matrix are mentioned below:
 A covariance matrix is always square, implying that the
number of rows in a covariance matrix is always equal to
the number of columns in it.
 A covariance matrix is always symmetric, implying that
the transpose of a covariance matrix is always equal to
the original matrix.
 A covariance matrix is always positive and semi-definite.
 The eigenvalues of a covariance matrix are always real
and non-negative.

Understanding Hypothesis Testing


Hypothesis method compares two opposite statements
about a population and uses sample data to decide
which one is more likely to be correct.To test this
assumption we first take a sample from the population and
analyze it and use the results of the analysis to decide if the
claim is valid or not.
Suppose a company claims that its website gets an average
of 50 user visits per day. To verify this we use hypothesis
testing to analyze past website traffic data and determine if the
claim is accurate. This helps us decide whether the observed
data supports the company’s claim or if there is a significant
difference.
Defining Hypotheses
 Null hypothesis (H0): The null hypothesis is the starting
assumption in statistics. It says there is no relationship
between groups. For Example A company claims its
average production is 50 units per day then here:
Null Hypothesis: H₀: The mean number of daily visits (μμ)
= 50.

42
 Alternative hypothesis (H1): The alternative
hypothesis is the opposite of the null hypothesis it
suggests there is a difference between groups. like The
company’s production is not equal to 50 units per day
then the alternative hypothesis would be:
H₁: The mean number of daily visits (μμ) ≠ 50.
Key Terms of Hypothesis Testing
To understand the Hypothesis testing firstly we need to
understand the key terms which are given below:
 Level of significance: It refers to the degree of
significance in which we accept or reject the null
hypothesis. 100% accuracy is not possible for accepting a
hypothesis so we select a level of significance. This is
normally denoted with ααand generally it is 0.05 or 5%
which means your output should be 95% confident to give
a similar kind of result in each sample.
 P-value: When analyzing data the p-value tells you the
likelihood of seeing your result if the null hypothesis is
true. If your P-value is less than the chosen significance
level then you reject the null hypothesis otherwise accept
it.
 Test Statistic: Test statistic is the number that helps you
decide whether your result is significant. It’s calculated
from the sample data you collect it could be used to test if
a machine learning model performs better than a random
guess.
 Critical value: Critical value is a boundary or threshold
that helps you decide if your test statistic is enough to
reject the null hypothesis
 Degrees of freedom: Degrees of freedom are important
when we conduct statistical tests they help you
understand how much data can vary.
Types of Hypothesis Testing

43
It involves basically two types of testing:

1. One-Tailed Test
A one-tailed test is used when we expect a change in only
one direction—either an increase or a decrease but not
both. Let’s say if we’re analyzing data to see if a new algorithm
improves accuracy we would only focus on whether the
accuracy goes up not down.
The test looks at just one side of the data to decide if the result
is enough to reject the null hypothesis. If the data falls in the
critical region on that side then we reject the null hypothesis.
There are two types of one-tailed test:
 Left-Tailed (Left-Sided) Test: If the alternative
hypothesis say that the true parameter value is less than
the null hypothesis. then it is a Left tailed test. Example:
H0:μ≥50μ≥50 and H1: μ<50μ<50
 Right-Tailed (Right-Sided) Test: when the alternative
hypothesis say that the true parameter value is greater
than the null hypothesis then it is called Right Tailed test.
Example: H0 : μ≤50μ≤50 and H1:μ>50μ>50
2. Two-Tailed Test

44
A two-tailed test is used when we want to check for a
significant difference in both directions—whether the
result is greater than or less than a specific value. We
use this test when we don’t have a specific expectation about
the direction of change.
If we are testing whether a new marketing strategy affects
sales we want to know if sales increase or decrease so we look
at both possibilities.
Example: H0: μ=μ= 50 and H1: μ≠50μ=50
What are Type 1 and Type 2 errors in Hypothesis
Testing?
In hypothesis testing Type I and Type II errors are two possible
errors that can happen when we are finding conclusions about a
population based on a sample of data. These errors are
associated with the decisions we made regarding the null
hypothesis and the alternative hypothesis.
 Type I error: When we reject the null hypothesis although
that hypothesis was true. Type I error is denoted by
alpha(αα).
 Type II errors: When we accept the null hypothesis but it
is false. Type II errors are denoted by beta(ββ).

Null Hypothesis Null Hypothesis


is True is False

Null Hypothesis is Type II Error (False


Correct Decision
True (Accept) Negative)

Alternative
Type I Error (False
Hypothesis is Correct Decision
Positive)
True (Reject)

How does Hypothesis Testing work?

45
Working of Hypothesis testing involves various steps:

Step 1: Define Null and Alternative Hypothesis


We start by defining the null hypothesis (H₀) which
represents the assumption that there is no difference.
The alternative hypothesis (H₁) suggests there is a
difference. These hypotheses should be contradictory to one
another. Imagine we want to test if a new recommendation
algorithm increases user engagement.
 Null Hypothesis (H₀): The new algorithm has no effect
on user engagement.
 Alternative Hypothesis (H₁): The new algorithm
increases user engagement.
Step 2 – Choose significance level
 Next we choose a significance level (α) commonly set
at 0.05. This level defines the threshold for deciding if the
results are statistically significant. It also tells us the
probability of making a Type I error—rejecting a true null
hypothesis.
 In this step we also calculate the p-value which is used to
assess the evidence against the null hypothesis.

46
Step 3 – Collect and Analyze data.
 Now we gather data this could come from user
observations or an experiment. Once collected we analyze
the data using appropriate statistical methods to calculate
the test statistic.
 Example: We collect data on user engagement before and
after implementing the algorithm. We can also find the
mean engagement scores for each group.
Step 4-Calculate Test Statistic
The test statistic is a measure used to determine if the sample
data support in reject the null hypothesis. The choice of the test
statistic depends on the type of hypothesis test being
conducted it could be a Z-test, Chi-square, T-test and so on. For
our example we are dealing with a t-test because:
 We have a smaller sample size.
 The population standard deviation is unknown.
T-statistic is a measure of the difference between the means of
two groups relative to the variability within each group. It is
calculated as the difference between the sample means
divided by the standard error of the difference. It is also known
as the t-value or t-score.
Step 5 – Comparing Test Statistic
Now we compare the test statistic to either the critical
value or the p-value to decide whether to reject the null
hypothesis or not.
Method A: Using Critical values: We refer to a statistical
distribution table like the t-distribution in this case to find
the critical value based on the chosen significance level (α).
If:
 If Test Statistic>Critical Value then we Reject the null
hypothesis.

47
 If Test Statistic≤Critical Value then we fail to reject the null
hypothesis.
Example: If the p-value is 0.03 and α is 0.05 then we reject
the null hypothesis because the p-value is smaller than the
significance level.
Note: Critical values are predetermined threshold values that
are used to make a decision in hypothesis testing. To
determine critical values for hypothesis testing, we typically
refer to a statistical distribution table such as the normal
distribution or t-distribution tables based on.
Method B: Using P-values: We can also come to an
conclusion using the p-value
 If the p-value is less than or equal to the significance level
(p≤αp≤α) then you reject the null hypothesis.
 If the p-value is greater than the significance level
(p≥αp≥α) then you fail to reject the null hypothesis.
Note: To determine p-value for hypothesis testing we typically
refer to a statistical distribution table such as the normal
distribution or t-distribution tables based on.
Step 7- Interpret the Results
Based on the comparison of the test statistic to the critical
value or p-value we can conclude whether there is enough
evidence to reject the null hypothesis or not.

48
49
Limitations of Hypothesis Testing
Although hypothesis testing is a useful technique but it have
some limitations as well:
 Limited Scope: Hypothesis testing focuses on specific
questions or assumptions and not capture the complexity
of the problem being studied.
 Data Quality Dependence: The accuracy of the results
depends on the quality of the data. Poor-quality or
inaccurate data can led to incorrect conclusions.
 Missed Patterns: By focusing only on testing specific
hypotheses important patterns or relationships in the data
might be missed.
 Context Limitations: It doesn’t always consider the
bigger picture which can oversimplify results and led to
incomplete insights.
 Need for Additional Methods: To get a better
understanding of the data hypothesis testing should be
combined with other analytical methods such as data
visualization or machine learning techniques which we
study later in upcoming articles.

50
Confidence Interval
onfidence Interval (CI) is a range of values that
estimates where the true population value is likely to
fall. Instead of just saying The average height of students
is 165 cm a confidence interval allow us to say We are 95%
confident that the true average height is between 160
cm and 170 cm.
Before diving into confidence intervals you should be familiar
with:
 t-test
 z-test

Interpreting Confidence Intervals


Let’s say we take a sample of 50 students and calculate a 95%
confidence interval for their average height which turns out
to be 160–170 cm. This means If we repeatedly take similar
samples 95% of those intervals would contain the true
average height of all students in the population

Confidence level tells us how sure we are that the true value
is within a calculated range. If we have to repeat the sampling
process many times we expect that a certain percentage of
those intervals will include the true value.

51
 90% Confidence: 90% of intervals would include the true
population value.
 95% Confidence: 95% of intervals would include the true
value which is commonly used in data science.
 99% Confidence: 99% of intervals would include the true
value but the intervals would be wider.
Why are Confidence Intervals Important in Data
Science?
 They helps to measure uncertainty in predictions and
estimates.
 Through this data scientists finds the reliable results
instead of just giving a single number.
 They are widely used in A/B testing, machine learning,
and survey analysis which we study later to check if
results are meaningful.
Steps for Constructing a Confidence Interval
To calculate a confidence interval follow these simple 4 steps:
Step 1: Identify the sample problem.
Define the population parameter you want to estimate e.g.,
mean height of students. Choose the right statistic such as
the sample mean.
Step 2: Select a confidence level.
In this step we select the confidence level some common
choices are 90%, 95% or 99%. It represents how sure we are
about our estimate.
Step 3: Find the margin of error.
To find the Margin of Error, you use the formula:
Margin of Error=Critical Value×Standard ErrorMargin of Error=C
ritical Value×Standard Error

52
The Critical Value is found using Z-tables or T-tables for small
samples. First you choose the significance level (α) which is
typically 0.05 for a 95% confidence level. Then decide whether
you are performing a one-tailed or two-tailed test with two-
tailed being the more common choice. After this you look the
corresponding value in the Z-table or T-table based on your
significance level and test type.
The Standard Error measures the variability of the sample
and calculated by dividing the sample’s standard deviation by
the square root of the sample size. Combining the Critical Value
and Standard Error gives you the Margin of Error which tells
you the range within which you expect the true value to fall.
Step 4: Specify the confidence interval.
To find a Confidence Interval, we use this formula:
Confidence Interval=Point Estimate±Margin of ErrorConfidence
Interval=Point Estimate±Margin of Error
Now the Point Estimate is usually the average or mean from
your sample. It’s the best guess of the true value based on the
sample data. The Margin of Error tells you how much the
sample data might vary from the true value that we have
calculated in previous step.
So when you add or subtract the margin of error from your
point estimate you get a range. This range tells you where the
true value is likely to fall.
Types of Confidence Intervals
Some of the common types of Confidence Intervals are:

53
54

You might also like