Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
4 views

BigDataAnalytics _ Unit2

The document provides an overview of inferential and descriptive statistics, detailing measures of central tendency (mean, median, mode) and measures of dispersion (range, variance, standard deviation). It also covers concepts of skewness, kurtosis, normal distribution, and binomial distribution, including their characteristics and applications. Key features of normal distribution and conditions for binomial distribution are also highlighted.

Uploaded by

21ucs048
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

BigDataAnalytics _ Unit2

The document provides an overview of inferential and descriptive statistics, detailing measures of central tendency (mean, median, mode) and measures of dispersion (range, variance, standard deviation). It also covers concepts of skewness, kurtosis, normal distribution, and binomial distribution, including their characteristics and applications. Key features of normal distribution and conditions for binomial distribution are also highlighted.

Uploaded by

21ucs048
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Unit - 2

Inferential and Descriptive Statistics


Statistics is broadly classified into two categories:

1. Descriptive Statistics: Summarizes and organizes data in a meaningful way.


2. Inferential Statistics: Makes inferences or predictions about a population based on sample data.

Measures of Central Tendency

● Measures of central tendency summarize a dataset by identifying a single central value that represents
the entire dataset.

a) Mean

The average value of a dataset.


Where xi are the data points and nnn is the number of data points.

b) Median

The middle value in an ordered dataset.

● For odd n: Median = middle value.


● For even n: Median = average of two middle values.

c) Mode

The most frequently occurring value(s) in the dataset.

● Suitable for categorical data.


Measures of Dispersion
● Measures of Dispersion are used to represent the scattering of
data.
● These are the numbers that show the various aspects of the data
spread across various parameters.

Types of Measures of Dispersion


Measures of dispersion can be classified into the following two types :
● Absolute Measure of Dispersion - The measures of
dispersion that are measured and expressed in the units of data
themselves are called Absolute Measure of Dispersion. For
example – Meters, Dollars, Kg, etc.
● Relative Measure of Dispersion - The relative measures of
dispersion to measure the two quantities that have different units
to get a better idea about the scattering of the data.

These measures of dispersion can be further divided into various


categories. They have various parameters and these parameters have the
Absolute Measure of Dispersion

Range - The range is the difference between the largest and the smallest values in the distribution.
Thus, it can be written as
R=L–S

where,
L is the largest value in the Distribution
S is the smallest value in the Distribution
● A higher value of range implies higher variation in the data set.
● One drawback of this measure is that it only takes into account the maximum and the minimum value.
They might not always be the proper indicator of how the values of the distribution are scattered.

Variance (σ2)

The average of the squared differences from the mean.


Variance (σ2)

The average of the squared differences from the mean.

Standard Deviation (σ)

The square root of the variance.

Interquartile Range (IQR)

The range between the first quartile (Q1) and the third quartile (Q3).
Relative Measure of Dispersion

● Coefficient of Range: It is defined as the ratio of the difference between the highest and lowest value
in a data set to the sum of the highest and lowest value.
● Coefficient of Variation: It is defined as the ratio of the standard deviation to the mean of the data
set. We use percentages to express the coefficient of variation.
● Coefficient of Mean Deviation: It is defined as the ratio of the mean deviation to the value of the
central point of the data set.
● Coefficient of Quartile Deviation: It is defined as the ratio of the difference between the third
quartile and the first quartile to the sum of the third and first quartiles.
Quantile and Rank

● Quantiles: Points that divide the data into equal parts.


○ Quartiles: Divide data into four equal parts (Q1,Q2,Q3).
○ Percentiles: Divide data into 100 equal parts.
● Rank: The relative position of a value within a dataset.

Skewness and Kurtosis

a) Skewness

Measures the asymmetry of the probability distribution.

● Positive Skew: Tail on the right side.


● Negative Skew: Tail on the left side.
Symmetric Skewness:
● A perfect symmetric distribution is one in which frequency distribution is the same on the sides of the
center point of the frequency curve.
● In this, Mean = Median = Mode.
● There is no skewness in a perfectly symmetrical distribution.
Asymmetric Skewness:
● A asymmetrical or skewed distribution is one in which the spread of the frequencies is different on both
the sides of the center point or the frequency curve is more stretched towards one side or value of Mean.
Median and Mode falls at different points.
● The two types of asymmetric skewness is:

➔ Positive Skewness: In this, the concentration of


frequencies is more towards higher values of the
variable i.e. the right tail is longer than the left tail.
➔ Negative Skewness: In this, the concentration of
frequencies is more towards the lower values of the
variable i.e. the left tail is longer than the right tail.
Kurtosis
● It is also a characteristic of the frequency distribution. It gives an idea about the shape of a
frequency distribution.
● Basically, the measure of kurtosis is the extent to which a frequency distribution is peaked in
comparison with a normal curve. It is the degree of peaked Ness of a distribution.
Types of Kurtosis
The following figure describes the classification of kurtosis:
1. Leptokurtic: Leptokurtic is a curve having a high peak
than the normal distribution. In this curve, there is too
much concentration of items near the central value.
2. Mesokurtic: Mesokurtic is a curve having a normal peak
than the normal curve. In this curve, there is equal
distribution of items around the central value.
3. Platykurtic: Platykurtic is a curve having a low peak
than the normal curve is called platykurtic. In this curve,
there is less concentration of items around the central
value.
Normal Distribution
● A symmetric, bell-shaped distribution.
● Defined by the mean (μ) and standard deviation (σ).
● The area under the curve of the normal distribution
represents probabilities for the data.
● The area under the whole curve is equal to 1, or 100%

Here is a graph of a normal


distribution with probabilities
between standard deviations (σ)

● Roughly 68.3% of the data is within 1


standard deviation of the average
(from μ-1σ to μ+1σ)
● Roughly 95.5% of the data is within 2
standard deviations of the average
(from μ-2σ to μ+2σ)
● Roughly 99.7% of the data is within 3
standard deviations of the average
(from μ-3σ to μ+3σ)
Key Features of Normal Distribution
● Symmetry:The normal distribution is symmetric around its mean. This means the left side of the
distribution mirrors the right side.
● Mean, Median, and Mode: In a normal distribution, the mean, median, and mode are all equal and
located at the center of the distribution.
● Bell-shaped Curve: The curve is bell-shaped, indicating that most of the observations cluster
around the central peak, and the probabilities for values further away from the mean taper off equally
in both directions.
● Standard Deviation: The spread of the distribution is determined by the standard deviation. About
68% of the data falls within one standard deviation of the mean, 95% within two standard deviations,
and 99.7% within three standard deviations.

Normal Distribution Examples


The Normal Distribution for various types of data that include,
● Distribution of Height of People.
● Distribution of Errors in any Measurement.
● Distribution of Blood Pressure of any Patient, etc.
Binomial Distribution
● Binomial Distribution is a probability distribution used to model the number of successes in a fixed
number of independent trials, where each trial has only two possible outcomes: success or failure.
● This distribution is useful for calculating the probability of a specific number of successes in scenarios like
flipping coins, quality control, or survey predictions.
● Binomial Distribution is based on Bernoulli trials, where each trial has an independent and identical
chance of success. The probability distribution for a Bernoulli trial is called the Bernoulli Distribution.

Conditions for Binomial Distribution


The Binomial distribution can be used in scenarios where the following conditions are satisfied:
1. Fixed Number of Trials: There are a set number of trials or experiments (denoted by n), such as
flipping a coin 10 times.
2. Two Possible Outcomes: Each trial has only two possible outcomes, often labeled as “success” and
“failure.” For example, getting heads or tails in a coin flip.
3. Independent Trials: The outcome of each trial is independent of the others, meaning the result of
one trial does not affect the result of another.
4. Constant Probability: The probability of success (denoted by p) remains the same for each trial. For
example, if you’re flipping a fair coin, the probability of getting heads is always 0.5.
Binomial Distribution Calculation
Binomial Distribution in statistics is used to compute the probability of likelihood of an event using the above
formula.
To calculate the probability using binomial distribution we need to follow the following steps:
● Step 1: Find the number of trials and assign it as ‘n’
● Step 2: Find the probability of success in each trial and assign it as ‘p’
● Step 3: Find the probability of failure and assign it as q where q = 1-p
● Step 4: Find the random variable X = r for which we have to calculate the binomial
distribution
● Step 5: Calculate the probability of Binomial Distribution for X = r using the Binomial
Distribution Formula.

You might also like