0% found this document useful (0 votes)

17 views

Normal Distribution

Normal Distribution for machine learning

Uploaded by

kobohe1254

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views

Normal Distribution

Normal Distribution for machine learning

Uploaded by

kobohe1254

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

N ormal Distribution is an important concept in statistics and the

backbone of Machine Learning. A Data Scientist needs to know about

Normal Distribution when they work with Linear Models(perform well
if the data is normally distributed), Central Limit Theorem, and
exploratory data analysis.

As discovered by Carl Friedrich Gauss, Normal

Distribution/Gaussian Distribution is a continuous probability
distribution. It has a bell-shaped curve that is symmetrical from the
mean point to both halves of the curve.

Source: Google

Mathematical Definition:

A continuous random variable “x” is said to follow a normal

distribution with parameter μ(mean) and σ(standard deviation), if it’s
probability density function is given by,
This is also called a normal variate.

Standard Normal Variate:

If “x” is a normal variable with a mean(μ) and a standard deviation(σ)

then,

Source: Google

where z = standard normal variate

Standard Normal Distribution:

The simplest case of the normal distribution, known as the Standard

Normal Distribution, has an expected value of μ(mean) 0 and σ(s.d.) 1,
and is described by this probability density function,
Source: Google

Distribution Curve Characteristics:

1. The total area under the normal curve is equal to 1.

2. It is a continuous distribution.

3. It is symmetrical about the mean. Each half of the distribution is a

mirror image of the other half.

4. It is asymptotic to the horizontal axis.

5. It is unimodal.

Area Properties:

The normal distribution carries with it assumptions and can be

completely specified by two parameters: the mean and the standard
deviation. If the mean and standard deviation are known, you can
access every data point on the curve.
The empirical rule is a handy quick estimate of the data's spread given
the mean and standard deviation of a data set that follows a normal
distribution. It states that:

• 68.26% of the data will fall within 1 sd of the mean(μ±1σ)

• 95.44% of the data will fall within 2 sd of the mean(μ±2σ)

• 99.7% of the data will fall within 3 sd of the mean(μ±3σ)

• 95% — (μ±1.96σ)

• 99% — (μ±2.75σ)

Source: Google
Thus, almost all the data lies within 3 standard deviations. This
rule enables us to check for Outliers and is very helpful when
determining the normality of any distribution.

Application in Machine Learning:

In Machine Learning, data satisfying Normal Distribution is beneficial
for model building. It makes math easier. Models like LDA, Gaussian
Naive Bayes, Logistic Regression, Linear Regression, etc., are explicitly
calculated from the assumption that the distribution is a bivariate or
multivariate normal. Also, Sigmoid functions work most naturally with
normally distributed data.

Many natural phenomena in the world follow a log-normal

distribution, such as financial data and forecasting data. By applying
transformation techniques, we can convert the data into a normal
distribution. Also, many processes follow normality, such as
many measurement errors in an experiment, the position of a particle
that experiences diffusion, etc.

So it’s better to critically explore the data and check for the underlying
distributions for each variable before going to fit the model.

Note: Normality is an assumption for the ML models. It is not

mandatory that data should always follow normality. ML models
work very well in the case of non-normally distributed data also.
Models like decision tree, XgBoost, don’t assume any normality and
work on raw data as well. Also, linear regression is statistically
effective if only the model errors are Gaussian, not exactly the entire
dataset.

Here I have analyzed the Boston Housing Price Dataset. I

have explained the visualization techniques and the
conversion techniques along with plots that can validate
the normality of the distribution.

Visualization Techniques:

13 Numerical and 1 categorical(chas) feature is present

Histograms: It is a kind of bar graph which is an estimate of the

probability distribution of a continuous variable. It defines numerical
data and divided them into uniform bins which are consecutive, non-
overlapping intervals of a variable.
histogram of all numerical features

kdeplot: It is a Kernel Distribution Estimation Plot which depicts the

probability density function of the continuous or non-parametric data
variables i.e. we can plot for the univariate or multiple variables
altogether.
kdeplot of all numerical features

Feature Analysis:

Let’s take an example of feature rm(average number of rooms

per dwelling) closely resembling a normal distribution.
Though it has some distortion in the right tail, We need to check how
close it resembles a normal distribution. For that, we need to check
the Q-Q Plot.

When the quantiles of two variables are plotted against each other,
then the plot obtained is known as quantile — quantile plot or qqplot.
This plot provides a summary of whether the distributions of two
variables are similar or not with respect to the locations.

Note: “rm” feature is standardized before plotting qqplot

Here we can clearly see that feature is not normally distributed. But it
somewhat resembles it. We can conclude that standardizing
(StandardScaler) this feature before feeding it to a model can generate
a good result.

Central Limit Theorem and Normal Distribution:

CLT states that when we add a large number of independent random
variables to a dataset, irrespective of these variables' original
distribution, their normalized sum tends towards a Gaussian
distribution.

Machine Learning models generally treat training data as a mix

of deterministic and random parts. Let the dependent
variable(Y) consists of these parts. Models always want to express
the dependent variables(Y) as some function of several independent
variables(X). If the function is sum (or expressed as a sum of some
other function) and the number of X is really high, then Y should have
a normal distribution.

Here ml models try to express the deterministic part as a sum of

deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+

func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then

the model_error depicts only the random part and should have a
normal distribution.

So if the error distribution is normal, then we may suggest that the

model is successful. Else some other features are absent in the model
but have a large enough influence on Y, or the model is incorrect.

Solid Starts - First 100 Days
94% (18)
Solid Starts - First 100 Days
287 pages
Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
The Hold Me Tight Workbook - Dr. Sue Johnson
100% (16)
The Hold Me Tight Workbook - Dr. Sue Johnson
187 pages
Read People Like A Book by Patrick King-Edited
62% (66)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
94% (212)
COSMIC CONSCIOUSNESS OF HUMANITY - PROBLEMS OF NEW COSMOGONY (V.P.Kaznacheev,. Л. V. Trofimov.)
212 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (28)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
75% (12)
27 Feedback Mechanisms Pogil Key
6 pages
Frank Hammond - List of Demons
92% (92)
Frank Hammond - List of Demons
3 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
36 Questions To Fall in Love 1
97% (31)
36 Questions To Fall in Love 1
2 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
ALCHEMIST
64% (14)
ALCHEMIST
4 pages
1001 Songs
71% (69)
1001 Songs
1,798 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Normal Distribution For ML
No ratings yet
Normal Distribution For ML
17 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Week 5 Cheat Sheet - Copy
No ratings yet
Week 5 Cheat Sheet - Copy
3 pages
StatsLecture1 Probability
No ratings yet
StatsLecture1 Probability
4 pages
Normal, Binomial, Poisson, and Exponential Distributions
No ratings yet
Normal, Binomial, Poisson, and Exponential Distributions
39 pages
Normal Distribution and Probability Distribution Function
No ratings yet
Normal Distribution and Probability Distribution Function
5 pages
Continuous Probability Distributions
No ratings yet
Continuous Probability Distributions
8 pages
Chapter 6: How To Do Forecasting by Regression Analysis
No ratings yet
Chapter 6: How To Do Forecasting by Regression Analysis
7 pages
Statical Distriution function
No ratings yet
Statical Distriution function
8 pages
What Is Mode
No ratings yet
What Is Mode
4 pages
Unit 1
No ratings yet
Unit 1
21 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
Maximum-Likelihood Estimation
No ratings yet
Maximum-Likelihood Estimation
2 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Normal_LectureNote
No ratings yet
Normal_LectureNote
48 pages
Statistics 2
No ratings yet
Statistics 2
14 pages
Machine Learning Techniques Assignment-7: Name:Ishaan Kapoor Rollno:1/15/Fet/Bcs/1/055
No ratings yet
Machine Learning Techniques Assignment-7: Name:Ishaan Kapoor Rollno:1/15/Fet/Bcs/1/055
5 pages
Ch-6 Normal Distribution Lecture Notes
No ratings yet
Ch-6 Normal Distribution Lecture Notes
6 pages
When Can We Trust The Limits On A Process Behavior Chart?: Home Content
No ratings yet
When Can We Trust The Limits On A Process Behavior Chart?: Home Content
2 pages
Is Important Because:: TECH 6300 Introduction To Statistical Inference The Normal Distribution
100% (1)
Is Important Because:: TECH 6300 Introduction To Statistical Inference The Normal Distribution
19 pages
The Objective of Design of Experiments
No ratings yet
The Objective of Design of Experiments
26 pages
Error and Uncertainty: General Statistical Principles
No ratings yet
Error and Uncertainty: General Statistical Principles
8 pages
Exercise 1. Proportions _ R
No ratings yet
Exercise 1. Proportions _ R
1 page
ML Book Notes
No ratings yet
ML Book Notes
9 pages
STAT100 - Full Course Notes
No ratings yet
STAT100 - Full Course Notes
27 pages
Information Retrieval Important questions
No ratings yet
Information Retrieval Important questions
20 pages
Probability
No ratings yet
Probability
27 pages
ASM using r 2 marks answer Keys
No ratings yet
ASM using r 2 marks answer Keys
10 pages
Lecture-11 - Feature Scaling
No ratings yet
Lecture-11 - Feature Scaling
26 pages
Excel Normal Distribution Functions
No ratings yet
Excel Normal Distribution Functions
6 pages
UNIT I Notes
No ratings yet
UNIT I Notes
23 pages
UNIT I Notes-1
No ratings yet
UNIT I Notes-1
18 pages
Finding Mode
No ratings yet
Finding Mode
4 pages
ML final
No ratings yet
ML final
92 pages
1.0 Modeling: 1.1 Classification
No ratings yet
1.0 Modeling: 1.1 Classification
5 pages
Using Excel 2010 With The Normal Distribution
No ratings yet
Using Excel 2010 With The Normal Distribution
5 pages
1.3.6.6.1. Normal Distribution
No ratings yet
1.3.6.6.1. Normal Distribution
5 pages
Normal Distribution1
100% (1)
Normal Distribution1
8 pages
What Is The Mode
No ratings yet
What Is The Mode
4 pages
datamining unit4
No ratings yet
datamining unit4
21 pages
Normal Probability Curve
No ratings yet
Normal Probability Curve
6 pages
z table
No ratings yet
z table
13 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Normal Distribution Meaning
No ratings yet
Normal Distribution Meaning
6 pages
Lesson 7.1 Introduction To The Normal Distribution
No ratings yet
Lesson 7.1 Introduction To The Normal Distribution
9 pages
ML UNIT III
No ratings yet
ML UNIT III
12 pages
Perspectives On System Identification
100% (1)
Perspectives On System Identification
13 pages
00000chen - Linear Regression Analysis3
No ratings yet
00000chen - Linear Regression Analysis3
252 pages
Lecture 2
No ratings yet
Lecture 2
52 pages
PRP PBL-1
No ratings yet
PRP PBL-1
12 pages
Chapter5 - Machine Learning
No ratings yet
Chapter5 - Machine Learning
37 pages
DA UNIT-4
No ratings yet
DA UNIT-4
37 pages
Data Transformation (Statistics)
No ratings yet
Data Transformation (Statistics)
3 pages
Normal Distributions
No ratings yet
Normal Distributions
11 pages
Lab 3 - Kristi Proc Univariate
No ratings yet
Lab 3 - Kristi Proc Univariate
10 pages
INDENG231_FinalProjectReport
No ratings yet
INDENG231_FinalProjectReport
40 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Syllabus _ Probability Statistics_2024-2025 (1)
No ratings yet
Syllabus _ Probability Statistics_2024-2025 (1)
3 pages
Aldrich - R. A. Fisher On Bayes and Bayes' Theorem
No ratings yet
Aldrich - R. A. Fisher On Bayes and Bayes' Theorem
10 pages
Nurs 211
No ratings yet
Nurs 211
2 pages
BEST Linear Estimators
No ratings yet
BEST Linear Estimators
8 pages
Applied Time Series Analysis
No ratings yet
Applied Time Series Analysis
6 pages
Monte Carlo Simulation
No ratings yet
Monte Carlo Simulation
12 pages
STA301 Quiz-1 File by Vu Topper RM
No ratings yet
STA301 Quiz-1 File by Vu Topper RM
107 pages
Poison Distribtion Problems
No ratings yet
Poison Distribtion Problems
6 pages
Lets Have Fun With Math
No ratings yet
Lets Have Fun With Math
3 pages
4.1 - Experiment, Outcomes, and Sample Space
No ratings yet
4.1 - Experiment, Outcomes, and Sample Space
4 pages
Distribution Tables: Z Table T Table Chi-Square Table F Tables For: Alpha .10 Alpha .05 Alpha .025 Alpha .01
No ratings yet
Distribution Tables: Z Table T Table Chi-Square Table F Tables For: Alpha .10 Alpha .05 Alpha .025 Alpha .01
17 pages
Lecture 3-BEC 260
No ratings yet
Lecture 3-BEC 260
30 pages
Time Series Models 2nd Edition Andrew C. Harvey pdf download
100% (1)
Time Series Models 2nd Edition Andrew C. Harvey pdf download
77 pages
Unec 1709625403
No ratings yet
Unec 1709625403
59 pages
Geostats ENG 042111 PDF
100% (1)
Geostats ENG 042111 PDF
387 pages
Parametric Families of Discrete Distributions
No ratings yet
Parametric Families of Discrete Distributions
2 pages
01 Probability and Probability Distributions
No ratings yet
01 Probability and Probability Distributions
18 pages
Performance Task#2
100% (1)
Performance Task#2
3 pages
Unit 2 Stats & Probability
No ratings yet
Unit 2 Stats & Probability
51 pages
Bayes and MCMC For Undergraduates: The American Statistician
No ratings yet
Bayes and MCMC For Undergraduates: The American Statistician
7 pages
Probabilistic Machine Learning for Civil Engineers James-A Goulet - The latest ebook edition with all chapters is now available
100% (4)
Probabilistic Machine Learning for Civil Engineers James-A Goulet - The latest ebook edition with all chapters is now available
52 pages
Bcom 3rd Sem Business Statistics 2 Marks
No ratings yet
Bcom 3rd Sem Business Statistics 2 Marks
11 pages
Probability - Session 3 2023
No ratings yet
Probability - Session 3 2023
51 pages
Long Quiz Statistic and Probability
No ratings yet
Long Quiz Statistic and Probability
3 pages
Excel Sheet SBM(Edited)2
No ratings yet
Excel Sheet SBM(Edited)2
3 pages
Applied Economics
No ratings yet
Applied Economics
18 pages
2016 S Poisson Distribution
No ratings yet
2016 S Poisson Distribution
3 pages
STT WK 11 Lec 21 22
No ratings yet
STT WK 11 Lec 21 22
11 pages
PSR ASSIGNMENT Final
No ratings yet
PSR ASSIGNMENT Final
26 pages