0% found this document useful (0 votes)

11 views

Normal Distribution For ML

ND for machine learning

Uploaded by

kobohe1254

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Normal Distribution For ML

ND for machine learning

Uploaded by

kobohe1254

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

N ormal Distribution is an important concept in statistics and the

backbone of Machine Learning. A Data Scientist needs to know about

Normal Distribution when they work with Linear Models(perform well
if the data is normally distributed), Central Limit Theorem, and
exploratory data analysis.

As discovered by Carl Friedrich Gauss, Normal

Distribution/Gaussian Distribution is a continuous probability
distribution. It has a bell-shaped curve that is symmetrical from the
mean point to both halves of the curve.

Source: Google

Mathematical Definition:

A continuous random variable “x” is said to follow a normal

distribution with parameter μ(mean) and σ(standard deviation), if it’s
probability density function is given by,
This is also called a normal variate.

Standard Normal Variate:

If “x” is a normal variable with a mean(μ) and a standard deviation(σ)

then,

Source: Google

where z = standard normal variate

Standard Normal Distribution:

The simplest case of the normal distribution, known as the Standard

Normal Distribution, has an expected value of μ(mean) 0 and σ(s.d.) 1,
and is described by this probability density function,
Source: Google

Distribution Curve Characteristics:

1. The total area under the normal curve is equal to 1.

2. It is a continuous distribution.

3. It is symmetrical about the mean. Each half of the distribution is a

mirror image of the other half.

4. It is asymptotic to the horizontal axis.

5. It is unimodal.

Area Properties:

The normal distribution carries with it assumptions and can be

completely specified by two parameters: the mean and the standard
deviation. If the mean and standard deviation are known, you can
access every data point on the curve.
The empirical rule is a handy quick estimate of the data's spread given
the mean and standard deviation of a data set that follows a normal
distribution. It states that:

• 68.26% of the data will fall within 1 sd of the mean(μ±1σ)

• 95.44% of the data will fall within 2 sd of the mean(μ±2σ)

• 99.7% of the data will fall within 3 sd of the mean(μ±3σ)

• 95% — (μ±1.96σ)

• 99% — (μ±2.75σ)

Source: Google
Thus, almost all the data lies within 3 standard deviations. This
rule enables us to check for Outliers and is very helpful when
determining the normality of any distribution.

Application in Machine Learning:

In Machine Learning, data satisfying Normal Distribution is beneficial
for model building. It makes math easier. Models like LDA, Gaussian
Naive Bayes, Logistic Regression, Linear Regression, etc., are explicitly
calculated from the assumption that the distribution is a bivariate or
multivariate normal. Also, Sigmoid functions work most naturally with
normally distributed data.

Many natural phenomena in the world follow a log-normal

distribution, such as financial data and forecasting data. By applying
transformation techniques, we can convert the data into a normal
distribution. Also, many processes follow normality, such as
many measurement errors in an experiment, the position of a particle
that experiences diffusion, etc.

So it’s better to critically explore the data and check for the underlying
distributions for each variable before going to fit the model.

Note: Normality is an assumption for the ML models. It is not

mandatory that data should always follow normality. ML models
work very well in the case of non-normally distributed data also.
Models like decision tree, XgBoost, don’t assume any normality and
work on raw data as well. Also, linear regression is statistically
effective if only the model errors are Gaussian, not exactly the entire
dataset.

Here I have analyzed the Boston Housing Price Dataset. I

have explained the visualization techniques and the
conversion techniques along with plots that can validate
the normality of the distribution.

Visualization Techniques:

13 Numerical and 1 categorical(chas) feature is present

Histograms: It is a kind of bar graph which is an estimate of the

probability distribution of a continuous variable. It defines numerical
data and divided them into uniform bins which are consecutive, non-
overlapping intervals of a variable.
histogram of all numerical features

kdeplot: It is a Kernel Distribution Estimation Plot which depicts the

probability density function of the continuous or non-parametric data
variables i.e. we can plot for the univariate or multiple variables
altogether.
kdeplot of all numerical features

Feature Analysis:

Let’s take an example of feature rm(average number of rooms

per dwelling) closely resembling a normal distribution.
Though it has some distortion in the right tail, We need to check how
close it resembles a normal distribution. For that, we need to check
the Q-Q Plot.

When the quantiles of two variables are plotted against each other,
then the plot obtained is known as quantile — quantile plot or qqplot.
This plot provides a summary of whether the distributions of two
variables are similar or not with respect to the locations.

Note: “rm” feature is standardized before plotting qqplot

Here we can clearly see that feature is not normally distributed. But it
somewhat resembles it. We can conclude that standardizing
(StandardScaler) this feature before feeding it to a model can generate
a good result.

Central Limit Theorem and Normal Distribution:

CLT states that when we add a large number of independent random
variables to a dataset, irrespective of these variables' original
distribution, their normalized sum tends towards a Gaussian
distribution.

Machine Learning models generally treat training data as a mix

of deterministic and random parts. Let the dependent
variable(Y) consists of these parts. Models always want to express
the dependent variables(Y) as some function of several independent
variables(X). If the function is sum (or expressed as a sum of some
other function) and the number of X is really high, then Y should have
a normal distribution.

Here ml models try to express the deterministic part as a sum of

deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+

func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then

the model_error depicts only the random part and should have a
normal distribution.

So if the error distribution is normal, then we may suggest that the

model is successful. Else some other features are absent in the model
but have a large enough influence on Y, or the model is incorrect.
Characteristics of the Standard Normal Distribution
The characteristics of the Standard Normal Distribution have several
important implications for machine learning:

1. Symmetry: The Standard Normal Distribution is symmetric,

with the peak at the mean value of 0. In machine learning, this
symmetry can be useful when dealing with features that have a
balanced influence on the outcome. It ensures that the positive and
negative deviations from the mean are equally treated, which is
important in algorithms like support vector machines and logistic
regression.
2. Bell-Shaped Curve: The bell-shaped curve of the Standard
Normal Distribution represents how data tends to cluster around the
mean, with fewer data points as you move away from the center.
Machine learning models often make assumptions about the
distribution of data, and when data approximates a normal
distribution, these assumptions can lead to more accurate
predictions.
Source: Google Image

3. Standardization: Standardizing features to have a mean of 0

and a standard deviation of 1, as per the Standard Normal
Distribution, is a common preprocessing step in machine learning. It
ensures that all features contribute equally to model training,
preventing one feature from dominating the learning process. This
standardization helps algorithms like k-means clustering, and
principal component analysis perform optimally.

Source : Google Image

4. Z-Scores for Outlier Detection: In machine learning, detecting

outliers is crucial for building robust models. Z-scores, calculated
using the Standard Normal Distribution, provide a standardized way
to identify and handle outliers. Data points with extreme Z-scores are
considered potential outliers and can be treated accordingly.
Source: Google image

5. Probabilistic Models: Certain machine learning algorithms,

particularly those based on probabilistic models, assume that data
follows a normal distribution. For example, Gaussian Naive Bayes
assumes that features are normally distributed within each class,
making it suitable for text classification and spam detection.

Real World Application of Standard Normal Distribution

The Standard Normal Distribution, with its well-understood

properties, finds numerous real-world applications in machine
learning and data science. Here are some key areas where it plays a
crucial role:

1. Anomaly Detection: In machine learning, identifying

anomalies or outliers is essential for quality control, fraud
detection, and network security. The standard normal distribution
helps establish thresholds for what is considered normal, and data
points falling far from the mean in terms of standard deviations
can be flagged as anomalies.

2. Feature Engineering: Standardizing features to have a mean

of 0 and a standard deviation of 1 is a common preprocessing
step. This ensures that all features contribute equally to machine
learning models, preventing one feature from dominating the
learning process. Algorithms like k-means clustering and
principal component analysis (PCA) heavily rely on this
standardization.

3. Model Evaluation: Many machine learning models, such as

regression models, assume that the residuals (the differences
between predicted and actual values) follow a normal
distribution. By examining the distribution of residuals, data
scientists can assess whether the model’s assumptions are met and
make necessary adjustments.

4. Hypothesis Testing: Hypothesis tests, like the Z-test and t-test,

assume a normal distribution of data. In machine learning, these
tests are used for tasks such as comparing the performance of
different models or assessing the significance of features in
regression analysis.

5. Time Series Analysis: While time series data may not always
strictly follow a normal distribution, understanding the normal
distribution’s properties can be helpful in modeling and
forecasting time series data, especially when dealing with
residuals in models like ARIMA (AutoRegressive Integrated
Moving Average).

The standard normal distribution serves as a cornerstone in the world

of machine learning, providing the statistical foundation for numerous
techniques and practices. From feature standardization to outlier
detection, hypothesis testing, and model evaluation, its significance
cannot be overstated. As machine learning continues to shape our
world, understanding the core concepts of statistics, including the
standard normal distribution, empowers data scientists and machine
learning engineers to extract valuable insights. They build robust
models, and make data-driven decisions that drive progress and
innovation in various domains.

Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
Assignment 1:: Intro To Machine Learning
No ratings yet
Assignment 1:: Intro To Machine Learning
6 pages
Normal Distribution
No ratings yet
Normal Distribution
10 pages
Interview questions companie
No ratings yet
Interview questions companie
72 pages
Machine Learning Techniques Assignment-7: Name:Ishaan Kapoor Rollno:1/15/Fet/Bcs/1/055
No ratings yet
Machine Learning Techniques Assignment-7: Name:Ishaan Kapoor Rollno:1/15/Fet/Bcs/1/055
5 pages
Lecture-11 - Feature Scaling
No ratings yet
Lecture-11 - Feature Scaling
26 pages
Information Retrieval Important questions
No ratings yet
Information Retrieval Important questions
20 pages
DL DL2 DL3 Merged
No ratings yet
DL DL2 DL3 Merged
11 pages
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
No ratings yet
A Pattern Is An Abstract Object, Such As A Set of Measurements Describing A Physical Object
12 pages
Chapter5 - Machine Learning
No ratings yet
Chapter5 - Machine Learning
37 pages
Steps of Implementation of A GLM
No ratings yet
Steps of Implementation of A GLM
8 pages
ML final
No ratings yet
ML final
92 pages
Assignment
No ratings yet
Assignment
5 pages
Unit V - Big Data Programming
No ratings yet
Unit V - Big Data Programming
22 pages
1.0 Modeling: 1.1 Classification
No ratings yet
1.0 Modeling: 1.1 Classification
5 pages
Intro to Machine Learning New (2)
No ratings yet
Intro to Machine Learning New (2)
18 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
ML UNIT III
No ratings yet
ML UNIT III
12 pages
Unit II Deep Learning
No ratings yet
Unit II Deep Learning
11 pages
Basic Concepts for Understanding ML & DL
No ratings yet
Basic Concepts for Understanding ML & DL
8 pages
datamining unit4
No ratings yet
datamining unit4
21 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Normalization and Standardization: Methods To Preprocess Data To Have Consistent Scales and Distributions
No ratings yet
Normalization and Standardization: Methods To Preprocess Data To Have Consistent Scales and Distributions
10 pages
Normal Probability Curve
No ratings yet
Normal Probability Curve
6 pages
Boss
No ratings yet
Boss
13 pages
3.popular Machine Learning Algorithm
No ratings yet
3.popular Machine Learning Algorithm
11 pages
ML DL NLP Definitions
No ratings yet
ML DL NLP Definitions
22 pages
MC Learning
No ratings yet
MC Learning
4 pages
Data Science Crash Course
100% (1)
Data Science Crash Course
32 pages
Feature Scaling in Machine Learning
No ratings yet
Feature Scaling in Machine Learning
4 pages
ML Model Paper 2 Solution
No ratings yet
ML Model Paper 2 Solution
15 pages
Common DS Interview Questions and Answers - 2
No ratings yet
Common DS Interview Questions and Answers - 2
7 pages
Machine Learning Algorithms For Breast Cancer Prediction
No ratings yet
Machine Learning Algorithms For Breast Cancer Prediction
8 pages
1737527078055
No ratings yet
1737527078055
111 pages
Normal Distribution and Probability Distribution Function
No ratings yet
Normal Distribution and Probability Distribution Function
5 pages
Assignment_DADS303_MBA 3_Set 1 and 2
No ratings yet
Assignment_DADS303_MBA 3_Set 1 and 2
9 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
DAV Question Bank+Answe
No ratings yet
DAV Question Bank+Answe
54 pages
Unit 2 ML
No ratings yet
Unit 2 ML
141 pages
Data Science for Civil Engineering Unit 4 Notes
No ratings yet
Data Science for Civil Engineering Unit 4 Notes
18 pages
Unit 1
No ratings yet
Unit 1
21 pages
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
No ratings yet
11 Most Common Machine Learning Algorithms Explained in A Nutshell by Soner Yıldırım Towards Data Science
16 pages
How To Minimize Misclassification Rate and Expected Loss For Given Model
No ratings yet
How To Minimize Misclassification Rate and Expected Loss For Given Model
7 pages
Syllabus of Machine Learning
No ratings yet
Syllabus of Machine Learning
19 pages
ML Unit 3
No ratings yet
ML Unit 3
40 pages
STAT100 - Full Course Notes
No ratings yet
STAT100 - Full Course Notes
27 pages
Types of Machine Learning
No ratings yet
Types of Machine Learning
63 pages
Week 12 Chats
No ratings yet
Week 12 Chats
4 pages
Deep Learning Answers
No ratings yet
Deep Learning Answers
36 pages
Statistics 2
No ratings yet
Statistics 2
14 pages
Unit Iii
No ratings yet
Unit Iii
27 pages
UNIT 1 - Types of Learning
No ratings yet
UNIT 1 - Types of Learning
13 pages
Unit-Iii 3.1 Regression Modelling
100% (1)
Unit-Iii 3.1 Regression Modelling
7 pages
Mathematical Modeling
No ratings yet
Mathematical Modeling
33 pages
5th Unit Answer Bank AIML
No ratings yet
5th Unit Answer Bank AIML
24 pages
Homework # 2 - CYS 607: Submission Date: 24-03-21 Total Marks: 10
No ratings yet
Homework # 2 - CYS 607: Submission Date: 24-03-21 Total Marks: 10
4 pages
ML Endsem
No ratings yet
ML Endsem
14 pages
DS - UNIT - III - QB & Ans
No ratings yet
DS - UNIT - III - QB & Ans
25 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Lab Logbook
No ratings yet
Lab Logbook
82 pages
Density, Relative Density and API Gravity: According To IP559 and ASTM D7777
No ratings yet
Density, Relative Density and API Gravity: According To IP559 and ASTM D7777
1 page
Hydraulic Calculation For Bridge No.:-34 Mahesana-Taranga Gauge Conversion Project
No ratings yet
Hydraulic Calculation For Bridge No.:-34 Mahesana-Taranga Gauge Conversion Project
3 pages
1U Switching Power Supplies: Installation
No ratings yet
1U Switching Power Supplies: Installation
2 pages
Exceptions in Java
0% (1)
Exceptions in Java
18 pages
Create A Project Timeline Template in Excel in 10 Steps
No ratings yet
Create A Project Timeline Template in Excel in 10 Steps
20 pages
Endsem Daa All Pyq
No ratings yet
Endsem Daa All Pyq
12 pages
CHM-304 Experiment: Preparation of Phosphine Based Metal Complexes
No ratings yet
CHM-304 Experiment: Preparation of Phosphine Based Metal Complexes
4 pages
Science Process Skills: Observe Classify Measure Infer Predict Credits Extensions About The Author
No ratings yet
Science Process Skills: Observe Classify Measure Infer Predict Credits Extensions About The Author
21 pages
2010 Wace Mas 3cd Solutions
No ratings yet
2010 Wace Mas 3cd Solutions
6 pages
Year 10 Baseline Test Maths Foundation Non-Calculator (Interactive)
No ratings yet
Year 10 Baseline Test Maths Foundation Non-Calculator (Interactive)
8 pages
Quick Start Guide - Barco - DCS-200
No ratings yet
Quick Start Guide - Barco - DCS-200
2 pages
Itc Mathlab TP
100% (1)
Itc Mathlab TP
43 pages
Cabella - Kamionkowski - Kamionkowski1997
No ratings yet
Cabella - Kamionkowski - Kamionkowski1997
21 pages
DIGITAL LOGIC DESIGN MIDTERM EXAM JANUARY 2025
No ratings yet
DIGITAL LOGIC DESIGN MIDTERM EXAM JANUARY 2025
4 pages
32.power System Study-Sc, RC & Dynamic Testing
100% (2)
32.power System Study-Sc, RC & Dynamic Testing
136 pages
E0_270_RL
No ratings yet
E0_270_RL
10 pages
Automatic Transmission / Trans: Preparation
No ratings yet
Automatic Transmission / Trans: Preparation
2 pages
Wire Rope Slings SI 2.3
No ratings yet
Wire Rope Slings SI 2.3
2 pages
Microprocessor_Lab_Manual Final for Print
No ratings yet
Microprocessor_Lab_Manual Final for Print
19 pages
Crank Shaft Final Inspection
100% (1)
Crank Shaft Final Inspection
4 pages
MBA - Options, Multiple Questions
No ratings yet
MBA - Options, Multiple Questions
5 pages
18.0 Carbonyl Compounds
100% (2)
18.0 Carbonyl Compounds
9 pages
What Is The Momentum of A 23 KG Cannon Shell Going 530 M
No ratings yet
What Is The Momentum of A 23 KG Cannon Shell Going 530 M
13 pages
Glencoe McGraw Hill Math Triumphs Foundations For Algebra 2 spiral bound Teacher s Edition 2010 Glencoe Mcgraw-Hill - Read the ebook now with the complete version and no limits
100% (1)
Glencoe McGraw Hill Math Triumphs Foundations For Algebra 2 spiral bound Teacher s Edition 2010 Glencoe Mcgraw-Hill - Read the ebook now with the complete version and no limits
57 pages
Handling Precautions: Butterfly Valves (Common To All Models)
No ratings yet
Handling Precautions: Butterfly Valves (Common To All Models)
9 pages
Bidirectional power flow in an electric vehicle using predictive control algorithm including sneak circuit analysis
No ratings yet
Bidirectional power flow in an electric vehicle using predictive control algorithm including sneak circuit analysis
12 pages
Qorvo 5g Wireless Infrastructure Brochure PDF
No ratings yet
Qorvo 5g Wireless Infrastructure Brochure PDF
4 pages
Statistical Learning Methods
No ratings yet
Statistical Learning Methods
28 pages