Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
17 views

Normal Distribution

Normal Distribution for machine learning

Uploaded by

kobohe1254
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Normal Distribution

Normal Distribution for machine learning

Uploaded by

kobohe1254
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

N ormal Distribution is an important concept in statistics and the

backbone of Machine Learning. A Data Scientist needs to know about


Normal Distribution when they work with Linear Models(perform well
if the data is normally distributed), Central Limit Theorem, and
exploratory data analysis.

As discovered by Carl Friedrich Gauss, Normal


Distribution/Gaussian Distribution is a continuous probability
distribution. It has a bell-shaped curve that is symmetrical from the
mean point to both halves of the curve.

Source: Google

Mathematical Definition:

A continuous random variable “x” is said to follow a normal


distribution with parameter μ(mean) and σ(standard deviation), if it’s
probability density function is given by,
This is also called a normal variate.

Standard Normal Variate:

If “x” is a normal variable with a mean(μ) and a standard deviation(σ)


then,

Source: Google

where z = standard normal variate

Standard Normal Distribution:

The simplest case of the normal distribution, known as the Standard


Normal Distribution, has an expected value of μ(mean) 0 and σ(s.d.) 1,
and is described by this probability density function,
Source: Google

Distribution Curve Characteristics:

1. The total area under the normal curve is equal to 1.

2. It is a continuous distribution.

3. It is symmetrical about the mean. Each half of the distribution is a


mirror image of the other half.

4. It is asymptotic to the horizontal axis.

5. It is unimodal.

Area Properties:

The normal distribution carries with it assumptions and can be


completely specified by two parameters: the mean and the standard
deviation. If the mean and standard deviation are known, you can
access every data point on the curve.
The empirical rule is a handy quick estimate of the data's spread given
the mean and standard deviation of a data set that follows a normal
distribution. It states that:

• 68.26% of the data will fall within 1 sd of the mean(μ±1σ)

• 95.44% of the data will fall within 2 sd of the mean(μ±2σ)

• 99.7% of the data will fall within 3 sd of the mean(μ±3σ)

• 95% — (μ±1.96σ)

• 99% — (μ±2.75σ)

Source: Google
Thus, almost all the data lies within 3 standard deviations. This
rule enables us to check for Outliers and is very helpful when
determining the normality of any distribution.

Application in Machine Learning:


In Machine Learning, data satisfying Normal Distribution is beneficial
for model building. It makes math easier. Models like LDA, Gaussian
Naive Bayes, Logistic Regression, Linear Regression, etc., are explicitly
calculated from the assumption that the distribution is a bivariate or
multivariate normal. Also, Sigmoid functions work most naturally with
normally distributed data.

Many natural phenomena in the world follow a log-normal


distribution, such as financial data and forecasting data. By applying
transformation techniques, we can convert the data into a normal
distribution. Also, many processes follow normality, such as
many measurement errors in an experiment, the position of a particle
that experiences diffusion, etc.

So it’s better to critically explore the data and check for the underlying
distributions for each variable before going to fit the model.

Note: Normality is an assumption for the ML models. It is not


mandatory that data should always follow normality. ML models
work very well in the case of non-normally distributed data also.
Models like decision tree, XgBoost, don’t assume any normality and
work on raw data as well. Also, linear regression is statistically
effective if only the model errors are Gaussian, not exactly the entire
dataset.

Here I have analyzed the Boston Housing Price Dataset. I


have explained the visualization techniques and the
conversion techniques along with plots that can validate
the normality of the distribution.

Visualization Techniques:

13 Numerical and 1 categorical(chas) feature is present

Histograms: It is a kind of bar graph which is an estimate of the


probability distribution of a continuous variable. It defines numerical
data and divided them into uniform bins which are consecutive, non-
overlapping intervals of a variable.
histogram of all numerical features

kdeplot: It is a Kernel Distribution Estimation Plot which depicts the


probability density function of the continuous or non-parametric data
variables i.e. we can plot for the univariate or multiple variables
altogether.
kdeplot of all numerical features

Feature Analysis:

Let’s take an example of feature rm(average number of rooms


per dwelling) closely resembling a normal distribution.
Though it has some distortion in the right tail, We need to check how
close it resembles a normal distribution. For that, we need to check
the Q-Q Plot.

When the quantiles of two variables are plotted against each other,
then the plot obtained is known as quantile — quantile plot or qqplot.
This plot provides a summary of whether the distributions of two
variables are similar or not with respect to the locations.

Note: “rm” feature is standardized before plotting qqplot

Here we can clearly see that feature is not normally distributed. But it
somewhat resembles it. We can conclude that standardizing
(StandardScaler) this feature before feeding it to a model can generate
a good result.

Central Limit Theorem and Normal Distribution:


CLT states that when we add a large number of independent random
variables to a dataset, irrespective of these variables' original
distribution, their normalized sum tends towards a Gaussian
distribution.

Machine Learning models generally treat training data as a mix


of deterministic and random parts. Let the dependent
variable(Y) consists of these parts. Models always want to express
the dependent variables(Y) as some function of several independent
variables(X). If the function is sum (or expressed as a sum of some
other function) and the number of X is really high, then Y should have
a normal distribution.

Here ml models try to express the deterministic part as a sum of


deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+


func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then


the model_error depicts only the random part and should have a
normal distribution.

So if the error distribution is normal, then we may suggest that the


model is successful. Else some other features are absent in the model
but have a large enough influence on Y, or the model is incorrect.

You might also like