Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
11 views

Normal Distribution For ML

ND for machine learning

Uploaded by

kobohe1254
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Normal Distribution For ML

ND for machine learning

Uploaded by

kobohe1254
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

N ormal Distribution is an important concept in statistics and the

backbone of Machine Learning. A Data Scientist needs to know about


Normal Distribution when they work with Linear Models(perform well
if the data is normally distributed), Central Limit Theorem, and
exploratory data analysis.

As discovered by Carl Friedrich Gauss, Normal


Distribution/Gaussian Distribution is a continuous probability
distribution. It has a bell-shaped curve that is symmetrical from the
mean point to both halves of the curve.

Source: Google

Mathematical Definition:

A continuous random variable “x” is said to follow a normal


distribution with parameter μ(mean) and σ(standard deviation), if it’s
probability density function is given by,
This is also called a normal variate.

Standard Normal Variate:

If “x” is a normal variable with a mean(μ) and a standard deviation(σ)


then,

Source: Google

where z = standard normal variate

Standard Normal Distribution:

The simplest case of the normal distribution, known as the Standard


Normal Distribution, has an expected value of μ(mean) 0 and σ(s.d.) 1,
and is described by this probability density function,
Source: Google

Distribution Curve Characteristics:

1. The total area under the normal curve is equal to 1.

2. It is a continuous distribution.

3. It is symmetrical about the mean. Each half of the distribution is a


mirror image of the other half.

4. It is asymptotic to the horizontal axis.

5. It is unimodal.

Area Properties:

The normal distribution carries with it assumptions and can be


completely specified by two parameters: the mean and the standard
deviation. If the mean and standard deviation are known, you can
access every data point on the curve.
The empirical rule is a handy quick estimate of the data's spread given
the mean and standard deviation of a data set that follows a normal
distribution. It states that:

• 68.26% of the data will fall within 1 sd of the mean(μ±1σ)

• 95.44% of the data will fall within 2 sd of the mean(μ±2σ)

• 99.7% of the data will fall within 3 sd of the mean(μ±3σ)

• 95% — (μ±1.96σ)

• 99% — (μ±2.75σ)

Source: Google
Thus, almost all the data lies within 3 standard deviations. This
rule enables us to check for Outliers and is very helpful when
determining the normality of any distribution.

Application in Machine Learning:


In Machine Learning, data satisfying Normal Distribution is beneficial
for model building. It makes math easier. Models like LDA, Gaussian
Naive Bayes, Logistic Regression, Linear Regression, etc., are explicitly
calculated from the assumption that the distribution is a bivariate or
multivariate normal. Also, Sigmoid functions work most naturally with
normally distributed data.

Many natural phenomena in the world follow a log-normal


distribution, such as financial data and forecasting data. By applying
transformation techniques, we can convert the data into a normal
distribution. Also, many processes follow normality, such as
many measurement errors in an experiment, the position of a particle
that experiences diffusion, etc.

So it’s better to critically explore the data and check for the underlying
distributions for each variable before going to fit the model.

Note: Normality is an assumption for the ML models. It is not


mandatory that data should always follow normality. ML models
work very well in the case of non-normally distributed data also.
Models like decision tree, XgBoost, don’t assume any normality and
work on raw data as well. Also, linear regression is statistically
effective if only the model errors are Gaussian, not exactly the entire
dataset.

Here I have analyzed the Boston Housing Price Dataset. I


have explained the visualization techniques and the
conversion techniques along with plots that can validate
the normality of the distribution.

Visualization Techniques:

13 Numerical and 1 categorical(chas) feature is present

Histograms: It is a kind of bar graph which is an estimate of the


probability distribution of a continuous variable. It defines numerical
data and divided them into uniform bins which are consecutive, non-
overlapping intervals of a variable.
histogram of all numerical features

kdeplot: It is a Kernel Distribution Estimation Plot which depicts the


probability density function of the continuous or non-parametric data
variables i.e. we can plot for the univariate or multiple variables
altogether.
kdeplot of all numerical features

Feature Analysis:

Let’s take an example of feature rm(average number of rooms


per dwelling) closely resembling a normal distribution.
Though it has some distortion in the right tail, We need to check how
close it resembles a normal distribution. For that, we need to check
the Q-Q Plot.

When the quantiles of two variables are plotted against each other,
then the plot obtained is known as quantile — quantile plot or qqplot.
This plot provides a summary of whether the distributions of two
variables are similar or not with respect to the locations.

Note: “rm” feature is standardized before plotting qqplot

Here we can clearly see that feature is not normally distributed. But it
somewhat resembles it. We can conclude that standardizing
(StandardScaler) this feature before feeding it to a model can generate
a good result.

Central Limit Theorem and Normal Distribution:


CLT states that when we add a large number of independent random
variables to a dataset, irrespective of these variables' original
distribution, their normalized sum tends towards a Gaussian
distribution.

Machine Learning models generally treat training data as a mix


of deterministic and random parts. Let the dependent
variable(Y) consists of these parts. Models always want to express
the dependent variables(Y) as some function of several independent
variables(X). If the function is sum (or expressed as a sum of some
other function) and the number of X is really high, then Y should have
a normal distribution.

Here ml models try to express the deterministic part as a sum of


deterministic independent variables(X):

deterministic + random = func(deterministic(1)) +…+


func(deterministic(n)) + model_error

If the whole deterministic part of Y is explained by X, then


the model_error depicts only the random part and should have a
normal distribution.

So if the error distribution is normal, then we may suggest that the


model is successful. Else some other features are absent in the model
but have a large enough influence on Y, or the model is incorrect.
Characteristics of the Standard Normal Distribution
The characteristics of the Standard Normal Distribution have several
important implications for machine learning:

1. Symmetry: The Standard Normal Distribution is symmetric,


with the peak at the mean value of 0. In machine learning, this
symmetry can be useful when dealing with features that have a
balanced influence on the outcome. It ensures that the positive and
negative deviations from the mean are equally treated, which is
important in algorithms like support vector machines and logistic
regression.
2. Bell-Shaped Curve: The bell-shaped curve of the Standard
Normal Distribution represents how data tends to cluster around the
mean, with fewer data points as you move away from the center.
Machine learning models often make assumptions about the
distribution of data, and when data approximates a normal
distribution, these assumptions can lead to more accurate
predictions.
Source: Google Image

3. Standardization: Standardizing features to have a mean of 0


and a standard deviation of 1, as per the Standard Normal
Distribution, is a common preprocessing step in machine learning. It
ensures that all features contribute equally to model training,
preventing one feature from dominating the learning process. This
standardization helps algorithms like k-means clustering, and
principal component analysis perform optimally.

Source : Google Image

4. Z-Scores for Outlier Detection: In machine learning, detecting


outliers is crucial for building robust models. Z-scores, calculated
using the Standard Normal Distribution, provide a standardized way
to identify and handle outliers. Data points with extreme Z-scores are
considered potential outliers and can be treated accordingly.
Source: Google image

5. Probabilistic Models: Certain machine learning algorithms,


particularly those based on probabilistic models, assume that data
follows a normal distribution. For example, Gaussian Naive Bayes
assumes that features are normally distributed within each class,
making it suitable for text classification and spam detection.

Real World Application of Standard Normal Distribution

The Standard Normal Distribution, with its well-understood


properties, finds numerous real-world applications in machine
learning and data science. Here are some key areas where it plays a
crucial role:

1. Anomaly Detection: In machine learning, identifying


anomalies or outliers is essential for quality control, fraud
detection, and network security. The standard normal distribution
helps establish thresholds for what is considered normal, and data
points falling far from the mean in terms of standard deviations
can be flagged as anomalies.

2. Feature Engineering: Standardizing features to have a mean


of 0 and a standard deviation of 1 is a common preprocessing
step. This ensures that all features contribute equally to machine
learning models, preventing one feature from dominating the
learning process. Algorithms like k-means clustering and
principal component analysis (PCA) heavily rely on this
standardization.

3. Model Evaluation: Many machine learning models, such as


regression models, assume that the residuals (the differences
between predicted and actual values) follow a normal
distribution. By examining the distribution of residuals, data
scientists can assess whether the model’s assumptions are met and
make necessary adjustments.

4. Hypothesis Testing: Hypothesis tests, like the Z-test and t-test,


assume a normal distribution of data. In machine learning, these
tests are used for tasks such as comparing the performance of
different models or assessing the significance of features in
regression analysis.

5. Time Series Analysis: While time series data may not always
strictly follow a normal distribution, understanding the normal
distribution’s properties can be helpful in modeling and
forecasting time series data, especially when dealing with
residuals in models like ARIMA (AutoRegressive Integrated
Moving Average).

The standard normal distribution serves as a cornerstone in the world


of machine learning, providing the statistical foundation for numerous
techniques and practices. From feature standardization to outlier
detection, hypothesis testing, and model evaluation, its significance
cannot be overstated. As machine learning continues to shape our
world, understanding the core concepts of statistics, including the
standard normal distribution, empowers data scientists and machine
learning engineers to extract valuable insights. They build robust
models, and make data-driven decisions that drive progress and
innovation in various domains.

You might also like