Lesson 4 Notes
Lesson 4 Notes
Diploma in Data
Analysis
Introduction
to Data
Analysis
Lesson 4: Summary Notes
DATA ANALYSIS
2
Contents
3 Lesson 4 objectives
3 Introduction
6 Distributions
31 Conclusion
14 References
DATA ANALYSIS
3
Lesson Objectives
By the end of this lesson, you will have travelled
a few more steps along your data analyst
journey through the thought-provoking woods
of distributions, to the captivating lands of
measures of symmetry. You will understand the
difference between leptokurtic and platykurtic
forms and be able to identify distortions from
normality.
Lesson Introduction
What is the key to reaching your goals? We often
start off with a bang but start losing steam quickly
after the initial euphoria has worn off. Why is
that? Surprisingly, it’s not a lack of motivation.
Researchers have found that you are more likely
to reach your goal if you state when and where
new behaviours are going to happen, that means
that you need to plan out when and where your
changed behaviour will take place.
DATA ANALYSIS
4
The mean, median and mode are all measures of central location and aim to tell more about the central position of
the data.
Mean:
• The mean or average is the sum of all the values of the observations divided by the number of observations.
• Outliers, or extreme values, have a substantial impact on the mean.
• The mean is best to use with symmetrical continuous data.
• Mean = (Sum of all data points)/(Number of data points)
Median:
Mode:
• The mode is the value that appears the most frequent in the data.
• The mode is also not as affected by outliers as the mean.
• The mode is best to use for data that is categorical, ordinal and discrete.
NOTES
DATA ANALYSIS
5
Measures of dispersion
Measures of dispersion or spread include the range, variance, standard deviation and interquartile range are all
summary statistics we use to describe the amount of dispersion in a dataset. The higher variability in a dataset, the
more likely it is that the dataset contains extreme values and the more different observations or values become.
Range:
• The range indicated the difference between the highest and lowest value in the dataset.
• This value is affected by outliers.
• The range will likely increase the more the sample size increases, so it’s something to keep in mind with larger
datasets.
• The interquartile range divides the data into quarters or quantiles, q1, q2 and q3.
• Q1 is the 25th data point and q3 the 75th data point.
• The interquartile range represents the data points between q1 and q3.
• It is not as influenced by extreme values and is a good measure to use for skewed distributions.
Variance:
• Variance is the average squared difference of values from the mean, it tells us how far each number in the
dataset is from the mean.
• The variance is influenced by outliers.
• The variance might be difficult to analyse, so we rather use standard deviation.
Standard deviation:
Standard error:
• The standard error of the mean tells us how much the sample mean deviations from the actual mean of the
population.
• The standard error tells us if the mean is reliable or not.
• The standard error is also a measure of variability because it shows how much the variation there is between the
known mean of the sample and the calculated mean of the population.
• The more observations in the dataset, the smaller the standard error is likely to be.
• A small standard error is more representative of the true mean.
DATA ANALYSIS
6
Distributions
Let’s draw a random sample to obtain grades of learners for the quiz. As we obtain the grades, we can create a
distribution of the grades. This is useful to us when we need to know which outcomes are the most likely to occur
and what the spread of the potential values are. In other words, in either table, formula or graphical format we can
see what the likelihood of the event or outcome is.
Just like data types help us determine how to further analyse the dataset, distributions determine the appropriate
statistical test to choose. Distribution is one of the key concepts for analytics and provides a basis for inferential
statistics. Thus far, we have mostly dealt with descriptive statistics, which entails describing the data. Inferential
statistics is what allows us to make predictions about the dataset. Inferential statistics aims to make generalisations
about the population from the sample data set.
Distribution defined
• A probability distribution is a table, function or graph that describe the values of a random variable and the
probabilities associated with these values
• A probability distribution describes all the possible values and likelihood of the random variable in a range.
• The sum of all the possibilities of the distribution will always be equal to one.
• Think of it this way, if you have a coin that you flip and you have 50% chance of heads and 50% chance of tails,
then the sum of these two probabilities is equal to 100% or 1.
• Probability distributions, just like data types, can be divided into 2 categories: discrete and continuous.
2 dice example
Let’s use our dice example again to further understand the concept of probability distributions.
A dice has a ⅙ probability of rolling any single number, one to six.
We can represent the distribution of the event – rolling a dice through the following table.
If we throw 2 dice, the sum can be anything from 2 to 12 as we saw in the previous lesson.
The probability for the sum to be 7, is the most likely at 0,17.
The probability for the sum of the 2 dice to be 2 or 12 is the least likely at 0,03
The sum of these probabilities equals 1.
The probability of any other outcomes not on this table is 0.
Visual representation
DATA ANALYSIS
7
Bernoulli distribution
This distribution is named after Jacob Bernoulli, a Swiss mathematician, who analysed the Bernoulli process in the
17th century, hence Bernoulli distribution.
Think about our coin flipping example again, there are only two possible outcomes for the coin flip - heads or tails.
The probabilities of these outcomes do not have to be equally likely to occur. For example, the passengers who
survived and those who did not survive on the Titanic were not equally likely to occur. The probability of what we
would define as success and failure was not the same
Normal distribution
The most used distribution is the normal distribution. The normal distribution represents the behaviour that most
often occurs in the universe. It is also known as the gaussian distribution or the bell-shaped curve.
The normal distribution is symmetric around the mean and it is depicted as a bell-shaped curve when plotted.
This means that half of the values are to the left of the centre and the other half to the right of the centre of the
distribution. The mean median and mode are equal to each other for this distribution, hence the symmetry.
The normal distribution is continuous, meaning that it takes on values from negative infinity to positive infinity.
Once again, note that the total area under the curve, the sum of all likely outcomes, is equal to one.
The normal distribution can be fully described by its mean and standard deviation. This means that we can simplify
the normal distribution using only these 2 parameters, the mean and standard deviation.
The mean of this distribution is zero, the standard deviation is 1, the skewness is zero and the kurtosis 3, but more
on this a bit later in this lesson.
When dealing with the normal distribution, the standard deviation becomes especially useful.
NOTES
DATA ANALYSIS
8
The empirical formula or rule says that when a random variable is normally distributed
• 68.27% of data lies within 1 standard deviation of the mean
• 95.45% of data lies within 2 standard deviations of the mean
• 99.73% of data lies within 3 standard deviations of the mean
From this rule, we see that almost all the data falls within 3 standard deviations of the mean.
This rule also helps us to identify outliers or extreme values in this distribution.
DATA ANALYSIS
9
The central limit theorem states that if the sample size is large enough, meaning if the sample size is bigger than 30,
the distribution will be approximately normally distributed.
Let’s say that we decide to flip the coin 100 times. Maybe the first time we do this we get 80 heads. The second time
we get 45 heads. The third time we get 30 heads. Let’s say we repeat this process 50 times. Eventually, the more we
do the experiment of flipping the coin 100 times, the more normal our distribution becomes. As the number of trials
we conduct grows, the distribution tends to normality. The central limit theorem is one of the magical mathematical
applications and will form the foundation to hypothesis testing techniques that we will discuss later on in this
module.
NOTES
DATA ANALYSIS
10
Let’s say that we decide to flip the coin 100 times. Maybe the first time we do this we get 80 heads. The second time
we get 45 heads. The third time we get 30 heads. Let’s say we repeat this process 50 times. Eventually, the more we
do the experiment of flipping the coin 100 times, the more normal our distribution becomes. As the number of trials
we conduct grows, the distribution tends to normality. The central limit theorem is one of the magical mathematical
applications and will form the foundation to hypothesis testing techniques that we will discuss later on in this
module.
There are many methods you can use to check for normality, a few of them include histograms, QQ Plots, skewness
and kurtosis.
Histogram:
A histogram is useful to visualise the distribution of the data over a continuous interval and provides the frequency
of event per value for the data. From the histogram, we can quickly visualise the distribution of a variable.
QQ plots:
QQ plots are plots that divide the range of data into quartiles and plots the theoretical quantiles against the actual
quantiles for the variable. We will explore this plot in more detail in Module 2 when we start working in R.
Another method to determine the distribution of the data is the concepts of skewness and kurtosis.
The Shapiro Wilk test is used to determine if the distribution is normal. This test was specifically designed just for
this purpose.
The test returns the test statistics W and the p-value to us in order to determine normality.
Characterisation of data
Location, variability, skewness and kurtosis are all measures that tell us more about the dataset, in other words,
they help us characterise the dataset.
DATA ANALYSIS
11
We know that measures of central tendency have to do with data clustering around a certain point, and variability
tells us more about the spread around the central value.
Skewness tells us more about the asymmetry of the distribution and will be present when the mean median and
mode are not equal to each other. As we discovered earlier, the skewness for the normal distribution is zero,
because, in the normal distribution, the mean, median and mode are equal. Kurtosis tells us more about the peak of
a distribution.
Skewness
So we know that the normal distribution is symmetric, which means that when we look at a histogram or frequency
distribution, our tails are mirror images of each other. However, generally, there tends to be a cluster of data points
on one side of the mean that creates what we call a skewed distribution. In other words, when the left and right side
of the distribution are shaped dissimilarly, the distribution is skewed. A distribution can be skewed to the left or
right.
Left-skewed distribution:
• If the distribution is skewed to the left, then the data points will accumulate on the right side of the distribution
and the tail to the left will be longer. Seems as if the curve is leaning towards the right.
• Also referred to as a negatively-skewed distribution.
• Mean < median < mode
Right-skewed distribution:
• Alternatively, if the distribution is skewed to the right, the contraction of data points will be on the left side of
the distribution and the tail will be longer to the right.
• Also referred to as a positively-skewed distribution.
• Mean > Median > mode
DATA ANALYSIS
12
Kurtosis
Kurtosis is another measure we can use to check normality of the distribution and provides us with information
about the distribution along the tails.
Kurtosis also helps explain more about the shape of the probability distribution, ie. if there are any extreme values
in the tails of the distribution.
Kurtosis can also be seen as the combined weight of the tails in comparison to the rest of the distribution.
As the tails become heavier, kurtosis increases and as the tails become lighter, kurtosis decreases. In other words, if
the distribution has a low kurtosis, the data in the tail is less extreme and that of the normal distribution and if the
distribution has a high kurtosis, the data in the tails is more extreme than that of the normal distribution.
Mesokurtic:
Leptokurtic:
• If the distribution has tails that are heavier or fatter, the peak will be higher and sharper than mesokurtic, and
we say that the distribution is leptokurtic.
• Leptokurtic distributions have kurtosis greater than 3.
• Leptokurtic has more extreme outliers.
• The tails will be fatter and denser in comparison to that of the normal distribution.
Platykurtic:
• Platykurtic distribution occurs when the extreme values are less than that of the normal distribution and when
there are fewer data points along the tails.
• The kurtosis value is less than that of the normal distribution and might be less than zero.
• The tails of the platykurtic distribution are thinner than that of the normal distribution.
DATA ANALYSIS
13
Conclusion
In order to become a data analyst, employers will typically expect you to be proficient in tools such as SQL, Python, R,
Excel, PowerBI or Tableau, to name but a few.
Of course, some of these tools are essential to master and have as a tick on your resume. Skills such as communication,
critical thinking and attention to detail are skills that are equally, if not more important, and often the more difficult
to master.
We develop these skills through practice. Good old-fashioned trial and error. There is no easy way to master the skill of
data analysis if you are truly passionate about the field, and it’s up to you to make it happen. Throughout this course,
we practically apply the theory we study in order to develop critical thinking and attention to detail, but I challenge
you to apply the skills you learn here, on other datasets as well in order to build a solid foundation for yourself.
NOTES
DATA ANALYSIS
14
Resources
365DataScience, 2020, What is a distribution in statistics?, https://365datascience.com/explainer-
video/distribution-in-statistics/
Meena, S., 2020, Statistics for Data Science: What is Normal Distribution?, Analytics Vidhya,
https://www.analyticsvidhya.com/blog/2020/04/statistics-data-science-normal-distribution/
Measures of Skewness and Kurtosis, https://www.itl.nist.gov/div898/handbook/eda/section3/eda35b.
htm
Kenton, W., 2020, Standard error, Investopedia, https://www.investopedia.com/terms/s/standard-
error.asp
Frost, J., 2018, Measures of Central Tendency: Mean, Median, and Mode, Statistics by Jim, https://
statisticsbyjim.com/basics/measures-central-tendency-mean-median-mode/
Frost, J., 2018, Measures of Variability: Range, Interquartile Range, Variance, and Standard Deviation,
Statistics by Jim,
https://statisticsbyjim.com/basics/variability-range-interquartile-variance-standard-deviation/
Greenbook, How to Interpret Standard Deviation and Standard Error in Survey Research, Greenbook.
org, https://www.greenbook.org/marketing-research/how-to-interpret-standard-deviation-and-
standard-error-in-survey-research-03377
Holmes, A., Opentextbc.ca, Skewness and the Mean, Median, and Mode, BCCampus, https://
opentextbc.ca/introbusinessstatopenstax/chapter/skewness-and-the-mean-median-and-
mode/#:~:text=To%20summarize%2C%20generally%20if%20the,is%20less%20than%20the%20mean.
Lane, D.M., Introduction to Normal Distribution, Onlinestatbook.com, http://onlinestatbook.com/2/
normal_distribution/intro.html
Brownlee, J., 2018, A gentle introduction to statistical data distributions, Machine Learning Mastery,
https://machinelearningmastery.com/statistical-data-distributions/
Hayes, A., 2019, Probability Distributions, Investopedia, https://www.investopedia.com/terms/p/
probabilitydistribution.asp
Sheats, R.D., 2002, Understanding distributions and data types, Seminars in Orthodontics, pp.
62 - 66, Issue number: 2, Volume number: 8, https://www.sciencedirect.com/science/article/abs/
pii/1073874602800342
Vollmer, C., 2017, Normal Approximation to Binomial Distributions, Colorado State University, https://
www.stat.colostate.edu/~vollmer/stat307pdfs/Binom_to_Normal.pdf
McNeese, B., Are the skewness and kurtosis useful statistics?, BPI Consulting, https://www.spcforexcel.
com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics#:~:text=Skewness%20
essentially%20measures%20the%20relative,which%20is%20equal%20to%203.
Bassett, E.E., et al., Statistics: Problems and Solutions, 2nd ed, London, pp. 32 - 34, 1986, Edward
Arnold publishers
DATA ANALYSIS