Data Analysis Fundamentals
Data Analysis Fundamentals
Lesson Overview
This lesson will cover some of the foundational statistical topics needed to use
statistics in practice. You will learn how to:
Lesson Prerequisites
This lesson does not have any hard prerequisites, although having some
knowledge of the following areas may be useful:
Use Cases
Descriptive Stats is useful in many differing areas.
Descriptive Statistics is useful in many different jobs, and activities. Having a
good understanding of descriptive statistics will help anyone working in:
Business Analytics
Data Analysis
Data Engineering
Product Management
and so much more. Once you are finished with this course and its concepts,
you'll be able to apply them in ways you didn't even think about before.
Data is used to understand and improve nearly every facet of our lives. So, no
matter what field you are in, you can utilize data to make better decisions and
accomplish your goals.
We will start this lesson with an overview of data types and the most common
statistics used when analyzing data.
We'll discuss :
Categorical is used to label a group or set of items (like dog breeds - Collies,
Labs, Poodles, etc.).
Question 1 of 2
Variable
Data Type
Zip Code
Categorical
Age
Quantitative
Income
Quantitative
Marital Status (Single, Married, Divorced, etc.)
Categorical
Height
Quantitative
Nice! You know your variables! The zip code is tricky. Even though this is a number, it
isn't a number with which we can perform mathematical operations (add, subtract, etc.)
and get another value that makes sense. Therefore, we consider it a categorical
variable, not quantitative.
Variable
Data Types
Categorical Nominal data do not have an order or ranking (like the breeds of
the dog).
Continuous data can be split into smaller and smaller units, and still a smaller
unit exists. An example of this is the age of the dog - we can measure the
units of the age in years, months, days, hours, seconds, but there are still
smaller units that could be associated with the age.
Another Look
To break down our data types, there are two main blocks:
You should have now mastered what types of data in the world around us falls
into each of these four buckets: Discrete, Continuous, Nominal, and Ordinal.
In the next sections, we will work through the numeric summaries that relate
specifically to quantitative variables.
Height, Age, the Number of Pages in a Book, and Annual Income all take
on values that we can add, subtract and perform other operations with to gain
useful insight. Hence, these are quantitative.
Gender, Letter Grade, Breakfast Type, Marital Status, and Zip Code can
be thought of as labels for a group of items or individuals. Hence, these
are categorical.
Continuous vs. Discrete
To consider if we have continuous or discrete data, we should see if we can
split our data into smaller and smaller units. Consider time - we could
measure an event in years, months, days, hours, minutes, or seconds, and
even at seconds we know there are smaller units we could measure time in.
Therefore, we know this data type is continuous. Height, age,
and income are all examples of continuous data. Alternatively, the number of
pages in a book, dogs I count outside a coffee shop, or trees in a
yard are discrete data. We would not want to split our dogs in half.
Final Words
In this section, we looked at the different data types we might work with in the
world around us. When we work with data in the real world, it might not be
very clean - sometimes there are typos or missing values. When this is the
case, simply having some expertise regarding the data and knowing the data
type can assist in our ability to ‘clean’ this data. Understanding data types can
also assist in our ability to build visuals to best explain the data. But more on
this very soon!
That's right! Remember ordinal variables have a rank ordering associated with each.
Nominal variables could be placed in any order.
The shortest time might be just a few weeks and the longest might be a
couple of years. What proportion of students finishes within two months and
what proportion takes longer than eight months?
Using a variety of measures, like measures of center, give you an idea of the
average student. Measures of spread, give you an idea of how students
differ. Visuals provide a more complete picture of how long it takes any
student to complete a course or program.
Measures of Center (Mean)
Analyzing Quantitative Data
Four Aspects for Quantitative Data
1. Measures of Center
2. Measures of Spread
3. The Shape of the data.
4. Outliers
Analyzing Categorical Data
Though not discussed in the video, analyzing categorical data has fewer parts
to consider. Categorical data is analyzed usually by looking at the counts or
proportion of individuals that fall into each group. For example, if we were
looking at the breeds of the dogs, we would care about how many dogs are of
each breed, or what proportion of dogs are of each breed type.
Measures of Center
There are three measures of center:
1. Mean
2. Median
3. Mode
The Mean
In this video, we focused on the calculation of the mean. The mean is often
called the average or the expected value in mathematics. We calculate the
mean by adding all of our values together and dividing by the number of
values in our dataset.
The remaining measures of the median and mode will be discussed in detail
in the upcoming quizzes and videos.
That's right! There are 3 M's to our measures of center - means, medians, and modes.
Question 2 of 2
Question 1 of 2
Nice! That's right, the median is the middle number when our data are ordered.
No Mode
If all observations in our dataset are observed with the same frequency, there
is no mode. If we have the dataset:
1, 1, 2, 2, 3, 3, 4, 4
There is no mode because all observations occur the same number of times.
Many Modes
If two (or more) numbers share the maximum value, then there is more than
one mode. If we have the dataset:
1, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 8, 9
There are two modes 3 and 6, because these values share the maximum
frequencies at 3 times, while all other values only appear once.
Question 1 of 5
We want to summarize the number of dogs our friends have into a single
number. We will use the measures of center for this problem. Ashley has 1
dog, Steve has 1 dog, Jeff has 2 dogs, Kylie has 3 dogs, and Lisa has 8 dogs.
There is no measure of center that is always best, so we need to try all three
to see what makes sense in this situation.
What is the mean, median, and mode for the number of dogs our friends
have?
Mean=3, median=2, and mode=1
Nice! You know your measures of center! The mode is the most frequent value in our
dataset.
Question 4 of 5
For the dataset below match the correct measure to the value:
Nice! That's right there are two modes in this dataset. If all the values appear the same
number of times, we usually say there is no mode. However, if more than one value
appears the most number of times, we count all of these values as modes.
What is Notation?
Notation
Notation is a common language used to communicate mathematical
ideas. Think of notation as a universal language used by academic and
industry professionals to convey mathematical ideas. In the next videos,
you might see things that seem confusing. Use the quizzes to assist with your
understanding of the concepts.
You likely already know some notation. Plus, minus, multiply, division, and
equal signs all have mathematical symbols that you are likely familiar with.
Each of these symbols replaces an idea for how numbers interact with one
another. In the coming concepts, you will be introduced to some additional
ideas related to notation. Though you will not need to use notation to complete
the project, it does have the following properties:
If you aren't familiar with spreadsheets, this will be covered in detail in future
lessons. Spreadsheets are a common way to hold data. They are composed
of rows and columns. Rows run horizontally, while columns run vertically.
Each column in a spreadsheet commonly holds a specific variable, while
each row is commonly called an instance or individual.
Time Spent On
Site (X)
5
10
Time Spent On
Site (X)
20
Random Variables
We might have the random variable X, which is a holder for the possible
values of the amount of time someone spends on our site. Or the random
variable Y, which is a holder for the possible values of whether or not an
individual purchases a product.
X is 'a holder' of the values that could possibly occur for the amount of time
spent on our website. Any number from 0 to infinity really.
Example 1
For example, the amount of time someone spends on our site is a random
variable (we are not sure what the outcome will be for any particular visitor),
and we would notate this with X. Then when the first person visits the website,
if they spend 5 minutes, we have now observed this outcome of our random
variable. We would notate any outcome as a lowercase letter with a subscript
associated with the order that we observed the outcome.
If 5 individuals visit our website, the first spend 10 minutes, the second
spends 20 minutes, the third spend 45 mins, the fourth spends 12 minutes,
and the fifth spends 8 minutes; we can notate this problem in the following
way:
Example 2
We could find this in the above example by noticing that only one of the 5
observations exceeds 20. So, we would say there is a 1 (the 45) in 5 or
20% chance that an individual spends more than 20 minutes on our website
(based on this dataset).
Example 3
We could then find this by noticing there are two out of the five individuals that
spent 20 or more minutes on the website. So this probability is 2 out of 5 or
40%.
YY= Department
ZZ= Part/Full-Time
Match the following notation to their corresponding:
A. x1x1
B. y2y2
C. z3z3
D. nn
Quiz Question
Use the information above to match the correct notation label to its
corresponding value.
These are the correct matches.
Notation
Value
A Better Way?
Notation for Calculating the Mean
We know that the mean is calculated as the sum of all our values divided by
the number of values in our dataset.
In our current notation, adding all of our values together can be extremely
tedious. If we want to add 3 values of some random variable together, we
would use the notation:
x1+x2+x3x1+x2+x3
If we want to add 6 values together, we would use the notation:
x1+x2+x3+x4+x5+x6x1+x2+x3+x4+x5+x6
To extend this to add one hundred, one thousand, or one million values would
be ridiculous! How can we make this easier to communicate?!
Summation
Aggregations
An aggregation is a way to turn multiple numbers into fewer numbers
(commonly one number).
Summation is a common aggregation. The notation used to sum our values
is a greek symbol called sigma ΣΣ.
Example 1
If we want to sum the first three values together in our previous notation, we
write:
x1+x2+x3x1+x2+x3
In our new notation, we can write:
∑i=13xii=1∑3xi.
Notice, our notation starts at the first observation (i=1i=1) and ends at 3
(the number at the top of our summation).
∑i=13xii=1∑3xi = x1+x2+x3x1+x2+x3 = 10 + 20 + 45 = 75
Example 2
x7+x8+x9x7+x8+x9
In our new notation, we can write:
∑i=79xii=7∑9xi.
Notice, our notation starts at the seventh observation (i=7i=7) and ends at
9 (the number at the top of our summation).
Other Aggregations
The ΣΣ sign is used for aggregating using summation, but we might choose
to aggregate in other ways. Summing is one of the most common ways to
need to aggregate. However, we might need to aggregate in alternative ways.
If we wanted to multiply all of our values together we would use a product
sign ΠΠ** **, capital Greek letter pi. The way we aggregate continuous
values is with something known as integration (a common technique in
calculus), which uses the following symbol ∫∫ which is just a long s. We will
not be using integrals or products for quizzes in this class, but you may see
them in the future!
Notation for the Mean
1n∑i=1nxin1i=1∑nxi
Instead of writing out all of the above, we commonly write xˉxˉ to represent
the mean of a dataset. Although similar to the first video, we could use any
variable. Therefore, we might also write yˉyˉ, or any other letter.
We also could index using any other letter, not just ii. We could just as easily
use jj, kk, or mm to index each of our data values. The quizzes on the next
concept will help reinforce this idea.
Notice
x1x1 = 5
x2x2 = 15
x3x3 = 3
x4x4 = 3
x5x5 = 8
x6x6 = 10
x7x7 = 12
These are the correct matches.
Expression
Value
nn
7
∑i=1nxii=1∑nxi
56
∑j=27xj+6j=2∑7xj+6
57
x5x5
8
∑i=36xin−1n−1i=3∑6xi
4
For the below quiz, let the following letters denote the corresponding notation:
A. ∑i=1nxii=1∑nxi
B. ∑i=1nxinni=1∑nxi
C. xˉxˉ
D. yˉyˉ
E. ∑j=1nyjnnj=1∑nyj
Question 2 of 2
If we wanted to provide notation for the mean of a particular dataset, which of the following
letters would correspond to the notation attached to calculating the mean? (Mark all that apply.)
Notation Recap
Notation is an essential tool for communicating mathematical ideas. We have
introduced the fundamentals of notation in this lesson that will allow you to
read, write, and communicate with others using your new skills!
1n∑i=1nxin1i=1∑nxi
In the next section, you will see this notation used to assist in your
understanding of calculating various measures of spread. Notation can take
time to fully grasp. Understanding notation not only helps in conveying
mathematical ideas but also in writing computer programs - if you decide you
want to learn that too! Soon you will analyze data using spreadsheets. When
that happens, many of these operations will be hidden by the functions you
will be using. But until we get to spreadsheets, it is important to understand
how mathematical ideas are commonly communicated. This isn't easy, but
you can do it!
Lesson Recap
This lesson covered some of the foundational statistical topics needed to use
statistics in practice. You can now:
Throughout this lesson, you will learn how to calculate these, as well as why
we would use one measure of spread over another.
Histograms
Histograms are super useful for understanding the different aspects of data
and they are the most common visual used for quantitative data. In the
upcoming concepts, you will see histograms used all the time to help you
understand the four aspects we outlined earlier regarding a quantitative
variable:
center
spread
shape
outliers
How are Histograms constructed?
First, we need to bin our data. Each bin represents a range of values in a
dataset. The number of values that fall in the range of each bin determines the
height of each histogram bar. As shown in the video above, changing the
range of our bins can result in slightly different visuals. However, there is no
right or wrong answer in choosing how to bin, and in most cases, the software
you use will choose the appropriate bins for you.
Visually, the difference between the histograms is the range or spread of dogs Josh
sees during each time period. In the upcoming lessons, we will discuss the most
common ways to measure the spread of our data.
Range
The range is then calculated as the difference between the maximum and
the minimum.
IQR
The interquartile range is calculated as the difference
between Q3Q3 and Q1Q1.
In the upcoming sections, you will practice this with Katie and on your own.
The third quartile doesn't look right. You should be finding the median of the
five largest values. Don't forget to order the values!
What if We Only Want One Number?
Looking back at the histograms Josh created for the number of dogs he
recorded seeing on weekdays and weekends, we can use the histograms to
mark the values of the 5 number summary and create a box plot.
Box plots are useful for quickly comparing the spread of two data sets across
some key metrics, like quartiles, maximum, and minimum.
1. The beginning of the line to the left of the box and the end of the line to the
right of the box represent the minimum and maximum values in a dataset.
2. The visual distance between these markings is an indication of the range of
the values.
3. The box itself represents the IQR. The box begins at the Q1 value, ends at the
Q3 value, and Q2, or the median, is represented by a line within the box.
From both the histograms and box plots, we can see that the number of dogs seen on weekends
varies much more than on weekdays.
However, instead of depending on a visual of the 5 number summary to compare our data, in the
next lesson, we will learn about using a single value to compare the two distribution spreads -
standard deviation.
The standard deviation is one of the most common measures for talking
about the spread of data. It is defined as the average distance of each
observation from the mean.
In the above video, we saw this as how far individuals were from the average
distance from work (the example distances shown are examples from the full
data set, the mean of just those 4 numbers is 38.5. The mean of 18 shown
later in the video is the mean of the full data set which is not shown in the
video). In the next video, you will see exactly how this is calculated.
(14-10) = 4 = 16
2 2
x‾=(∑i=14xi)n=404=10x=n(i=1∑4xi)=440=10
2. Next, calculate the distance of each observation from the mean and square
the value:
(xi−x‾)2=(xi−x)2=
(10−10)2=02=0(10−10)2=02=0
(14−10)2=42=16(14−10)2=42=16
(10−10)2=02=0(10−10)2=02=0
(6−10)2=−42=16(6−10)2=−42=16
2. Then calculate the variance, the average squared difference of each
observation from the mean:
1n∑i=1n(xi−x‾)2=14(0+16+0+16)=324=8n1i=1∑n(xi
−x)2=41(0+16+0+16)=432=8
4. Finally, calculate the standard deviation, the square root of the variance:
1n∑i=1n(xi−x‾)2=8=2.83n1i=1∑n(xi−x)2=8=2.83
The standard deviation is, on average, how far each point in our dataset is
from the mean.
Quiz: Measures of Spread (Calculation
and Units)
Question 1 of 2
If we measure the variance associated with our sales in dollars for each month for 3 years, what
are the units associated with the variance?
Dollars squared
That's right - the units of the variance are the square of the original units of your data.
Question 2 of 2
Remember to find the variance we first find the mean average of the values,
then subtract the mean from each value, then square each of these values,
then add them up, then divide by the number of values. (Round your answer
to two decimal places at the end of your calculation - don't round along the
way.)
1, 5, 10, 3, 8, 12, 4
Variance 13.55
SD=3.68
In the previous sections, we have seen how to calculate the values associated
with the five-number summary (min, Q1Q1, Q2Q2, Q3Q3, max), as well
as the measures of spread associated with these values (range and IQR).
For datasets that are not symmetric, the five-number summary and a
corresponding box plot are a great way to get started with understanding the
spread of your data. Although I still prefer a histogram in most cases, box
plots can be easier to compare two or more groups. You will see this in
the quizzes towards the end of this lesson.
Two additional measures of spread that are used all the time are
the variance and standard deviation. At first glance, the variance and
standard deviation can seem overwhelming. If you do not understand the
expressions below, don't panic! In this section, I just want to give you an
overview of what the next sections will cover. We will walk through each of
these parts thoroughly in the next few sections, but the big picture goal is to
generally understand the following:
Calculation
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
The variance is the average squared difference of each observation from
the mean.
The standard deviation is the square root of the variance. Therefore, the
formula for the standard deviation is the following:
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
In the same spreadsheet as above, to find the standard deviation of our same
set of 10 data values, we would use another cell like C13 to take the square
root of our variance measure, by typing in =sqrt(C12).
The standard deviation is a measurement that has the same units as our
original data, while the units of the variance are the square of the units in our
original data. For example, if the units in our original data were dollars, then
units of the standard deviation would also be dollars, while the units of the
variance would be dollars squared.
These applications are beyond the scope of this lesson as they pertain to
specific fields, but know that understanding the spread of a particular set of
data is extremely important to many areas. In this lesson, you mastered the
calculation of the most common measures of spread.
Quiz: Standard Deviation and Variance
Question 1 of 3
Assume d1 and d2 are datasets both measured in the same units. We know
that the standard deviation of d1 is 5 and the variance of d2 is 36, which of the
following are certainly true. Mark all that apply.
Remember the Standard Deviation is the square root of the variance. So if the
Variance is 4 the Standard Deviation would be 2
Think of two histograms that look the same, but that are centered on different
values. Will they have the same mean? Will they have the same variance?
Remember the units are identical. Think about what the standard deviation
and variance would be for each dataset.
Remember the units are identical. Think about what the standard deviation and variance
would be for each dataset.
Remember the units are identical. Think about what the standard deviation
and variance would be for each dataset.
This is tricky - does having a larger range necessarily indicate a dataset has a
larger variance?
That's right! We can only talk about specific measures of spread, and not measures of
center. Additionally, the range isn't directly associated with the standard deviation, so
we can't make a claim that is always true like the final option.
- The variance for d2 is larger than for d1
- The standard deviation of d2 is larger than for d1
Question 2 of 3
That's right! Since the standard deviation is a measure of spread, a zero value suggests
that all of our data points are the same value.
Question 3 of 3
For each of the below: If the statement is true, mark the box next to the
statement.
Oops! That isn't quite right. Remember the standard deviation is always the
square root of the variance.
Oops! Just because two investment options have the same mean return, this
does not mean they have the same risk of gain or loss. What else might we
want to know?
Oops! Just because two investment options have the same mean return, this does not
mean they have the same risk of gain or loss. What else might we want to know?
Oops! That isn't quite right. Remember the standard deviation is always the
square root of the variance.
This one is tricky! The standard deviation of two investments could be the
same despite having different maximums. Consider two datasets: 1, 2, 3, 4
and 5, 6, 7, 8. Different max, but same st. dev.
Retur
ns
Year
Year 1 Year 2 Year 3 Year 4 Year 6
5
Investment
5% 5% 5% 5% 5% 5%
1
Investment
12% -2% 10% 0% 7% 3%
2
The returns for 6 consecutive years for each investment are shown above.
Use this information to answer the questions below.
Question 1 of 3
Use the information above to match the mean/expected return for each
investment.
8%
6%
4%
9%
5%
5%
Question 1 of 3
Use the information above to match the mean/expected return for each
investment.
These are the correct matches.
Investment
Return Investment
Return
Investment 1
5%
Investment 2
5%
Investment Data
In the previous two questions, you should have found that these investments
have the same mean! That is, regardless of which investment opportunity you
choose, you are expected to earn the same amount. So how are they
different? Let's look at some additional questions to see if we can find some
differences.
Oops! Beyond knowing the mean, we should also consider the standard deviation (and
other values associated with the spread) for the return on our investments.
That's right! Because the return is the same year over year for Investment 1, it has 'no
spread' or a standard deviation of 0. This smaller standard deviation is associated with
smaller risk. Understanding the spread of values we could earn is just as important as
understanding the expected return (mean return).
Question 3 of 3
Based on the observed data, which of the above two investments has the best
opportunity of earning more than 7%?
That's right! Only Investment 2 has earned more than 7%, so it is more likely (with 1/3
chance). Where Investment 1 has a 0/6 chance of earning more than 7% based on our
observed data.
Useful Insight
The above example is a simplified version of the real world but does point out
something useful that you may have heard before. Notice if you were not fully
invested in either Investment 1 or fully invested in Investment 2, but instead,
you were diversified across both investment options, you could earn more
than either investment individually. This is the benefit of diversifying your
portfolio for long-term gains. For short-term gains, you might not need or want
to diversify. You could get lucky and hit short-term gains associated with the
upswings (12%, 10%, or 7%) of Investment 2. However, you might also get
unlucky, and hit a down term and earn nothing or even lose money on your
investment using this same strategy.
Final Quiz on Measures Spread
Question 1 of 2
For the following dataset, match each value to the appropriate label:
Term
Value
n
13
median
7
first quartile
3
third quartile
13.5
mean
8.4
mode
3’
Question 2 of 2
For the following dataset, match each value to the appropriate label:
Term
Value
interquartile range
10.5
range
20
variance
33.9
standard deviation
5.8
minimum
2
maximum
22
Variable Types
We have covered a lot up to this point! We started with identifying data types
as either categorical or quantitative. We then learned we could identify
quantitative variables as either continuous or discrete. We also found we could
identify categorical variables as either ordinal or nominal.
Categorical Variables
Quantitative Variables
1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
1. Means
2. Medians
3. Modes
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Calculating Variance
1n∑i=1n(xi−xˉ)2n1i=1∑n(xi−xˉ)2
You will also see:
1n−1∑i=1n(xi−xˉ)2n−11i=1∑n(xi−xˉ)2
The reason for this is beyond the scope of what we have covered thus far, but
you can find an explanation here(opens in a new tab).
You can commonly find answers to your questions with a quick Google
search(opens in a new tab). Now is a great time to get started with this
practice! This answer should make more sense at the completion of this
lesson.
The standard deviation is the square root of the variance. In practice, you
usually use the standard deviation rather than the variance. The reason for
this is because the standard deviation shares the same units with our original
data, while the variance has squared units.
What Next?
In the next sections, we will be looking at the last two aspects of quantitative
variables: shape and outliers. What we know about measures of center and
measures of spread will assist in your understanding of these final two
aspects.
Shape
Histograms
We learned how to build a histogram in this video, as this is the most popular
visual for quantitative data.
Shape
From a histogram, we can quickly identify the shape of our data, which helps
influence all of the measures we learned in the previous concepts. We learned
that the distribution of our data is frequently associated with one of the
three shapes:
1. Right-skewed
2. Left-skewed
Summary
Mean vs.
Shape Real-World Applications
Median
Symmetric Mean equals
Height, Weight, Errors, Precipitation
(Normal) Median
Mean greater Amount of drug remaining in a bloodstream, Time between
Right-skewed
than Median phone calls at a call center, Time until light bulb dies
Mean less than Grades as a percentage in many universities, Age of death,
Left-skewed
Median Asset price changes
The mode of a distribution is essentially the tallest bar in a histogram. There
may be multiple modes depending on the number of peaks in our histogram.
The Shape For Data In The World
When working with data, building a quick plot lets you quickly see the shape
of your data.
Distribution
Types of Data
Shape
Bell Shaped Heights, Weight, Scores
Left Skewed GPA, Age of Death, Price
Right Skewed Distribution of Wealth, Athletic Abilities
References
These are the references used to pull the applications of each shape.
Quora(opens in a new tab)
University of Texas(opens in a new tab)
Stack Exchange(opens in a new tab)
Supporting Materials
Question 2 of 2
For every dataset the mean equals the median, so every data set is normally
distributed.
Oops! This one is tricky. If data are normally distributed, the mean must equal
the median. However, any perfectly symmetric distribution will have a mean
equal to the median.
Oops! Remember a box-plot has 5 lines that relate to the minimum, first
quartile, second quartile, third quartile, and maximum.
Oops! This one is tricky. If data are normally distributed, the mean must equal
the median. However, any perfectly symmetric distribution will have a mean
equal to the median.
Oops! All normal distributions are perfectly symmetric. What do you think that
means about how the mean relates to the median?
Oops! Remember a box-plot has 5 lines that relate to the minimum, first
quartile, second quartile, third quartile, and maximum.
Oops! Histograms and box plots are used only for quantitative data. We have not
discussed any plots for categorical variables, but more on that soon!
Identifying Outliers
There are a number of different techniques for identifying outliers. A full paper
on this topic is provided here(opens in a new tab). In general, I usually just
look at a picture and see if something looks suspicious!
3. Understanding why they exist, and the impact on questions we are trying to
answer about our data.
4. Reporting the 5 number summary values is often a better indication than
measures like the mean and standard deviation when we have outliers.
3. If no outliers and your data follow a normal distribution - use the mean and
standard deviation to describe your dataset, and report that the data are
normally distributed.
Side note
If you aren't sure if your data are normally distributed, there are plots
called normal quantile plots(opens in a new tab) and statistical methods
like the Kolmogorov-Smirnov test(opens in a new tab) that are aimed to
help you understand whether or not your data are normally distributed.
Implementing this test is beyond the scope of this class, but can be used as a
fun fact.
Supporting Materials
IQR is space between the first and third quartile which are the edges of the
box. They are about 4.8 for the first quartile and 5.2 for the third
Image Summary
In the below image, we have three box-plots. Each box-plot is for a different
Iris flower: setosa, versicolor, or virginica. On the y-axis, we are given the sepal
length. Notice that virginica has an outlier towards the bottom of the plot.
Therefore, the minimum is not given by the bottom line here; rather, it is
provided by this point.
IQR is space between the first and third quartile which are the edges of the
box. They are about 4.8 for the first quartile and 5.2 for the third
Question 1 of 2
Match the appropriate Iris type to the statement(s) that are true for its Sepal
Length.
These are the correct matches.
Sepal Length
Iris Type
Oops! Remember the third quartile is the top part of the box in the box-plot.
Approximately Symmetric
Oops! We can tell the shape of our distribution by looking at the middle portion
of the box-plot. If it is square, the plot is symmetric. If it is more rectangular it
is skewed.
Question 2 of 2
Using the same flower data, select all of the below statements that MUST be
true.
Virginica
Box Plots of Sepal length for 3 Iris Flower Species
Question 2 of 2
Using the same flower data, select all of the below statements that MUST be
true.
- More than 75% of the virginica flowers have a larger sepal
length than the largest setosa flower.
- More than 50% of setosa flowers have larger sepal length
than the shortest versicolor flower.
Notice that we cannot tell how many of each flower type were collected based on this
plot. However, we do know about the max, min, and the quartiles based on this
dataset. Since all of the box-plots overlap according to the y-axis, no one flower type
has a larger sepal length all the time.
Quiz Image
Question 1 of 5
Bar Chart
Box Plot
Histogram
Pie Chart
Question 2 of 5
Right skewed
Left skewed
Symmetric
Bi-modal
That's right! This is a bi-modal distribution. Notice it has two areas where there
are peaks in our dataset.
Bar Chart
Box Plot
Histogram
Pie Chart
Question 4 of 5
Right skewed
Left skewed
Symmetric
Bi-modal
Question 5 of 5
That's right! Because the distribution is left-skewed, we know the mean will be
less than the median.
Pay attention to the scale of these two graphs. The first is dealing with a lot
higher numbers.
The average factors in all the numbers so outliers will bring the average
towards them.
Left Skewed is when the graphs start with a low frequency and then slopes
up. Right Skewed is when the graph starts with a high frequency and slopes
down.
Variable Types
We have covered a lot up to this point! We started with identifying data types
as either categorical or quantitative. We then learned we could identify
quantitative variables as either continuous or discrete. We also found we could
identify categorical variables as either ordinal or nominal.
Categorical Variables
Quantitative Variables
1. Measures of Center
2. Measures of Spread
3. Shape of the Distribution
4. Outliers
Measures of Center
1. Means
2. Medians
3. Modes
Measures of Spread
1. Range
2. Interquartile Range
3. Standard Deviation
4. Variance
Shape
We learned that the distribution of our data is frequently associated with one
of the three shapes:
1. Right-skewed
2. Left-skewed
Outliers
We learned that outliers have a larger influence on measures like the mean
than on measures like the median. We learned that we should work with
outliers on a situation by situation basis. Common techniques include:
3. Understand why they exist, and the impact on questions we are trying to
answer about our data.
We also looked at histograms and box plots to visualize our quantitative data.
Identifying outliers and the shape associated with the distribution of our data
are easier when using a visual as opposed to using summary statistics.
What Next?
Descriptive Statistics
Descriptive statisticsis about describing our collected data.
Inferential Statistics
Inferential Statisticsis
about using our collected data to draw conclusions
about a larger population.
Identify the population, parameter, sample, and statistic for the below
scenario:
Term
Description
Population
All Udacity students
Parameter
We cannot know for sure.
Sample
5,000 Udacity students
Statistic
6.8 hours of sleep
Essentially we have two populations - one is all the bagels from our
competitor, and the second is all the bagels from our shop. We know the
diameter of all the bagels at our shop, so this is a parameter. The 100 bagels
from the competitor are now a sample, and we have a statistic, which is our
numeric summary from that sample of 6 inches.
Question 3 of 3
Description
Term
Descriptive Statistics
Inferential Statistics
Looking Ahead
Though we will not be diving deep into inferential statistics within this course,
you are now aware of the difference between these two branches of statistics.
If you have ever conducted a hypothesis test or built a confidence interval,
you have performed inferential statistics. The way we perform inferential
statistics is changing as technology evolves. Many career paths
involving Machine Learning and Artificial Intelligence are aimed at using
collected data to draw conclusions about entire populations at an individual
level. It is an exciting time to be a part of this space, and you are now well on
your way to joining the other practitioners!
Lesson Recap
Lesson Review
Congratulations on completing this lesson on descriptive statistics. You
learned some foundational metrics for understanding data, including how to :
Intro