Data Science Unit 2 Notes
Data Science Unit 2 Notes
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data
with Averages - Describing Variability - Normal Distributions and Standard (z) Scores.
I. TYPES OF DATA
Ranked Data
A set of observations where any single observation is a number that indicates relative
standing. Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative
standing within a group.
Ex: Third place
Quantitative Data
A set of observations where any single observation is a number that represents an amount
or a count. Quantitative
data consist of numbers (weights of 238, 170, . . . 185 lbs) that represent an amount or a count.
Ex: Age, Family size, IQ Score, Temperature
II. TYPES OF VARIABLES
General Definition
A variable is a characteristic or property that can take on different values.
A continuous variable consists of numbers whose values, at least in theory, have no restrictions.
(Continuous variable-Non physical & Infinite).
Examples include amounts, such as weights of male statistics students;
Ex: Temperature is a Continuous variable
Height of boys in a class
Weight of Students in a class
Income
Age
Approximate Numbers
An Approximate number, time or position is close to correct number, time or position but is not
exact.
Whenever values are rounded off, as is always the case with actual values for continuous
variables, the resulting numbers are approximate.
Ex: A student whose weight is listed as 150 lbs could actually weigh between 149.5 and
150.5 lbs. In effect, any value for a continuous variable, such as 150 lbs, must be identified with
a range of values from 149.5 to 150.5 rather than with a solitary value.
Independent and Dependent Variables
Independent Variable
A variable that is manipulated to determine the value of dependent variable.
In an experiment, an independent variable is the treatment manipulated by the investigator. The
independent variable is the variable , the experimenter manipulates or changes, and is assumed to
have direct effect on dependent variable.
Ex: The liquid used to water each plant
Dependent variable
It is the variable being tested and measured in an experiment.
When a variable is believed to have been influenced by the independent variable, it is called a
dependent variable. In an experimental setting, the dependent variable is measured, counted, or
recorded by the investigator.
Ex: Change in the height or health of plant.
Ex: Independent Variable- Stress (Cause), Dependent Variable- Mental state of human
being(Effect).
Confounding Variable
A Confounding variable is third variable that influences both Independent & Dependent variable.
III. DESCRIBING DATA WITH TABLES AND GRAPHS
An ungrouped frequency distribution, which displays the frequency of each individual data
value rather groups of data values.
Ex: Suppose we conduct a survey in which we ask 15 households how many pets they have in
their home. The results are as follows:
1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8
Here’s an example of an ungrouped frequency distribution for our survey data:
This type of frequency distribution allows us to directly see how often different values occurred
in our dataset. For example:
And so on.
Ungrouped frequency distributions can be useful when you want to see how often each
individual value occurs in a dataset.
Note that ungrouped frequency distributions work best with small datasets in which there are
only a few unique values.
For example, in our survey data from earlier there were only 8 unique values so it made sense to
create an ungrouped frequency distribution.
The easiest way to visualize the values in an ungrouped frequency distribution is to create
a frequency polygon, which displays the frequencies of each individual value in a simple chart.
Here’s what a frequency polygon would look like for our sample data:
Frequency polygon
This helps us quickly gain an understanding of how often each value occurs in the dataset.
Alternatively, we could create a bar chart to display the exact same data using bars rather than
a single line:
Bar chart
Both charts allow us to quickly understand the distribution of values in our dataset.
1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 5, 5, 6, 7, 8
One way to summarize these results is to create a frequency distribution, which tells us how
frequently different values occur in a dataset.
Often we use grouped frequency distributions, in which we create groups of values and then
summarize how many observations from a dataset fall into those groups.
Here’s an example of a grouped frequency distribution for our survey data:
We first created groups of size 2, then we counted how many individual observations from the
dataset fell in each group. For example:
A related distribution is known as a relative frequency distribution, which shows the relative
frequency of each value in a dataset as a percentage of all frequencies.
For example, in the previous table we saw that there were 400 total households. To find the
relative frequency of each value in the distribution, we simply divide each individual frequency
by 400:
If these conditions are not met, then the relative frequency distribution is not valid.
The most common way to visualize a relative frequency distribution is to create a relative
frequency histogram, which displays the individual data values along the x-axis of a graph and
uses bars to represent the relative frequencies of each class along the y-axis.
For example, here’s what a relative frequency histogram would look like for the data in our
previous example:
The x-axis displays the number of pets in the household and the y-axis displays the relative
frequency of households that have that number of pets.
This histogram is a useful way for us to visualize the distribution of relative frequencies.
Cumulative frequency distributions show the total number of observations in each class and in all
lower-ranked classes.
Constructing Cumulative Frequency Distributions
To convert a frequency distribution into a cumulative frequency distribution, add to the
frequency of each class the sum of the frequencies of all classes ranked below it.
Types of Cumulative Frequency Distribution
The cumulative frequency distribution is classified into two different types namely: less
than cumulative frequency and more/greater than cumulative frequency.
Less Than Cumulative Frequency:
The Less than cumulative frequency distribution is obtained by adding successively the
frequencies of all the previous classes along with the class against which it is written. In this
type, the cumulate begins from the lowest to the highest size.
Greater Than Cumulative Frequency:
The greater than cumulative frequency is also known as the more than type cumulative
frequency. Here, the greater than cumulative frequency distribution is obtained by determining
the cumulative total frequencies starting from the highest class to the lowest class.
To draw a cumulative frequency distribution graph of less than type, consider the following
cumulative frequency distribution table which gives the number of participants in any level of
essay writing competition according to their age:
Table 1 Cumulative Frequency distribution table of less than type
Number of
Level of Age Group Cumulative
Age group participants
Essay (class interval) Frequency
(Frequency)
To draw a cumulative frequency distribution graph of more than type, consider the same
cumulative frequency distribution table, which gives the number of participants in any level of
essay writing competition according to their age:
Table 2 Cumulative Frequency distribution table of more than type
Number of
Level of Age Group Cumulative
Age group participants
Essay (class interval) Frequency
(Frequency)
These graphs are helpful in figuring out the median of a given data set. The median can be found
by drawing both types of cumulative frequency distribution curves on the same graph. The value
of the point of intersection of both the curves gives the median of the given set of data. For the
given table 1, the median can be calculated as shown:
GRAPHS
GRAPHS FOR QUANTITATIVE DATA
Histograms
A bar-type graph for quantitative data. The common boundaries between adjacent bars
emphasize the continuity of the data, as with continuous variables.
The body of the histogram consists of a series of bars whose heights reflect the
frequencies for the various classes.
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various
class intervals of the frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in
frequency.
The intersection of the two axes defines the origin at which both numerical scales
equal 0.
Numerical scales always increase from left to right along the horizontal axis and
from bottom to top along the vertical axis.
Histograms
Frequency Polygon
A line graph for quantitative data that also emphasizes the continuity of continuous
variables.
TYPICAL SHAPES
Some of the more typical shapes for smoothed frequency polygons (which ignore the
inevitable irregularities of real data).
A. NORMAL
The familiar bell-shaped silhouette of the normal curve can be superimposed on many
frequency distributions, including those for uninterrupted gestation periods of human fetuses,
scores on standardized tests.
B. BIMODAL
It reflect the coexistence of two different types of observations in the same distribution.
C. POSITIVELY SKEWED
Distribution that includes a few extreme observations in the positive direction (to the
right of the majority of observations).
A lopsided distribution caused by a few extreme observations in the positive direction (to the
right of the majority of
observations), is a positively skewed distribution.
D. NEGATIVELY SKEWED
A distribution that includes a few extreme observations in the negative direction (to the left of
the majority of observations). A lopsided distribution caused by a few extreme observations in
the negative direction (to the left of the majority of observations) is a negatively skewed
distribution.
The distribution , based on replies to the question “Do you have a Facebook profile?” appears as
a bar graph. A glance at this graph confirms that Yes replies occur approximately twice as often
as No replies.
A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not
some impossible intermediate value, such as 40 percent Yes and 60 percent No. Gaps are placed
between adjacent bars of bar graphs to emphasize the discontinuous nature of qualitative data. A
bar graph also can be used with quantitative data to emphasize the discontinuous nature of a
discrete variable.
Bar Chart
IV. DESCRIBING DATA WITH AVERAGES
Branch of mathematics dealing with the collection , analysis, interpretation and presentation of
masses of numerical data.
Descriptive Statistics
Inferential Statistics
Statistics also provides tools—a variety of tests and estimates—for generalizing beyond
collections of actual observations. This more advanced area is known as inferential statistics.
for example:
An assertion about the relationship between job satisfaction and overall happiness.
MODE
The mode reflects the value of the most frequently occurring score.
Set of numbers that appears most often.
Ex: Determining the mode of following retirement ages: 60,63,45,63,65,70,55,63,60,65,63(Here,
63 occurs the most.)
MEDIAN
The median reflects the middle value when observations are ordered from least to
most.
The median splits a set of ordered observations into two equal parts, the upper and
lower halves.
MEAN
The mean is the most common average, one you have doubtless calculated many times.
The mean is found by adding all scores and then dividing by the number of scores.
That is,
It’s usually more efficient to substitute symbols for words in statistical formulas,
including the word formula given above for the mean. When symbols are used, X designates the
sample mean, and the formula becomes
and reads: “X-bar equals the sum of the variable X divided by the sample size n.” [Note that the
uppercase Greek letter sigma (Σ) is read as the sum of, not as sigma.
The formula for the population mean differs from that for the sample mean only because
of a change in some symbols. In statistics, Greek symbols usually describe population
characteristics, such as the population mean, while English letters usually describe sample
characteristics, such as the sample mean. The population mean is represented by μ (pronounced
“mu”), the lowercase Greek letter m for mean,
where the uppercase letter N refers to the population size. Otherwise, the calculations are the
same as those for the sample mean.
Sample question: All 57 residents in nursing home were surveyed to see how many times a day
that eat meals.
1 meal (2 people)
2 meal (7 people)
3 meal (28 people)
4 meal (12 people)
5 meal (8 people)
What is population mean for the number of meals eaten per day?
=(1x2)+(2x7)+(3x28)+(4x12)+(5x8)/57
=2+14+84+48+40/57
=188/57
=3.29 (Approx 3.3)
The Population mean is 3.3
WHICH AVERAGE
If Distribution Is Not Skewed
When a distribution of scores is not too skewed, the values of the mode, median, and
mean are similar, and any of them can be used to describe the central tendency of the
distribution.
If Distribution Is Skewed
Positively skewed distribution
In positively skewed distribution, the mean is greater than Median and skewed in Positive
direction (Right side)
.
Negatively skewed distribution
In Negatively skewed distribution, the mean is greater than Median and skewed in Negative
direction (Left side).
V. DESCRIBING VARIABILITY
Descriptions of the amount by which scores are dispersed or scattered in a distribution.
This chapter describes several measures of variability, including the range, the interquartile
range, the variance, and most important, the standard deviation.
INTUITIVE APPROACH
RANGE
VARIANCE
STANDARD DEVIATION
DEGREES OF FREEDOM (df )
INTERQUARTILE RANGE (IQR)
a. INTUITIVE APPROACH
You probably already possess an intuitive feel for differences in variability.
From the above diagrams, each of the three frequency distributions consists of seven scores
with the same mean (10) but with different variabilities. Before reading on, rank the three
distributions from least to most variable. Your intuition was correct if you concluded that
distribution A has the least variability, distribution B has intermediate variability, and
distribution C has the most variability.
Ex:
For distribution A with the least (zero) variability, all seven scores have the same
value (10).
For distribution B with intermediate variability, the values of scores vary slightly
(one 9 and one 11), and
For distribution C with most variability, they vary even more (one 7, two 9s, two
11s, and one 13).
b. RANGE
The range is the difference between the largest and smallest scores.
In distribution A, the least variable, has the smallest range of 0 (from 10 to 10); distribution B,
the moderately variable, has an intermediate range of 2 (from 11 to 9); and distribution C, the
most variable, has the largest range of 6 (from 13 to 7), in agreement with our intuitive
judgments about differences in variability. The range is a handy measure of variability that can
readily be calculated and understood.
c. VARIANCE (Type of Mean)
The mean of all squared deviation scores.
Particularly for its square root, the standard deviation, because these measures serve as key
components for other important statistical measures.
Its value equals 0.00 for the least variable distribution, A, 0.29 for the moderately variable
distribution, B, and 3.14 for the most variable distribution, C, in agreement with our intuitive
judgments about the relative variability of these three distributions.
d. STANDARD DEVIATION
A rough measure of the average (or standard) amount by which scores deviate on either side of
their mean.
The standard deviation, the square root of the mean of all squared deviations from the
mean, that is,
Variance :
Standard Deviation :
Sample
Definition Formula:
Computational Formula:
Variance:
Standard Deviation:
Sum of Squares Formulas for Population
Standard Deviation for Population σ
A rough measure of the average amount by which scores in the population deviate on either side
of their population mean.
Recall that, most generally, a mean is defined as the sum of all scores divided by the number of
scores. Since the variance is the mean of all squared deviation scores, it can be defined as the
sum of all squared deviation scores divided by the number of scores:
σ2 (pronounced “sigma squared”), represents the population variance, SS is the sum of squared
deviations for the population, and N is the population size.
The definition formula provides the most accessible version of the population sum of squares:
where SS represents the sum of squares, Σ directs us to sum over the expression to its right, and
(X − μ)2 denotes each of the squared deviation scores. “The sum of squares equals the sum of all
squared deviation scores.” You can reconstruct this formula by remembering the following three
steps:
1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score, X −
μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X − μ)2.
Standard Deviation for Sample (s )
A rough measure of the average amount by which scores in the sample deviate on either side
Of their sample mean.
Although the sum of squares term remains essentially the same for both populations and
samples, there is a small but important change in the formulas for the variance and standard
deviation samples. This change appears in the denominator of each formula where N, the
population size, is replaced not by n, the sample size, but by n − 1, as shown:
where s2 and s represent the sample variance and sample standard deviation, SS is the sample
sum of squares.
A 10
B 5
C 15
Step 1: Average (Mean)=(10+5+15)/3=30/3=10
Step 2: Find the Difference from the Mean for A,B & C.
A: 10-10=0
B: 5-10=-5
C: 15-10=?
Conclusion: Addition of all should be 0.
Here, Two equations are available to modify based on expected output.
Hence, Degree of Freedom is 2.
z SCORES
A unit-free, standardized score that indicates how many standard deviations a score is above or
below the mean of its distribution.
To obtain a z score, express any original score, whether measured in inches, milliseconds,
dollars, IQ points, etc., as a deviation from its mean (by subtracting its mean) and then split this
deviation into standard deviation units (by dividing by its standard deviation), that is
where X is the original score and μ and σ are the mean and the standard deviation, respectively,
for the normal distribution of the original scores.
A z score consists of two parts:
1. a positive or negative sign indicating whether it’s above or below the mean; and
2. a number indicating the size of its deviation from the mean in standard deviation units.
Converting to z Scores
To answer the question about eligible FBI applicants, replace X with 66 (the maximum
permissible height), μ with 69 (the mean height), and σ with 3 (the standard deviation of heights)
and solve for z as follows:
Ex: a score of 470 on the SAT math test, given a mean of 500 and a standard deviation of 10.
=(470-500)/100
=-30/100
=-3
STANDARD NORMAL CURVE or STANDARD NORMAL DISTRIBUTION
The tabled normal curve for z scores, with a mean of 0 and a standard deviation of 1.
The standard normal distribution is one of the forms of the normal distribution. It occurs when
a normal random variable has a mean equal to zero and a standard deviation equal to one. In
other words, a normal distribution with a mean 0 and standard deviation of 1 is called the
standard normal distribution. Also, the standard normal distribution is centred at zero, and the
standard deviation gives the degree to which a given measurement deviates from the mean.
The random variable of a standard normal distribution is known as the standard score or a z-
score. It is possible to transform every normal random variable X into a z score using the
following formula:
z = (X – μ) / σ
where X is a normal random variable, μ is the mean of X, and σ is the standard deviation of X.
P(Z > a)
The probability of P(Z > a) is 1 – Φ(a). To understand the reasoning behind this look at the
illustration below:
You know Φ(a), and you realize that the total area under the standard normal curve is 1 so by
numerical conclusion: P(Z > a) is 1 Φ(a).
P(Z > –a)
The probability of P(Z > –a) is P(a), which is Φ(a). To comprehend this, we have to value the
symmetry of the standard normal distribution curve. We are attempting to discover the region
Below:
The standard normal distribution is a tool to translate a normal distribution into numbers.
We may use it to get more information about the data set than was initially known.
Standard normal distribution allows us to quickly estimate the probability of specific
values befalling in our distribution or compare data sets with varying means and standard
deviations.
Also, the z-score of the standard normal distribution is interpreted as the number of
standard deviations a data point falls above or below the mean.