basics of data science
basics of data science
Unit – II
DESCRIBING DATA
Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data with
Averages - Describing Variability - Normal Distributions and Standard (z) Scores
n
group.
Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that represent an amount or a
count. To determine the type of data, focus on a single observation in any collection of observations
e.i
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
The weights can be described not only as quantitative data but also as observations for a quantitative
fre
variable, since the various weights take on different numerical values.
By the same token, the replies can be described as observations for a qualitative variable, since the
replies to the Facebook profile question take on different values of either Yes or No.
Given this perspective, any single observation can be described as a constant, since it takes on only
one value.
tes
Discrete and Continuous Variables
Quantitative variables can be further distinguished as discrete or continuous.
A discrete variable consists of isolated numbers separated by gaps.
Discrete variables can only assume specific values that you cannot subdivide. Typically, you count discrete
values, and the results are integers.
No
Examples
Counts- such as the number of children in a family. (1, 2, 3, etc., but never 1.5)
These variables cannot have fractional or decimal values. You can have 20 or 21 cats, but not 20.5
The number of heads in a sequence of coin tosses.
The result of rolling a die.
w.
A continuous variable consists of numbers whose values, at least in theory, have no restrictions.
Continuous variables can assume any numeric value and can be meaningfully split into smaller parts.
Consequently, they have valid fractional and decimal values. In fact, continuous variables have an infinite
number of potential values between any two points. Generally, you measure them using a scale.
Examples of continuous variables include weight, height, length, time, and temperature.
Durations, such as the reaction times of grade school children to a fire alarm; and standardized test scores,
such as those on the Scholastic Aptitude Test (SAT).
n
For example, someone's age might be an independent variable. Other factors (such as what they eat, how
much they go to school, how much television they watch)
e.i
The impartial creation of distinct groups, which differ only in terms of the independent variable, has a most
desirable consequence. Once the data have been collected, any difference between the groups can be
interpreted as being caused by the independent variable.
fre
Dependent Variable
When a variable is believed to have been influenced by the independent variable, it is called a dependent
variable. In an experimental setting, the dependent variable is measured, counted, or recorded by the
investigator.
The dependent variable (DV) is what you want to use the model to explain or predict. The values of
this variable depend on other variables.
tes
It’s also known as the response variable, outcome variable, and left-hand variable. Graphs place
dependent variables on the vertical, or Y, axis.
a dependent variable is exactly what it sounds like. It is something that depends on other factors.
For example the blood sugar test depends on what food you ate, at which time you ate etc.
No
Unlike the independent variable, the dependent variable isn’t manipulated by the investigator. Instead, it
represents an outcome: the data produced by the experiment.
Confounding Variable
An uncontrolled variable that compromises the interpretation of a study is known as a confounding variable.
Sometimes a confounding variable occurs because it’s impossible to assign subjects randomly to different
w.
conditions.
Grouped Data
According to their frequency of occurrence. When observations are sorted
into classes of more than one value result is referred to as a frequency
for grouped data. (Shown in table 2.2)
The general structure of this frequency distribution is the data’s are
grouped into class intervals with 10 possible values each.
The frequency ( f ) column shows the frequency of observations in
each class and, at the bottom, the total number of observations in all classes.
n
GUIDELINES
e.i
fre
tes
No
w.
ww
n
e.i
fre
tes
No
w.
ww
OUTLIERS
An outlier is an extremely high or extremely low data point relative to the nearest data point and the rest
of the neighboring co-existing values in a data graph or dataset you're working with.
Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.
n
Constructing Relative Frequency Distributions
e.i
To convert a frequency distribution into a relative frequency
distribution, divide the frequency for each class by the total f
requency for the entire distribution.
Table 2.5 illustrates a relative frequency distribution based on
the weight distribution of Table 2.2.
Percentages or Proportions
fre
Some people prefer to deal with percentages rather than proportions because percentages usually lack
decimal points. A proportion always varies between 0 and 1, whereas a percentage always varies between
tes
0 percent and 100 percent.
To convert the relative frequencies, multiply each proportion by 100; that is, move the decimal point two
places to the right.
Cumulative frequency distributions show the total number of observations in each class and in all lower-
ranked classes.
Cumulative frequencies are usually converted, in turn, to cumulative percentages. Cumulative percentages
are often referred to as percentile ranks.
To convert a frequency distribution into a cumulative frequency distribution, add to the frequency of each
class the sum of the frequencies of all classes ranked below it.
ww
Cumulative Percentages
As has been suggested, if relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages
To obtain this cumulative percentage, the cumulative frequency of the class should be divided by the total
frequency of the entire distribution.
Percentile Ranks
When used to describe the relative position of any score within its parent distribution, cumulative
percentages are referred to as percentile ranks.
The percentile rank of a score indicates the percentage of scores in the entire distribution with similar or
n
smaller values than that score. Thus a weight has a percentile rank of 80 if equal or lighter weights
constitute 80 percent of the entire distribution.
e.i
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
Frequency distributions for qualitative data are easy to construct.
Simply determine the frequency with which observations occupy
if measurement is ordinal because observations can be ordered from least to most, cumulative frequencies
(and cumulative percentages) can be used.
w.
ww
GRAPHS
Data can be described clearly and concisely with the aid of a well-constructed frequency distribution. And
data can often be described even more vividly by converting frequency distributions into graphs.
n
Important features of histograms
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class intervals of the
e.i
frequency distribution.
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency. (The units
along the vertical axis do not have to be the same width as those along the horizontal axis.)
The intersection of the two axes defines the origin at which both numerical scales equal 0.
Numerical scales always increase from left to right along the horizontal axis and from bottom to top
along the vertical axis
fre
The body of the histogram consists of a series of bars whose heights reflect the frequencies for the
various classes.
The adjacent bars in histograms have common boundaries that emphasize the continuity of
tes
quantitative data for continuous variables.
The introduction of gaps between adjacent bars would suggest an artificial disruption in the data more
appropriate for discrete quantitative variables or for qualitative variables.
No
w.
ww
Figure: Histogram
Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency polygons may
be constructed directly from frequency distributions.
n
the frequency polygon is enclosed completely.
D. Finally, erase all of the histogram bars, leaving only the frequency polygon.
e.i
fre
tes
No
w.
ww
n
(representing the 240s). Draw a vertical line to separate the stems, which represent multiples of 10, from the
space to be occupied by the leaves, which represent multiples of 1.
e.i
For example
Enter each raw score into the stem and leaf display. As suggested by the shaded coding in Table 2.9, the first
raw score of 160 reappears as a leaf of 0 on a stem of 16. The next raw score of 193 reappears as a leaf of 3 on
a stem of 19, and the third raw score of 226 reappears as a leaf of 6 on a stem of 22, and so on, until each raw
score reappears as a leaf on its appropriate stem.
fre
tes
No
w.
ww
TYPICAL SHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important
characteristic of a frequency distribution is its shape. Below figure shows some of the more typical shapes for
smoothed frequency polygons (which ignore the inevitable irregularities of real data).
n
e.i
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
fre
tes
As with histograms, equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data. Likewise, equal segments along
the vertical axis reflect increases in frequency. The body of the bar graph consists of a series of bars
whose heights reflect the frequencies for the various words or classes.
A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not some
No
MISLEADING GRAPHS
Graphs can be constructed in an unscrupulous manner to support a particular point of view.
w.
Popular sayings says, including “Numbers don’t lie, but statisticians do” and “There are three kinds of lies—
lies, damned lies, and statistics.”
ww
10
n
e.i
fre
tes
No
w.
ww
11
Types of Modes
n
Bimodal, Trimodal & Multimodal (More than one mode)
When there are two modes in a data set, then the set is called bimodal
e.i
For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because both 2 and 5 is repeated three times
in the given set.
When there are three modes in a data set, then the set is called trimodal
For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
When there are four or more modes in a data set, then the set is called multimodal
fre
Example: The following table represents the number of wickets taken by a bowler in 10 matches. Find the
mode of the given set of data.
tes
It can be seen that 2 wickets were taken by the bowler frequently in different matches. Hence, the mode of the
given data is 2.
No
MEDIAN
The median reflects the middle value when observations are ordered from least to most.
The median splits a set of ordered observations into two equal parts, the upper and lower halves.
If the total number of observation is even, then the median formula is:
Example 1:
4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29
12
Solution:
n= 15
When we put those numbers in the order we have:
4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,
n
Example 2:
e.i
Find the median of the following:
9,7,2,11,18,12,6,4
Solution
n=8
When we put those numbers in the order we have:
2, 4, 6, 7, 9,11, 12, 18
MEAN
The mean is found by adding all scores and then dividing by the number of scores.
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total
w.
number of numbers.
ww
Types of means
Sample mean
Population mean
Sample Mean
The sample mean is a central tendency measure. The arithmetic average is computed using samples or random
values taken from the population. It is evaluated as the sum of all the sample variables divided by the total
number of variables.
13
Population Mean
The population mean can be calculated by the sum of all values in the given data/population divided by a total
number of values in the given data/population.
n
AVERAGES FOR QUALITATIVE AND RANKED DATA
e.i
Mode
The mode always can be used with qualitative data.
Median
The median can be used whenever it is possible to order qualitative data from least to most because the level
of measurement is ordinal.
RANGE
fre
Describing Variability
The range is the difference between the largest and smallest scores.
The range in statistics for a given data set is the difference between the highest and lowest values. For
tes
example, if the given data set is {2,5,8,10,3}, then the range will be 10 – 2 = 8.
Example 1: Find the range of given observations: 32, 41, 28, 54, 35, 26, 23, 33, 38, 40.
Since 23 is the lowest value and 54 is the highest value, therefore, the range of the observations will be;
Range (X) = Max (X) – Min (X)
= 54 – 23
= 31
w.
VARIANCE
Variance is a measure of how data points differ from the mean. A variance is a measure of how far a set of
data (numbers) are spread out from their mean (average) value.
Formula
σ = Σ(x-μ)2 or
ww
2 2
Variance = (Standard deviation)2= σ2 = > σ = Σ(x-μ) /n
the values of all scores must be added and then divided by the total number of scores.
Example
X = 5, 8, 6, 10, 12, 9, 11, 10, 12, 7
Solution
Mean = sum (x)/ n
n= 10
sum (x) = 5+8+6+10+12+9+11+10+12+ 7
14
= 90
Mean=> μ = 90 / 10 = 9
Deviation from mean
x- μ = -4, -1, -3, 1, 3, 0, 2,1,3,-2
(x-μ)2 = 16,1,9,1,9,0,4,1,9,4
Σ(x-μ)2 = 16+1+9+1+9+0+4+1+9+4
=54
σ 2= Σ(x-μ)2 /n
n
=54/10
= 5.4
e.i
STANDARD DEVIATION
The standard deviation, the square root of the mean of all squared deviations from the mean, that is,
Standard deviation = √variance
Standard Deviation: A rough measure of the average (or standard) amount by which scores deviate
“The sum of squares equals the sum of all squared deviation scores.” You can reconstruct this formula by
w.
15
n
e.i
fre
tes
No
16
n
e.i
fre
tes
No
w.
ww
17
Formula
Degree of freedom df = n-1
n
Example
Consider a data set consists of five positive integers. The sum of the five integers must be the multiple of 6.
e.i
The values are randomly selected as 3, 8, 5, and 4.
The sum of this for values is 20. So we have to choose the fifth integer to make the sum divisible by 6.
Therefore the fifth element is 10.
fre
The number of degrees of Degrees of Freedom (df ) The number of values free to vary, given one or more
mathematical restrictions. Freedom—in the numerator, as in the formulas for s2 and s. In fact, we can use
degrees of freedom to rewrite the formulas for the sample variance and standard deviation:
tes
No
The interquartile range (IQR), is simply the range for the middle 50 percent of the scores. More specifically,
the IQR equals the distance between the third quartile (or 75th percentile) and the first quartile (or 25 th
percentile), that is, after the highest quarter (or top 25 percent) and the lowest quarter (or bottom 25 percent)
have been trimmed from the original set of scores. Since most distributions are spread more widely in their
ww
extremities than their middle, the IQR tends to be less than half the size of the range.
Simply, The IQR describes the middle 50% of values when ordered from lowest to highest. To find the
interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These
values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.
18
n
e.i
fre
tes
No
The normal curve is a theoretical curve defined for a continuous variable, as described in Section 1.6,
and noted for its symmetrical bell-shaped form, as revealed in below figure
Because the normal curve is symmetrical, its lower half is the mirror image of its upper half.
The normal curve peaks above a point midway along the horizontal spread and then tapers off
ww
gradually in either direction from the peak (without actually touching the horizontal axis, since, in
theory, the tails of a normal curve extend infinitely far).
The values of the mean, median (or 50th percentile), and mode, located at a point midway along the
horizontal spread, are the same for the normal curve.
19
n
Different Normal Curves
e.i
As a theoretical exercise, it is instructive to note the various types of normal curves that are produced
by an arbitrary change in the value of either the mean (μ) or the standard deviation (σ).
Obvious differences in appearance among normal curves are less important than you might suspect.
Because of their common mathematical origin, every normal curve can be interpreted in exactly the same way
once any distance from the mean is expressed in standard deviation units.
fre
tes
No
z SCORES
A z score is a unit-free, standardized score that, regardless of the original units of measurement, indicates how
many standard deviations a score is above or below the mean of its distribution.
w.
A z score can be defined as a measure of the number of standard deviations by which a score is below or
above the mean of a distribution. In other words, it is used to determine the distance of a score from the mean.
If the z score is positive it indicates that the score is above the mean. If it is negative then the score will be
below the mean. However, if the z score is 0 it denotes that the data point is the same as the mean.
ww
To obtain a z score, express any original score, whether measured in inches, milliseconds, dollars, IQ points,
etc., as a deviation from its mean (by subtracting its mean) and then split this deviation into standard deviation
units (by dividing by its standard deviation),
Where X is the original score and μ and σ are the mean and the standard deviation, respectively, for the
normal distribution of the original scores. Since identical units of measurement appear in both the numerator
20
and denominator of the ratio for z, the original units of measurement cancel each other and the z score
emerges as a unit-free or standardized number, often referred to as a standard score.
Converting to z Scores
Example
Suppose on a GRE test a score of 1100 is obtained. The mean score for the GRE test is 1026 and the
population standard deviation is 209. In order to find how well a person scored with respect to the score of an
average test taker, the z score will have to be determined.
n
The steps to calculate the z score are as follows:
e.i
Step 1: Write the value of the raw score in the z score equation. z = (1100−μ) /σ
Step 2: Write the mean and standard deviation of the population in the z score formula.
z = (1100−1026) / 209
Step 3: Perform the calculations to get the required z score. z = 0.345
Step 4: A z score table can be used to find the percentage of test-takers that are below the score of the
fre
person. Using the first two digits of the z score, determine the row containing these digits of the z
table. Now using the 2nd digit after the decimal, find the corresponding column. The intersection of
this row and column will give a value. As shown below, this value will be 0.6368 for the given
example.
Step 5: Use the value from step 5 and multiply it by 100 to get the required percentage. 0.6368 * 100 =
tes
63.68%. This shows that 63.68% of test-takers scores are lesser than the given raw score.
Although there is an infinite number of different normal curves, each with its own mean and standard
deviation, there is only one standard normal curve, with a mean of 0 and a standard deviation of 1.
Mean = 0
Standard deviation = 1
ww
21
Given a z score of zero or more, columns B and C indicate how the z score splits the area in the upper half of
the normal curve. As suggested by the shading in the top legend, column B indicates the proportion of area
between the mean and the z score, and column C indicates the proportion of area beyond the z score, in the
upper tail of the standard normal curve.
n
e.i
fre
tes
No
w.
ww
22
FINDING PROPORTIONS
Finding Proportions for One Score
Sketch a normal curve and shade in the target area,
Plan your solution according to the normal table.
Convert X to z.
n
e.i
Find the target area.
to 255)
fre
Sketch a normal curve and shade in the target area, (example, find proportion between 245
23
FINDING SCORES
So far, we have concentrated on normal curve problems for which Table A must be consulted to find
the unknown proportion (of area) associated with some known score or pair of known scores
Now we will concentrate on the opposite type of normal curve problem for which Table A must be
consulted to find the unknown score or scores associated with some known proportion.
For this type of problem requires that we reverse our use of Table A by entering proportions in
columns B, C, B′, or C′ and finding z scores listed in columns A or A′.
n
e.i
fre
tes
It’s often helpful to visualize the target score as splitting the total area into two sectors—one to the left of
No
(below) the target score and one to the right of (above) the target score
Find z.
Convert z to the target score.
When converting z scores to original scores, you will probably find it more efficient to use the following
equation
ww
24
n
e.i
Find z.
Convert z to the target score.
fre
Plan your solution according to the normal table.
tes
Points to Remember
1. range = largest value – smallest value in a list
2. class interval = range / desired no of classes
3. relative frequency = frequency (f)/ε(f)
No
4. Cumulative frequency - add to the frequency of each class the sum of the frequencies of all
classes ranked below it.
5. Cumulative percentage = (f/cumulative f)*100
6. Histograms
7. Construction of frequency polygon
8. Stem and leaf display
w.
Variance σ = Σ(x-μ)2 or
2
Variance = (Standard deviation)2= σ2 =>σ = Σ(x-μ)2 /n
12. Range (X) = Max (X) – Min (X)
25
n
e.i
fre
tes
13. Degree of freedom df = n-1
No
15. z – score
26
n
e.i
2. For between two score
fre
tes
No
Two scores
27
n
e.i
fre
tes
No
w.
ww
28
Unit – III
DESCRIBING RELATIONSHIPS
Correlation – Scatter plots – correlation coefficient for quantitative data – computational formula for correlation
coefficient – Regression – regression line – least squares regression line – Standard error of estimate –
interpretation of r2 – multiple regression equations – regression towards the mean
Correlation
Correlation refers to a process for establishing the relationships between two variables. You learned a way to
get a general idea about whether or not two variables are related, is to plot them on a “scatter plot”. While there
are many measures of association for variables which are measured at the ordinal or higher level of
measurement, correlation is the most commonly used approach.
n
Types of Correlation
Positive Correlation – when the values of the two variables move in the same direction so that an
e.i
increase/decrease in the value of one variable is followed by an increase/decrease in the value of the
other variable.
Negative Correlation – when the values of the two variables move in the opposite direction so that an
increase/decrease in the value of one variable is followed by decrease/increase in the value of the other
fre
variable.
No Correlation – when there is no linear dependence or no relation between the two variables.
tes
No
w.
ww
SCATTERPLOTS
A scatter plot is a graph containing a cluster of dots that represents all pairs of scores. In other words
Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data
points on a two-dimensional plane or on a Cartesian system.
Use each pair of scores to locate a dot within the scatter plot
The first step is to note the tilt or slope, if any, of a dot cluster.
A dot cluster that has a slope from the lower left to the upper right, as in panel A of below figure reflects a
positive relationship.
A dot cluster that has a slope from the upper left to the lower right, as in panel B of below figure reflects a
negative relationship.
n
A dot cluster that lacks any apparent slope, as in panel C of below figure reflects little or no relationship.
e.i
fre
tes
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a perfect relationship between
two variables.
No
Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates a straight line and, therefore, reflects a linear
relationship. But this is not always the case. Sometimes a dot cluster approximates a bent or curved line, as in
below figure, and therefore reflects a curvilinear relationship.
w.
ww
Properties of r
The correlation coefficient is scaled so that it is always between -1 and +1.
When r is close to 0 this means that there is little relationship between the variables and the farther away
from 0 r is, in either the positive or negative direction, the greater the relationship between the two
variables.
The sign of r indicates the type of linear relationship, whether positive or negative.
The numerical value of r, without regard to sign, indicates the strength of the linear relationship.
n
A number with a plus sign (or no sign) indicates a positive relationship, and a number with a minus sign
indicates a negative relationship
e.i
COMPUTATION FORMULA FOR r
Calculate a value for r by using the following computation formula:
fre
Where the two sum of squares terms in the denominator are defined as
tes
The sum of the products term in the numerator, SPxy, is defined in below formula
No