Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

basics of data science

The document provides an overview of data types, including qualitative, ranked, and quantitative data, alongside their corresponding variables. It discusses the distinctions between discrete and continuous variables, as well as independent and dependent variables in experimental contexts. Additionally, it covers methods for describing data using tables, graphs, frequency distributions, and the importance of outliers and cumulative frequencies.

Uploaded by

Samay Rajput
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

basics of data science

The document provides an overview of data types, including qualitative, ranked, and quantitative data, alongside their corresponding variables. It discusses the distinctions between discrete and continuous variables, as well as independent and dependent variables in experimental contexts. Additionally, it covers methods for describing data using tables, graphs, frequency distributions, and the importance of outliers and cumulative frequencies.

Uploaded by

Samay Rajput
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Unit – II
DESCRIBING DATA

Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing Data with
Averages - Describing Variability - Normal Distributions and Standard (z) Scores

THREE TYPES OF DATA


 Qualitative data consist of words (Yes or No), letters (Y or N), or numerical codes (0 or 1) that
represent a class or category.
 Ranked data consist of numbers (1st, 2nd, . . . 40th place) that represent relative standing within a

n
group.
 Quantitative data consist of numbers (weights of 238, 170, . . . 185 lbs) that represent an amount or a
count. To determine the type of data, focus on a single observation in any collection of observations

e.i
TYPES OF VARIABLES
A variable is a characteristic or property that can take on different values.
 The weights can be described not only as quantitative data but also as observations for a quantitative

fre
variable, since the various weights take on different numerical values.
 By the same token, the replies can be described as observations for a qualitative variable, since the
replies to the Facebook profile question take on different values of either Yes or No.
 Given this perspective, any single observation can be described as a constant, since it takes on only
one value.
tes
Discrete and Continuous Variables
Quantitative variables can be further distinguished as discrete or continuous.
A discrete variable consists of isolated numbers separated by gaps.
Discrete variables can only assume specific values that you cannot subdivide. Typically, you count discrete
values, and the results are integers.
No

Examples
 Counts- such as the number of children in a family. (1, 2, 3, etc., but never 1.5)
 These variables cannot have fractional or decimal values. You can have 20 or 21 cats, but not 20.5
 The number of heads in a sequence of coin tosses.
 The result of rolling a die.
w.

 The number of patients in a hospital.


 The population of a country.
While discrete variables have no decimal places, the average of these values can be fractional. For example,
families can have only a discrete number of children: 1, 2, 3, etc. However, the average number of children
per family can be 2.2.
ww

A continuous variable consists of numbers whose values, at least in theory, have no restrictions.
Continuous variables can assume any numeric value and can be meaningfully split into smaller parts.
Consequently, they have valid fractional and decimal values. In fact, continuous variables have an infinite
number of potential values between any two points. Generally, you measure them using a scale.

Examples of continuous variables include weight, height, length, time, and temperature.
Durations, such as the reaction times of grade school children to a fire alarm; and standardized test scores,
such as those on the Scholastic Aptitude Test (SAT).

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Independent and Dependent Variables


Independent Variable
In an experiment, an independent variable is the treatment manipulated by the investigator.
 Independent variables (IVs) are the ones that you include in the model to explain or predict changes in
the dependent variable.
 Independent indicates that they stand alone and other variables in the model do not influence them.
 Independent variables are also known as predictors, factors, treatment variables, explanatory variables,
input variables, x-variables, and right-hand variables—because they appear on the right side of the
equals sign in a regression equation.
 It is a variable that stands alone and isn't changed by the other variables you are trying to measure.

n
For example, someone's age might be an independent variable. Other factors (such as what they eat, how
much they go to school, how much television they watch)

e.i
The impartial creation of distinct groups, which differ only in terms of the independent variable, has a most
desirable consequence. Once the data have been collected, any difference between the groups can be
interpreted as being caused by the independent variable.

fre
Dependent Variable
When a variable is believed to have been influenced by the independent variable, it is called a dependent
variable. In an experimental setting, the dependent variable is measured, counted, or recorded by the
investigator.
 The dependent variable (DV) is what you want to use the model to explain or predict. The values of
this variable depend on other variables.
tes
 It’s also known as the response variable, outcome variable, and left-hand variable. Graphs place
dependent variables on the vertical, or Y, axis.
 a dependent variable is exactly what it sounds like. It is something that depends on other factors.

For example the blood sugar test depends on what food you ate, at which time you ate etc.
No

Unlike the independent variable, the dependent variable isn’t manipulated by the investigator. Instead, it
represents an outcome: the data produced by the experiment.

Confounding Variable
An uncontrolled variable that compromises the interpretation of a study is known as a confounding variable.
Sometimes a confounding variable occurs because it’s impossible to assign subjects randomly to different
w.

conditions.

Describing Data with Tables and Graphs


ww

Frequency Distributions for Quantitative Data


 A frequency distribution is a collection of observations
produced by sorting observations into classes and showing
their frequency (f) of occurrence in each class.
 When observations are sorted into classes of single
values, as in Table 2.1, the result is referred to as a frequency
distribution for ungrouped data.
 The frequency distribution shown in Table 2.1 is only partially
displayed because there are more than 100 possible values between
the largest and smallest observations.
Frequency distribution table is much more informative if possible
2

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

observed values is less then 20. If more entry is observed then


grouped data is used.

Grouped Data
According to their frequency of occurrence. When observations are sorted
into classes of more than one value result is referred to as a frequency
for grouped data. (Shown in table 2.2)
 The general structure of this frequency distribution is the data’s are
grouped into class intervals with 10 possible values each.
 The frequency ( f ) column shows the frequency of observations in
each class and, at the bottom, the total number of observations in all classes.

n
GUIDELINES

e.i
fre
tes
No
w.
ww

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
fre
tes
No
w.
ww

OUTLIERS
An outlier is an extremely high or extremely low data point relative to the nearest data point and the rest
of the neighboring co-existing values in a data graph or dataset you're working with.

Outliers are extreme values that stand out greatly from the overall pattern of values in a dataset or graph.

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

RELATIVE FREQUENCY DISTRIBUTIONS


Relative frequency distributions show the frequency of each
class as a part or fraction of the total frequency for the entire
distribution.
This type of distribution is especially helpful when you must
compare two or more distributions based on different total
numbers of observations.
The conversion to relative frequencies allows a direct
comparison of the shapes of two distributions without
adjust other observations.

n
Constructing Relative Frequency Distributions

e.i
To convert a frequency distribution into a relative frequency
distribution, divide the frequency for each class by the total f
requency for the entire distribution.
Table 2.5 illustrates a relative frequency distribution based on
the weight distribution of Table 2.2.

Percentages or Proportions
fre
Some people prefer to deal with percentages rather than proportions because percentages usually lack
decimal points. A proportion always varies between 0 and 1, whereas a percentage always varies between
tes
0 percent and 100 percent.
To convert the relative frequencies, multiply each proportion by 100; that is, move the decimal point two
places to the right.

CUMULATIVE FREQUENCY DISTRIBUTIONS


No

Cumulative frequency distributions show the total number of observations in each class and in all lower-
ranked classes.
Cumulative frequencies are usually converted, in turn, to cumulative percentages. Cumulative percentages
are often referred to as percentile ranks.

Constructing Cumulative Frequency Distributions


w.

To convert a frequency distribution into a cumulative frequency distribution, add to the frequency of each
class the sum of the frequencies of all classes ranked below it.
ww

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Cumulative Percentages
As has been suggested, if relative standing within a distribution is particularly important, then cumulative
frequencies are converted to cumulative percentages
To obtain this cumulative percentage, the cumulative frequency of the class should be divided by the total
frequency of the entire distribution.

Percentile Ranks
When used to describe the relative position of any score within its parent distribution, cumulative
percentages are referred to as percentile ranks.
The percentile rank of a score indicates the percentage of scores in the entire distribution with similar or

n
smaller values than that score. Thus a weight has a percentile rank of 80 if equal or lighter weights
constitute 80 percent of the entire distribution.

e.i
FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA
Frequency distributions for qualitative data are easy to construct.
Simply determine the frequency with which observations occupy

the Face book profile survey


fre
Each class, and report these frequencies as shown in Table 2.7 for

Qualitative data have an ordinal level of measurement because


Observations can be ordered from least to most, that order should
tes
be preserved in the frequency table

Relative and Cumulative Distributions for Qualitative Data


Frequency distributions for qualitative variables can always be converted into relative frequency
distributions.
No

if measurement is ordinal because observations can be ordered from least to most, cumulative frequencies
(and cumulative percentages) can be used.
w.
ww

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

GRAPHS

Data can be described clearly and concisely with the aid of a well-constructed frequency distribution. And
data can often be described even more vividly by converting frequency distributions into graphs.

GRAPHS FOR QUANTITATIVE DATA


Histograms
A bar-type graph for quantitative data. The common boundaries between adjacent bars emphasize the
continuity of the data, as with continuous variables.
A histogram is a display of statistical information that uses rectangles to show the frequency of data items
in successive numerical intervals of equal size.

n
Important features of histograms
 Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class intervals of the

e.i
frequency distribution.
 Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency. (The units
along the vertical axis do not have to be the same width as those along the horizontal axis.)
 The intersection of the two axes defines the origin at which both numerical scales equal 0.
 Numerical scales always increase from left to right along the horizontal axis and from bottom to top
along the vertical axis

fre
 The body of the histogram consists of a series of bars whose heights reflect the frequencies for the
various classes.
 The adjacent bars in histograms have common boundaries that emphasize the continuity of
tes
quantitative data for continuous variables.
 The introduction of gaps between adjacent bars would suggest an artificial disruption in the data more
appropriate for discrete quantitative variables or for qualitative variables.
No
w.
ww

Figure: Histogram

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Frequency Polygon
An important variation on a histogram is the frequency polygon, or line graph. Frequency polygons may
be constructed directly from frequency distributions.

Step-by-step transformation of a histogram into a frequency polygon

A. This panel shows the histogram for the weight distribution.


B. Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpoints for classes on
the horizontal axis, and connect them with straight lines.
C. c. Anchor the frequency polygon to the horizontal axis. First, extend the upper tail to the midpoint of
the first unoccupied class on the upper flank of the histogram. Then extend the lower tail to the
midpoint of the first unoccupied class on the lower flank of the histogram. Now all of the area under

n
the frequency polygon is enclosed completely.
D. Finally, erase all of the histogram bars, leaving only the frequency polygon.

e.i
fre
tes
No
w.
ww

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Stem and Leaf Displays


Another technique for summarizing quantitative data is a stem and leaf display. Stem and leaf displays are
ideal for summarizing distributions, such as that for weight data, without destroying the identities of
individual observations.

Constructing Stem and Leaf Display

The leftmost panel of table re-creates the weights.


To construct the stem and leaf display for the table given below, first note that, when counting by tens, the
weights range from the 130s to the 240s.
Arrange a column of numbers, the stems, beginning with 13 (representing the 130s) and ending with 24

n
(representing the 240s). Draw a vertical line to separate the stems, which represent multiples of 10, from the
space to be occupied by the leaves, which represent multiples of 1.

e.i
For example
Enter each raw score into the stem and leaf display. As suggested by the shaded coding in Table 2.9, the first
raw score of 160 reappears as a leaf of 0 on a stem of 16. The next raw score of 193 reappears as a leaf of 3 on
a stem of 19, and the third raw score of 226 reappears as a leaf of 6 on a stem of 22, and so on, until each raw
score reappears as a leaf on its appropriate stem.

fre
tes
No
w.
ww

TYPICAL SHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important
characteristic of a frequency distribution is its shape. Below figure shows some of the more typical shapes for
smoothed frequency polygons (which ignore the inevitable irregularities of real data).

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
A GRAPH FOR QUALITATIVE (NOMINAL) DATA
fre
tes
 As with histograms, equal segments along the horizontal axis are allocated to the different words or
classes that appear in the frequency distribution for qualitative data. Likewise, equal segments along
the vertical axis reflect increases in frequency. The body of the bar graph consists of a series of bars
whose heights reflect the frequencies for the various words or classes.
 A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not some
No

impossible intermediate value, such as 40 percent Yes and 60 percent No.


 Gaps are placed between adjacent bars of bar graphs to emphasize the discontinuous nature of
qualitative data.

MISLEADING GRAPHS
Graphs can be constructed in an unscrupulous manner to support a particular point of view.
w.

Popular sayings says, including “Numbers don’t lie, but statisticians do” and “There are three kinds of lies—
lies, damned lies, and statistics.”
ww

10

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
fre
tes
No
w.
ww

11

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Describing Data with Averages


MODE
The mode reflects the value of the most frequently occurring score.
In other words
A mode is defined as the value that has a higher frequency in a given set of values. It is the value that appears
the most number of times.
Example:
In the given set of data: 2, 4, 5, 5, 6, 7, the mode of the data set is 5 since it has appeared in the set twice.

Types of Modes

n
Bimodal, Trimodal & Multimodal (More than one mode)
 When there are two modes in a data set, then the set is called bimodal

e.i
For example, The mode of Set A = {2,2,2,3,4,4,5,5,5} is 2 and 5, because both 2 and 5 is repeated three times
in the given set.
 When there are three modes in a data set, then the set is called trimodal
For example, the mode of set A = {2,2,2,3,4,4,5,5,5,7,8,8,8} is 2, 5 and 8
 When there are four or more modes in a data set, then the set is called multimodal

fre
Example: The following table represents the number of wickets taken by a bowler in 10 matches. Find the
mode of the given set of data.
tes
It can be seen that 2 wickets were taken by the bowler frequently in different matches. Hence, the mode of the
given data is 2.
No

MEDIAN
The median reflects the middle value when observations are ordered from least to most.
The median splits a set of ordered observations into two equal parts, the upper and lower halves.

Finding the Median


w.

 Order scores from least to most.


 If the total number of observation given is odd, then the formula to calculate the median is:

Median = {(n+1)/2}th term / observation


ww

 If the total number of observation is even, then the median formula is:

Median = 1/2[(n/2)th term + {(n/2)+1}th term ]

Example 1:

Find the median of the following:

4, 17, 77, 25, 22, 23, 92, 82, 40, 24, 14, 12, 67, 23, 29

12

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Solution:
n= 15
When we put those numbers in the order we have:

4, 12, 14, 17, 22, 23, 23, 24, 25, 29, 40, 67, 77, 82, 92,

Median = {(n+1)/2}th term


= (15+1)/2
=8
th
The 8 term in the list is 24
The median value of this set of numbers is 24.

n
Example 2:

e.i
Find the median of the following:
9,7,2,11,18,12,6,4

Solution
n=8
When we put those numbers in the order we have:
2, 4, 6, 7, 9,11, 12, 18

Median = 1/2[(n/2)th term + {(n/2)+1}th term ]


fre
tes
= ½ [(8/2) term + ((8/2)+1)term]
=1/2[4th term+5th term] (in our list 4th term is 7 and 5th term is 9)
= ½[7+9]
=1/2(16)
=8
No

The median value of this set of numbers is 8.

MEAN
The mean is found by adding all scores and then dividing by the number of scores.

Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total
w.

number of numbers.
ww

Types of means
 Sample mean
 Population mean

Sample Mean
The sample mean is a central tendency measure. The arithmetic average is computed using samples or random
values taken from the population. It is evaluated as the sum of all the sample variables divided by the total
number of variables.

13

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Population Mean
The population mean can be calculated by the sum of all values in the given data/population divided by a total
number of values in the given data/population.

n
AVERAGES FOR QUALITATIVE AND RANKED DATA

e.i
Mode
The mode always can be used with qualitative data.
Median
The median can be used whenever it is possible to order qualitative data from least to most because the level
of measurement is ordinal.

RANGE
fre
Describing Variability
The range is the difference between the largest and smallest scores.
The range in statistics for a given data set is the difference between the highest and lowest values. For
tes
example, if the given data set is {2,5,8,10,3}, then the range will be 10 – 2 = 8.

Example 1: Find the range of given observations: 32, 41, 28, 54, 35, 26, 23, 33, 38, 40.

Solution: Let us first arrange the given values in ascending order.


23, 26, 28, 32, 33, 35, 38, 40, 41, 54
No

Since 23 is the lowest value and 54 is the highest value, therefore, the range of the observations will be;
Range (X) = Max (X) – Min (X)
= 54 – 23
= 31
w.

VARIANCE
Variance is a measure of how data points differ from the mean. A variance is a measure of how far a set of
data (numbers) are spread out from their mean (average) value.
Formula
σ = Σ(x-μ)2 or
ww

2 2
Variance = (Standard deviation)2= σ2 = > σ = Σ(x-μ) /n

the values of all scores must be added and then divided by the total number of scores.

Example
X = 5, 8, 6, 10, 12, 9, 11, 10, 12, 7
Solution
Mean = sum (x)/ n
n= 10
sum (x) = 5+8+6+10+12+9+11+10+12+ 7

14

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

= 90
Mean=> μ = 90 / 10 = 9
Deviation from mean
x- μ = -4, -1, -3, 1, 3, 0, 2,1,3,-2

(x-μ)2 = 16,1,9,1,9,0,4,1,9,4

Σ(x-μ)2 = 16+1+9+1+9+0+4+1+9+4
=54

σ 2= Σ(x-μ)2 /n

n
=54/10
= 5.4

e.i
STANDARD DEVIATION
The standard deviation, the square root of the mean of all squared deviations from the mean, that is,
Standard deviation = √variance
Standard Deviation: A rough measure of the average (or standard) amount by which scores deviate

Standard Deviation: A Measure of Distance


fre
The mean is a measure of position, but the standard deviation is a measure of distance (on either side of the
mean of the distribution).
tes
Sum of Squares (SS)
Calculating the standard deviation requires that we obtain first a value for the variance. However, calculating
the variance requires, in turn, that we obtain the sum of the squared deviation scores.
The sum of squared deviation scores or more simply the sum of squares, symbolized by SS
No

“The sum of squares equals the sum of all squared deviation scores.” You can reconstruct this formula by
w.

remembering the following three steps:


1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score, X − μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.
3. Sum all squared deviation scores, Σ (X − μ)2.
ww

15

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
fre
tes
No

Sum of Squares Formulas for Sample


Sample notation can be substituted for population notation in the above two formulas without causing any
essential changes:
w.
ww

16

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
fre
tes
No
w.
ww

17

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

DEGREES OF FREEDOM (df)


 Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
 Degrees of freedom are the number of independent variables that can be estimated in a statistical
analysis. These values of these variables are without constraint, although the values do impost
restrictions on other variables if the data set is to comply with estimate parameters.
 Degrees of Freedom (df ) The number of values free to vary, given one or more mathematical
restrictions.

Formula
Degree of freedom df = n-1

n
Example
Consider a data set consists of five positive integers. The sum of the five integers must be the multiple of 6.

e.i
The values are randomly selected as 3, 8, 5, and 4.
The sum of this for values is 20. So we have to choose the fifth integer to make the sum divisible by 6.
Therefore the fifth element is 10.

fre
The number of degrees of Degrees of Freedom (df ) The number of values free to vary, given one or more
mathematical restrictions. Freedom—in the numerator, as in the formulas for s2 and s. In fact, we can use
degrees of freedom to rewrite the formulas for the sample variance and standard deviation:
tes
No

INTERQUARTILE RANGE (IQR)


w.

The interquartile range (IQR), is simply the range for the middle 50 percent of the scores. More specifically,
the IQR equals the distance between the third quartile (or 75th percentile) and the first quartile (or 25 th
percentile), that is, after the highest quarter (or top 25 percent) and the lowest quarter (or bottom 25 percent)
have been trimmed from the original set of scores. Since most distributions are spread more widely in their
ww

extremities than their middle, the IQR tends to be less than half the size of the range.

Simply, The IQR describes the middle 50% of values when ordered from lowest to highest. To find the
interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These
values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1.

18

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
fre
tes
No

Normal Distributions and Standard (z) Scores


THE NORMAL CURVE
The normal distribution is a continuous probability distribution that is symmetrical on both sides of the mean,
so the right side of the center is a mirror image of the left side.

Properties of the Normal Curve


w.

 The normal curve is a theoretical curve defined for a continuous variable, as described in Section 1.6,
and noted for its symmetrical bell-shaped form, as revealed in below figure
 Because the normal curve is symmetrical, its lower half is the mirror image of its upper half.
 The normal curve peaks above a point midway along the horizontal spread and then tapers off
ww

gradually in either direction from the peak (without actually touching the horizontal axis, since, in
theory, the tails of a normal curve extend infinitely far).
 The values of the mean, median (or 50th percentile), and mode, located at a point midway along the
horizontal spread, are the same for the normal curve.

Properties of a normal distribution


 The mean, mode and median are all equal.
 The curve is symmetric at the center (i.e. around the mean, μ).
 Exactly half of the values are to the left of center and exactly half the values are to the right.
 The total area under the curve is 1.

19

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
Different Normal Curves

e.i
As a theoretical exercise, it is instructive to note the various types of normal curves that are produced
by an arbitrary change in the value of either the mean (μ) or the standard deviation (σ).
Obvious differences in appearance among normal curves are less important than you might suspect.
Because of their common mathematical origin, every normal curve can be interpreted in exactly the same way
once any distance from the mean is expressed in standard deviation units.

fre
tes
No

z SCORES

A z score is a unit-free, standardized score that, regardless of the original units of measurement, indicates how
many standard deviations a score is above or below the mean of its distribution.
w.

A z score can be defined as a measure of the number of standard deviations by which a score is below or
above the mean of a distribution. In other words, it is used to determine the distance of a score from the mean.
If the z score is positive it indicates that the score is above the mean. If it is negative then the score will be
below the mean. However, if the z score is 0 it denotes that the data point is the same as the mean.
ww

To obtain a z score, express any original score, whether measured in inches, milliseconds, dollars, IQ points,
etc., as a deviation from its mean (by subtracting its mean) and then split this deviation into standard deviation
units (by dividing by its standard deviation),

Where X is the original score and μ and σ are the mean and the standard deviation, respectively, for the
normal distribution of the original scores. Since identical units of measurement appear in both the numerator

20

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

and denominator of the ratio for z, the original units of measurement cancel each other and the z score
emerges as a unit-free or standardized number, often referred to as a standard score.

A z score consists of two parts:


1. A positive or negative sign indicating whether it’s above or below the mean; and
2. A number indicating the size of its deviation from the mean in standard deviation units.

Converting to z Scores
Example
Suppose on a GRE test a score of 1100 is obtained. The mean score for the GRE test is 1026 and the
population standard deviation is 209. In order to find how well a person scored with respect to the score of an
average test taker, the z score will have to be determined.

n
The steps to calculate the z score are as follows:

e.i
 Step 1: Write the value of the raw score in the z score equation. z = (1100−μ) /σ
 Step 2: Write the mean and standard deviation of the population in the z score formula.
z = (1100−1026) / 209
 Step 3: Perform the calculations to get the required z score. z = 0.345
 Step 4: A z score table can be used to find the percentage of test-takers that are below the score of the

fre
person. Using the first two digits of the z score, determine the row containing these digits of the z
table. Now using the 2nd digit after the decimal, find the corresponding column. The intersection of
this row and column will give a value. As shown below, this value will be 0.6368 for the given
example.
 Step 5: Use the value from step 5 and multiply it by 100 to get the required percentage. 0.6368 * 100 =
tes
63.68%. This shows that 63.68% of test-takers scores are lesser than the given raw score.

STANDARD NORMAL CURVE


If the original distribution approximates a normal curve, then the shift to standard or z scores will always
produce a new distribution that approximates the standard normal curve. This is the one normal curve for
No

which a table is actually available.

Although there is an infinite number of different normal curves, each with its own mean and standard
deviation, there is only one standard normal curve, with a mean of 0 and a standard deviation of 1.

For a standard normal curve


w.

Mean = 0

Standard deviation = 1
ww

Standard Normal Table


The standard normal table consists of columns of z scores coordinated with columns of proportions

Using the Top Legend of the Table


Notice that columns are arranged in sets of three, designated as A, B, and C in the legend at the top of the
table. When using the top legend, all entries refer to the upper half of the standard normal curve. The entries
in column A are z scores, beginning with 0.00 and ending with 4.00

21

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Given a z score of zero or more, columns B and C indicate how the z score splits the area in the upper half of
the normal curve. As suggested by the shading in the top legend, column B indicates the proportion of area
between the mean and the z score, and column C indicates the proportion of area beyond the z score, in the
upper tail of the standard normal curve.

Using the Bottom Legend of the Table


Now the columns are designated as A′, B′, and C′ in the legend at the bottom of the table. When using the
bottom legend, all entries refer to the lower half of the standard normal curve.
A negative z score, columns B′ and C′ indicate how that z score splits the lower half of the normal curve. As
suggested by the shading in the bottom legend of the table, column B′ indicates the proportion of area between
the mean and the negative z score, and column C′ indicates the proportion of area beyond the negative z score,
in the lower tail of the standard normal curve.

n
e.i
fre
tes
No
w.
ww

22

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

FINDING PROPORTIONS
Finding Proportions for One Score
 Sketch a normal curve and shade in the target area,
 Plan your solution according to the normal table.
 Convert X to z.

n
e.i
 Find the target area.

Finding Proportions between Two Scores

to 255)
fre
 Sketch a normal curve and shade in the target area, (example, find proportion between 245

 Plan your solution according to the normal table.


 Convert X to z by expressing 255 as
tes
No
w.
ww

 Find the target area.

23

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

FINDING SCORES
So far, we have concentrated on normal curve problems for which Table A must be consulted to find
the unknown proportion (of area) associated with some known score or pair of known scores
Now we will concentrate on the opposite type of normal curve problem for which Table A must be
consulted to find the unknown score or scores associated with some known proportion.
For this type of problem requires that we reverse our use of Table A by entering proportions in
columns B, C, B′, or C′ and finding z scores listed in columns A or A′.

Finding One Score


 Sketch a normal curve and, on the correct side of the mean, draw a line representing the target
score, as in figure

n
e.i
fre
tes
It’s often helpful to visualize the target score as splitting the total area into two sectors—one to the left of
No

(below) the target score and one to the right of (above) the target score

 Plan your solution according to the normal table.


In problems of this type, you must plan how to find the z score for the target score. Because the target score is
on the right side of the mean, concentrate on the area in the upper half of the normal curve, as described in
columns B and C.
w.

 Find z.
 Convert z to the target score.

When converting z scores to original scores, you will probably find it more efficient to use the following
equation
ww

24

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Finding Two Scores


 Sketch a normal curve. On either side of the mean, draw two lines representing the two target scores,
as in figure

n
e.i



Find z.
Convert z to the target score.
fre
Plan your solution according to the normal table.
tes
Points to Remember
1. range = largest value – smallest value in a list
2. class interval = range / desired no of classes
3. relative frequency = frequency (f)/ε(f)
No

4. Cumulative frequency - add to the frequency of each class the sum of the frequencies of all
classes ranked below it.
5. Cumulative percentage = (f/cumulative f)*100
6. Histograms
7. Construction of frequency polygon
8. Stem and leaf display
w.

9. Mode - The value of the most frequent score.


th
10. For odd no of terms Median = {(n+1)/2} term / observation. For even no of terms Median
= 1/2[(n/2)th term + {(n/2)+1}th term ]
11. Mean = sum of all scores / number of scores
ww

Variance σ = Σ(x-μ)2 or
2
Variance = (Standard deviation)2= σ2 =>σ = Σ(x-μ)2 /n
12. Range (X) = Max (X) – Min (X)

25

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
fre
tes
13. Degree of freedom df = n-1
No

14. Types of normal curve


w.
ww

15. z – score

26

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

16. Standard normal curve; mean = 0, standard deviation = 1


17. Finding proportion

18. Finding proportion


1. For one score

n
e.i
2. For between two score

fre
tes
No

19. Finding scores


w.

20. Finding scores – one score


ww

Two scores

27

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

n
e.i
fre
tes
No
w.
ww

28

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

Unit – III
DESCRIBING RELATIONSHIPS
Correlation – Scatter plots – correlation coefficient for quantitative data – computational formula for correlation
coefficient – Regression – regression line – least squares regression line – Standard error of estimate –
interpretation of r2 – multiple regression equations – regression towards the mean

Correlation
Correlation refers to a process for establishing the relationships between two variables. You learned a way to
get a general idea about whether or not two variables are related, is to plot them on a “scatter plot”. While there
are many measures of association for variables which are measured at the ordinal or higher level of
measurement, correlation is the most commonly used approach.

n
Types of Correlation
 Positive Correlation – when the values of the two variables move in the same direction so that an

e.i
increase/decrease in the value of one variable is followed by an increase/decrease in the value of the
other variable.
 Negative Correlation – when the values of the two variables move in the opposite direction so that an
increase/decrease in the value of one variable is followed by decrease/increase in the value of the other

fre
variable.
 No Correlation – when there is no linear dependence or no relation between the two variables.
tes
No
w.
ww

SCATTERPLOTS
A scatter plot is a graph containing a cluster of dots that represents all pairs of scores. In other words
Scatter plots are the graphs that present the relationship between two variables in a data-set. It represents data
points on a two-dimensional plane or on a Cartesian system.

Construction of scatter plots


 The independent variable or attribute is plotted on the X-axis. Fig 6.1
 The dependent variable is plotted on the Y-axis.
1

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

 Use each pair of scores to locate a dot within the scatter plot

Positive, Negative, or Little or No Relationship?

The first step is to note the tilt or slope, if any, of a dot cluster.
A dot cluster that has a slope from the lower left to the upper right, as in panel A of below figure reflects a
positive relationship.

A dot cluster that has a slope from the upper left to the lower right, as in panel B of below figure reflects a
negative relationship.

n
A dot cluster that lacks any apparent slope, as in panel C of below figure reflects little or no relationship.

e.i
fre
tes
Perfect Relationship
A dot cluster that equals (rather than merely approximates) a straight line reflects a perfect relationship between
two variables.
No

Curvilinear Relationship
The previous discussion assumes that a dot cluster approximates a straight line and, therefore, reflects a linear
relationship. But this is not always the case. Sometimes a dot cluster approximates a bent or curved line, as in
below figure, and therefore reflects a curvilinear relationship.
w.
ww

DOWNLOADED FROM STUCOR APP


CS3352 – DOWNLOADED FROM STUCOR APP | FDS Unit – II III SEM CSE

A CORRELATION COEFFICIENT FOR QUANTITATIVE DATA : r


The correlation coefficient, r, is a summary measure that describes the extent of the statistical
relationship between two interval or ratio level variables.

Properties of r
 The correlation coefficient is scaled so that it is always between -1 and +1.
 When r is close to 0 this means that there is little relationship between the variables and the farther away
from 0 r is, in either the positive or negative direction, the greater the relationship between the two
variables.
 The sign of r indicates the type of linear relationship, whether positive or negative.
 The numerical value of r, without regard to sign, indicates the strength of the linear relationship.

n
 A number with a plus sign (or no sign) indicates a positive relationship, and a number with a minus sign
indicates a negative relationship

e.i
COMPUTATION FORMULA FOR r
Calculate a value for r by using the following computation formula:

fre
Where the two sum of squares terms in the denominator are defined as
tes
The sum of the products term in the numerator, SPxy, is defined in below formula
No

Or the formula is written as


w.

Where n = Number of Information


Σx = Total of the First Variable Value
ww

Σy = Total of the Second Variable Value


Σxy = Sum of the Product of first & Second Value
Σx2 = Sum of the Squares of the First Value
Σy2 = Sum of the Squares of the Second Value

DOWNLOADED FROM STUCOR APP

You might also like