Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
57 views

Data Management

This document discusses various statistical concepts and terminology. It begins by defining key population and sample terms used in statistics. It then discusses different types of variables and levels of measurement. Various measures of center are introduced, including the mean, median and mode, and examples are provided to illustrate when each would be most appropriate. Measures of dispersion like range, standard deviation and variance are also defined. The document then covers percentiles, quartiles and z-scores as measures of relative position. It provides an overview of the normal distribution and its key characteristics. Finally, it introduces the empirical rule for interpreting intervals within 1, 2 and 3 standard deviations of the mean for a normal distribution.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

Data Management

This document discusses various statistical concepts and terminology. It begins by defining key population and sample terms used in statistics. It then discusses different types of variables and levels of measurement. Various measures of center are introduced, including the mean, median and mode, and examples are provided to illustrate when each would be most appropriate. Measures of dispersion like range, standard deviation and variance are also defined. The document then covers percentiles, quartiles and z-scores as measures of relative position. It provides an overview of the normal distribution and its key characteristics. Finally, it introduces the empirical rule for interpreting intervals within 1, 2 and 3 standard deviations of the mean for a normal distribution.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Some Statistical Terminologies…

Statistics involves the collection, organization, summarization,


presentation, and interpretation of data. (Aufmann et al, 2013)
Population – refers to an entire group that is being studied.
Parameter – is a value calculated using all the data from a
population.
Census – is a survey of an entire population.
Sample – is a smaller subset of the population, ideally one that is
fairly representative of the population.
Statistic – is a value calculated using the data from the sample.
Some Statistical Terminologies…
Classifying variables
Variable – a particular characteristic or trait of the units of the
population that can take on different values.
Qualitative – when a characteristic can be placed into a well-
defined groups or categories.
Quantitative – when a characteristic is expressed in numerical
value.
Discrete – the domain is at most countable.
Continuous – can take all possible values within a range; that
is, a measurement.
Some Statistical Terminologies…
Levels of Measurement – was first proposed by the American
psychologist Stanley Smith Stevens in 1946.
Nominal – is one in which the values of the variables are names
or labels.
Ordinal – uses numerical categories that convey a meaningful
order.
Interval – measurement shows order and the spaces between the
values also have significant meaning.
Ratio – the ratio between any two values has meaning, because
the data include an absolute zero value.
Measures of Center
Once the data are collected, it is useful to summarize the data
set by identifying a value around which the data are centered.

Mode – is the most frequently occurring number in a data set.

Median – is the middle number or the mean of the two middle


numbers in an ordered set of data.

Mean – is the numerical balancing point of the data set.


Measures of Center
Which measure of center is most useful?
❖ A teacher wants to know about her students family situation.
She asks for the number of children in their families:
6 3 2 3 4 1 2 2 4 3 1 2 2 4

❖ A shoe manufacturer wants to know the average shoe size of


women.

❖ Another teacher wants to know how well her class performed


in a long test.
Measures of Center
Mean – Median – Mode
➢ The mean is easy to compute. You only deal with one
number. It is not so with the median.
➢ The mean is affected by outliers while the median is
resistant. In a sense, the median is able to resist the pull of a
far away value, but the mean is drawn to such values.
➢ A change in any of the numbers changes the mean, and the
mean can be changed drastically by changing an extreme
value.
➢ In contrast, the median and the mode of a set of data are
usually not changed by changing an extreme value.
➢ The mean, the median, and the mode are all averages;
however, they are generally not equal.
Measures of Center
Compare the mean, the median, and the mode for the
salaries of 5 employees of a small company.

Salaries: P370,000 P60,000 P36,000 P20,000 P20,000

Mean = P101,200
Median = P 36,000
Mode = P 20,000

Most of the employees of this company would probably


agree that the median of P36,000 better represents the average of
the salaries than does either the mean or the mode.
Measures of Center
Consider the data in the following table below.

▪ In the first game, Barry has the best average.


▪ In the second game, Barry has the best average.
▪ If the statistics for the games are combined, Warren has the best
average.
In statistics, an example such as this is known as a Simpson’s
paradox.
Measures of Center
Simpson-Yule paradox means that sometimes when you
divide data in groups, it looks different from viewing it as a
whole.
Consider the following data on the test scores for two
students.

English History English and History


combined
Maria 84, 65, 70, 90, 99, 89, 75, 85 Average: ?
84
Sarah 66, 84, 75, 77, 94, 72, 78, 98, 81, 68, Average: ?
96, 81 92, 88, 86

Is this an example of Simpson’s paradox? Explain.


Measures of Dispersion
Another important feature that can help us understand more
about a data set is the manner in which the data are distributed.
➢ Range is the difference between the largest value (maximum)
and the smallest value (minimum) in the data.

➢ Standard deviation is an extremely important measure of


spread that is based on the mean. It is a measure of the
average deviation for all of the data point from the mean.

➢ Variance is the square of the standard deviation of the data. It


does not use the same unit of measure as the original data.
Measure of Dispersion
Consider the following data sets:
1. 5 5 5 5 5 5 5 5 5 5

2. 0 0 0 0 0 10 10 10 10 10

3. 4 4 4 5 5 5 6 6 6

4. 0 5 5 5 5 5 5 10
Measure of Dispersion
Properties that determine the usefulness of the standard
deviation:
➢ It is use to describe the variability of the distribution only
when the mean is used to describe the center.
➢ It is equal to zero when there is no variability. This happens
only when all observations are of the same value.
➢ It has the same units of measurement as the original
observations.
➢ Like the mean, it can be influenced by outliers.
for population: for sample:
Measure of Dispersion
Example. A consumer group has tested a sample of 8 size-AAA
batteries from each of 3 companies. The results of the tests are
shown in the following table. According to these tests, which
company produces batteries for which the values representing
hours of constant use have the smallest standard deviation?
Measure of Relative Position
Percentiles and Quartiles
are useful when you want to know where the score is located
in reference to the other scores.
➢ Percentile is a data value for which the specified percentage
of the data is below that value.
➢ The median is the 50th percentile.
➢ The 25th, 50th , 75th percentiles divide the data into lower
quartile Q1, middle quartile Q2, and upper quartile Q3,
respectively.
➢ In using quartiles, there are five numbers to be used
altogether: min value, Q1, median, Q3, and max value.
➢ Quartiles are useful for box plots.
Measures of Relative Position
z-score
The z-score for a given data value x is the number of standard
deviations that x is above or below the mean of the data.

z-score of xi in a population:

z-score of xi in a sample:
Problem. (Task: Discuss your solutions to each of the 3 problems)
1.The mean time to download a file is 12 minutes with std.
deviation of 4 minutes. Your download time is 20 minutes.
Your friend’s download time is 6 minutes. How can you
compare your download time with your friend?

2. Raul takes 2 tests in Chemistry. He scored 72 in long test 1 for


which the class mean score was 65 with std. deviation of 8. He
received a score of 60 in long test 2 for which the mean was 45
and the std. deviation was 12. In comparison to other students,
did Raul do better in LT 1 or LT 2?

3. A consumer group tested a sample of 100 light bulbs. The


mean life expectancy of the light bulbs was 842 hours with std.
deviation of 90 hours. One particular light bulb from the
company has a z-score life expectancy of 1.2. What was the life
span of the bulb?
Normal Distribution and Probability
Normal Distribution
is an extremely important concept, because it occurs so often
in the data we collect from the natural world, as well as in many
of the more theoretical ideas that are the foundation of statistics.
Normal Distribution and Probability
Characteristics of a Normal Distribution
Shape
A normal distribution is a perfectly symmetric, mound-shaped
distribution. It is commonly referred to the as a normal curve, or
bell curve.
Normal Distribution and Probability
Characteristics of a Normal Distribution
Center
Due to the exact symmetry of a normal curve, the center of a
normal distribution is located at the highest point of the
distribution, and all the statistical measures of center are equal.
Normal Distribution and Probability
Characteristics of a Normal Distribution
Center
It is also important to realize that this center peak divides the
data into two equal parts.
Normal Distribution and Probability
Characteristics of a Normal Distribution
Spread
In an idealized normal distribution of a continuous random
variable, the distribution continues infinitely in both directions.
Normal Distribution and Probability
Characteristics of a Normal Distribution
Area under the Curve
➢ Areas under the curve that are symmetric about the mean are
equal.
➢ The total area under the curve is 1.
Normal Distribution and Probability
Empirical Rule for a Normal Distribution
In a normal distribution, approximately
➢ 68% of the data lie within 1 standard deviation of the mean.
➢ 95% of the data lie within 2 standard deviations of the mean.
➢ 99.7% of the data lie within 3 standard deviations of the mean.
Normal Distribution and Probability
Empirical Rule for a Normal Distribution
Example. The heights of a large group of people are assumed to be
normally distributed. Their mean height is 66.5 inches, and the
standard deviation is 2.4 inches. Find and interpret the intervals
representing one, two, and three standard deviations of the
mean.

One standard deviation of the mean:


Approximately 68% of the people are between 64.1 and 68.9 inches tall.
Two standard deviations of the mean:
Therefore, approximately 95% of the people are between 61.7 and 71.3 inches tall.
Three standard deviations of the mean:
Nearly all of the people (99.74%) are between 59.3 and 73.7 inches tall.
Problem. (Use the Empirical rule)
A vegetable distributor knows that during the month of August, the
weights of its tomatoes are normally distributed with a mean of 0.61 kg and a
standard deviation of 0.15 kg.
a. What percent of the tomatoes weigh less than 0.76 kg?
b. In a shipment of 6000 tomatoes, how many tomatoes can be expected to
weigh more than 0.31 kg?
c. In a shipment of 4500 tomatoes, how many tomatoes can be expected to
weigh from 0.31 kg to 0.91 kg?

a. 0.76 kg is 1 standard deviation above the mean of 0.61 kg. In a normal distribution, 34% of all data lie
between the mean and 1 standard deviation above the mean, and 50% of all data lie below the mean. Thus,
34% + 50% = 84% of the tomatoes weigh less than 0.76 kg.
b. 0.31 kg is 2 standard deviations below the mean of 0.61 kg. In a normal distribution, 47.5% of all data lie
between the mean and 2 standard deviations below the mean, and 50% of all data lie above the mean. This
gives a total of 47.5% + 50% = 97.5% of the tomatoes that weigh more than 0.31 kg. Therefore 97.5% of
6000 = 5850 of the tomatoes can be expected to weigh more than 0.31 kg.
c. 0.31 kg is 2 standard deviations below the mean of 0.61 kg and 0.91 kg is 2 standard deviations above the
mean of 0.61 kg. In a normal distribution, 95% of all data lie within 2 standard deviations of the mean.
Therefore 95% of 4500 = 4275 of the tomatoes can be expected to weigh from 0.31 kg to 0.91 kg.
Normal Distribution and Probability
Standard Normal Distribution

If the original distribution


of x values is a normal
distribution, then the
corresponding distribution of
z-scores will also be a normal
distribution. This normal
distribution of z-scores is
called the standard normal
distribution.

The standard normal distribution is the normal distribution that


has a mean of 0 and a standard deviation of 1.
Normal Distribution and Probability
Standard Normal Distribution

In the standard normal distribution, the area of the distribution


from z = a to z = b represents

➢ the percentage of z-values that lie in the interval from a to b.

➢ the probability that z lies in the interval from a to b.


Problem
1. A soda machine dispenses soda into 12-ounce cups. Tests show
that the actual amount of soda dispensed is normally distributed,
with a mean of 11.5 oz and a standard deviation of 0.2 oz.
a. What percent of cups will receive less than 11.25 oz of soda?
b. What percent of cups will receive between 11.2 oz and 11.55 oz
of soda?
c. If a cup is chosen at random, what is the probability that the
machine will overflow the cup?
2. The OnTheGo company manufactures laptop computers. A study
indicates that the life spans of their computers are normally
distributed, with a mean of 4.0 years and a standard deviation of
1.2 years. How long should the company warrant its computers if
the company wishes less than 4% of its computers to fail during the
warranty period?
Statistical Hypotheses

A hypothesis is simply a conjecture about a characteristic or


set of facts.
When performing statistical analyses, our hypotheses provide
the general framework of what we are testing and how to perform
the test.

Hypothesis testing involves testing the difference between a


hypothesized value of a population parameter and the estimate of
that parameter which is calculated from a sample.
Statistical Hypotheses
Overview of the Process

The hypothesis to be tested is called the null hypothesis and


given the symbol H0 The alternative hypothesis is given the
symbol H1.
Statistical Hypotheses
Sample null and alternative hypotheses
1. H0 : m = m0 H1 : m ≠ m0 or
(Mean is equal to a reference value) H1 : m > m0 or H1 : m < m0
2. H0 : m1 = m2 H1 : m1 ≠ m2 or
(Two population means are equal) H1 : m1 > m2 or H1 : m1< m0
3. H0 : m1 = m2 = . . . = mk H1 : at least two means are not equal
(The k population means are equal)
4. H0 : π1 = π2 H1 : π1 ≠ π2 or
(Two population proportions are H1 : π1 > π2 or H1 : π1< π0
equal)
5. H0 : r = 0 H1 : r ≠ 0 or
(There is no linear correlation H1 : r > 0 or H1 : r < 0
between the two variables )

If the H1 is either > or <, the test is referred to as one-sided test.


If H1 contains ≠, it is two-sided test.
Statistical Hypotheses
Tests Concerning the Mean
To test whether an observed difference between a population
mean and a reference value or to test whether the difference
between the two values of the mean is significant or can be
attributed to chance, the following statistical tests are used.
The z–test is used if the population standard deviation is
known or if not, the sample standard deviation can be used as an
estimate of the population standard deviation provided that the
sample size is large; that is, n ≥ 30.
The t–test is used if the sample size is less than 30 and the
sample standard deviation is known.
Statistical Hypotheses
Tests Concerning the Mean
The purpose of Analysis of variance (ANOVA) is much the
same as the t – tests; however, if a series of several t–tests are used
to evaluate several mean differences, the risk of Type I error
increases; that is, the α-levels accumulate over a series of tests so
that the final experiment wise α-level can be quite large.
The ANOVA is necessary to protect researchers from
excessive risk of a Type I error.
The ANOVA allows researcher to evaluate all of the mean
differences in a single hypothesis test using a single α-level and,
thereby, keeps the risk of a Type I error under control no matter
how many different means are being compared.
Statistical Hypotheses
Tests Concerning the Mean
The ANOVA tests the homogeneity of a set of means but if the
null hypothesis is rejected in favor of the alternative hypothesis
that the means are not all equal, further test should be done (Post
Hoc) to determine which pairs of means are significantly different.
The following Post Hoc Tests are available in most statistical
software:
1. Duncan’s multiple range test
2. Tukey’s procedure
3. Scheffe test
4. Fisher’s least significant difference
Linear Regression and Correlation
Correlation measures the relationship between bivariate data.

Bivariate data are data sets in which each subject has two
observations associated with it.

A response variable measures an outcome or result of a study.

An explanatory variable is a variable that we think explains or


causes changes in the response variables.

Linear regression is an approach for modeling the relationship


between a dependent variable (outcome) and one or more
explanatory variables. The case of one explanatory variable is
called simple linear regression.
Linear Regression and Correlation
Scatterplot is a graph of plotted points showing the relationship
between two numerical variables.
Linear Regression and Correlation

Examining a Scatterplot
1. Describe the overall pattern of a scatterplot by the form,
direction, and strength of the relationship.
2. Then look for any striking deviations from the pattern. Identify
each occurrence of an outlier.
Linear Regression and Correlation
Linear Regression
– involves using data to calculate a line that best fits that data
and then using that line to predict scores.
Least-Square Regression Line
– is the line that minimizes the sum of the squares of the
vertical deviations from each data point to the line.
The equation of the least-squares line is

where and
Linear Regression and Correlation
Linear Correlation Coefficient
– determine the strength of a linear relationship between two
variables which is denoted by the variable r.

If the linear correlation coefficient r is positive, the


relationship between the variables has a positive correlation. In this
case, if one variable increases, the other variable also tends to
increase.
If r is negative, the linear relationship between the
variables has a negative correlation. In this case, if one variable
increases, the other variable tends to decrease.
Linear Regression and Correlation
Happiness vs Life Expectancy

Country Happiness Life Expectancy


Japan 6.8 80.80
South Korea 6.2 74.20
China 6.3 70.40
Taiwan 6.2 76.40
Indonesia 6.6 78.00
Philippines 6.4 69.00
Singapore 6.8 77.60
Vietnam 6.1 69.40
India 6.2 63.00
Bangladesh 5.7 59.50

Source: CHED GenEd 1st Generation Training


Linear Regression and Correlation
Happiness vs Life Expectancy
a = 16.661
b =- 33.635

Will the line give accurate predictions?


Correlation Coefficient r = 0.82

Predict the life expectancy for the following countries:


Actual LE
a) Zimbabwe: happiness = 4.2 35.40
b) Ghana: happiness = 5.4 57.90
c) Belarus: happiness = 6.1 68.60
Linear Regression and Correlation
Example. Unemployment and family income are undoubtedly related; we would
assume that as the national annual unemployment rate increases, average
annual family income would decrease. Table on next slide gives the annual
unemployment rate and the average annual family income for the
Philippines according to regions from the Philippine Statistics Authority.

a. Use linear regression to predict the average annual family income of


the Philippines if the annual unemployment rate is 6.3%.

b. Use linear regression to predict the annual unemployment rate if the


average annual family income of the Philippines is P267,000.

c. Are the predictions in parts (a) and (b) reliable? Why or why not?
Linear Regression and Correlation
Annual Ave. Annual Family
Region Unemployment Rate Income (000,000)
NCR 8.5 4.25
Cordilla 4.8 2.82
I - Ilocos Region 8.4 2.38
II - Cagayan Valley 3.2 2.37
III - Central Luzon 7.8 2.99
IVA - CALABARZON 8.0 3.12
IVB - MIMAROPA 3.3 2.22
V - Bicol Region 5.6 1.87
VI - Western Visayas 5.4 2.26
VII - Central Visayas 5.9 2.39
VIII- Eastern Visayas 5.4 1.97
IX - Zamboanga Peninsula 3.5 1.90
X - Northern Mindanao 5.6 2.21
XI - Davao Region 5.8 2.47
XII - SOCCSKSARGEN 3.5 1.88
Caraga 5.7 1.98
ARMM 3.5 1.39
References:
Aufmann et al (2013). Mathematical Excursions 3ed. Brooks/Cole ,Cengage
Learning.
Bluman, A. G. (2012). Elementary statistics: a step by step approach 8ed. New
York: McGraw-Hill.
COMAP, Inc. (2013). For all practical purposes: mathematical literacy in
today’s world. New York: W.H Freeman and Company.
Johnson & Mowry (2012). Mathematics: a practical odyssey. Brooks/Cole,
Cengage Learning
Lawsky et al (2014). CK-12 advanced probability and statistics, 2ed. CK-12
Foundation.
Nocon, R. & Nocon, E. (2016). Essential mathematics for the modern world..
QC: C & E Publishing, Inc.
Vistru-Yu, C. and Gozon, A. (2016). Statistics a review ppt. CHED’s GE First
Generation Training.

You might also like