(Student's Handouts) Data Management
(Student's Handouts) Data Management
Data Management
Student’s Handouts
Introduction to Statistics
Some of the most commonly used statistical treatments used are percentages, measures of central
tendency (such as the mean, median and mode), measures of variation (such as range, average
deviation, standard deviation, variance and coefficient of variation) and measures of skewness and
kurtosis.
Definition 3: Inferential Statistics is use to draw conclusions and make predictions based on the
analysis of numeric data. It is a set of methods used to make a generalization, estimate, prediction
or decision.
A pair of one measure of central tendency and one measure of variation can be use to draw a
conclusion, commonly used pair are mean and standard deviation.
Exercise 4: Classify whether the statement belongs to the area of Descriptive Statistics and
Inferential Statistics.
1. Ninety two percent of the class has age between 16-18 years.
2. Ninety five percent of the class may pass Basic Statistics.
Page 1 of 12
3. According to the local survey, the top three popular courses are: Psychology (23%),
Hospitality (19%) and Computer-related course (10%).
4. The normal blood sugar level of human is 70 mg/dL to 120 mg/dL.
5. Drinking pineapple juice may boost our immune system.
In the study of study of statistics, two terms are commonly used: population and sample.
Definition 7: Variables are characteristics or values that vary across individuals. It can be
qualitative or quantitative.
Definition 8: Qualitative Variables, also known as categorical variables, are used to represent
character, class or kind but not in amount. Some examples of qualitative variables are gender,
religion, nationality, favorite color and birthplace.
Definition 10: Discrete uses natural numbers or counting numbers. Some examples of discrete
variables are number of students enrolled in STA111, number of iPad units in a store and number
of buildings in Metro Manila.
Page 2 of 12
Definition 11: A quantitative variable is continuous if it uses decimals or fractions. Some
examples of continuous variables are height, weight, length, width and speed of a bullet.
Definition 12: Levels of measurement are used to determine the statistical tool that can be used
to describe a data. There are four levels of measurement; these are Nominal, Ordinal, Interval and
Ratio.
Definition 13: The first level is called the Nominal level. In this level, names are assigned to
objects for the purpose of identifying or belonging to a group or category. The data can not be
arranged in an ordering system. Examples of data under this level are religion, nationality or race,
gender, birthplace and course.
Definition 14: The second level is the Ordinal level. In this stage, the words or numbers are
assigned to objects to represent the rank or order between them. It implies ranking, order or
inequalities. Examples are class rank, contest winners, degree of burn and cancer stages.
Definition 15: Interval level is the third level of measurement. It refers to quantitative
measurements used to identify and rank but in this scale, differences between two items can be
determined and operations such as multiplication and division are worthless. Interval scales do not
have a true zero point. Example of an interval data is temperature.
Definition 16: Lastly, fourth level of measurement is the Ratio level. It is similar to interval scale
but ratio has a true zero point and operations such as multiplication and division are therefore
significant. Examples of data under ratio are income, age, height, weight, area and volume.
Page 3 of 12
6. Gender
7. Land area
8. Contest winners
9. Kids height in cm
10. Athletes age in years
Definition 18: Sampling is the process of choosing elements, such as person, objects or groups
from a known population of interest to be included in a study in order to generate a fair result.
Sampling is done to reduce cost since it is less expensive conduct survey in a sample than in whole
population. Another advantage of using a sample instead of a population is that in sampling, data
can be obtained faster. Also, greater scope and accuracy are expected since the volume of work in
encoding and computing will be reduced.
There are two types of sampling techniques: probability sampling and non-probability sampling.
Definition 19: Probability sampling or random sampling gives all members of the population a
known and equal chance of being part in the sample. In other words, the selection of individuals
does not affect the chance of anyone else in the population being selected.
Definition 20. Simple random sampling is also called the lottery or the fishbowl method. Simple
random sampling use scientific calculator or computer program to generate a random number or a
table of random numbers to select the numbers for the elements to include in the sample.
Definition 21. In Systematic Skip Sampling, elements are listed numerically and then every “kth”
element from the list is selected from a randomly selected starting point.
Page 4 of 12
Definition 22: Stratified Random Sampling is a method where the population is divided into
sub-groups (called strata) base on some well-known characteristics of the population, such as age,
gender or socio-economic status; then take a random sample from each strata. The selection of
elements is then made separately from within each strata, usually by random or systematic
sampling methods.
Remark 23: In stratified random sampling, the number of samples per strata may be equal or
proportional.
Example 24: A study is conducted to 1,000 college students of the University of the East. Two
hundred students will be selected to be part of the study. How many samples are needed per year
level using equal distribution?
n = 4 (four groups: First Year, Second Year, Third Year and Fourth Year)
N 200
ni 50
n 4
Example 25: A study is conducted to 1,000 college students of the University of the East. Two
hundred students will be selected to be part of the study. The number of student per year level is
presented on a table. How many samples are needed per year level using proportional allocation?
Page 5 of 12
n Ni
Use the formula: ni where n i is the number of sample per year level, N i is the population
N
of student per year level, N is total number of population of the high school students and n is the
total sample needed.
Definition 26: Cluster Sampling is a method where the researcher divides the population into
groups, or clusters. Elements within a cluster are heterogeneous or are dissimilar. Select clusters
at random then use all units in the selected clusters as the sample.
Definition 27: Unlike probability sampling, non-probability sampling does not give everyone
an equal chance of being selected to be part of the sample. Non-probability sampling procedures
are much less desirable, as they will almost certainly contain sampling biases.
Some of the methods under non-probability sampling are quota, convenience and purposive
sampling.
Page 6 of 12
Descriptive Measures
Definition 28: Measures of Central Tendency are descriptive measures that are used to describe
the center of a set of data, arranged numerically. The three different types of “average” will be
discussed, the mean, the median and the mode.
Definition 29: The most commonly used to measure the central tendency is the mean. It is also
called the computed average. It is defined as the sum of the values divided by the total number of
items.
Definition 30: The median is the middle value in a set of data. The value which divides the
distribution into two equal parts, with one half of the values is lower than the median and other
half are higher than the median.
Definition 31: The third measure on central tendency is the mode. It is easily found by inspection.
It is a point on the distribution in which the frequency is higher than any other value.
Definition 32: A distribution with only one mode is called unimodal while f it has two modes,
then it is called bimodal. If it has more than two modes, the distribution is called multimodal. The
mode does not exist in a distribution if no value is repeated.
Page 7 of 12
Exercise 33: Determine the mean, median and mode of the given set of data.
The mean is computed if the values are in interval or ratio scale. The mean is influenced
by outliers that may be at the extremes of the data set. The median is used for ordinal scale. Unlike
the mean, the median is not influenced by outliers at the extremes of the data set. The mode is
practical for nominal data. In such cases, the mode may not exist or may not be very meaningful.
Using the measures of central tendency it seems that the sets are equal (that is, 15). But obviously,
the sets of data are different. Like, the values of Set A are more disperse or scattered than of Set B
and C. Using only these measures are not enough to describe a given set of data, we need to use
other descriptive measures to further describe a distribution.
Definition 35: Measures of Dispersion or Variability describes the spread or the scatterings of
the values around the mean.
Definition 36: The range is the difference between the highest and lowest value/observation.
Page 8 of 12
Example 37: Using the data above, the range of Set A is 24 9 15 .
Definition 38: The average deviation is the measure of the distance of each value to the mean.
xx
The formula is given by: AD where 𝑥̅ is the mean, 𝑥 are the values and 𝑛 is number
n
of values.
Exercise 39: Compute the average deviation of set A in the data above.
Definition 40: Variance measures how much variability there is in the entire distribution. The
standard deviation is the most commonly used measure of dispersion. It is the positive square
root of the variance. The formulas are as follows:
( x x)
2
( x x)
2
s 2
s
n 1 n 1
Exercise 41: Using the above data, compute the variance and the standard deviation of set A.
Definition 42: In a symmetrical or normal distribution the mean, median, and mode all fall at
the same point or equal.
Definition 43: In a positively skewed distribution, the extreme scores are larger, thus the mean
is larger than the median.
Page 9 of 12
Definition 44: In negatively skewed distribution, the order of the measures of central tendency
would be the opposite of the positively skewed distribution, with the mean being smaller than the
median, which is smaller than the mode.
Definition 45: Skewness measures the degree of symmetry of a distribution. One of the formulas
3(mean median) 3( x md )
of skewness is the given by Sk .
s tan dard deviation s
Remark 46: When Sk = 0, the distribution is Normal or Symmetrical, when Sk > 0, the distribution
is Positively Skewed and when Sk < 0, the distribution is Negatively Skewed
Hypothesis Testing
Definition 47: A statistical hypothesis is a conjecture concerning one or more population whose
veracity can be stablished using sample data.
Definition 48: Parametric tests are applied to data that are normally distributed. Moreover, it is
assumed that the measurement of variables are either interval or ratio level.
Definition 49: Nonparametric tests do not require a normal distribution and the variables of
interest are on nominal or ordinal level.
Page 10 of 12
Table 50: Below is the summary of some of the different statistical tests.
https://www.google.com/url?sa=i&source=imgres&cd=&cad=rja&uact=8&ved=2ahUKEwi9vdqMq7biAhUXZt4KHQSSD2IQjRx6BAgBEAU&url=http%3A%2F%
2Fmethods.sagepub.com%2Fbook%2Funderstanding-social-science-research%2Fn10.xml&psig=AOvVaw3r5_gFGEBctk29Qmey76r_&ust=1558861859448779
Correlation
Definition 51: Correlation measures the strength of the linear association between two
quantitative variables: the independent variable and the dependent variable. The independent
variables are variables that can be manipulated or controlled while dependent variables are those
that cannot be controlled.
Definition 52: The most commonly used technique to calculate the coefficient of correlation is by
using the Pearson Product Moment Correlation Coefficient. The formula is given by
NXY XY
r
[ NX X ][ NY 2 Y ]
2 2 2
Page 11 of 12
where 𝑋 = the observed data from the independent variable, 𝑌 = the observed data from the
dependent variable, 𝑁 = sample size and 𝑟 = degree of relationship of x and y
Remark 53: The range of the correlation coefficient is -1 and +1. If the value of the coefficient is
close to -1.00, it represents a perfect negative correlation while a value of +1.00 represents a
perfect positive correlation. If the value is equal to 0.00, it means that there is no relation between
the variables.
Exercise 54: Determine the degree of relationship of the number of absences incurred by the
students and his final grade in Statistics 111 class. The data obtained in a study is from seven
randomly selected students of a Statistics 111 class.
Page 12 of 12