Data Management
Data Management
Data Management
LEARNING MATERIAL
The process is actually what statistics is all about. Statistics is an art and science that
deals with the collection, organization, presentation, analysis, and interpretation of data.
Fields of Statistics
∙ Descriptive Statistics - methods concerned with the collection, description, and analysis of a
set of data without drawing conclusions or inferences about a larger set
- the main concern is simply to describe the set of data such that
otherwise obscure information is brought out clearly
- conclusions apply only to the data on hand
∙ Inferential Statistics - methods concerned with making predictions or inferences about a
larger set of data using only the information gathered from a subset of this larger set
- the main concern is not merely to describe but actually predict and make
inferences based on the information gathered
- conclusions are applicable to a larger set of data which the data on hand
is only a subset.
PAGE 1
Definition of Some Basic Statistical Terms
1) Population is a collection of all the elements under consideration in a statistical study.
Example: The researcher would like to determine the average age of patients infected with
dengue fever for the month of October at XYZ Hospital.
Population: the set of all patients with dengue fever.
2) Sample is a part or subset of the population from which the information is collected.
3) Variable is a characteristic of interest measurable on each and every individual in the
universe, denoted by any capital letter in English alphabet.
TYPES OF VARIABLES
∙ Qualitative Variable consists of categories or attributes, which have non-numerical
characteristics.
∙ Quantitative Variable consists of numbers representing counts or measurements.
CLASSIFICATION OF QUANTITATIVE VARIABLE
∙ Discrete Quantitative Variable has numerical values that arise from a counting
process.
∙ Continuous Quantitative Variable produces numerical responses that arise from a
measuring process.
4) Data are the different values associated with a variable.
5) Operational definition is the description of some observable event in terms of the specific
process or manner by which it was observed or measured.
6) Parameter is a numerical measurement describing some characteristic of a
population. 7) Statistic is a numerical measurement describing some characteristic of a
sample.
Example: Identify the population, variable of interest and type of variable in the following:
1) The dean of a certain college would like to determine the average weekly allowance of BS
Computer Science students.
2) From all students enrolled this semester, the Mathematics Department would like to know how
many students like mathematics.
PAGE 2
Data Organization and Presentation
Data collected or obtained from whatever manner are called raw data. Data collected can
be classified according to the scale of measurement used.
LEVELS OF MEASUREMENT
Categorical or Qualitative Variables Numerical or Quantitative Variables
∙ Nominal - classifies data into distinct ∙ Interval - is an ordered scale in which the
categories in which no ranking is implied. difference between measurements is a
Examples: name, civil status meaningful quantity but does not involve a
true zero point.
Examples: IQ Score, temperature (in °C)
∙ Ordinal - classifies values into distinct
categories in which ranking is implied. ∙ Ratio - an ordered scale in which the
Examples: military rank, job position difference between the measurements
involves a true zero point.
Examples: height, width
Methods of Data Collection
∙ Interview Method – this is a personal communication with the individual you want to
interview.
∙ Questionnaire Method – this is done by sending questionnaires to the person from whom
you would like to get the information.
PAGE 3
∙ The Registration Method – data or information needed in the research study were properly
kept and recorded by the government agencies authorized of the law to be responsible for
safe keeping of information or data.
∙ The Observation Method – the data or information needed must only be obtained by using
the observation procedures. It is a method of obtaining data by seeing, hearing, tasting,
touching and smelling.
∙ The Experiment Method – this method of data gathering can only be succeeded by
performing series of experiment on some controlled and experimental variable. This
method of data gathering was used to find the cause and effect relationship,
Make use of words, sentences and paragraph in presenting data collected. Always
remember that in descriptive statistics, there are some cases that data collected are not always in
numerical form such as sex, religion and educational attainment. Where there are only a few
numbers in data, these figures can be represented in textual form. They are incorporated in
paragraphs of discussions. Specifically if numerical data need to be compared with other data, or
need to be enumerated the textual form of data presentation may very well served this purpose.
Example: The board of Investments (BOI) has approved 151 projects in January with a total
product cost of 10.3 billion pesos more than 12% of the 85 billion target for the whole of 1991. The
amount does not include an additional 15 billion pesos worth of investments already in the pipeline.
Another method of presenting data is tabular form, a way of classifying related numerical
facts in horizontal arrays called rows and vertical lines called columns. The space common to a
particular row and column is called a cell. A table consist of the following parts:
(a) The table heading - found above the table. It consists of the title of the table and the table
number. The table heading gives advance information on what is the table all about.
(b) The stub – describes the data found in rows of the table. They give the classification into
which the figures fall.
PAGE 4
(c) The caption or box heading – gives the designation of the column, or identifies the
figures found in that column.
(d) The body – constitutes the main part of the table and contains figures to be presented.
The most important figures to be compared are put in columns or rows. In tabulating
long column figures, spaces should be broken down every after five or ten rows. Long
unbroken columns are oftentimes confusing and do not allow ease in the comparison of
two numbers or figures. Interpretive figures such as percentages and averages should
be included in the body.
Example:
Strongly Favorable 10
Favorable 11
Slightly Favorable 12
Slightly Unfavorable 14
Unfavorable 22
Strongly Unfavorable 31
TOTAL 100
2.1) Qualitative or Categorical FDT – a frequency distribution table where the data are grouped
according to some qualitative characteristics; data are grouped into non-numerical categories.
Example:
Male 38
Female 62
TOTAL 100
PAGE 5
2.2) Quantitative FDT – a frequency distribution table where the data are grouped according to
some numerical or quantitative characteristics.
Example:
7-9 2
10-12 8
13-15 14
16-18 19
19-21 7
TOTAL 50
3. Graphical Presentation.
Another method of data presentation is by using graphs, symbols or visual aids. Kinds of
graphs most commonly used in data presentation:
(a) Line Graph – is the simplest graph to construct. The graph makes use of the rectangular
coordinate system, where two perpendicular lines are drawn with the vertical dimension bearing a
reasonable proportion to the horizontal. The points are plotted along the horizontal axes and
against the vertical axes. The plotted points are connected by broken lines. The line graphs are
usually used for the following purposes:
a.1 when data cover a long period of a.4 when trends of frequency
time distribution are presented
a.2 when several series are a.5 when a multiple-amount scale is
compared to the same chart. used
a.3 when emphasis is on the a.6 when estimates forecast,
movement rather than on the actual interpolation or extrapolation are to be
amount shown
PAGE 6
Example:
(b) Bar Graph – are comparisons of numerical values of a given item over a period of time maybe
made by the use of bars. The vertical bars consist of equal spaced vertical rectangles placed on a
common horizontal base line. They should all be the same width and begin at zero. The heights of
the rectangles are proportional to the magnitude represented.
Example: The graphs show that first-year college students spend the most on electronic
equipment including computers.
(c) Pie Graph – is used to form percentage chart or when we want to represent the composition
parts of a whole. This is a circle whose area is divided into component parts of sectors.
PAGE 7
Example: Super Bowl Snack Foods. This frequency distribution shows the number of pounds of
each snack food eaten during the Super Bowl. Construct a pie graph for the data.
(d) Pictogram or Pictograph – make used of pictures and symbols to immediately suggest
the nature of the data being shown. It combines the attention getting the quality of
the picture and the accuracy of the bar graph. Appropriate pictures or symbols,
arranged in rows present the quantities to be compared.
Example:
1999
1998
1997
1996
1995
1990*
1980*
1975*
1970* 1960* PAGE 8
Note:
Based on
Series 2:
Moderate
Fertility
and
Mortality
Decline
Population Projection
* Censual Year
A measure of central tendency is any single value that is used to identify the “center” or the
typical value of the data set. It is often referred to as the average.
1. THE MEAN
The population mean for a finite population with N elements, denoted by the Greek
N
∑ i
X
μμN
i =1
Xn
The sample mean (read as “X bar”) of n observations is computed as . The sample mean
(a statistic) is an estimate of the unknown population mean (a parameter).
Examples:
1. The number of employees at 5 different drug stores are 10, 12, 6, 8, and 4. Find the mean
number of employees for the 5 stores.
PAGE 9
2. Scores in the Stat 2 first exam for a sample of 10 students are as follows: 60, 55, 30, 90, 88, 79,
45, 66, 93, and 80. Find the mean.
Definition. The weighted mean is a modification of the usual mean that assigns weights (or
measures of relative importance) to the observations to be average. If each observation X i is
assigned a weight Wi where i = 1, 2,…, n, the weighted mean is given by
n
∑ WX
11
___
X W
=
=n 1
1
i
∑
i
=
1
Examples:
Project 25%
The maximum score a student may obtain for each component is 100. Jeffry obtains marks
of 83 for assignments, 72 for the project, 42 for the midterm exam, and 47 for the final exam. Find
his mean mark for the course.
History 1.0
Humanities 1.5
Math 19 2.25
Math 53 3.0
Philosophy 1.0
Math 53 is a 5-unit course and all others are 3-unit course. Find Alex’s GWA for the semester.
PAGE 10
Characteristics of the Mean
2 THE MEDIAN
The first step in calculating the median, denoted as �� , is to arrange the data in an
If n is odd, the median position equals (n+1) /2, and the value of the (n+1)/2 th observation
in the array is taken as the median, i.e.,
= �� ��+1
��
2
If n is even, the mean of the two middle values in the array is the median, i.e.,
�� = �� �� + �� ��
2 2 +1 2
PAGE 11
Examples:
1. Given the following heights (in inches): 71, 72, 75, 75, and 67. Find the median
3. THE MODE
The mode is determined by counting the frequency of each value and finding the value
with the highest frequency of occurrence.
1. 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
2. 2, 5, 5, 2, 2, 5, 1, 3, 5, 4, 2, 5, 5, 2, 2, 5, 5, 2, 2, 1
3. blue, yellow, red, pink, blue, blue, blue, pink, red, yellow, blue, white, white, white
PAGE 12
Characteristics of the Mode:
1. It does not always exist; and if it does, it may not be unique. A data is said to be unimodal
if there is only one mode, bimodal if there are two modes, trimodal if there are three
modes, and so on.
2. It is not affected by extreme values.
3. The mode can be used for qualitative as well as quantitative data.
MEASURES OF LOCATION
Measures of Location (or fractiles/quantiles) are values below which a specified fraction
or percentage of the observations in a given set must fall.
1. PERCENTILES
- are values that divide a set of observations in an array into 100 equal parts.
Thus, P1, read as first percentile, is the value below which 1% of the values fall.
P2, read as second percentile, is the value below which 2% of the values fall.
∙
∙
∙
P99, read as ninety-ninth percentile, is the value below which 99% of the values fall.
⎢
⎣⎡ +
i(n 1)
⎥
Pi ⎦⎤
100
= the value of the th observation in the array
= �� ��(��+��)
����
������
PAGE 13
2. DECILES
- are values that divide the array into 10 equal parts. Thus, D 1, read as
first decile, is the value below which 10% of the values fall. D 2, read as
second decile, is the value below which 20% of the values fall.
∙
∙
∙
D9, read as ninth decile, is the value below which 90% of the values fall.
* ith Decile of Ungrouped Data
⎢ +
⎣⎡ 10
i(n 1)
⎥
Di ⎦⎤
= the value of the th observation in the array
= �� ��(��+��)
����
����
3. QUARTILES
- are values that divide the array into 4 equal parts. Thus, Q 1, read as first
quartile, is the value below which 25% of the values fall. Q 2, read as
second quartile, is the value below which 50% of the values fall. Q 3, read
as third quartile, is the value below which 75% of the values fall.
PAGE 14
th
* i Quartile of Ungrouped Data
⎢ +
⎣⎡ 4
i(n 1)
⎥
Qi ⎦⎤
= the value of the th observation in the array
= �� ��(��+��)
����
��
MEASURES OF VARIATION OR DISPERSION
- indicate the extent to which individual items in a series are scattered about an average.
- to determine the extent of the scatter so that steps maybe taken to control the existing
variation
1. THE RANGE
Definition: The range of a set of data is the difference between the largest and the smallest values
in the set. In a frequency distribution, the range is approximated by getting the
difference between the upper class limit of the highest class interval and the lower
class limit of the lowest class interval.
PAGE 15
2 THE STANDARD DEVIATION AND THE VARIANCE
�� 2
�� = (���� − ��)
��=1
��
For a finite population of size n, the sample variance is
�� 2
��2 = (���� − �� )
��=1
�� − 1
And the sample standard deviation is
�� 2
�� = (���� − �� )
��=1
�� − 1
Example:
1) A comparison of coffee prices at 4 randomly selected grocery stores in Imus City showed
increases from the previous month of 52, 55, 57 and 60 pesos for a 200-gram jar. Find the
standard deviation of this random sample price increases.
3. If each observation of a set of data is transformed to a new set by the addition (or subtraction)
of a constant c, the standard deviation of the new data set is the same as the standard
deviation of the original data set.
4. If a set of data is transformed to a new set by multiplying (or dividing) each observation by a
constant c, the standard deviation of the new data set is equal to the standard deviation of the
original data set multiplied (or divided) by c.
PAGE 16
3. THE COEFFICIENT OF VARIATION
Definition: The coefficient of variation, CV, is the ratio of the standard deviation to the mean and
is usually expressed in percentage. It is computed as
CV = σ x 100%
Mean S.D.
Solution:
CV 1.84 ==
8991
− == x 100% 4.36%
x 100% 8.21% 26.4
22.4
CV Therefore, the period 1992-1994
9294
− 1.16 is more stable.
2. Two of the quality criteria in processing butter cookies are the weight and color development in
the final stage of oven browning. Individual pieces of cookies are scanned by a
spectrophotometer calibrated to reflect yellow-brown light. The readout is expressed in percent
PAGE 17
of a standard yellow-brown reference plate and a value of 41 is considered optimal (golden
yellow). The cookies were also weighed in grams at this stage. The means and standard
deviations of 30 sample cookies are presented below.
Mean S.D.
Color 41.1 10
Solution:
CV ==
color
x 100% 24.33%
41.1
CV 3.2
10 ==
weight
17.7 x 100% 18.08%
Definition. A measure of skewness shows the degree of asymmetry, or departure from symmetry
of a distribution. It indicates not only the amount of skewness but also the direction.
• most skewed curves encountered in the social sciences are skewed to the right.
PAGE 18
2. Negatively Skewed or Skewed to the Left
• only rarely do we find curves skewed to the left, and even more rarely do we find data
characteristically skewed to the left
Coefficient of Skewness
___
���� =��( �� − �� )
��
Where: �� = mean, �� = median, s = standarad deviation
Remarks:
1. Since the mode is frequently only an approximation, formula 2 is preferred.
2. Interpretation of the measures of skewness:
3. Interpretation of the measures of skewness:
Sk = 0: symmetric since �� = Md = Mo
Example: Suppose the following descriptive measures were obtained from the pre-final exam
scores of GNED03 students: mean = 74.1, median = 75, mode = 84, and standard dev
= 11.25. Compute for the coefficient of skewness.
Solution:
11.25 = −��. ��
PAGE 19