Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Data Management

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 19

Data Management

LEARNING MATERIAL

Rex Saño | GNED 03


Introduction to Data Management

Data Management is the development, execution and supervision of plans, policies,


programs, and practices that control, protect, deliver, and enhance the value of data and
information assets.

The process is actually what statistics is all about. Statistics is an art and science that
deals with the collection, organization, presentation, analysis, and interpretation of data.

General Uses of Statistics


a. Statistics aids in decision making

b. Statistics summarizes data for public use

Fields of Statistics

∙ Descriptive Statistics - methods concerned with the collection, description, and analysis of a
set of data without drawing conclusions or inferences about a larger set
- the main concern is simply to describe the set of data such that
otherwise obscure information is brought out clearly
- conclusions apply only to the data on hand
∙ Inferential Statistics - methods concerned with making predictions or inferences about a
larger set of data using only the information gathered from a subset of this larger set
- the main concern is not merely to describe but actually predict and make
inferences based on the information gathered

- conclusions are applicable to a larger set of data which the data on hand
is only a subset.

Descriptive Statistics vs. Inferential Statistics


Descriptive Inferential
∙ A bowler wants to find his bowling ∙ A bowler wants to estimate his chance of
average for the past 12 games winning a game based on his current season
∙ A housewife wants to determine the averages and the averages of his opponents
average weekly amount she spent on ∙ A housewife would like to predict based on last
groceries in the past 3 months year’s grocery bills, the average weekly
amount she will spend on groceries for this
year.

PAGE 1
Definition of Some Basic Statistical Terms
1) Population is a collection of all the elements under consideration in a statistical study.
Example: The researcher would like to determine the average age of patients infected with
dengue fever for the month of October at XYZ Hospital.
Population: the set of all patients with dengue fever.
2) Sample is a part or subset of the population from which the information is collected.
3) Variable is a characteristic of interest measurable on each and every individual in the
universe, denoted by any capital letter in English alphabet.
TYPES OF VARIABLES
∙ Qualitative Variable consists of categories or attributes, which have non-numerical
characteristics.
∙ Quantitative Variable consists of numbers representing counts or measurements.
CLASSIFICATION OF QUANTITATIVE VARIABLE
∙ Discrete Quantitative Variable has numerical values that arise from a counting
process.
∙ Continuous Quantitative Variable produces numerical responses that arise from a
measuring process.
4) Data are the different values associated with a variable.
5) Operational definition is the description of some observable event in terms of the specific
process or manner by which it was observed or measured.
6) Parameter is a numerical measurement describing some characteristic of a
population. 7) Statistic is a numerical measurement describing some characteristic of a
sample.

8) Survey is often conducted to gather opinions or feedbacks about a variety of topics.

Census Survey, most often simply referred to as census, is conducted by gathering


information from the entire population.

Sampling Survey, most often simply referred to as survey, is conducted by gathering


information only from part of the population.

Example: Identify the population, variable of interest and type of variable in the following:

1) The dean of a certain college would like to determine the average weekly allowance of BS
Computer Science students.

2) From all students enrolled this semester, the Mathematics Department would like to know how
many students like mathematics.

PAGE 2
Data Organization and Presentation

Data collected or obtained from whatever manner are called raw data. Data collected can
be classified according to the scale of measurement used.

LEVELS OF MEASUREMENT
Categorical or Qualitative Variables Numerical or Quantitative Variables

∙ Nominal - classifies data into distinct ∙ Interval - is an ordered scale in which the
categories in which no ranking is implied. difference between measurements is a
Examples: name, civil status meaningful quantity but does not involve a
true zero point.
Examples: IQ Score, temperature (in °C)
∙ Ordinal - classifies values into distinct
categories in which ranking is implied. ∙ Ratio - an ordered scale in which the
Examples: military rank, job position difference between the measurements
involves a true zero point.
Examples: height, width
Methods of Data Collection

∙ Interview Method – this is a personal communication with the individual you want to
interview.

∙ Questionnaire Method – this is done by sending questionnaires to the person from whom
you would like to get the information.

A questionnaire is a list of well-planned questions written on paper, which can be


either personally administered or mailed by the researcher to the respondents.
TWO CATEGORIES OF SURVEY QUESTIONS
1) OPEN-ENDED QUESTIONS – allows a free response.
2) CLOSED QUESTIONS – allows only a fixed response. The respondent has choices
for his/her answers.

FEATURES OF A GOOD QUESTIONNAIRE:


✔ Make the question short and clear.
✔ Avoid leading the questions.
✔ State always the precise units which you require them to use.
✔ Limit questions to essential information.
✔ Questions should be carefully planned and arranged in chronological order.

PAGE 3
∙ The Registration Method – data or information needed in the research study were properly
kept and recorded by the government agencies authorized of the law to be responsible for
safe keeping of information or data.

∙ The Observation Method – the data or information needed must only be obtained by using
the observation procedures. It is a method of obtaining data by seeing, hearing, tasting,
touching and smelling.

∙ The Experiment Method – this method of data gathering can only be succeeded by
performing series of experiment on some controlled and experimental variable. This
method of data gathering was used to find the cause and effect relationship,

METHODS OF DATA PRESENTATION

1. Textual form of presentation.

Make use of words, sentences and paragraph in presenting data collected. Always
remember that in descriptive statistics, there are some cases that data collected are not always in
numerical form such as sex, religion and educational attainment. Where there are only a few
numbers in data, these figures can be represented in textual form. They are incorporated in
paragraphs of discussions. Specifically if numerical data need to be compared with other data, or
need to be enumerated the textual form of data presentation may very well served this purpose.

Example: The board of Investments (BOI) has approved 151 projects in January with a total
product cost of 10.3 billion pesos more than 12% of the 85 billion target for the whole of 1991. The
amount does not include an additional 15 billion pesos worth of investments already in the pipeline.

2. Tabular form of presentation.

Another method of presenting data is tabular form, a way of classifying related numerical
facts in horizontal arrays called rows and vertical lines called columns. The space common to a
particular row and column is called a cell. A table consist of the following parts:

(a) The table heading - found above the table. It consists of the title of the table and the table
number. The table heading gives advance information on what is the table all about.

(b) The stub – describes the data found in rows of the table. They give the classification into
which the figures fall.

PAGE 4
(c) The caption or box heading – gives the designation of the column, or identifies the
figures found in that column.

(d) The body – constitutes the main part of the table and contains figures to be presented.
The most important figures to be compared are put in columns or rows. In tabulating
long column figures, spaces should be broken down every after five or ten rows. Long
unbroken columns are oftentimes confusing and do not allow ease in the comparison of
two numbers or figures. Interpretive figures such as percentages and averages should
be included in the body.

Example:

Table 1: Frequency Distribution of Staff Perception of the Leadership Behavior of


the Administrator
Perception of Leadership Behavior Frequency

Strongly Favorable 10

Favorable 11

Slightly Favorable 12

Slightly Unfavorable 14

Unfavorable 22

Strongly Unfavorable 31

TOTAL 100

2.1) Qualitative or Categorical FDT – a frequency distribution table where the data are grouped
according to some qualitative characteristics; data are grouped into non-numerical categories.
Example:

TABLE 2: Frequency Distribution of the Gender of Respondents of a Survey


Gender of Respondents Number of Respondents

Male 38

Female 62

TOTAL 100

PAGE 5
2.2) Quantitative FDT – a frequency distribution table where the data are grouped according to
some numerical or quantitative characteristics.

Example:

TABLE 3: Frequency Distribution for the Weights of 50 Pieces of Luggage


WEIGHT FREQUENCY

7-9 2

10-12 8

13-15 14

16-18 19

19-21 7

TOTAL 50

3. Graphical Presentation.

Another method of data presentation is by using graphs, symbols or visual aids. Kinds of
graphs most commonly used in data presentation:

(a) Line Graph – is the simplest graph to construct. The graph makes use of the rectangular
coordinate system, where two perpendicular lines are drawn with the vertical dimension bearing a
reasonable proportion to the horizontal. The points are plotted along the horizontal axes and
against the vertical axes. The plotted points are connected by broken lines. The line graphs are
usually used for the following purposes:
a.1 when data cover a long period of a.4 when trends of frequency
time distribution are presented
a.2 when several series are a.5 when a multiple-amount scale is
compared to the same chart. used
a.3 when emphasis is on the a.6 when estimates forecast,
movement rather than on the actual interpolation or extrapolation are to be
amount shown
PAGE 6
Example:

(b) Bar Graph – are comparisons of numerical values of a given item over a period of time maybe
made by the use of bars. The vertical bars consist of equal spaced vertical rectangles placed on a
common horizontal base line. They should all be the same width and begin at zero. The heights of
the rectangles are proportional to the magnitude represented.

Example: The graphs show that first-year college students spend the most on electronic
equipment including computers.

(c) Pie Graph – is used to form percentage chart or when we want to represent the composition
parts of a whole. This is a circle whose area is divided into component parts of sectors.

PAGE 7
Example: Super Bowl Snack Foods. This frequency distribution shows the number of pounds of
each snack food eaten during the Super Bowl. Construct a pie graph for the data.
(d) Pictogram or Pictograph – make used of pictures and symbols to immediately suggest
the nature of the data being shown. It combines the attention getting the quality of
the picture and the accuracy of the bar graph. Appropriate pictures or symbols,
arranged in rows present the quantities to be compared.

Example:

Growth Pattern of Philippine Population: 1960 – 2000


Year
2000

1999

1998

1997
1996

1995

1990*
1980*

1975*
1970* 1960* PAGE 8
Note:
Based on
Series 2:
Moderate
Fertility
and
Mortality
Decline
Population Projection

* Censual Year

MEASURES OF CENTRAL TENDENCY

A measure of central tendency is any single value that is used to identify the “center” or the
typical value of the data set. It is often referred to as the average.

1. THE MEAN

- the most common average


- the sum of all values of the observations divided by the number of observations
- simply referred as the mean

Arithmetic Mean for Ungrouped Data

The population mean for a finite population with N elements, denoted by the Greek
N

∑ i

X
μμN
i =1

letter (mu) is computed as = . ___ ___ X X


n = 1
=
∑ i i

Xn
The sample mean (read as “X bar”) of n observations is computed as . The sample mean
(a statistic) is an estimate of the unknown population mean (a parameter).

Examples:

1. The number of employees at 5 different drug stores are 10, 12, 6, 8, and 4. Find the mean
number of employees for the 5 stores.

PAGE 9
2. Scores in the Stat 2 first exam for a sample of 10 students are as follows: 60, 55, 30, 90, 88, 79,
45, 66, 93, and 80. Find the mean.

Definition. The weighted mean is a modification of the usual mean that assigns weights (or
measures of relative importance) to the observations to be average. If each observation X i is
assigned a weight Wi where i = 1, 2,…, n, the weighted mean is given by
n

∑ WX
11
___

X W
=
=n 1
1
i


i
=
1

Examples:

1. Suppose a teacher assigns the following weights to the various course


requirements: Assignment 15%

Project 25%

Midterm Exam 20%

Final Exam 40%

The maximum score a student may obtain for each component is 100. Jeffry obtains marks
of 83 for assignments, 72 for the project, 42 for the midterm exam, and 47 for the final exam. Find
his mean mark for the course.

2. Alex’s grades for the second semester AY 1996-1997 are as follows:

History 1.0

Humanities 1.5

Math 19 2.25

Math 53 3.0

Philosophy 1.0

Math 53 is a 5-unit course and all others are 3-unit course. Find Alex’s GWA for the semester.

PAGE 10
Characteristics of the Mean

1. It is most familiar measure used, and employs all available information.


2. It is affected by the value of every observation. In particular it is strongly influenced by
extreme values.
3. Since the mean is calculated number, it may not be an actual number in the data set. 4.
Is possesses two mathematical properties that will prove to be important in subsequent
analyses.
i) The sum of the deviations of the values from the mean is zero.
ii) The sum of the squared deviations is minimum when the deviations are taken from
the mean.
5. a. If a contract c is added (subtracted) to all observations, the mean of the new
observations will increase (decrease) by the same amount c.
b. If all observations are multiplied or divided by a constant, the new observations will have
a mean that is the same constant multiple of the original mean.

2 THE MEDIAN

- The positional middle of the arrayed data


- In an array, one half of the values precede the median and one – half follow it

Median for Ungrouped Data

The first step in calculating the median, denoted as �� , is to arrange the data in an

array. Let X(i) be ith median observation in the array, i = 1, 2,…,n

If n is odd, the median position equals (n+1) /2, and the value of the (n+1)/2 th observation
in the array is taken as the median, i.e.,

= �� ��+1
��
2

If n is even, the mean of the two middle values in the array is the median, i.e.,

�� = �� �� + �� ��
2 2 +1 2

PAGE 11
Examples:

1. Given the following heights (in inches): 71, 72, 75, 75, and 67. Find the median

height. 2. Given the following scores: 1, 7, 3, 3, 6, 5, 4, 3, find the median scores.


Characteristics of the Median:

1. The median is a position measure.


2. The median is affected by the position of each item in the series but not by the value of
each item. This means that extreme values affect the median less than the arithmetic
mean.

3. THE MODE

− the observed value that occurs most frequently


− locates the point where the observation values occur with the greatest density
− generally a less popular measure than the mean or the median

Mode for Ungrouped Data

The mode is determined by counting the frequency of each value and finding the value
with the highest frequency of occurrence.

Example. Determine the mode in the given data sets:

1. 2, 5, 2, 3, 5, 2, 1, 4, 2, 2, 2, 1, 2, 2, 2, 3, 2, 2, 2, 2
2. 2, 5, 5, 2, 2, 5, 1, 3, 5, 4, 2, 5, 5, 2, 2, 5, 5, 2, 2, 1
3. blue, yellow, red, pink, blue, blue, blue, pink, red, yellow, blue, white, white, white

PAGE 12
Characteristics of the Mode:

1. It does not always exist; and if it does, it may not be unique. A data is said to be unimodal
if there is only one mode, bimodal if there are two modes, trimodal if there are three
modes, and so on.
2. It is not affected by extreme values.
3. The mode can be used for qualitative as well as quantitative data.

MEASURES OF LOCATION
Measures of Location (or fractiles/quantiles) are values below which a specified fraction
or percentage of the observations in a given set must fall.

1. PERCENTILES

- are values that divide a set of observations in an array into 100 equal parts.

Thus, P1, read as first percentile, is the value below which 1% of the values fall.

P2, read as second percentile, is the value below which 2% of the values fall.




P99, read as ninety-ninth percentile, is the value below which 99% of the values fall.

* ith Percentile of Ungrouped Data


⎣⎡ +
i(n 1)

Pi ⎦⎤
100
= the value of the th observation in the array

= �� ��(��+��)
����
������

PAGE 13
2. DECILES

- are values that divide the array into 10 equal parts. Thus, D 1, read as
first decile, is the value below which 10% of the values fall. D 2, read as
second decile, is the value below which 20% of the values fall.




D9, read as ninth decile, is the value below which 90% of the values fall.
* ith Decile of Ungrouped Data

⎢ +
⎣⎡ 10
i(n 1)

Di ⎦⎤
= the value of the th observation in the array

= �� ��(��+��)
����
����

3. QUARTILES

- are values that divide the array into 4 equal parts. Thus, Q 1, read as first
quartile, is the value below which 25% of the values fall. Q 2, read as
second quartile, is the value below which 50% of the values fall. Q 3, read
as third quartile, is the value below which 75% of the values fall.

PAGE 14
th
* i Quartile of Ungrouped Data

⎢ +
⎣⎡ 4
i(n 1)

Qi ⎦⎤
= the value of the th observation in the array

= �� ��(��+��)
����
��
MEASURES OF VARIATION OR DISPERSION

- indicate the extent to which individual items in a series are scattered about an average.

Some Uses for Measuring Dispersion:

- to determine the extent of the scatter so that steps maybe taken to control the existing
variation

- used as a measure of reliability of the average value

1. THE RANGE

Definition: The range of a set of data is the difference between the largest and the smallest values
in the set. In a frequency distribution, the range is approximated by getting the
difference between the upper class limit of the highest class interval and the lower
class limit of the lowest class interval.

Characteristics of the Range


1. It uses only the extreme values. It fails to communicate any information about the clustering
or the lack of clustering of the values between the extremes.
2. A weakness of the range is that an outlier can greatly alter its value.
3. It cannot be approximated from open-ended frequency distributions.
4. It is unreliable when computed from a frequency distribution table with gaps or zero
frequencies.

PAGE 15
2 THE STANDARD DEVIATION AND THE VARIANCE

Definition: For a finite population of size N, the population variance is


�� 2
2
�� = (���� − ��)
��=1
��
And the population standard deviation is

�� 2

�� = (���� − ��)
��=1
��
For a finite population of size n, the sample variance is
�� 2

��2 = (���� − �� )
��=1
�� − 1
And the sample standard deviation is

�� 2

�� = (���� − �� )
��=1
�� − 1

Example:
1) A comparison of coffee prices at 4 randomly selected grocery stores in Imus City showed
increases from the previous month of 52, 55, 57 and 60 pesos for a 200-gram jar. Find the
standard deviation of this random sample price increases.

Characteristics of the Standard Deviation

1. It is affected by the value of every observation. It may be distorted by few extreme


values. 2. It cannot be computed from an open-ended distribution.

3. If each observation of a set of data is transformed to a new set by the addition (or subtraction)
of a constant c, the standard deviation of the new data set is the same as the standard
deviation of the original data set.

4. If a set of data is transformed to a new set by multiplying (or dividing) each observation by a
constant c, the standard deviation of the new data set is equal to the standard deviation of the
original data set multiplied (or divided) by c.

PAGE 16
3. THE COEFFICIENT OF VARIATION
Definition: The coefficient of variation, CV, is the ratio of the standard deviation to the mean and
is usually expressed in percentage. It is computed as

CV = σ x 100%

and its sample counterpart μs

Examples: ___ CV = X x 100%


1. The foreign exchange rate is an indicator of the stability of the peso and is also an indicator of
the economic performance. In 1992 Bangko Sentral ng Pilipinas (BSP) put the peso on a
floating rate basis. Market forces and not government policy have determined the level of the
peso since. Government intervenes through the BSP, only when there are speculative
elements in the market. Given below are the means and standard deviations of the quarterly
$ exchange rate for the periods 1989 to 1991 to 1992 to 1994. Which of the two periods is
more stable?

Mean S.D.

1989-1991 22.4 1.84

1992-1994 26.4 1.16

Solution:

CV 1.84 ==
8991
− == x 100% 4.36%
x 100% 8.21% 26.4
22.4
CV Therefore, the period 1992-1994
9294
− 1.16 is more stable.

2. Two of the quality criteria in processing butter cookies are the weight and color development in
the final stage of oven browning. Individual pieces of cookies are scanned by a
spectrophotometer calibrated to reflect yellow-brown light. The readout is expressed in percent

PAGE 17
of a standard yellow-brown reference plate and a value of 41 is considered optimal (golden
yellow). The cookies were also weighed in grams at this stage. The means and standard
deviations of 30 sample cookies are presented below.

Mean S.D.

Color 41.1 10

Weight 17.7 3.2

Which of the two quality criteria is more varied?

Solution:
CV ==
color
x 100% 24.33%
41.1

CV 3.2
10 ==
weight
17.7 x 100% 18.08%

4.5 MEASURE OF SKEWNESS

Definition. A measure of skewness shows the degree of asymmetry, or departure from symmetry
of a distribution. It indicates not only the amount of skewness but also the direction.

TWO TYPES OF SKEWNESS

1. Positively Skewed or Skewed to the Right

• distribution tapers more to the right than to the left

• longer tail to the right

• more concentration of values below than above the mean

• most skewed curves encountered in the social sciences are skewed to the right.

PAGE 18
2. Negatively Skewed or Skewed to the Left

• distribution tapers more to the left than to the right

• longer tail to the left

• more concentration of values above than below the mean

• only rarely do we find curves skewed to the left, and even more rarely do we find data
characteristically skewed to the left

Coefficient of Skewness
___
���� =��( �� − �� )
��
Where: �� = mean, �� = median, s = standarad deviation

Remarks:
1. Since the mode is frequently only an approximation, formula 2 is preferred.
2. Interpretation of the measures of skewness:
3. Interpretation of the measures of skewness:

Sk > 0: positively skewed since �� > Md > Mo

Sk < 0: negatively skewed since �� < Md < Mo

Sk = 0: symmetric since �� = Md = Mo

Example: Suppose the following descriptive measures were obtained from the pre-final exam
scores of GNED03 students: mean = 74.1, median = 75, mode = 84, and standard dev
= 11.25. Compute for the coefficient of skewness.

Solution:

���� =3(74.1 − 75)

11.25 = −��. ��

PAGE 19

You might also like