Data-Management-Lecture-Notes
Data-Management-Lecture-Notes
DATA MANAGEMENT
Definition of Terms:
Types of Variable
Qualitative Quantitative
Civil Status Age
Nationality No. of Children
Degree earned Ounces of ice cream
Gender
Letter Grade
Job/occupation
Types of Quantitative
1. Discrete Variables
2. Continuous Variables
Nominal Level - characterized by data that consist of names, labels, or categories only. The data cannot
be arranged in an ordering scheme (such as low to high).
Example:
Survey responses as yes, no, undecided
Ordinal Level - involves data that may be arranged in some order, but differences between data values
either cannot be determined or are meaningless.
Example:
Course grades A, B+, B, C+, C, D, F or AF
Interval Level - like the ordinal level, with the additional property that the difference between any two
data values is meaningful. However, there is no natural zero starting point (where none of the quantity
is present).
Example:
IQ, Temperature
Ratio Level - possesses the characteristic of interval level, and there exist a true zero. Differences and
ratios are meaningful.
Example:
Height, Salary, Time
Nationality
1. Direct or Interview method- this is one of the most effective methods of collecting original data. To
obtain accurate responses, well-trained interviewers may do the interview. The interviewers can be of
great help to the respondents in answering questions that the respondents could not understand.
2. Indirect or Questionnaire method- this one of the easiest methods of data gathering. It takes time to
prepare because questionnaires need to be attractive. It can include illustrations, pictures and sketches.
Its contents, especially the directions, must be precise , clear and self-explanatory.
3. Registration Method- Through this method, the respondents provide information in compliance with
certain laws, policies, rules, regulations, decrees or standard practices. Examples are marriage contracts,
birth certificates, motor registrations, license of firearms, registration of corporations, real estates,
voters , etc.
4. Other methods
4.1. Observation- This method utilized to gather data regarding attitudes, behavior, values and cultural
patterns of the samples under investigation.
4.2. Telephone Interview- This method is employed if the question to be asked are brief and few. An
example is the check made on listeners to certain radio programs like asking what program his radio is
turned in to.
4.3. Experiments- This method is applied to collect or gather data if the investigator wants to control the
factors affecting the variable being studied. An example is when the researcher aims to determine the
different factors affecting the academic performance of the students such as methods or approaches
used in teaching, etc.
1. Textual form-the presentation is in narrative or paragraph form. The data are within the text of the
paragraph. This form may not the immediate interest of the reader. However, it can present a more
comprehensive picture of the data because of further written explanation of its nature.
2. Tabular-makes use of rows and columns like frequency table or frequency distribution. The data are
presented in a systematic and orderly manner, which catches one’s attention and may facilitate the
comprehension and analysis of the data presented.
3. Graphical form- the numerical data provided in a frequency distribution can be made more
interesting and easier to understand when depicted in graphical form. A graph is a pictorial or
geometrical representation of a given data.
Organizing Data
Categorical Distribution
Twenty five inductees were given a blood test to determine their blood type.
A B B AB O
O O B AB B
B B O A O
A O O O AB
AB A O B A
25 28 27 30 32
25 31 26 29
31 20 21 32 18
50 53 60 50
45 40 37 25 20
27 32 24 29
25 24 10 12 15
28 6 54 30
33 – 41 32.5 – 41.5 II 2
51 - 59 50.0 – 59.5 II 2
60 - 68 59.5 – 68.5 I 1
Total 36 36
Data Presentation
Histogram - a bar graph in which the horizontal scale represents classes of data values and the vertical
scale represents frequencies. The heights of the bars correspond to the frequency values, and the bars
are drawn adjacent to each other (without gaps)
Figure 1: Histogram
Frequency Polygon
- uses line segments connected to points located directly above class midpoint values.
Ogive
- a line graph that depicts cumulative frequencies, just as the cumulative frequency
distribution.
graph begins with the lower boundary of the first class and ends with the upper boundary of
the last class.
Pie Graph
- a circle divided into sections according to the percentage of frequencies in each category of
the distribution
Bar Graph
- represents the data by using vertical or horizontal bars whose heights or lengths represent
the frequencies of the data
Line Graph
- used to show trends and increases or decreases in sale, scores, population per year etc.
Pareto Chart
- a type of chart that contains both bars and a line graph, where individual values are
represented in descending order by bars, and the cumulative total is represented by the line
Once data are collected, it is useful to summarize the data set by identifying a value around which data
are centered.
- numerical balancing point of the data set. It is calculated by adding all the data values and dividing the
sum by the total number of data points.
Sample Mean ( x )
n xi
X= x 1 + x 2 + x 3 + … + x n = ∑ i=1
_____________________ _______
n n
N xi
µ = x1 + x2 + x3 + … + xN ∑ i=1
____________________ _____
N N
Example: 6 test scores in Statistics
12 10 15 8` 5 18
X= 12 + 10 + 15 + 8 + 5+18 = 68
____________________ ____
N 6
=11.333…
Median (Md)
- simply the middle number in an ordered set of data
if n is odd if n is even
Md = middle score Md = ∑ of two middle scores
2
Example:
1. Six students borrowed these numbers of books in a library:
1, 2, 3, 4, 6, 7 (3 and 4 are the middle scores)
Md = 3 + 4 / 2= 3.5
2. Five students borrowed these numbers of books in a library:
2, 2, 3, 3, 3
Md = 3
Mode (Mo)
- most frequently occurring number in a data set
Examples:
1. 2 5 8 3 2 2 3 1
(Mode is 2)
2. 1 1 3 5 7 8 3 4
(Bimodal, Mode is 1 and 3)
3. 2 4 6 8 9 3 1 5
(No mode)
Midrange (MR)
- the value midway between the highest and lowest values in the original data set
Weighted Mean
- multiply each value by its corresponding weight and dividing the sum of the products by the
sum of the weights.
n wi xi
X=w 1 x 1 + w 2 x 2 + … + w n x n x= ∑ i=1
__________________________ _____________
n wi
w 1 + w 2 + w n …+ ∑ i=1
where w1, w2, w3, ..., wn are the weights and X1, X2, X3, ..., Xn are the values.
Example:
PE 2 A 4
Calculus 5 C+ 2.5
English 3 B 3
X= 2 . 4 + 5 .2.5 + 3 . 3
10 = 2.95
Measures of Dispersions
σ = √σ 2 =
√ ∑ (X −μ)2
N
where X - individual value
µ - population mean
N - population size
s = √s2 =
√ ∑ (X −X )2
n−1
where X - individual value
X - sample mean
n - sample size
Let A = 5, 5, 5, 5, 5, 5, 5, 5, 5, 5
B = 4, 4, 4, 5, 5, 5, 5, 6, 6, 6
D = 0, 5, 5, 5, 5, 5, 5, 5, 5, 10
• Find the range and the standard deviation of each data set.
Measures of Positions
- the z-score for a given data value x is the number of standard deviations that x is
above or below the mean of the data
population: sample:
x−µ x−x
z= z=
σ s
Example:
A basketball player Carl is 78 inches tall and a volleyball player Jane is 76 inches tall. Carl is obviously
taller by 2 inches, but which player is relatively taller? Does Carl’s height among men exceed Jane’s
height among women? Men have mean height of 68 inches and a standard deviation of 2.8 inches while
women have mean height of 63.6 inches and a standard deviation of 2.5 inches.
X−x 78−69
Carl : z = = = 3.21
s 2.8
X−x 76−63.6
Jane : z = = = 4.96
s 2.5
Carl’s height is 3.21 standard deviations above the mean, but Jane’s height is a whopping 4.96 standard
deviations above the mean.
∴ Jane’s height among women is relatively greater than Carl’s height among men.
Exercises
1. The average teacher’s salary in a particular city is P54,166. If the standard deviation is P10,200,
find the salaries corresponding to the following z scores.
a. 2
b. −1.6
c. 2.5
2. The mean time to download pdf file is 12 min with a standard deviation of 4 min. Belle’s download
time is 20 min. John’s download time is 6 min. How can you compare Belle’s download time compare
with John?
3. Cheryl has taken two quizzes in her history class. She scored 15 on the first quiz, for which the mean
of all scores was 12 and the standard deviation was 2.4. Her score on the second quiz, for which the
mean of all scores was 11 and the standard deviation was 2.0, was 14. In comparison to her classmates,
did Cheryl do better on the first quiz or the second quiz?
4. Roland received a score of 70 on a test for which the mean score was 65.5. Roland has learned that
the z-score for his test is 0.6. What is the standard deviation for this set of test scores?
- it shows the median, the 1st and 3rd quartiles, and the minimum and maximum values of a
data set
1. Find the five-number summary for the data values, that is, the maximum and minimum data
values, Q1 and Q3, and the median.
2. Draw a horizontal line with a scale such that it includes the maximum and minimum data values.
3. Draw a box whose vertical sides go through Q1 and Q3, and draw a vertical line though the
median.
4. Draw a line from the minimum data value to the left side of the box and a line from the
maximum data value to the right side of the box.
Example:
Construct a box-and-whisker plot for the following data:
86 77 58 45 94 96 83 76 75
65 68 72 78 85 87 92 55 61