Group 2 - Processing and Analysis of Data PDF
Group 2 - Processing and Analysis of Data PDF
Group 2 - Processing and Analysis of Data PDF
WRITTEN REPORT
GROUP 2
____________________________
Practical Research I
Members:
Aboguin, Abegail E.
Ayuson, Mary Angeline R.
Cerujano, Niña Carmina D.
Reyes, Micaela V.
Tapia, Darleen A.
0
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
Table of Contents
1
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
● The data has to be processed and analyzed in accordance with the outline
laid down for the purpose at the time of developing the research plan. It is
essential for scientific study and for ensuring that we have all relevant data for
making contemplated comparisons and analysis. Technically speaking,
processing implies editing, coding, classification and tabulation of collected
data so that they are amenable to analysis. The term analysis refers to the
computation of certain measures along with searching for patterns of
relationship that exist among data-groups.
● Thus, “in the process of analysis, relationships or differences supporting or
conflicting with original or new hypotheses should be subjected to statistical
tests of significance to determine with what validity data can be said to
indicate any conclusions”.1 But there are persons (Selltiz, Jahoda and others)
who do not like to make difference between processing and analysis.
● They opine that analysis of data in a general way involves a number of closely
related operations which are performed with the purpose of summarizing the
collected data and organizing these in such a manner that they answer the
research question(s). We, however, shall prefer to observe the difference
between the two terms as stated here in order to understand their implications
more clearly.
Processing Operations
1. Editing - a process of examining the collected raw data (specially in surveys)
to detect errors and omissions and to correct these when possible. As a
matter of fact, editing involves a careful scrutiny of the completed
questionnaires and/or schedules. Editing is done to assure that the data are
accurate, consistent with other facts gathered, uniformly entered, as
completed as possible and have been well arranged to facilitate coding and
tabulation.
- Fied Editing - consists in the review of the reporting forms by the
investigator for completing (translating or rewriting) what the latter has
written in abbreviated and/or in illegible form at the time of recording
the respondents’ responses.This type of editing is necessary in view of
the fact that individual writing styles often can be difficult for others to
decipher.
- Central editing - should take place when all forms or schedules have
been completed and returned to the office. This type of editing implies
that all forms should get a thorough editing by a single editor in a small
2
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
3
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
4
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
5
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
Tabulation example:
6
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
We can take up the following two problems of processing the data for analytical
purposes:
a) The problem concerning “Don’t know” (or DK) responses: While processing
the data, the researcher often comes across some responses that are difficult
to handle. One category of such responses may be ‘Don’t Know Response’ or
simply DK response.
But what about the DK responses that have already taken place?
● One way to tackle this issue is to estimate the allocation of DK answers from
other data in the questionnaire.
● The other way is to keep DK responses as a separate category in tabulation
where we can consider it as a separate reply category if DK responses
happen to be legitimate, otherwise we should let the reader make his own
decision.
● Another way is to assume that DK responses occur more or less randomly
and as such we may distribute them among the other answers in the ratio in
which the latter have occurred.
7
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
ELEMENTS/TYPES OF ANALYSIS
3) Correlation Analysis studies the joint variation of two or more variables for
determining the amount of correlation between two or more variables.
8
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
4) Causal analysis is concerned with the study of how one or more variables
affect changes in another variable. It is thus a study of functional relationships
existing between two or more variables. This analysis can be termed as
regression analysis.
In modern times, with the availability of computer facilities, there has been a rapid
development of multivariate analysis which may be defined as “all statistical methods
which simultaneously analyze more than two variables on a sample of observations”.
Usually the following analyses are involved when we make a reference of
multivariate analysis:
STATISTICS IN RESEARCH
9
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
The important statistical measures that are used to summarize the survey/research
data are:
10
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
● Measures of central tendency (or statistical averages) tell us the point about
which items have a tendency to cluster. Mean, median and mode are the
most popular averages. Mean, also known as arithmetic average.
For example: Compute the mean of the given numbers: 100, 115, 125, 150, 145
For example: Find the median of the given numbers: 25, 20, 35, 40, 30
20, 25, 30, 35, 40
30 is the median
For example: Find the mode of the given numbers: 45, 50, 55, 60, 60, 65, 70
45, 50, 55, 60, 60, 65, 70
60 is the mode
11
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
MEASURES OF DISPERSION
● An averages can represent a series only as best as a single figure can, but it
certainly cannot reveal the entire story of any phenomenon under study.
❖ It describes the spread of the data. Range and Mean Deviation are
included here.
(A) Range is the simplest possible measure of dispersion and is defined as the
difference between the values of the extreme items of a series.
Formula: Highest Value - Lowest Value
For example: Find the range of the given numbers: 108, 112, 125, 118 and 113.
(B) Mean Deviation is the average of difference of the values of items from some
average of the series. Such a difference is technically described as deviation.
❖ Mean deviation is a statistical measure that computes the average
deviation from the mean value of a given data collection.
For example: Find the mean deviation of the given numbers: 108, 112, 125, 118 and
113.
108 + 112 + 125 + 118 + 113 = 115.2 is the mean
5
Next, |125 - 115.2| + |118 - 115.2| + |113 - 115.2| + |112 - 115.2| + |108 - 115.2|
5
12
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
MEASURES OF ASYMMETRY(SKEWNESS)
Normal Distribution
Curve showing no skewness in which case we have X=M=Z
SKEWNESS
● Measures of imbalance or Asymmetry from the mean of a data distribution.
● Values concentrated on one side of the data distribution
POSITIVE/RIGHT SKEWED
● Tail on the right side is longer
● Outliers is on the right side
13
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
NEGATIVE/LEFT SKEWED
Tail on the left side is longer
Outliers is on the left side.
Such a curve is technically described as a normal curve and the relating distribution
as normal distribution. Such a curve is a perfectly bell-shaped curve in which case
14
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
the value of X or M or Z is just the same and skewness is altogether absent. But if
the curve is distorted (whether on the right side or on the left side), we have an
asymmetrical distribution which indicates that there is skewness. If the curve is
distorted on the right side, we have positive skewness but when the curve is
distorted towards left, we have negative skewness
Positive skewness, we have Z < M < X and in case of Negative skewness we have X
< M < Z.
MEASURES OF RELATIONSHIP
15
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
So far we have dealt with those statistical measures that we use in context of
univariate population i.e., the population consisting of measurement of only one
variable. But if we have the data on two variables, we are said to have a bivariate
population and if the data happen to be on more than two variables, the population is
known as multivariate population. If for every measurement of a variable, X, we have
corresponding value of a second variable, Y, the resulting pairs of values are called a
bivariate population. In addition, we may also have a corresponding value of the third
variable, Z, or the forth variable, W, and so on, the resulting pairs of values are called
a multivariate population. In case of bivariate or multivariate populations, we often
wish to know the relation of the two and/or more variables in the data to one another.
We may like to know, for example, whether the number of hours students devote for
studies is somehow related to their family income, to age, to sex or to similar other
factor.
Cross tabulation approach is specially useful when the data are in nominal form.
Under it we classify each variable into two or more categories and then cross classify
the variables in these subcategories. Then we look for interactions between them
which may be symmetrical, reciprocal or asymmetrical. A symmetrical
relationship is one in which the two variables vary together, but we assume that
neither variable is due to the other. A reciprocal relationship exists when the two
variables mutually influence or reinforce each other. Asymmetrical relationship is
said to exist if one variable (the independent variable) is responsible for another
variable (the dependent variable).
The cross classification procedure begins with a two-way table which indicates
whether there is or there is not an interrelationship between the variables. This sort
of analysis can be further elaborated in which case a third factor is introduced into
the association through cross-classifying the three variables. By doing so we find
conditional relationship in which factor X appears to affect factor Y only when factor
Z is held constant. The correlation, if any, found through this approach is not
16
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
For the calculation and significance testing of the ranking variable, it requires the
following data assumption to hold true:
If your data doesn’t meet the above assumptions, then you would need Spearman’s
Coefficient. It is necessary to know what monotonic function is to understand
Spearman correlation coefficient. A monotonic function is one that either never
decreases or never increases as it is an independent variable increase. A monotonic
function can be explained using the image below:
17
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
1. Monotonically increasing: When the ‘x’ variable increases and the ‘y’
variable never decreases.
2. Monotonically decreasing: When the ‘x’ variable increases but the ‘y’
variable never increases
3. Not monotonic: When the ‘x’ variable increases and the ‘y’ variable
sometimes increases and sometimes decreases.
Closer the ⍴ value to 0, weaker is the association between the two ranks.
We must be able to rank the data before proceeding with the Spearman’s Rank
Coefficient of Correlation. It is important to observe if increasing one variable, the
other variable follows a monotonic relation.
EXAMPLE:
18
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
1-(6*12)/(9(81-1))
=1-72/720
=1-01
=0.9
the Spearman’s Rank Correlation for this data is 0.9 and as mentioned above if the ⍴
value is nearing +1 then they have a perfect association of rank.
Karl Pearson’s coefficient of correlation (or simple correlation) is the most widely
used method of measuring the degree of relationship between two variables. This
coefficient assumes the following:
19
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
The Pearson correlation can evaluate ONLY a linear relationship between two
continuous variables (A relationship is linear only when a change in one variable is
associated with a proportional change in the other variable)
Below is an example of how the Pearson correlation coefficient (r) varies with the
strength and the direction of the relationship between the two variables. Note that
when no linear relationship could be established (refer to graphs in the third column),
the Pearson coefficient yields a value of zero.
20
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
EXAMPLE:
2. When a correlation coefficient is (-1), that means for every positive increase in one
variable, there is a negative decrease in the other fixed proportion. For example, the
decrease in the quantity of gas in a gas tank shows a perfect (almost) inverse
correlation with speed.
3. When a correlation coefficient is (0) for every increase, that means there is no
positive or negative increase, and the two variables are not related.
1. The fundamental difference between the two correlation coefficients is that the
Pearson coefficient works with a linear relationship between the two variables
whereas the Spearman Coefficient works with monotonic relationships as
well.
2. One more difference is that Pearson works with raw data values of the
variables whereas Spearman works with rank-ordered variables.
21
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
where the symbol Y denotes the estimated value of Y for a given value of X.
This equation is known as the regression equation of Y on X (also represents
the regression line of Y on X when drawn on a graph) which means that each
unit change in X produces a change of b in Y, which is positive for direct and
negative for inverse relationships.
● The generally used method to find the ‘best’ fit that a straight line of this kind
can give is the least-square method. To use it efficiently, we first determine
Then
These measures define a and b which will give the best possible fit
through the original X and Y points and the value of r can then be worked out
as under:
22
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
and then solving these equations for finding a and b values. Once these values are
obtained and have been put in the equation Y = a + bX, we say that we have fitted
the regression equation of Y on X to the given data. In a similar fashion, we can
develop the regression equation of X and Y viz., X = a + bX, presuming Y as an
independent variable and X as a dependent variable.
Example:
Here, we will be citing a scenario that serves as an example of the implementation of
simple regression analysis.
Let us assume the average speed when 2 highway patrols are deployed is 75 mph,
or 35 mph when 10 highway patrols are deployed. The question thus is what is the
average speed of cars on the freeway when 5 highway patrols are deployed?
Using our simple regression analysis formula, we can thus compute the
values and derive the following equation:
Therefore, the average speed of cars on the highway when there are zero highway
patrols operating (X=0) will be 85 mph. For every extra highway patrol car working,
the average speed will reduce by 5 mph.
Hence, for 5 patrol cars (X = 5), we have Y = 85 + (-5) (5) = 85 – 25 = 60 mph.
23
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
● The relationship analysis is used when there are two or more independent
variables.
● The equation describing such a relationship as the multiple regression
equation. We here explain multiple correlation and regression taking only two
independent variables and one dependent variable (Convenient computer
programs exist for dealing with a great number of variables). In this situation
the results are interpreted as shown below:
Multiple regression equation assumes the form
● With more than one independent variable, we may make a difference between
the collective effect of the two independent variables and the individual effect
of each of them taken separately. The collective effect is given by the
coefficient of multiple correlation
24
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
PARTIAL CORRELATION
Example:
Using the partial correlation, you will measure whether there is a linear relationship
between 10,000 m running performance and VO2max (a marker of aerobic fitness
which refers to the maximum amount of oxygen a person can utilize during exercise.
), whilst controlling for wind speed and relative humidity.
Data given:
Formula:
25
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
and
The partial correlation coefficients are called first order coefficients when one
variable is held constant as shown above; they are known as second order
coefficients when two variables are held constant and so on.
Assumptions:
There are at least 5 assumptions to consider when using the partial correlation to
analyze your chosen data.
❖ Assumption #1: You have one (dependent) variable and one (independent)
variable and these are both measured on a continuous scale (i.e., they are
measured on an interval or ratio scale).
❖ Assumption #2: You have one or more control variables, also known as
covariates.
❖ Assumption #3: There needs to be a linear relationship between all three
variables.
❖ Assumption #4: There should be no significant outliers.
❖ Assumption #5: Your variables should be approximately normally distributed.
26
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
Types of Association:
27
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
These three types of association can be put symbolically as shown here in the slide:
In order to find out the degree or intensity of association between two or more sets of
attributes, we should work out the coefficient of association.
28
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
Partial Association
Illusory Association
Chisquare test
29
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
Contingency Table
● We can have manifold classification of the two attributes in which case each
of the two attributes are first observed and then each one is classified into two
or more subclasses, resulting into what is called as contingency table.
30
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
● But the practice of combining classes is not considered very correct and at
times it is inconvenient also, Karl Pearson has suggested a measure known
as Coefficient of mean square contingency for studying association in
contingency tables.
● The formula for coefficient of mean square contingency that Karl Pearson have
suggested is:
31
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
OTHER MEASURES:
a. Index Numbers - When series are expressed in the same units, we can use
averages for the purpose of comparison, but when the units in which two or
more series are expressed happen to be different, statistical averages cannot
be used to compare them. In such situations we have to rely upon some
relative measurement which consists in reducing the figures to a common
base. Once such method is to convert the series into a series of index
numbers.
- But index numbers have their own limitations with which a researcher
must always keep himself aware. For instance, index numbers are only
approximate indicators and as such give only a fair idea of changes but
cannot give an accurate idea. Chances of error also remain at one
point or the other while constructing an index number but this does not
diminish the utility of index numbers for they still can indicate the trend
of the phenomenon being measured. However, to avoid fallacious
conclusions, index numbers prepared for one purpose should not be
used for other purposes or for the same purpose at other places.
1. Secular trend or long term trend that shows the direction of the series
in a long period of time.
2. Short time oscillations i.e., changes taking place in the short period of
time only and such changes can be the effect of the following factors:
a. Cyclical fluctuations (or C) are the fluctuations as a result of
business cycles and are generally referred to as long term
movements that represent consistently recurring rises and
declines in an activity
b. Seasonal fluctuations (or S) are of short duration occurring in a
regular sequence at specific intervals of time. Such fluctuations
are the result of changing seasons.
c. Irregular fluctuations (or I) also known as Random fluctuations,
are variations which take place in a completely unpredictable
fashion.
32
Rizal Technological University
Cities of Mandaluyong and Pasig
COLLEGE OF ARTS AND SCIENCES
For analyzing time series, we usually have two models; (1) multiplicative model; and
(2) additive model. Multiplicative model assumes that the various components
interact in a multiplicative manner to produce the given values of the overall time
series and can be stated as under:
Y=TxCxSxI
where
Y = observed values of time series, T = Trend, C = Cyclical fluctuations,
S = Seasonal fluctuations, I = Irregular fluctuations.
The analysis of time series is done to understand the dynamic conditions for
achieving the shortterm and long-term goals of business firm(s). The past trends can
be used to evaluate the success or failure of management policy or policies
practiced hitherto. On the basis of past trends, the future patterns can be predicted
and policy or policies may accordingly be formulated. We can as well study properly
the effects of factors causing changes in the short period of time only, once we have
eliminated the effects of trend. Thus, analysis of time series is important in context of
long term as well as short term forecasting and is considered a very powerful tool in
the hands of business analysts and researchers.
33