Group 2 - Processing and Analysis of Data PDF

Rizal Technological University
Cities of Mandaluyong and Pasig

COLLEGE OF ARTS AND SCIENCES
WRITTEN REPORT
GROUP 2
____________________________
Processing and Analysis of Data
Practical Research I
Leader: Bederio, Ma. Cristina C.
Members:
Aboguin, Abegail E.
Ayuson, Mary Angeline R.
Cerujano, Niña Carmina D.
Reyes, Micaela V.
Tapia, Darleen A.
0
Table of Contents
Processing Operations ………………………………………………… 2

Editing …………………………………………………………………..
Coding …………………………………………………………………..
Classification ……………………………………………………………
Tabulation ………………………………………………………………
Problems in Processing ………………………………………………… 7

Elements/Types of Analysis ………………………………………….
Statistics in Research …………………………………………………
Measures of Central Tendency ………………………………………… 11

Measures of Dispersion ……………………………………………….
Measures of Asymmetry (Skewness) …………………………………
Measures of Relationship ………………………………………………
Simple Regression Analysis …………………………………………….. 22

Multiple Correlation and Regression ………………………………….
Partial Correlation ……………………………………………………….
Association in Case of Attributes ………………………………………
Other Measures …………………………………………………………… 32
1
Processing and Analysis of Data
● The data has to be processed and analyzed in accordance with the outline
laid down for the purpose at the time of developing the research plan. It is
essential for scientific study and for ensuring that we have all relevant data for
making contemplated comparisons and analysis. Technically speaking,
processing implies editing, coding, classification and tabulation of collected
data so that they are amenable to analysis. The term analysis refers to the
computation of certain measures along with searching for patterns of
relationship that exist among data-groups.
● Thus, “in the process of analysis, relationships or differences supporting or
conflicting with original or new hypotheses should be subjected to statistical
tests of significance to determine with what validity data can be said to
indicate any conclusions”.1 But there are persons (Selltiz, Jahoda and others)
who do not like to make difference between processing and analysis.
● They opine that analysis of data in a general way involves a number of closely
related operations which are performed with the purpose of summarizing the
collected data and organizing these in such a manner that they answer the
research question(s). We, however, shall prefer to observe the difference
between the two terms as stated here in order to understand their implications
more clearly.
Processing Operations
1. Editing - a process of examining the collected raw data (specially in surveys)
to detect errors and omissions and to correct these when possible. As a
matter of fact, editing involves a careful scrutiny of the completed
questionnaires and/or schedules. Editing is done to assure that the data are
accurate, consistent with other facts gathered, uniformly entered, as
completed as possible and have been well arranged to facilitate coding and
tabulation.
- Fied Editing - consists in the review of the reporting forms by the
investigator for completing (translating or rewriting) what the latter has
written in abbreviated and/or in illegible form at the time of recording
the respondents’ responses.This type of editing is necessary in view of
the fact that individual writing styles often can be difficult for others to
decipher.
- Central editing - should take place when all forms or schedules have
been completed and returned to the office. This type of editing implies
that all forms should get a thorough editing by a single editor in a small
2
study and by a team of editors in case of a large inquiry. Editor(s) may

correct the obvious errors such as an entry in the wrong place, entry
recorded in months when it should have been recorded in weeks, and
the like. In case of inappropriate on missing replies, the editor can
sometimes determine the proper answer by reviewing the other
information in the schedule. At times, the respondent can be contacted
for clarification. The editor must strike out the answer if the same is
inappropriate and he has no basis for determining the correct answer
or the response. In such a case an editing entry of ‘no answer’ is called
for. All the wrong replies, which are quite obvious, must be dropped
from the final results, especially in the context of mail surveys
● Several points while performing their work:

a) They should be familiar with instructions given to the
interviewers and coders as well as with the editing instructions
supplied to them for the purpose.
b) While crossing out an original entry for one reason or another,
they should just draw a single line on it so that the same may
remain legible.
c) They must make entries (if any) on the form in some distinctive
colour and that too in a standardized form.
d) They should initial all answers which they change or supply.
e) Editor’s initials and the date of editing should be placed on each
completed form or schedule.
2. Coding - refers to the process of assigning numerals or other symbols to

answers so that responses can be put into a limited number of categories or
classes. Such classes should be appropriate to the research problem under
consideration. They must also possess the characteristic of exhaustiveness
(i.e., there must be a class for every data item) and also that of mutual
exclusively which means that a specific answer can be placed in one and only
one cell in a given category set. Another rule to be observed is that of
unidimensionality which means that every class is defined in terms of only one
concept.
- Coding is necessary for efficient analysis and through it the several
replies may be reduced to a small number of classes which contain the
critical information required for analysis. Coding decisions should
usually be taken at the designing stage of the questionnaire. This
makes it possible to precode the questionnaire choices and which in
turn is helpful for computer tabulation as one can straight forward key
3
punch from the original questionnaires. But in case of hand coding

some standard method may be used. One such standard method is to
code in the margin with a coloured pencil. The other method can be to
transcribe the data from the questionnaire to a coding sheet. Whatever
method is adopted, one should see that coding errors are altogether
eliminated or reduced to the minimum level.
3. Classification - Most research studies result in a large volume of raw data

which must be reduced into homogeneous groups if we are to get meaningful
relationships. This fact necessitates classification of data which happens to be
the process of arranging data in groups or classes on the basis of common
characteristics. Data having a common characteristic are placed in one class
and in this way the entire data get divided into a number of groups or classes.
Classification can be one of the following two types, depending upon the
nature of the phenomenon involved:
a) Classification according to attributes - data are classified on the basis
of common characteristics which can either be descriptive (such as
literacy, sex, honesty, etc.) or numerical (such as weight, height,
income, etc.). Descriptive characteristics refer to qualitative
phenomenon which cannot be measured quantitatively; only their
presence or absence in an individual item can be noticed. Data
obtained this way on the basis of certain attributes are known as
statistics of attributes and their classification is said to be classification
according to attributes
b) Classification according to class-intervals - Unlike descriptive
characteristics, the numerical characteristics refer to quantitative
phenomenon which can be measured through some statistical units.
Data relating to income, production, age, weight, etc. come under this
category. Such data are known as statistics of variables and are
classified on the basis of class intervals.
4. Tabulation - When a mass of data has been assembled, it becomes necessary

for the researcher to arrange the same in some kind of concise and logical order.
This procedure is referred to as tabulation. Thus, tabulation is the process of
summarizing raw data and displaying the same in compact form (i.e., in the form of
statistical tables) for further analysis. In a broader sense, tabulation is an orderly
arrangement of data in columns and rows.
Tabulation is essential because of the following reasons:
a) It conserves space and reduces explanatory and descriptive
statements to a minimum.
4
b) It facilitates the process of comparison.

c) It facilitates the summation of items and the detection of errors and
omissions.
d) It provides a basis for various statistical computations.
● Generally accepted principles of tabulation: Such principles of tabulation,

particularly of constructing statistical tables, can be briefly states as follows:
1. Every table should have a clear, concise and adequate title so as to
make the table intelligible without reference to the text and this title
should always be placed just above the body of the table.
2. Every table should be given a distinct number to facilitate easy
reference.
3. The column headings (captions) and the row headings (stubs) of the
table should be clear and brief.
4. The units of measurement under each heading or sub-heading must
always be indicated.
5. Explanatory footnotes, if any, concerning the table should be placed
directly beneath the table, along with the reference symbols used in the
table.
6. Source or sources from where the data in the table have been obtained
must be indicated just below the table.
7. Usually the columns are separated from one another by lines which
make the table more readable and attractive. Lines are always drawn
at the top and bottom of the table and below the captions.
8. There should be thick lines to separate the data under one class from
the data under another class and the lines separating the sub-divisions
of the classes should be comparatively thin lines.
9. The columns may be numbered to facilitate reference.
10. Those columns whose data are to be compared should be kept side by
side. Similarly, percentages and/or averages must also be kept close to
the data.
11. It is generally considered better to approximate figures before
tabulation as the same would reduce unnecessary details in the table
itself.
12. In order to emphasize the relative significance of certain categories,
different kinds of type, spacing and indentations may be used.
13. It is important that all column figures be properly aligned. Decimal
points and (+) or (–) signs should be in perfect alignment.
14. Abbreviations should be avoided to the extent possible and ditto marks
should not be used in the table.
5
15. Miscellaneous and exceptional items, if any, should be usually placed

in the last row of the table.
16. Table should be made as logical, clear, accurate and simple as
possible. If the data happen to be very large, they should not be
crowded in a single table for that would make the table unwieldy and
inconvenient.
17. The arrangement of the categories in a table may be chronological,
geographical, alphabetical or according to magnitude to facilitate
comparison. Above all, the table must suit the needs and requirements
of an investigation
Tabulation example:
6
SOME PROBLEMS IN PROCESSING
We can take up the following two problems of processing the data for analytical
purposes:
a) The problem concerning “Don’t know” (or DK) responses: While processing
the data, the researcher often comes across some responses that are difficult
to handle. One category of such responses may be ‘Don’t Know Response’ or
simply DK response.
➢ When the DK response group is small, it is of little significance. But

when it is relatively big, it becomes a matter of major concern in which
case the question arises: Is the question which elicited DK response
useless? The answer depends on two points viz., the respondent
actually may not know the answer or the researcher may fail in
obtaining the appropriate information.
How DK responses are to be dealt with by researchers?

● The best way is to design better type of questions.
● Good rapport of interviewers with respondents will result in minimizing DK
responses.
But what about the DK responses that have already taken place?
● One way to tackle this issue is to estimate the allocation of DK answers from
other data in the questionnaire.
● The other way is to keep DK responses as a separate category in tabulation
where we can consider it as a separate reply category if DK responses
happen to be legitimate, otherwise we should let the reader make his own
decision.
● Another way is to assume that DK responses occur more or less randomly
and as such we may distribute them among the other answers in the ratio in
which the latter have occurred.
b) Use or percentages: Percentages are often used in data presentation for

they simplify numbers, reducing all of them to a 0 to 100 range. Through the
use of percentages, the data are reduced in the standard form with base
equal to 100 which fact facilitates relative comparisons.
7
While using percentages, the following rules should be kept in view by

researchers:
● Two or more percentages must not be averaged unless each is

weighted by the group size from which it has been derived.
● Use of too large percentages should be avoided
● Percentages hide the base from which they have been computed.
● Percentage decreases can never exceed 100 percent and as such for
calculating the percentage of decrease, the higher figure should
invariably be taken as the base.
● Percentages should generally be worked out in the direction of the
causal-factor in case of two-dimension tables and for this purpose we
must select the more significant factor out of the two given factors as
the causal factor.
ELEMENTS/TYPES OF ANALYSIS
By "analysis," we mean the computation of certain indices or measures along with

searching for patterns of relationships that exist among the data groups. Analysis,
particularly in the case of survey or experimental data, involves estimating the values
of unknown parameters in the population and testing hypotheses for drawing
inferences. Analysis can be divided into two types: descriptive analysis and
inferential analysis (also known as statistical analysis).
1) Descriptive Analysis is largely the study of distributions of one variable. This

study provides us with profiles of companies, work groups, persons and other
subjects on any of a multiple of characteristics such as size, composition,
efficiency, preferences, etc. This sort of analysis may be in respect of one
variable (described as unidimensional analysis), or in respect of two variables
(described as bivariate analysis) or in respect of more than two variables
(described as multivariate analysis).
2) Inferential analysis is concerned with the various tests of significance for

testing hypotheses in order to determine with what validity data can be said to
indicate some conclusion or conclusions. It is also concerned with the
estimation of population values. It is mainly on the basis of inferential analysis
that the task of interpretation is performed.
3) Correlation Analysis studies the joint variation of two or more variables for
determining the amount of correlation between two or more variables.
8
4) Causal analysis is concerned with the study of how one or more variables
affect changes in another variable. It is thus a study of functional relationships
existing between two or more variables. This analysis can be termed as
regression analysis.
Causal analysis is considered relatively more important in experimental

researches, whereas in most social and business researches our interest lies
in understanding and controlling relationships between variables then with
determining causes per se and as such we consider correlation analysis as
relatively more important.
In modern times, with the availability of computer facilities, there has been a rapid
development of multivariate analysis which may be defined as “all statistical methods
which simultaneously analyze more than two variables on a sample of observations”.
Usually the following analyses are involved when we make a reference of
multivariate analysis:
a) Multiple regression analysis: This analysis is adopted when the researcher

has one dependent variable which is presumed to be a function of two or
more independent variables.
b) Multiple discriminant analysis: This analysis is appropriate when the
researcher has a single dependent variable that cannot be measured, but can
be classified into two or more groups on the basis of some attribute.
c) Multivariate analysis of variance (or multi-ANOVA): This analysis is an
extension of two way ANOVA, wherein the ratio of among group variance to
within group variance is worked out on a set of variables.
d) Canonical analysis: This analysis can be used in case of both measurable
and non-measurable variables for the purpose of simultaneously predicting a
set of dependent variables from their joint covariance with a set of
independent variables.
STATISTICS IN RESEARCH
The role of statistics in research is to function as a tool in designing research,

analyzing its data and drawing conclusions therefrom. Most research studies result
in a large volume of raw data which must be suitably reduced so that the same can
be read easily and can be used for further analysis.
9
Two major areas of statistics:

a) Descriptive statistics concern the development of certain indices from the
raw data
b) Inferential statistics are concerned with the process of generalization, are
also known as "sampling statistics," and are mainly concerned with two types
of problems: (i) the estimation of population parameters and (ii) the testing of
statistical hypotheses.
The important statistical measures that are used to summarize the survey/research
data are:
1) measures of central tendency or statistical averages (The three most

important ones are the arithmetic average or mean, median and mode.
Geometric mean and harmonic mean are also sometimes used.)
2) measures of dispersion (From among the measures of dispersion, variance,
and its square root—the standard deviation are the most often used
measures.)
3) measures of asymmetry (We mostly use the first measure of skewness based
on mean and mode or on mean and median.)
4) measures of relationship (Karl Pearson’s coefficient of correlation, Yule’s
coefficient of association, multiple correlation coefficient, partial correlation
coefficient, regression analysis, etc.,)
5) other measures (index number, time series analysis and coefficient of
contingency, etc.,)
10
MEASURES OF CENTRAL TENDENCY
● Measures of central tendency (or statistical averages) tell us the point about
which items have a tendency to cluster. Mean, median and mode are the
most popular averages. Mean, also known as arithmetic average.
● Mean is the simplest measurement of central tendency and is a widely used

measure. Its chief use consists in summarizing the essential features of a
series and in enabling data to be compared.
❖ The mean (average) of a data set is found by adding all numbers in the
data set and then dividing by the number of values in the set.
For example: Compute the mean of the given numbers: 100, 115, 125, 150, 145
100 + 115 + 125 + 145 + 150 = 127 is the mean

5
● Median is the value of the middle item of series when it is arranged in

ascending or descending order of magnitude.
❖ The median is the middle value when a data set is ordered from least
to greatest.
For example: Find the median of the given numbers: 25, 20, 35, 40, 30
20, 25, 30, 35, 40
30 is the median
● Mode is the most commonly or frequently occurring value in a series. The

mode in a distribution is that item around which there is maximum
concentration.
❖ The mode is the value that occurs the most often in a data set.
For example: Find the mode of the given numbers: 45, 50, 55, 60, 60, 65, 70
45, 50, 55, 60, 60, 65, 70
60 is the mode
11
MEASURES OF DISPERSION
● An averages can represent a series only as best as a single figure can, but it
certainly cannot reveal the entire story of any phenomenon under study.
❖ It describes the spread of the data. Range and Mean Deviation are
included here.
(A) Range is the simplest possible measure of dispersion and is defined as the
difference between the values of the extreme items of a series.
Formula: Highest Value - Lowest Value
For example: Find the range of the given numbers: 108, 112, 125, 118 and 113.
108, 112, 125, 118, 113.

125 - 108 = 17 is the range
(B) Mean Deviation is the average of difference of the values of items from some
average of the series. Such a difference is technically described as deviation.
❖ Mean deviation is a statistical measure that computes the average
deviation from the mean value of a given data collection.
For example: Find the mean deviation of the given numbers: 108, 112, 125, 118 and
113.
108 + 112 + 125 + 118 + 113 = 115.2 is the mean
5
Next, |125 - 115.2| + |118 - 115.2| + |113 - 115.2| + |112 - 115.2| + |108 - 115.2|
5
5.04 is the mean deviance
12
MEASURES OF ASYMMETRY(SKEWNESS)
Normal Distribution
Curve showing no skewness in which case we have X=M=Z
SKEWNESS
● Measures of imbalance or Asymmetry from the mean of a data distribution.
● Values concentrated on one side of the data distribution
POSITIVE/RIGHT SKEWED
● Tail on the right side is longer
● Outliers is on the right side
13
NEGATIVE/LEFT SKEWED
Tail on the left side is longer
Outliers is on the left side.
Such a curve is technically described as a normal curve and the relating distribution
as normal distribution. Such a curve is a perfectly bell-shaped curve in which case
14
the value of X or M or Z is just the same and skewness is altogether absent. But if
the curve is distorted (whether on the right side or on the left side), we have an
asymmetrical distribution which indicates that there is skewness. If the curve is
distorted on the right side, we have positive skewness but when the curve is
distorted towards left, we have negative skewness
Positive skewness, we have Z < M < X and in case of Negative skewness we have X
< M < Z.
Kurtosis is the measure of flat-toppedness of a curve. A bell shaped curve or the

normal curve is Mesokurtic because it is kurtic in the centre; but if the curve is
relatively more peaked than the normal curve, it is called Leptokurtic whereas a
curve is more flat than the normal curve, it is called Platykurtic. In brief, Kurtosis is
the humpedness of the curve and points to the nature of distribution of items in the
middle of a series. It may be pointed out here that knowing the shape of the
distribution curve is crucial to the use of statistical methods in research analysis
since most methods make specific assumptions about the nature of the distribution
curve.
MEASURES OF RELATIONSHIP
15
So far we have dealt with those statistical measures that we use in context of
univariate population i.e., the population consisting of measurement of only one
variable. But if we have the data on two variables, we are said to have a bivariate
population and if the data happen to be on more than two variables, the population is
known as multivariate population. If for every measurement of a variable, X, we have
corresponding value of a second variable, Y, the resulting pairs of values are called a
bivariate population. In addition, we may also have a corresponding value of the third
variable, Z, or the forth variable, W, and so on, the resulting pairs of values are called
a multivariate population. In case of bivariate or multivariate populations, we often
wish to know the relation of the two and/or more variables in the data to one another.
We may like to know, for example, whether the number of hours students devote for
studies is somehow related to their family income, to age, to sex or to similar other
factor.
In case of bivariate population: Correlation can be studied through (a) cross

tabulation; (b) Charles Spearman’s coefficient of correlation; (c) Karl Pearson’s
coefficient of correlation; whereas cause and effect relationship can be studied
through simple regression equations. In case of multivariate population: Correlation
can be studied through (a) coefficient of multiple correlation; (b) coefficient of
partial correlation; whereas cause and effect relationship can be studied through
multiple regression equations.
Cross tabulation approach is specially useful when the data are in nominal form.
Under it we classify each variable into two or more categories and then cross classify
the variables in these subcategories. Then we look for interactions between them
which may be symmetrical, reciprocal or asymmetrical. A symmetrical
relationship is one in which the two variables vary together, but we assume that
neither variable is due to the other. A reciprocal relationship exists when the two
variables mutually influence or reinforce each other. Asymmetrical relationship is
said to exist if one variable (the independent variable) is responsible for another
variable (the dependent variable).
The cross classification procedure begins with a two-way table which indicates
whether there is or there is not an interrelationship between the variables. This sort
of analysis can be further elaborated in which case a third factor is introduced into
the association through cross-classifying the three variables. By doing so we find
conditional relationship in which factor X appears to affect factor Y only when factor
Z is held constant. The correlation, if any, found through this approach is not
16
considered a very powerful form of statistical correlation and accordingly we use

some other methods when data happen to be either ordinal or interval or ratio data.
Charles Spearman’s coefficient of correlation (or rank correlation) is the

technique of determining the degree of correlation between two variables in case of
ordinal data where ranks are given to the different values of the variables. The main
objective of this coefficient is to determine the extent to which the two sets of ranking
are similar or dissimilar. This coefficient is determined as under
di = difference between ranks of ith pair of the two variables;

n = number of pairs of observations.
For the calculation and significance testing of the ranking variable, it requires the
following data assumption to hold true:
● Interval or ratio level

● Linearly related
● Bivariant distributed
If your data doesn’t meet the above assumptions, then you would need Spearman’s
Coefficient. It is necessary to know what monotonic function is to understand
Spearman correlation coefficient. A monotonic function is one that either never
decreases or never increases as it is an independent variable increase. A monotonic
function can be explained using the image below:
17
The image explains three concepts in monotonic function:
1. Monotonically increasing: When the ‘x’ variable increases and the ‘y’
variable never decreases.
2. Monotonically decreasing: When the ‘x’ variable increases but the ‘y’
variable never increases
3. Not monotonic: When the ‘x’ variable increases and the ‘y’ variable
sometimes increases and sometimes decreases.
The Spearman Coefficient,⍴, can take a value between +1 to -1 where,
● A ⍴ value of +1 means a perfect association of rank

● A ⍴ value of 0 means no association of ranks
● A ⍴ value of -1 means a perfect negative association between ranks.
Closer the ⍴ value to 0, weaker is the association between the two ranks.
We must be able to rank the data before proceeding with the Spearman’s Rank
Coefficient of Correlation. It is important to observe if increasing one variable, the
other variable follows a monotonic relation.
EXAMPLE:
18
Add up all your d square values, which is 12 (∑d square)

Insert these values in the formula
1-(6*12)/(9(81-1))
=1-72/720
=1-01
=0.9
the Spearman’s Rank Correlation for this data is 0.9 and as mentioned above if the ⍴
value is nearing +1 then they have a perfect association of rank.
Karl Pearson’s coefficient of correlation (or simple correlation) is the most widely
used method of measuring the degree of relationship between two variables. This
coefficient assumes the following:
● that there is linear relationship between the two variables;

● that the two variables are casually related which means that one of the
variables is independent and the other one is dependent; and
19
● a large number of independent causes are operating in both variables so as to

produce a normal distribution.
Karl Pearson’s coefficient of correlation is defined as a linear correlation coefficient

that falls in the value range of -1 to +1. Value of -1 signifies strong negative
correlation while +1 indicates strong positive correlation.
The Pearson correlation coefficient also referred to as Pearson’s r or the bivariate

correlation is a statistic that measures the linear correlation between two variables X
and Y. It has a value between +1 and −1. A value of +1 is a total positive linear
correlation, 0 is no linear correlation, and −1 is a total negative linear correlation.
The Pearson correlation can evaluate ONLY a linear relationship between two
continuous variables (A relationship is linear only when a change in one variable is
associated with a proportional change in the other variable)
Below is an example of how the Pearson correlation coefficient (r) varies with the
strength and the direction of the relationship between the two variables. Note that
when no linear relationship could be established (refer to graphs in the third column),
the Pearson coefficient yields a value of zero.
20
EXAMPLE:
Pearson correlation example

1. When a correlation coefficient is (1), that means for every increase in one variable,
there is a positive increase in the other fixed proportion. For example, shoe sizes
change according to the length of the feet and are perfect (almost) correlations.
2. When a correlation coefficient is (-1), that means for every positive increase in one
variable, there is a negative decrease in the other fixed proportion. For example, the
decrease in the quantity of gas in a gas tank shows a perfect (almost) inverse
correlation with speed.
3. When a correlation coefficient is (0) for every increase, that means there is no
positive or negative increase, and the two variables are not related.
Comparison of Pearson and Spearman coefficients
1. The fundamental difference between the two correlation coefficients is that the
Pearson coefficient works with a linear relationship between the two variables
whereas the Spearman Coefficient works with monotonic relationships as
well.
2. One more difference is that Pearson works with raw data values of the
variables whereas Spearman works with rank-ordered variables.
21
SIMPLE REGRESSION ANALYSIS
● Regression is the determination of a statistical relationship between two or

more variables.
● In simple regression, we have only two variables, one variable (defined as
independent) is the cause of the behavior of another one (defined as
dependent variable). Regression can only interpret what exists physically i.e.,
there must be a physical way in which independent variable X can affect
dependent variable Y. The basic relationship between X and Y is given by
where the symbol Y denotes the estimated value of Y for a given value of X.
This equation is known as the regression equation of Y on X (also represents
the regression line of Y on X when drawn on a graph) which means that each
unit change in X produces a change of b in Y, which is positive for direct and
negative for inverse relationships.
● The generally used method to find the ‘best’ fit that a straight line of this kind
can give is the least-square method. To use it efficiently, we first determine
Then
These measures define a and b which will give the best possible fit
through the original X and Y points and the value of r can then be worked out
as under:
22
● Thus, regression analysis is a statistical method to deal with the formulation of

mathematical models depicting relationship amongst variables which can be
used for the purpose of prediction of the values of dependent variable, given
the values of the independent variable.
● Alternatively, for fitting a regression equation of the type Y = a + bX to the
given values of X and Y variables, we can find the values of the two constants
viz., a and b by using the following two normal equations:
and then solving these equations for finding a and b values. Once these values are
obtained and have been put in the equation Y = a + bX, we say that we have fitted
the regression equation of Y on X to the given data. In a similar fashion, we can
develop the regression equation of X and Y viz., X = a + bX, presuming Y as an
independent variable and X as a dependent variable.
Example:
Here, we will be citing a scenario that serves as an example of the implementation of
simple regression analysis.
Let us assume the average speed when 2 highway patrols are deployed is 75 mph,
or 35 mph when 10 highway patrols are deployed. The question thus is what is the
average speed of cars on the freeway when 5 highway patrols are deployed?
Using our simple regression analysis formula, we can thus compute the
values and derive the following equation:
Y = 85 + (-5) X, given that Y is the average speed of cars on the highway.
A = 85, or the average speed when X = 0
B = (-5), the impact of each extra patrol car deployed on Y
And X = no of patrols deployed
Therefore, the average speed of cars on the highway when there are zero highway
patrols operating (X=0) will be 85 mph. For every extra highway patrol car working,
the average speed will reduce by 5 mph.
Hence, for 5 patrol cars (X = 5), we have Y = 85 + (-5) (5) = 85 – 25 = 60 mph.
23
MULTIPLE CORRELATION AND REGRESSION
● The relationship analysis is used when there are two or more independent
variables.
● The equation describing such a relationship as the multiple regression
equation. We here explain multiple correlation and regression taking only two
independent variables and one dependent variable (Convenient computer
programs exist for dealing with a great number of variables). In this situation
the results are interpreted as shown below:
Multiple regression equation assumes the form
where X1 and X2 are two independent variables and Y being the

dependent variable, and the constants a, b1 and b2 can be solved by solving
the following three normal equations:
● With more than one independent variable, we may make a difference between
the collective effect of the two independent variables and the individual effect
of each of them taken separately. The collective effect is given by the
coefficient of multiple correlation
24
PARTIAL CORRELATION
● Partial correlation measures separately the relationship between two variables

in such a way that the effects of other related variables are eliminated.
● In partial correlation analysis, we aim at measuring the relation between a
dependent variable and a particular independent variable by holding all other
variables constant. Thus, each partial coefficient of correlation measures the
effect of its independent variable on the dependent variable.
Example:
Using the partial correlation, you will measure whether there is a linear relationship
between 10,000 m running performance and VO2max (a marker of aerobic fitness
which refers to the maximum amount of oxygen a person can utilize during exercise.
), whilst controlling for wind speed and relative humidity.
Data given:
❖ 10,000 m running performance – continuous dependent variable (measured in

minutes and seconds.)
❖ VO2max – continuous independent variable (measured in ml/min/kg.)
❖ Wind speed (measured in mph) and relative humidity (expressed as
percentage) – two control variables.
Formula:
We can work out the partial correlation coefficients thus:
25
and
The partial correlation coefficients are called first order coefficients when one
variable is held constant as shown above; they are known as second order
coefficients when two variables are held constant and so on.
Assumptions:
There are at least 5 assumptions to consider when using the partial correlation to
analyze your chosen data.
❖ Assumption #1: You have one (dependent) variable and one (independent)
variable and these are both measured on a continuous scale (i.e., they are
measured on an interval or ratio scale).
❖ Assumption #2: You have one or more control variables, also known as
covariates.
❖ Assumption #3: There needs to be a linear relationship between all three
variables.
❖ Assumption #4: There should be no significant outliers.
❖ Assumption #5: Your variables should be approximately normally distributed.
26
ASSOCIATION IN CASE OF ATTRIBUTES
● When data is collected on the basis of some attribute or attributes, we have

statistics commonly termed as statistics of attributes. It is not necessary that
the objects may process only one attribute; rather it would be found that the
objects possess more than one attribute. In such a situation our interest may
remain in knowing whether the attributes are associated with each other or
not.
Types of Association:
1. Positive association - If class frequency of AB, symbolically written as (AB), is

greater than the expectation of AB being together if they are independent,
then we say the two attributes are positively associated;
Ex: Health and Hygiene
2. Negative association or disassociation - If the class frequency of AB is less

than this expectation, the two attributes are said to be negatively associated.
Ex: Vaccination and Occurrence of disease
3. No association or independence. - In case the class frequency of AB is equal

to expectation, the two attributes are considered as independent i.e., are said
to have no association.
Ex: Honesty and Boldeness
27
These three types of association can be put symbolically as shown here in the slide:
In order to find out the degree or intensity of association between two or more sets of
attributes, we should work out the coefficient of association.
Professor Yule’s coefficient of association:
● The value of this coefficient will be somewhere between +1 and –1.

❖ Attributes are completely associated with each other, coefficient = +1
❖ Attributes are completely disassociated with each other, = coefficient =
-1
❖ Attributes are completely independent of each other, coefficient = 0
28
● Sometimes the association between two attributes, A and B, may be regarded

as unwarranted when we find that the observed association between A and B
is due to the association of both A and B with another attribute C.
Partial Association
● The sort of association between A and B in the population of C is described

as partial association as distinguished from total association between A and B
in the overall universe.
Illusory Association
● It is an association between where two attributes does not correspond to any

real relationship. Such association may also be the result of the fact that the
attributes A and B might not have been properly defined or might not have
been correctly recorded.
● Researcher must remain alert and must not conclude association between A
and B when in fact there is no such association in reality.
Chisquare test
● Using a Chisquare test to find the value of of Chi-square ( χ2 ) and using

Chi-square distribution the value of χ2, we can judge accordingly the
significance of association between two attributes.
29
Contingency Table
● We can have manifold classification of the two attributes in which case each
of the two attributes are first observed and then each one is classified into two
or more subclasses, resulting into what is called as contingency table.
● Association can be studied in a contingency table through Yule’s coefficient of

association as stated above, but for this purpose we have to reduce the
contingency table into 2 × 2 table by combining some classes.
30
● But the practice of combining classes is not considered very correct and at
times it is inconvenient also, Karl Pearson has suggested a measure known
as Coefficient of mean square contingency for studying association in
contingency tables.
Coefficient of mean square contingency
● The formula for coefficient of mean square contingency that Karl Pearson have
suggested is:
● This is considered a satisfactory measure of studying association in contingency

tables.
31
OTHER MEASURES:
a. Index Numbers - When series are expressed in the same units, we can use
averages for the purpose of comparison, but when the units in which two or
more series are expressed happen to be different, statistical averages cannot
be used to compare them. In such situations we have to rely upon some
relative measurement which consists in reducing the figures to a common
base. Once such method is to convert the series into a series of index
numbers.
- But index numbers have their own limitations with which a researcher
must always keep himself aware. For instance, index numbers are only
approximate indicators and as such give only a fair idea of changes but
cannot give an accurate idea. Chances of error also remain at one
point or the other while constructing an index number but this does not
diminish the utility of index numbers for they still can indicate the trend
of the phenomenon being measured. However, to avoid fallacious
conclusions, index numbers prepared for one purpose should not be
used for other purposes or for the same purpose at other places.
b. Time series analysis: In the context of economic and business research, we

may obtain quite often data relating to some time period concerning a given
phenomenon. Such data is labeled as ‘Time Series’. Such series are usually
the result of the effects of one or more of the following factors:
1. Secular trend or long term trend that shows the direction of the series
in a long period of time.
2. Short time oscillations i.e., changes taking place in the short period of
time only and such changes can be the effect of the following factors:
a. Cyclical fluctuations (or C) are the fluctuations as a result of
business cycles and are generally referred to as long term
movements that represent consistently recurring rises and
declines in an activity
b. Seasonal fluctuations (or S) are of short duration occurring in a
regular sequence at specific intervals of time. Such fluctuations
are the result of changing seasons.
c. Irregular fluctuations (or I) also known as Random fluctuations,
are variations which take place in a completely unpredictable
fashion.
32
For analyzing time series, we usually have two models; (1) multiplicative model; and
(2) additive model. Multiplicative model assumes that the various components
interact in a multiplicative manner to produce the given values of the overall time
series and can be stated as under:
Y=TxCxSxI
where
Y = observed values of time series, T = Trend, C = Cyclical fluctuations,
S = Seasonal fluctuations, I = Irregular fluctuations.
The analysis of time series is done to understand the dynamic conditions for
achieving the shortterm and long-term goals of business firm(s). The past trends can
be used to evaluate the success or failure of management policy or policies
practiced hitherto. On the basis of past trends, the future patterns can be predicted
and policy or policies may accordingly be formulated. We can as well study properly
the effects of factors causing changes in the short period of time only, once we have
eliminated the effects of trend. Thus, analysis of time series is important in context of
long term as well as short term forecasting and is considered a very powerful tool in
the hands of business analysts and researchers.
33

Group 2 - Processing and Analysis of Data PDF

Uploaded by

Copyright:

Available Formats

Group 2 - Processing and Analysis of Data PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group 2 - Processing and Analysis of Data PDF

Uploaded by

Copyright:

Available Formats

Rizal Technological University

Cities of Mandaluyong and Pasig

Processing and Analysis of Data

Leader: Bederio, Ma. Cristina C.

Processing Operations ………………………………………………… 2

Problems in Processing ………………………………………………… 7

Measures of Central Tendency ………………………………………… 11

Simple Regression Analysis …………………………………………….. 22

Other Measures …………………………………………………………… 32

Processing and Analysis of Data

study and by a team of editors in case of a large inquiry. Editor(s) may

● Several points while performing their work:

2. Coding - refers to the process of assigning numerals or other symbols to

punch from the original questionnaires. But in case of hand coding

3. Classification - Most research studies result in a large volume of raw data

4. Tabulation - When a mass of data has been assembled, it becomes necessary

b) It facilitates the process of comparison.

● Generally accepted principles of tabulation: Such principles of tabulation,

15. Miscellaneous and exceptional items, if any, should be usually placed

SOME PROBLEMS IN PROCESSING

➢ When the DK response group is small, it is of little significance. But

How DK responses are to be dealt with by researchers?

b) Use or percentages: Percentages are often used in data presentation for

While using percentages, the following rules should be kept in view by

● Two or more percentages must not be averaged unless each is

By "analysis," we mean the computation of certain indices or measures along with

1) Descriptive Analysis is largely the study of distributions of one variable. This

2) Inferential analysis is concerned with the various tests of significance for

Causal analysis is considered relatively more important in experimental

a) Multiple regression analysis: This analysis is adopted when the researcher

The role of statistics in research is to function as a tool in designing research,

Two major areas of statistics:

1) measures of central tendency or statistical averages (The three most

MEASURES OF CENTRAL TENDENCY

● Mean is the simplest measurement of central tendency and is a widely used

100 + 115 + 125 + 145 + 150 = 127 is the mean

● Median is the value of the middle item of series when it is arranged in

● Mode is the most commonly or frequently occurring value in a series. The

108, 112, 125, 118, 113.

5.04 is the mean deviance

Kurtosis is the measure of flat-toppedness of a curve. A bell shaped curve or the

In case of bivariate population: Correlation can be studied through (a) cross

considered a very powerful form of statistical correlation and accordingly we use

Charles Spearman’s coefficient of correlation (or rank correlation) is the

di = difference between ranks of ith pair of the two variables;

● Interval or ratio level

The image explains three concepts in monotonic function:

The Spearman Coefficient,⍴, can take a value between +1 to -1 where,

● A ⍴ value of +1 means a perfect association of rank

Add up all your d square values, which is 12 (∑d square)

● that there is linear relationship between the two variables;

● a large number of independent causes are operating in both variables so as to

Karl Pearson’s coefficient of correlation is defined as a linear correlation coefficient

The Pearson correlation coefficient also referred to as Pearson’s r or the bivariate

Pearson correlation example

Comparison of Pearson and Spearman coefficients

SIMPLE REGRESSION ANALYSIS

● Regression is the determination of a statistical relationship between two or

● Thus, regression analysis is a statistical method to deal with the formulation of