Chapter Six Data Processing, Analysis and Interpretation
Chapter Six Data Processing, Analysis and Interpretation
Chapter Six Data Processing, Analysis and Interpretation
The data, after collection, has to be processed and analyzed in accordance with the outline laid
down for the purpose at the time of developing the research proposal.
1) Editing: Editing is a process of examining the collected raw data to detect errors and
omissions and to correct these when possible. Editing is done to assure that the data are accurate,
consistent with other facts gathered, uniformly entered, as complete as possible and have been
well arranged to facilitate coding and tabulation. With regard to points or stages at which editing
should be done, one can talk of field editing and central editing.
Field editing consists in the review of forms by the investigator for completing what the
interviewer (enumerator) has written in abbreviated and/or in illegible form at the time of
recording the respondents’ responses. This type of editing should be done as soon as possible
after the interview, preferably on the very day or on the next day. While doing field editing, the
investigator must restrain himself and must not correct errors of omission by simply guessing
what the informant would have said if the question had been asked.
Central editing should take place when all forms or questionnaires have been completed and
returned to the office. Editor(s) may correct the obvious errors such as an entry in the wrong
place, entry recorded in months when it should have been recorded in weeks, and the like. At
times the respondents can be asked for clarification. The editor must strike out the answer if the
same is inappropriate and he has no basis for determining the correct answer or the response. In
such a case an editing entry of ‘no answer’ is called for.
2) Coding: Coding refers to the process of assigning numerals or other symbols to answers so
that responses can be put into a limited number of categories or classes. These classes must
possess the characteristic of exhaustiveness (there must be a class for every data item) and also
that of mutual exclusivity which means that a specific answer can be placed in one and only one
cell in a given category set.
Coding is necessary for efficient analysis and through it the several replies may be reduced to a
small number of classes which contain the critical information required for analysis. Coding
decisions should usually be taken at the designing stage of the questionnaire. In case of hand
coding, it is possible to code on the margin of the questionnaire with colored pencil or to
transcribe the data from the questionnaire to a coding sheet.
3) Classification: Classification is the process of arranging data in groups or classes on the basis
of common characteristics, especially for studies with large volume of raw data. Data having a
common characteristic are placed in one class and in this way the entire data get divided into a
number of groups or classes.
4) Tabulation: When a mass of data has been assembled, it becomes necessary for researcher to
arrange the same in some kind of concise and logical order. This procedure is referred to as
tabulation. Thus tabulation is the process of summarizing raw data and displaying the same in
compact form for further analysis. Tabulation is essential because of the following reasons:
i) It conserves space and reduces explanatory and descriptive statement to a minimum
ii) It facilitates the process of comparison
iii) It facilitates the summation of items and the detection of errors and omissions
iv) It provides a basis for various statistical computations.
B) Use of percentages: Percentages are often used in data presentation for they simplify
numbers, reducing all of them to a 0 to 100 range. While using percentages, the following rules
should be kept in view by researchers:
1. Two or more percentages must not be averaged unless each is weighted by the group size from
which it has been derived
2. Use of too large percentages should be avoided
3. Percentages hide the base from which they have been computed. If this is not kept in view, the
real differences may not be correctly read
4. Percentage decreases can never exceed 100 per cent and as such for calculating the percentage
of decrease, the higher figure should invariably be taken as the base
5. Percentages should generally be worked out in the direction of causal-factor in case of two
dimension tables and for this purpose we must select the more significant factors as the causal
factor.
DATA ANALYSIS
The term analysis refers to the computation of certain measures along with searching for patterns
of relationship that exist among data-groups. Analysis involves estimating the values of
unknown parameters of the population and testing of hypotheses for drawing inferences.
Analysis can be categorized as descriptive analysis and inferential (statistical) analysis.
Descriptive analysis is largely the study of distribution of one variable. The characteristics of
location, spread, and shape describe distributions. Their definitions, applications, and formulas
fall under the heading of descriptive statistics. The common measures of location, often called
central tendency, include mean, median, and mode. The common measures of spread,
alternatively called measures of dispersion, are variance, standard deviation, and range. The
common measures of shape are skewness and kurtosis.
Inferential analysis includes two topics, estimation of population values and testing statistical
hypothesis.
We may as well talk of correlation analysis and causal analysis. Correlation analysis studies the
joint variation of two or more variables for determining the amount of correlation between two
or more variables. Causal (regression) analysis is concerned with the study of how one or more
variables affect changes in another variable. It is thus a study of functional relationships existing
between two or more variables.
Measurement Scales
Before analyzing data, it is important to identify the measurement scales of the data type. There
are four basic measurement scales: nominal, ordinal, interval, and ratio. The most accepted basis
for scaling has three characteristics:
1. Numbers are ordered. One number is less than, greater than, or equal to another number.
2. Differences between numbers are ordered. The difference between any pair of numbers is
greater than, less than, or equal to the difference between any other pair of numbers.
3. The number series has a unique origin indicated by the number zero.
Combination of these characteristics of order, distance, and origin provide the following widely
used classification of measurement scales.
Nominal Scales: When we use nominal scale, we partition a set into categories that are mutually
exclusive and collectively exhaustive. The counting of members is the only possible arithmetic
operation and as a result the researcher is restricted to the use of the mode as the measure of
central tendency. If we use numbers to identify categories, they are recognized as labels only and
have no quantitative value. Nominal scales are the least powerful of the four types. They suggest
no order or distance relationship and have no arithmetic origin. Examples can be respondents’
marital status, gender, students’ Id number, etc.
Ordinal Scales: Ordinal scales include the characteristics of the nominal scale plus an indicator
of order. The use of an ordinal scale implies a statement of ‘greater than’ or ‘less than’ (an
equality statement is also acceptable) with out stating how much greater or less. Thus the real
difference between ranks 1 and 2 may be more or less than the difference between ranks 2 and 3.
The appropriate measure of central tendency for ordinal scales is the median. Examples of
ordinal scales include opinion or preference scales.
Interval Scales: The interval scale has the powers of nominal and ordinal scales plus one
additional strength: It incorporates the concept of equality of interval (the distance between 1 and
2 equals the distance between 2 and 3). When a scale is interval, you use the arithmetic mean as
the measure of central tendency. Calendar time is such a scale. For example, the elapsed time
between 4 and 6 A.M. equals the time between 5 and 7 A.M. One cannot say, however, 6 A.M is
twice as late as 3 A.M. because zero time is an arbitrary origin. Centigrade and Fahrenheit
temperature scales are other examples of classical interval scales.
Ratio Scales: Ratio scales incorporate all of the powers of the previous ones plus the provision
for absolute zero or origin. The ratio scale represents the actual amounts of a variable.
Multiplication and division can be used with this scale but not with the other mentioned. Money
values, population counts, distances, return rates, weight, height, and area can be examples for
ratio scales.
r=
∑ ( X i− X )(Y i −Y )
(n−1). S X . SY
Correlation coefficients reveal the magnitude and direction of relationships. Pearson’s
correlation coefficient varies over a range of +1 through 0 to -1. The sign signifies the direction
of relationship.
There are two basic assumptions for Pearson’s correlation coefficient. The first is linearity.
When r =0, no pattern is evident that could be described with a single line. It is possible to find
coefficients of zero where the variables are highly related but in a non-linear form. The second
assumption is a bi-variate normal distribution. That is, the data are from a random sample of a
population where the two variables are normally distributed in a joint manner. If this assumption
is not met, one should select a nonparametric measure of association.
B) Spearman’s coefficient of correlation (or rank correlation): When the data are not
available to use in numerical form but the information is sufficient to rank the data as
first, second, third, and so forth, we quite often use the rank correlation method. In fact,
the rank correlation coefficient is a measure of correlation that exists between two sets of
ranks.
For calculating rank correlation coefficient, rank the observations by giving 1 for the highest
value, 2 to the next highest value, and so forth. If two or more values happen to be equal, then
the average of the ranks which should have been assigned to such values had they been all
different, is taken and the same rank is given to concerning values. The next step is to record the
difference between ranks (‘d’) for each pair of observations, then square these differences to
obtain a total of such differences. Finally, Spearman’s rank correlation coefficient can be worked
out as:
6 ∑ d 2i
r r =1− { n(n 2−1 ) }
The value of Spearman’s rank correlation coefficient will always vary between -1 and 1, where 1
indicates a perfect positive correlation and -1 indicates a perfect negative correlation.
2) REGRESSION ANALYSIS
The statistical tool with the help of which we are in a position to estimate (or predict) the
unknown values of one variable from known values of another variable is called regression. For
example if we know that advertising and sales are correlated, we may find out the expected
amount of sales for a given advertising expenditure or the required amount of expenditure for
attaining a given amount of sales. If we take two variables x and y, we shall have two regression
lines as under.
a. Regression Equation of X on Y
b. Regression equation of Y on X
3) TEST OF HYPOTHESIS
Hypothesis is usually considered as the principal instrument in research. Its main function is to
suggest new experiments and observations. In general, it is a mere assumption or some
supposition to be proved or disproved. But for the researcher, hypothesis is a formal question
that he intends to resolve.
For Example,
1) Students who receive counseling will show a greater increase in creativity than students
not receiving counseling.
2) The automobile A is performing as well as automobile B.
CHARACTERISTICS OF HYPOTHESIS
1) It should be clear and precise.
2) It should be capable being tested.
3) It should state relationship between variables.
4) It should be limited in scope and must be specific,
5) Hypothesis should be stated as far as possible in most simple terms. So that the same is
easily understandable by all.
6) Hypothesis should be consistent with most known facts.
7) Hypothesis should be amenable to testing within a reasonable time.
8) Hypothesis must explain the facts that gave rise to the need for explanations.
There are two type of error in the hypothesis testing. We may reject H o when Ho is true and we
may accept Ho when it is not true. The former is known as type 1 error and latter is known as
type II error. In other words, type I error means rejecting the hypothesis, which should have been
accepted, and type II error means accepting the hypothesis, which should have been rejected.
Type I error is denoted by α (alpha) known as α error, also called as level of significance
of the test. Type II error is denoted by (Beta) known as β error. The following table will
explain the two errors as clearly.
DECISION DECISION
ACCEPT H O REJECT HO
The probability of type I error is usually determined in advance and it is the level of significance
of testing the hypothesis. If type I error is fixed at 5%, it means that there are about 5 chances in
100 that we will reject HO when Ho is true. We can control the type I error just by fixing at lower
level. Suppose we fix at 1%, we will say that the maximum probability of committing type I
error would only be 0.01.
When we try to reduce type I error, the probability of committing type II error will increase.
Both types of errors cannot be reduced simultaneously. So it should be trade- off between two
types of error. Therefore, we are usually fixing the level of significance (Type I error) is 5% for
hypothesis testing.