Tabulation Coding and Editing
Tabulation Coding and Editing
Tabulation Coding and Editing
Introduction:
Data collected in research needs to be converted in a form that can be analyzed further. Data collected
may have some errors; that a researcher needs to get rid of. It may have certain inconsistencies that a
researcher has to deal with. Data collected tend to be in raw form that has to be coded by assigning
symbol or numbers so that proper categories may be formed for further analysis. Often statistical
adjustments are necessary to make the data representative of the population. The researcher them
should select the appropriate data analysis strategy. This entire process is known as data processing or
data preparation. The data preparation starts with questionnaire checking for completeness of the
responses. This is usually done when the fieldwork is in process. Data is processed and analyzed to come
to some conclusions or to verify the hypothesis made. Processing of data includes some very important
steps namely Editing, Coding, Classification, Tabulation, making the statistical adjustments and planning
the analysis strategy of data. All these steps have been explained hereunder:
EDITING
Data editing is a process by which collected data is examined to detect any errors or omissions and
further these are corrected as much as possible before proceeding further. Here researcher aims for
maximum accuracy and clarity of the collected data.
Editing can be at pre-testing levels also. At pretesting level, questionnaire is analyzed in order to unravel
the misinterpreted instructions, recording errors, and other problems. Editing at pretesting level is
comparatively easy as researcher can again contact the respondents and ask for clarifications. Once
pretesting is done and data is collected, the actual editing begins and includes the following :
1. COMPLETENESS OF ENTRIES. On a fully structured collection form, the absence of an entry is
ambiguous. It may mean that the respondent could not or would not provide an answer, that the
interviewer failed to ask the question, or that the interviewer failed to record collected data.
2. CONSISTENCY OF ENTRIES. Inconsistent responses raise questions concerning the validity of every
response. (If respondents indicate that they do not watch game shows, for example, but a later entry
indicates that they watched Wheel of Fortune twice during the past week, an obvious question arises as
to which entry is correct.)
3. ACCURACY OF ENTRIES. An editor should keep an eye out for any indication of inaccuracy in the data.
Of particular importance is detecting any repetitive response patterns in the reports of individual
interviews. Such patterns may well be indicative of systematic interviewer or respondent bias or
dishonesty.
Editing is of two types on the basis of place where it is done namely a) Field Editing and b) Central
Editing.
FIELD EDITING: Field editing deals with abbreviated or illegible written form of gathered data. This
editing is more effective when done on same day or the very next day after the interview. The
investigator must not jump to conclusion while doing field editing.
CENTRAL EDITING: Central editing relates to the time when all data collection process has been
completed. Here a single or common editor corrects the errors like entry in the wrong place, entry in
wrong unit etc. Editor must be familiar with the interviewer’s mind set, objectives and everything related
to the study.
Treatment for Unsatisfactory Responses: Unsatisfactory responses can be discarded, assigned missing
values and return to the field for filling the responses. The questionnaire with unsatisfactory responses
may be returned to the respondents in the field by re-contacting them. This approach works for
industrial marketing where sample size is small and respondents are easily identifiable.
If re-contacting with respondents is not possible, editor may assign missing values to the unsatisfactory
responses. Missing values should be assigned only when number of respondents with unsatisfactory
responses is small and variables with unsatisfactory responses are not the key variables.
Sometimes editor simply discards the unsatisfactory responses when proportion of such responses is
small and sample size is large.
Coding:
After editing the data, particular number or symbol is assigned to the responses in order to put them in
some definite categories or classes. This process of assigning numbers or symbols is known as coding.
Coding enables efficient and effective analysis as the responses are categorized into meaningful classes.
The classes of responses determined by the researcher should be appropriate and suitable to the study.
Coding can be done manually or through computer.
Coding is of two types on the basis of time when it is done namely pre-coding and post coding. Pre-
coding is the common practice being followed by the researchers now–a-days. Pre-coding refers to the
practice of assigning codes to categories before collecting the data. Here codes are the part of the
answer scale that is shown to the respondents when data is collected. Post- Coding: Post-coding is the
assignment of codes to responses after the data are collected and is most often required when
responses are reported in an unstructured format. Careful interpretation and good judgment are
required to ensure that the meaning of the response and the meaning of the category are consistently
and uniformly matched. Coding is an activity of extreme importance as improper coding leads to poor
analyses and may even constrain the types of analysis one can perform.
Coding the Questions: Structured questionnaire can be easily coded in comparison to unstructured
questionnaire in research. Many a times the tool used in the research also specifies the codes to be
used. For example 5 point likert scale also specified the codes to be used for the descriptors; like strongly
agree is coded as 5 whereas agree is coded 4, neutral is coded as 3, disagree is coded as 2 and strongly
disagree is coded as 1. Nominal data is coded by the wish of the researcher. For example, Gender of the
respondent can be coded as 1 for Male and 2 for female.
Normally a data file is prepared for further analysis whereby each question is coded with coding
instructions and necessary information about variables in data set. Given below is an example of data
file.
Age Group Gender Occupation Importance of Busy Schedule is a Adoption health life
Healthy life hurdle style is costly.
Style
2 1 2 5 4 3
2 1 2 5 4 4
4 1 5 5 4 5
1 1 1 1 2 1
4 1 5 4 4 4
3 0 3 5 5 4
1 0 1 4 2 2
3 1 2 5 4 4
1 0 1 4 3 4
1 1 1 5 1 1
1 1 1 1 3 3
1 0 5 5 4 3
3 0 3 5 5 4
4 0 3 4 4 4
3 1 2 5 5 4
3 1 2 5 5 4
5 1 4 5 4 4
5 1 4 2 3 2
4 0 3 5 5 4
3 0 5 5 5 5
In the data file, Age group column, Code 1 has been assigned to respondents falling in age group below 18
years, Code 2 has been assigned to respondents for age groups in 18-25 years, code 3 to respondents falling in
age group of 26-30 years; code 4 has been assigned to respondents falling in age group between 31-35 years
and code 5 has been assigned to respondents above 35 years of age.
In gender column, 0 has been assigned to Male respondents whereas 1 has been assigned to females. In
occupation column, 1 has been assigned to students, 2 has been assigned to service, 3 has been assigned
to business and 4 has been assigned to homemaker and 5 has been assigned to professional. The three
statements related to healthy life style has been measured on a scale of 1 to 5; where 1 stands for
strongly disagree, 2 stands for disagree, 3 stands for neutral, 4 stands for agree and 5 stands for strongly
agree.
Transcribing
Transcribing data involves transferring the coded data from the questionnaire or coding sheets onto disks or
directly into computers by keypunching or other means. However if the data has been collected using
computer assisted personal interviewing, computer assisted telephonic interviewing or internet surveys,
this step is unnecessary.
Many qualitative studies collect audio or video data (e.g. recordings of interviews, focus groups or talk in
consultation), and these are usually transcribed into written form for closer study. Transcribing should
match the analytic and methodological aims of the research. Whilst transcription is often part of the
analysis process, it also enhances the sharing and re-use potential of qualitative research data.
Data Cleaning
Data cleaning includes consistency checks and treatment of missing values. Although consistency checks
are also applied at the time of editing; here these checks are more thorough and extensive, because
these are made by the computer.
Consistency check deals with removing or eliminating the responses that are incorrect, incomplete and
irrelevant. Tabulation done without data cleaning leads to incorrect research analysis. Out of range data
are inadmissible and must be corrected before proceeding to analysis. For example if all the statements
in a questionnaire have been measured at a scale of 1 to 5, and 9 has been assigned to missing values;
then values like 0, 7, 8 are out of range. The correct responses can be again entered by referring the
original questionnaire. Consistency can also be checked in terms of logical consistency. For example if a
respondent reports both the unfamiliarity and usage of the product; the response of that respondent
should be discarded or should be filled by re-contacting him/her.
Researcher in data cleaning process has also need to deal with missing data and outliers. Missing data
occurs when respondents do not provide responses for all the questions. The most common way in
research is to treat the nonresponse as missing data in the analysis. Some may opt for deleting the
responses with missing data whereas some others assign a mean value to the missing data on the basis
of previously given responses. Outlier refers to an observation that is well outside of the expected range
of values in a study or experiment, and quite often outliers are removed from the data set. The most
common source of outliers is measurement error which might occurs due to faulty working of the
instrument being used. Another reason could be experimentation error that occurs due to wrong or
faulty conditions at the time of an experiment. One can also observe outliers in research due to by
chance, human errors, sampling errors and intentional wrong reporting of results by the respondents.
The easiest way to detect an outlier is by creating a graph. We can spot outliers by using histograms,
scatterplots, number lines, and the interquartile range.
Adjustments in data are not always required but might enhance the quality of data analysis. Procedures
for statistically adjusting the data include weighting, variable specification and scale transformation.
Weighting: In weighting, each case or respondent in the study is not given equal importance; rather each
case or respondent is assigned a weight to reflect its importance relative to other cases or respondents.
The objective of weighting is to increase or decrease the cases in the sample that possess certain
characteristics. The value of 1 represents a non-weighted case. Weighting is done to make the sample
more representative of population. For example if a form wants to make some modification in its
product; then opinion of heavy users is most important for the firm to know followed by medium users
and lastly by light users or non-users. So a firm might assign a weight of 3 to heavy users, 2 to medium
users and 1 to light or non-users.
If sample for a study differs significantly from the population, then weights can be derived by dividing
the population percentage by the corresponding sample percentage. Categories under-represented in
the sample receive higher weights whereas categories over-represented receive lower weights.
Weighting in a study should be applied with caution as weighting destroys the self-weighting nature of a
sample. If used, the weighting procedure should be documented and made a part of the project.
An important variable re-specification procedure involves use of dummy variables for re-specification of
categorical variable. Dummy variables are also known as dichotomous, binary, instrumental or
qualitative variables. They can only two values like 0 or 1. As a rule, to re-specify a categorical variable
with k categories, number of dummy variables is decided by the formula: k-1 where k refers to number
of categories. The reason for having k-1 number of dummy variables rather than k, is that only k-1
categories are independent. Given below is the example of assigning dummy variables. There are four
categories; so three dummy variables have been assigned.
Product Usage Original variable Dummy Variable Code
Category Code
X1 X2 X3
Nonusers 1 1 0 0
Light Users 2 0 1 0
Medium Users 3 0 0 1
Heavy Users 4 0 0 0
Scale Transformation: Scale transformation involves the manipulation of scale values to ensure
comparability with other scales or otherwise make the data suitable for analysis. Normally data is
collected using different scales; so in order to compare the two variables measured with different scales,
we need scale transformation. Even if same scale is used, respondents may use the scale differently.
Some may regularly give higher ratings whereas some others may regularly use lower ratings. These
differences can also be corrected by using scale transformations.
The scale can be transformed for the mean response value by subtracting the individual value from the
mean and adding a constant c. Another common way to transform the scale is standardization. To
standardize a scale, we first subtract the mean score from each value and then divide by the standard
deviation. This is similar to calculation of Z score.
Classification means categorizing the collected data in to common groups. Data having common
characteristics are placed in a common group. The entire data collected is categorized into various
groups or classes, which convey a meaning to the researcher and facilitates easy understanding of
research. In other words, heterogeneous data is divided into separate homogeneous classes according to
characteristics that exist amongst different individuals or quantities constituting the data. Thus,
fundamentally classification is dependent upon similarities and resemblances among the items in the
data. Classification can be done according to attributes or class intervals. Classification according the
attributes deals with classifying data on the basis of common descriptive characteristics like literacy, sex,
honesty, marital status, usage of a product or service etc. The attributes on the basis of which data is
classified are qualitative in nature and cannot be measured quantitatively but are considered important
while making an analysis. Classification on the basis of the interval deals with numerical feature of data
can be measured quantitatively. Data related to income, production, age, weight etc. come under this
category.
Tabulation is the final step in the data collection process and the first step in the data analytical process.
The most basic tabulation consists of counting the number of responses that occur in each of the data
categories that comprise a variable. Tabulation summarizes the raw data and displays data in form of
some statistical tables. Tabulation is an orderly arrangement of data in rows and columns. Cross-
tabulation involves simultaneously counting the number of observations that occur in each of the data
categories of two or more variables. Cross tabulation has been explained in bivariate data analysis
chapter in detail.
OBJECTIVE OF TABULATION:
Frequency Table
The frequency of a particular data value is the number of times the data value occurs in the study. For
example, if five students have a score of 90 in mathematics, and then the score of 90 is said to have a
frequency of 5. The frequency of a data value is often represented by f. The distribution of a variable is
the pattern of frequencies of the observation. Frequency distributions are portrayed as frequency tables,
histograms, or polygons. Frequency distributions can show either the actual number of observations
falling in each range or the percentage of observations. When frequency distribution shows the actual
number of observation; distribution is called a relative frequency distribution.
A frequency distribution table is constructed by arranging collected data values in ascending order of
magnitude with their corresponding frequencies. Frequency distribution tables can be used for both
categorical and numeric variables. However continuous variables should only be used with class
intervals.
1. Divide the results (x) into intervals, and then count the number of results in each interval.
2. Make a table with separate columns for the interval numbers, the tallied results, and the
frequency of results in each interval.
3. Read the list of data from left to right and place a tally mark in the appropriate row. For example,
the first result is a 1, so place a tally mark in the row beside where 1 appears in the interval
column. The next result is a 2, so place a tally mark in the row beside the 2, and so on. When you
reach your fifth tally mark, draw a tally line through the preceding four marks to make the final
frequency calculations easier to read.
4. Add up the number of tally marks in each row and record them in the final column entitled
Frequency.
Example: A researcher conducted a survey in 20 homes and asked people about the number of cars
registered to their households. The results were recorded as follows:
1, 2, 1, 0, 3, 4, 0, 1, 1, 1, 2, 2, 3, 2, 3, 2, 1, 4, 0, 0
0 4
INCLUDEPI
CTURE
"http://ww
w.statcan.g
c.ca/edu/p
ower-
pouvoir/im
ages/tick_0
4.gif" \*
MERGEFOR
MATINET
1 6
INCLUDEPI
CTURE
"http://ww
w.statcan.g
c.ca/edu/p
ower-
pouvoir/im
ages/tick_0
6.gif" \*
MERGEFOR
MATINET
2 5
INCLUDEPI
CTURE
"http://ww
w.statcan.g
c.ca/edu/p
ower-
pouvoir/im
ages/tick_0
5.gif" \*
MERGEFOR
MATINET
3 3
INCLUDEPI
CTURE
"http://ww
w.statcan.g
c.ca/edu/p
ower-
pouvoir/im
ages/tick_0
3.gif" \*
MERGEFOR
MATINET
4 2
INCLUDEPI
CTURE
"http://ww
w.statcan.g
c.ca/edu/p
ower-
pouvoir/im
ages/tick_0
2.gif" \*
MERGEFOR
MATINET
By looking at this frequency distribution table quickly, we can see that out of 20 households surveyed,
4 households had no cars, 6 households had 1 car, etc.
Data analysis strategy is made by the researcher by taking in to consideration various factors namely
problem definition, known characteristics of data, properties of statistical techniques and philosophy of
the researcher. Data analysis strategy should emerge from problem definition, research plan and
research design. For example ANOVA is generally applied in casual research designs. The measurement
scale used for data collection exerts strong influence on data analysis. Table given below gives the
permissible statistics with particular scale of measurement.
Interval Range, Mean, Standard Deviation Product –moment Correlation, t-test, ANOVA,
Regression, Factor Analysis
It is also important to understand the properties of statistical techniques as some are suitable for
understanding the difference whereas some are suitable for knowing the association. Some give
information about magnitude of the variables whereas some are useful in predicting the value of the
variable. The researcher’s background and philosophy also affect the choice of statistical technique. The
experienced and trained researchers employ variety of statistical technique to uncover the relationship
among variables. Researchers also differ in terms of making assumptions about the population.
Conservative researchers employ nonparametric techniques. Non-parametric techniques assume
population parameters to be distribution free.
Non-parametric tests don’t make assumptions about normality of the data. Hence, these tests are also
known as distribution free tests. One might define nonparametric statistical procedures as a class of
statistical procedures that do not rely on assumptions about the shape or form of the probability
distribution from which the data were drawn. The most common non-parametric tests include chi-
square test, Wilcoxon Rank Sum test, Wilcoxon Signed Rank test, Kruskal-Wallis test and Spearman’s
Rank Correlation Test.
If normality assumptions are not met, then data can be analyzed with the non-parametric data. There is
an alternate non-parametric test to every parametric test. Table given below gives the alternate non-
parametric test to be used when normality condition does not meet.
Statistical techniques can also be categorized on the basis of number of variables to be analyzed. If
study makes use of only one variable, then analysis becomes Univariate whereas bi-variate analysis
makes use of two variables. Multi-variate analysis makes use of more than three variables. All these
techniques have been explained in consequent chapters.
Data Representation:
Data can be represented in many ways. The most common ways are:
Bar charts
Histograms
Pie charts
Pictograms
Line graphs/charts
Frequency polygons
Stem-leaf diagrams
Scatter diagram
Bar Charts: Bar Charts are most commonly used to compare categorical or qualitative data and grouped
discrete quantitative data like scores on a test, amount spend by customers in a shop etc. Rectangles
with equal width are used. The height/length represents the frequency of the category. Given below is
the example of bar chart where it is representing the frequency of visit by the customers in a retail shop.
Histograms: Histograms are most commonly used for representing grouped continuous variables.
Histograms depict frequency (or count) versus a continuous or nearly continuous variable. Rectangles are
drawn whose widths represent class intervals and areas are proportional to the corresponding
frequencies. The rectangles touch each other and there is no space in-between like bar charts. It is
possible to tell the skewness of the data with the help of histograms as histograms on x-axis makes use
of continuous data; so it is possible for the researcher to look at the tendency of the observations to fall
more on the low end or the high end of the X axis. For this reason, researchers use histograms to look for
the normality of the data. However with bar charts such discretion is not possible as bar chart5s make
use of categorical variable on x-axis. Given below is an example of histogram.
A Frequency Histogram is a special histogram that uses vertical columns to show frequencies i.e. how
many times each score occurs.
Pie-Charts: A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices to
illustrate numerical proportion. Pie-charts are used to represent data as part of a whole, to illustrate
differences in categories provided the number of categories is limited (generally between 2 and 8).
Measure of the angle at the centre of the circle is proportional to the frequency. Given below is an
example of pie-chart for Age of the Respondents.
Line graphs/charts: The line chart is represented by a series of data points connected with a straight line.
Line charts are most often used to visualize changes in continuous variable over time. It is similar to a
scatter plot except that the measurement points are ordered and joined with straight line segments.
Frequency polygons: Frequency Polygons are used to compare grouped continuous variables. These are
used when two or more sets of data are to be illustrated on the same diagram such as death rates in
smokers and non-smokers, birth and death rates of a population etc. In a Frequency Polygon, a line graph
is drawn by joining all the midpoints of the top of the bars of a histogram. Frequency polygons are
similar to histograms except histograms have bars and frequency polygons have dots and lines
connecting the frequencies of each class interval thereby giving a diagram or graph with many angles i.e.
the polygon. Given below is an example of Polygon.
Stem-leaf diagrams: These are used to represent ungrouped quantitative data. These can also be used
to compare two sets of ungrouped quantitative data like male and female data on the same variable.
Stem-leaf diagrams are the only graphical representations that also display all the original data values.
For instance, suppose you have the following list of values: 12, 13, 21, 27, 33, 34, 35, 37, 40, 40, 41. You
could make a frequency distribution table showing how many tens, twenties, thirties, and forties you
have:
10 - 19 2
20 - 29 2
30 - 39 4
40 - 49 3
The stem-and-leaf plot for the same data would look like:
The "stem" is the left-hand column which contains the tens digits. The "leaves" are the lists in the right-
hand column, showing all the ones digits for each of the tens, twenties, thirties, and forties. As you can
see, the original values can still be determined; you can tell, from that bottom leaf, that the three values
in the forties were 40, 40, and 41.
Scatter diagrams: When looking at statistical data it is often observed that there are connections
between sets of data. For example the mass and height of persons are related: the taller the person the
greater his/her mass. To find out whether or not two sets of data are connected scatter diagrams can be
used. Independent variable is plotted on X axis and dependent variable is plotted on Y axis. A dot is made
where x axis intersects the y axis. Given below is the example of scatter diagram which clearly depicts
that player’s height and weight are correlated.
Summary:
This chapter deals with data processing. Processing of data includes some very important steps namely
Editing, Coding, Classification, Tabulation, making the statistical adjustments and planning the analysis
strategy of data. Data editing is a process by which collected data is examined to detect any errors or
omissions and further these are corrected as much as possible before proceeding further. Here
researcher aims for maximum accuracy and clarity of the collected data. Editing is of two types on the
basis of place where it is done namely a) Field Editing and b) Central Editing. After editing the data,
particular number or symbol is assigned to the responses in order to put them in some definite
categories or classes. This process of assigning numbers or symbols is known as coding. Coding enables
efficient and effective analysis as the responses are categorized into meaningful classes. Data cleaning
includes consistency checks and treatment of missing values. Researcher in data cleaning process has
also need to deal with missing data and outliers. Missing data occurs when respondents do not provide
responses for all the questions. The most common way in research is to treat the nonresponse as missing
data in the analysis. Adjustments in data are not always required but might enhance the quality of data
analysis. Procedures for statistically adjusting the data include weighting, variable specification and scale
transformation. All these terms were explained briefly in the chapter. Data analysis strategy was also
discussed in the chapter. Data analysis strategy is made by the researcher by taking in to consideration
various factors namely problem definition, known characteristics of data, properties of statistical
techniques and philosophy of the researcher. Data representation was also discussed briefly in the
chapter which included bar chart, histograms pie-chart, line chart, stem and leaf diagram, scatter
diagram etc.
Review Questions: