Topic 8 Data Processing and Analysis PDF
Topic 8 Data Processing and Analysis PDF
ANALYSIS
Topic outline
•Data processing
– Overview
– Quantitative data
– Qualitative data
•Data analysis
– Introduction
– Descriptive
– Inferential
– Qualitative data
analysis
Reading Materials
•Kothari, C.R. (2004):
Chapter 7
•Bhattacherjee, A. (2012):
Chapters 14 and 15
What is data processing?
• Data processing is the preparation of data for
analysis, involves converting raw data to make it
amenable for analysis
• For quantitative data, processing involves:
– Questionnaire checking: completeness and
quality of answers in the field and after data
collection
– Editing: examining collected raw data to
detect errors and omissions, to ensure that
collected data are accurate, consistent,
uniformly entered and complete
Data processing…
• Coding: assigning numerals or other symbols to the
answers by categorizing them
• In coding, the categories or classes should be:
– Appropriate to the research question and objectives
– Exhaustive: each data item must have a category
– Mutually exclusive: each answer should be placed in
only one item
• Coding involves preparation of a codebook to guide
data entry
• SPSS is the popular statistical package for data analysis
in social sciences
• Others are STATA, SAS, EPINFO, EXCEL, etc
Data processing…
• Structure of the codebook in SPSS
– Variable name: Variable name shortened
(abbreviated)
– Variable type: numeric or string
– Variable label: Variable name written in
expanded form
– Value label: what do numbers e.g. 1,2,3 etc
stand for (for categorical variables)?
– Other considerations: width, decimal, columns,
align,
– Measure (nominal, ordinal or scale)
Data processing…
• In SPSS, a codebook is created using the
VARIBALE VIEW window
Data processing…
• Entering the data into the computer statistical
package
• Data cleaning or verification: done after data
entry exercise is completed
• Data cleaning involves removing committed
errors like wrong codes, illogical errors such as
males having babies or miscarriages,
inconsistent errors, etc
• Preliminary descriptive statistics can be used to
clean data
Data processing…
• Data entry is done using DATA VIEW window
Data processing…
Data Analysis: Introduction
• Data analysis is computation of indices or
measures to show patterns or relationship
between data groups for for meaningful
interpretation and discussion of findings
• Involves estimating the values of unknown
parameters of the population to understand the
information or message contained in raw data,
test hypotheses and draw inferences
• Data analysis is the most skilled task in the
research process, that requires critical
examination of the processed data
Why data analysis?
• Summarize collected raw data into understandable and
meaningful information
• Use statistics to make descriptions of a phenomenon
• Identify causal factors for a particular phenomenon
• Make reliable predictions or inferences from observed
data, e.g.
• What will be the demand of a product or service in
the next five years?
• What will be the rate of population growth or
industrial production in the next 10 years?
• To answer these questions you need knowledge useful
for predictions
Why data analysis?...
• Understand the distribution of variables, relationship or
association among variables, differences among
groups, etc
• To make proper estimations or generalizations from
sample results. Sample statistics (computed indices)
may give good estimates of particular population
parameters
• Thus, statistical inferences enable researchers to
evaluate the accuracy of the estimates made
• Hypotheses testing is useful for assessing the
significance of specific sample results/findings under
the studied population conditions
Data analysis: key considerations
• Nature of objectives, research questions or hypothesis
• Nature of variable(s) under consideration
(measurement levels)
• Appropriate method of analysed data presentation
• Depth of analyses required: what does the researcher
want to see in the data - distribution of variables,
relationship, association or differences?: descriptive or
inferential analysis?
• Purpose of analysis (i.e. comparison, establishing
correlation or relationship among variables?)
• Number of variables you want to deal with at a time:
univariate, bivariate or multivariate?
Order and types of data analysis
• Start analysing background characteristics such
as age, sex, education, marital status, etc
• Then, analyse other variables as required by
each specific objective, research question or
hypothesis
• Two main types for quantitative data analysis:
descriptive statistics and inferential statistics
• Another way is to classify data analysis based on
the number of variables as univariate, bivariate
or multivariate analysis
Descriptive statistics
• Descriptive statistics are summary measures aiming at
– Describing patterns and distribution of one or more
variables involved in the study
– Showing size and shape of the distribution
– Getting quick insight on how data behave
• However, descriptive statistics do not allow to
– Make conclusions beyond the data we have analyzed
– Reach conclusions regarding any hypotheses
• Two types of descriptive statistics
– Measures of central tendency (mean, mode, median)
– Measures of dispersion or spread (range, quartiles,
variance, standard deviation)
Measures of Central Tendency
• Measure of central tendency (also called
statistical average) is a single value that
describes a set of data by identifying the central
position within that set of data
• It a value around which all observations have a
tendency to cluster
• Such value is considered the most representative
figure of the entire dataset
• The MEAN, MEDIAN and MODE are all valid
measures of central tendency, but under
different conditions
Measures of central tendency…
• Arithmetic Mean: Sum of all the scores in the
data set divided by the number of observations
in the data set
• So, if we have “n” observations in a data set and
they have the scores x1, x2, ..., xn then the
sample mean, is denoted by
• Appropriate when dealing with continuous data,
measured at interval or ratio scale
• The mean produces the lowest amount of error
from all other values in the data set
Measures of central tendency…
• However, the mean is susceptible to the influence of
outliers i.e. extreme values in the dataset
• For example, consider the wages of staff at a factory
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k
Ordinal Median
• Procedures in SPSS
Analyze Descriptive statistics
Descriptive Select variable(s) of interest and put
it (them) to variable(s) box Option (Select
mean and other statistics you want) OK
• Procedures in SPSS
Analyze Descriptive statistics
Frequencies Select variable (s) of
interest and put it (them) to Variable (s) Box
OK
Activity
• Using “SPSS working file1”, compute descriptive
statistics (frequencies and percentages) for the variables
age2, sex, marital, educ.
• Note 1: Columns for “Percent” and “Valid percent” have
similar values because we don’t have missing cases. If
there were missing cases, the values in two columns
would be different. Always use a column for VALID
percent!
• Note 2: SPSS usually produces one table for each variable
and hence ending up with many tables in output.
• It is a good practice to summarize related information in
one Table using MS Word to reduce number of tables
and for clarity
Inferential statistics
• Inferential statistics are statistical procedures used to
reach conclusions about associations between variables
• As opposed to descriptive statistics, inferential statistics
are designed to
– Test hypotheses to determine whether there are
significant differences between groups or there is
significant association/ relationship among variables
– Determine whether independent variable(s) have an
effect (influence) on a dependent variable
Inferential Statistics
•P-values
•Frequencies
Interpreting the p-values
• P> 0.05 = Non-significant at P >0.05 (i.e. null hypothesis
is accepted, therefore, there is NO significant
difference/relationship) (i.e. NS)
• P ≤ 0.05 = Significant at P < 0.05 or 5% (null hypothesis
is rejected at P< 0.05, therefore, there is significant
difference/relationship at P< 0.05) (i.e. *)
• P < 0.01 = Significant at P < 0.01 or 1 % (null hypothesis
is rejected at P< 0.01, therefore, there is significant
difference/relationship at P< 0.01) (i.e. **)
• P < 0.001 = Significant at P < 0.001 or 0.1% (null
hypothesis is rejected at P< 0.001, therefore, there is
significant difference/relationship at P< 0.001) (i.e. ***)
Pearson Chi-Square Test…
• From the above output, there was significant
association between sex of household head and
access to credit ( = 37.17, P < 0.001)
• Procedure in SPSS
Analyze Compare means One
sample t-test Select variable under study
(variable of interest) and put it to Test variables
(s) box Specify test value at Test value box
OK
Activity
Using data in the file “SPSS working file1”
test the null hypothesis that average annual
household income (in ‘000) before the
project (Income1) for rural household of
Dodoma is 1,000
One sample t-test…
One sample t-test…
One sample t-test…
One-Sample Test
• Where
– Y is the dependent variable
– X is the independent (explanatory) variable
– is the intercept in the Y axis ( a regression constant)
• Example: Suppose we want to establish if there is a
relationship between annual household income before
project (INCOME1) as a dependent variable and farm
size (FSIZE) as an independent variable
Simple linear regression…
• Regression model
• Where
– INCOME1 = Annual household income before project
(‘000Tsh)
– FSIZE= Farm size (acres)
– Regression constant
– Regression coefficient
– Error term
Simple linear regression…
• Procedures for Simple linear regression in SPSS
Analyze Linear Choose dependent (criterion)
variable and put it to Dependent box Chose
independent variable and put it to Independent(s) box
OK
• SPSS outputs
Simple linear regression…
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 52780721 1 52780720.96 136.516 .000a
Residual 57220713 148 386626.438
Total 1.10E+08 149
a. Predictors: (Constant), farm size (acres)
b. Dependent Variable: annual income before project('000)
Simple linear regression…
Statistical tests in linear regression
• ANOVA (F-test)
– Mostly useful in multiple linear regression
– Tests the significance of the model
– Divides the variation into: variation due to
regression (Xs entered into the equation) and
variation due to residues
• R-Square (goodness of fit or robustness of the model)
– Indicates the percentage variation in Y that is
explained by variation in Xs
• t-test and regression coefficient
– Tests the significance of Xs in influencing Y
Interpreting simple linear regression results
• Positive regression coefficient (β) implies that there is
positive relationship between X and Y (i.e. increase in X
is associated with increase in Y)
• Negative regression coefficient (β) means that there is
negative relationship between X and Y (i.e. increase in X
is associated with decrease in Y)
• Results indicate farm size was a good predictor of
annual household income before project
• About 48% of variations in income were due to
variations in farm size (R2 =0.48)
• Farm size was significantly positively associated with
annual household income before project (t = 4.108, P <
0.001)
Multiple linear regression analysis
• Multivariate analysis with more than one independent
variable (Xs)
• Multiple linear regression equation can be written as
• Where
– dependent variable
– independent variables
– a regression constant
– regression coefficients,
– random error
Multiple linear regression analysis..
• Example: Suppose we want to establish if there is relationship
between annual household income before project by a household
(INCOME1) as a dependent variable and age of respondent
(years) (AGE), education level (years in school) (EDUC2), number
of cattle owned (NCATTLE) and farm size (FSIZE) as independent
variables
• The multiple linear regression model can be written as:
OR
Multiple linear regression analysis..
•Procedures for multiple linear regression
in SPSS
Standard
Independent variable B Error t- value Sig.
(SE)
Constant -478.90 219.18 -2.19 0.031
Age (Years) -2.74 4.34 -0.63 0.529
Education level (Years in school) 114.01 18.02 6.33 0.000
Number of cattle owned 9.67 3.23 3.00 0.003
Farm size (acres) 96.61 15.69 6.16 0.000
for your
Attention