Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
9 views

Topic 8 Data Processing and Analysis PDF

Uploaded by

GWAKISA MWASONYA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Topic 8 Data Processing and Analysis PDF

Uploaded by

GWAKISA MWASONYA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 157

DATA PROCESSING AND

ANALYSIS
Topic outline
•Data processing
– Overview
– Quantitative data
– Qualitative data
•Data analysis
– Introduction
– Descriptive
– Inferential
– Qualitative data
analysis
Reading Materials
•Kothari, C.R. (2004):
Chapter 7

•Bhattacherjee, A. (2012):
Chapters 14 and 15
What is data processing?
• Data processing is the preparation of data for
analysis, involves converting raw data to make it
amenable for analysis
• For quantitative data, processing involves:
– Questionnaire checking: completeness and
quality of answers in the field and after data
collection
– Editing: examining collected raw data to
detect errors and omissions, to ensure that
collected data are accurate, consistent,
uniformly entered and complete
Data processing…
• Coding: assigning numerals or other symbols to the
answers by categorizing them
• In coding, the categories or classes should be:
– Appropriate to the research question and objectives
– Exhaustive: each data item must have a category
– Mutually exclusive: each answer should be placed in
only one item
• Coding involves preparation of a codebook to guide
data entry
• SPSS is the popular statistical package for data analysis
in social sciences
• Others are STATA, SAS, EPINFO, EXCEL, etc
Data processing…
• Structure of the codebook in SPSS
– Variable name: Variable name shortened
(abbreviated)
– Variable type: numeric or string
– Variable label: Variable name written in
expanded form
– Value label: what do numbers e.g. 1,2,3 etc
stand for (for categorical variables)?
– Other considerations: width, decimal, columns,
align,
– Measure (nominal, ordinal or scale)
Data processing…
• In SPSS, a codebook is created using the
VARIBALE VIEW window
Data processing…
• Entering the data into the computer statistical
package
• Data cleaning or verification: done after data
entry exercise is completed
• Data cleaning involves removing committed
errors like wrong codes, illogical errors such as
males having babies or miscarriages,
inconsistent errors, etc
• Preliminary descriptive statistics can be used to
clean data
Data processing…
• Data entry is done using DATA VIEW window
Data processing…
Data Analysis: Introduction
• Data analysis is computation of indices or
measures to show patterns or relationship
between data groups for for meaningful
interpretation and discussion of findings
• Involves estimating the values of unknown
parameters of the population to understand the
information or message contained in raw data,
test hypotheses and draw inferences
• Data analysis is the most skilled task in the
research process, that requires critical
examination of the processed data
Why data analysis?
• Summarize collected raw data into understandable and
meaningful information
• Use statistics to make descriptions of a phenomenon
• Identify causal factors for a particular phenomenon
• Make reliable predictions or inferences from observed
data, e.g.
• What will be the demand of a product or service in
the next five years?
• What will be the rate of population growth or
industrial production in the next 10 years?
• To answer these questions you need knowledge useful
for predictions
Why data analysis?...
• Understand the distribution of variables, relationship or
association among variables, differences among
groups, etc
• To make proper estimations or generalizations from
sample results. Sample statistics (computed indices)
may give good estimates of particular population
parameters
• Thus, statistical inferences enable researchers to
evaluate the accuracy of the estimates made
• Hypotheses testing is useful for assessing the
significance of specific sample results/findings under
the studied population conditions
Data analysis: key considerations
• Nature of objectives, research questions or hypothesis
• Nature of variable(s) under consideration
(measurement levels)
• Appropriate method of analysed data presentation
• Depth of analyses required: what does the researcher
want to see in the data - distribution of variables,
relationship, association or differences?: descriptive or
inferential analysis?
• Purpose of analysis (i.e. comparison, establishing
correlation or relationship among variables?)
• Number of variables you want to deal with at a time:
univariate, bivariate or multivariate?
Order and types of data analysis
• Start analysing background characteristics such
as age, sex, education, marital status, etc
• Then, analyse other variables as required by
each specific objective, research question or
hypothesis
• Two main types for quantitative data analysis:
descriptive statistics and inferential statistics
• Another way is to classify data analysis based on
the number of variables as univariate, bivariate
or multivariate analysis
Descriptive statistics
• Descriptive statistics are summary measures aiming at
– Describing patterns and distribution of one or more
variables involved in the study
– Showing size and shape of the distribution
– Getting quick insight on how data behave
• However, descriptive statistics do not allow to
– Make conclusions beyond the data we have analyzed
– Reach conclusions regarding any hypotheses
• Two types of descriptive statistics
– Measures of central tendency (mean, mode, median)
– Measures of dispersion or spread (range, quartiles,
variance, standard deviation)
Measures of Central Tendency
• Measure of central tendency (also called
statistical average) is a single value that
describes a set of data by identifying the central
position within that set of data
• It a value around which all observations have a
tendency to cluster
• Such value is considered the most representative
figure of the entire dataset
• The MEAN, MEDIAN and MODE are all valid
measures of central tendency, but under
different conditions
Measures of central tendency…
• Arithmetic Mean: Sum of all the scores in the
data set divided by the number of observations
in the data set
• So, if we have “n” observations in a data set and
they have the scores x1, x2, ..., xn then the
sample mean, is denoted by
• Appropriate when dealing with continuous data,
measured at interval or ratio scale
• The mean produces the lowest amount of error
from all other values in the data set
Measures of central tendency…
• However, the mean is susceptible to the influence of
outliers i.e. extreme values in the dataset
• For example, consider the wages of staff at a factory
Staff 1 2 3 4 5 6 7 8 9 10
Salary 15k 18k 16k 14k 15k 15k 12k 17k 90k 95k

• The mean salary of ten staff is $30.7k


• However, inspecting the raw data suggests that the
MEAN is not the best way to accurately reflect the
typical salary of a worker, as most workers have salaries
in the $12k to $18k range
Measures of central tendency…
• In this situation we need a better measure of central
tendency, the MEDIAN because is less affected by the
outliers and skewed data
• The median is the middle score for a set of data that has
been arranged in order of magnitude i.e. positional
average
• Suppose we have the data below
56 55 89 56 35 14 56 55 55 87 92
• First, we need to rearrange that data into order of
magnitude (smallest first):
14 35 45 55 55 56 56 65 87 89 92
• Our median is the middle value which is 56
Measures of central tendency…
• MODE is the most frequently occurring observation
in the data set
• The mode is applicable in categorical data where we
wish to know which is the most common category
• It represents the highest bar in a bar chart or
histogram
• You can, therefore, sometimes consider the mode as
being the most popular option
• However, the problem with MODE that is not
unique, so it leaves us with problems when we have
two or more values that share highest frequency
When to use Mean, Media or Mode
Type of Variable / Scale Best measure of
central tendency
Nominal Mode

Ordinal Median

Interval/Ratio (not Mean


skewed)
Interval/Ratio (skewed) Median
Measures of Dispersion
• Measures of dispersion or spread are used to describe
the or scatteredness or variability of observations in a
dataset
• Usually used in conjunction with measures of central
tendency to provide an overall description of dataset
• Measures dispersion give a clue of how a mean for
example represents the data
• If the spread of values is large then the mean is not a
good representative of data and vice versa
• Important measures of dispersion are range, mean
deviation and standard deviation
Measures of dispersion…
• Range
– Simplest measure of dispersion
– Difference between the highest and lowest values in a dataset
– Used as a rough measure of variability
– However, it is affected by fluctuations in sampling
• Quartile or interquartile: shows the spread of a data
set by breaking the data set into quarters, just like the
median breaks it in half
• Quartiles are useful measure of spread because they are less
affected by outliers or a skewed data set than the equivalent
measures of mean and standard deviation
• Quartiles are often reported along with the median when dealing
with skewed and/or data with outliers
Descriptive statistical analysis
• Type of variable or level of measurement is a key
determinant for the choice of statistical method to use
in data analysis
• For descriptive statistical analysis, we analyze for
– Frequencies and percent for categorical variables

– Mean, mode, median, min. value, max. value,


variance and standard deviation for continuous
variables
– Continuous variables can also be analyzed for
skewness, kurtosis, inter-quartile range, Box and
Whiskers plots (box-plots)
Descriptive statistical analysis…
• Measures of central tendency and dispersion
– Applicable to continuous data (i.e. interval or ratio
scale)

• Procedures in SPSS
Analyze Descriptive statistics
Descriptive Select variable(s) of interest and put
it (them) to variable(s) box Option (Select
mean and other statistics you want) OK

• Note: N, Mean, SD, Min, and Max are generated by


default)
Activity
•Using “SPSS working file1”, compute
descriptive statistics for the variables age,
educ2, hhsize, income1, income2, fsize,
ncattle.
Descriptive statistical analysis…
• Frequency distribution and percentages
– Applicable to categorical data (i.e. nominal and
ordinal) e.g. categorized or coded data

• Procedures in SPSS
Analyze Descriptive statistics
Frequencies Select variable (s) of
interest and put it (them) to Variable (s) Box
OK
Activity
• Using “SPSS working file1”, compute descriptive
statistics (frequencies and percentages) for the variables
age2, sex, marital, educ.
• Note 1: Columns for “Percent” and “Valid percent” have
similar values because we don’t have missing cases. If
there were missing cases, the values in two columns
would be different. Always use a column for VALID
percent!
• Note 2: SPSS usually produces one table for each variable
and hence ending up with many tables in output.
• It is a good practice to summarize related information in
one Table using MS Word to reduce number of tables
and for clarity
Inferential statistics
• Inferential statistics are statistical procedures used to
reach conclusions about associations between variables
• As opposed to descriptive statistics, inferential statistics
are designed to
– Test hypotheses to determine whether there are
significant differences between groups or there is
significant association/ relationship among variables
– Determine whether independent variable(s) have an
effect (influence) on a dependent variable

• Help to draw inferences and conclusions


Types of Inferential Statistics

Inferential Statistics

Parametric Non parametric Multivariate


tests tests statistics
Parametric tests
• Parametric tests assume certain properties of the
population from which sample was drawn
• Random sampling: scores are obtained using a random
sample from the population
• Independence of observation: violation of this
assumption is very serious
• Normal distribution: population from which samples are
taken are normally distributed
• Homogeneity of variance: Samples are obtained from
populations of equal variability of scores for each of the
group is similar
• Level of measurement: the dependent variable is
interval or ratio
Examples of parametric tests
• T-test
– One sample t-test
– Independent samples t-test
– Paired samples t-test

• Analysis of variance (F-test)


– One way ANOVA
– Two way ANOVA
– Multi way ANOVA
Non-parametric tests
• Non-parametric tests are used in situations when the
researcher cannot or does not want to make the
assumptions for parametric tests
• Non-parametric techniques are ideal for use when you
have data measured on
– Nominal (categorical)
– Ordinal (ranking order)
• They are less sensitive than parametric tests and
therefore fail to detect differences between groups
that actually exist
• Useful when dealing with small samples and when data
do not meet the assumptions of parametric tests
Examples of non-parametric tests
• Chi-Square Test
– Test for goodness of fit
– Test for independency
• Mann Whitney U Test
– Tests difference between groups on a continuous
measures
– It is an alternative to the test for independent
samples (independent samples t-test)
• Wilcoxon Signed Rank Test
– Used with repeated measures
– It is an alternative to repeated measures T-test
Examples of non-parametric tests
• Kruskal-Wallis Test
– Is an alternative to a one way ANOVA between
groups
– It allows to compare scores on some continuous
variables for three or more groups
• Friedman Test
– Is an alternative to the one way repeated measures
ANOVA
– Applicable when you measure some sample of
subjects or cases and you measure three or more
points in time or under three different conditions
What is hypothesis?
•Hypothesis is a claim, proposition or guess
concerning a population parameter, and
this claim can be true or not true (subject to
results of statistical tests)
•It is a tentative prediction of results of a
study or experiment
•Hypothesis is a predictive statement about
the outcome of a study
Hypothesis testing can involve…
• Comparison of groups with regard to their mean or
proportion
– In comparing means: hypothesis testing can be for a
single mean, difference between two means, and
difference between more than two means (t-test,
ANOVA)
– In comparing proportions: hypothesis testing can be
for a single proportion, and difference between two
proportions (t-test, z-test)
• Establishing if there is relationship or association
between variables
– Chi-square test and associated tests such as Cramer
V, Fischer’s exact tests, etc for categorical variables
Hypothesis testing can involve…
• Person’s correlation analysis, Spearman’s rank
correlation analysis for continuous variables
• Regression analysis, which can be simple or multiple
depending on number of independent variables.
• Regression analysis can also be classified depending on
nature of a dependent variable as
– Linear regression for continuous dependent variable
– Binary logit or probit regression for binary dependent
variable
– Ordinal logistic regression for ordinal dependent
variable
– Multinomial logistic regression for multinomial (nominal
with more than two categories) dependent variable
Inferential statistics
• Testing association among variables
– Categorical variables: Pearson Chi-square test
– Continuous variables: Pearson and Spearman’s Rank
correlation tests
• Testing hypothesis about population mean
– One sample t-test
– Independent samples t-test
– Paired samples t-test
• Comparing more than two population means
– One way ANOVA
– Two way ANOVA
• Testing relationship among variables: Regression and
Factor Analysis
Pearson Chi-Square Test
• Non-parametric test (distribution free) designed to
analyze group differences when the dependent variable
is measured at a nominal level
• It permits evaluation of both dichotomous independent
variables, and of multiple group studies
• Chi-square test involves establishing association
between two categorical variables through bi-variate
analysis
• It compares the observed frequencies in the categories
to frequencies expected by chance
Pearson Chi-Square Test…
• The test involves entering processed data at column
and raw, then one may go further into testing the
significance level of association by using the χ2 test
• Useful in deciding to accept or reject the null (Ho)
hypothesis by comparing the calculated χ2 value and
the tabulated χ2 value
• Hypotheses for Pearson Chi-square test are:
– Ho: No association between a variable in row and a
variable in column in a contingency table
– H1: There is association between variable in row and
variable in column
• Cramer’s V test is used to test strength for chi-square
Assumptions of Chi-Square Test
• Data in the cells should be frequencies or counts, not
percentages or some other transformation of the data
• Categories of the variables are mutually exclusive: a
particular subject fits into one and only one category of
each of the variables
• The study groups must be independent. Different tests
must be used if the two groups are related
• There are two variables, and both are measured as
categories, usually at the nominal or ordinal level
• Cochran rule says that no expected cell frequencies
should be less than one and no more than 20% are less
than five
Pearson Chi-Square Test…
• Procedure for Chi-square test in SPSS
Analyze Descriptive statistics Crosstabs
Select variable of interest and put it to row(s) Box
Select variable of interest and put it to column (s) Box
Specify how percentages should be computed (i.e. within
column or within row?: Hint- compute % within groups you
want to compare) after clicking cell button Continue
Click statistics button and choose Chi-square
Continue OK
Activity

Using the data in the file “SPSS working


file1” test if there significant association or
between “sex” (independent variable)
and “access to credit” (dependent
variable)
Interpreting Chi-Square Test…
• Chi-square value: the higher the χ2
value the better

•Degrees of freedom (n-1)

•P-values

•Frequencies
Interpreting the p-values
• P> 0.05 = Non-significant at P >0.05 (i.e. null hypothesis
is accepted, therefore, there is NO significant
difference/relationship) (i.e. NS)
• P ≤ 0.05 = Significant at P < 0.05 or 5% (null hypothesis
is rejected at P< 0.05, therefore, there is significant
difference/relationship at P< 0.05) (i.e. *)
• P < 0.01 = Significant at P < 0.01 or 1 % (null hypothesis
is rejected at P< 0.01, therefore, there is significant
difference/relationship at P< 0.01) (i.e. **)
• P < 0.001 = Significant at P < 0.001 or 0.1% (null
hypothesis is rejected at P< 0.001, therefore, there is
significant difference/relationship at P< 0.001) (i.e. ***)
Pearson Chi-Square Test…
• From the above output, there was significant
association between sex of household head and
access to credit ( = 37.17, P < 0.001)

• Male headed households were more likely to


access credit than females headed households

• Majority of credit receivers were males (85.9%)


compared to non-credit receivers in which most
of them were females (62.1%)
Activity
• Using data in the file “SPSS working file1”, test
if there is significant association between
engagement in off-farm activities (a dependent
variable) and sex, marital status and district of
residence by a respondent (independent
variables)

• Summarize your output and interpret your


results
Correlation
• Correlation is a bi-variate analysis that measures the
strength of association between two variables and the
direction of the relationship
• Strength or intensity means closeness of relationship
among pairs of variables, and the value of the
correlation coefficient varies between +1 and -1.
• A value of ±1 indicates a perfect degree of association
between the two variables, and as the value goes
towards 0, the relationship between the two variables
will be weaker
• The direction of the relationship is indicated by
coefficient sign: + sign indicates a positive relationship
and - sign indicates a negative relationship
Correlation…
• In a study, variables can be related in three ways
– Positively: an increase in one unit of one variable
results in an increased in one unit of the other
variable
– Negatively: An increase in one unit of one variable
leads to a decrease in one unit of the other
variable
– Not related at all: a change in one variable does
not cause any change in the other variable
• Four types of correlation tests: Pearson, Kendall
rank, Spearman’s Rank test, Point-Biserial correlation
Pearson Correlation
• Pearson correlation coefficient is a measure of the
strength of a linear association between two variables
— denoted by r.
• Assumptions of Pearson Correlation
– Both variables should be normally distributed
– Both variables should be continuous i.e. interval or
ratios
– There should be no significant outliers
– The two variables have a linear relationship
– The observations are paired observations
Pearson’s Correlation…
• Hypotheses under this test are;
– Ho: There is no significant correlation between two
variables under study
– H1: There is significant correlation between two
variables under study
• Or alternative hypothesis could be
– H1: There is significant positive correlation between
two variables under study i.e. correlation coefficient
is significantly above zero or
– H1: There is significant negative correlation between
two variables under study i.e. correlation coefficient
is significantly below zero
Pearson’s correlation coefficient (r)
• According to Cohen and Holliday (1982)
– If r ≤ 0.19 = very low correlation
– If r = 0.20 – 0.39 = low correlation
– If r= 0.40 – 0.69 = modest correlation
– If r = 0.70 – 0.89 = high correlation
– If r=0.90 – 1 = very high correlation

• Procedures for Pearson’s correlation analysis in SPSS


Analyze Correlate Bivariate Select variables
of interest and put them to variables Box Select
Pearson OK
Activity
Using data in the file “SPSS working file1” test
for the null hypothesis that annual household
income before project (dependent variable)
was not significantly correlated with age of
respondent, farm size, and number of cattle
owned (independent variables)
Pearson’s correlation coefficient…
Pearson’s correlation coefficient…
• Results show that annual household income
before project was significantly correlated with
age of household head (r = 0.203, P = 0.013),
farm size ( r = 0.693, P = 0.000), and number of
cattle owned (r = 0.739, P = 0.000)
OR
• Results reveal that annual household income
before project was significantly correlated with
age of household head (r = 0.203, P <0.05), farm
size ( r = 0.693, P < 0.001), and number of cattle
owned (r = 0.739, P <0.001)
t-test
• t-test is a type of inferential statistic used to determine
if there is a significant difference between the means of
two groups
• t-test looks at the t-statistic, the t-distribution values,
and the degrees of freedom to determine the
probability of difference between two sets of data
• Assumptions about t-test
– Sample distribution is normally distributed
– Data are measured at interval level
– Data comes from large sample size
• Types of t-test
– One sample t-test, independent samples t-test
(independent means t-test) and paired samples t-test
One sample t-test
• Tests if population mean is different from a
particular specified value or standard value
• Examples
– One wants to know if the per capita income
of a particular region is same as the national
average
– You want to know if average distance to
water sources in a district differs from
national policy standard
• Suitable for a continuous data i.e. able to
compute mean from the data
One sample t-test…
• In one-sample t-test, hypotheses would be:
– Ho: Population mean is equal to…
– Ha: Population mean is not equal to…

• Procedure in SPSS
Analyze Compare means One
sample t-test Select variable under study
(variable of interest) and put it to Test variables
(s) box Specify test value at Test value box
OK
Activity
Using data in the file “SPSS working file1”
test the null hypothesis that average annual
household income (in ‘000) before the
project (Income1) for rural household of
Dodoma is 1,000
One sample t-test…
One sample t-test…
One sample t-test…

One-Sample Test

Test Value = 1000


95% Confidence
Interval of the
Mean Difference
t df Sig. (2-tailed) Difference Lower Upper
annual income
2.697 149 .008 189.2200 50.5922 327.8478
before project('000)
Independent samples t-test
• Compares means between two unrelated groups
on the same continuous, dependent variable

• Used to answer the question: are the means for


two groups statistically different?

• The assumption is that the two samples are from


two independent populations
Assumptions of independent
samples t-test
• Dependent variable should be measured on a
continuous scale (i.e. interval or ratio level)
• Independent variable should consist of two
categorical, independent groups
• There should be no significant outliers
• Dependent variable should be approximately
normally distributed for each group of the
independent variable
Independent samples t-test…
Analyze Compare means Independent
samples t- test Select a Response variable/criterion
variable (a dependent variable) and put it to Test variable
(s) Box Select a comparison- group variable
(independent variable) and put it to Grouping variable
Box Specify groups to be compared after clicking
Define groups button Continue OK
Independent samples t-test…
• Hypotheses in this test are as follows:
– Ho: There is no significant difference in means
for two groups
– Ha: There is significant difference in averages
means for two groups
• Example 8
– Using data in the file “SPSS working file1” test the
null hypothesis that average annual household
income (in ‘000) before a particular project
(Income1) for male headed households is the
same as that of female headed households
Independent samples t-test…
Independent samples t-test…
Independent samples t-test…
Independent samples t-test…
Activity
• Using the file “SPSS working file1”, compare
the male headed and female headed
households on mean age and mean annual
income after project

• Summarize your output and interpret your


results
Paired samples t-test
• Used to compare two population means where
you have two samples in which observations in
one sample can be paired with observations in the
other sample
• Examples
• Before-and-after observations on the same
subjects
• Comparison of two different methods of
measurement or two different treatments
where the measurements/treatments are
applied to the same subjects
Assumptions of Paired Samples
t-test
• The dependent variable must be continuous (i.e.
ratio or interval)
• The observations are independent of one
another
• The dependent variable should be
approximately normally distributed
• The dependent variable should not contain any
outliers
Paired samples t-test…
• Hypotheses in this test are:
– Ho: Mean (average) for period1 is NOT significantly
different from mean for period2
– Ha: Mean (average) for period1 is significantly
different from that of period2
• Suppose someone is interested in comparing average
annual household income before and after
implementation of a particular project for improving
income of smallholder farmers
• Using the file “Workshop SPSS working file1”, test the
null hypothesis that average annual household income
(in ‘000) before project (Income1) is the same as that
after project (Income2)
Paired samples t-test
• Procedure in SPSS
Analyze Compare means Paired samples t-
test Select two variables for comparison (click the
first variable and then the second variable) and put them
to Paired variables Box OK
Paired samples t-test…
Paired samples t-test…
Paired samples t-test…
Analysis of Variance (ANOVA)
• Tests concerned with testing equality of means for
more than two groups (i.e. comparing more than two
groups)
• Statistical tests under this category use F-test
• Dependent variable should be interval or ratio data type
• Independent variable should be categorical with more
than two levels/groups for comparison
• The simplest test under this class is One-way Analysis of
Variance (ANOVA I)
• Other ANOVAs include Two-way ANOVA, Latin Square
Design, Split Plot Design, Factorial Design, and ANOVA
for repeated measures
One- Way ANOVA (ANOVA I)
• Assumes response variable is influenced by ONE factor
• A factor has several levels or categories (more than
two) that entail groups for comparison
• These levels or categories are sometimes called samples
or treatments. Test statistic is F
• Procedure in SPSS
Analyze Compare Means One–way ANOVA
Select a response variable i.e. dependent variable
(criterion variable) and put it to Dependent list Box
Select a comparison- group variable (independent
variable) and put it to Factor Box Options
Descriptives (select descriptive statistics) Continue
OK
Assumptions of One-Way
ANOVA
• Dependent variable should be measured on a
continuous scale (i.e. interval or ratio level)
• Independent variable should consist of more
than two categorical, independent groups
• There should be no significant outliers
• Dependent variable should be approximately
normally distributed for each group of the
independent variable
• There needs to be homogeneity of variances
One- Way ANOVA (ANOVA I)
• Hypotheses in this test are as follows:
– Ho: There is no significant difference in averages
(means) for different groups
– Ha: At least one pair of means differ significantly

• Example: Suppose someone wants to compare


average annual household income for three districts
– Ho: There is no significant difference among three
districts on average (mean) annual household
income.
– Ha: At least one pair of means differ significantly
Activity
•Using data in ‘SPSS working file1’ test
the hypothesis that average annual
household income (in ‘000) before project
(Income1) is the same for all three districts
under study vs alternative hypothesis that
there are differences
One- Way ANOVA (ANOVA I)…
One- Way ANOVA (ANOVA I)…
One- Way ANOVA (ANOVA I)…
Activity
• Using the same file (SPSS working file1),
compare the districts on average annual
household income after project, farm size, and
number of cattle owned
• Summarize your output and interpret your
results
Two- way ANOVA (ANOVA II)…
Two- way ANOVA (ANOVA II)…
Testing relationship among variables
• Regression analysis is used to establish relationship
among variables (using an equation)
• In regression analysis, one must have a dependent
variable (Y) and one or more independent variables (Xs)
• Regression analysis can be:
– Simple regression: one independent variable
– Multiple regression: more than one independent
variable
• Regression can also be:
– Linear when the dependent variable (Y) is continuous
(interval or ratio)
– Non-linear when a dependent variable is categorical
Simple linear regression
• Assumes one independent variable (X) influences the
dependent variable (Y)
• Simple linear regression equation or model

• Where
– Y is the dependent variable
– X is the independent (explanatory) variable
– is the intercept in the Y axis ( a regression constant)
• Example: Suppose we want to establish if there is a
relationship between annual household income before
project (INCOME1) as a dependent variable and farm
size (FSIZE) as an independent variable
Simple linear regression…
• Regression model

• Where
– INCOME1 = Annual household income before project
(‘000Tsh)
– FSIZE= Farm size (acres)
– Regression constant
– Regression coefficient
– Error term
Simple linear regression…
• Procedures for Simple linear regression in SPSS
Analyze Linear Choose dependent (criterion)
variable and put it to Dependent box Chose
independent variable and put it to Independent(s) box
OK
• SPSS outputs
Simple linear regression…

ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 52780721 1 52780720.96 136.516 .000a
Residual 57220713 148 386626.438
Total 1.10E+08 149
a. Predictors: (Constant), farm size (acres)
b. Dependent Variable: annual income before project('000)
Simple linear regression…
Statistical tests in linear regression
• ANOVA (F-test)
– Mostly useful in multiple linear regression
– Tests the significance of the model
– Divides the variation into: variation due to
regression (Xs entered into the equation) and
variation due to residues
• R-Square (goodness of fit or robustness of the model)
– Indicates the percentage variation in Y that is
explained by variation in Xs
• t-test and regression coefficient
– Tests the significance of Xs in influencing Y
Interpreting simple linear regression results
• Positive regression coefficient (β) implies that there is
positive relationship between X and Y (i.e. increase in X
is associated with increase in Y)
• Negative regression coefficient (β) means that there is
negative relationship between X and Y (i.e. increase in X
is associated with decrease in Y)
• Results indicate farm size was a good predictor of
annual household income before project
• About 48% of variations in income were due to
variations in farm size (R2 =0.48)
• Farm size was significantly positively associated with
annual household income before project (t = 4.108, P <
0.001)
Multiple linear regression analysis
• Multivariate analysis with more than one independent
variable (Xs)
• Multiple linear regression equation can be written as

• Where
– dependent variable
– independent variables
– a regression constant
– regression coefficients,
– random error
Multiple linear regression analysis..
• Example: Suppose we want to establish if there is relationship
between annual household income before project by a household
(INCOME1) as a dependent variable and age of respondent
(years) (AGE), education level (years in school) (EDUC2), number
of cattle owned (NCATTLE) and farm size (FSIZE) as independent
variables
• The multiple linear regression model can be written as:

OR
Multiple linear regression analysis..
•Procedures for multiple linear regression
in SPSS

Analyze Regression Linear


Choose dependent (criterion) variable and
put it to Dependent box Chose
independent variables and put them to
Independent (s) box OK
Activity

• Using the data in SPSS working file1 run


regression analysis to study the effect age
(years) (AGE), education level (years in school)
(EDUC2), number of cattle owned (NCATTLE),
and farm size (FSIZE) of household heads on
income before project by a household
(INCOME1).
Multiple linear regression analysis..
Multiple linear regression analysis..

Standard
Independent variable B Error t- value Sig.
(SE)
Constant -478.90 219.18 -2.19 0.031
Age (Years) -2.74 4.34 -0.63 0.529
Education level (Years in school) 114.01 18.02 6.33 0.000
Number of cattle owned 9.67 3.23 3.00 0.003
Farm size (acres) 96.61 15.69 6.16 0.000

R2= 0.71; F-value = 91.07, P <0.001


Multiple linear regression analysis..
Standard
Independent variable B t- value
Error (SE)
Constant -478.90 219.18 -2.19*
NS
Age (Years) -2.74 4.34 -0.63
Education level (Years in school) 114.01 18.02 6.33***
Number of cattle owned 9.67 3.23 3.00**
Farm size (acres) 96.61 15.69 6.16***

R2= 0.71; F-value = 91.07, P <0.001; NS = Non- significant, *


= Significant at P < 0.05, ** = Significant at P < 0.01; *** =
Significant at P < 0.001
Multiple linear regression analysis..
• Interpretation based on R-square, ANOVA and t-test
• Results indicate that independent variables included in
the model were good predictors of annual household
income before project
• About 72% of variations in annual household income
before project were due to variations in independent
variables included in the model
• Independent variables included in the model
collectively had a significant influence on annual
household income before project (F = 91.07, P < 0.001)
Multiple linear regression analysis..
• Results for t- test indicate annual household income
before project had a significant relationship with
– education level (t = 6.33, P < 0.001)
– number of cattle owned (t = 3.00, P < 0.01) and
– farm size (t = 6.16, P < 0.05).
• The effect of age was not significant (t = - 0.63, P >
0.05)
• Increase in education level, number of cattle owned
and farm size was associated with increased income
before project
Multiple linear regression analysis:
special cases
• When we have continuous dependent variable, and a
mixture of categorical and continuous independent
variables
– Transform categorical variables into dummy
variables e.g. 1 = if male; 0 = otherwise (or 1 = Male, 0
= Female; 1 = if Yes; 0 = otherwise (or 1 = Yes, 0 = No)
– Run linear regression analysis as usual to facilitate
interpretation
• Interpretation
– Positive coefficient implies that category coded as 1
is associated with increase in Y (a dependent)
– Negative coefficient means that category coded as 1
Multiple regression analysis: special
cases
• When we have a categorical dependent variable
– Binary logistic regression for binary
dependent variable
– Ordinal logistic regression for ordinal or
ordered dependent variable
– Multinomial logistic regression for nominal
dependent variable
• We will discuss in these models details in the
next slides
Binary logistic regression
• In social sciences, we frequently encounter
variables with binary responses i.e. two categories
of response (BINARY VARIABLE)
• Thus, it is logical to study probability of observing a
category of a response variable (Y) given particular
levels of an independent variable
• The probabilities of observing a particular category
of response variable given particular levels of
independent variable is NON LINEAR
• We can model this non linear relationship using
logistic equation (Logit link)
Why binary logistic regression
• To evaluate the statistical significance of the
associations between variables
• To gauge the effect of one explanatory variable
on the outcome variable when controlling for
other variables also associated with the
outcome
• To gauge the extent and significance of any
interactions between the explanatory variables
in their effects on the outcome
Binary logistic regression…
• Suppose we have a dependent variable (Y) with two
response categories; 1 if a condition is observed and 0 if
otherwise (not observed), and one independent
variable (X).
• Procedure on SPSS
Analyze Regression Binary Logistic
Choose a dependent variable and put it to a
Dependent box Select independent variable
and put it to Covariate box Click Categorical (if
independent var. is categorical) Select categorical
independent variable and put it to Categorical
Covariate box Specify a reference category
and click Change button Continue OK.
Binary logistic regression…
Binary logistic regression…
Example
• Using data in file “SPSS working file1” establish
if there is relationship between “access to
credit” (ACCESS) as a dependent variable and
age (years) (AGE), sex of household head (SEX),
education level (EDUC), marital status
(MARITAL) number of cattle owned (NCATTLE),
and farm size (FSIZE), district (DISTRICT) as
independent variables
Binary logistic regression…
• Statistical tests in binary logistic regression
• Goodness of fit-test
– Chi-square and Sig
– Nagelkerke R2, Cox and Snell R2, McFaden R2
etc

• Testing for significance for coefficients


– Regression coefficient (B)
– Wald and Sig
– Exp(B) – odds ratios for the predictors
Activity
• Using data in file “SPSS working file4” establish if there
is relationship between a variable “If ever given birth
(BIRTH)” by a female adolescent as a dependent variable
and Age (AGE), Highest education level attained
(HIHESED), Religious affiliation (RELIGION), Ethnicity
(ETHNIC), Wealth status of a family (WEALTH), Living
arrangement (LIVING), Type of marriage by parents
(TMARRIAG), Having close friends that are sexually active
(ACTIVE), and Use of alcohol (ALCOHOL) by an
adolescent as independent variables

• Summarize your output and interpret your results


Thanks

for your

Attention

You might also like