Basic Statistical Tools For Research
Basic Statistical Tools For Research
Basic Statistical Tools For Research
Objectives
y Understand the statistical nature of research data y Identify approaches in quantitative research planning (data collection, organization and analysis) y Identify appropriate statistical techniques for a given study design
Levels of Measurement
y Nominal numbers are just categories y Ordinal ranks, hierarchy, order y Interval equally spaced scores; no mathematical concept of multiplicity; no true zero y Ratio highest level of measurement
effects yPredicting a value of an attribute of interest yTesting the effect of several factors on a response
Sampling
y Nonprobability
Sampling
procedure wherein every element of the population is given a (known) nonzero chance of being selected in the sample procedure wherein not all the elements in the population are given a chance of being included in the sample
Issues
y Choice relies on
Nature of measurement Variation in the population Tolerable margin of error y Treatment of Heterogeneity Stratification Clustering Multi-staging y Formula
Do not reject Ho
DESCRIPTIVE METHODS
Describing and Summarizing A Set of Measurements yPresentation of Tables yConstruction of Graphs yComputation of Summary Measures
Issue: Which average to use? y Variation describes extent of dispersion Issue: Absolute or comparative dispersion? y Skewness describes degree of asymmetry Where in the range of values do data cluster? y Percentiles identify markers or thresholds
Chi-Square Test
y The chi-square test determines the
association between two (categorical) variables set in a contingency table. y Generally regarded as a nonparametric test though no parametric counterpart is gaining popularity. y The Fisher Exact Test is an alternative to this test for 2x2 contingency tables.
Chi-Square Test
Low Income Middle Income High Income (-) attitude 31 29 27 (+) attitude 48 93 165 Total 79 122 192 The null and alternative hypotheses arey Ho: Socioeconomic status and attitude are independent. y Ha: The 2 variables are associated.
Correlation Analysis
y Correlation means the degree of linear
association between two measurements. y The most common correlation measure is the Pearson coefficient, r. Alternative to this is the Spearman coefficient for rank data. y Pearson s r ranges from -1 to +1. Values close to either -1 or +1 indicate strong correlation while near-zero values mean minimal or no correlation.
Correlation Analysis
y Positive correlation means that as one
variable increases, there is a tendency for the other to increase as well. Also, there is a tendency for both variables to decrease together. y Negative correlation means that as one variable increases, there is a tendency for the other to decrease; and vice-versa.
Correlation Analysis
y Example: Refer to the data showing 20
nations ranked with respect to births attended by trained health care personnel and maternal mortality rate. Spearman correlation (rs) is -0.88 (p=0.000). A significant negative correlation exists; there is a general tendency for maternal mortality to decrease when more births are attended by medical personnel.
Nation
Rank by Percentage
y y y y y y y y y y y y y y y y y y y y
Bangladesh Nepal Morocco Pakistan Nigeria Kenya Philippines Iran Ecuador Portugal Vietnam Spain Panama Chile Switzerland USA Hungary Netherlands Hong Kong Belgium
1 2 3 4 5 6 7 8 9 10 11 12.5 12.5 14 16 16 16 19 19 19
Attended Rank by Maternal Mortality Rate per 100,000 Live Births 18 20 16 17 19 14.5 11 12.5 14.5 6.5 12.5 2.5 9 10 2.5 5 8 6.5 4 1
significant differences in scores between related observations or matched pairs. y The two common types of paired-sample tests are: y Paired t-test (parametric) y Wilcoxon Signed Ranks Test (nonparametric)
determine if scores significantly differ between two disjoint or exclusive groups. y The two most common types of independent-sample tests are: Independent-sample t-test (parametric) Mann-Whitney Test (nonparametric)
sample t-test is used when scores are assumed to be normally distributed or following a bell-shaped histogram. y The Mann-Whitney test is used when marked skewness in the observed measurements is present or when data is ordinal (ranks).
independent-sample t-test to the case of three or more disjoint or exclusive groups. y When data is ordinal or when there is skewness, the counterpart procedure is the Kruskal-Wallis test. y When the null hypotheses of equality of means is rejected, pairwise comparisons are necessary (e.g. Duncan, Tukey, Scheffe,etc.)
59.7 1.4
Regression Analysis
y Regression analysis is a method relevant
to analyzing a variable by using information on other variables. The variable that is being explained or analyzed is called the response or dependentvariable. y The variables whose effects act on the response are called predictor, regressor or independentvariables.
Regression Analysis
y When there is only one predictor, we have a simple linear regression model. y Response = function (one predictor) y Ex. O2Consumption = function of Running Time y The formal model is Yi= b0+ b1Xi+ i where i is a random disturbance. y O2= intercept value + slope value times RunTime+ random error
Regression Analysis
y When there are many predictors, we have
amultiple linear regression model. y Response = function (several predictors) y Ex. O2= function of RunTime and Age y The MLRM is written as Yi= 0+ 1X1i+ 2X2i+ . + kXki+ ei. Where Yi is the value of the response variable in the ith observation 0, 1, 2, ., k are parameters of the model y X1i, X2i, .,Xki are the values of the predictors in the ith observation and ei is the error term
IDENTIFY YOUR RESPONSE VARIABLE! yThis should be quantifiable. yYes/No, High/Low, and similar categorical responses are not valid here.
your predictors. Quantitative predictors must have correlation with the response. y Make sure there is no redundancy among predictors. Check this by computing their correlations. If there are correlated predictors, choose only the one that has practical significance to your study. There are advanced statistical methods that treat correlated predictors.
What s next?
y You are now ready to fit the regression equation.To illustrate, consider an example.
RenarInteriors operates in medium size business areas. In considering an expansion into other areas of similar size, it wishes to investigate how sales (Y) can be predicted from the size of the target market, i.e., the 20-39 age group (X1) and the average monthly income of households in the area (X2). Data on these variables in the most recent year for 21 business areas where the company operates is given below.
Results
yThe Coefficients column gives the
estimated values of the regression parameters. yHere,the fitted model is: Y=-3.887+0.146X1+0.929X2 ySALES = -3.887 + 0.146 x Market Size + 0.929 x Income
variation in the response into explained (pattern) and unexplained (error) parts. y The explained variability is the amount of variation in the response variable that may be attributed to the predictors explicitly stated in the model. y The unexplained variability is the amount of variation attributed to random error.
Results from the ANOVA table for the Renar Interiors data
y The first column in the table labels the sources of
variation (Regression and Residual). y The df column refers to the degrees of freedom. The df for Regression is always the number of regression parameters minus one. The df for Residual, it is the sample size minus the number of regression parameters. The total df is the sum of these two degrees of freedom.
Results from the ANOVA table for the Renar Interiors data
y SS refers to Sum of Squares. The value 240.3407 represents the amount of variation in sales explained by the two predictors in the model. The value 21.9658 represents the unexplained variation. These two values sum to 262.3065. There is good fit if the Regression Sum of Squares is much larger than the Residual Sum of Squares y MS refers to Mean Squares. The values in this column are the ratio of each sum of square to their respective degrees of freedom. Mean squares have no physical meaning but are instrumental in computing the Fstatistic.
The F-test
yTheF-test determines if
regression is meaningful for the data at hand. When the p-value is small (see Significance F in Excel output), it means that there is at least one significant predictor in the analysis.
we do not have any significant predictor in the data. When it is small,we reject that hypothesis. y Technically, we call the above hypothesis our null hypothesis or Ho. y Remember: WHEN p IS LOW, Ho MUST GO! y Rule of Thumb: The p-value is low if it is less than 0.05.
percentage and is interpreted as the amount of variability in the response explained by the independent variables. y Thevalue of the R squared = 0.9163 means that 91.63% of the variation in sales can be explained by size of target market and average monthly family income.
increases as the number of predictors increases. This is true even if the added predictor(s) are not significant. y As an alternative, we use the adjusted-R squared(Ra squared). y Ra squared penalizes the R squared for the addition of regressors that do not contribute to the explanatory power of the model. y The Ra squared is never larger than the R squared and can decrease as regressors are added and for poorly fitting models, may even be negative.
The T-tests
y The t-test helps in assessing if an individual
predictor is significant. y Let us interpret the t-tests for the Renar data. X Variable 1 (Target Market Size): Since p=2.05x10-6 <0.05, size of target market is a significant predictorof sales. X Variable 2 (Average Monthly Income): Since p=.0353 <0.05, average monthly income is a significant predictorof sales. Intercept: Since p=.5466 >0.05, the intercept is not significantly different from zero