Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
0% found this document useful (0 votes)
2 views

Treatment testing_Lecturs slides part 1

The document provides an overview of parametric and non-parametric treatment testing in econometrics, highlighting the importance of selecting the appropriate statistical test based on data characteristics and assumptions. It discusses the experimental design involving sellers and buyers in a game setting, the different treatment groups, and various statistical tests including t-tests and Wilcoxon rank-sum tests for analyzing differences in means. The conclusion emphasizes the lack of significant evidence for treatment effects and the necessity of graphical representations for clearer data presentation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Treatment testing_Lecturs slides part 1

The document provides an overview of parametric and non-parametric treatment testing in econometrics, highlighting the importance of selecting the appropriate statistical test based on data characteristics and assumptions. It discusses the experimental design involving sellers and buyers in a game setting, the different treatment groups, and various statistical tests including t-tests and Wilcoxon rank-sum tests for analyzing differences in means. The conclusion emphasizes the lack of significant evidence for treatment effects and the necessity of graphical representations for clearer data presentation.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

1

Econometrics Analysis of Experimental Data

Prof. Dr. Sabrina Jeworrek


Chair for Organizational Behavior and Human Resource Management

Parametric and Non-parametric Treatment Testing


Part I
2

Basic Principles of Treatment Testing


3

Parametric or non-parametric?
• A treatment test always has a null hypothesis and an alternative
hypothesis. Usually, the null hypothesis is that there is no effect.
* If alternative hypothesis specifies the direction of the effect, one-tailed tests can
be conducted and the p-value is only half of two-tailed test (usually, not always!).

• But which test to choose? Statistical procedures may be grouped


into two major classifications: parametric and nonparametric.
* The assumptions of parametric statistics (i.e. normality and equal variances) are
more specific and stringent than of nonparametric ones.
* But: The more rigorous the assumptions, the more trustworthy the conclusions
because of exploiting the richness of the data.

• Example: Two independent groups reveal similar median values but


the mean values are different.
* Taking into account the variance, as it is the case for parametric tests, the test
could yield a statistically significant result whereas this would not be the case for a
nonparametric test.

Preferential use of parametric tests whenever the data meet the


assumptions.
4

Parametric or non-parametric? An overview

Nonparametric Statistics Parametric Statistics


Continuous distribution Assumptions of normality and
equal variances
Uses median as location Uses mean, variance, and standard
parameter deviation as location parameters
Random sample Random sample
Independence of responses Independence of responses
Uses nominal, ordinal, interval, Uses interval and ratio data
and sometimes ratio data
Large and small data sets Large data sets (minimum of 30
or more cases)
Weaker statistical power than More powerful than nonparametric
parametric statistics tests for rejecting null hypothesis

Source: Kraska-Miller (2014), p. 35.


5

The importance of nonparametric methods


• Outcomes of interest are often dichotomous or on an ordinal scale.
* Example: The probability to pass an exam.
* Distributional assumptions of parametric tests can only hold if variables of interest
are measured on a cardinal scale.

• Even if data is cardinal in nature, researchers might transform the


data to ordinal data due to their resarch question, e.g. income in
high, medium and low income groups.
6

Using Parametric Statistics?


Even if data is cardinal, assumptions may not be met. But:

1. Non-normal distributions:

* Sufficiently large sample sizes make it possible to appeal to the central limit
theorem (CLT): the standardized mean of a sample follows a normal distribution
even when the sample is drawn from a distribution that is not normal.

*If sample sizes are low, bootstrapping ensures that inferences made from
parametric tests are valid regardless of the distribution.

2. Heterogeneous variances:

Some parametric tests are still robust. Additionally, for some tests there are test
statistics for unequal variances along with the test for equal variances.
7

Let‘s get started! The description of the experiment.


8

Data is obtained from the following game:


• There are two different types of players, seller and buyer.

• The seller is given 60 units which he can invest (binary decision!) to


generate a greater amount (i.e. 100 units).
*If seller does not invest: she keeps the 60 units which are paid out at the end of the
experiment.

• If the seller has invested, the buyer proposes how to split the 100
units.

• The seller chooses whether to accept this offer. In case of rejection,


both parties receive zero.

Combination of a binary trust game and an ultimatum game.


9

The treatments
The game is played with three different treatment groups:

T1: Control group, no communication except the actions


themselves.

T2: Buyer can send a message to the seller before the seller
makes the investment decision.

T3: Seller can send a message to the buyer along with the
investment decision.

Examplary research question: Does the amount offered by the buyer


increase due to the seller‘s message?

You can do the analysis on your own. Use the following data set:
„IndSamples MeanDifferences.dta“. This part of the lecture is based on
Moffat (2016), chapter 3.
10

Testing for differences in means between two independent samples

1. Testing for normality as first step (in case of cardinal data)

2. The parametric t-test


3. The nonparametric Wilcoxon rank sum test
4. Mood‘s median test (also nonparametric)
11

1.1. Testing for normality: Visual inspection


Getting a first impression using a histogramm:

hist depvar, disc freq normal Normal density superimposed on the histogram,
with the same mean and std as the data
20

Minimum for
executing the
command
15
Frequency
10
5
0

0 20 40 60 80 100
buyer's offer to seller

depvar = dependent variable = outcome of interest


12

1.2. Testing for normality: Formal tests


The skewness-kurtosis test

sktest depvar Skewness/Kurtosis tests for Normality


joint
Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2

offer 51 0.0021 0.4595 8.60 0.0136

Rejection of Normal kurtosis


symmetry
. sktest offer if treatment==1
Separate estimation by treatment is important!
Skewness/Kurtosis tests for Normality
joint
Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2

offer 14 0.5120 0.0088 6.52 0.0383

. sktest offer if treatment==2

Skewness/Kurtosis tests for Normality


joint
Variable Obs Pr(Skewness) Pr(Kurtosis) adj chi2(2) Prob>chi2

offer 16 0.0003 0.0012 16.65 0.0002


13

1.2. Testing for normality: Formal tests II


The Shapiro-Wilk test

swilk depvar Shapiro-Wilk W test for normal data

Variable Obs W V z Prob>z

offer 51 0.87616 5.916 3.795 0.00007

Conclusion:

Normality is strongly rejected: Nonparametric tests should be used,


especially given that sample sizes are low. Validity of parametric tests
have to be doubted.
14

2.1. The parametric t-test


• Used for testing whether two samples (and, hence, their means)
belong to the same population.

• How does the test work?


1. Calculating the t-test statistic

Sample means
Pooled std as a weighted average
of the two individual sample std Sample sizes

2. Calculating the degrees of freedom

3. Comparing the test statistic with the distribution: If


test statistics is larger, the difference in means is statistically
significant on the chosen significance level
15
The groups you want to
compare. Alternativ formulation:
2.2. T-test in Stata if treatment!=2

ttest depvar if treatment==1 | treatment==3, by(treatment)


Two-sample t test with equal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

1 14 48.57143 8.619371 32.25073 29.95041 67.19245


3 21 63.33333 4.230464 19.38642 54.50874 72.15793

combined 35 57.42857 4.383753 25.93463 48.51971 66.33743

diff -14.7619 8.711774 -32.48614 2.962333

diff = mean(1) - mean(3) t = -1.6945


Ho: diff = 0 degrees of freedom = 33

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 0.0498 Pr(|T| > |t|) = 0.0996 Pr(T > t) = 0.9502

• In case of a p-value smaller than 0.05 you reject the null hypothesis that there is
no difference between the two groups.
• Here: Using the two-sided(!) test there is only „mild/ suggestive evidence“ for a
treatment effect.
16

2.3. Taking unequal variances into account


sdtest depvar if treatment==1 | treatment==3, by(treatment)
Ha: ratio < 1 Ha: ratio != 1 Ha: ratio > 1
Pr(F < f) = 0.9801 2*Pr(F > f) = 0.0398 Pr(F > f) = 0.0199

The variance-ratio test suggests that variances are not equal

ttest depvar if treatment==1 | treatment==3, by(treatment) unequal


Two-sample t test with unequal variances

Group Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

1 14 48.57143 8.619371 32.25073 29.95041 67.19245


3 21 63.33333 4.230464 19.38642 54.50874 72.15793

combined 35 57.42857 4.383753 25.93463 48.51971 66.33743

diff -14.7619 9.601583 -34.83782 5.314009

diff = mean(1) - mean(3) t = -1.5374


Ho: diff = 0 Satterthwaite's degrees of freedom = 19.29

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0


Pr(T < t) = 0.0702 Pr(|T| > |t|) = 0.1404 Pr(T > t) = 0.9298
17

2.4. The bootstrap technique


• Using the possibly rich cardinal information in the data even in case
of non-normality

• Procedure:

1. Obtain test statistic from parametric test (as usual).

2. Generate a „healthy“ number of bootstrap sample:


Samples with same sample size as the original sample and also drawn
from the original sample but with replacement

3. For each bootstrap sample, compute the test statistic.

4. Compute the sandard deviation of the bootstrap test statistics .

5. Obtain new test statistic .


18

2.4. The bootstrap technique II


• Depending on the chosen significance level (1%, 5%, 10%), the
number of bootstrap samples should be chosen (99, 999, 9999).

bootstrap t=r(t), rep(999) nodrop: ttest depvar if treatment==1 |


treatment==3, by(treatment)

Bootstrap results Number of obs = 103


Replications = 999

command: ttest offer if treatment==1 | treatment==3, by(treatment)


t: r(t)

Observed Bootstrap Normal-based


Coef. Std. Err. z P>|z| [95% Conf. Interval]

t -1.694477 1.150798 -1.47 0.141 -3.949999 .5610444


19

3.1. The nonparametric Wilcoxon rank sum test


• Also called Mann-Whitney U test

• Determining the test statistic

The „sum of ranks“

Test statistic is approximately normally distributed:

• An example for the sum of ranks:


offer 20 30 50 50 55 60 65 70 80 R1 = 1+2+3.5+7 = 13.5
R2 = 3.5+5+6+8+9 = 31.5
treat 1 1 2 1 2 2 1 2 2

rank 1 2 3.5 3.5 5 6 7 8 9


20

3.2. Rank sum test in Stata


ranksum depvar if treatment==1 | treatment==3, by(treatment)
Two-sample Wilcoxon rank-sum (Mann-Whitney) test

treatment obs rank sum expected

1 14 220.5 252
3 21 409.5 378

combined 35 630 630

unadjusted variance 882.00


adjustment for ties -26.56

adjusted variance 855.44

Ho: offer(treatm~t==1) = offer(treatm~t==3)


z = -1.077
Prob > |z| = 0.2815

Given that nonparametric tests are more suitable for this specific data, we find no
evidence for a statistically significant difference between the treatments.

Means have to be displayed separately: bysort treatment: sum depvar


21

4. Mood‘s Median Test


• Compares the median of two indepent groups.

• Similar to the rank sum test, no distributional assumptions.

• Rank sum test, however, is more powerful since it uses the rank of
each observation instead of only the relation of a score to the
median value in the distribution (above or below).

• Command: median. For further information: help median.

• To conclude: The amount offered by the buyer does not increase


due to the seller‘s message. Obtained p-values:

t-test Mann-Whitney t-test bootstrapped


0.0996 0.2815 0.141
22

Summary: Using t-test or rank sum test?


• We want to compare the means of two samples. If not, other tests
have to be used.

• In case of ordinal data, you have no choice: only rank sum test
appropriate!

• In case of cardinal data, check for normality first.


* If the data is normally distributed, check for equal variances next.
* If the data is not normally distributed, check the sample sizes.
You can use the bootstrap technique or go to the rank sum if the
sample size is low.
23

How to present results?


24

Data analysis is not enough…


• You can simply report the results in the main text of your paper
(means for each group, the corresponding p-values, and which test
was used), but:
*To help the reader to find the most important results, graphical
illustrations are extremely useful.

• Simple bar charts would do the job, but experimental economists


would like to see confidence intervals in order to judge the variation
of outcomes.

80
• cibar depvar, over(treatment)

70
 Some graphical adjustments still

mean of offer
60
necessary, use the graph editor. You

50
can also add text, e.g. p-values…
40
30
25

… and what about economic significance?


• Comparing means already gives you an idea about the economic
importance, but hardly comparable across different studies.

• Cohen‘s d (see introductory lecture) as one common measure.


* Hedge‘s g very similar but for sample sizes <20

• esize twosample depvar if treatment!=2, by(treatment) coh hed

 Comparably large effect size Effect size based on mean comparison

but statistically insignificant. Obs per group:


treatment==1 = 14
treatment==3 = 21
So what is your conclusion? Effect size Estimate [95% conf. interval]

Cohen's d -.5846503 -1.271109 .1102795


Hedges's g -.5712442 -1.241962 .1077508
26

Multiple hypothesis testing (MHT)

(Hard to find in older publications but is becoming increasingly important…)


27

The Problem with Multiple Hypotheses


• False discovery drives resource allocation and future streams of
thought, private and social costs might be quite high.

• Multiple hypothesis testing (MHT) as one key reason for false


positives. Testing…

- several dependent variables that might be affected by the


treatment
- for heterogeneity
- multiple treatment groups

• If all p-values are mutually independent, the probability of at least


one true null hypothesis being rejected would equal
with N as the number of tested hypotheses.
28

Classical Correction Methods


• Bonferroni correction: Test each idividual hypothesis at a
significance level .

Extremely conservative especially if there is a large number of


tests and it reduces the statistical power.

• Bonferroni-Holmes correction: Stepwise algorithm.

Example:
Testing H1 to H4 with p1=0.01, p2=0.04, p3=0.03 and p4=0.005.

1. The smallest p-value (0.005) is compared to which is 0.0125.


Hence, the null hypothesis can be rejected.
2. The second smallest p-value (0.01) is compared with . Since
0.01<0.0167, the null hypothesis is again rejected.
3. Continue as above for all available hypotheses.
29

MHT in Experimental Economics


• New method that incorporates information about the joint dependence
structure of the test statistics when determining which null hypotheses
to reject. Developed by List, Shaikh, and Xu (2019).

• Example: mhtexp x y z, treatment(treatment) subgroup(gender)


outcome subgroup treatme~1 treatme~2 diff_in~s Remark3_1 Thm3_1 Remark3_7 Bonf Holm

r1 1 1 0 1 26.15011 .0336667 .0336667 .0336667 .0673333 .0336667


r2 2 1 0 1 .6393227 .0046667 .0086667 .0086667 .0093333 .0093333

outcome subgroup treatme~1 treatme~2 diff_in~s Remark3_1 Thm3_1 Remark3_7 Bonf Holm

r1 1 1 0 1 26.15011 .0336667 .065 .065 .101 .0673333


r2 2 1 0 1 .6393227 .0046667 .013 .013 .014 .014
r3 3 1 0 1 .021702 .8763333 .8763333 .8763333 1 .8763333

outcome subgroup treatme~1 treatme~2 diff_in~s Remark3_1 Thm3_1 Remark3_7 Bonf Holm

r1 1 1 0 1 41.16463 .0136667 .0763333 .0763333 .082 .082


r2 1 2 0 1 4.027027 .813 .9926667 .9926667 1 1
r3 2 1 0 1 .6791667 .028 .1336667 .1336667 .168 .14
r4 2 2 0 1 .5527463 .101 .3476667 .3476667 .606 .404
r5 3 1 0 1 .0271739 .8946667 .9893333 .9893333 1 1
r6 3 2 0 1 .0243243 .9043333 .9043333 .9043333 1 .9043333

• Attention: The subgroup should not be coded as 0 vs. 1, use instead 1 vs. 2!
The treatment variable, however, has to contain values of 0.

You might also like