Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ExerciseC PowerCalc TAs

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

EXERCISE: HOW TO DO POWER

CALCULATIONS

TABLE OF CONTENTS
Introduction (5 mins) .................................................................................................................................................... 2

Using the EGAP Power Calculator .............................................................................................................................. 3

Estimating Sample Size Using Results From a Similar Study (10 mins) ...................................................................... 3

Estimating Sample Size Using Data From a Pilot Study (20 Mins) ............................................................................. 5

Limited Resources and Imperfect Compliance (10 mins) ....................................................................................... 9

Clustered Designs (20 mins)....................................................................................................................................... 10

Resources .................................................................................................................................................................... 14

POVERTYACTIONLAB.ORG
KEY VOCABULARY

Significance The probability of committing a type I error (a false positive: concluding that the
level: program has an effect when it actually does not). Statistical tests are typically
performed at significance levels of 1%, 5%, or sometimes 10% to determine whether
one group (e.g., the experimental group) is different from another group (e.g., the
comparison group) on certain outcome indicators of interest (for instance, test scores
in an education program). The significance level is typically denoted by alpha (α).
Standard For a particular indicator, a measure of the variation (or spread) of a sample or
deviation: population. Mathematically, this is the square root of the variance.
Standardized A standardized (or normalized) measure of the [expected] magnitude of the effect of
effect size: a program. Mathematically, it is the difference between the treatment and control
group (or between any two treatment arms) for a particular outcome, divided by the
standard deviation of that outcome in the control (or comparison) group.
Type II error: A false negative: finding no evidence of impact when a program/treatment actually
has an effect. Type II error is often denoted by kappa (κ).
Power: The likelihood of avoiding a type II error, that is, the probability that your statistical test
will distinguish the program effect (correctly) from zero when the program/treatment
actually has an effect, given the sample size and the population the sample is drawn
from). Power mirrors the significance level, α: as α increases (e.g., from 1% to 5%), the
probability of rejecting the null hypothesis increases, which translates to a more
powerful test.
Cluster: The unit level at which a sample is randomized (e.g., school), each of which typically
contains several units of observation that are measured (e.g., students). Generally,
observations within the same unit of randomization that are potentially correlated
with each other should be clustered, and the required sample size should be
calculated with an adjustment for clustering.
Intra-cluster A measure of the correlation between observations within the same cluster. For
correlation instance, if your experiment is clustered at the school level and the outcome of
coefficient interest is test scores, the ICC would be the level of correlation in test scores for
(ICC): children in a given school relative to the overall distribution of test scores of students in
all schools.

INTRODUCTION (5 MINS)

In this exercise, we will practice power calculations using estimates of effect sizes and outcome variance. The
exercise will also help explain the considerations that go into determining sample size and randomization
design when designing a randomized evaluation for an education intervention. Should the intervention be at
the school level or the student level? Should we sample every student in just a few schools? Should we sample

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

a few students from many schools? How many students or schools must we sample to confidently avoid
making a type II error?

We will work through these questions by determining the sample size that allows us to detect a specific effect
with at least 80% power, which is a commonly accepted level of power (such as by organizations that fund
research). Recall that power is the probability of avoiding a false negative, also known as a type II error. That
is, power is the likelihood that, when a treatment or program has an effect, you will be able to distinguish
this effect from zero in your sample. Therefore, if our sample is chosen for 80% power and we accept that
an intervention had an impact if it is statistically significant at the 5% level (a commonly accepted level of
significance), then at the given sample size, we are 80% likely to correctly reject the null hypothesis (typically,
the null hypothesis is that the program had no effect). Alternatively, if a study has not been set up to achieve
a commonly accepted level of power, it is considered to be “underpowered” and is at risk of a type II error.

Throughout this exercise, we will use the example of an education intervention that seeks to raise test scores.
We will explore how our study’s power changes with the total number of students, the number of students
in each classroom, the expected magnitude of the change in test scores, and the extent to which students
within a classroom appear more similar than students across classrooms.

We will walk through how to do simple power calculations in two ways: (1) by using results from a similar
study to estimate the effect size and standard deviation of the outcome variable for our program, and (2) by
using pilot data from our program (the Balsakhi study in the lecture) to calculate these components ourselves.1

USING THE EGAP POWER CALCULATOR

For this exercise, we will use a calculator for power calculations developed by Alexander Coppock for EGAP
(Evidence in Governance and Politics). The calculator was developed using the Shiny package in R and can
be used to conduct power calculations for individual-level randomization, clustered designs, and with binary
or continuous outcome variables. The calculator can be accessed at https://egap.shinyapps.io/power-app/.

ESTIMATING SAMPLE SIZE USING RESULTS FROM A SIMILAR STUDY (10 MINS)

Let’s work through this part together. (TA lead the large group.) Recall, the key components needed to
estimate sample size using a simple (non-clustered) design are:
• Significance level (α): Typically denoted by α, the significance level is the probability of a type I
error, or falsely concluding that there is an effect when there is none (falsely rejecting the null
hypothesis). The EGAP calculator default value of α=0.05 is commonly accepted.

1The “similar study” results shown in Table 1 are fictional and were created for the purposes of this exercise. The “pilot data”
provided is a random subset of the actual Balsakhi program survey data. The full dataset for the Balsakhi program is available on
the J-PAL Dataverse: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UV7ERB.
POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

• Power (1-κ): The probability of avoiding a type II error, where a type II error is falsely concluding
there is no effect when there is one. Power is typically set at 80%, though some researchers aim for
90%.
• The effect size of the program for the outcome of interest.
• The variance (or the root of the variance, the standard deviation) of the outcome of interest.

More complex designs, such as clustered designs or using covariates to improve the precision of your
estimates, require additional components to estimate sample size, which will be explained in further detail
below.

The paradox of power is that we cannot know two of the above components, namely the effect size and the
variance of the outcome of interest, until we conduct the experiment! That is, in order to conduct the
experiment, we need to decide on a sample size—but this decision is contingent on a number of outcomes
that we cannot know without conducting the experiment in the first place.

In this regard, power calculations involve making careful assumptions about certain outcomes, such as the
effect you realistically expect your program may have or the variation you expect in the outcome variable.
These assumptions are often informed by real data, such as from previous studies of similar programs, or pilot
studies in your population of interest. Making wrong assumptions will not bias the results of the study but
will affect the likelihood of a type II error, or failing to detect an effect when there is one. Regardless of the
source of the data you use to inform your power calculations, it is important to justify your assumptions,
which requires carefully thinking through the details of your program and context.

We will start by using data from a previous study looking at a similar program to inform our power
calculations. Table 1 shows the regression results from a program in Andhra Pradesh, India that sought to
increase student test scores through intensive tutoring. The table also shows the mean and standard deviation
of the pre-treatment outcome variable, pre-test scores. We will use the effect size and distribution of test
scores in this study as a benchmark to conduct power calculations for our own study.

Table 1

Effect of the Andhra Pradesh Tutoring Program on Post-test Scores

3.8***
Received tutor
(1.12)

Constant 35.9
(0.83)

Average pre-test score 36.4

Standard deviation of pre-test score 15.2

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

N 694
Notes: Test scores are out of 60 possible points. Standard error is in
parentheses.
***Statistically significant at the 1% level.

A. What is β, the treatment effect size?

Answer: 3.8 points – the treatment effect is the increase in test scores for those who
received the tutoring program.

B. What is the standard deviation of the dependent variable?

Answer: 15.2 – test score is our dependent variable. Currently, we do not expect the
tutoring program to change the spread of test scores, so we use the standard deviation of
pre-test scores.

If you haven’t already, open the EGAP calculator in your web browser by going to:
https://egap.shinyapps.io/power-app/. Next, set the desired power and significance – for this exercise
we’ll use the standard values of α=0.05 and 1-κ=0.8. Keep the maximum number of subjects at the default
and leave the “Clustered design?” and “Binary dependent variable?” boxes unchecked right now; we’ll return
to these later.

C. Plug in the values you found in questions A and B. Given these parameters and a significance level of
α=0.05, what is the sample size needed for 80% power?

Answer: 503

D. The EGAP calculator gives the total sample size. Assuming half allocated to treatment and half allocated
to control, how many do you need in your treatment and control groups?

Number in treatment: 252 (can’t have half a person)


Number in control: 252 (can’t have half a person)

ESTIMATING SAMPLE SIZE USING DATA FROM A PILOT STUDY (20 MINS)

Now it’s your turn! Suppose you have data from a pilot study your team did for this project. This data can be
found in (xls file). Here, test scores are again our outcome of interest, and the pilot study was done in the
same population from which you’ll draw your sample for the main study of the Balsakhi tutoring program.

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

E. What is the mean and standard deviation of the dependent variable?


Mean: 33
Standard deviation: 14.3

F. After the tutoring program, what do you expect for the mean test score in the control group? What
standard deviation do you expect?

Mean: 33 – we do not expect the tutoring program to affect the distribution of test scores
for the control group, meaning both the mean test score and the standard deviation of
the test score should remain the same.
Standard deviation: 14.3 – see above.
After deliberations with your partner organization, you’ve decided that you need an effect size of at least 10%
for the program to be worth the its costs, which is roughly what the previous study from A-D found.
G. If you observe a 10% increase in test scores as a result of the tutoring program, what is the mean test
score for the treatment group after the intervention? What do you expect for the standard deviation of
test scores in the treatment group? What is the effect size?
Mean test score in treatment group: 33 x 1.1 = 36.3
Standard deviation of test scores in treatment group: 14.3 – currently, we do not expect
the tutoring program to change the spread of test scores in the treatment group (though,
realistically, it is probably the case that students would respond differently to tutoring,
which would increase the variance of test scores. We will return to this below.)
Effect size: 36.3 - 33 = 3.3

H. Given α=0.05 and the standard deviation of the outcome variable you found above, what is the minimum
sample size you need to detect the effect size you found in G with 80% power?
Total number of participants: 590
Number in treatment: 590/2 = 295
Number in control: 590/2 = 295

A new study has come out finding a 5% increase in test scores as a result of a tutoring program in a different
city in Gujarat (the same state where your program will take place). While the prior study of the tutoring
program in Andhra Pradesh led you to believe that a 10% increase in test scores is possible, the more recent
study suggests that a smaller increase of 5% is more reasonable.

I. What is the effect size for a 5% increase in test scores?

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

Answer: A 5% increase in test scores mean a post-tutoring score of 33 x 1.05 = 34.65


Effect size = 34.65 – 33 = 1.65

J. Now calculate the minimum sample size that is needed to detect the effect size you found in I. Use the
standard deviation you found in part G.
Total number of students: 2359
Number in treatment: 1180 (can’t have ½ student)
Number in control: 1180 (can’t have ½ student)

K. Explain what you found about sample sizes needed to detect a 5% increase in test scores versus a 10%
increase in test scores, given no changes in the variance of the outcome variable. Intuitively, will you
need larger or smaller samples to measure effect sizes that are smaller, relative to their spread? Why?
Answer: You will need a larger sample to detect an effect size that is smaller, relative to
the standard deviation. While we require a sample of at least 2359 students to detect an
effect size of 5%, we only require a sample of 590 students to detect a 10% increase in test
scores. Intuitively, given the same variation in test scores, for smaller effect sizes there
will be a greater chance that the true effect will be masked by this variance. For larger
sample sizes, we expect narrower sampling distributions for treatment and control
groups.
[Optional: Sketch out for participants two scenarios of overlapping bell curves
representing control and treatment sampling distributions. The control distribution
(mean, variance) should be the same in both scenarios. In one scenario, show the
treatment distribution overlapping with the control distribution more than in the other
scenario, leaving the variance of the treatment distribution the same in both scenarios.
The scenario with more overlap between the two distributions represents a smaller
effect size, relative to the variance, while the scenario with less overlap between the two
distributions represents a larger effect size, relative to the variance. You can also sketch
out a third set of overlapping bell curves, where the variance for both distributions is
smaller than in the first two scenarios (and where there is thus less overlap between the
treatment and control distributions). This represents a larger sample size. Remind
participants of the Law of Large Numbers from the previous lecture and that larger
samples are more likely to give you narrower sampling distributions, which will overlap
less. This is a graphical representation of how power decreases with the effect size,
relative to the variance, but increases with the sample size.]
For example:

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

Smaller effect size, relative to the variance

Larger effect size, relative to the variance

Smaller variance (note that the means are the same as in the first graphic above, with the
smaller effect size, but the variance is smaller)
L. Recall that in the first part of the exercise, you used an effect size of 10%, based on results from a previous
study, to do your power calculations. With the new study suggesting that a 5% increase in test scores is

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

more reasonable, you are now faced with a dilemma: what sample size should you pick for your study,
and why?

Answer: If you have the funding to cover the larger sample size, you should err on the
side of caution and go with the more conservative estimate. The risk of using the smaller
sample size, based on an effect size of 10%, is that the tutoring program will increase test
scores, but the true effect size will be too small to detect based on a sample of 590, as found
above. This would lead to a type II error and waste resources.
While the new study found that test scores increased after the tutoring intervention, it also found that the
standard deviation of test scores increased, meaning that there was a larger spread of test scores across the
treatment group. This is because students respond differently to tutoring; some students’ test scores increased
dramatically after the tutoring program, while others’ increased only slightly. To account for the higher
variance in test scores, you posit that instead of the standard deviation you found in G, the standard deviation
of test scores may now be 16.5 after the tutoring program.

M. Without going through the calculations, does the minimum sample size needed to detect a 10% increase
in test scores increase, decrease, or remain the same when the standard deviation of test scores rises to
16.5?
Answer: It increases. The size of the treatment effect, relative to the standard deviation,
has decreased. Intuitively, because the underlying variation in test scores in the treatment
group has increased without the distribution shifting left or right, you are more likely to
randomly select by chance a treatment group that is on the bottom tail of the distribution
(and so looks more similar to the control group after the intervention).

N. Having gone through the intuition, now calculate the minimum sample size needed to detect a 10%
increase in test scores, given a standard deviation in test scores of 16.5.
Answer: The calculator outputs 785 in total. Since you can’t have half a person, this
amounts to is 393 each in the treatment and control groups, so 786 in total.

LIMITED RESOURCES AND IMPERFECT COMPLIANCE (10 MINS)

Sometimes, rather than calculate a budget based on sample size, we have a maximum budget and need to
decide whether it is worth doing the study (that is, whether we are sufficiently powered to detect a given
effect size, conditional on budgetary limitations).

O. You find out that you only have enough funds for a sample size of 2400 in total. Using the more recent
paper’s estimate of a roughly 5% increase in test scores and a standard deviation of test scores of 16.5,
what is the power of your experiment? (An approximate answer is okay; it’s hard to get exact power on
the calculator this way.) Is it worth carrying out the study on just 2400 students? How would you
determine this?

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

Answer: Power is roughly 70% (with an effect size of 1.65 and standard deviation of 16.5).
The danger here is that you will not be powered to detect a 5% increase and could thus
falsely fail to reject the null hypothesis, or commit a type II error. Note that if you did not
think the standard deviation changed as a result of the tutoring program, you would be
sufficiently powered (with an effect size of 1.65 and standard deviation of 14.3).
Encourage participants to think through the downsides of either scenario – increasing
the budget, vs. failing to detect an effect with an additional 10% probability. What is it
“worth” to get this extra power?

P. If you use the 10% increase in test scores as suggested by the first study, is it worth carrying out the study
on just 2400 students? What is the power of your experiment? Assume a standard deviation of test scores
of 16.5.

Answer: Yes, you are very well-powered (near 1, or 100%). Recall above that a sample size
of 785 total was needed for 80% power, given an effect size of 3.3 and standard deviation
of 16.5.

Q. Your research team has conducted some focus groups and determined that only 40% of students would
be interested in the tutoring services—that is, not every student who is offered the tutoring program will
choose to attend. How does this affect your power calculations? Will the required sample size to detect
a 10% increase in test scores increase, decrease, or remain the same? Why? Here, you should assume that
noncompliance is random—otherwise, and as discussed in case study 4, we would have to worry about
selection bias from noncompliance (for example, weaker students may be more likely to take up the
tutoring services, which would give you an underestimate of the program’s impact).

Answer: You will need to increase the sample size substantially to detect a 10% increase
in test scores. The tutoring program will still have the same effect size for compliers, but
now the treatment group will be a mix of compliers and non-compliers (for whom we
expect to see no change in test scores). So the treatment group average in score increases
will only be 3.3 x 0.40 = 1.32.

CLUSTERED DESIGNS (20 MINS)

(Time permitting: for groups that move quickly, you may be able to get through this portion
in class. Otherwise, participants can finish this section on their own.)

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

Thus far we have considered a simple design where we randomize at the individual level, where students are
either assigned to the treatment (tutoring) or control (no tutoring) condition. However, spillovers could be
a major concern with such a design: if treatment and control students are in the same school, let alone the
same classroom, students receiving tutoring may affect the outcomes for students not receiving tutoring (such
as through peer learning effects) and vice versa. This would lead us to get a biased estimate of the impact of
the tutoring program. Here, we would likely underestimate the effect of the tutoring program, because
students in the control group would benefit from it as well.

To avoid this issue, your research team decides to run a cluster randomized trial, randomizing at the school
level instead of the individual level. In this case, each school forms a “cluster,” with all of the students in a
given school assigned to either the treatment group or the comparison group. Under such a design, the only
spillovers that may show up would be across schools, a far less likely possibility than spillovers within schools.

Generally, individuals within a cluster tend to be more similar to each other than individuals from different
clusters. For example, students in the same school share the same teachers, the same peers, and may share
similarities in socioeconomic status and other factors that help determine school performance. This
correlation in behavior of individuals within a given cluster is called intra-cluster or intra-class
correlation, and we need to account for it in our power calculations using rho (ρ), the intra-cluster
correlation coefficient (ICC), for our outcome of interest. Remember, ρ is a measure of the degree of
similarity between children within a given school (see key vocabulary at the start of this exercise); it tells us
how strongly the outcomes are correlated for units within the same cluster (specifically, it is the share of the
variance between clusters relative to the overall variance). If students from the same school all scored exactly
the same on the test, then ρ would equal 1. If, on the other hand, test scores of students from the same school
are independent, and there was zero correlation in test scores of students within the same school, then ρ
would equal 0. Realistically, ρ will fall somewhere between these two extremes.

The ICC (ρ) of a given variable is typically determined by looking at pilot or baseline data for your population
of interest. Should you not have this data, another way of estimating ρ is to look at other studies examining
similar outcomes amongst similar populations. Given the inherent uncertainty with extrapolating across
contexts and populations, it is useful to consider a range of ρs when conducting your power calculations to
see how sensitive they are to changes in ρ, a process known as sensitivity analysis. We will look at this a little
further on; assumptions about ρ will have important implications for power calculations. While ρ can vary
widely depending on what you are looking at, values of less than 0.05 are typically considered low, values
between 0.05-0.20 are considered to be of moderate size, and values above 0.20 are considered fairly high.
Again, what counts as a low versus high ρ can vary dramatically by context and outcome of interest, but these
ranges serve as initial rules of thumb.

First, let’s look at how power changes with the ICC. Start by checking the “Clustered Design?” box in the
EGAP calculator. With this box checked, the blue line on the graph shows power with individual-level

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

randomization, while the green line shows power for a clustered design, both for a given a set of parameters
(significance level, treatment effect size, standard deviation of the outcome variable, ICC, number of clusters
per arm, power target, and sample size). Keep the significance level at α=0.05 and the power target at 0.8,
and set the treatment effect size based on the 10% increase in test scores you found earlier, with a standard
deviation of 16.5. Keep the number of clusters per arm at 40, but increase the maximum number of subjects
to 5000 (which rescales the graph and lets you see more easily what’s going on).

R. The ICC default setting on this calculator is 0.5. What happens to the green line, relative to the blue line,
as you slowly increase it to 1? What happens as you slowly decrease it down to 0? Intuitively, why is the
green line moving closer to or further from the blue line as you change the ICC?

Answer: The intra-cluster correlation reflects the level of similarity between individuals
within a cluster (with respect to our outcome variable). If the ICC is 0, there is zero
correlation in outcomes for students in the same school (cluster). Here, each individual
student can be thought of as a unique cluster, and this case is statistically equivalent to
individual-level randomization (as shown by the overlap in lines on the chart as the ICC
goes to 0). On the opposite side of the spectrum, if outcomes of students within a school
were perfectly correlated (i.e., ICC=1), our 80 clusters in total would be equivalent to 80
sampling units, or would be the same as sampling just 80 students.

Based on the pilot study and earlier tutoring interventions, your research team has estimated a ρ of 0.11. You
need to calculate the total sample size to measure a 10% increase in test scores (assuming that the mean test
score at baseline is 33, with a standard deviation of 16.5). You can do this by checking the clustered design
box in the EGAP calculator and adjusting the intra-cluster correlation bar to 0.11.

S. Change the number of clusters per arm to 55. Given 55 clusters per arm (so 110 clusters in total), a 10%
increase in test scores, a standard deviation of 16.5, and an ICC of 0.11, how many subjects in total do
you need for 80% power? How many in each cluster? How does this compare to the sample size you
found in part N, using individual-level randomization? Why?

Answer: You need a sample size of at least 3,548. With 110 clusters, this means an average
of at least 32 participants per cluster (since you can’t have a fraction of a person in a
cluster). This is much more than the 785 we found in N. See answer above for the
intuition.

With an individual-level randomization, we can only manipulate the number of participants in the study. But
with a clustered design, we can manipulate the number of clusters and the number of participants in each
cluster. The two affect power in slightly different ways and have different costs—it is typically going to be
cheaper to add participants to a cluster than to add clusters, though sometimes there is a limit on the number
POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

of participants in a cluster (e.g., a set number of students in a class). We’ll now examine how manipulating
the number of clusters versus the number of participants per cluster affects power, starting with the number
of clusters.

T. Using the same effect size, standard deviation, and ICC as in S, how many students do you need in total
if you have 60 clusters per arm, or 120 clusters in total? How many students per cluster is this? Fill in the
table below.

Answer: You need a sample size of at least 2,648. With 120 clusters, this means an average
of at least 22 participants per cluster, which works out to be 120 x 22 =2640.

U. If you have 50 clusters per arm, or 100 clusters in total, how many students do you need per cluster and
in total? Fill in the table below.

Answer: You need a sample size of at least 5990. With 100 clusters, this means an average
of at least 60 participants per cluster (since you can’t have a fraction of a person in a
cluster).

50 schools per arm (100 55 schools per arm (110 60 schools per arm (120
schools in total) schools in total) schools in total)
Number of students
60 32 22
per school:
Total no. of students: 6,000 (5,990 also ok) 3,520 (3,548 also ok) 2,640 (2,648 also ok)
Note: The calculator spits out the numbers in parentheses as answers, while mathematically
the number of students per cluster times the number of clusters (provided by the calculator)
equals the numbers not in parentheses. These discrepancies are due to rounding, since it is
impossible to have a fraction of a person.

V. As the number of clusters increases, does the total number of students required for your study increase
or decrease? Why do you suspect this is the case?

Answer: Sample size decreases as the number of clusters increases, i.e., you need fewer students in total
as the number of clusters increases. A larger number of clusters will give you a more representative sample.

W. You realize that you had read the pilot data wrong: It turns out that ρ is actually 0.07 and not 0.11. Now
what would the number of students per cluster and total number of students if you had 50 schools per
arm (or 100 schools in total)? What about with 55 schools per arm or 60 schools per arm? Fill in the table
below.

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

50 schools per arm (100 55 schools per arm (110 60 schools per arm (120
schools in total) schools in total) schools in total)
Number of students
17 14 12
per school:
Total no. of students: 1,696 (1,700 also ok) 1,514 (1,540 also ok) 1,390 (1,440 also ok)
Note: The calculator spits out the numbers in parentheses as answers, while mathematically
the number of students per cluster times the number of clusters (provided by the calculator)
equals the numbers not in parentheses. These discrepancies are due to rounding, since it is
impossible to have a fraction of a person.

X. How do your answers here compare to your answers in part U? Why?

Answer: You don’t need as many students because the ICC is smaller. A smaller ICC means
that the students within a cluster look less similar to one another in terms of test scores, and
thus each cluster will be more representative of your sample population. This means that we
can achieve the same level of power while including fewer students in our sample (and fewer
students per cluster).

Y. Given a choice between offering the tutors to more children in each school (i.e., adding more individuals
to the cluster) versus offering tutors in more schools (i.e., adding more clusters), which option is best
purely from the perspective of improving statistical power? What about from a cost perspective?

Answer: To increase power, adding clusters is clearly the better choice. By adding more
clusters, you gain more variation as opposed to adding more individuals per cluster
where you are gaining more observations that are likely to be correlated with the
observations you already have. From a cost perspective, it depends on whether it works
out to be more expensive to sample more students in total (as you would have to do if you
add students, not clusters) versus adding more clusters but sampling fewer students. In
general, it will be cheaper to add more students than to add clusters, but it may not be.
As noted above, you may also be unable to increase the number of students per cluster
beyond a natural limit.

RESOURCES

3ie, “Power calculation for causal inference in social science: sample size and minimum detectable effect
determination” (report and calculator tool: http://www.3ieimpact.org/evidence-
hub/publications/working-papers/power-calculation-causal-inference-social-science-sample
Optimal Design program for power calculations: http://hlmsoft.net/od/

POVERTYACTIONLAB.ORG
E X E R CI S E • POW E R CA L C U LA T IO N S • A B DUL LA T IF JA M E E L PO V E R TY A C TI O N LA B

Institute for Fiscal Studies, “Going beyond simple sample size calculations: a practitioner’s guide for power
calculations” (includes sample Stata code):
https://www.ifs.org.uk/uploads/publications/wps/WP201517_update_Sep15.pdf
http://blogs.worldbank.org/impactevaluations/power-calculations-101-dealing-with-incomplete-take-up
“Remedying Education: Evidence from Two Randomized Experiments in India” (background on the Balsakhi
tutoring program): https://www.povertyactionlab.org/sites/default/files/publications/6%20Computer-
Assisted%20Learning%20Project%20with%20Pratham%20in%20India%2007.pdf

POVERTYACTIONLAB.ORG

You might also like