Statistics
Statistics
Statistics
Reliability is concerned with questions of stability and consistency - does the same measurement
tool yield stable and consistent results when repeated over time. Think about measurement
processes in other contexts - in construction or woodworking, a tape measure is a highly reliable
measuring instrument.
Say you have a piece of wood that is 2 1/2 feet long. You measure it once with the tape
measure - you get a measurement of 2 1/2 feet. Measure it again and you get 2 1/2 feet. Measure
it repeatedly and you consistently get a measurement of 2 1/2 feet. The tape measure yields
reliable results.
Validity refers to the extent we are measuring what we hope to measure (and what we think we
are measuring). To continue with the example of measuring the piece of wood, a tape measure
that has been created with accurate spacing for inches, feet, etc. should yield valid results as well.
Measuring this piece of wood with a "good" tape measure should produce a correct measurement
of the wood's length.
To apply these concepts to social research, we want to use measurement tools that are both
reliable and valid. We want questions that yield consistent responses when asked multiple times -
this is reliability. Similarly, we want questions that get accurate responses from respondents - this
is validity.
Reliability
Reliability refers to a condition where a measurement process yields consistent scores (given an
unchanged measured phenomenon) over repeat measurements. Perhaps the most straightforward
way to assess reliability is to ensure that they meet the following three criteria of reliability.
Measures that are high in reliability should exhibit all three.
Test-Retest Reliability
When a researcher administers the same measurement tool multiple times - asks the same
question, follows the same research procedures, etc. - does he/she obtain consistent results,
assuming that there has been no change in whatever he/she is measuring? This is really the
simplest method for assessing reliability - when a researcher asks the same person the same
question twice ("What's your name?"), does he/she get back the same results both times. If so,
the measure has test-retest reliability. Measurement of the piece of wood talked about earlier has
high test-retest reliability.
Inter-Item Reliability
This is a dimension that applies to cases where multiple items are used to measure a
single concept. In such cases, answers to a set of questions designed to measure some single
concept (e.g., altruism) should be associated with each other.
Interobserver Reliability
Interobserver reliability concerns the extent to which different interviewers or observers using
the same measure get equivalent results. If different observers or interviewers use the same
instrument to score the same thing, their scores should match. For example, the interobserver
reliability of an observational assessment of parent-child interaction is often evaluated by
showing two observers a videotape of a parent and child at play. These observers are asked to use
an assessment tool to score the interactions between parent and child on the tape. If the
instrument has high interobserver reliability, the scores of the two observers should match.
Validity
To reiterate, validity refers to the extent we are measuring what we hope to measure (and what
we think we are measuring). How to assess the validity of a set of measurements? A valid
measure should satisfy four criteria.
Face Validity
This criterion is an assessment of whether a measure appears, on the face of it, to measure the
concept it is intended to measure. This is a very minimum assessment - if a measure cannot
satisfy this criterion, then the other criteria are inconsequential. We can think about observational
measures of behavior that would have face validity. For example, striking out at another person
would have face validity for an indicator of aggression. Similarly, offering assistance to a
stranger would meet the criterion of face validity for helping. However, asking people about their
favorite movie to measure racial prejudice has little face validity.
Content Validity
Content validity concerns the extent to which a measure adequately represents all facets of a
concept. Consider a series of questions that serve as indicators of depression (don't feel like
eating, lost interest in things usually enjoyed, etc.). If there were other kinds of common
behaviors that mark a person as depressed that were not included in the index, then the index
would have low content validity since it did not adequately represent
all facets of the concept.
Criterion-Related Validity
Criterion-related validity applies to instruments than have been developed for usefulness as
indicator of specific trait or behavior, either now or in the future. For example, think about the
driving test as a social measurement that has pretty good predictive validity. That is to say, an
individual's performance on a driving test correlates well with his/her driving ability.
Construct Validity
But for a many things we want to measure, there is not necessarily a pertinent criterion available.
In this case, turn to construct validity, which concerns the extent to which a measure is related to
other measures as specified by theory or previous research. Does a measure stack up with other
variables the way we expect it to? A good example of this form of validity comes from early self-
esteem studies - self-esteem refers to a person's sense of self-worth or self-respect. Clinical
observations in psychology had shown that people who had low self-esteem often had
depression. Therefore, to establish the construct validity of the self-esteem measure, the
researchers showed that those with higher scores on the self-esteem measure had lower
depression scores, while those with low self-esteem had higher rates of depression.
So what is the relationship between validity and reliability? The two do not necessarily go hand-
in-hand.
At best, we have a measure that has both high validity and high reliability. It yields consistent
results in repeated application and it accurately reflects what we hope to represent.
It is possible to have a measure that has high reliability but low validity - one that is consistent in
getting bad information or consistent in missing the mark. *It is also possible to have one that
has low reliability and low validity - inconsistent and not on target.
Finally, it is not possible to have a measure that has low reliability and high validity - you can't
really get at what you want or what you're interested in if your measure fluctuates wildly.
Samples and Sampling
Introduction
"Statistical designs always involve compromises between the desirable and the possible."
Leslie Kish. Statistical Designs for Research. 1987. (New York: John Wiley and Sons) p. 1.
As the quote above from Leslie Kish highlights, all research designs involve some form of
compromise or adjustment. One of the dimensions on which such compromises are made relates
to the populations about which we wish to learn. There are many research questions we would
like to answer that involve populations that are too large to consider learning about every
member of the population. How have wages of European workers changed over the past ten
years? How do Americans feel about the job that the President is doing? What are the
management practices of international banking firms?
Questions such as these are important in understanding the world around us, yet it would be
impractical, if not impossible, to measure the wages of all European workers, the feelings about
the President of all Americans, and the banking practices of the world's banks. Generally, in
answering such questions, social scientists examine a fraction of the possible population of
interest, drawing statistical inferences from this fraction. The selection process used to draw such
a fraction is known as sampling, while the group contained in the fraction is known as the
sample.
It is not only statisticians or quantitative researchers that sample. Journalists who select a
particular case or particular group of people to highlight in a news story are engaging in a form
of sampling. Most of us, in our everyday lives, do some sampling, whether we realize it or not.
Although you may not have listened to all the songs of a particular band or singer, you likely
would be able to form an opinion about such songs from hearing a few of them. In making such
inferences you've relied on a subset of entities (some songs of an artist) to generalize to a larger
group (all songs by an artist). You've sampled.
Why Sample?
Sampling is done in a wide variety of research settings. Listed below are a few of the benefits of
sampling:
1. Reduced cost: It is obviously less costly to obtain data for a selected subset of a
population, rather than the entire population. Furthermore, data collected through a
carefully selected sample are highly accurate measures of the larger population. Public
opinion researchers can usually draw accurate inferences for the entire population of the
United States from interviews of only 1,000 people.
2. Speed: Observations are easier to collect and summarize with a sample than with a
complete count. This consideration may be vital if the speed of the analysis is important,
such as through exit polls in elections.
3. Greater scope: Sometimes highly trained personnel or specialized equipment limited in
availability must be used to obtain the data. A complete census (enumeration) is not
practical or possible. Thus, surveys that rely on sampling have greater flexibility
regarding the type of information that can be obtained.
It is important to keep in mind that the primary point of sampling is to create a small group from
a population that is as similar to the larger population as possible. In essence, we want to have a
little group that is like the big group. With that in mind, one of the features we look for in a
sample is the degree of representativeness - how well does the sample represent the larger
population from which it was drawn? How closely do the features of the sample resemble those
of the larger population?
There are, of course, good and bad samples, and different sampling methods have different
strengths and weaknesses. Before turning to specific methods, a few specialized terms used in
sampling should be defined.
Sampling Terminology
Samples are always drawn from a population, but we have not defined the term "population." By
"population" we denote the aggregate from which the sample is drawn. The population to be
sampled (the sampled population) should coincide with the population about which information
is wanted (the target population). Sometimes, for reasons of practicality or convenience, the
sampled population is more restricted than the target population. In such cases, precautions must
be taken to secure that the conclusions only refer to the sampled population.
Before selecting the sample, the population must be divided into parts that are called sampling
units or units. These units must cover the whole of the population and they must not overlap, in
the sense that every element in the population belongs to one and only one unit. Sometimes the
choice of the unit is obvious, as in the case of the population of Americans so often used for
opinion polling. In sampling individuals in a town, the unit might be an individual person, the
members of a family, or all persons living in the same city block. In sampling an agricultural
crop, the unit might be a field, a farm, or an area of land whose shape and dimensions are at our
disposal. The construction of this list of sampling units, called a frame, is often one of the major
practical problems.
Most of us are familiar with sampling at some level through seeing reports about levels of
popular opinion about some current topic. Newspapers and television programs are filled with
references to the current state of popular opinion.
One of the most prestigious firms in the polling business is The Gallup Organization. Their web
site has an excellent description of why sampling is common in social science research, how it is
conducted, and some other issues. For example, the page contains the following statement about
the value of sampling:
The basic principle: a randomly selected, small percent of a population of people can represent
the attitudes, opinions, or projected behavior of all of the people, if the sample is selected
correctly.
Samples and Sampling
Types of Sampling
We may then consider different types of probability samples. Although there are a number of
different methods that might be used to create a sample, they generally can be grouped into one
of two categories: probability samples or non-probability samples.
Probability Samples
The idea behind this type is random selection. More specifically, each sample from the
population of interest has a known probability of selection under a given sampling scheme. There
are four categories of probability samples described below.
The most widely known type of a random sample is the simple random sample (SRS). This is
characterized by the fact that the probability of selection is the same for every case in the
population. Simple random sampling is a method of selecting n units from a population of size N
such that every possible sample of size an has equal chance of being drawn.
An example may make this easier to understand. Imagine you want to carry out a survey of 100
voters in a small town with a population of 1,000 eligible voters. With a town this size, there are
"old-fashioned" ways to draw a sample. For example, we could write the names of all voters on a
piece of paper, put all pieces of paper into a box and draw 100 tickets at random. You shake the
box, draw a piece of paper and set it aside, shake again, draw another, set it aside, etc. until we
had 100 slips of paper. These 100 form our sample. And this sample would be drawn through a
simple random sampling procedure - at each draw, every name in the box had the same
probability of being chosen.
In real-world social research, designs that employ simple random sampling are difficult to come
by. We can imagine some situations where it might be possible - you want to interview a sample
of doctors in a hospital about work conditions. So you get a list of all the physicians that work in
the hospital, write their names on a piece of paper, put those pieces of paper in the box, shake
and draw. But in most real-world instances it is impossible to list everything on a piece of paper
and put it in a box, then randomly draw numbers until desired sample size is reached.
There are many reasons why one would choose a different type of probability sample in practice.
Example 1
Suppose you were interested in investigating the link between the family of origin and income
and your particular interest is in comparing incomes of Hispanic and Non-Hispanic respondents.
For statistical reasons, you decide that you need at least 1,000 non-Hispanics and 1,000
Hispanics. Hispanics comprise around 6 or 7% of the population. If you take a simple random
sample of all races that would be large enough to get you 1,000 Hispanics, the sample size would
be near 15,000, which would be far more expensive than a method that yields a sample of 2,000.
One strategy that would be more cost-effective would be to split the population into Hispanics
and non-Hispanics, then take a simple random sample within each portion (Hispanic and non-
Hispanic).
Example 2
Let's suppose your sampling frame is a large city's telephone book that has 2,000,000 entries. To
take a SRS, you need to associate each entry with a number and choose n= 200 numbers from
N= 2,000,000. This could be quite an ordeal. Instead, you decide to take a random start between
1 and N/n= 20,000 and then take every 20,000th name, etc. This is an example of systematic
sampling, a technique discussed more fully below.
Example 3
Suppose you wanted to study dance club and bar employees in NYC with a sample of n = 600.
Yet there is no list of these employees from which to draw a simple random sample. Suppose you
obtained a list of all bars/clubs in NYC. One way to get this would be to randomly sample 300
bars and then randomly sample 2 employees within each bars/club. This is an example of cluster
sampling. Here the unit of analysis (employee) is different from the primary sampling unit (the
bar/club).
In each of these three examples, a probability sample is drawn, yet none is an example of simple
random sampling. Each of these methods is described in greater detail below.
Although simple random sampling is the ideal for social science and most of the statistics used
are based on assumptions of SRS, in practice, SRS are rarely seen. It can be terribly inefficient,
and particularly difficult when large samples are needed. Other probability methods are more
common. Yet SRS is essential, both as a method and as an easy-to-understand method of
selecting a sample.
To recap, though, that simple random sampling is a sampling procedure in which every element
of the population has the same chance of being selected and every element in the sample is
selected by chance.
In this form of sampling, the population is first divided into two or more mutually exclusive
segments based on some categories of variables of interest in the research. It is designed to
organize the population into homogenous subsets before sampling, then drawing a random
sample within each subset. With stratified random sampling the population of N units is divided
into subpopulations of units respectively. These subpopulations, called strata, are non-
overlapping and together they comprise the whole of the population. When these have been
determined, a sample is drawn from each, with a separate draw for each of the different strata.
The sample sizes within the strata are denoted by respectively. If a SRS is taken within each
stratum, then the whole sampling procedure is described as stratified random sampling.
The primary benefit of this method is to ensure that cases from smaller strata of the population
are included in sufficient numbers to allow comparison. An example makes it easier to
understand. Say that you're interested in how job satisfaction varies by race among a group of
employees at a firm. To explore this issue, we need to create a sample of the employees of the
firm. However, the employee population at this particular firm is predominantly white, as the
following chart illustrates:
If we were to take a simple random sample of employees, there's a good chance that we would
end up with very small numbers of Blacks, Asians, and Latinos. That could be disastrous for our
research, since we might end up with too few cases for comparison in one or more of the smaller
groups.
Rather than taking a simple random sample from the firm's population at large, in a stratified
sampling design, we ensure that appropriate numbers of elements are drawn from each racial
group in proportion to the percentage of the population as a whole. Say we want a sample of
1000 employees - we would stratify the sample by race (group of White employees, group of
African American employees, etc.), then randomly draw out 750 employees from the White
group, 90 from the African American, 100 from the Asian, and 60 from the Latino. This yields a
sample that is proportionately representative of the firm as a whole.
Stratification is a common technique. There are many reasons for this, such as:
1. If data of known precision are wanted for certain subpopulations, than each of these
should be treated as a population in its own right.
2. Administrative convenience may dictate the use of stratification, for example, if an
agency administering a survey may have regional offices, which can supervise the survey
for a part of the population.
3. Sampling problems may be inherent with certain sub populations, such as people living in
institutions (e.g. hotels, hospitals, prisons).
4. Stratification may improve the estimates of characteristics of the whole population. It
may be possible to divide a heterogeneous population into sub-populations, each of
which is internally homogenous. If these strata are homogenous, i.e., the measurements
vary little from one unit to another; a precise estimate of any stratum mean can be
obtained from a small sample in that stratum. The estimate can then be combined into a
precise estimate for the whole population.
5. There is also a statistical advantage in the method, as a stratified random sample nearly
always results in a smaller variance for the estimated mean or other population
parameters of interest.
Systematic Sampling
This method of sampling is at first glance very different from SRS. In practice, it is a variant of
simple random sampling that involves some listing of elements - every nth element of list is then
drawn for inclusion in the sample. Say you have a list of 10,000 people and you want a sample of
1,000.
1. Divide number of cases in the population by the desired sample size. In this example,
dividing 10,000 by 1,000 gives a value of 10.
2. Select a random number between one and the value attained in Step 1. In this example,
we choose a number between 1 and 10 - say we pick 7.
3. Starting with case number chosen in Step 2, take every tenth record (7, 17, 27, etc.).
More generally, suppose that the N units in the population are ranked 1 to N in some order (e.g.,
alphabetic). To select a sample of n units, we take a unit at random, from the 1st k units and take
every k-th unit thereafter.
The advantages of systematic sampling method over simple random sampling include:
1. It is easier to draw a sample and often easier to execute without mistakes. This is a
particular advantage when the drawing is done in the field.
2. Intuitively, you might think that systematic sampling might be more precise than SRS. In
effect it stratifies the population into n strata, consisting of the 1st k units, the 2nd k units,
and so on. Thus, we might expect the systematic sample to be as precise as a stratified
random sample with one unit per stratum. The difference is that with the systematic one
the units occur at the same relative position in the stratum whereas with the stratified, the
position in the stratum is determined separately by randomization within each stratum.
Cluster Sampling
In some instances the sampling unit consists of a group or cluster of smaller units that we call
elements or subunits (these are the units of analysis for your study). There are two main reasons
for the widespread application of cluster sampling. Although the first intention may be to use the
elements as sampling units, it is found in many surveys that no reliable list of elements in the
population is available and that it would be prohibitively expensive to construct such a list. In
many countries there are no complete and updated lists of the people, the houses or the farms in
any large geographical region.
Even when a list of individual houses is available, economic considerations may point to the
choice of a larger cluster unit. For a given size of sample, a small unit usually gives more precise
results than a large unit. For example a SRS of 600 houses covers a town more evenly than 20
city blocks containing an average of 30 houses apiece. But greater field costs are incurred in
locating 600 houses and in traveling between them than in covering 20 city blocks. When cost is
balanced against precision, the larger unit may prove superior.
Nonprobability Sampling
Social research is often conducted in situations where a researcher cannot select the kinds of
probability samples used in large-scale social surveys. For example, say you wanted to study
homelessness - there is no list of homeless individuals nor are you likely to create such a list.
However, you need to get some kind of a sample of respondents in order to conduct your
research. To gather such a sample, you would likely use some form of non-probability sampling.
To reiterate, the primary difference between probability methods of sampling and non-
probability methods is that in the latter you do not know the likelihood that any element of a
population will be selected for study.
Availability Sampling
Availability sampling is a method of choosing subjects who are available or easy to find. This
method is also sometimes referred to as haphazard, accidental, or convenience sampling. The
primary advantage of the method is that it is very easy to carry out, relative to other methods. A
researcher can merely stand out on his/her favorite street corner or in his/her favorite tavern and
hand out surveys. One place this used to show up often is in university courses. Years ago,
researchers often would conduct surveys of students in their large lecture courses. For example,
all students taking introductory sociology courses would have been given a survey and
compelled to fill it out. There are some advantages to this design - it is easy to do, particularly
with a captive audience, and in some schools you can attain a large number of interviews through
this method.
The primary problem with availability sampling is that you can never be certain what population
the participants in the study represent. The population is unknown, the method for selecting cases
is haphazard, and the cases studied probably don't represent any population you could come up
with.
However, there are some situations in which this kind of design has advantages - for example,
survey designers often want to have some people respond to their survey before it is given out in
the "real" research setting as a way of making certain the questions make sense to respondents.
For this purpose, availability sampling is not a bad way to get a group to take a survey, though in
this case researchers care less about the specific responses given than whether the instrument is
confusing or makes people feel bad.
Despite the known flaws with this design, it's remarkably common. Ask a provocative question,
give telephone number and web site address ("Vote now at CNN.com), announce results of poll.
This method provides some form of statistical data on a current issue, but it is entirely unknown
what population the results of such polls represents. At best, a researcher could make some
conditional statement about people who are watching CNN at a particular point in time who
cared enough about the issue in question to log on or call in.
Quota Sampling
Quota sampling is designed to overcome the most obvious flaw of availability sampling. Rather
than taking just anyone, you set quotas to ensure that the sample you get represents certain
characteristics in proportion to their prevalence in the population. Note that for this method, you
have to know something about the characteristics of the population ahead of time. Say you want
to make sure you have a sample proportional to the population in terms of gender - you have to
know what percentage of the population is male and female, then collect sample until yours
matches. Marketing studies are particularly fond of this form of research design.
The primary problem with this form of sampling is that even when we know that a quota sample
is representative of the particular characteristics for which quotas have been set, we have no way
of knowing if sample is representative in terms of any other characteristics. If we set quotas for
gender and age, we are likely to attain a sample with good representativeness on age and gender,
but one that may not be very representative in terms of income and education or other factors.
Moreover, because researchers can set quotas for only a small fraction of the characteristics
relevant to a study quota sampling is really not much better than availability sampling. To
reiterate, you must know the characteristics of the entire population to set quotas; otherwise
there's not much point to setting up quotas. Finally, interviewers often introduce bias when
allowed to self-select respondents, which is usually the case in this form of research. In choosing
males 18-25, interviewers are more likely to choose those that are better-dressed, seem more
approachable or less threatening. That may be understandable from a practical point of view, but
it introduces bias into research findings.
Purposive Sampling
Purposive sampling is a sampling method in which elements are chosen based on purpose of the
study. Purposive sampling may involve studying the entire population of some limited group
(sociology faculty at Columbia) or a subset of a population (Columbia faculty who have won
Nobel Prizes). As with other non-probability sampling methods, purposive sampling does not
produce a sample that is representative of a larger population, but it can be exactly what is
needed in some cases - study of organization, community, or some other clearly defined and
relatively limited group.
Snowball Sampling
Snowball sampling is a method in which a researcher identifies one member of some population
of interest, speaks to him/her, then asks that person to identify others in the population that the
researcher might speak to. This person is then asked to refer the researcher to yet another person,
and so on.
Snowball sampling is very good for cases where members of a special population are difficult to
locate. For example, several studies of Mexican migrants in Los Angeles have used snowball
sampling to get respondents.
The method also has an interesting application to group membership - if you want to look at
pattern of recruitment to a community organization over time, you might begin by interviewing
fairly recent recruits, asking them who introduced them to the group. Then interview the people
named, asking them who recruited them to the group.
The method creates a sample with questionable representativeness. A researcher is not sure who
is in the sample. In effect snowball sampling often leads the researcher into a realm he/she knows
little about. It can be difficult to determine how a sample compares to a larger population. Also,
there's an issue of who respondents refer you to - friends refer to friends, less likely to refer to
ones they don't like, fear, etc.
Hypothesis Testing
You may be familiar with examples of hypotheses and hypothesis testing from the natural
sciences, perhaps through schoolwork or participation in a school science fair. You may have
evaluated hypotheses such as:
What is a Hypothesis?
Each of these specifies a relationship that may or may not exist under particular conditions. They
are testable statements about relationships between different factors. But why bother with
forming a hypothesis as part of the research process?
For instance, consider a hypothetical experiment that is designed to evaluate whether enhancing
hospital patients' "sense of control" influences their health. In this experiment, conducted in
McGregor Hospital, ten people in the chronic care ward were sampled and given "enhanced
control" over their schedule and living conditions. They could specify when they would have
their meals, which hours they could receive visitors, and which programs they could watch on
television. To compare the benefits of this enhanced control, an additional ten patients of the
chronic care ward were chosen, though their routines were not altered.
After six weeks, the health of all subjects was measured and it was found that the mean level of
health (on a 10-point scale with higher numbers indicating better health) was 6 for the enhanced
control group and 4 for the non-enhanced group.
Perhaps the first question that should be asked is: "Can we be sure that the enhanced sense of
control is responsible for the difference between the groups, rather than chance?" It might be that
simply by chance the people who were chosen for the enhanced control group were somewhat
healthier before the experiment than those assigned to the other group. Or it might be that these
differences were due only to chance, rather than some benefit of control over living conditions.
What is needed is a way to evaluate the likelihood that relationships, such as those in the study in
the hospital described above, occurred by chance. The establishing and testing of hypotheses is
such a method.
Hypothesis Testing
The null hypothesis is often the reverse of what the experimenter actually believes; it is put
forward to allow the data to contradict it. In the study of the effect of sense of control on health,
the researchers expect that a sense of control will improve health. The null hypothesis they
would establish in this setting, then, is that enhancing sense of control will have no effect on
health. The alternative hypothesis is one that stands in contrast to the null, usually that the
condition or change will have some effect. In the sense of control example, the alternative
hypothesis is that changes in sense of control will result in a change in health.
Depending on the data, the null hypothesis either will or will not be rejected as a viable
possibility. If the data show a sufficiently large effect of the sense of control, then the null
hypothesis that sense of control has no effect can be rejected. Specific criteria used to accept or
reject the null hypothesis are discussed in the modules describing statistical tests used to evaluate
hypotheses.
With this understanding of the way that hypotheses are generated in the social sciences, you're
ready to look at specific tools that are used to test hypotheses.
The Chi-Square Test
Introduction
One of the most common and useful ways to look at information about the social world is in the
format of a table. Say, for example, we want to know whether boys or girls get into trouble more
often in school. There are many ways we might show information related to this question, but
perhaps the most frequent and easiest to comprehend method is in a table.
The above example is relatively straightforward in that we can fairly quickly tell that more boys
than girls got into trouble in school. Calculating percentages, we find that 39 percent of boys got
into trouble (46 boys got in trouble out of 117 total boys = 39%), as compared with 31 percent of
girls (37 girls got in trouble out of 120 total girls = 31%). However, to re-frame the issue, what if
we wanted to test the hypothesis that boys get in trouble more often than girls in school. These
figures are a good start to examining that hypothesis; however, the figures in the table are only
descriptive. To examine the hypothesis, we need to employ a statistical test, the chi-square test.
The T-Test
Introduction
How are outcomes different for different groups? It's a question of central concern to the social
sciences, for outcomes vary by a number of characteristics: where we live, how far we went in
school, what kind of job we have, and so on.
One goal of social science research is to accurately measure the social world, to document the
levels of different features of society.
So we are concerned with the measurement of phenomena and strive to specify the level of
difference in voting behavior, household income, or feelings of self-efficacy. However, we often
want to use these measurements in evaluating specific hypotheses about these differences.
In this module, you will learn to use a statistical tool to evaluate hypotheses about group-level
differences in outcomes: the t-test. Specifically, you will learn to use two different applications of
the t-test in evaluating two kinds of hypotheses:
The one-sample t-test, in which the level of outcome for a group is compared to a known
standard.
The two-sample t-test, where the outcome levels of two groups are compared to each other.
The t-test was developed by W. S. Gossett, a statistician employed at the Guiness brewery.
However, because the brewery did not allow employees to publish their research, Gossett's work
on the t-test appears under the name "Student" (and the t-test is sometimes referred to as
"Student's t-test.") Gossett was a chemist and was responsible for developing procedures for
ensuring the similarity of batches of Guiness. The t-test was developed as a way of measuring
how closely the yeast content of a particular batch of beer corresponded to the brewery's
standard.
But the t-test has applications well beyond the realm of quality beer. Applied to the social world,
the same kinds of questions addressed by the t-test in the brewery (how different is a particular
batch of beer from the desired standard?) can be useful in the social world. How different are the
SAT scores of political science undergraduates of a particular university from the SAT scores of
the average SAT scores of the university's undergraduate population?
And the same statistical methodology that compares a particular batch of beer to a standard can
be used to compare how different any two batches are from each other. The test can be used to
compare the yeast content of two kegs of beer brewed at separate times. Extending this into the
realm of social phenomena, we can use this methodology to address questions such as whether
SAT preparation courses improve test scores or whether African Americans continue to face
discrimination in the housing market.
One of the advantages of the t-test is that it can be applied to a relatively small number of cases.
It was specifically designed to evaluate statistical differences for samples of 30 or less.
The T-Test
One-Sample T-Test
It is perhaps easiest to demonstrate the ideas and methods of the one-sample t-
test by working through an example. To reiterate, the one-sample t-test
compares the mean score of a sample to a known value, usually the
population mean (the average for the outcome of some population of interest).
The basic idea of the test is a comparison of the average of the sample
(observed average) and the population (expected average), with an adjustment One-Sample
for the number of cases in the sample and the standard deviation of the T-Test Calculator
average. Working through an example can help to highlight the issues involved and demonstrate
how to conduct a t-test using actual data.
One of the best indicators of the health of a baby is his/her weight at birth. Birthweight is an
outcome that is sensitive to the conditions in which mothers experienced pregnancy, particularly
to issues of deprivation and poor diet, which are tied to lower birthweight. It is also an excellent
predictor of some difficulties that infants may experience in their first weeks of life. The
National Center for Health Statistics reports that although infants weighing 5 1/2 pounds (88
ounces) or less account for only 7% of births, they account for nearly 2/3 of infant deaths.
In the United States, mothers who live in poverty generally have babies with lower birthweight
than those who do not live in poverty. While the average birthweight for babies born in the U.S.
is approximately 3300 grams, the average birthweight for women living in poverty is 2800
grams.
Eliminating the linkage between poverty and low birthweight status has been a prominent
dimension of health policy for the past decade. Recently, a local hospital introduced an
innovative new prenatal care program to reduce the number of low birthweight babies born in the
hospital. In the first year, 25 mothers, all of whom live in poverty, participated in this program.
Data drawn from hospital records reveals that the babies born to these women had a birthweight
of 3075 grams, with a standard deviation of 500 grams.
The question posed to you, the researcher, is whether this program has been effective at
improving the birthweights of babies born to poor women.
1. Establish Hypotheses
The first step to examining this question is to establish the specific hypotheses we wish to
examine. Recall from the unit on hypothesis testing that most social science research involves
the development (based on theory) of a null hypothesis and an alternative hypothesis - some test
statistic is then calculated to determine whether to reject the null hypothesis or not.
For this example, what is the null hypothesis? What is the alternative hypothesis?
In this case:
Null hypothesis is that the difference between the birthweights of babies born to mothers
who participated in the program and those born to other poor mothers is 0. Another way
of stating the null hypothesis is that the difference between the observed mean of
birthweight for program babies and the expected mean of birthweight for poor women is
zero.
Alternative hypothesis - the difference between the observed mean of birthweight for
program babies and the expected mean of birthweight for poor women is not zero.
Having calculated the t-statistic, compare the t-value with a standard table of
t-values to determine whether the t-statistic reaches the threshold of statistical One-Sample
significance. T-Test Calculator
Plugging in the values of t (.898) and n (number of cases = 25) yields a p-value of .378.
Generally speaking, we require p-values of .05 or less in order to reject the null hypothesis. With
a value of .378, we cannot reject the null. Therefore, we conclude that the intervention did not
successfully improve birthweight.
Extension Exercise
Although the prenatal care program appears to have been successful in improving infants'
birthweights significantly above those of other mothers born to poverty, the question remains
whether the program alleviated the disadvantage infants born to poorer women have in
birthweight. The same source tells us that the birthweight of all babies born in the United States
in X was 3,339 grams.
Are the birthweights of the babies born to the participants of the prenatal care program
significantly different from the average for the overall national average?
Summary
1. Establish Hypotheses
Having calculated the t-statistic, compare the t-value with a standard table of t-values to
determine whether the t-statistic reaches the threshold of statistical significance.
The T-Test
Two-Sample T-Test
We often want to know whether the means of two populations on some outcome differ. For
example, there are many questions in which we want to compare two
categories of some categorical variable (e.g., compare males and females) or
Two-Sample
two populations receiving different treatments in context of an experiment.
T-Test Calculator
The two-sample t-test is a hypothesis test for answering questions about the
mean where the data are collected from two random samples of independent observations, each
from an underlying normal distribution:
The steps of conducting a two-sample t-test are quite similar to those of the one-sample test. And
for the sake of consistency, we will focus on another example dealing with birthweight and
prenatal care. In this example, rather than comparing the birthweight of a group of infant to some
national average, we will examine a program's effect by comparing the birthweights of babies
born to women who participated in an intervention with the birthweights of a group that did not.
A comparison of this sort is very common in medicine and social science. To evaluate the effects
of some intervention, program, or treatment, a group of subjects is divided into two groups. The
group receiving the treatment to be evaluated is referred to as the treatment group, while those
who do not are referred to as the control or comparison group. In this example, mothers who are
part of the prenatal care program to reduce the likelihood of low birthweight is the treatment
group, with a control group comprised of women who do not take part in the program.
Returning to the two-sample t-test, the steps to conduct the test are similar to those of the one-
sample test.
Establish Hypotheses
The first step to examining this question is to establish the specific hypotheses we wish to
examine. Specifically, we want to establish a null hypothesis and an alternative hypothesis to be
evaluated with data.
In this case:
Null hypothesis is that the difference between the two groups is 0. Another way of stating
the null hypothesis is that the difference between the mean of the treatment group of
birthweight for program babies and the mean of the control group of birthweight for poor
women is zero.
Alternative hypothesis - the difference between the observed mean of birthweight for
program babies and the expected mean of birthweight for poor women is not zero.
From hospital records, we obtain the following values for these components:
Treatment Control
Average Weight 3100 g 2750 g
SD 420 425
n 75 75
With these pieces of information, we calculate the following statistic, t:
Having calculated the t-statistic, compare the t-value with a standard table of t-values to
determine whether the t-statistic reaches the threshold of statistical significance.
With a t-score so high, the p-value is 0.001, a score that forms our basis to reject the null
hypothesis and conclude that the prenatal care program made a difference.
ANOVA
About the ANOVA Test
Another e-lesson on the t-test demonstrated how to compare differences of means between two
groups, such as comparing outcomes between control and treatment groups in an experimental
study. The t-test is a useful tool for comparing the means of two groups; however, the t-test is not
good in situations calling for the comparison of three or more groups. It can only compare one
group's mean to a known distribution or compare the means of two groups. With three or more
groups, the t-test is not an effective statistical tool. On a practical level, using the t-test to
compare many means is a cumbersome process in terms of the calculations involved. On a
statistical level, using the t-test to compare multiple means can lead to biased results.
Yet there are many kinds of questions in which we might want to compare the means of several
different groups at once. For example, in evaluating the effects of a particular social program, we
might want to compare the mean outcomes of several different program sites. Or we might be
interested in examining the relative performance of different members of a corporate sales team
in terms of their monthly or annual sales records. Alternatively, in an organization with several
different sales managers, we might ask whether some sales managers get more out of their sales
staff than others.
With questions such as these, the preferred statistical tool is the ANOVA, (Analysis Of Variance.
There are some similarities between the t-test and ANOVA. Like the t-test, ANOVA is used to
test hypotheses about differences in the average values of some outcome between two groups;
however, while the t-test can be used to compare two means or one mean against a known
distribution, ANOVA can be used to examine differences among the means of several different
groups at once. More generally, ANOVA is a statistical technique for assessing how nominal
independent variables influence a continuous dependent variable.
This module describes and explains the one-way ANOVA, a statistical tool that is used to
compare multiple groups of observations, all of which are independent but may have a different
mean for each group. A test of importance for many kinds of questions is whether or not all the
averages of a set of groups are equal. There is another form of ANOVA that examines how two
explanatory variables affect an outcome variable; however, this application is not discussed in
this module.
Assumptions
1. The standard deviations (SD) of the populations for all groups are equal - this is sometimes
referred to as an assumption of the homogeneity of variance. Again, we can represent this
assumption for groups 1 through n as
Karl Rousseau has recently been hired as the new chair of the statistics department at UTech. In taking
on his duties as the department head, he's interested in whether there's any variation in how well
students do in the course, based on whether they enroll in the morning, afternoon, or evening course. A
morning man himself, Prof. Rousseau has some doubt that there's much learning going on in the evening
course; however, given his position as chair, he is very interested in making sure that all three sections
are getting the same high-quality education. Moreover, he's too much of an empiricist to allow this idea
to go untested. So he proposes that at semester's end students in all three sections take the National
Assessment of Statistical Knowledge (NASK) to determine whether there are differences in student
performance.
He starts by generating a null hypothesis that all three groups will have the same mean score on the test.
In formula terms, if we use the symbol μ to represent the average score, the null hypothesis is expressed
through the following notation:
Notice in the graph that all three groups have the same average score (all three points are on the
dashed line) and all three groups have the same SD (noted by the fact that the line around the
mean point for each group is the same size). So the null hypothesis is that all three groups will
have the same average score on the NASK.
The alternate hypothesis is that all means are not the same. It's important to point out that the
opposite is not that all means are different (i.e., μ1 μ2 μ5 ). It is possible that some of the means
could be the same, yet if they are not all identical, we would reject the null hypothesis. Rather,
the alternative hypothesis is that not all means are equal.
4.15 950
Note: The standard deviation (SD) for each group is the same: 1.3.
With these data, we can calculate an ANOVA statistic to evaluate Prof. Rousseau's hypothesis.
This is done in multiple steps, as described below.
The first step is to calculate the variation between groups by comparing the mean of each group
(or, in this example, the mean of each of the three classes) with the mean of the overall sample
(the mean score on the test for all students in this sample). This measure of between-group
variance is referred to as "between sum of squares" or BSS. BSS is calculated by adding up, for
all groups, the difference between the group's mean and the overall population mean, multiplied
by the number of cases in the group. In formula terms:
This sum of squares has a number of degrees of freedom equal to the number of groups minus 1.
In this case, dfB = (3-1) = 2
We divide the BSS figure by the number of degrees of freedom to get our estimate of the
variation between groups, referred to as "Between Mean Squares" as:
To measure the variation within groups, we find the sum of the squared deviation between scores
on the exam and the group average, calculating separate measures for each group, then summing
the group values. This is a sum referred to as the "within sum of squares" or WSS. In formula
terms, this is expressed as:
As in step 1, we need to adjust the WSS to transform it into an estimate of population variance,
an adjustment that involves a value for the number of degrees of freedom within. To calculate
this, we take a value equal to the number of cases in the total sample (N), minus the number of
groups (k). In formula terms,
This calculation is relatively straightforward. Simply divide the Between Mean Squares, the
value obtained in step 1, by the Within Mean Squares, the value calculated in step 2.
Then compare this value to a standard table with values for the F distribution to calculate the
significance level for the F value (link to F-test calculator). In this case, the significance level is
less than .01. This is extremely strong evidence against the null hypothesis, indicating that
students' performance varies significantly across the three classes.
Recap
To calculate an ANOVA, it is often convenient to arrange the statistics needed for calculation
into a table such as the one below:
Source Sum of Squares Degrees of Freedom Mean Squares
To fill in this table with the data from the problem above, we have:
Multiple Regression
Multiple regression is such a tool. Multiple regression (or, more generally, "regression") allows
researchers to examine the effect of many different factors on some outcome at the same time.
The general purpose of multiple regression is to learn more about the relationship between
several independent or predictor variables and a dependent variable. For some kinds of research
questions, regression can be used to examine how much a particular set of predictors explains
differences in some outcome. In other cases, regression is used to examine the effect of some
specific factor while accounting for other factors that influence the outcome.
In this latter use of regression analysis, the researcher uses algebraic methods to "hold constant"
a group of factors involved in some social phenomenon except one, in order to see how much of
the net result that one factor accounts for. In the example of the college program for high school
students, we could use regression to examine the effects of the program while accounting for
differences in grades, aspirations, income and parental education level that might also influence
college attendance. By mathematically holding constant all factors but one at a time, the
researcher can measure the part a particular factor played in some outcome.
Multiple Regression
One quick way to gain an initial understanding of the relationship between education and income
(or any two variables) is to plot them. Plotting these two variables produces the following graph:
The graph shows enough evidence of a linear pattern between the two variables: higher values of
X are associated with higher values of Y and vice-versa. From this graph, it seems that a linear
model does a good job of describing the relationship between annual income and number of
years of school. (There are more sophisticated methods of evaluating linearity, but examining a
graph such as this one is often sufficient).
We can express the relationship between income and education through the following formula:
That is, equation 1 says that income can be expressed as a linear function of the number of years
of education. Moreover, income can be expressed as a function of some multiple of education
level (b1) added to some value (b0). Regression is a tool that allows us to take income and
education data from some sample and use these data to estimate v0 and b1. These values are then
used to create predicted values of the outcome, with the observed or "true" value from the data
designated as "y" and the predicted value as .
A Note on Linearity
It is important to make certain that a linear relationship exists between the factors before running
a regression model. Regression is not an appropriate method of analysis for non-linear
relationships, such as that shown in the graph below comparing female life expectancy with the
number of doctors per million persons in the population.
In "ordinary least squares" (OLS) regression analysis, [1] is selected to minimize the sum of the
squared distances of the errors (e), where errors are defined as the difference between the
observed value and the predicted one.
where are the values of y predicted from equation [1]. It is this formula that is responsible for the name
given to regression of Ordinary Least Squares or OLS regression. The equation that generates the least
value for the sum of squared terms in equation [2] is the regression line.
In equation [1], the value b1 measures the causal effect of a one unit increase of X on the value
of Y. b1 is also referred to as the regression coefficient for X, and is the average amount the
dependent variable increases when the independent variable increases one unit and other
independents are held constant. So when independent measure increases by 1, how much does
the dependent variable increase by? It increases by b1 units.
Example
Let's return to the questions laid out earlier about the relationship between education and income
in the town of Springfield. Again, we have income and education data for all 8,175 of the town's
residents. Using these data, we want to use ordinary least squares (OLS) regression to estimate a
regression equation of the form:
We interpret the estimated value of b1 that each additional year of schooling is associated with
an additional $4,000 of income.
This equation can also be used to generate a predicted value given a specified level of education.
Say we want to know the predicted income of a Springfield resident who has twelve years of
education. We can use the regression equation to do so.
Multiple Regression
Hypothesis Testing
Now, assume that our analysis show that both education level and number of year of work have a
significant effect on yearly income. Another question we might want to know is how good a job
these two factors do in accounting for differences in income. Although both are significantly
related, they may account for a substantial amount of the variance or a fraction of it. These two
factors may account for most of the differences in Springfield residents' incomes or only a bit.
The fact that both factors are significantly related to the outcome does not necessarily imply that
they explain a substantial portion of the variance in the outcome. Or, in more formal statistical
terms, it doesn't mean that our model fits the data well.
This may appear surprising but it is rather a logical result of the way the coefficients are
calculated. To see how this is so, recall first that it is possible to decompose the total variance in:
The numerator in [6] actually represents the amount of unexplained variance left in Y by the
regression line. It is usually called Sum of Squared Errors (SSE). One can easily picture a
situation where the observations are so dispersed on the scatter plot that although high values of
X are associated with high values of Y the fitting line represents a poor model. In this case we
therefore expect SSE to be rather high. Intuitively, the amount of explained variance (SSR) is
given by:
Using these two quantities it is finally possible to construct a statistic that measures the quality of
the fitness of the model.
R2 has a value that is between 0 and 1. High values of R2 will indicate that the model fits the data
well because the amount of total variance explained by the factors in the model yields a high
value of SSR and a higher total value of R2, while models that explain less will have a lower R2.
Multiple Regression
Assumptions
For regression analysis to work "correctly" (that is, to give unbiased and reliable results), certain
conditions need to be meet. In an era when computer packages can perform efficiently a lot of
operations the cost of running regression models is extremely low. However, the violation of any
of these implied conditions could have potentially devastating effects for your research, as it will
become clear further down in this paragraph.
The best method to observe whether your regression model satisfied each of these assumptions is the
graphical plotting of the error terms. In particular plot the error terms (or their standardized version)
against each predictor and also versus the fitted values. If no assumptions are violated, the errors should
be randomly distributed around the mean 0. Formal statistical tests have been developed to check for
each one of the previous assumptions. Although the specific way to correct for these violations is beyond
the scope of the present discussion, learning how to detect them is the first and arguably the most
important step for efficiently employ regression analysis.
Contents
Introduction Frequency table & Chi-square test
File menu
Edit menu Statistics
View menu Command: Categorical data
Format Frequency table & Chi-square test
menu Tools
menu Description
Statistics
menu The Frequency table & Chi-square test procedure can be used for the following:
Graphs
menu Tests To test the hypothesis that for one classification table (e.g. gender), all classification
menu levels have the same frequency.
Sampling To test the relationship between two classification factors (e.g. gender and
menu profession).
Window
menu Help Required input
menu
Spreadsheet In the Frequency table & Chi-square test dialog box, one or two discrete variables
with the classification data must be identified. Classification data may either be
Appendices
numeric or alphanumeric (string) values. If required, you can convert a continuous
variable into a discrete variable using the IF function (see elsewhere).
Results
After you have completed the dialog box, click the OK button to obtain the
frequency table with the relevant statistics.
Chi-square test
When you want to test the hypothesis that for one single classification table (e.g.
gender), all classification levels have the same frequency, then identify only one
discrete variable in the dialog form. In this case the null hypothesis is that all
classification levels have the same frequency. If the calculated P-value is low
(P<0.05), then you reject the null hypothesis and the alternative hypothesis that there
is a significant difference between the frequencies of the different classification
levels must be accepted.
In a single classification table the mode of the observations is the most common
observation or category (the observation with the highest frequency). A unimodal
distribution has one mode; a bimodal distribution, two modes.
When you want to study the relationship between two classification factors (e.g.
gender and profession), then identify the two discrete variables in the dialog form. In
this case the null hypothesis is that the two factors are independent. If the calculated
P-value is low (P<0.05), then the null hypothesis is rejected and you accept the
alternative hypothesis that there is a relation between the two factors.
Note that when the degrees of freedom is equal to 1, e.g. in case of a 2x2 table,
MedCalc uses Yates' correction for continuity.
If the table has two columns and three or more rows (or two rows and three or more
columns), and the categories can be quantified, MedCalc will also perform a Chi-
square test for trend. This calculation tests whether there is a linear trend between
row (or column) number and the fraction of subjects in the left column (or top row).
The chi-square test for trend provides a more powerful test than the unordered
independence test above.
If there is no meaningful order in the row (or column) categories, then you should
ignore this calculation.
Literature
Altman DG (1991) Practical statistics for medical research. London: Chapman and
Hall.
See also
Fisher's exact test
McNemar test
Coefficient of variation
From Wikipedia, the free encyclopedia
In probability theory and statistics, the coefficient of variation (CV) is a normalized measure of
dispersion of a probability distribution. It is defined as the ratio of the standard deviation to the
mean :
This is only defined for non-zero mean, and is most useful for variables that are always positive.
It is also known as unitized risk or the variation coefficient. It is expressed as percentage.
The coefficient of variation should only be computed for data measured on a ratio scale. As an
example, if a group of temperatures are analyzed, the standard deviation does not depend on
whether the Kelvin or Celsius scale is used since an object that changes its temperature by 1 K
also changes its temperature by 1 C. However the mean temperature of the data set would differ
in each scale by an amount of 273 and thus the coefficient of variation would differ. So the
coefficient of variation does not have any meaning for data on an interval scale.
Standardized moments are similar ratios, , which are also dimensionless and scale invariant. The
variance-to-mean ratio, σ2 / μ, is another similar ratio, but is not dimensionless, and hence not
scale invariant.
In signal processing, particularly image processing, the reciprocal ratio μ / σ is referred to as the
signal to noise ratio.
Advantages
The coefficient of variation is useful because the standard deviation of data must always be
understood in the context of the mean of the data. The coefficient of variation is a dimensionless
number. So when comparing between data sets with different units or widely different means,
one should use the coefficient of variation for comparison instead of the standard deviation.
Disadvantages
When the mean value is close to zero, the coefficient of variation is sensitive to small changes in
the mean, limiting its usefulness.
Unlike the standard deviation, it cannot be used to construct confidence intervals for the mean.
Applications
The coefficient of variation is also common in applied probability fields such as renewal theory,
queueing theory, and reliability theory. In these fields, the exponential distribution is often more
important than the normal distribution. The standard deviation of an exponential distribution is
equal to its mean, so its coefficient of variation is equal to 1. Distributions with CV < 1 (such as
an Erlang distribution) are considered low-variance, while those with CV > 1 (such as a hyper-
exponential distribution) are considered high-variance. Some formulas in these fields are
expressed using the squared coefficient of variation, often abbreviated SCV. In modeling, a
variation of the CV is the CV(RMSD). Essentially the CV(RMSD) replaces the standard
deviation term with the Root Mean Square Deviation (RMSD).
Distribution
Under weak conditions on the sample distribution, the probability distribution of the coefficient
of variation is known. In fact, it has been determined by Hendricks and Robey . This is useful,
for instance, in the construction of hypothesis tests or confidence intervals.
Coefficient Of Variation - CV
What Does Coefficient Of Variation - CV Mean?
A statistical measure of the dispersion of data points in a data series around the mean. It is
calculated as follows:
The coefficient of variation represents the ratio of the standard deviation to the mean, and it is a
useful statistic for comparing the degree of variation from one data series to another, even if the
means are drastically different from each other.
Note that if the expected return in the denominator of the calculation is negative or zero, the ratio
will not make sense.
Standard deviation
This article's introduction may be too long. Please help by moving some material from it into the
body of the article. Read the layout guide and Wikipedia's lead section guidelines for more
information. Discuss this issue on the talk page. (September 2010)
A plot of a normal distribution (or bell curve). Each colored band has a width of one standard deviation.
Cumulative probability of a normal distribution with expected value 0 and standard deviation 1
A data set with a mean of 50 (shown in blue) and a standard deviation (σ) of 20.
Example of two sample populations with the same mean and different standard deviations. Red
population has mean 100 and SD 10; blue population has mean 100 and SD 50.
In probability theory and statistics, the standard deviation of a statistical population, a data set,
or a probability distribution is the square root of its variance. Standard deviation is a widely used
measure of the variability or dispersion, being algebraically more tractable though practically
less robust than the expected deviation or average absolute deviation.[citation needed]
It shows how much variation there is from the "average" (mean, or expected/budgeted value). A
low standard deviation indicates that the data points tend to be very close to the mean, whereas
high standard deviation indicates that the data is spread out over a large range of values.
For example, the average height for adult men in the United States is about 70 inches (178 cm),
with a standard deviation of around 3 in (8 cm). This means that most men (about 68 percent,
assuming a normal distribution) have a height within 3 in (8 cm) of the mean (67–73 in/170–185
cm), one standard deviation. Whereas almost all men (about 95%) have a height within 6 in (15
cm) of the mean (64–76 in/163–193 cm), 2 standard deviations. If the standard deviation were
zero, then all men would be exactly 70 in (178 cm) high. If the standard deviation were 20 in (51
cm), then men would have much more variable heights, with a typical range of about 50–90 in
(127–229 cm). Three standard deviations account for 99.7% of the sample population being
studied, assuming the distribution is normal (bell-shaped).
The term standard deviation was first used in writing by Karl Pearson in 1894, following his use
of it in lectures. This was as a replacement for earlier alternative names for the same idea: for
example, Gauss used mean error. A useful property of standard deviation is that, unlike variance,
it is expressed in the same units as the data. Note, however, that for measurements with
percentage as unit, the standard deviation will have percentage points as unit.
When only a sample of data from a population is available, the population standard deviation can
be estimated by a modified quantity called the sample standard deviation, explained below.
Basic examples
To calculate the population standard deviation, first compute the difference of each data point
from the mean, and square the result of each:
Next compute the average of these values, and take the square root:
This quantity is the population standard deviation; it is equal to the square root of the variance.
The formula is valid only if the eight values we began with form the complete population. If they
instead were a random sample, drawn from some larger, “parent” population, then we should
have used 7 instead of 8 in the denominator of the last formula, and then the quantity thus
obtained would have been called the sample standard deviation. See the section Estimation
below for more details.
Here the operator E denotes the average or expected value of X. Then the standard deviation of
X is the quantity
That is, the standard deviation σ (sigma) is the square root of the average value of (X − μ)2.
The standard deviation of a (univariate) probability distribution is the same as that of a random
variable having that distribution. Not all random variables have a standard deviation, since these
expected values need not exist. For example, the standard deviation of a random variable which
follows a Cauchy distribution is undefined because its expected value μ is undefined.
In the case where X takes random values from a finite data set x1, x2, …, xN, with each value
having the same probability, the standard deviation is
The standard deviation of a continuous real-valued random variable X with probability density
function p(x) is
where
and where the integrals are definite integrals taken for x ranging over the sample space of X.
In the case of a parametric family of distributions, the standard deviation can be expressed in
terms of the parameters. For example, in the case of the log-normal distribution with parameters
μ and σ2, the standard deviation is [(exp(σ2) − 1)exp(2μ + σ2)]1/2.
Estimation
One can find the standard deviation of an entire population in cases (such as standardized
testing) where every member of a population is sampled. In cases where that cannot be done, the
standard deviation σ is estimated by examining a random sample taken from the population.
Some estimators are given below:
This estimator has a uniformly smaller mean squared error than the sample standard deviation
(see below), and is the maximum-likelihood estimate when the population is normally
distributed. But this estimator, when applied to a small or moderately sized sample, tends to be
too low: it is a biased estimator.
The standard deviation of the sample is the same as the population standard deviation of a
discrete random variable that can assume precisely the values from the data set, where the
probability for each value is proportional to its multiplicity in the data set.
The most common estimator for σ used is an adjusted version, the sample standard deviation,
denoted by s and defined as follows:
where are the observed values of the sample items and is the mean value of these observations.
This correction (the use of N − 1 instead of N) is known as Bessel's correction. The reason for
this correction is that s2 is an unbiased estimator for the variance σ2 of the underlying population,
if that variance exists and the sample values are drawn independently with replacement.
However, s is not an unbiased estimator for the standard deviation σ; it tends to underestimate the
population standard deviation.
The term standard deviation of the sample is used for the uncorrected estimator (using N) while
the term sample standard deviation is used for the corrected estimator (using N − 1). The
denominator N − 1 is the number of degrees of freedom in the vector of residuals, .
Other estimators
Although an unbiased estimator for σ is known when the random variable is normally
distributed, the formula is complicated and amounts to a minor correction. Moreover,
unbiasedness (in this sense of the word) is not always desirable.[citation needed]
The standard deviation is invariant to changes in location, and scales directly with the scale of
the random variable. Thus, for a constant c and random variables X and Y:
The standard deviation of the sum of two random variables can be related to their individual
standard deviations and the covariance between them:
The calculation of the sum of squared deviations can be related to moments calculated directly
from the data. In general, we have
Thus, the standard deviation is equal to the square root of (the average of the squares less the
square of the average). See computational formula for the variance for a proof of this fact, and
for an analogous result for the sample standard deviation.
A large standard deviation indicates that the data points are far from the mean and a small
standard deviation indicates that they are clustered closely around the mean.
For example, each of the three populations {0, 0, 14, 14}, {0, 6, 8, 14} and {6, 6, 8, 8} has a
mean of 7. Their standard deviations are 7, 5, and 1, respectively. The third population has a
much smaller standard deviation than the other two because its values are all close to 7. In a
loose sense, the standard deviation tells us how far from the mean the data points tend to be. It
will have the same units as the data points themselves. If, for instance, the data set {0, 6, 8, 14}
represents the ages of a population of four siblings in years, the standard deviation is 5 years.
As another example, the population {1000, 1006, 1008, 1014} may represent the distances
traveled by four athletes, measured in meters. It has a mean of 1007 meters, and a standard
deviation of 5 meters.
Standard deviation may serve as a measure of uncertainty. In physical science, for example, the
reported standard deviation of a group of repeated measurements should give the precision of
those measurements. When deciding whether measurements agree with a theoretical prediction
the standard deviation of those measurements is of crucial importance: if the mean of the
measurements is too far away from the prediction (with the distance measured in standard
deviations), then the theory being tested probably needs to be revised. This makes sense since
they fall outside the range of values that could reasonably be expected to occur if the prediction
were correct and the standard deviation appropriately quantified. See prediction interval.
Application examples
The practical value of understanding the standard deviation of a set of values is in appreciating
how much variation there is from the "average" (mean).
Climate
As a simple example, consider the average daily maximum temperatures for two cities, one
inland and one on the coast. It is helpful to understand that the range of daily maximum
temperatures for cities near the coast is smaller than for cities inland. Thus, while these two cities
may each have the same average maximum temperature, the standard deviation of the daily
maximum temperature for the coastal city will be less than that of the inland city as, on any
particular day, the actual maximum temperature is more likely to be farther from the average
maximum temperature for the inland city than for the coastal one.
Sports
Another way of seeing it is to consider sports teams. In any set of categories, there will be teams
that rate highly at some things and poorly at others. Chances are, the teams that lead in the
standings will not show such disparity, but will perform well in most categories. The lower the
standard deviation of their ratings in each category, the more balanced and consistent they will
tend to be. Whereas, teams with a higher standard deviation will be more unpredictable. For
example, a team that is consistently bad in most categories will have a low standard deviation. A
team that is consistently good in most categories will also have a low standard deviation.
However, a team with a high standard deviation might be the type of team that scores a lot
(strong offense) but also concedes a lot (weak defense), or, vice versa, that might have a poor
offense but compensates by being difficult to score on.
Trying to predict which teams, on any given day, will win, may include looking at the standard
deviations of the various team "stats" ratings, in which anomalies can match strengths vs.
weaknesses to attempt to understand what factors may prevail as stronger indicators of eventual
scoring outcomes.
In racing, a driver is timed on successive laps. A driver with a low standard deviation of lap times
is more consistent than a driver with a higher standard deviation. This information can be used to
help understand where opportunities might be found to reduce lap times.
Finance
In finance, standard deviation is a representation of the risk associated with a given security
(stocks, bonds, property, etc.), or the risk of a portfolio of securities (actively managed mutual
funds, index mutual funds, or ETFs). Risk is an important factor in determining how to
efficiently manage a portfolio of investments because it determines the variation in returns on the
asset and/or portfolio and gives investors a mathematical basis for investment decisions (known
as mean-variance optimization). The overall concept of risk is that as it increases, the expected
return on the asset will increase as a result of the risk premium earned – in other words, investors
should expect a higher return on an investment when said investment carries a higher level of
risk, or uncertainty of that return. When evaluating investments, investors should estimate both
the expected return and the uncertainty of future returns. Standard deviation provides a
quantified estimate of the uncertainty of future returns.
For example, let's assume an investor had to choose between two stocks. Stock A over the last 20
years had an average return of 10%, with a standard deviation of 20 percentage points (pp) and
Stock B, over the same period, had average returns of 12%, but a higher standard deviation of 30
pp. On the basis of risk and return, an investor may decide that Stock A is the safer choice,
because Stock B's additional 2% points of return is not worth the additional 10 pp standard
deviation (greater risk or uncertainty of the expected return). Stock B is likely to fall short of the
initial investment (but also to exceed the initial investment) more often than Stock A under the
same circumstances, and is estimated to return only 2% more on average. In this example, Stock
A is expected to earn about 10%, plus or minus 20 pp (a range of 30% to -10%), about two-thirds
of the future year returns. When considering more extreme possible returns or outcomes in
future, an investor should expect results of up to 10% plus or minus 60 pp, or a range from 70%
to (−)50%, which includes outcomes for three standard deviations from the average return (about
99.7% of probable returns).
Calculating the average return (or arithmetic mean) of a security over a given period will
generate an expected return on the asset. For each period, subtracting the expected return from
the actual return results in the variance. Square the variance in each period to find the effect of
the result on the overall risk of the asset. The larger the variance in a period, the greater risk the
security carries. Taking the average of the squared variances results in the measurement of
overall units of risk associated with the asset. Finding the square root of this variance will result
in the standard deviation of the investment tool in question.
Population standard deviation is used to set the width of Bollinger Bands, a widely adopted
technical analysis tool. For example, the upper Bollinger Band is given as: x + nσx The most
commonly used value for n is 2; there is about 5% chance of going outside, assuming the normal
distribution is right.
Geometric interpretation
To gain some geometric insights, we will start with a population of three values, x1, x2, x3. This
defines a point P = (x1, x2, x3) in R3. Consider the line L = {(r, r, r) : r in R}. This is the "main
diagonal" going through the origin. If our three given values were all equal, then the standard
deviation would be zero and P would lie on L. So it is not unreasonable to assume that the
standard deviation is related to the distance of P to L. And that is indeed the case. To move
orthogonally from L to the point P, one begins at the point:
whose coordinates are the mean of the values we started out with. A little algebra shows that the
distance between P and M (which is the same as the orthogonal distance between P and the line
L) is equal to the standard deviation of the vector x1, x2, x3, divided by the square root of the
number of dimensions of the vector.
Chebyshev's inequality
An observation is rarely more than a few standard deviations away from the mean. Chebyshev's
inequality ensures, for all distributions for which the standard deviation is defined, the amount of
data within a number of standard deviations is at least that as follows. The following table gives
some exemplar values of the minimum population within a number of standard deviations.
50% √2
75% 2
89% 3
94% 4
96% 5
97% 6
[4]
Dark blue is less than one standard deviation from the mean. For the normal distribution, this accounts
for 68.27 % of the set; while two standard deviations from the mean (medium and dark blue) account for
95.45%; three standard deviations (light, medium, and dark blue) account for 99.73%; and four standard
deviations account for 99.994%. The two points of the curve which are one standard deviation from the
mean are also the inflection points.
The central limit theorem says that the distribution of a sum of many independent, identically
distributed random variables tends towards the famous bell-shaped normal distribution with a
probability density function of:
where μ is the arithmetic mean of the sample. The standard deviation therefore is simply a
scaling variable that adjusts how broad the curve will be, though also appears in the normalizing
constant to keep the distribution normalized for different widths.
If a data distribution is approximately normal then the proportion of data values within z standard
deviations of the mean is defined by , where is the error function. If a data distribution is
approximately normal then about 68% of the data values are within 1 standard deviation of the
mean (mathematically, μ ± σ, where μ is the arithmetic mean), about 95% are within two
standard deviations (μ ± 2σ), and about 99.7% lie within 3 standard deviations (μ ± 3σ). This is
known as the 68-95-99.7 rule, or the empirical rule.
For various values of z, the percentage of values expected to lie in and outside the symmetric
confidence interval, CI = (−zσ, zσ), are as follows:
1.960σ 95% 5% 1 / 20
The mean and the standard deviation of a set of data are usually reported together. In a certain
sense, the standard deviation is a "natural" measure of statistical dispersion if the center of the
data is measured about the mean. This is because the standard deviation from the mean is smaller
than from any other point. The precise statement is the following: suppose x1, ..., xn are real
numbers and define the function:
Using calculus or by completing the square, it is possible to show that σ(r) has a unique
minimum at the mean:
The coefficient of variation of a sample is the ratio of the standard deviation to the mean. It is a
dimensionless number that can be used to compare the amount of variance between populations
with means that are close together. The reason is that if you compare populations with same
standard deviations but different means then coefficient of variation will be bigger for the
population with the smaller mean. Thus in comparing variability of data, coefficient of variation
should be used with care and better replaced with another method.
Often we want some information about the accuracy of the mean we obtained. We can obtain this
by determining the standard deviation of the sampled mean. The standard deviation of the mean
is related to the standard deviation of the distribution by:
where N is the number of observation in the sample used to estimate the mean. This can easily be
proven with:
hence
Resulting in:
Worked example
The standard deviation of a discrete random variable is the root-mean-square (RMS) deviation of
its values from the mean.
If the random variable X takes on N values (which are real numbers) with equal probability, then
its standard deviation σ can be calculated as follows:
4. Find the mean of the squared deviations. This quantity is the variance σ2.
If not all values have equal probability, but the probability of value xi equals pi, the standard
deviation can be computed by:
where
Suppose we wished to find the standard deviation of the distribution placing probabilities 1⁄4, 1⁄2,
and 1⁄4 on the points in the sample space 3, 7, and 19.
Step 3: square each of the deviations, which amplifies large deviations and makes negative
values positive,
Step 5: take the positive square root of the quotient (converting squared units back to regular
units),
So, the standard deviation of the set is 6. This example also shows that, in general, the standard
deviation is different from the mean absolute deviation (which is 5 in this example).
The following two formulas can represent a running (continuous) standard deviation. A set of
three power sums s0, s1, s2 are each computed over a set of N values of x, denoted as xk.
Note that s0 raises x to the zero power, and since x0 is always 1, s0 evaluates to N.
Given the results of these three running summations, the values s0, s1, s2 can be used at any time
to compute the current value of the running standard deviation. This definition for sj can
represent the two different phases (summation computation sj, and σ calculation).
In a computer implementation, as the three sj sums become large, we need to consider round-off
error, arithmetic overflow, and arithmetic underflow. The method below calculates the running
sums method with reduced rounding errors:
where A is the mean value.
Sample variance:
Standard variance:
Weighted calculation
When the values xi are weighted with unequal weights wi, the power sums s0, s1, s2 are each
computed as:
And the standard deviation equations remain unchanged. Note that s0 is now the sum of the
weights and not the number of samples N.
The incremental method with reduced rounding errors can also be applied, with some additional
complexity.
and
where n is the total number of elements, and n' is the number of elements with non-zero weights.
The above formulas become equal to the simpler formulas given above if weights are taken as
equal to 1.
Combining standard deviations
Population-based statistics
The populations of sets, which may overlap, can be calculated simply as follows:
For example, suppose it is known that the average American man has a mean height of 70 inches
with a standard deviation of 3 inches and that the average American woman has a mean height of
65 inches with a standard deviation of 2 inches. Also assume that the number of men, N, is equal
to the number of woman. Then the mean and standard deviation of heights of American adults
could be calculated as:
where
If the size (actual or relative to one another), mean, and standard deviation of two overlapping
populations are known for the populations as well as their intersection, then the standard
deviation of the overall population can still be calculated as follows:
If two or more sets of data are being added in a pairwise fashion, the standard deviation can be
calculated if the covariance between the each pair of data sets is known.
For the special case where no correlation exists between all pairs of data sets, then the relation
reduces to the root-mean-square:
Sample-based statistics
where:
If the size, mean, and standard deviation of two overlapping samples are known for the samples
as well as their intersection, then the standard deviation of the samples can still be calculated. In
general:
Standard Deviation
What Does Standard Deviation Mean?
1. A measure of the dispersion of a set of data from its mean. The more spread apart the data, the
higher the deviation. Standard deviation is calculated as the square root of variance.
This type of analysis attempts to break down the various underlying factors that determine the
price of securities as well as market behavior. For example, it could possibly show how much of
a security's rise or fall is due to changes in interest rates. A t-test and f-test is used to analyze the
results of an analysis of variance test to determine which variables are of statistical significance.
One method of calculating covariance is by looking at return surprises (deviations from expected
return) in each scenario. Another method is to multiply the correlation between the two variables
by the standard deviation of each variable.
For example, if stock A's return is high whenever stock B's return is high and the same can be
said for low returns, then these stocks are said to have a positive covariance. If an investor wants
a portfolio whose assets have diversified earnings, he or she should pick financial assets that
have low covariance to each other.
What Does Mean Mean?
The simple mathematical average of a set of two or more numbers. The mean for a given set of
numbers can be computed in more than one way, including the arithmetic mean method, which
uses the sum of the numbers in the series, and the geometric mean method. However, all of the
primary methods for computing a simple average of a normal number series produce the same
approximate result most of the time.
In contrast, the geometric mean would be computed as third root of the numbers' product, or
the third root of 137,700, which approximately equals $51.64. While the two numbers are not
exactly equal, most people consider arithmetic and geometric means to be equivalent for
everyday purposes.
What Does Variance Mean?
A measure of the dispersion of a set of data points around their mean value. Variance is a
mathematical expectation of the average squared deviations from the mean.
Standard Deviation
The standard deviation is one of several indices of variability that statisticians
use to characterize the dispersion among the measures in a given population.
To calculate the standard deviation of a population it is first necessary to
calculate that population's variance. Numerically, the standard deviation is the
square root of the variance. Unlike the variance, which is a somewhat abstract
measure of variability, the standard deviation can be readily conceptualized as
a distance along the scale of measurement.
In statistics, correlation and dependence are any of a broad class of statistical relationships
between two or more random variables or observed data values.
Familiar examples of dependent phenomena include the correlation between the physical
statures of parents and their offspring, and the correlation between the demand for a product and
its price. Correlations are useful because they can indicate a predictive relationship that can be
exploited in practice. For example, an electrical utility may produce less power on a mild day
based on the correlation between electricity demand and weather. Correlations can also suggest
possible causal, or mechanistic relationships; however, statistical dependence is not sufficient to
demonstrate the presence of such a relationship.
Formally, dependence refers to any situation in which random variables do not satisfy a
mathematical condition of probabilistic independence. In general statistical usage, correlation or
co-relation can refer to any departure of two or more random variables from independence, but
most commonly refers to a more specialized type of relationship between mean values. There are
several correlation coefficients, often denoted ρ or r, measuring the degree of correlation. The
most common of these is the Pearson correlation coefficient, which is sensitive only to a linear
relationship between two variables (which may exist even if one is a nonlinear function of the
other). Other correlation coefficients have been developed to be more robust than the Pearson
correlation, or more sensitive to nonlinear relationships.[1][2][3]
Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the
correlation reflects the noisiness and direction of a linear relationship (top row), but not the slope of that
relationship (middle), nor many aspects of nonlinear relationships (bottom). N.B.: the figure in the center
has a slope of 0 but in that case the correlation coefficient is undefined because the variance of Y is zero.
The most familiar measure of dependence between two quantities is the Pearson product-moment
correlation coefficient, or "Pearson's correlation." It is obtained by dividing the covariance of the
two variables by the product of their standard deviations. Karl Pearson developed the coefficient
from a similar but slightly different idea by Francis Galton.[4]
The population correlation coefficient ρX,Y between two random variables X and Y with expected
values μX and μY and standard deviations σX and σY is defined as:
where E is the expected value operator, cov means covariance, and, corr a widely used
alternative notation for Pearson's correlation.
The Pearson correlation is defined only if both of the standard deviations are finite and both of
them are nonzero. It is a corollary of the Cauchy–Schwarz inequality that the correlation cannot
exceed 1 in absolute value. The correlation coefficient is symmetric: corr(X,Y) = corr(Y,X).
The Pearson correlation is +1 in the case of a perfect positive (increasing) linear relationship
(correlation), −1 in the case of a perfect decreasing (negative) linear relationship
(anticorrelation) [5], and some value between −1 and 1 in all other cases, indicating the degree of
linear dependence between the variables. As it approaches zero there is less of a relationship
(closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation
between the variables.
If the variables are independent, Pearson's correlation coefficient is 0, but the converse is not true
because the correlation coefficient detects only linear dependencies between two variables. For
example, suppose the random variable X is symmetrically distributed about zero, and Y = X2.
Then Y is completely determined by X, so that X and Y are perfectly dependent, but their
correlation is zero; they are uncorrelated. However, in the special case when X and Y are jointly
normal, uncorrelatedness is equivalent to independence.
where x and y are the sample means of X and Y, sx and sy are the sample standard deviations of X
and Y.
Rank correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank
correlation coefficient (τ) measure the extent to which, as one variable increases, the other
variable tends to increase, without requiring that increase to be represented by a linear
relationship. If, as the one variable increase, the other decreases, the rank correlation coefficients
will be negative. It is common to regard these rank correlation coefficients as alternatives to
Pearson's coefficient, used either to reduce the amount of calculation or to make the coefficient
less sensitive to non-normality in distributions. However, this view has little mathematical basis,
as rank correlation coefficients measure a different type of relationship than the Pearson product-
moment correlation coefficient, and are best seen as measures of a different type of association,
rather than as alternative measure of the population correlation coefficient.
To illustrate the nature of rank correlation, and its difference from linear correlation, consider the
following four pairs of numbers (x, y):
As we go from each pair to the next pair x increases, and so does y. This relationship is perfect,
in the sense that an increase in x is always accompanied by an increase in y. This means that we
have a perfect rank correlation, and both Spearman's and Kendall's correlation coefficients are 1,
whereas in this example Pearson product-moment correlation coefficient is 0.7544, indicating
that the points are far from lying on a straight line. In the same way if y always decreases when x
increases, the rank correlation coefficients will be −1, while the Pearson product-moment
correlation coefficient may or may not be close to -1, depending on how close the points are to a
straight line. Although in the extreme cases of perfect rank correlation the two coefficients are
both equal (being both +1 or both −1) this is not in general so, and values of the two coefficients
cannot meaningfully be compared. For example, for the three pairs (1, 1) (2, 3) (3, 2) Spearman's
coefficient is 1/2, while Kendall's coefficient is 1/3.
Other measures of dependence among random variables
The information given by a correlation coefficient is not enough to define the dependence
structure between random variables. The correlation coefficient completely defines the
dependence structure only in very particular cases, for example when the distribution is a
multivariate normal distribution. (See diagram above.) In the case of elliptic distributions it
characterizes the (hyper-)ellipses of equal density, however, it does not completely characterize
the dependence structure (for example, a multivariate t-distribution's degrees of freedom
determine the level of tail dependence).
The correlation ratio is able to detect almost any functional dependency, or the entropy-based
mutual information/total correlation which is capable of detecting even more general
dependencies. The latter are sometimes referred to as multi-moment correlation measures, in
comparison to those that consider only 2nd moment (pairwise or quadratic) dependence.
The polychoric correlation is another correlation applied to ordinal data that aims to estimate the
correlation between theorised latent variables.
One way to capture a more complete view of dependence structure is to consider a copula
between them.
The degree of dependence between variables X and Y should not depend on the scale on which
the variables are expressed. Therefore, most correlation measures in common use are invariant to
location and scale transformations of the marginal distributions. That is, if we are analyzing the
relationship between X and Y, most correlation measures are unaffected by transforming X to
a + bX and Y to c + dY, where a, b, c, and d are constants. This is true of most correlation
statistics as well as their population analogues. Some correlation statistics, such as the rank
correlation coefficient, are also invariant to monotone transformations of the marginal
distributions of X and/or Y.
Pearson/Spearman correlation coefficients between X and Y are shown when the two variables' ranges
are unrestricted, and when the range of X is restricted to the interval (0,1).
Most correlation measures are sensitive to the manner in which X and Y are sampled.
Dependencies tend to be stronger if viewed over a wider range of values. Thus, if we consider
the correlation coefficient between the heights of fathers and their sons over all adult males, and
compare it to the same correlation coefficient calculated when the fathers are selected to be
between 165 cm and 170 cm in height, the correlation will be weaker in the latter case.
Various correlation measures in use may be undefined for certain joint distributions of X and Y.
For example, the Pearson correlation coefficient is defined in terms of moments, and hence will
be undefined if the moments are undefined. Measures of dependence based on quantiles are
always defined. Sample-based statistics intended to estimate population measures of dependence
may or may not have desirable statistical properties such as being unbiased, or asymptotically
consistent, based on the structure of the population from which the data were sampled.
Correlation matrices
The correlation matrix of n random variables X1, ..., Xn is the n × n matrix whose i,j entry is
corr(Xi, Xj). If the measures of correlation used are product-moment coefficients, the correlation
matrix is the same as the covariance matrix of the standardized random variables Xi /σ (Xi) for i =
1, ..., n. This applies to both the matrix of population correlations (in which case "σ " is the
population standard deviation), and to the matrix of sample correlations (in which case "σ "
denotes the sample standard deviation). Consequently, each is necessarily a positive-semidefinite
matrix.
The correlation matrix is symmetric because the correlation between Xi and Xj is the same as the
correlation between Xj and Xi.
Common misconceptions
The conventional dictum that "correlation does not imply causation" means that correlation
cannot be used to infer a causal relationship between the variables.[10] This dictum should not be
taken to mean that correlations cannot indicate the potential existence of causal relations.
However, the causes underlying the correlation, if any, may be indirect and unknown, and high
correlations also overlap with identity relations, where no causal process exists. Consequently,
establishing a correlation between two variables is not a sufficient condition to establish a causal
relationship (in either direction). For example, one may observe a correlation between an
ordinary alarm clock ringing and daybreak, though there is no causal relationship between these
phenomena.
A correlation between age and height in children is fairly causally transparent, but a correlation
between mood and health in people is less so. Does improved mood lead to improved health; or
does good health lead to good mood; or both? Or does some other factor underlie both? In other
words, a correlation can be taken as evidence for a possible causal relationship, but cannot
indicate what the causal relationship, if any, might be.
The Pearson correlation coefficient indicates the strength of a linear relationship between two
variables, but its value generally does not completely characterize their relationship. In
particular, if the conditional mean of Y given X, denoted E(Y|X), is not linear in X, the correlation
coefficient will not fully determine the form of E(Y|X).
The image on the right shows scatterplots of Anscombe's quartet, a set of four different pairs of
variables created by Francis Anscombe.[11] The four y variables have the same mean (7.5),
standard deviation (4.12), correlation (0.816) and regression line (y = 3 + 0.5x). However, as can
be seen on the plots, the distribution of the variables is very different. The first one (top left)
seems to be distributed normally, and corresponds to what one would expect when considering
two variables correlated and following the assumption of normality. The second one (top right) is
not distributed normally; while an obvious relationship between the two variables can be
observed, it is not linear. In this case the Pearson correlation coefficient does not indicate that
there is an exact functional relationship: only the extent to which that relationship can be
approximated by a linear relationship. In the third case (bottom left), the linear relationship is
perfect, except for one outlier which exerts enough influence to lower the correlation coefficient
from 1 to 0.816. Finally, the fourth example (bottom right) shows another example when one
outlier is enough to produce a high correlation coefficient, even though the relationship between
the two variables is not linear.
These examples indicate that the correlation coefficient, as a summary statistic, cannot replace
the individual examination of the data. Note that the examples are sometimes said to demonstrate
that the Pearson correlation assumes that the data follow a normal distribution, but this is not
correct.
If a pair (X, Y) of random variables follows a bivariate normal distribution, the conditional mean
E(X|Y) is a linear function of Y, and the conditional mean E(Y|X) is a linear function of X. The
correlation coefficient r between X and Y, along with the marginal means and variances of X and
Y, determines this linear relationship:
where EX and EY are the expected values of X and Y, respectively, and σx and σy are the standard
deviations of X and Y, respectively.
Partial correlation
The Design
Suppose a statistics teacher gave an essay final to his class. He randomly divides the classes in
half such that half the class writes the final with a blue-book and half with notebook computers.
In addition the students are partitioned into three groups, no typing ability, some typing ability,
and highly skilled at typing. Answers written in blue-books will be transcribed to word
processors and scoring will be done blindly. Not with a blindfold, but the instructor will not
know the method or skill level of the student when scoring the final. The dependent measure will
be the score on the essay part of the final exam.
The first factor will be called Method and will have two levels, blue-book and computer. The
second factor will be designated as Ability and will have three levels: none, some, and lots. Each
subject will be measured a single time. Any effects discovered will necessarily be between
subjects or groups and hence the designation "between groups" designs.
The Data
In the case of the example data, the Ability factor has two levels while the Method factor has
three. The X variable is the score on the final exam. The example data file appears below.
The analysis is done in SPSS/WIN by selecting "Statistics", "General Linear Model", and then
"GLM - General Factorial." In the next screen, the Dependent Variable is X and the Fixed
Factors are Ability and Method. The screen will appear as follows.
The only "Options" that will be selected in this example is the "Descriptive Statistics" option
under "Display." This will produce the table of means and standard deviations.
Interpretation of Output
The interpretation of the output from the General Linear Model command will focus on two
parts: the table of means and the ANOVA summary table. The table of means is the primary
focus of the analysis while the summary table directs attention to the interesting or statistically
significant portions of the table of means.
Often the means are organized and presented in a slightly different manner than the form of the
output from the GENERAL LINEAR MODEL command. The table of means may be rearranged
and presented as follows:
The means inside the boxes are called cell means, the means in the margins are called marginal
means, and the number on the bottom right-hand corner is called the grand mean. An analysis of
these means reveals that there is very little difference between the marginal means for the
different levels of Method across the levels of Ability (30.31 vs. 30.56). The marginal means of
Ability over levels of Method are different (27.33 vs. 33.83 vs. 30.17) with the mean for "Some"
being the highest. The cell means show an increasing pattern for levels of Ability using a blue-
book (26.67 vs. 31.00 vs. 33.33) and a different pattern for levels of Ability using a computer
(28.00 vs. 36.67 vs. 27.00).
Graphs of Means
Graphs of means are often used to present information in a manner that is easier to comprehend
than the tables of means. One factor is selected for presentation as the X-axis and its levels are
marked on that axis. Separate lines are drawn the height of the mean for each level of the second
factor. In the following graph, the Ability, or keyboard experience, factor was selected for the X-
axis and the Method, factor was selected for the different lines.
Presenting the information in an opposite fashion would be equally correct, although some
graphs are more easily understood than others, depending upon the values for the means and the
number of levels of each factor. The second possible graph is presented below.
It is recommended that if there is any doubt that both versions of the graphs be attempted and the
one which best illustrates the data be selected for inclusion into the statistical report. In this case
it appears that the graph with Ability on the X-axis is easier to understand than the one with
Method on the X-axis.
The results of the analysis are presented in the ANOVA summary table, presented below for the
example data.
The items of primary interest in this table are the effects listed under the "Source" column and
the values under the "Sig." column. As in the previous hypothesis test, if the value of "Sig" is less
than the value of as set by the experimenter, then that effect is significant. If =.05, then the
Ability main effect and the Ability BY Method interaction would be significant in this table.
Main Effects
Main effects are differences in means over levels of one factor collapsed over levels of the other
factor. This is actually much easier than it sounds. For example, the main effect of Method is
simply the difference between the means of final exam score for the two levels of Method,
ignoring or collapsing over experience. As seen in the second method of presenting a table of
means, the main effect of Method is whether the two marginal means associated with the
Method factor are different. In the example case these means were 30.33 and 30.56 and the
differences between these means was not statistically significant.
As can be seen from the summary table, the main effect of Ability is significant. This effect
refers to the differences between the three marginal means associated with Ability. In this case
the values for these means were 27.33, 33.83, and 30.17 and the differences between them may
be attributed to a real effect.
A simple main effect is a main effect of one factor at a given level of a second factor. In the
example data it would be possible to talk about the simple main effect of Ability at Method
equal blue-book. That effect would be the difference between the three cell means at level a1
(26.67, 31.00, and 33.33). One could also talk about the simple main effect of Method at Ability
equal lots (33.33 and 27.00). Simple main effects are not directly tested in this analysis. They
are, however, necessary to understand an interaction.
Interaction Effects
An interaction effect is a change in the simple main effect of one variable over levels of the
second. An A X B or A BY B interaction is a change in the simple main effect of B over levels of
A or the change in the simple main effect of A over levels of B. In either case the cell means
cannot be modeled simply by knowing the size of the main effects. An additional set of
parameters must be used to explain the differences between the cell means. These parameters are
collectively called an interaction.
The change in the simple main effect of one variable over levels of the other is most easily seen
in the graph of the interaction. If the lines describing the simple main effects are not parallel,
then a possibility of an interaction exists. As can be seen from the graph of the example data, the
possibility of a significant interaction exists because the lines are not parallel. The presence of an
interaction was confirmed by the significant interaction in the summary table. The following
graph overlays the main effect of Ability on the graph of the interaction.
Two things can be observed from this presentation. The first is that the main effect of Ability is
possibly significant, because the means are different heights. Second, the interaction is possibly
significant because the simple main effects of Ability using blue-book and computer are
different from the main effect of Ability.
One method of understanding how main effects and interactions work is to observe a wide
variety of data and data analysis. With three effects, A, B, and A x B, which may or may not be
significant there are eight possible combinations of effects. All eight are presented on the
following pages.
No Significant Effects
Main Effect of A
Main Effect of B
A x B Interaction
No Significant Effects
Note that the means and graphs of the last two example data sets were identical. The ANOVA
table, however, provided a quite different analysis of each data set. The data in this final set was
constructed such that there was a large standard deviation within each cell. In this case the
marginal and cell means were not different enough to warrant rejecting the hypothesis of no
effects, thus no significant effects were observed
Statisticians speak of two significant sorts of statistical error. The context, is that there is a "null
hypothesis" which corresponds to a presumed default "state of nature", e.g., that an individual is
free of disease, that an accused is innocent. Corresponding to the null hypothesis is an
"alternative hypothesis" which corresponds to the opposite situation, that is, that the individual
has the disease, that the accused is guilty. The goal is to determine accurately if the null
hypothesis can be discarded in favor of the alternative. A test of some sort is conducted and data
are obtained. The result of the test may be negative (that is, it does not indicate disease, guilt).
On the other hand, it may be positive (that is, it may indicate disease, guilt). If the result of the
test does not correspond with the actual state of nature, then an error has occurred, but if the
result of the test corresponds with the actual state of nature, then a correct decision has been
made. There are two kinds of error, classified as "type I error" and "type II error," depending
upon which hypothesis has incorrectly been identified as the true state of nature.
Type I error
Type I error, also known as an "error of the first kind", an α error, or a "false positive": the
error of rejecting a null hypothesis when it is actually true. Plainly speaking, it occurs when we
are observing a difference when in truth there is none, thus indicating a test of poor specificity.
An example of this would be if a test shows that a woman is pregnant when in reality she is not,
or telling a patient he is sick when in fact he is not. Type I error can be viewed as the error of
excessive credulity.
Type II error
Type II error, also known as an "error of the second kind", a β error, or a "false negative":
the error of failing to reject a null hypothesis when it is in fact not true. In other words, this is the
error of failing to observe a difference when in truth there is one, thus indicating a test of poor
sensitivity. An example of this would be if a test shows that a woman is not pregnant, when in
reality, she is. Type II error can be viewed as the error of excessive skepticism.
Wrong decision
Accept Null Hypothesis Right decision Type II Error
False Negative
Wrong decision
Reject Null Hypothesis Type I Error Right decision
False Positive
When an observer makes a Type I error in evaluating a sample against its parent population, he
or she is mistakenly thinking that a statistical difference exists when in truth there is no statistical
difference (or, to put another way, the null hypothesis should not be rejected but was mistakenly
rejected). For example, imagine that a pregnancy test has produced a "positive" result (indicating
that the woman taking the test is pregnant); if the woman is actually not pregnant though, then
we say the test produced a "false positive" (assuming the null hypothesis, Ho, was that she is not
pregnant). A Type II error, or a "false negative", is the error of failing to reject a null hypothesis
when the alternative hypothesis is the true state of nature. For example, a type II error occurs if a
pregnancy test reports "negative" when the woman is, in fact, pregnant.
From the Bayesian point of view, a type one error is one that looks at information that should not
substantially change one's prior estimate of probability, but does. A type two error is that one
looks at information which should change one's estimate, but does not. (Though the null
hypothesis is not quite the same thing as one's prior estimate, it is, rather, one's pro forma prior
estimate.)
In summary:
Rejecting a null-hypothesis when it should not have been rejected creates a type I error.
failing to reject a null-hypothesis when it should have been rejected creates a type II error.
Decision rules (or tests of hypotheses), in order to be good, must be designed to minimize errors
of decision.
Minimizing errors of decision is not a simple issue—for any given sample size the effort to reduce
one type of error generally results in increasing the other type of error.
Based on the real-life application of the error, one type may be more serious than the other.
(In such cases, a compromise should be reached in favor of limiting the more serious type of
error.)
The only way to minimize both types of error is to increase the sample size, and this may or may
not be feasible.
Hypothesis testing is the art of testing whether a variation between two sample distributions can
be explained by chance or not. In many practical applications type I errors are more delicate than
type II errors. In these cases, care is usually focused on minimizing the occurrence of this
statistical error. Suppose, the probability for a type I error is 1% , then there is a 1% chance that
the observed variation is not true. This is called the level of significance. While 1% might be an
acceptable level of significance for one application, a different application can require a very
different level. For example, the standard goal of six sigma is to achieve precision to 4.5 standard
deviations above or below the mean. This means that only 3.4 parts per million are allowed to be
deficient in a normally distributed process. The probability of type I error is generally denoted
with the Greek letter alpha, α.
To state it simply, a type I error can usually be interpreted as a false alarm or under-active
specificity. A type II error could be similarly interpreted as an oversight, but is more akin to a
lapse in attention or under-active sensitivity. The probability of type II error is generally denoted
with the Greek letter beta, β.
In a memorable application, the cynic (who searches every kind act for nefarious motive), fits the
standard attitude of type I error. His exact opposite, the gullible guy (who believes everything we
say) is classically guilty of type II error. For other real-life applications, see the "usage examples"
below.
Statistical error: the difference between a computed, estimated, or measured value and the
true, specified, or theoretically correct value (see errors and residuals in statistics) that is caused
by random, and inherently unpredictable fluctuations in the measurement apparatus or the
system being studied.
Systematic error: the difference between a computed, estimated, or measured value and the
true, specified, or theoretically correct value that is caused by non-random fluctuations from an
unknown source (see uncertainty), and which, once identified, can usually be eliminated.
Etymology
In 1928, Jerzy Neyman (1894–1981) and Egon Pearson (1895–1980), both eminent statisticians,
discussed the problems associated with "deciding whether or not a particular sample may be
judged as likely to have been randomly drawn from a certain population": and, as Florence
Nightingale David remarked, "it is necessary to remember the adjective ‘random’ [in the term
‘random sample’] should apply to the method of drawing the sample and not to the sample
itself".
(a) the error of rejecting a hypothesis that should have been accepted, and
(b) the error of accepting a hypothesis that should have been rejected.
...in testing hypotheses two considerations must be kept in view, (1) we must be able to reduce
the chance of rejecting a true hypothesis to as low a value as desired; (2) the test must be so
devised that it will reject the hypothesis tested when it is likely to be false.
In 1933, they observed that these "problems are rarely presented in such a form that we can
discriminate with certainty between the true and false hypothesis" (p.187). They also noted that,
in deciding whether to accept or reject a particular hypothesis amongst a "set of alternative
hypotheses" (p.201), it was easy to make an error:
In all of the papers co-written by Neyman and Pearson the expression H0 always signifies "the
hypothesis to be tested"
In the same paper they call these two sources of error, errors of type I and errors of type II
respectively.
The false positive rate is the proportion of absent events that yield positive test outcomes, i.e., the
conditional probability of a positive test result given an absent event.
The false positive rate is equal to the significance level. The specificity of the test is equal to 1
minus the false positive rate.
In statistical hypothesis testing, this fraction is given the Greek letter α, and 1 − α is defined as
the specificity of the test. Increasing the specificity of the test lowers the probability of type I
errors, but raises the probability of type II errors (false negatives that reject the alternative
hypothesis when it is true).
The false negative rate is the proportion of present events that yield negative test outcomes, i.e.,
the conditional probability of a negative test result given present event.
In statistical hypothesis testing, this fraction is given the letter β. The "power" (or the
"sensitivity") of the test is equal to 1 minus β.
It is standard practice for statisticians to conduct tests in order to determine whether or not a
"speculative hypothesis" concerning the observed phenomena of the world (or its inhabitants)
can be supported. The results of such testing determine whether a particular set of results agrees
reasonably (or does not agree) with the speculated hypothesis.
On the basis that it is always assumed, by statistical convention, that the speculated hypothesis is
wrong, and the so-called "null hypothesis" that the observed phenomena simply occur by chance
(and that, as a consequence, the speculated agent has no effect) — the test will determine
whether this hypothesis is right or wrong. This is why the hypothesis under test is often called
the null hypothesis (most likely, coined by Fisher (1935, p. 19)), because it is this hypothesis that
is to be either nullified or not nullified by the test. When the null hypothesis is nullified, it is
possible to conclude that data support the "alternative hypothesis" (which is the original
speculated one).
The consistent application by statisticians of Neyman and Pearson's convention of representing
"the hypothesis to be tested" (or "the hypothesis to be nullified") with the expression H0 has led
to circumstances where many understand the term "the null hypothesis" as meaning "the nil
hypothesis" — a statement that the results in question have arisen through chance. This is not
necessarily the case — the key restriction, as per Fisher (1966), is that "the null hypothesis must
be exact, that is free from vagueness and ambiguity, because it must supply the basis of the
'problem of distribution,' of which the test of significance is the solution.” As a consequence of
this, in experimental science the null hypothesis is generally a statement that a particular
treatment has no effect; in observational science, it is that there is no difference between the value
of a particular measured variable, and that of an experimental prediction.
The extent to which the test in question shows that the "speculated hypothesis" has (or has not)
been nullified is called its significance level; and the higher the significance level, the less likely
it is that the phenomena in question could have been produced by chance alone. British
statistician Sir Ronald Aylmer Fisher (1890–1962) stressed that the "null hypothesis":
...is never proved or established, but is possibly disproved, in the course of experimentation.
Every experiment may be said to exist only in order to give the facts a chance of disproving the
null hypothesis.
Bayes' theorem
The probability that an observed positive result is a false positive (as contrasted with an observed
positive result being a true positive) may be calculated using Bayes' theorem.
The key concept of Bayes' theorem is that the true rates of false positives and false negatives are
not a function of the accuracy of the test alone, but also the actual rate or frequency of
occurrence within the test population; and, often, the more powerful issue is the actual rates of
the condition within the sample being tested.
Since the paired notions of Type I errors (or "false positives") and Type II errors (or "false
negatives") that were introduced by Neyman and Pearson are now widely used, their choice of
terminology ("errors of the first kind" and "errors of the second kind"), has led others to
suppose that certain sorts of mistake that they have identified might be an "error of the third
kind", "fourth kind", etc.
None of these proposed categories have met with any sort of wide acceptance. The following is a
brief account of some of these proposals.
Systems Theory
David
Florence Nightingale David (1909–1993) a sometime colleague of both Neyman and Pearson at
the University College London, making a humorous aside at the end of her 1947 paper,
suggested that, in the case of her own research, perhaps Neyman and Pearson's "two sources of
error" could be extended to a third:
I have been concerned here with trying to explain what I believe to be the basic ideas [of my
"theory of the conditional power functions"], and to forestall possible criticism that I am falling
into error (of the third kind) and am choosing the test falsely to suit the significance of the
sample. (1947, p.339)
Mosteller
In 1948, Frederick Mosteller (1916–2006) argued that a "third kind of error" was required to
describe circumstances he had observed, namely:
Type III error: "correctly rejecting the null hypothesis for the wrong reason". (1948, p. 61)
Kaiser
According to Henry F. Kaiser (1927–1992), in his 1966 paper extended Mosteller's classification
such that an error of the third kind entailed an incorrect decision of direction following a rejected
two-tailed test of hypothesis. In his discussion (1966, pp. 162–163), Kaiser also speaks of α
errors, β errors, and γ errors for type I, type II and type III errors respectively (C.O.
Dellomos).
Kimball
In 1957, Allyn W. Kimball, a statistician with the Oak Ridge National Laboratory, proposed a
different kind of error to stand beside "the first and second types of error in the theory of testing
hypotheses". Kimball defined this new "error of the third kind" as being "the error committed by
giving the right answer to the wrong problem" (1957, p. 134).
Mathematician Richard Hamming (1915–1998) expressed his view that "It is better to solve the
right problem the wrong way than to solve the wrong problem the right way".
Harvard economist Howard Raiffa describes an occasion when he, too, "fell into the trap of
working on the wrong problem" (1968, pp. 264–265).
Mitroff and Featheringham
In 1974, Ian Mitroff and Tom Featheringham extended Kimball's category, arguing that "one of
the most important determinants of a problem's solution is how that problem has been
represented or formulated in the first place".
They defined type III errors as either "the error... of having solved the wrong problem... when
one should have solved the right problem" or "the error... [of] choosing the wrong problem
representation... when one should have... chosen the right problem representation" (1974),
p. 383.
In 2009, dirty rotten strategies by Ian I. Mitroff and Abraham Silvers was published regarding
type III and type IV errors providing many examples of both developing good answers to the
wrong questions (III) and deliberately selecting the wrong questions for intensive and skilled
investigation (IV). Most of the examples have nothing to do with statistics, many being problems
of public policy or business decisions.
Raiffa
In 1969, the Harvard economist Howard Raiffa jokingly suggested "a candidate for the error of
the fourth kind: solving the right problem too late" (1968, p. 264).
In 1970, L. A. Marascuilo and J. R. Levin proposed a "fourth kind of error" -- a "Type IV error"
-- which they defined in a Mosteller-like manner as being the mistake of "the incorrect
interpretation of a correctly rejected hypothesis"; which, they suggested, was the equivalent of
"a physician's correct diagnosis of an ailment followed by the prescription of a wrong medicine"
(1970, p. 398).
Usage examples
(a) the acceptable level of false positives (in which a non-match is declared to be a match) and
(b) the acceptable level of false negatives (in which an actual match is not detected).
A threshold value can be varied to make the test more restrictive or more sensitive; with the more
restrictive tests increasing the risk of rejecting true positives, and the more sensitive tests
increasing the risk of accepting false positives.
Inventory Control
The notions of "false positives" and "false negatives" have a wide currency in the realm of
computers and computer applications.
Computer security
Security vulnerabilities are an important consideration in the task of keeping all computer data
safe, while maintaining access to that data for appropriate users (see computer security, computer
insecurity). Moulton (1983), stresses the importance of:
avoiding the type I errors (or false positive) that classify authorized users as imposters.
avoiding the type II errors (or false negatives) that classify imposters as authorized users (1983,
p. 125).
Spam filtering
A false positive occurs when "spam filtering" or "spam blocking" techniques wrongly classify a
legitimate email message as spam and, as a result, interferes with its delivery. While most anti-
spam tactics can block or filter a high percentage of unwanted emails, doing so without creating
significant false-positive results is a much more demanding task.
A false negative occurs when a spam email is not detected as spam, but is classified as "non-
spam". A low number of false negatives is an indicator of the efficiency of "spam filtering"
methods.
Malware
The term false positive is also used when antivirus software wrongly classifies an innocuous file
as a virus. The incorrect detection may be due to heuristics or to an incorrect virus signature in a
database. Similar problems can occur with antitrojan or antispyware software.
In computer database searching, documents are assumed to be relevant by default. Thus, false
positives are documents that are rejected by a search despite their relevance to the search
question.False Negatives are documents that are retrieved by a search despite their irrelevance to
the search question. False negatives are common in full text searching, in which the search
algorithm examines all of the text in all of the stored documents and tries to match one or more
of the search terms that have been supplied by the user. Consider how this relates to spam
filtering — it is more severe to not retrieve a document you want than to retrieve a document you
don't want.
Most false positives can be attributed to the deficiencies of natural language, which is often
ambiguous: e.g., the term "home" may mean "a person's dwelling" or "the main or top-level page
in a Web site".
Detection algorithms of all kinds often create false positives. Optical character recognition
(OCR) software may detect an "a" where there are only some dots that appear to be an "a" to the
algorithm being used.
Security screening
False positives are routinely found every day in airport security screening, which are ultimately
visual inspection systems. The installed security alarms are intended to prevent weapons being
brought onto aircraft; yet they are often set to such high sensitivity that they alarm many times a
day for minor items, such as keys, belt buckles, loose change, mobile phones, and tacks in shoes
(see explosive detection, metal detector.)
The ratio of false positives (identifying an innocent traveller as a terrorist) to true positives
(detecting a would-be terrorist) is, therefore, very high; and because almost every alarm is a false
positive, the positive predictive value of these screening tests is very low.
The relative cost of false results determines the likelihood that test creators allow these events to
occur. As the cost of a false negative in this scenario is extremely high (not detecting a bomb
being brought onto a plane could result in hundreds of deaths) whilst the cost of a false positive
is relatively low (a reasonably simple further inspection) the most appropriate test is one with a
high statistical sensitivity but low statistical specificity (one that allows minimal false negatives
in return for a high rate of false positives).
Biometrics
Biometric matching, such as for fingerprint, facial recognition or iris recognition, is susceptible
to type I and type II errors. The null hypothesis is that the input does identify someone in the
searched list of people, so:
the probability of type I errors is called the "False Reject Rate" (FRR) or False Non-match Rate
(FNMR),
while the probability of type II errors is called the "False Accept Rate" (FAR) or False Match Rate
(FMR).
If the system is designed to rarely match suspects then the probability of type II errors can be
called the "False Alarm Rate". On the other hand, if the system is used for validation (and
acceptance is the norm) then the FAR is a measure of system security, while the FRR measures
user inconvenience level.
Medical screening
In the practice of medicine, there is a significant difference between the applications of screening
and testing:
Screening involves relatively cheap tests that are given to large populations, none of whom
manifest any clinical indication of disease (e.g., Pap smears).
Testing involves far more expensive, often invasive, procedures that are given only to those who
manifest some clinical indication of disease, and are most often applied to confirm a suspected
diagnosis.
For example, most States in the USA require newborns to be screened for phenylketonuria and
hypothyroidism, among other congenital disorders. Although they display a high rate of false
positives, the screening tests are considered valuable because they greatly increase the likelihood
of detecting these disorders at a far earlier stage.
The simple blood tests used to screen possible blood donors for HIV and hepatitis have a
significant rate of false positives; however, physicians use much more expensive and far more
precise tests to determine whether a person is actually infected with either of these viruses.
Perhaps the most widely discussed false positives in medical screening come from the breast
cancer screening procedure mammography. The US rate of false positive mammograms is up to
15%, the highest in world. The lowest rate in the world is in the Netherlands, 1%.
The ideal population screening test would be cheap, easy to administer, and produce zero false-
negatives, if possible. Such tests usually produce more false-positives, which can subsequently
be sorted out by more sophisticated (and expensive) testing.
Medical testing
False negatives and False positives are significant issues in medical testing.
False negatives may provide a falsely reassuring message to patients and physicians that disease
is absent, when it is actually present. This sometimes leads to inappropriate or inadequate
treatment of both the patient and their disease. A common example is relying on cardiac stress
tests to detect coronary atherosclerosis, even though cardiac stress tests are known to only detect
limitations of coronary artery blood flow due to advanced stenosis.
False negatives produce serious and counter-intuitive problems, especially when the condition
being searched for is common. If a test with a false negative rate of only 10%, is used to test a
population with a true occurrence rate of 70%, many of the "negatives" detected by the test will
be false. (See Bayes' theorem)
False positives can also produce serious and counter-intuitive problems when the condition
being searched for is rare, as in screening. If a test has a false positive rate of one in ten
thousand, but only one in a million samples (or people) is a true positive, most of the "positives"
detected by that test will be false.
Paranormal investigation
The notion of a false positive is common in cases of paranormal or ghost phenomena seen in
images and such, when there is another plausible explanation. When observing a photograph,
recording, or some other evidence that appears to have a paranormal origin—in this usage, a
false positive is a disproven piece of media "evidence" (image, movie, audio recording, etc.) that
actually has a normal explanation.[
Ever wonder how someone in America can be arrested if they really are presumed innocent, why a defendant is
found not guilty instead of innocent, or why Americans put up with a justice system which sometimes allows
criminals to go free on technicalities? These questions can be understood by examining the similarity of the
American justice system to hypothesis testing in statistics and the two types of errors it can produce. (This
discussion assumes that the reader has at least been introduced to the normal distribution and its use in hypothesis
testing. Also please note that the American justice system is used for convenience. Others are similar in nature such
as the British system which inspired the American system)
True, the trial process does not use numerical values while hypothesis testing in statistics does, but both share at
least four common elements (other than a lot of jargon that sounds like double talk):
1. The alternative hypothesis - This is the reason a criminal is arrested. Obviously the police don't think the
arrested person is innocent or they wouldn't arrest him. In statistics the alternative hypothesis is the
hypothesis the researchers wish to evaluate.
2. The null hypothesis - In the criminal justice system this is the presumption of innocence. In both the
judicial system and statistics the null hypothesis indicates that the suspect or treatment didn't do anything.
In other words, nothing out of the ordinary happened The null is the logical opposite of the alternative. For
example "not white" is the logical opposite of white. Colors such as red, blue and green as well as black all
qualify as "not white".
3. A standard of judgment - In the justice system and statistics there is no possibility of absolute proof and
so a standard has to be set for rejecting the null hypothesis. In the justice system the standard is "a
reasonable doubt". The null hypothesis has to be rejected beyond a reasonable doubt. In statistics the
standard is the maximum acceptable probability that the effect is due to random variability in the data
rather than the potential cause being investigated. This standard is often set at 5% which is called the alpha
level.
4. A data sample - This is the information evaluated in order to reach a conclusion. As mentioned earlier, the
data is usually in numerical form for statistical analysis while it may be in a wide diversity of forms--eye-
witness, fiber analysis, fingerprints, DNA analysis, etc.--for the justice system. However in both cases there
are standards for how the data must be collected and for what is admissible. Both statistical analysis and the
justice system operate on samples of data or in other words partial information because, let's face it, getting
the whole truth and nothing but the truth is not possible in the real world.
It only takes one good piece of evidence to send a hypothesis down in flames but an endless amount to prove it
correct. If the null is rejected then logically the alternative hypothesis is accepted. This is why both the justice
system and statistics concentrate on disproving or rejecting the null hypothesis rather than proving the
alternative. It's much easier to do. If a jury rejects the presumption of innocence, the defendant is pronounced guilty.
Type I errors: Unfortunately, neither the legal system or statistical testing are perfect. A jury sometimes makes an
error and an innocent person goes to jail. Statisticians, being highly imaginative, call this a type I error. Civilians call
it a travesty.
In the justice system, failure to reject the presumption of innocence gives the defendant a not guilty verdict. This
means only that the standard for rejecting innocence was not met. It does not mean the person really is innocent. It
would take an endless amount of evidence to actually prove the null hypothesis of innocence.
Type II errors: Sometimes, guilty people are set free. Statisticians have given this error the highly imaginative
name, type II error.
Americans find type II errors disturbing but not as horrifying as type I errors. A type I error means that not only has
an innocent person been sent to jail but the truly guilty person has gone free. In a sense, a type I error in a trial is
twice as bad as a type II error. Needless to say, the American justice system puts a lot of emphasis on avoiding type I
errors. This emphasis on avoiding type I errors, however, is not true in all cases where statistical hypothesis testing
is done.
In statistical hypothesis testing used for quality control in manufacturing, the type II error is
considered worse than a type I. Here the null hypothesis indicates that the product satisfies the
customer's specifications. If the null hypothesis is rejected for a batch of product, it cannot be
sold to the customer. Rejecting a good batch by mistake--a type I error--is a very expensive error
but not as expensive as failing to reject a bad batch of product--a type II error--and shipping it to
a customer. This can result in losing the customer and tarnishing the company's reputation.
Reject
Presumption of Reject Null Type I
Innocence Type I Error Correct Correct
Hypothesis Error
(Guilty Verdict)
Fail to Reject
Presumption of Fail to Reject
Correct Type II Error Type II
Innocence (Not Null Correct
Guilty Verdict) Hypothesis Error
In the criminal justice system a measurement of guilt or innocence is packaged in the form of a witness, similar to a
data point in statistical analysis. Using this comparison we can talk about sample size in both trials and hypothesis
tests. In a hypothesis test a single data point would be a sample size of one and ten data points a sample size of ten.
Likewise, in the justice system one witness would be a sample size of one, ten witnesses a sample size ten, and so
forth.
Impact on a jury is going to depend on the credibility of the witness as well as the actual testimony. An articulate
pillar of the community is going to be more credible to a jury than a stuttering wino, regardless of what he or she
says.
The normal distribution shown in figure 1 represents the distribution of testimony for all possible witnesses in a trial
for a person who is innocent. Witnesses represented by the left hand tail would be highly credible people who are
convinced that the person is innocent. Those represented by the right tail would be highly credible people
wrongfully convinced that the person is guilty.
At first glace, the idea that highly credible people could not just be wrong but also adamant about their testimony
might seem absurd, but it happens. According to the innocence project, "eyewitness misidentifications contributed to
over 75% of the more than 220 wrongful convictions in the United States overturned by post-conviction DNA
evidence." Who could possibly be more credible than a rape victim convinced of the identity of her attacker, yet
even here mistakes have been documented.
For example, a rape victim mistakenly identified John Jerome White as her attacker even though the actual
perpetrator was in the lineup at the time of identification. Thanks to DNA evidence White was eventually
exonerated, but only after wrongfully serving 22 years in prison.
If the standard of judgment for evaluating testimony were positioned as shown in figure 2 and only one witness
testified, the accused innocent person would be judged guilty (a type I error) if the witnesses testimony was in the
red area. Since the normal distribution extends to infinity, type I errors would never be zero even if the standard of
judgment were moved to the far right. The only way to prevent all type I errors would be to arrest no one.
Unfortunately this would drive the number of unpunished criminals or type II errors through the roof.
figure 2. Distribution of possible witnesses in a trial when the accused is innocent, showing the probable outcomes
with a single witness.
Figure 3 shows what happens not only to innocent suspects but also guilty ones when they are arrested and tried for
crimes. In this case, the criminals are clearly guilty and face certain punishment if arrested.
figure 3. Distribution of possible witnesses in a trial showing figure 4. Distribution of possible witnesses in a
the probable outcomes with a single witness if the accused is trial showing the probable outcomes with a
innocent or obviously guilty.. single witness if the accused is innocent or not
clearly guilty..
Note: to run the above applet you must have Java enabled
The value of unbiased, highly trained, top quality
in your browser and have a Java runtime environment
police investigators with state of the art equipment
(JRE) installed on you computer. If you have not installed
should be obvious. There is no possibility of having
a JRE you can download it for free here.
a type I error if the police never arrest the wrong
person. Of course, modern tools such as DNA testing
are very important, but so are properly designed and
executed police procedures and professionalism. The
famous trial of O. J. Simpson would have likely
ended in a guilty verdict if the Los Angeles Police
officers investigating the crime had been beyond
reproach.
Type-I Error:
The null hypothesis may be true but it may be rejected. This is an error and is called Type-I error. When is
true, the test-statistic, say , can take any value between to . But we reject when lies in the rejection region
while the rejection region is also included in the interval to . In a two-sided (like ), the hypothesis is
rejected when is less than or is greater than . When is true, can fall in the rejection region with a
probability equal to the rejection region . Thus it is possible that is rejected while is true. This is called
Type 1 error. The probability is that is accepted when is true. It is called correct decision. We can say that
Type I error has been committed when:
These are the examples from practical life. These examples are quoted to make a point clear to the students.
Alpha :
The probability of making Type-I error is denoted by (alpha). When a null hypothesis is rejected, we
may be wrong in rejecting it or we may be right in rejecting it. We do not know that is true or false.
Whatever our decision will be, it will have the support of probability. A true hypothesis has some probability
of rejection and this probability is denoted by . This probability is also called the size of Type-I error and is
denoted by .
Type-II Error:
The null hypothesis may be false but it may be accepted. It is an error and is called Type-II error. The
value of the test-statistic may fall in the acceptance region when is in fact false. Suppose the hypothesis
being tested is and is false and true value of is or . If the difference between and is very large then the
chance is very small that (wrong) will be accepted. In this case the true sampling distribution of the statistic
will be quite away from the sampling distribution under . There will be hardly any test-statistic which will
fall in the acceptance region of . When the true distribution of the test-statistic overlaps the acceptance
region of , then is accepted though is false. If the difference between and is small, then there is a high
chance of accepting . This action will be an error of Type-II.
A Type I error is committed whenever a true null hypothesis is rejected. A Type II error is committed
whenever a false null hypothesis is accepted. The best way to explain this is by an example. Suppose a
company develops a new drug. The FDA has to decide whether or not the new drug is safe. The null
hypothesis here is that the new drug is not safe. A Type I error is committed when a true null hypothesis
is rejected, e.g. the FDA concludes that the new drug is safe when it is not. A Type II error occurs
whenever a false null hypothesis is accepted, e.g. the drug is declared unsafe, when in fact it is safe.
EXPERIMENTAL ERRORS:
TYPE I ERROR - TYPE II ERROR
Whilst many will not have heard of Type I error or Type II error, most people will be familiar with the
terms ’false positive’ and ’false negative’, mainly as a medical term.
A patient might take an HIV test, promising a 99.9% accuracy rate. This means that 1 in every
1000 tests could give a ’false positive,’ informing a patient that they have the virus, when they do
not.
Conversely, the test could also show a false negative reading, giving an HIV positive patient the
all-clear. This is why most medical tests require duplicate samples, to stack the odds up
favorably. A one in one thousand chance becomes a 1 in 1 000 000 chance, if two independent
samples are tested.
With any scientific process, there is no such ideal as total proof or total rejection, and researchers
must, by necessity, work upon probabilities. That means that, whatever level of proof was
reached, there is still the possibility that the results may be wrong.
This could take the form of a false rejection, or acceptance, of the null hypothesis.
TYPE I ERROR
A Type I error is often referred to as a ’false positive’, and is the process of incorrectly rejecting the null
hypothesis in favor of the alternative. In the case above, the null hypothesis refers to the natural state of
things, stating that the patient is not HIV positive.
The alternative hypothesis states that the patient does carry the virus. A Type I error would
indicate that the patient has the virus when they do not, a false rejection of the null.
TYPE II ERROR
A Type II error is the opposite of a Type I error and is the false acceptance of the null hypothesis. A Type
II error, also known as a false negative, would imply that the patient is free of HIV when they are not, a
dangerous diagnosis.
In most fields of science, Type II errors are not seen to be as problematic as a Type I error. With
the Type II error, a chance to reject the null hypothesis was lost, and no conclusion is inferred
from a non-rejected null. The Type I error is more serious, because you have wrongly rejected
the null hypothesis.
Medicine, however, is one exception; telling a patient that they are free of disease, when they are
not, is potentially dangerous.
REPLICATION
This is the reason why scientific experiments must be replicatable, and other scientists must be able to
follow the exact methodology.
Even if the highest level of proof, where P < 0.01 (probability is less than 1%), is reached, out of
every 100 experiments, there will be one false result. To a certain extent, duplicate or triplicate
samples reduce the chance of error, but may still mask chance if the error causing variable is
present in all samples.
If however, other researchers, using the same equipment, replicate the experiment and find that
the results are the same, the chances of 5 or 10 experiments giving false results is unbelievably
small. This is how science regulates, and minimizes, the potential for Type I and Type II errors.
One area that is guilty of ignoring Type I and II errors is the lawcourt, where the jury is not told
that fingerprint and DNA tests may produce false results. There have been many documented
miscarriages of justice involving these tests. Many courts will now not accept these tests alone,
as proof of guilt, and require other evidence.
Many statisticians are now adopting a third type of error, a type III, which is where the null hypothesis
was rejected for the wrong reason.
The problem is, that there may be some relationship between the variables, but it could be for a
different reason than stated in the hypothesis. An unknown process may underlie the relationship.
CONCLUSION
Both Type I errors and Type II errors are a factors that every scientist and researcher must take into
account.
Whilst replication can minimize the chances of an inaccurate result, this is one of the major
reasons why research should be replicatable.
Many scientists do not accept quasi-experiments, because they are difficult to replicate and
analyze.
Z-test
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution. Due to the central limit theorem, many
test statistics are approximately normally distributed for large samples. Therefore, many
statistical tests can be performed as approximate Z-tests if the sample size is not too small.
General form
The most general way to obtain a Z-test is to define a numerical test statistic that can be
calculated from a collection of data, such that the sampling distribution of the statistic is
approximately normal under the null hypothesis. Statistics that are averages (or approximate
averages) of approximately independent data values are generally well-approximated by a
normal distribution. An example of a statistic that would not be well-approximated by a normal
distribution would be an extreme value such as the sample maximum.
If T is a statistic that is approximately normally distributed under the null hypothesis, the next
step in performing a Z-test is to determine the expected value θ of T under the null hypothesis,
and then obtain an estimate s of the standard deviation of T. We then calculate the standard score
Z = (T − θ) / s, from which one-tailed and two-tailed p-values can be calculated as Φ(−|Z|) and
2Φ(−|Z|), respectively, where Φ is the standard normal cumulative distribution function.
The term Z-test is often used to refer specifically to the one-sample location test comparing the
mean of a set of measurements to a given constant. If the observed data X1, ..., Xn are (i)
uncorrelated, (ii) have a common mean μ, and (iii) have a common variance σ2, then the sample
average X has mean μ and variance σ2 / n. If our null hypothesis is that the mean value of the
population is a given number μ0, we can use X −μ0 as a test-statistic, rejecting the null hypothesis
if X −μ0 is large.
Other location tests that can be performed as Z-tests are the two-sample location test and the
paired difference test.
Conditions
Nuisance parameters should be known, or estimated with high accuracy (an example of a
nuisance parameter would be the standard deviation in a one-sample location test). Z-tests focus
on a single parameter, and treat all other unknown parameters as being fixed at their true
values. In practice, due to Slutsky's theorem, "plugging in" consistent estimates of nuisance
parameters can be justified. However if the sample size is not large enough for these estimates
to be reasonably accurate, the Z-test may not perform well.
The test statistic should follow a normal distribution. Generally, one appeals to the central limit
theorem to justify assuming that a test statistic varies normally. There is a great deal of statistical
research on the question of when a test statistic varies approximately normally. If the variation
of the test statistic is strongly non-normal, a Z-test should not be used.
In some situations, it is possible to devise a test that properly accounts for the variation in plug-in
estimates of nuisance parameters. In the case of one and two sample location problems, a t-test
does this.
Example
Suppose that in a particular geographic region, the mean and standard deviation of scores on a
reading test are 100 points, and 12 points, respectively. Our interest is in the scores of 55 students
in a particular school who received a mean score of 96. We can ask whether this mean score is
significantly lower than the regional mean — that is, are the students in this school comparable
to a simple random sample of 55 students from the region as a whole, or are their scores
surprisingly low?
In this example, we treat the population mean and variance as known, which would be
appropriate either if all students in the region were tested, or if a large random sample were used
to estimate the population mean and variance with minimal estimation error.
The classroom mean score is 96, which is −2.47 standard error units from the population mean of
100. Looking up the z-score in a table of the standard normal distribution, we find that the
probability of observing a standard normal value below -2.47 is approximately 0.5 - 0.4932 =
0.0068. This is the one-sided p-value for the null hypothesis that the 55 students are comparable
to a simple random sample from the population of all test-takers. The two-sided p-value is
approximately 0.014 (twice the one-sided p-value).
Another way of stating things is that with probability 1 − 0.014 = 0.986, a simple random sample
of 55 students would have a mean test score within 4 units of the population mean. We could also
say that with 98% confidence we reject the null hypothesis that the 55 test takers are comparable
to a simple random sample from the population of test-takers.
The Z-test tells us that the 55 students of interest have an unusually low mean test score
compared to most simple random samples of similar size from the population of test-takers. A
deficiency of this analysis is that it does not consider whether the effect size of 4 points is
meaningful. If instead of a classroom, we considered a subregion containing 900 students whose
mean score was 99, nearly the same z-score and p-value would be observed. This shows that if
the sample size is large enough, very small differences from the null value can be highly
statistically significant. See statistical hypothesis testing for further discussion of this issue.
Location tests are the most familiar t-tests. Another class of Z-tests arises in maximum likelihood
estimation of the parameters in a parametric statistical model. Maximum likelihood estimates are
approximately normal under certain conditions, and their asymptotic variance can be calculated
in terms of the Fisher information. The maximum likelihood estimate divided by its standard
error can be used as a test statistic for the null hypothesis that the population value of the
parameter equals zero. More generally, if is the maximum likelihood estimate of a parameter θ,
and θ0 is the value of θ under the null hypothesis,
Z-tests are employed whenever it can be argued that a test statistic follows a normal distribution
under the null hypothesis of interest. Many non-parametric test statistics, such as U statistics, are
approximately normal for large enough sample sizes, and hence are often performed as Z-tests.
Z-test
Description
The Z-test compares sample and population means to determine if there is a significant
difference.
It requires a simple random sample from a population with a Normal distribution and where
where the mean is known.
Calculation
z = (x - ) / SE
where x is the mean sample to be standardized, (mu) is the population mean and SE is the
standard error of the mean.
SE = / SQRT(n)
The z value is then looked up in a z-table. A negative z value means it is below the population
mean (the sign is ignored in the lookup table).
Discussion
The Z-test is typically with standardized tests, checking whether the scores from a particular
sample are within or outside the standard test performance.
The z value indicates the number of standard deviation units of the sample from the population
mean.
Note that the z-test is not the same as the z-score, although they are closely related.
Z-TEST
Z-test is a statistical test where normal distribution is applied and is basically used for dealing with
problems relating to large samples when n ≥ 30.
n = sample size
For example suppose a person wants to test if both tea & coffee are equally popular in a
particular town. Then he can take a sample of size say 500 from the town out of which suppose
280 are tea drinkers. To test the hypothesis, he can use Z-test.
1. z test for single proportion is used to test a hypothesis on a specific value of the population
proportion.
Statistically speaking, we test the null hypothesis H0: p = p0 against the alternative
hypothesis H1: p >< p0 where p is the population proportion and p0 is a specific value of
the population proportion we would like to test for acceptance.
The example on tea drinkers explained above requires this test. In that example, p0 = 0.5.
Notice that in this particular example, proportion refers to the proportion of tea drinkers.
2. z test for difference of proportions is used to test the hypothesis that two populations have the
same proportion.
For example suppose one is interested to test if there is any significant difference in the
habit of tea drinking between male and female citizens of a town. In such a situation, Z-
test for difference of proportions can be applied.
One would have to obtain two independent samples from the town- one from males and
the other from females and determine the proportion of tea drinkers in each sample in
order to perform this test.
3. z -test for single mean is used to test a hypothesis on a specific value of the population mean.
Statistically speaking, we test the null hypothesis H0: μ = μ0 against the alternative
hypothesis H1: μ >< μ0 where μ is the population mean and μ0 is a specific value of the
population that we would like to test for acceptance.
Unlike the t-test for single mean, this test is used if n ≥ 30 and population standard
deviation is known.
4. z test for single variance is used to test a hypothesis on a specific value of the population
variance.
Statistically speaking, we test the null hypothesis H0: σ = σ0 against H1: σ >< σ0 where σ
is the population mean and σ0 is a specific value of the population variance that we would
like to test for acceptance.
In other words, this test enables us to test if the given sample has been drawn from a
population with specific variance σ0. Unlike the chi square test for single variance, this
test is used if n ≥ 30.
5. Z-test for testing equality of variance is used to test the hypothesis of equality of two population
variances when the sample size of each sample is 30 or larger.
ASSUMPTION
Irrespective of the type of Z-test used it is assumed that the populations from which the samples are
drawn are normal.
the variances of the samples should be the same (F-test) for the
equal variance t-test
It is well publicised that female students are currently doing better then male
students! It could be speculated that this is due to brain size differences? To
assess differences between a set of male students' brains and female
students' brains a z or t-test could be used. This is an important issue (as I'm
sure you'll realise lads) and we should use substantial numbers of
measurements. Several universities and colleges are visited and a set of male
brain volumes and a set of female brain volumes are gathered (I leave it to
your imagination how the brain sizes are obtained!).
Hypotheses
Data arrangement
Excel can apply the z or t-tests to data arranged in rows or in columns, but the
statistical packages nearly always use columns and are required side by side.
For the z-test degrees of freedom are not required since z-scores
of 1.96 and 2.58 are used for 5% and 1% respectively.
The output from the z and t-tests are always similar and there are several
values you need to look for:
You can check that the program has used the right data by making sure that
the means (1.81 and 1.66 for the t-test), number of observations (32, 32) and
degrees of freedom (62) are correct. The information you then need to use in
order to reject or accept your HO, are the bottom five values. The t Stat value
is the calculated value relating to your data. This must be compared with the
two t Critical values depending on whether you have decided on a one or two-
tail test (do not confuse these terms with the one or two-way ANOVA). If the
calculated value exceeds the critical values the HO must be rejected at the
level of confidence you selected before the test was executed. Both the one
and two-tailed results confirm that the HO must be rejected and the HA
accepted.
We can also use the P(T<=t) values to ascertain the precise probability rather
than the one specified beforehand. For the results of the t-test above the
probability of the differences occurring by chance for the one-tail test are
2.3x10-9 (from 2.3E-11 x 100). All the above P-values denote very high
significant differences.
F-test
An F-test is any statistical test in which the test statistic has an F-distribution under the null
hypothesis. It is most often used when comparing statistical models that have been fit to a data
set, in order to identify the model that best fits the population from which the data were sampled.
Exact F-tests mainly arise when the models have been fit to the data using least squares. The
name was coined by George W. Snedecor, in honour of Sir Ronald A. Fisher. Fisher initially
developed the statistic as the variance ratio in the 1920s.
The hypothesis that the means of several normally distributed populations, all having the same
standard deviation, are equal. This is perhaps the best-known F-test, and plays an important role
in the analysis of variance (ANOVA).
The hypothesis that a proposed regression model fits the data well. See Lack-of-fit sum of
squares.
The hypothesis that a data set in a regression analysis follows the simpler of two proposed linear
models that are nested within each other.
The F-test in one-way analysis of variance is used to assess whether the expected values of a
quantitative variable within several pre-defined groups differ from each other. For example,
suppose that a medical trial compares four treatments. The ANOVA F-test can be used to assess
whether any of the treatments is on average superior, or inferior, to the others versus the null
hypothesis that all four treatments yield the same mean response. This is an example of an
"omnibus" test, meaning that a single test is performed to detect any of several possible
differences. Alternatively, we could carry out pairwise tests among the treatments (for instance,
in the medical trial example with four treatments we could carry out six tests among pairs of
treatments). The advantage of the ANOVA F-test is that we do not need to pre-specify which
treatments are to be compared, and we do not need to adjust for making multiple comparisons.
The disadvantage of the ANOVA F-test is that if we reject the null hypothesis, we do not know
which treatments can be said to be significantly different from the others — if the F-test is
performed at level α we cannot state that the treatment pair with the greatest mean difference is
significantly different at level α.
or
where Yij is the jth observation in the ith out of K groups and N is the overall sample size. This F-
statistic follows the F-distribution with K − 1, N −K degrees of freedom under the null
hypothesis. The statistic will be large if the between-group variability is large relative to the
within-group variability, which is unlikely to happen if the population means of the groups all
have the same value.
Note that when there are only two groups for the one-way ANOVA F-test, F = t2 where t is the
Student's t statistic.
Regression problems
Consider two models, 1 and 2, where model 1 is 'nested' within model 2. That is, model 1 has p1
parameters, and model 2 has p2 parameters, where p2 > p1, and for any choice of parameters in
model 1, the same regression curve can be achieved by some choice of the parameters of model
2. (We use the convention that any constant parameter in a model is included when counting the
parameters. For instance, the simple linear model y = mx + b has p = 2 under this convention.)
The model with more parameters will always be able to fit the data at least as well as the model
with fewer parameters. Thus typically model 2 will give a better (i.e. lower error) fit to the data
than model 1. But one often wants to determine whether model 2 gives a significantly better fit to
the data. One approach to this problem is to use an F test.
If there are n data points to estimate parameters of both models from, then one can calculate the
F statistic, given by
where RSSi is the residual sum of squares of model i. If your regression model has been
calculated with weights, then replace RSSi with χ2, the weighted sum of squared residuals. Under
the null hypothesis that model 2 does not provide a significantly better fit than model 1, F will
have an F distribution, with (p2 − p1, n − p2) degrees of freedom. The null hypothesis is rejected
if the F calculated from the data is greater than the critical value of the F distribution for some
desired false-rejection probability (e.g. 0.05). The test is a likelihood ratio test.
Consider an experiment to study the effect of three different levels of some factor on a response
(e.g. three types of fertilizer on plant growth). If we had 6 observations for each level, we could
write the outcome of the experiment in a table like this, where a1, a2, and a3 are the three levels
of the factor being studied.
a1 a2 a3
6 8 13
8 12 9
4 9 11
5 11 8
3 6 7
4 8 12
The null hypothesis, denoted H0, for the overall F-test for this experiment would be that all three
levels of the factor produce the same response, on average. To calculate the F-ratio:
The between-group degrees of freedom is one less than the number of groups
dfb = 3 − 1 = 2
MSB = 84 / 2 = 42
Step 4: Calculate the "within-group" sum of squares. Begin by centering the data in each group
a1 a2 a3
6 − 5 = 1 8 − 9 = -1 13 − 10 = 3
8 − 5 = 3 12 − 9 = 3 9 − 10 = -1
4 − 5 = -1 9 − 9 = 0 11 − 10 = 1
5 − 5 = 0 11 − 9 = 2 8 − 10 = -2
3 − 5 = -2 6 − 9 = -3 7 − 10 = -3
4 − 5 = -1 8 − 9 = -1 12 − 10 = 2
The within-group sum of squares is the sum of squares of all 18 values in this table
SSW = 1 + 9 + 1 + 0 + 4 + 1 + 1 + 9 + 0 + 4 + 9 + 1 + 9 + 1 + 1 + 4 + 9 + 4 = 68
The critical value is the number that the test statistic must exceed to reject the test. In this case,
Fcrit(2,15) = 3.68 at α = 0.05. Since F = 9.3 > 3.68, the results are significant at the 5%
significance level. One would reject the null hypothesis, concluding that there is strong evidence
that the expected values in the three groups differ. The p-value for this test is 0.002.
After performing the F-test, it is common to carry out some "post-hoc" analysis of the group
means. In this case, the first two group means differ by 4 units, the first and third group means
differ by 5 units, and the second and third group means differ by only 1 unit. The standard error
of each of these differences is . Thus the first group is strongly different from the other groups,
as the mean difference is more times the standard error, so we can be highly confident that the
population mean of the first group differs from the population means of the other groups.
However there is no evidence that the second and third groups have different population means
from each other, as their mean difference of one unit is comparable to the standard error.
Note F(x, y) denotes an F-distribution with x degrees of freedom in the numerator and y degrees
of freedom in the denominator.
ANOVA's robustness with respect to Type I errors for departures from
population normality
The oneway ANOVA can be generalized to the factorial and multivariate layouts, as well as to
the analysis of covariance. None of these F-tests, however, are robust when there are severe
violations of the assumption that each population follows the normal distribution, particularly for
small alpha levels and unbalanced layouts. Furthermore, if the underlying assumption of
homoscedasticity is violated, the Type I error properties degenerate much more severely. For
nonparametric alternatives in the factorial layout, see Sawilowsky.
Stats: F-Test
There are two independent degrees of freedom, one for the numerator, and one for the
denominator.
There are many different F distributions, one for each pair of degrees of freedom.
F-Test
The F-test is designed to test if two population variances are equal. It does this by comparing the
ratio of two variances. So, if the variances are equal, the ratio of the variances will be 1.
All hypothesis testing is done under the assumption the null hypothesis is trueIf
the null hypothesis is true, then the F test-statistic given above can be simplified
(dramatically). This ratio of sample variances will be test statistic used. If the null
hypothesis is false, then we will reject the null hypothesis that the ratio was equal to
1 and our assumption that they were equal.
There are several different F-tables. Each one has a different level of significance. So, find the
correct level of significance first, and then look up the numerator degrees of freedom and the
denominator degrees of freedom to find the critical value.
You will notice that all of the tables only give level of significance for right tail tests. Because
the F distribution is not symmetric, and there are no negative values, you may not simply take
the opposite of the right critical value to find the left critical value. The way to find a left critical
value is to reverse the degrees of freedom, look up the right critical value, and then take the
reciprocal of this value. For example, the critical value with 0.05 on the left with 12 numerator
and 15 denominator degrees of freedom is found of taking the reciprocal of the critical value
with 0.05 on the right with 15 numerator and 12 denominator degrees of freedom.
Since the left critical values are a pain to calculate, they are often avoided altogether. This is the
procedure followed in the textbook. You can force the F test into a right tail test by placing the
sample with the large variance in the numerator and the smaller variance in the denominator. It
does not matter which sample has the larger sample size, only which sample has the larger
variance.
The numerator degrees of freedom will be the degrees of freedom for whichever sample has the
larger variance (since it is in the numerator) and the denominator degrees of freedom will be the
degrees of freedom for whichever sample has the smaller variance (since it is in the
denominator).
If a two-tail test is being conducted, you still have to divide alpha by 2, but you only look up and
compare the right critical value.
Assumptions / Notes
Divide alpha by 2 for a two tail test and then find the right critical value
When the degrees of freedom aren't given in the table, go with the value with the larger critical
value (this happens to be the smaller degrees of freedom). This is so that you are less likely to
reject in error (type I error)
The populations from which the samples were obtained must be normal.
H0:
or
where is the critical value of the F distribution with and degrees of freedom and a
significance level of .
In the above formulas for the critical regions, the Handbook follows the convention
that is the upper critical value from the F distribution and is the lower critical value
from the F distribution. Note that this is the opposite of the designation used by some
texts and software programs. In particular, Dataplot uses the opposite convention.
Sample Output
Dataplot generated the following output for an F-test from the JAHANMI2.DAT data set:
F TEST
NULL HYPOTHESIS UNDER TEST--SIGMA1 = SIGMA2
ALTERNATIVE HYPOTHESIS UNDER TEST--SIGMA1 NOT EQUAL SIGMA2
SAMPLE 1:
NUMBER OF OBSERVATIONS = 240
MEAN = 688.9987
STANDARD DEVIATION = 65.54909
SAMPLE 2:
NUMBER OF OBSERVATIONS = 240
MEAN = 611.1559
STANDARD DEVIATION = 61.85425
TEST:
STANDARD DEV. (NUMERATOR) = 65.54909
STANDARD DEV. (DENOMINATOR) = 61.85425
F TEST STATISTIC VALUE = 1.123037
DEG. OF FREEDOM (NUMER.) = 239.0000
DEG. OF FREEDOM (DENOM.) = 239.0000
F TEST STATISTIC CDF VALUE = 0.814808
1. The first section prints the sample statistics for sample one used in the computation of the F-
test.
2. The second section prints the sample statistics for sample two used in the computation of the F-
test.
3. The third section prints the numerator and denominator standard deviations, the F-test statistic
value, the degrees of freedom, and the cumulative distribution function (cdf) value of the F-test
statistic. The F-test statistic cdf value is an alternative way of expressing the critical value. This
cdf value is compared to the acceptance interval printed in section four. The acceptance interval
for a two-tailed test is (0,1 - ).
4. The fourth section prints the conclusions for a 95% test since this is the most common case.
Results are printed for an upper one-tailed test. The acceptance interval column is stated in
terms of the cdf value printed in section three. The last column specifies whether the null
hypothesis is accepted or rejected. For a different significance level, the appropriate conclusion
can be drawn from the F-test statistic cdf value printed in section four. For example, for a
significance level of 0.10, the corresponding acceptance interval become (0.000,0.9000).
Output from other statistical software may look somewhat different from the above output. Questions
The F-test can be used to answer the following questions:
1. Do two samples come from populations with equal standard deviations?
2. Does a new process, treatment, or test reduce the variability of the current process?
3. Introduction File menu Edit menu View menu Format menu Tools menu Statistics menu Graphs
menu Tests menu Sampling menu Window menu Help menu Spreadsheet Appendices
4. Description
5. Performs an F-test to compares the known standard deviations of two independent
samples.
6. This test is not performed on data in the spreadsheet, but on statistics you enter in a
dialog box.
7. Required input
8. Comparison of two standard deviations is performed by means of the F-test. In this test,
the ratio of two variances is calculated. If the two variances are not significantly different,
their ratio will be close to 1.
9. Comparison of variances: if you want to compare two known variances, first calculate the
standard deviations, by taking the square root, and next you can compare the two
standard deviations.
10.
11. In the dialog box, enter the two standard deviations that you want to compare, and the
corresponding number of cases. Next you click the Test button (or press the Enter key) to
perform the F-test or variance ratio test. In this test, the square of the standard deviations
is calculated to obtain the corresponding variances. If the two variances are not
significantly different, then their ratio will be close to 1.
12. When the calculated P value is less than 0.05 (P<0.05), the conclusion is that the two
standard deviations are statistically significant different.
13. In the example, the standard deviation was 25.6 and sample size was 60 for the first
sample, and for the second sample the standard deviation was 23.2 with sample size equal
to 80. The resulting F-statistic was 1.2176 and the associated P-value was 0.412. Since P
was not less than 0.05, you can conclude that there is no significant difference between
the two standard deviations.
14. If you want to compare two known variances, first calculate the standard deviations, by
taking the square root, and next you can compare the two standard deviations.
Note: For F-Test (ANOVA) (as well as for F-Test (MCR)), you can choose
whether you want to perform power analyses for global (i.e., omnibus) tests
or for special tests. Global test is the default option. This test refers to the H0
that all means in the design are equal (ANOVA) or that all regression
coefficients (next to the additive constant) are zero (MCR).
Random effects ANOVAs and mixed effects ANOVAs are not considered. We
may add them at a later time, however. A discussion of how to do power
analyses for repeated measures ANOVAs and MANOVAs can be found in the
Other F-Tests section.
For the ANOVA designs, we will use the effect size index f (Cohen, 1977).
The relation of f to the noncentrality parameter lambda is given by
lambda = f2 * N.
k
sum (mi - m)2 = 0 ,
H0 : i=1
k
sum (mi - m)2 = c, c > 0.
H1: i=1
where k is the number of conditions, mi is the mean in condition i, and m is
the grand mean.
You can compute the effect size index from group means and s which is
assumed to be constant across groups by clicking on "Calc 'f'". We will spare
you the formula behind that. (Just that much: You can save a lot of time when
you use the "Calc 'f'" option.)
Example
We compare 10 groups, and we have reason to expect a "medium" effect size
(f = .25).
Lambda: 24.1237
Thus, we need 39 subjects in each of the 10 groups. What if we had only 200
subjects available? Assuming that both alpha and beta are equally serious
(i.e., the ratio q := beta/alpha = 1) which probably is the default in basic
research, we can compute the following compromise power analysis:
Beta/alpha ratio: 1
Critical F: 1.4762
Lambda: 12.50000
For main effects, the H0, the interpretation of the effect size index f, and the
procedure are basically the same as for single-factor designs. The major
difference is that the numerator df (df = degrees of freedom) are reduced
relative to a single-factor design because other factors have to be taken into
account.
Thus, the only new part is that you need to specify, as the "Groups", all cells
of your multi-factor design, and as numerator df (the new item for this type
of power analysis) you enter i - 1, where i represents the levels of the specific
factor to be tested.
Note that there may be considerable differences between the power analysis
values as determined by G*Power and those determined according to the
"approximations" suggested by Cohen (1977, p. 365). G*Power is correct,
while Cohen's approximations systematically underestimate the power.
Lambda: 16.8750
Example 2
Assume that your H0 states that there is no interaction between A and B.
How do you perform a power analysis for this case?
To extend this further, assume that you have a 3 x 4 x 6 design with factors
A, B, and C. You test the main and interaction effects of this design using the
following values (assuming alpha = .05, effect size f = .25, and a total sample
size of 288):
Example 3
So far, we have limited the discussion to post-hoc power analyses. However,
in planning a multi-factor design, we want to know how many participants
we need to recruit for our experiment. How do we proceed in that case?
Step 1:
We compute a priori power analyses for the statistical tests of the effects of
all factors and interactions that are interesting from a theoretical point of
view. We ignore all other factors and interactions.
Step 2:
Case 1:
One factor or interaction (henceforth our critical factor or interaction) is more
important for our research question than all other factors or interactions. Two
alternatives are possible:
1. Our critical factor or interaction is the one associated with the largest
sample size as determined in Step 1. We use that sample size. As a rule, we
will be on the safe side with all other relevant factors and interactions.
2. Our critical factor or interaction is the not the one associated with the
largest sample size as determined in Step 1. We need to do some more
work:
We take the sample size as suggested for the critical factor or interaction
and perform post-hoc power analyses for all other factors or interactions
that are theoretically relevant. If we can live with the error probabilities
associated with the statistical tests of the effects of these factors, then we
are done.
If we are not happy with the error probabilities, we try to increase the
sample size up to the level at which we find both the error probabilities and
the resource demands acceptable.
Case 2:
All factors and interactions are equally important. We use the largest sample
size as determined in Step 1. As a rule, we will be on the safe side with all
other relevant factors and interactions. (Note that Case 2 and Case 1.1 lead to
the same result.)
Let us return to our 3 x 5 design in which Factor A has 3 levels and Factor B
has 5 levels. For simplicity, we assume that we want to detect effects of size f
= .40 for the two main effects and the interaction given alpha = beta = .05.
The relevant a priori power analyses suggest the following sample sizes:
* Note that the total sample size values produced by G*Power are somewhat
smaller. However, we use the next largest number that can be divided by 15
because our design has 15 cells and we wish to assure that the n's in all cells are
equal.
In a Case 2 situation, we would need a total sample size of 165. Given that
our assumptions about alpha and the effect size remained unchanged, a total
sample size of 165 would imply power values > .99 for tests of the effects of
Factors A and B. This result is a dream come true!
Given that our assumptions about alpha and the effect size remained
unchanged, a total sample size of 135 would imply power values of . 9890
and .9195 for tests of the effects of Factors A and the A x B interaction,
respectively. This result is certainly acceptable and we may decide to use
135 as the total sample size.
With planned comparisons, the H0 is that the contrasts among the means do
not explain, in the dependent variable, any variance which has not already
been accounted for by other sources of the effect. The effect size f is defined
as
_______________
/ \
/ R2p
f = / ______________
/
\/ ( 1 - R2p)
where R2p is the partial multiple correlation between the dependent variable
and the variable(s) coding the contrast among the means. In G*Power, click
"Calc F" after selecting "F-Test (ANOVA), Special" to calculate f from the
partial multiple correlation (referred to as partial eta-square in G*Power).
For a power analysis, it does not matter whether the contrasts are orthogonal
or not. However, note that f does not only depend on the population means
but also on the correlations among the contrast variables.
Example
Assume you have a Factor A with 4 levels. We want to determine whether the
effect of A on our dependent variable Y is linear, but not quadratic or cubic.
You can code A into 4-1=3 orthogonal contrast variables as follows.
x1, linear -3 -1 1 3
x2, quadratic 1 -1 -1 1
x3, cubic -1 3 -3 1
Assume that your H1 specifies that R2p = .20 for the linear contrast (x1).
Thus, f = .1667.
Beta/alpha ratio: 1
Lambda: 4.8975
where
In other words, covariate Xi differs in each of the populations we look at, but
its relation to Y and, hence, its regression weight bi is the same in all of those
populations.
denominator df = N - groups.
If the correlation between Y and the covariates is substantial, then the power
of your statistical test is increased. This is so because the within-population
standard deviation sigmaY' in the denominator of the F ratio is smaller than
sigmaY.
________
/ \
sigmaY' = sigmaY * \/ 1 - r2
Example
Assume a 2 x 3 design. A covariate X has been partialled out of a dependent
variable Y'. We want to detect 'large' effects (f = .40) according to Cohen's
effect size conventions for Factor B which has 3 levels. We had 60 subjects,
and we decide that alpha = .05. What is the power of the F-test in this
situation?
Lambda: 9.6000
Learn how to interpret the results of statistical tests and about our programs GraphPad InStat and
GraphPad Prism.
All tests are described in this book and are performed by InStat, except for tests marked with
asterisks. Tests labeled with a single asterisk are briefly mentioned in this book, and tests labeled
with two asterisks are not mentioned at all.
Type of Data
Goal Measurement Rank, Score, or Binomial Survival Time
(from Gaussian Measurement (Two Possible
Population) (from Non- Outcomes)
Gaussian
Population)
Describe one Mean, SD Median, Proportion Kaplan Meier
group interquartile range survival curve
Compare one One-sample t Wilcoxon test Chi-square
group to a test or
hypothetical Binomial test
value **
Compare two Unpaired t test Mann-Whitney test Fisher's test Log-rank test
unpaired (chi-square for or Mantel-
groups large samples) Haenszel*
Compare two Paired t test Wilcoxon test McNemar's test Conditional
paired groups proportional
hazards
regression*
Compare three One-way Kruskal-Wallis test Chi-square test Cox
or more ANOVA proportional
unmatched hazard
groups regression**
Compare three Repeated- Friedman test Cochrane Q** Conditional
or more measures proportional
matched groups ANOVA hazards
regression**
Quantify Pearson Spearman Contingency
association correlation correlation coefficients**
between two
variables
Predict value Simple linear Nonparametric Simple logistic Cox
from another regression regression** regression* proportional
measured or hazard
variable Nonlinear regression*
regression
Predict value Multiple linear Multiple Cox
from several regression* logistic proportional
measured or or regression* hazard
binomial Multiple regression*
variables nonlinear
regression**
Choosing the right test to compare measurements is a bit tricky, as you must choose between two
families of tests: parametric and nonparametric. Many -statistical test are based upon the
assumption that the data are sampled from a Gaussian distribution. These tests are referred to as
parametric tests. Commonly used parametric tests are listed in the first column of the table and
include the t test and analysis of variance.
Tests that do not make assumptions about the population distribution are referred to as
nonparametric- tests. You've already learned a bit about nonparametric tests in previous chapters.
All commonly used nonparametric tests rank the outcome variable from low to high and then
analyze the ranks. These tests are listed in the second column of the table and include the
Wilcoxon, Mann-Whitney test, and Kruskal-Wallis tests. These tests are also called distribution-
free tests.
Choosing between parametric and nonparametric tests is sometimes easy. You should definitely
choose a parametric test if you are sure that your data are sampled from a population that follows
a Gaussian distribution (at least approximately). You should definitely select a nonparametric test
in three situations:
• The outcome is a rank or a score and the population is clearly not Gaussian. Examples
include class ranking of students, the Apgar score for the health of newborn babies
(measured on a scale of 0 to IO and where all scores are integers), the visual analogue
score for pain (measured on a continuous scale where 0 is no pain and 10 is unbearable
pain), and the star scale commonly used by movie and restaurant critics (* is OK, *****
is fantastic).
• Some values are "off the scale," that is, too high or too low to measure. Even if the
population is Gaussian, it is impossible to analyze such data with a parametric test since
you don't know all of the values. Using a nonparametric test with these data is simple.
Assign values too low to measure an arbitrary very low value and assign values too high
to measure an arbitrary very high value. Then perform a nonparametric test. Since the
nonparametric test only knows about the relative ranks of the values, it won't matter that
you didn't know all the values exactly.
• The data ire measurements, and you are sure that the population is not distributed in a
Gaussian manner. If the data are not sampled from a Gaussian distribution, consider
whether you can transformed the values to make the distribution become Gaussian. For
example, you might take the logarithm or reciprocal of all values. There are often
biological or chemical reasons (as well as statistical ones) for performing a particular
transform.
It is not always easy to decide whether a sample comes from a Gaussian population. Consider
these points:
• If you collect many data points (over a hundred or so), you can look at the distribution
of data and it will be fairly obvious whether the distribution is approximately bell shaped.
A formal statistical test (Kolmogorov-Smirnoff test, not explained in this book) can be
used to test whether the distribution of the data differs significantly from a Gaussian
distribution. With few data points, it is difficult to tell whether the data are Gaussian by
inspection, and the formal test has little power to discriminate between Gaussian and non-
Gaussian distributions.
• You should look at previous data as well. Remember, what matters is the distribution of
the overall population, not the distribution of your sample. In deciding whether a
population is Gaussian, look at all available data, not just data in the current experiment.
• Consider the source of scatter. When the scatter comes from the sum of numerous
sources (with no one source contributing most of the scatter), you expect to find a
roughly Gaussian distribution.
When in doubt, some people choose a parametric test (because they aren't sure the
Gaussian assumption is violated), and others choose a nonparametric test (because they
aren't sure the Gaussian assumption is met).
Does it matter whether you choose a parametric or nonparametric test? The answer depends on
sample size. There are four cases to think about:
• Large sample. What happens when you use a parametric test with data from a
nongaussian population? The central limit theorem (discussed in Chapter 5) ensures that
parametric tests work well with large samples even if the population is non-Gaussian. In
other words, parametric tests are robust to deviations from Gaussian distributions, so long
as the samples are large. The snag is that it is impossible to say how large is large enough,
as it depends on the nature of the particular non-Gaussian distribution. Unless the
population distribution is really weird, you are probably safe choosing a parametric test
when there are at least two dozen data points in each group.
• Large sample. What happens when you use a nonparametric test with data from a
Gaussian population? Nonparametric tests work well with large samples from Gaussian
populations. The P values tend to be a bit too large, but the discrepancy is small. In other
words, nonparametric tests are only slightly less powerful than parametric tests with large
samples.
• Small samples. What happens when you use a parametric test with data from
nongaussian populations? You can't rely on the central limit theorem, so the P value may
be inaccurate.
• Small samples. When you use a nonparametric test with data from a Gaussian
population, the P values tend to be too high. The nonparametric tests lack statistical
power with small samples.
Thus, large data sets present no problems. It is usually easy to tell if the data come from a
Gaussian population, but it doesn't really matter because the nonparametric tests are so powerful
and the parametric tests are so robust. Small data sets present a dilemma. It is difficult to tell if
the data come from a Gaussian population, but it matters a lot. The nonparametric tests are not
powerful and the parametric tests are not robust.
With many tests, you must choose whether you wish to calculate a one- or two-sided P value
(same as one- or two-tailed P value). The difference between one- and two-sided P values was
discussed in Chapter 10. Let's review the difference in the context of a t test. The P value is
calculated for the null hypothesis that the two population means are equal, and any discrepancy
between the two sample means is due to chance. If this null hypothesis is true, the one-sided P
value is the probability that two sample means would differ as much as was observed (or further)
in the direction specified by the hypothesis just by chance, even though the means of the overall
populations are actually equal. The two-sided P value also includes the probability that the
sample means would differ that much in the opposite direction (i.e., the other group has the
larger mean). The two-sided P value is twice the one-sided P value.
A one-sided P value is appropriate when you can state with certainty (and before collecting any
data) that there either will be no difference between the means or that the difference will go in a
direction you can specify in advance (i.e., you have specified which group will have the larger
mean). If you cannot specify the direction of any difference before collecting data, then a two-
sided P value is more appropriate. If in doubt, select a two-sided P value.
If you select a one-sided test, you should do so before collecting any data and you need to state
the direction of your experimental hypothesis. If the data go the other way, you must be willing
to attribute that difference (or association or correlation) to chance, no matter how striking the
data. If you would be intrigued, even a little, by data that goes in the "wrong" direction, then you
should use a two-sided P value. For reasons discussed in Chapter 10, I recommend that you
always calculate a two-sided P value.
Use an unpaired test to compare groups when the individual values are not paired or matched
with one another. Select a paired or repeated-measures test when values represent repeated
measurements on one subject (before and after an intervention) or measurements on matched
subjects. The paired or repeated-measures tests are also appropriate for repeated laboratory
experiments run at different times, each with its own control.
You should select a paired test when values in one group are more closely correlated with a
specific value in the other group than with random values in the other group. It is only
appropriate to select a paired test when the subjects were matched or paired before the data were
collected. You cannot base the pairing on the data you are analyzing.
When analyzing contingency tables with two rows and two columns, you can use either Fisher's
exact test or the chi-square test. The Fisher's test is the best choice as it always gives the exact P
value. The chi-square test is simpler to calculate but yields only an approximate P value. If a
computer is doing the calculations, you should choose Fisher's test unless you prefer the
familiarity of the chi-square test. You should definitely avoid the chi-square test when the
numbers in the contingency table are very small (any number less than about six). When the
numbers are larger, the P values reported by the chi-square and Fisher's test will he very similar.
The chi-square test calculates approximate P values, and the Yates' continuity correction is
designed to make the approximation better. Without the Yates' correction, the P values are too
low. However, the correction goes too far, and the resulting P value is too high. Statisticians give
different recommendations regarding Yates' correction. With large sample sizes, the Yates'
correction makes little difference. If you select Fisher's test, the P value is exact and Yates'
correction is not needed and is not available.
REGRESSION OR CORRELATION?
Linear regression and correlation are similar and easily confused. In some situations it makes
sense to perform both calculations. Calculate linear correlation if you measured both X and Y in
each subject and wish to quantity how well they are associated. Select the Pearson (parametric)
correlation coefficient if you can assume that both X and Y are sampled from Gaussian
populations. Otherwise choose the Spearman nonparametric correlation coefficient. Don't
calculate the correlation coefficient (or its confidence interval) if you manipulated the X variable.
Calculate linear regressions only if one of the variables (X) is likely to precede or cause the other
variable (Y). Definitely choose linear regression if you manipulated the X variable. It makes a
big difference which variable is called X and which is called Y, as linear regression calculations
are not symmetrical with respect to X and Y. If you swap the two variables, you will obtain a
different regression line. In contrast, linear correlation calculations are symmetrical with respect
to X and Y. If you swap the labels X and Y, you will still get the same correlation coefficient.
Learn how to interpret the results of statistical tests and about our programs GraphPad InStat and
GraphPad Prism. Visit the GraphPad home page
Chi-Square Test
Ads by Google
The chi-square test (KHGR2) is the most commonly used method for comparing frequencies or
proportions. It is a statistical test used to determine if observed data deviate from those expected
under a particular hypothesis. The chi-square test is also referred to as a test of a measure of fit or
"goodness of fit" between data. Typically, the hypothesis tested is whether or not two samples are
different enough in a particular characteristic to be considered members of different populations.
Chi-square analysis belongs to the family of univariate analysis, i.e., those tests that evaluate the
possible effect of one variable (often called the independent variable) upon an outcome (often
called the dependent variable).
The chi-square analysis is used to test the null hypothesis (H0), which is the hypothesis that states
there is no significant difference between expected and observed data. Investigators either accept
or reject H0, after comparing the value of chi-square to a probability distribution. Chi-square
values with low probability lead to the rejection of H0 and it is assumed that a factor other than
chance creates a large deviation between expected and observed results. As with all non-
parametric tests (that do not require normal distribution curves), chi-square tests only evaluate a
single variable, thus they do not take into account the interaction among more than one variable
upon the outcome.
A chi-square analysis is best illustrated using an example in which data from a population is
categorized with respect to two qualitative variables. Table 1 shows a sample of patients
categorized with respect to two qualitative variables, namely, congenital heart defect (CHD;
present or absent) and karyotype (trisomy 21, also called Down syndrome, or trisomy 13, also
called Patau syndrome). The classification table used in a chi-square analysis is called a
contingency table and this is its simplest form (2 x 2). The data in a contingency table are often
defined as row (r) and column (c) variables.
In general, a chi-square analysis evaluates whether or not variables within a contingency table
are independent, or that there is no association between them. In this example, independence
would mean that the proportion of individuals affected by CHD is not dependent on karyotype;
thus, the proportion of patients with CHD would be similar for both Down and Patau syndrome
patients. Dependence, or association, would mean that the proportion of individuals affected by
CHD is dependent on kayotype, so that CHD would be more commonly found in patients with
one of the two karyotypes examined.
Table 1 shows a 2 x 2 contingency table for a chi-square test—CHD (congenital heart defects)
found in patients with Down and Patau syndromes
Figure 1. Chi-square distributions for 1, 3, and 5 degrees of freedom. The shaded region in
each of the distributions indicates the upper 5% of the distribution. Illustration by Argosy.
The Gale Group.
Chi-square is the sum of the squared difference between observed and expected data, divided by
the expected data in all possible categories:
Χ2 = (O11 - E11)2 / E11 + (O12 - E12)2 / E 12 + (O21 - E21)2/ E21 + (O22–E22)2 / E22, where O11 represents the
observed number of subjects in column 1, row 1, and so on. A summary is shown in Table 2.
The observed frequency is simply the actual number of observations in a cell. In other words, O11
for CHD in the Down-syndrome-affected individuals is 24. Likewise, the observed frequency of
CHD in the Patau-syndrome-affected patients is 20 (O12). Because the null hypothesis assumes that
the two variables are independent
TABLE 1.
Karyotype
Down syndrome Patau syndrome Total
Congenital Heart Defects CHD present 24 20 44
CHD absent 36 5 41
Total 60 25 85
TABLE 2.
Karyotype
Down syndrome Patau syndrome Total
Congenital Heart Defects CHD present O11 O12 r1
CHD absent O21 O22 r2
Total c1 c2 N
TABLE 3.
Observed (o) Expected (e) o-e (o-e)2 (o-e)2/e
24 31.1 -7.1 50.41 1.62
20 12.9 7.1 50.41 3.91
TABLE 3.
Observed (o) Expected (e) o-e (o-e)2 (o-e)2/e
36 28.9 7.1 50.41 1.74
5 12.1 -7.1 50.41 4.17
85 85.0 Χ2=11.44
of each other, expected frequencies are calculated using the multiplication rule of probability.
The multiplication rule says that the probability of the occurrence of two independent events X
and Y is the product of the individual probabilities of X and Y. In this case, the expected
probability that a patient has both Down syndrome and CHD is the product of the probability that
a patient has Down syndrome (60/85 = 0.706) and the probability that a patient has CHD (44/85
= 0.518), or 0.706 x 0.518 = 0.366. The expected frequency of patients with both Down
syndrome and CHD is the product of the expected probability and the total population studied, or
0.366 x 85 = 31.1.
Table 3 presents observed and expected frequencies and Χ2 for data in Table 1.
Before the chi-square value can be evaluated, the degrees of freedom for the data set must be
determined. Degrees of freedom are the number of independent variables in the data set. In a
contingency table, the degrees of freedom are calculated as the product of the number of rows
minus 1 and the number of columns minus 1, or (r-1)(c-1). In this example, (2-1)(2-1) = 1; thus,
there is just one degree of freedom.
Once the degrees of freedom are determined, the value of Χ2 is compared with the appropriate
chi-square distribution, which can be found in tables in most statistical analyses texts. A relative
standard serves as the basis for accepting or rejecting the hypothesis. In biological research, the
relative standard is usually p = 0.05, where p is the probability that the deviation of the observed
frequencies from the expected frequencies is due to chance alone. If p is less than or equal to
0.05, then the null hypothesis is rejected and the data are not independent of each other. For one
degree of freedom, the critical value associated with p = 0.05 for Χ2 is 3.84. Chi-square values
higher than this critical value are associated with a statistically low probability that H0 is true.
Because the chi-square value is 11.44, much greater than 3.84, the hypothesis that the proportion
of trisomy-13-affected patients with CHD does not differ significantly from the corresponding
proportion for trisomy-21-affected patients is rejected. Instead, it is very likely that there is a
dependence of CHD on karyotype.
Figure 1 shows chi-square distributions for 1, 3, and 5 degrees of freedom. The shaded region in
each of the distributions indicates the upper 5% of the distribution. The critical value associated
with p = 0.05 is indicated. Notice that as the degrees of freedom increases, the chi-square value
required to reject the null hypothesis increases.
Because a chi-square test is a univariate test; it does not consider relationships among multiple
variables at the same time. Therefore, dependencies detected by chi-square analyses may be
unrealistic or non-causal. There may be other unseen factors that make the variables appear to be
associated. However, if properly used, the test is a very useful tool for the evaluation of
associations and can be used as a preliminary analysis of more complex statistical evaluations.
Introduction
This page shows how to perform a number of statistical tests using Stata. Each section gives a
brief description of the aim of the statistical test, when it is used, an example showing the Stata
commands and Stata output with a brief interpretation of the output. You can see the page
Choosing the Correct Statistical Test for a table that shows an overview of when each test is
appropriate to use. In deciding which test is appropriate to use, it is important to consider the
type of variables that you have (i.e., whether your variables are categorical, ordinal or interval
and whether they are normally distributed), see What is the difference between categorical,
ordinal and interval variables? for more information on this.
Most of the examples in this page will use a data file called hsb2, high school and beyond. This
data file contains 200 observations from a sample of high school students with demographic
information about the students, such as their gender (female), socio-economic status (ses) and
ethnic background (race). It also contains a number of scores on standardized tests, including
tests of reading (read), writing (write), mathematics (math) and social studies (socst). You can
get the hsb2 data file from within Stata by typing:
use http://www.ats.ucla.edu/stat/stata/notes/hsb2
A one sample t-test allows us to test whether a sample mean (of a normally distributed interval
variable) significantly differs from a hypothesized value. For example, using the hsb2 data file,
say we wish to test whether the average writing score (write) differs significantly from 50. We
can do this as shown below.
ttest write=50
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
write | 200 52.775 .6702372 9.478586 51.45332 54.09668
------------------------------------------------------------------------------
Degrees of freedom: 199
Ho: mean(write) = 50
The mean of the variable write for this particular sample of students is 52.775, which is
statistically significantly different from the test value of 50. We would conclude that this group
of students has a significantly higher mean on the writing test than 50.
A one sample median test allows us to test whether a sample median differs significantly from a
hypothesized value. We will use the same variable, write, as we did in the one sample t-test
example above, but we do not need to assume that it is interval and normally distributed (we only
need to assume that write is an ordinal variable). We will test whether the median writing score
(write) differs significantly from 50.
signrank write=50
Wilcoxon signed-rank test
Ho: write = 50
z = 4.130
Prob > |z| = 0.0000
The results indicate that the median of the variable write for this group is statistically
significantly different from 50.
Binomial test
A one sample binomial test allows us to test whether the proportion of successes on a two-level
categorical dependent variable significantly differs from a hypothesized value. For example,
using the hsb2 data file, say we wish to test whether the proportion of females (female) differs
significantly from 50%, i.e., from .5. We can do this as shown below.
bitest female=.5
Variable | N Observed k Expected k Assumed p Observed p
-------------+------------------------------------------------------------
female | 200 109 100 0.50000 0.54500
The results indicate that there is no statistically significant difference (p = .2292). In other
words, the proportion of females does not significantly differ from the hypothesized value of
50%.
A chi-square goodness of fit test allows us to test whether the observed proportions for a
categorical variable differ from hypothesized proportions. For example, let's suppose that we
believe that the general population consists of 10% Hispanic, 10% Asian, 10% African American
and 70% White folks. We want to test whether the observed proportions from our sample differ
significantly from these hypothesized proportions. To conduct the chi-square goodness of fit test,
you need to first download the csgof program that performs this test. You can download csgof
from within Stata by typing findit csgof (see How can I used the findit command to search for
programs and get additional help? for more information about using findit).
These results show that racial composition in our sample does not differ significantly from the
hypothesized values that we supplied (chi-square with three degrees of freedom = 5.03, p = .
1697).
An independent samples t-test is used when you want to compare the means of a normally
distributed interval dependent variable for two independent groups. For example, using the hsb2
data file, say we wish to test whether the mean for write is the same for males and females.
------------------------------------------------------------------------------
Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
male | 91 50.12088 1.080274 10.30516 47.97473 52.26703
female | 109 54.99083 .7790686 8.133715 53.44658 56.53507
---------+--------------------------------------------------------------------
combined | 200 52.775 .6702372 9.478586 51.45332 54.09668
---------+--------------------------------------------------------------------
diff | -4.869947 1.304191 -7.441835 -2.298059
------------------------------------------------------------------------------
Degrees of freedom: 198
The results indicate that there is a statistically significant difference between the mean writing
score for males and females (t = -3.7341, p = .0002). In other words, females have a statistically
significantly higher mean score on writing (54.99) than males (50.12).
Wilcoxon-Mann-Whitney test
The results suggest that there is a statistically significant difference between the underlying
distributions of the write scores of males and the write scores of females (z = -3.329, p =
0.0009). You can determine which group has the higher rank by looking at the how the actual
rank sums compare to the expected rank sums under the null hypothesis. The sum of the female
ranks was higher while the sum of the male ranks was lower. Thus the female group had higher
rank.
Chi-square test
A chi-square test is used when you want to see if there is a relationship between two categorical
variables. In Stata, the chi2 option is used with the tabulate command to obtain the test statistic
and its associated p-value. Using the hsb2 data file, let's see if there is a relationship between the
type of school attended (schtyp) and students' gender (female). Remember that the chi-square
test assumes the expected value of each cell is five or higher. This assumption is easily met in
the examples below. However, if this assumption is not met in your data, please see the section
on Fisher's exact test below.
type of | female
school | male female | Total
-----------+----------------------+----------
public | 77 91 | 168
private | 14 18 | 32
-----------+----------------------+----------
Total | 91 109 | 200
These results indicate that there is no statistically significant relationship between the type of
school attended and gender (chi-square with one degree of freedom = 0.0470, p = 0.828).
Let's look at another example, this time looking at the relationship between gender (female) and
socio-economic status (ses). The point of this example is that one (or both) variables may have
more than two levels, and that the variables do not have to have the same number of levels. In
this example, female has two levels (male and female) and ses has three levels (low, medium and
high).
| ses
female | low middle high | Total
-----------+---------------------------------+----------
male | 15 47 29 | 91
female | 32 48 29 | 109
-----------+---------------------------------+----------
Total | 47 95 58 | 200
Again we find that there is no statistically significant relationship between the variables (chi-
square with two degrees of freedom = 4.5765, p = 0.101).
type of | race
school | hispanic asian african-a white | Total
-----------+--------------------------------------------+----------
public | 22 10 18 118 | 168
private | 2 1 2 27 | 32
-----------+--------------------------------------------+----------
Total | 24 11 20 145 | 200
These results suggest that there is not a statistically significant relationship between race and
type of school (p = 0.597). Note that the Fisher's exact test does not have a "test statistic", but
computes the p-value directly.
One-way ANOVA
A one-way analysis of variance (ANOVA) is used when you have a categorical independent
variable (with two or more categories) and a normally distributed interval dependent variable and
you wish to test for differences in the means of the dependent variable broken down by the levels
of the independent variable. For example, using the hsb2 data file, say we wish to test whether
the mean of write differs between the three program types (prog). The command for this test
would be:
From this we can see that the students in the academic program have the highest mean writing
score, while students in the vocational program have the lowest.
The Kruskal Wallis test is used when you have one independent variable with two or more levels
and an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA
and a generalized form of the Mann-Whitney test method since it permits 2 or more groups. We
will use the same data file as the one way ANOVA example above (the hsb2 data file) and the
same variables as in the example above, but we will not assume that write is a normally
distributed interval variable.
If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly
different value of chi-squared. With or without ties, the results indicate that there is a statistically
significant difference among the three type of programs.
Paired t-test
A paired (samples) t-test is used when you have two related observations (i.e. two observations
per subject) and you want to see if the means on these two normally distributed interval variables
differ from one another. For example, using the hsb2 data file we will test whether the mean of
read is equal to the mean of write.
------------------------------------------------------------------------------
Variable | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]
---------+--------------------------------------------------------------------
read | 200 52.23 .7249921 10.25294 50.80035 53.65965
write | 200 52.775 .6702372 9.478586 51.45332 54.09668
---------+--------------------------------------------------------------------
diff | 200 -.545 .6283822 8.886666 -1.784142 .6941424
------------------------------------------------------------------------------
These results indicate that the mean of read is not statistically significantly different from the
mean of write (t = -0.8673, p = 0.3868).
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You
use the Wilcoxon signed rank sum test when you do not wish to assume that the difference
between the two variables is interval and normally distributed (but you do assume the difference
is ordinal). We will use the same example as above, but we will not assume that the difference
between read and write is interval and normally distributed.
The results suggest that there is not a statistically significant difference between read and write.
If you believe the differences between read and write were not ordinal but could merely be
classified as positive and negative, then you may want to consider a sign test in lieu of sign rank
test. Again, we will use the same variables in this example and assume that this difference is not
ordinal.
One-sided tests:
Ho: median of read - write = 0 vs.
Ha: median of read - write > 0
Pr(#positive >= 88) =
Binomial(n = 185, x >= 88, p = 0.5) = 0.7688
Two-sided test:
Ho: median of read - write = 0 vs.
Ha: median of read - write ~= 0
Pr(#positive >= 97 or #negative >= 97) =
min(1, 2*Binomial(n = 185, x >= 97, p = 0.5)) = 0.5565
This output gives both of the one-sided tests as well as the two-sided test. Assuming that we
were looking for any difference, we would use the two-sided test and conclude that no
statistically significant difference was found (p=.5565).
McNemar test
You would perform McNemar's test if you were interested in the marginal frequencies of two
binary outcomes. These binary outcomes may be the same outcome variable on matched pairs
(like a case-control study) or two outcome variables from a single group. For example, let us
consider two questions, Q1 and Q2, from a test taken by 200 students. Suppose 172 students
answered both questions correctly, 15 students answered both questions incorrectly, 7 answered
Q1 correctly and Q2 incorrectly, and 6 answered Q2 correctly and Q1 incorrectly. These counts
can be considered in a two-way contingency table. The null hypothesis is that the two questions
are answered correctly or incorrectly at the same rate (or that the contingency table is
symmetric). We can enter these counts into Stata using mcci, a command from Stata's
epidemiology tables. The outcome is labeled according to case-control study conventions.
mcci 172 6 7 15
| Controls |
Cases | Exposed Unexposed | Total
-----------------+------------------------+------------
Exposed | 172 6 | 178
Unexposed | 7 15 | 22
-----------------+------------------------+------------
Total | 179 21 | 200
You would perform a one-way repeated measures analysis of variance if you had one categorical
independent variable and a normally distributed interval dependent variable that was repeated at
least twice for each subject. This is the equivalent of the paired samples t-test, but allows for two
or more levels of the categorical variable. This tests whether the mean of the dependent variable
differs by the categorical variable. We have an example data set called rb4, which is used in
Kirk's book Experimental Design. In this data set, y is the dependent variable, a is the repeated
measure and s is the variable that indicates the subject number.
use http://www.ats.ucla.edu/stat/stata/examples/kirk/rb4
anova y a s, repeated(a)
Number of obs = 32 R-squared = 0.7318
Root MSE = 1.18523 Adj R-squared = 0.6041
Repeated variable: a
Huynh-Feldt epsilon = 0.8343
Greenhouse-Geisser epsilon = 0.6195
Box's conservative epsilon = 0.3333
You will notice that this output gives four different p-values. The "regular" (0.0001) is the p-
value that you would get if you assumed compound symmetry in the variance-covariance
matrix. Because that assumption is often not valid, the three other p-values offer various
corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Box's conservative, Box). No
matter which p-value you use, our results indicate that we have a statistically significant effect of
a at the .05 level.
If you have a binary outcome measured repeatedly for each subject and you wish to run a logistic
regression that accounts for the effect of these multiple measures from each subjects, you can
perform a repeated measures logistic regression. In Stata, this can be done using the xtgee
command and indicating binomial as the probability distribution and logit as the link function to
be used in the model. The exercise data file contains 3 pulse measurements of 30 people assigned
to 2 different diet regiments and 3 different exercise regiments. If we define a "high" pulse as
being over 100, we can then predict the probability of a high pulse using diet regiment.
First, we use xtset to define which variable defines the repetitions. In this dataset, there are three
measurements taken for each id, so we will use id as our panel variable. Then we can use xi:
before our xtgee command so that we can create indicator variables as needed within the model
statement.
------------------------------------------------------------------------------
highpulse | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Idiet_2 | .7537718 .6088196 1.24 0.216 -.4394927 1.947036
_cons | -1.252763 .4621704 -2.71 0.007 -2.1586 -.3469257
------------------------------------------------------------------------------
These results indicate that diet is not statistically significant (Z = 1.24, p = 0.216).
Factorial ANOVA
A factorial ANOVA has two or more categorical independent variables (either with or without the
interactions) and a single normally distributed interval dependent variable. For example, using
the hsb2 data file we will look at writing scores (write) as the dependent variable and gender
(female) and socio-economic status (ses) as independent variables, and we will include an
interaction of female by ses. Note that in Stata, you do not need to have the interaction term(s)
in your data set. Rather, you can have Stata create it/them temporarily by placing an asterisk
between the variables that will make up the interaction term(s).
These results indicate that the overall model is statistically significant (F = 5.67, p = 0.001). The
variables female and ses are also statistically significant (F = 16.59, p = 0.0001 and F = 6.61, p =
0.0017, respectively). However, that interaction between female and ses is not statistically
significant (F = 0.13, p = 0.8753).
Friedman test
You perform a Friedman test when you have one within-subjects independent variable with two
or more levels and a dependent variable that is not interval and normally distributed (but at least
ordinal). We will use this test to determine if there is a difference in the reading, writing and
math scores. The null hypothesis in this test is that the distribution of the ranks of each type of
score (i.e., reading, writing and math) are the same. To conduct the Friedman test in Stata, you
need to first download the friedman program that performs this test. You can download
friedman from within Stata by typing findit friedman (see How can I used the findit command
to search for programs and get additional help? for more information about using findit). Also,
your data will need to be transposed such that subjects are the columns and the variables are the
rows. We will use the xpose command to arrange our data this way.
use http://www.ats.ucla.edu/stat/stata/notes/hsb2
keep read write math
xpose, clear
friedman v1-v200
Friedman = 0.6175
Kendall = 0.0015
P-value = 0.7344
Friedman's chi-square has a value of 0.6175 and a p-value of 0.7344 and is not statistically
significant. Hence, there is no evidence that the distributions of the three types of scores are
different.
A factorial logistic regression is used when you have two or more categorical independent
variables but a dichotomous dependent variable. For example, using the hsb2 data file we will
use female as our dependent variable, because it is the only dichotomous (0/1) variable in our
data set; certainly not because it common practice to use gender as an outcome variable. We will
use type of program (prog) and school type (schtyp) as our predictor variables. Because prog is
a categorical variable (it has three levels), we need to create dummy codes for it. The use of xi:
and i.prog does this. You can use the logit command if you want to see the regression
coefficients or the logistic command if you want to see the odds ratios.
------------------------------------------------------------------------------
female | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iprog_2 | 2.258611 1.406529 1.61 0.108 -.4981346 5.015357
_Iprog_3 | 2.046133 1.986474 1.03 0.303 -1.847285 5.93955
schtyp | 1.660731 1.14128 1.46 0.146 -.5761361 3.897598
_IproXscht~2 | -1.934025 1.232679 -1.57 0.117 -4.35003 .4819813
_IproXscht~3 | -1.827785 1.840227 -0.99 0.321 -5.434564 1.778993
_cons | -1.712024 1.269021 -1.35 0.177 -4.19926 .7752109
------------------------------------------------------------------------------
The results indicate that the overall model is not statistically significant (LR chi2 = 3.15, p =
0.6774). Furthermore, none of the coefficients are statistically significant either. We can use the
test command to get the test of the overall effect of prog as shown below. This shows that the
overall effect of prog is not statistically significant.
( 1) _Iprog_2 = 0.0
( 2) _Iprog_3 = 0.0
chi2( 2) = 2.59
Prob > chi2 = 0.2732
Likewise, we can use the test command to get the test of the overall effect of the prog by schtyp
interaction, as shown below. This shows that the overall effect of this interaction is not
statistically significant.
( 1) _IproXschty_2 = 0.0
( 2) _IproXschty_3 = 0.0
chi2( 2) = 2.47
Prob > chi2 = 0.2902
If you prefer, you could use the logistic command to see the results as odds ratios, as shown
below.
------------------------------------------------------------------------------
female | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iprog_2 | 9.569789 13.46018 1.61 0.108 .6076632 150.7099
_Iprog_3 | 7.737918 15.37117 1.03 0.303 .1576647 379.764
schtyp | 5.263158 6.006736 1.46 0.146 .562066 49.28395
_IproXscht~2 | .1445652 .1782025 -1.57 0.117 .0129064 1.61928
_IproXscht~3 | .1607692 .2958519 -0.99 0.321 .0043631 5.923891
------------------------------------------------------------------------------
Correlation
A correlation is useful when you want to see the linear relationship between two (or more)
normally distributed interval variables. For example, using the hsb2 data file we can run a
correlation between two continuous variables, read and write.
| read write
-------------+------------------
read | 1.0000
write | 0.5968 1.0000
In the second example, we will run a correlation between a dichotomous variable, female, and a
continuous variable, write. Although it is assumed that the variables are interval and normally
distributed, we can include dummy variables when performing correlations.
| female write
-------------+------------------
female | 1.0000
write | 0.2565 1.0000
In the first example above, we see that the correlation between read and write is 0.5968. By
squaring the correlation and then multiplying by 100, you can determine what percentage of the
variability is shared. Let's round 0.5968 to be 0.6, which when squared would be .36, multiplied
by 100 would be 36%. Hence read shares about 36% of its variability with write. In the output
for the second example, we can see the correlation between write and female is 0.2565.
Squaring this number yields .06579225, meaning that female shares approximately 6.5% of its
variability with write.
Simple linear regression allows us to look at the linear relationship between one normally
distributed interval predictor and one normally distributed interval outcome variable. For
example, using the hsb2 data file, say we wish to look at the relationship between writing scores
(write) and reading scores (read); in other words, predicting write from read.
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .5517051 .0527178 10.47 0.000 .4477446 .6556656
_cons | 23.95944 2.805744 8.54 0.000 18.42647 29.49242
------------------------------------------------------------------------------
We see that the relationship between write and read is positive (.5517051) and based on the t-
value (10.47) and p-value (0.000), we would conclude this relationship is statistically
significant. Hence, we would say there is a statistically significant positive linear relationship
between reading and writing.
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be normally
distributed and interval (but are assumed to be ordinal). The values of the variables are converted
in ranks and then correlated. In our example, we will look for a relationship between read and
write. We will not assume that both of these variables are normal and interval .
The results suggest that the relationship between read and write (rho = 0.6167, p = 0.000) is
statistically significant.
Logistic regression assumes that the outcome variable is binary (i.e., coded as 0 and 1). We have
only one variable in the hsb2 data file that is coded 0 and 1, and that is female. We understand
that female is a silly outcome variable (it would make more sense to use it as a predictor
variable), but we can use female as the outcome variable to illustrate how the code for this
command is structured and how to interpret the output. The first variable listed after the logistic
(or logit) command is the outcome (or dependent) variable, and all of the rest of the variables are
predictor (or independent) variables. You can use the logit command if you want to see the
regression coefficients or the logistic command if you want to see the odds ratios. In our
example, female will be the outcome variable, and read will be the predictor variable. As with
OLS regression, the predictor variables must be either dichotomous or continuous; they cannot
be categorical.
------------------------------------------------------------------------------
female | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .9896176 .0137732 -0.75 0.453 .9629875 1.016984
------------------------------------------------------------------------------
------------------------------------------------------------------------------
female | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | -.0104367 .0139177 -0.75 0.453 -.0377148 .0168415
_cons | .7260875 .7419612 0.98 0.328 -.7281297 2.180305
------------------------------------------------------------------------------
The results indicate that reading score (read) is not a statistically significant predictor of gender
(i.e., being female), z = -0.75, p = 0.453. Likewise, the test of the overall model is not
statistically significant, LR chi-squared 0.56, p = 0.4527.
Multiple regression
Multiple regression is very similar to simple regression, except that in multiple regression you
have more than one predictor variable in the equation. For example, using the hsb2 data file we
will predict writing score from gender (female), reading, math, science and social studies (socst)
scores.
------------------------------------------------------------------------------
write | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
female | 5.492502 .8754227 6.27 0.000 3.765935 7.21907
read | .1254123 .0649598 1.93 0.055 -.0027059 .2535304
math | .2380748 .0671266 3.55 0.000 .1056832 .3704665
science | .2419382 .0606997 3.99 0.000 .1222221 .3616542
socst | .2292644 .0528361 4.34 0.000 .1250575 .3334713
_cons | 6.138759 2.808423 2.19 0.030 .599798 11.67772
------------------------------------------------------------------------------
The results indicate that the overall model is statistically significant (F = 58.60, p = 0.0000).
Furthermore, all of the predictor variables are statistically significant except for read.
Analysis of covariance
Analysis of covariance is like ANOVA, except in addition to the categorical predictors you also
have continuous predictors as well. For example, the one way ANOVA example used write as
the dependent variable and prog as the independent variable. Let's add read as a continuous
variable to this model, as shown below.
The results indicate that even after adjusting for reading score (read), writing scores still
significantly differ by program type (prog) F = 5.87, p = 0.0034.
Multiple logistic regression is like simple logistic regression, except that there are two or more
predictors. The predictors can be interval variables or dummy variables, but cannot be
categorical variables. If you have categorical predictors, they should be coded into one or more
dummy variables. We have only one variable in our data set that is coded 0 and 1, and that is
female. We understand that female is a silly outcome variable (it would make more sense to use
it as a predictor variable), but we can use female as the outcome variable to illustrate how the
code for this command is structured and how to interpret the output. The first variable listed
after the logistic (or logit) command is the outcome (or dependent) variable, and all of the rest of
the variables are predictor (or independent) variables. You can use the logit command if you
want to see the regression coefficients or the logistic command if you want to see the odds
ratios. In our example, female will be the outcome variable, and read and write will be the
predictor variables.
------------------------------------------------------------------------------
female | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
read | .9314488 .0182578 -3.62 0.000 .8963428 .9679298
write | 1.112231 .0246282 4.80 0.000 1.064993 1.161564
------------------------------------------------------------------------------
These results show that both read and write are significant predictors of female.
Discriminant analysis
Discriminant analysis is used when you have one or more normally distributed interval
independent variables and a categorical dependent variable. It is a multivariate technique that
considers the latent dimensions in the independent variables for predicting group membership in
the categorical dependent variable. For example, using the hsb2 data file, say we wish to use
read, write and math scores to predict the type of program a student belongs to (prog). For this
analysis, you need to first download the daoneway program that performs this test. You can
download daoneway from within Stata by typing findit daoneway (see How can I used the
findit command to search for programs and get additional help? for more information about
using findit).
You can then perform the discriminant function analysis like this.
Observations = 200
Variables = 3
Groups = 3
func1 func2
read 0.0292 -0.0439
write 0.0383 0.1370
math 0.0703 -0.0793
_cons -7.2509 -0.7635
func1 func2
read 0.2729 -0.4098
write 0.3311 1.1834
math 0.5816 -0.6557
func1 func2
read 0.7785 -0.1841
write 0.7753 0.6303
math 0.9129 -0.2725
Group means on canonical discriminant functions
func1 func2
prog-1 -0.3120 0.1190
prog-2 0.5359 -0.0197
prog-3 -0.8445 -0.0658
Clearly, the Stata output for this procedure is lengthy, and it is beyond the scope of this page to
explain all of it. However, the main point is that two canonical variables are identified by the
analysis, the first of which seems to be more related to program type than the second. For more
information, see this page on discriminant function analysis.
One-way MANOVA
MANOVA (multivariate analysis of variance) is like ANOVA, except that there are two or more
dependent variables. In a one-way MANOVA, there is one categorical independent variable and
two or more dependent variables. For example, using the hsb2 data file, say we wish to examine
the differences in read, write and math broken down by program type (prog). For this analysis,
you can use the manova command and then perform the analysis like this.
This command produces three different test statistics that are used to evaluate the statistical
significance of the relationship between the independent variable and the outcome variables.
According to all three criteria, the students in the different programs differ in their joint
distribution of read, write and math.
Multivariate multiple regression is used when you have two or more variables that are to be
predicted from two or more predictor variables. In our example, we will predict write and read
from female, math, science and social studies (socst) scores.
------------------------------------------------------------------------------
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
write |
female | 5.428215 .8808853 6.16 0.000 3.69093 7.165501
math | .2801611 .0639308 4.38 0.000 .1540766 .4062456
science | .2786543 .0580452 4.80 0.000 .1641773 .3931313
socst | .2681117 .049195 5.45 0.000 .1710892 .3651343
_cons | 6.568924 2.819079 2.33 0.021 1.009124 12.12872
-------------+----------------------------------------------------------------
read |
female | -.512606 .9643644 -0.53 0.596 -2.414529 1.389317
math | .3355829 .0699893 4.79 0.000 .1975497 .4736161
science | .2927632 .063546 4.61 0.000 .1674376 .4180889
socst | .3097572 .0538571 5.75 0.000 .2035401 .4159744
_cons | 3.430005 3.086236 1.11 0.268 -2.656682 9.516691
------------------------------------------------------------------------------
Many researchers familiar with traditional multivariate analysis may not recognize the tests
above. They do not see Wilks' Lambda, Pillai's Trace or the Hotelling-Lawley Trace statistics, the
statistics with which they are familiar. It is possible to obtain these statistics using the mvtest
command written by David E. Moore of the University of Cincinnati, mvtest. UCLA updated
this command to work with Stata 6 and above. You can download mvtest from within Stata by
typing findit mvtest (see How can I used the findit command to search for programs and get
additional help? for more information about using findit).
Now that we have downloaded it, we can use the command shown below.
mvtest female
MULTIVARIATE TESTS OF SIGNIFICANCE
These results show that female has a significant relationship with the joint distribution of write
and read. The mvtest command could then be repeated for each of the other predictor variables.
Canonical correlation
Canonical correlation is a multivariate technique used to examine the relationship between two
groups of variables. For each set of variables, it creates latent variables and looks at the
relationships among the latent variables. It assumes that all variables in the model are interval
and normally distributed. Stata requires that each of the two groups of variables be enclosed in
parentheses. There need not be an equal number of variables in the two groups.
The output above shows the linear combinations corresponding to the first canonical correlation.
At the bottom of the output are the two canonical correlations. These results indicate that the
first canonical correlation is .7728. You will note that Stata is brief and may not provide you
with all of the information that you may want. Several programs have been developed to provide
more information regarding the analysis. You can download this family of programs by typing
findit cancor (see How can I used the findit command to search for programs and get additional
help? for more information about using findit).
Because the output from the cancor command is lengthy, we will use the cantest command to
obtain the eigenvalues, F-tests and associated p-values that we want. Note that you do not have
to specify a model with either the cancor or the cantest commands if they are issued after the
canon command.
cantest
Canon Can Corr Likelihood Approx
Corr Squared Ratio F df1 df2 Pr > F
7728 .59728 0.4025 56.4706 4 392.000 0.0000
0235 .00055 0.9994 0.1087 1 197.000 0.7420
Factor analysis
Factor analysis is a form of exploratory multivariate analysis that is used to either reduce the
number of variables in a model or to detect relationships among variables. All variables involved
in the factor analysis need to be continuous and are assumed to be normally distributed. The
goal of the analysis is to try to identify factors which underlie the variables. There may be fewer
factors than variables, but there may not be more factors than variables. For our example, let's
suppose that we think that there are some common factors underlying the various test scores. We
will first use the principal components method of extraction (by using the pc option) and then the
principal components factor method of extraction (by using the pcf option). This parallels the
output produced by SAS and SPSS.
Eigenvectors
Variable | 1 2 3 4 5
-------------+------------------------------------------------------
read | 0.46642 -0.02728 -0.53127 -0.02058 -0.70642
write | 0.44839 0.20755 0.80642 0.05575 -0.32007
math | 0.45878 -0.26090 -0.00060 -0.78004 0.33615
science | 0.43558 -0.61089 -0.00695 0.58948 0.29924
socst | 0.42567 0.71758 -0.25958 0.20132 0.44269
Now let's rerun the factor analysis with a principal component factors extraction method and
retain factors with eigenvalues of .5 or greater. Then we will use a varimax rotation on the
solution.
(varimax rotation)
Rotated Factor Loadings
Variable | 1 2 Uniqueness
-------------+--------------------------------
read | 0.64808 0.56204 0.26410
write | 0.50558 0.66942 0.29627
math | 0.75506 0.42357 0.25048
science | 0.89934 0.20159 0.15054
socst | 0.21844 0.92297 0.10041
Note that by default, Stata will retain all factors with positive eigenvalues; hence the use of the
mineigen option or the factors(#) option. The factors(#) option does not specify the number of
solutions to retain, but rather the largest number of solutions to retain. From the table of factor
loadings, we can see that all five of the test scores load onto the first factor, while all five tend to
load not so heavily on the second factor. Uniqueness (which is the opposite of commonality) is
the proportion of variance of the variable (i.e., read) that is not accounted for by all of the factors
taken together, and a very high uniqueness can indicate that a variable may not belong with any
of the factors. Factor loadings are often rotated in an attempt to make them more interpretable.
Stata performs both varimax and promax rotations.
rotate, varimax
(varimax rotation)
Rotated Factor Loadings
Variable | 1 2 Uniqueness
-------------+--------------------------------
read | 0.62238 0.51992 0.34233
write | 0.53933 0.54228 0.41505
math | 0.65110 0.45408 0.36988
science | 0.64835 0.37324 0.44033
socst | 0.44265 0.58091 0.46660
The purpose of rotating the factors is to get the variables to load either very high or very low on
each factor. In this example, because all of the variables loaded onto factor 1 and not on factor 2,
the rotation did not aid in the interpretation. Instead, it made the results even more difficult to
interpret.
To obtain a scree plot of the eigenvalues, you can use the greigen command. We have included a
reference line on the y-axis at one to aid in determining how many factors should be retained.
greigen, yline(1)
What is a z-test and when is it used?
Best Answer - Chosen by Voters
there are a few different times when it would be smart to use a Z-test, mainly, (1)when you are studying
a sample population that has an approximately normal global population (if your N>30, youll be fine) and
(2) when the you know the global population mean.
an example would be: lets say you wanted to know if exactly half the people in your town liked coke
better, and the other half liked pepsi better. you go out and take a survey, and it turns out 280 like coke
better out of a total of 500, you would now use a Z-test to test your hypothesis that 50% of the people in
your town like coke and the other half like pepsi
STATISTICAL TESTS:
Student's t-test for comparing the means of two samples
Paired-samples test. (like a t-test, but used when data can be paired)
Analysis of variance for comparing means of three or more samples:
linear regression
logarithmic and sigmoid curves
STATISTICAL TABLES:
t (Student's t-test)
F, p = 0.05 (Analysis of Variance)
F, p = 0.01 (Analysis of Variance)
F, p = 0.001 (Analysis of Variance)
c2 (chi squared)
r (correlation coefficient)
Q (Multiple Range test)
Fmax (test for homogeneity of variance)
The ANOVA test means ANalysis of Varience and it is used to test for difference among group
means.
That is the amount of variability between the means of the groups compared to the amount of
variability among the individual scores of each group. Varience between groups versus varience
within groups. Hope this helps...Natalie