2020 Testing by Betting A Strategy For Statistical and Scientific Communication
2020 Testing by Betting A Strategy For Statistical and Scientific Communication
2020 Testing by Betting A Strategy For Statistical and Scientific Communication
tific communication
Glenn Shafer
Rutgers University, Newark, New Jersey, USA
E-mail: gshafer@business.rutgers.edu
[To be read before The Royal Statistical Society at the Society’s 2020 annual con-
ference held online on Wednesday, September 9th, 2020, the President, Professor
D. Ashby, in the Chair]
Summary. The most widely used concept of statistical inference — the p-value —
is too complicated for effective communication to a wide audience. This paper in-
troduces a simpler way of reporting statistical evidence: report the outcome of a bet
against the null hypothesis. This leads to a new role for likelihood, to alternatives
to power and confidence, and to a framework for meta-analysis that accommodates
both planned and opportunistic testing of statistical hypotheses and probabilistic fore-
casts. This framework builds on the foundation for mathematical probability devel-
oped in previous work by Vladimir Vovk and myself.
Keywords: betting score, game-theoretic probability, likelihood ratio, p-value, statisti-
cal communication, warranty
1. Introduction
The most widely used concept of statistical inference — the p-value — is too com-
plicated for effective communication to a wide audience (McShane and Gal, 2017;
Gigerenzer, 2018). This paper introduces a simpler way of reporting statistical ev-
idence: report the outcome of a bet against the null hypothesis. This leads to a
new role for likelihood, to alternatives to power and confidence, and to a frame-
work for meta-analysis that accommodates both planned and opportunistic testing
of statistical hypotheses and probabilistic forecasts.
Testing a hypothesized probability distribution by betting is straightforward.
We select a nonnegative payoff and buy it for its hypothesized expected value. If
this bet multiplies the money it risks by a large factor, we have evidence against
the hypothesis, and the factor measures the strength of this evidence. Multiplying
our money by 5 might merit attention; multiplying it by 100 or by 1000 might be
considered conclusive.
The factor by which we multiply the money we risk — we may call it the betting
score — is conceptually simpler than a p-value, because it reports the result of a
single bet, whereas a p-value is based on a family of tests. As explained in Section 2,
betting scores also have a number of other advantages:
2 Glenn Shafer
(a) Whereas the certainty provided by a p-value is sometimes exaggerated, the
uncertainty remaining when a large betting score is obtained is less easily
minimized. Whether or not you have been schooled in mathematical statistics,
you will not forget that a long shot can succeed by sheer luck.
(b) A bet (a payoff selected and bought) determines an implied alternative hy-
pothesis, and the betting score is the likelihood ratio with respect to this
alternative. So the evidential meaning of betting scores is aligned with our
intuitions about likelihood ratios.
(c) Along with its implied alternative hypothesis, a bet determines an implied
target: a value for the betting score that can be hoped for under the alterna-
tive hypothesis. Implied targets can be more useful than power calculations,
because an implied target along with an actual betting score tells a coherent
story. The notion of power, because it requires a fixed significance level, does
not similarly cohere with the notion of a p-value.
(d) Testing by betting permits opportunistic searches for significance, because the
persuasiveness of having multiplied one’s money by successive bets does not
depend on having followed a complete betting strategy laid out in advance.
A shift from reporting p-values to reporting outcomes of bets cannot happen
overnight, and the notion of calculating a p-value will always be on the table when
statisticians look at the standard or probable error of the estimate of a difference;
this was already true in the 1830s (Shafer, 2019). We will want, therefore, to
relate the scale for measuring evidence provided by a p-value to the scale provided
by a betting score. Any rule for translating from the one scale to the other will
be arbitrary, but it may nevertheless be useful to establish some such rule as a
standard. This issue is discussed in Section 3.
Section 4 considers statistical modeling and estimation. A statistical model en-
codes partial knowledge of a probability distribution. In the corresponding betting
story, the statistician has partial information about what is happening in a bet-
ting game. We see outcomes, but we do not see what bets have been offered on
them and which of these bets have been taken up. We can nevertheless equate the
model’s validity with the futility of betting against it. A strategy for a hypothetical
bettor inside the game, together with the outcomes we see, then translates into
warranties about the validity of the bets that were offered. The strategy tells the
bettor what bets to make as a function of those offered, and if the game involves
the bettor’s being offered any payoff at the price given by a probability distribution
— the distribution remaining unknown to us, because we are not inside the game
— then assertions about the validity of this unknown probability distribution are
warrantied. Instead of (1 − α)-confidence in an assertion about the distribution, we
obtain a (1/α)-warranty. Either the warrantied assertion holds or the hypothetical
bettor has multiplied the money he risked by 1/α.
A statement of (1−α)-confidence can be interpreted as a (1/α)-warranty, one re-
sulting from all-or-nothing bets. But the more general concept of warranty obtained
by allowing bets that are not all-or-nothing has several advantages:
(a) Like individual betting scores, it gives colour to residual uncertainty by evoking
Testing by Betting 3
our knowledge of gambling and its dangers.
(b) Observations together with a strategy for the bettor produce more than one
warranty set. They produce a (1/α)-warranty set for every α, and these war-
ranty sets are nested.
(c) Because it is always legitimate to continue betting with whatever capital re-
mains, the hypothetical bettor can continue betting on additional outcomes,
and we can update our warranty sets accordingly without being accused of
“sampling to a foregone conclusion”. The same principles authorize us to
combine warranty sets based on successive studies.
2. Testing by betting
You claim that a probability distribution P describes a certain phenomenon Y . How
can you give content to your claim, and how can I challenge it?
Assuming that we will later see Y ’s actual value y, a natural way to proceed is to
interpret your claim as a collection of betting offers. You offer to sell me any payoff
S(Y ) for its expected value, EP (S). I choose a nonnegative payoff S, so that EP (S)
is all I risk. Let us call S my bet, and let us call the factor by which I multiply the
money I risk,
S(y)
,
EP (S)
my betting score. This score does not change when S is multiplied by a positive
constant. I will usually assume, for simplicity, that EP (S) = 1 and hence that the
score is simply S(y).
A large betting score can count as evidence against P . What better evidence
can I have? I have bet against P and won. On the other hand, the possibility that
I was merely lucky remains stubbornly in everyone’s view. By using the language
of betting, I have accepted the uncertainty involved in my test and made sure that
everyone else is aware of it as well.
I need not risk a lot of money. I can risk as little as I like — so little that I am
indifferent to losing it and to winning any amount the bet might yield. So this use
of the language of betting is not a chapter in decision theory. It involves neither the
evaluation of utilities nor any Bayesian reasoning. I am betting merely to make a
point. But whether I use real money or play money, I must declare my bet before
the outcome y is revealed, in the situation in which you asserted P .
4 Glenn Shafer
This section explains how testing by betting can bring greater flexibility and
clarity into statistical testing. Section 2.1 explains how betting can be more op-
portunistic than conventional significance testing. Section 2.2 explains that a bet
implies an alternative hypothesis, and that the betting score is the likelihood ratio
with respect to this alternative. Section 2.3 explains how the alternative hypothesis
in turn implies a target for the bet. Finally, Section 2.4 uses three simple but rep-
resentative examples to show how the concepts of betting score and implied target
provide a clear and consistent message about the result of a test, in contrast to the
confusion that can arise when we use the concepts of p-value and power.
If E happens, I have multiplied the $1 I risked by 1/α. This makes standard testing
a special case of testing by betting, the special case where the bet is all-or-nothing.
In return for $1, I get either $(1/α) or $0.
Although statisticians are accustomed to all-or-nothing bets, there are two good
reasons for generalizing beyond them. First, the betting score S(y) from a more
general bet is a graduated appraisal of the strength of the evidence against P .
Second, when we allow more general bets, testing can be opportunistic.
P (T ≥ tα ) = α. (2)
Because S(y) and P (y) are nonnegative for all y, this tells us that the product SP
is a probability distribution. Write Q for SP , and call Q the alternative implied by
the bet S. If we suppose further that P (y) > 0 for all y, then S = Q/P , and
Q(y)
S(y) = . (4)
P (y)
A betting score is a likelihood ratio.
Testing by Betting 7
Conversely, a likelihood ratio is a betting score. Indeed, if Q is a probability
distribution for Y , then Q/P is a bet by our definition, because Q/P ≥ 0 and
X Q(y) X
P (y) = Q(y) = 1.
y
P (y) y
According to the probability distribution Q, the expected gain from the bet S
is nonnegative. In other words, EQ (S) is greater than 1, S’s price. In fact, as a
referee has pointed out, EQ (S) = EP (S 2 ) and hence
When I have a hunch that Q is better. . . We began with your claiming that P
describes the phenomenon Y and my making a bet S satisfying S ≥ 0 and, for
simplicity, EP (S) = 1. There are no other constraints on my choice of S. The
choice may be guided by some hunch about what might work, or I may act on a
whim. I may not have any alternative distribution Q in mind. Perhaps I do not
even believe that there is an alternative distribution that is valid as a description
of Y .
Suppose, however, that I do have an alternative Q in mind. I have a hunch
that Q is a valid description of Y . In this case, should I use Q/P as my bet? The
thought that I should is supported by Gibbs’s inequality, which says that
Q R
EQ ln ≥ EQ ln (5)
P P
for any probability distribution R for Y . Because any bet S is of the form R/P
for some such R, (5) tells us that EQ (ln S) is maximized over S by setting S :=
Q/P . Many readers will recognize EQ (ln(Q/P )) as the Kullback-Leibler divergence
between Q and P . In the terminology of Kullback’s 1959 book (Kullback, 1959, p. 5),
it is the mean information for discrimination in favor of Q against P per observation
from Q.
Why should I choose S to maximize EQ (ln S)? Why not maximize EQ (S)? Or
perhaps Q(S ≥ 20) or Q(S ≥ 1/α) for some other significance level α?
Maximizing E(ln S) makes sense in a scientific context where we combine suc-
cessive betting scores by multiplication. When S is the product of many successive
factors, maximizing E(ln S) maximizes S’s rate of growth. This point was made
famously and succinctly by John L. Kelly, Jr. (Kelly Jr., 1956, p. 926): “it is the
logarithm which is additive in repeated bets and to which the law of large numbers
applies.” The idea has been used extensively in gambling theory (Breiman, 1961),
information theory (Cover and Thomas, 1991), finance theory Luenberger (2014),
and machine learning (Cesa-Bianchi and Lugosi, 2006). I am proposing that we
put it to greater use in statistical testing. It provides a crucial link in this paper’s
argument.
We can use Kelly’s insight even when betting is opportunistic and hence does
not define alternative joint probabilities for successive outcomes. Even if the null
8 Glenn Shafer
hypothesis P does provide joint probabilities for a phenomenon (Y1 , Y2 , . . .), succes-
sive opportunistic bets S1 , S2 , . . . against P will not determine a joint alternative
Q. Each bet Si will determine only an alternative Qi for Yi in light of the actual
outcomes y1 , . . . , yn−1 . A game-theoretic law of large numbers nevertheless holds
with respect to the sequence Q1 , Q2 , . . .: if they are valid in the betting sense (an
opponent will not multiply their capital by a large factor betting against them),
then the average of the ln Si will approximate the average of the expected values
assigned them by the Qi (Shafer and Vovk, 2019, Chapter 2).
Should we ever choose S to maximize EQ (S)? Kelly devises a rather artificial
story about gambling where maximizing EQ (S) makes sense:
. . . suppose the gambler’s wife allowed him to bet one dollar each week
but not to reinvest his winnings. He should then maximize his expec-
tation (expected value of capital) on each bet. He would bet all his
available capital (one dollar) on the event yielding the highest expecta-
tion. With probability one he would get ahead of anyone dividing his
money differently.
But when our purpose is to test P against Q, it seldom makes sense to choose the S
by maximizing EQ (S). As Kelly tells us, the event yielding the highest expectation
under Q is the value of y0 for which Q/P is greatest. Is a bet that risks everything on
this single possible outcome a sensible test? If Q(y0 )/P (y0 ) is huge, much greater
than we would need to refute Q, and yet Q(y0 ) is very small, then we would be
buying a tiny chance of an unnecessarily huge betting score at the price of very
likely getting a zero betting score even when the evidence against P in favor of Q
is substantial.
Choosing S to maximize Q(S ≥ 1/α) is appropriate when the hypothesis being
tested will not be tested again. It leads us to the Neyman-Pearson theory, to which
we now turn.
which we call the power of the test with respect to Q. In fact, SE with this choice
of E maximizes Q(S(Y ) ≥ 1/α) over all bets S, not merely over all-or-nothing bets.
But the success of the Neyman-Pearson bet may be unconvincing in such cases; see
Examples 1 and 2 in Section 2.4.
R. A. Fisher famously criticized Neyman and Pearson for confusing the scien-
tific enterprise with the problem of “making decisions in an acceptance procedure”
(Fisher, 1956, Chapter 4). Going beyond all-or-nothing tests to general testing by
betting is a way of taking this criticism seriously. The choice to “reject” or “accept”
is imposed when we are testing a widget that is to be put on sale or returned to
the factory for rework, never in either case to be tested again. But in many cases
scientists are testing a hypothesis that may be tested again many times in many
ways.
When the bet loses money. . . In the second paragraph of the introduction, I
suggested that a betting score of 5 casts enough doubt on the hypothesis being
tested to merit attention. We can elaborate on this by noting that a value of 5 or
more for S(y) means, according to (4), that the outcome y was at least 5 times as
likely under the alternative hypothesis Q than under the null hypothesis P .
Suppose we obtain an equally extreme result in the opposite direction: S(y)
comes out less than 1/5. Does this provide enough evidence in favor of P to merit
attention? Maybe and maybe not. A low value of S(y) does suggest that P describes
the phenomenon better than Q. But Q may or may not be the only plausible
alternative. It is the alternative for which the bet S is optimal in a certain sense.
But as I have emphasized, we may have chosen S blindly or on a whim, without
10 Glenn Shafer
Table 1. Elements of a study that tests a probability distribution by bet-
ting. The proposed study may be considered meritorious and perhaps even
publishable regardless of its outcome when the implied target is reasonably
large and both the null hypothesis P and the implied alternative Q are ini-
tially plausible. A large betting score then discredits the null hypothesis.
name notation
Proposed study
initially unknown outcome phenomenon Y
probability distribution for Y null hypothesis P
nonnegative function of Y with
bet S
expected value 1 under P
SP implied alternative Q
exp (EQ (ln S)) implied target S∗
Results
actual value of Y outcome y
factor by which money risked
betting score S(y)
has been multiplied
any real opinion or clue as to what alternative we should consider. In this case, the
message of a low betting score is not that P is supported by the evidence but that
we should try a rather different bet the next time we test P . This understanding
of the matter accords with Fisher’s contention that testing usually precedes the
formulation of alternative hypotheses in science (Bennett, 1990, p. 246),(Senn, 2011,
p. 57).
To see how betting scores can help forestall these misuses, it suffices to consider
elementary examples. Here I will consider examples where the null and alternative
distributions of the test statistic are normal with the same variance.
Example 1. Suppose P says that Y is normal with mean 0 and standard deviation
10, Q says that Y is normal with mean 1 and standard deviation 10, and we observe
y = 30.
for which
2y − 1 1
EQ (ln(S)) = EQ = ,
200 200
so that the implied target is exp(1/200) ≈ 1.005. She does a little better than
this very low target; she multiplies the money she risked by exp(59/200) ≈
1.34.
The power and the implied target both told us in advance that the study was a
waste of time. The betting score of 1.34 confirms that little was accomplished,
while the low p-value and the Neyman-Pearson rejection of P give a misleading
verdict in favor of Q.
Example 2. Now the case of high power and a borderline outcome: P says that
Y is normal with mean 0 and standard deviation 10, Q says that Y is normal with
mean 37 and standard deviation 10, and we observe y = 16.5.
Assuming that Q is indeed a plausible alternative, the high power and high im-
plied target suggest that the study is meritorious. But the low p-value and the
Neyman-Pearson rejection of P are misleading. The betting score points in the
other direction, albeit not enough to merit attention.
In this case, the power and the implied target both suggested that the study was
marginal. The Neyman-Pearson conclusion was “no evidence”. The bet S provides
the same conclusion; the score S(y) favors P relative to Q but too weakly to merit
attention.
The underlying problem in the first two examples is the mismatch between the
concept of a p-value on the one hand and the concepts of a fixed significance level
and power on the other. This mismatch and the confusion it engenders disappears
when we replace p-values with betting scores and power with implied target. The
bet, implied target, and betting score always tell a coherent story. In Example 1,
the implied target close to 1 told us that the bet would not accomplish much, and
the betting score close to 1 only confirmed this. In Example 2, the high implied
target told us that we had a good test of P relative to Q, and P ’s passing this test
strongly suggests that Q is not better than P .
The problem in Example 3 is the meagerness of the interpretation available for a
middling to high p-value. The theoretical statistician correctly tells us that such a
p-value should be taken as “no evidence”. But a scientist who has put great effort
into a study will want to believe that its result signifies something. In this case,
the merit of the betting score is that it blocks any erroneous claim with a concrete
message: it tells us the direction the result points and how strongly.
As the three examples illustrate, the betting language does not change substan-
tively the conclusions that an expert mathematical statistician would draw from
given evidence. But it can sometimes provide a simpler and clearer way to explain
these conclusions to a wider audience.
3. Comparing scales
The notion of a p-value retains a whiff of betting. In a passage I will quote shortly,
Fisher used the word “odds” when comparing two p-values. But obtaining a p-
value p(y) cannot be interpreted as multiplying money risked by 1/p(y). The logic
of betting requires that a bet be laid before its outcome is observed, and we cannot
make the bet (1) with α = 1/p(y) unless we already know y. Pretending that we
had made the bet would be cheating, and some penalty for this cheating — some
sort of shrinking — is needed to make 1/p(y) comparable to a betting score.
The inadmissibility of 1/p(y) as a betting score is confirmed by its infinite ex-
pected value under P . Shrinking it to make it comparable to a betting score means
shrinking it to a payoff with expected value 1. In the ideal case, p(y) is uniformly
distributed between 0 and 1 under P , and there are infinitely many ways of shrinking
1/p(y) to a payoff with expected value 1. (In the general case, p(y) is stochastically
dominated under P by the uniform distribution; so the payoff will have expected
14 Glenn Shafer
Table 2. Making a p-value into a betting
score
1 1
p-value √ −1
p-value p-value
0.10 10 2.2
0.05 20 3.5
0.01 100 9.0
0.005 200 13.1
0.001 1,000 30.6
0.000001 1,000,000 999
value 1 or less.) No one has made a convincing case for any particular choice
from this infinitude; the choice is fundamentally arbitrary (Shafer and Vovk, 2019,
Section 11.5). But it would useful to make some such choice, because the use of
p-values will never completely disappear, and if we also use betting scores, we will
find ourselves wanting to compare the two scales.
It seems reasonable to shrink p-values in a way that is monotonic, smooth, and
unbounded, and the exact way of doing this will sometimes be unimportant. My
favorite, only because it is easy to remember and calculate, is
1
S(y) := p − 1. (6)
p(y)
Table 2 applies this rule to some commonly used significance levels. If we retain
the conventional 5% threshold for saying that a p-value merits attention, then this
table accords with the suggestion, made in the introduction to this paper, that
multiplying our money by 5 merits attention. Multiplying our money by 2 or 3, or
by 1/2 or 1/3 as in Examples 2 and 3 of Section 2.4, does not meet this threshold.
If we adopt a standard rule for shrinking p-values, we will have a fuller picture
of what we are doing when we use a conventional test that is proposed without
any alternative hypothesis being specified. Since it determines a bet, the rule for
shrinking implies an alternative hypothesis.
Example 4. Consider Fisher’s analysis of Weldon’s dice data in the first edition
of his Statistical Methods for Research Workers (Fisher, 1925, pp. 66–69). Weldon
threw 12 dice together 26,306 times and recorded, for each throw, how many dice
came up 5 or 6. Using this data, Fisher tested the bias of the dice in two different
ways.
(a) First, he performed a χ2 goodness-of-fit test. On none of the 26,306 throws did
all 12 dice come up 5 or 6, so he pooled the outcomes 11 and 12 and performed
the test with 12 categories and 11 degrees of freedom. The χ2 statistic came
out 40.748, and he noted that “the actual chance in this case of χ2 exceeding
40·75 if the dice had been true is ·00003.”
(b) Then he noted that in the 12 × 26,306 = 315,672 throws of a die there were
altogether 106,602 5s and 6s. The expected number is 315,672/3 = 105,224
Testing by Betting 15
with standard error 264.9, so that the observed number exceeded expectation
by 5.20 times its standard error, and “a normal deviation only exceeds 5·2
times its standard error once in 5 million times.”
Why is the one p-value so much less than the other? Fisher explained:
The reason why this last test gives so much higher odds than the test
for goodness of fit, is that the latter is testing for discrepancies of any
kind, such, for example, as copying errors would introduce. The actual
discrepancy is almost wholly due to a single item, namely, the value of p,
and when that point is tested separately its significance is more clearly
brought out.
is normally distributed under the null hypothesis P , with mean 1/3 and standard
deviation 0.00084. The observed value y is 106,602/315,672 ≈ 0.3377. As Fisher
noted, the deviation from 1/3, 0.0044, is 5.2 times the standard deviation. The
function p(y) for Fisher’s test is
!!
|y − 13 |
p(y) = 2 1 − Φ ,
0.00084
where Φ is the cumulative distribution function for the standard normal distribution.
The density q for the alternative Q, obtained by multiplying P ’s normal density p
by (6) is symmetric around 1/3, just as p is. It has the same value at 1/3 as p does,
but much heavier tails. The probability of a deviation of 0.0044 or more under Q
is still very small, but only about 1 in a thousand instead of 1 in 5 million.
A different rule for shrinking the p-value to a betting score will of course produce
a different alternative hypothesis Q. But a wide range of rules will give roughly the
same picture.
We can obtain an alternative hypothesis in the same way for the χ2 test. Whereas
the distribution of the χ2 statistic is approximately normal under the null hypoth-
esis, the alternative will again have much heavier tails. Even if we consider this
alternative vaguely defined, its existence supports Joseph Berkson’s classic argu-
ment for discretion when using the test (Berkson, 1938).
16 Glenn Shafer
4. Betting games as statistical models
Like all testing protocols considered in this paper, this is a perfect information
protocol; the players move sequentially and each sees the other’s move as it is
made.
Because y can be multi-dimensional, Protocol 1 can be used to test a probabil-
ity distribution P for a stochastic process Y = (Y1 , . . . , YN ). See Shafer and Vovk
(2019) for expositions that emphasize processes that continue indefinitely instead of
stopping at a non-random time N . Often, however, the probability distribution for
a stochastic process represents the hypothesis that no additional information we ob-
tain as the process unfolds can provide further help predicting it — more precisely,
that no information available at the point when we have observed y1 , . . . , yn−1 can
enable us to improve on P ’s conditional probabilities given y1 , . . . , yn−1 for predict-
ing yn , . . . , yN . To test this hypothesis, we may use a perfect-information protocol
in which Sceptic observes the yn step by step:
The condition of perfect information requires only that each player sees the oth-
ers’ moves as they are made. Some or all of the players may receive additional
information as play proceeds.
Sceptic can make any bet against P in Protocol 2 that he can make in Protocol 1.
Indeed, for any payoff S : Y N → [0, ∞) such that EP (S) = 1, Sceptic can play
so that KN = S(y1 , . . . , yN ); on the nth round, he makes the bet Sn given by
Sn (y) := EP (S(y1 , . . . , yn−1 , y, Yn+1 , . . . , YN )|y1 , . . . , yn−1 , y). He can also make
bets taking additional information into account.
Many or most statistical models also use additional information (a.k.a., inde-
pendent variables) to make the probability predictions about a sequence y1 , . . . , yN .
We can bring this option into the sequential betting picture by having Reality an-
nounce a signal xn at the beginning of each round and by supplying the protocol
18 Glenn Shafer
with a probability distribution Px1 ,y1 ...,xn−1 ,yn−1 ,xn for each round n and each pos-
sible sequence of signals and outcomes x1 , y1 . . . , xn−1 , yn−1 , xn that might precede
Sceptic’s move on that round:
A simpler protocol is obtained when we drop the assumption that probability dis-
tributions are specified at the outset and introduce instead a player, say Forecaster,
who decides on them as play proceeds:
This protocol allows us to test forecasters who give probabilities for sequences of
events without using probability distributions or statistical models fixed at the out-
set. This includes both weather forecasters who use physical models and forecasters
of sporting and electoral outcomes who invent and tinker with models as they go
along. Although there is no comprehensive probability distribution or statistical
model to test in these cases, we can still rely on the intuition that the forecaster is
discredited if Sceptic manages to multiply the capital he risks by a large factor. This
intuition is supported by the theory developed by Shafer and Vovk (2019), where it
is shown that Sceptic can multiply the capital he risks by a large factor if the prob-
ability forecasts actually made do not agree with outcomes in ways that standard
probability theory predicts. The reliance on forecasts actually made, without any
attention to other aspects of any purported comprehensive probability distribution,
makes this approach prequential in the sense developed by Dawid (1984).
The statistician sees neither Reality’s move θ nor Sceptic’s moves S1 , . . . , SN . She
sees only the outcomes y1 , . . . , yN .
Because θ is announced to Sceptic at the outset, a strategy S for Sceptic that uses
only the information provided by Reality’s moves can be thought of as a collection
of strategies, one for each θ ∈ Θ. The strategy for θ, say S θ , specifies Sceptic’s move
Sn as a function of y1 , . . . , yn−1 . This makes Sceptic’s final capital a function of θ
and the observations y1 , . . . , yN . Let us write KS (θ) for this final capital, leaving
the dependence on y1 , . . . , yN implicit.
Sceptic is a creation of the statistician’s imagination and therefore subject to
the statistician’s direction. Suppose the statistician directs Sceptic to play a par-
ticular strategy that uses only Reality’s moves. Then, after observing y1 , . . . , yN ,
the statistician can calculate Sceptic’s final capital is a function of θ, say K(θ). Let
us call K(θ) the statistician’s betting score against the hypothesis θ. We interpret
it just as we interpreted betting scores in Section 2. The statistician doubts that
Sceptic has multiplied his initial unit capital by a large factor, and so he thinks that
the hypothesis θ has been discredited if K(θ) is large. This way of thinking also
leads us to betting scores against composite hypotheses and to a notion of warranty
analogous to the established notion of confidence.
The notion of a warranty was already developed in (Vovk, 1993, Section 7). The
intuition can be traced back at least to Schnorr (1971) and Levin (1976).
For small α, the statistician will tend to believe that the true θ is in W1/α .
For example, she will not expect Sceptic to have multiplied his capital by 1000
20 Glenn Shafer
and hence will believe that θ is in W1000 . But this belief is not irrefutable. If
she obtains strong enough evidence that θ is not in W1000 , she may conclude that
Sceptic actually did multiply his capital by 1000 using S. See Fraser et al. (2018)
for examples of outcomes that cast doubt on confidence statements and would also
cast doubt on warranties.
Every (1 − α)-confidence set has a (1/α)-warranty. This is because a (1 − α)-
confidence set is specified by testing each θ at level α; the (1 − α)-confidence set
consists of the θ not rejected. When S makes the all-or-nothing bet against θ
corresponding to the test used to form the confidence set, K(θ) < 1/α if and only
if θ was not rejected, and hence W1/α is equal to the confidence set.
Warranty sets are nested: W1/α ⊆ W1/α0 when α ≤ α0 . Standard statistical
theory also allows nesting; sets with different levels of confidence can be nested.
But the different confidence sets will be based on different tests (Cox, 1958; Xie
and Singh, 2013). The (1/α)-warranty sets for different α all come from the same
strategy for Sceptic.
Instead of stopping the protocol after some fixed number of rounds, the statis-
tician may stop it when she pleases and adopt the (1/α)-warranty sets obtained at
that point. As we learned in Section 2, the intuition underlying betting scores sup-
ports such optional continuation; multiplying the money you risk by a large factor
discredits a probabilistic hypothesis or a probability forecaster no matter how you
decide to bet and no matter how long you persist in betting. The only caveat is
that we cannot pretend to have stopped before we actually did (Shafer and Vovk,
2019, Ch. 11). This contrasts with confidence intervals; if we continually calculate
test results and the corresponding confidence intervals as we make more and more
observations, the multiple testing vitiates the confidence coefficients and so may
be called “sampling to a foregone conclusion” (Cornfield, 1966; Shafer et al., 2011).
The most important exceptions are the “confidence sequences” that can be obtained
from the sequential probability ratio test (Lai, 2009). Because they are derived from
products of successive likelihood ratios that can be interpreted as betting scores,
these confidence sequences can be understood as sequences of warranty sets.
How should the statistician choose the strategy for Sceptic? An obvious goal is
to obtain small warranty sets. But a strategy that produces the smallest warranty
set for one N and one warranty level 1/α will not generally do so for other values
of these parameters. So any choice will be a balancing act. How to perform this
balancing act is an important topic for further research (Grünwald et al., 2019).
The probability calculus began as a theory about betting, and its logic remains the
logic of betting, even when it serves to describe phenomena. But in their quest for
the appearance of objectivity, mathematicians have created a language (likelihood,
significance, power, p-value, confidence) that pushes betting into the background.
This deceptively objective statistical language can encourage overconfidence in
the results of statistical testing and neglect of relevant information about how the
results are obtained. In recent decades this problem has become increasingly salient,
especially in medicine and the social sciences, as numerous influential statistical
studies in these fields have turned out to be misleading.
In 2016, the American Statistical Association issued a statement listing common
misunderstandings of p-values and urging full reporting of searches that produce
p-values (Wasserstein and Lazar, 2016). Many statisticians fear, however, that
the situation will not improve. Most dispiriting are studies showing that both
teachers of statistics and scientists who use statistics are apt to answer questions
about the meaning of p-values incorrectly (McShane and Gal, 2017; Gigerenzer,
2018). Andrew Gelman and John Carlin argue persuasively that the most frequently
proposed solutions (better exposition, confidence intervals instead of tests, practical
instead of statistical significance, Bayesian interpretation of one-sided p-values, and
Bayes factors) will not work (Gelman and Carlin, 2017). The only solution, they
contend, is “to move toward a greater acceptance of uncertainty and embracing of
variation” (p. 901).
In this context, the language of betting emerges as an important tool of com-
munication. When statistical tests and conclusions are framed as bets, everyone
understands their limitations. Great success in betting against probabilities may be
the best evidence we can have that the probabilities are wrong, but everyone un-
derstands that such success may be mere luck. Moreover, candor about the betting
aspects of scientific exploration can communicate truths about the games scien-
tists must and do play — honest games that are essential to the advancement of
knowledge.
This paper has developed new ways of expressing statistical results with betting
language. The basic concepts are bet (not necessarily all-or-nothing), betting score
(equivalent to likelihood ratio when the bets offered define a probability distribu-
tion), implied target (an alternative to power), and (1/α)-warranty (a generalization
of (1 − α)-confidence). Substantial research is needed to apply these concepts to
complex models, but their greatest utility may be in communicating the uncertainty
of simple tests and estimates.
Are the probabilities tested subjective or objective? The probabilities may represent
someone’s opinion, but the hypothesis that they say something true about the world
is inherent in the project of testing them.
Why is the proposal to test by betting better than other proposals for remedying
the misuse of p-values? Many authors have proposed remedying the misuse of p-
values by supplementing them with additional information (Wasserstein et al., 2019;
Mayo, 2018). Sometimes the additional information involves Bayesian calculations
(Bayarri et al., 2016; Matthews, 2018). Sometimes it involves likelihood ratios
(Colquhoun, 2019). Sometimes it involves attained power (Mayo and Spanos, 2006).
I find nearly all these proposals persuasive as ways of correcting the misunder-
standings and misinterpretations to which p-values are susceptible. Each of them
might be used by a highly trained mathematical statistician to explain what has
gone wrong to another highly trained mathematical statistician. But adding more
complexity to the already overly complex idea of a p-value may not help those who
are not specialists in mathematical statistics. We need strategies for communicating
with millions of people. Worldwide, we teach p-values to millions every year, and
hundreds of thousands of them may eventually use statistical tests in one way or
another.
The strongest argument for betting scores as a replacement for p-values is its
simplicity. I do not know any other proposal that is equally simple.
References
Aalen, O. O., P. K. Andersen, Ø. Borgan, R. D. Gill, and N. Keiding (2009). History
of applications of martingales in survival analysis. Electronic Journal for History
of Probability and Statistics 5 (1).
Amrhein, V., S. Greenland, and B. McShane et al. (2019). Retire statistical signif-
icance. Nature 567, 305–307.
Bienvenu, L., G. Shafer, and A. Shen (2009). On the history of martingales in the
study of randomness. Electronic Journal for History of Probability and Statis-
tics 5 (1).
Testing by Betting 27
Breiman, L. (1961). Optimal gambling systems for favorable games. In J. Neyman
(Ed.), Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics
and Probability, Volume 1 (Contributions to the Theory of Statistics), Berkeley,
CA, pp. 65–78. University of California Press.
Cready, W. M., J. He, W. Lin, C. Shao, D. Wang, and Y. Zhang (2019). Is there a
confidence interval for that? A critical examination of null outcome reporting in
accounting research. Available at SSRN: https://ssrn.com/abstract=3131251
or http://dx.doi.org/10.2139/ssrn.3131251.
Dempster, A. P. (1997). The direct use of likelihood for significance testing. Statis-
tics and Computing 7 (4), 247–252. This article is followed on pages 253–272 by
a related article by Murray Aitkin and further discussion by Dempster, Aitkin,
and Mervyn Stone. It originally appeared on pages 335–354 of ?) along with
discussion by George Barnard and David Cox.