Non Parametric For Finance
Non Parametric For Finance
Non Parametric For Finance
317
318 J. FAN
spectacularly beautiful formulas when the underlying spectacularly beautiful formulas for pricing contingent
dynamics is correctly specified, it offers little guid- claims. For an introduction to financial derivatives, see
ance in choosing or validating a model. There is al- Hull [78].
ways the danger that misspecification of a model leads
2.1 One-Factor Diffusion Models
to erroneous valuation and hedging strategies. Hence,
there are genuine needs for flexible stochastic model- Let St denote the stock price observed at time t.
ing. Nonparametric methods offer a unified and elegant The time unit can be hourly, daily, weekly, among oth-
treatment for such a purpose. ers. Presented in Figure 1(a) are the daily log-returns,
Nonparametric approaches have recently been intro- defined as
duced to estimate return, volatility, transition densities
log(St ) − log S(t−1) ≈ St − S(t−1) /S(t−1) ,
and state price densities of stock prices and bond yields
(interest rates). They are also useful for examining the of the Standard and Poor’s 500 index, a value-weighted
extent to which the dynamics of stock prices and bond index based on the prices of the 500 stocks that account
yields vary over time. They have immediate applica- for approximately 70% of the total U.S. equity (stock)
tions to the valuation of bond price and stock options market capitalization. The styled features of the returns
and management of market risks. They can also be em- include that the volatility tends to cluster and that the
ployed to test economic theory such as the capital asset (marginal) mean and variance of the returns tend to be
pricing model and stochastic discount model [28] and constant. One simplified model to capture the second
answer questions such as if the geometric Brownian feature is that
motion fits certain stock indices, whether the Cox–
log(St ) − log S(t−1) ≈ µ0 + σ0 εt ,
Ingersoll–Ross model fits yields of bonds, and if in-
terest rate dynamics evolve with time. Furthermore, where {εt } is a sequence of independent normal random
based on empirical data, one can also fit directly the variables. This is basically a random walk hypothesis,
observed option prices with their associated character- regarding the stock price movement as an independent
istics such as strike price, the time to maturity, risk-free random walk. When the sampling time unit gets
interest rate, dividend yield and see if the option prices small, the above random walk can be regarded as a
are consistent with the theoretical ones. Needless to random sample from the continuous-time process:
say, nonparametric techniques will play an increas- (1) d log(St ) = µ0 + σ1 dWt ,
ingly important role in financial econometrics, thanks
to the availability of modern computing power and the where {Wt } is a standard√ one-dimensional Brownian
development of financial econometrics. motion and σ1 = σ0 / . The process (1) is called
The paper is organized as follows. We first intro- geometric Brownian motion as St is an exponent of
duce in Section 2 some useful stochastic models for Brownian motion Wt . It was used by Osborne [92]
modeling stock prices and bond yields and then briefly to model the stock price dynamic and by Black and
outline some probabilistic aspects of the models. In Scholes [23] to derive their celebrated option price for-
Section 3 we review nonparametric techniques used for mula.
estimating the drift and diffusion functions, based on Interest rates are fundamental to financial markets,
either discretely or continuously observed data. In Sec- consumer spending, corporate earnings, asset pricing,
tion 4 we outline techniques for estimating state price inflation and the economy. The bond market is even
densities and transition densities. Their applications in bigger than the equity market. Presented in Figure 1(c)
asset pricing and testing for parametric diffusion mod- are the interest rates {rt } of the two-year U.S. Treasury
els are also introduced. Section 5 makes some conclud- notes at a weekly frequency. As the interest rates get
ing remarks. higher, so do the volatilities. To appreciate this, Fig-
ure 1(d) plots the pairs {(rt−1 , rt − rt−1 )}. Its dynamic
is very different from that of the equity market. The
2. STOCHASTIC DIFFUSION MODELS
interest rates should be nonnegative. They possess het-
Much of financial econometrics is concerned with eroscedasticity in addition to the mean-revision prop-
asset pricing, portfolio choice and risk management. erty: As the interest rates rise above the mean level α,
Stochastic diffusion models have been widely used for there is a negative drift that pulls the rates down; while
describing the dynamics of underlying economic vari- when the interest rates fall below α, there is a posi-
ables and asset prices. They form the basis of many tive force that drives the rates up. To capture these two
A SELECTIVE OVERVIEW 319
F IG . 1. (a) Daily log-returns of the Standard and Poor’s 500 index from October 21, 1980 to July 29, 2004. (b) Scatterplot of the returns
against logarithm of the index ( price level). (c) Interest rates of two-year U.S. Treasury notes from June 4, 1976 to March 7, 2003 sampled
at weekly frequency. (d) Scatterplot of the difference of yields versus the yields.
main features, Cox, Ingersoll and Ross [37] derived the and usually serves as a test case for proposed statistical
following model for the interest rate dynamic: methods.
1/2 There are many stochastic models that have been in-
(2) drt = κ(α − rt ) dt + σ rt dWt .
troduced to model the dynamics of stocks and bonds.
For simplicity, we will refer it to as the CIR model. It Let Xt be an observed economic variable at time t.
is an amelioration of the Vasicek model [106], This can be the price of a stock or a stock index, or
(3) drt = κ(α − rt ) dt + σ dWt , the yield of a bond. A simple and frequently used sto-
chastic model is
which ignores the heteroscedasticity and is also re-
ferred to as the Ornstein–Uhlenbeck process. While (5) dXt = µ(Xt ) dt + σ (Xt ) dWt .
this is an unrealistic model for interest rates, the
process is Gaussian with explicit transition density. It The function µ(·) is often called a drift or instanta-
fact, the time series sampled from (3) follows the au- neous return function and σ (·) is referred to as a dif-
toregressive model of order 1, fusion or volatility function, since
(4) Yt = (1 − ρ)α + ρYt−1 + εt , µ(Xt ) = lim −1 E(Xt+ − Xt |Xt ),
→0
where Yt = rt , ε ∼ N(0, σ 2 (1 − ρ 2 )/(2κ)) and
ρ = exp(−κ). Hence, the process is well understood σ (Xt ) = lim −1 var(Xt+ |Xt ).
2
→0
320 J. FAN
The time-homogeneous model (5) contains many fa- 2.2 Some Probabilistic Aspects
mous one-factor models in financial econometrics. In
The question when there exists a solution to the sto-
an effort to improve the flexibility of modeling interest
chastic differential equation (SDE) (7) arises naturally.
dynamics, Chan et al. [29] extends the CIR model (2)
Such a program was first carried out by Itô [80, 81].
to the CKLS model,
For SDE (7), there are two different meanings of solu-
γ
(6) dXt = κ(α − Xt ) dt + σ Xt dWt . tion: strong solution and weak solution. See Sections
5.2 and 5.3 of [84]. Basically, for a given initial con-
Aït-Sahalia [3] introduces a nonlinear mean rever- dition ξ , a strong solution requires that Xt is deter-
sion: while interest rates remain in the middle part mined completely by the information up to time t. Un-
of their domain, there is little mean reversion, and at der Lipschitz and linear growth conditions on the drift
the end of the domain, a strong nonlinear mean re- and diffusion functions, for every ξ that is independent
version emerges. He imposes the nonlinear drift of the of {Ws }, there exists a strong solution of equation (7).
form (α0 Xt−1 + α1 + α2 Xt + α2 Xt2 ). See also Ahn and Such a solution is unique. See Theorem 2.9 of [84].
Gao [1], which models the interest rates by Yt = Xt−1 , For the one-dimensional time-homogeneous diffu-
in which the Xt follows the CIR model. sion process (5), weaker conditions can be obtained for
Economic conditions vary over time. Thus, it is the so-called weak solution. By an application of the
reasonable to expect that the instantaneous return Itô formula to an appropriate transform of the process,
and volatility depend on both time and price level one can make the transformed process have zero drift.
for a given state variable such as stock prices and Thus, we can consider without loss of generality that
bond yields. This leads to a further generalization of the drift in (5) is zero. For such a model, Engelbert
model (5) to allow the coefficients to depend on time t: and Schmidt [45] give a necessary and sufficient condi-
tion for the existence of the solution. The continuity of
(7) dXt = µ(Xt , t) dt + σ (Xt , t) dWt .
σ suffices for the existence of the weak solution. See
Since only a trajectory of the process is observed Theorem 5.5.4 of [84], page 333, and Theorem 23.1
[see Figure 1(c)], there is not sufficient information of [83].
to estimate the bivariate functions in (7) without fur- We will use several times the Itô formula. For the
ther restrictions. [To consistently estimate the bivariate process Xt in (7), for a sufficiently regular function f
volatility function σ (x, t), we need to have data that ([84], page 153),
eventually fill up a neighborhood of the point (t, x).]
∂f (Xt , t)
A useful specification of model (7) is df (Xt , t) =
∂t
(8) dXt = {α0 (t) + α1 (t)Xt } dt + β0 (t)Xt 1
β (t)
dWt . 1 ∂ 2 f (Xt , t) 2
(10) + σ (Xt , t) dt
This is an extension of the CKLS model (6) by 2 ∂x 2
allowing the coefficients to depend on time and was ∂f (Xt , t)
introduced and studied by Fan et al. [48]. Model (8) in- + dXt .
∂x
cludes many commonly used time-varying models for The formula can be understood as the second-order
the yields of bonds, introduced by Ho and Lee [75], Taylor expansion of f (Xt+ , t + ) − f (Xt , t) by
Hull and White [79], Black, Derman and Toy [21] and noticing that (Xt+ − Xt )2 is approximately σ 2 (Xt ,
Black and Karasinski [22], among others. The expe- t).
rience in [48] and other studies of the varying coeffi- The Markovian property plays an important role
cient models [26, 31, 74, 76] shows that coefficient in statistical inference. According to Theorem 5.4.20
functions in (8) cannot be estimated reliably due to of [84], the solution Xt to equation (5) is Markovian,
the collinearity effect in local estimation: localizing in provided that the coefficient functions µ and σ are
the time domain, the process {Xt } is nearly constant bounded on compact subsets. Let p (y|x) be the tran-
and hence α0 (t) and α1 (t) and β0 (t) and β1 (t) cannot sition density, the conditional density of Xt+ = y
easily be differentiated. This leads Fan et al. [48] to given Xt = x. The transition density must satisfy the
introduce the semiparametric model forward and backward Kolmogorov equations ([84],
β page 282).
(9) dXt = {α0 (t) + α1 Xt } dt + β0 (t)Xt dWt
Under the linear growth and Lipschitz conditions,
to avoid the collinearity. and additional conditions on the boundary behavior of
A SELECTIVE OVERVIEW 321
the functions µ and σ , the solution to equation (1) is 2.3 Valuation of Contingent Claims
positive and ergodic. The invariant density is given by An important application of SDE is the pricing of fi-
nancial derivatives such as options and bonds. It forms
f (x) = 2C0 σ −2 (x)
a beautiful modern asset pricing theory and provides
(11) x useful guidance in practice. Steele [105], Duffie [42]
· exp −2 µ(y)σ −2 (y) dy , and Hull [78] offer very nice introductions to the field.
.
The simplest financial derivative is the European call
where C0 is a normalizing constant and the lower limit option. A call option is the right to buy an asset at a
of the integral does not matter. If the initial distri- certain price K (strike price) before or at expiration
bution is taken from the invariant density, then the time T . A put option gives the right to sell an asset
process {Xt } is stationary with the marginal density f at a certain price K (strike price) before or at expira-
and transition density p . tion. European options allow option holders to exercise
Stationarity plays an important role in time series only at maturity while American options can be exer-
cised at any time before expiration. Most stock options
analysis and forecasting [50]. The structural invariabil-
are American, while options on stock indices are Euro-
ity allows us to forecast the future based on the his-
pean.
torical data. For example, the structural relation (e.g., The payoff for a European call option is (XT − K)+ ,
the conditional distribution, conditional moments) be- where XT is the price of the stock at expiration T .
tween Xt and Xt+ remains the same over time t. This When the stock rises above the strike price K, one can
makes it possible to use historical data to estimate the exercise the right and make a profit of XT − K. How-
invariant quantities. Associated with stationarity is the ever, when the stock falls below K, one renders one’s
concept of mixing, which says that the data that are far right and makes no profit. Similarly, a European put op-
apart in time are nearly independent. We now describe tion has payoff (K − XT )+ . See Figure 2. By creating
the conditions under which the solution to the SDE (1) a portfolio with different maturities and different strike
is geometrically mixing. prices, one can obtain all kinds of payoff functions. As
Let Ht be the operator defined by an example, suppose that a portfolio of options con-
sists of contracts of the S&P 500 index maturing in six
(12) (Ht g)(x) = E g(Xt )|X0 = x , x ∈ R, months: one call option with strike price $1,200, one
put option with strike price $1,050 and $40 cash, but
where f is a Borel measurable bounded function on R. with short position (borrowing or −1 contract) on a call
A stationary process Xt is said to satisfy the condition option with strike price $1,150 and on a put option with
G2 (s, α) of Rosenblatt [95] if there exists an s such strike price $1,100. Figure 2(c) shows the payoff func-
that tion of such a portfolio of options at the expiration T .
Clearly, such an investor bets the S&P 500 index will
E(Hs f )2 (X) be around $1,125 in six months and limits the risk ex-
Hs 22 = sup ≤ α 2 < 1, posure on the investment (losing at most $10 if his/her
{f : Ef (X)=0} Ef 2 (X)
bet is wrong). Thus, the European call and put options
namely, the operator is contractive. As a consequence are fundamental options as far as the payoff function
of the semigroup (Hs+t = Hs Ht ) and contraction prop- at time T is concerned. There are many other exotic
erties, the condition G2 implies [16, 17] that for any options such as Asian options, look-back options and
barrier options, which have different payoff functions,
t ∈ [0, ∞), Ht 2 ≤ α t/s−1 . The latter implies, by the
and the payoffs can be path dependent. See Chapter 18
Cauchy–Schwarz inequality, that
of [78].
Suppose that the asset price follows the SDE (7) and
(13) ρ(t) = sup corr g1 (X0 ), g2 (Xt ) ≤ α t/s−1 ,
g1 ,g2 there is a riskless investment alternative such as a bond
which earns compounding rate of interest rt . Suppose
that is, the ρ-mixing coefficient decays exponentially that the underlying asset pays no dividend. Let βt be
fast. Banon and Nguyen [18] show further that for the value of the riskless bond at time t. Then, with an
a stationary Markov process, ρ(t) → 0 is equivalent initial investment β0 ,
to (13), namely, ρ-mixing and geometric ρ-mixing are t
equivalent. βt = β0 exp rs ds ,
0
322 J. FAN
F IG . 2. (a) Payoff of a call option. (b) Payoff of a put option. (c) Payoff of a portfolio of four options with different strike prices and different
(long and short) positions.
thanks to the compounding of interest. Suppose that that makes the drift zero. To achieve this, we ap-
a probability measure Q is equivalent to the original peal to the Girsanov theorem, which changes the drift
probability measure P , namely P (A) = 0 if and only if of a diffusion process without altering the diffusion
Q(A) = 0. The measure Q is called an equivalent mar- via a change of probability measure. Under the “risk-
tingale measure for deflated price processes of given neutral” probability measure Q, the process Yt satisfies
securities if these processes are martingales with re- dYt = σ Yt dWt , a martingale. Hence, the price process
spect to Q. An equivalent martingale measure is also Xt = exp(rt)Yt under Q follows
referred to as a “risk-neutral” measure if the deflater is
the bond price process. See Chapter 6 of [42]. (15) dXt = rXt dt + σ Xt dWt .
When the markets are dynamically complete, the Using exactly the same derivation, one can easily gen-
price of the European option with payoff (XT ) with eralize the result to the price process (5). Under the
initial price X0 = x0 is risk-neutral measure, the price process (5) follows
T
(14) P0 = exp − rs ds E Q (XT )|X0 = x0 , (16) dXt = rXt dt + σ (Xt ) dWt .
0
The intuitive explanation of this is clear: all stocks un-
where Q is the equivalent martingale measure for the
der the “risk-neutral” world are expected to earn the
deflated price process Xt /βt . Namely, it is the dis-
same rate as the risk-free bond.
counted value of the expected payoff in the risk neutral
For the geometric Brownian motion, by an applica-
world. The formula is derived by using the so-called
tion of the Itô formula (10) to (15), we have under the
relative pricing approach, which values the price of the
“risk-neutral” measure
option from given prices of a portfolio consisting of a
risk-free bond and a stock with the identical payoff as (17) log Xt − log X0 = (r − σ 2 /2)t + σ 2 Wt .
the option at the expiration.
As an illustrative example, suppose that the price of Note that given the initial price X0 , the price fol-
a stock follows the geometric Brownian motion dXt = lows a log-normal distribution. Evaluating the expec-
µXt dt + σ Xt dWt and that the risk-free rate r is con- tation of (14) for the European call option with payoff
stant. Then the deflated price process Yt = exp(−rt)Xt (XT ) = (XT − K)+ , one obtains the Black–Scholes
follows the SDE [23] option pricing formula
2.4 Simulation of Stochastic Models current value Xt = x0 , one draws Xt+ from the tran-
sition density p (·|x0 ). The initial condition can either
Simulation methods provide useful tools for the
be fixed at a given value or be generated from the in-
valuation of financial derivatives and other financial
variant density (11). In the latter case, the generated
instruments when the analytical formula (14) is hard
sequence is stationary.
to obtain. For example, if the price under the “risk-
There are only a few processes where exact sim-
neutral” measure is (16), the analytical formula for
ulation is possible. For GBM, one can generate the
pricing derivatives is usually not analytically tractable
sequence from the explicit solution (17), where the
and simulation methods offer viable alternatives (to-
Brownian motion can be simulated from indepen-
gether with variance reduction techniques) to evaluate
dent Gaussian increments. The conditional density of
it. They also provide useful tools for assessing perfor-
Vasicek’s model (3) is Gaussian with mean
mance of statistical methods and statistical inferences.
α + (x0 − α)ρ and variance σ2 = σ 2 (1 − ρ 2 )/(2κ) as
The simplest method is perhaps the Euler scheme.
indicated by (4). Generate X0 from the invariant den-
The SDE (7) is approximated as
sity N(α, σ 2 /(2κ)). With X0 , generate X from the
(19) Xt+ = Xt + µ(t, Xt ) + σ (t, Xt )1/2 εt , normal distribution with mean α + (X0 − α) exp(−κ)
and variance σ2 . With X , we generate X2 from
where {εt } is a sequence of independent random vari- mean α + (X − α) exp(−κ) and variance σ2 . Re-
ables with the standard normal distribution. The time peat this process until we obtain the desired length of
unit is usually a year. Thus, the monthly, weekly and the process.
daily data correspond, respectively, to = 1/12, 1/52 For the CIR model (2), provided that q = 2κα/σ 2 −
and 1/252 (there are approximately 252 trading days 1 ≥ 0 (a sufficient condition for Xt ≥ 0), the transition
per year). Given an initial value, one can recursively density is determined by the fact that given Xt = x0 ,
apply (19) to obtain a sequence of simulated data 2cXt+ has a noncentral χ 2 distribution with degrees
{Xj , j = 1, 2, . . .}. The approximation error can be of freedom 2q + 2 and noncentrality parameter 2u,
reduced if one uses a smaller step size /M for a given where c = 2κ/{σ 2 (1 − exp(−κ))}, u = cx0 exp(k).
integer M to first obtain a more detailed sequence The invariant density is the Gamma distribution with
{Xj /M , j = 1, 2, . . .} and then one takes the sub- shape parameter q + 1 and scale parameter σ 2 /(2κ).
sequence {Xj , j = 1, 2, . . .}. For example, to simu- As an illustration, we consider the CIR model (7)
late daily prices of a stock, one can simulate hourly with parameters κ = 0.21459, α = 0.08571, σ =
data first and then take the daily closing prices. Since 0.07830 and = 1/12. The model parameters are
the step size /M is smaller, the approximation (19) taken from [30]. We simulated 1000 monthly data val-
is more accurate. However, the computational cost is ues using both the Euler scheme (19) and the strong
about a factor of M higher. order-one approximation (20) with the same random
The Euler scheme has convergence rate 1/2 , which shocks. Figure 3 depicts one of their trajectories. The
is called strong order 0.5 approximation by Kloeden difference is negligible. This is in line with the ob-
et al. [87]. The higher-order approximations can be ob- servations made by Stanton [104] that as long as data
tained by the Itô–Taylor expansion (see [100], are sampled monthly or more frequently, the errors in-
page 242). In particular, a strong order-one approxi- troduced by using the Euler approximation are very
mation is given by small for stochastic dynamics that are similar to the
Xt+ = Xt + µ(t, Xt ) + σ (t, Xt )1/2 εt CIR model.
(20)
+ 12 σ (t, Xt )σx (t, Xt ){εt2 − 1}, 3. ESTIMATION OF RETURN AND VOLATILITY
FUNCTIONS
where σx (t, x) is the partial derivative function with re-
spect to x. This method can be combined with a smaller There is a large literature on the estimation of
step size method in the last paragraph. For the time- the return and volatility functions. Early references
homogeneous model (1), an alternative form, without include [93] and [94]. Some studies are based on
evaluating the derivative function, is given in (3.14) continuously observed data while others are based on
of [87]. discretely observed data. For the latter, some regard
The exact simulation method is available if one can tending to zero while others regard fixed. We briefly
simulate the data from the transition density. Given the introduce some of the ideas.
324 J. FAN
F IG . 3. Simulated trajectories (multiplied by 100) using the Euler approximation and the strong order-one approximation for a CIR model.
Top panel: solid curve corresponds to the Euler approximation and the dashed curve is based on the order-one approximation. Bottom panel:
the difference between the order-one scheme and the Euler scheme.
3.1 Methods of Estimation estimator is very difficult to find, as the transition den-
sity involves the modified Bessel function of the first
We first outline several methods of estimation for kind.
parametric models. The idea can be extended to non- One simple technique is to rely on the Euler ap-
parametric models. Suppose that we have a sample proximation scheme (19). Then proceed as if the data
{Xi , i = 0, . . . , n} from model (5). Then, the likeli- come from the Gaussian location and scale model. This
hood function, under the stationary condition, is method works well when is small, but can create
some biases when is large. However, the bias can be
n
(21) log f (X0 ) + log p Xi |X(i−1) . reduced by the following calibration idea, called indi-
i=1 rect inference by Gouriéroux et al. [61]. The idea works
as follows. Suppose that the functions µ and σ have
If the functions µ and σ are parameterized and the ex- been parameterized with unknown parameters θ . Use
plicit form of the transition density is available, one can the Euler approximation (19) and the maximum likeli-
apply the maximum likelihood method. However, the hood method to obtain an estimate θ̂0 . For each given
explicit form of the transition density is not available parameter θ around θ̂0 , simulate data from (5) and ap-
for many simple models such as the CLKS model (6). ply the crude method to obtain an estimate θ̂1 (θ ) which
Even for the CIR model (2), its maximum likelihood depends on θ . Since we simulated the data with the true
A SELECTIVE OVERVIEW 325
parameter θ , the function θ̂1 (θ ) tells us how to cali- The right-hand side is the expectation of dg(Xt ). By
brate the estimate. See Figure 4. Calibrate the estimate Itô’s formula (10), the above equation reduces to
via θ̂1−1 (θ̂0 ), which improves the bias of the estimate.
One drawback of this method is that it is intensive in (22) E[g (Xt )µ(Xt ) + g (Xt )σ 2 (Xt )/2] = 0.
computation and the calibration cannot easily be done
For example, if g(x) = exp(−ax) for some given
when the dimensionality of parameters θ is high.
a > 0, then
Another method for bias reduction is to approximate
the transition density in (21) by a higher order approx- E exp(−aXt ){µ(Xt ) − aσ 2 (Xt )/2} = 0.
imation, and to then maximize the approximated like-
lihood function. Such a scheme has been introduced This can produce an arbitrary number of equations by
by Aït-Sahalia [4, 5], who derives the expansion of choosing different a’s. If the functions µ and σ are pa-
the transition density around a normal density function rameterized, the number of moment conditions can be
using Hermite polynomials. The intuition behind such more than the number of equations. One way to effi-
an expansion is that the diffusion process Xt+ − Xt ciently use this is the generalized method of moments
in (5) can be regarded as sum of many independent introduced by Hansen [65], minimizing a quadratic
increments with a very small step size and hence the form of the discrepancies between the empirical and
Edgeworth expansion can be obtained for the distribu- the theoretical moments, a generalization of the clas-
tion of Xt+ − Xt given Xt . See also [43]. sical method of moments which solves the moment
An “exact” approach is to use the method of moments. equations. The weighting matrix in the quadratic form
If the process Xt is stationary as in the interest-rate can be chosen to optimize the performance of the re-
models, the moment conditions can easily be derived sulting estimator. To improve the efficiency of the es-
by observing timate, a large system of moments is needed. Thus,
the generalized method of moments needs a large sys-
−1
E lim E[g(Xt+ ) − g(Xt )|Xt ] tem of nonlinear equations which can be expensive in
→0
computation. Further, the moment equations (22) use
= lim −1 E[g(Xt+ ) − g(Xt )] = 0 only the marginal information of the process. Hence,
→0
the procedure is not efficient. For example, in the
for any function g satisfying the regularity condition CKLS model (6), σ and κ are estimable via (22) only
that the limit and the expectation are exchangeable. through σ 2 /κ.
F IG . 4. The idea of indirect inference. For each given true θ , one obtains an estimate using the Euler approximation and the simulated
data. This gives a calibration curve as shown. Now for a given estimate θ̂0 = 3 based on the Euler approximation and real data, one finds the
calibrated estimate θ̂1−1 (3) = 2.080.
326 J. FAN
3.2 Time-Homogeneous Model These are the key properties for the bias reduction of
the local linear method as demonstrated in [46]. Fur-
The Euler approximation can easily be used to
ther, Fan and Yao [49] use the squared residuals
estimate the drift and diffusion nonparametrically.
2
Let Yi = −1 (X(i+1) − Xi ) and Zi = −1 X(i+1) − Xi − µ̂(Xi )
−1 (X(i+1) − Xi )2 . Then
rather than Zi to estimate the volatility function. This
E(Yi |Xi ) = µ(Xi ) + O() will further reduce the approximation errors in the
volatility estimation. They show further that the con-
and ditional variance function can be estimated as well as
E(Zi |Xi ) = σ 2 (Xi ) + O(). if the conditional mean function is known in advance.
Stanton [104] derives a higher-order approximation
Thus, µ(·) and σ 2 (·) can be approximately regarded scheme up to order three in an effort to reduce bi-
as the regression functions of Yi and Zi on Xi , ases. He suggests that higher-order approximations
respectively. Stanton [104] applies kernel regression must outperform lower-order approximations. To ver-
[102, 107] to estimate the return and volatility func- ify such a claim, Fan and Zhang [53] derived the fol-
tions. Let K(·) be a kernel function and h be a band- lowing order k approximation scheme:
width. Stanton’s estimators are given by ∗
E(Yi |Xi ) = µ(Xi ) + O(k ),
n−1 (25)
Yi Kh (Xi − x) ∗
µ̂(x) = i=0
n−1 E(Zi |Xi ) = σ 2 (Xi ) + O(k ),
i=0 Kh (Xi − x)
where
and
k
n−1 ∗
Yi = −1 ak,j X(i+j ) − Xi
Zi Kh (Xi − x)
σ̂ 2 (x) = i=0
n−1 , j =1
i=0 Kh (Xi − x)
and
where Kh (u) = h−1 K(u/ h) is a rescaled kernel. The
k
∗
2
consistency and asymptotic normality of the estimator Zi = −1 ak,j X(i+j ) − Xi
are studied in [15]. Fan and Yao [49] apply the local j =1
linear technique (Section 6.3 in [50]) to estimate the
return and volatility functions, under a slightly differ- and the coefficients ak,j = (−1)j +1 jk j are chosen to
ent setup. The local linear estimator [46] is given by make the approximation error in (25) of order k . For
example, the second approximation is
n−1
(23) µ̂(x) = Kn (Xi − x, x)Yi , 1.5(Xt+ − Xt ) − 0.5(Xt+2 − Xt+ ).
i=0
By using the independent increments of Brownian mo-
where tion, its variance is 1.52 + 0.52 = 2.5 times as large as
Sn,2 (x) − uSn,1 (x) that of the first-order difference. Indeed, Fan and Zhang
(24) Kn (u, x) = Kh (u) , [53] show that while higher-order approximations give
Sn,2 (x)Sn,0 (x) − Sn,1 (x)2
better approximation errors, we have to pay a huge pre-
with Sn,j (x) = n−1 i=0 Kh (Xi − x)(Xi − x) , is the
j
mium for variance inflation,
equivalent kernel induced by the local linear fit. In con- ∗
var(Yi |Xi ) = σ 2 (Xi )V1 (k)−1 {1 + O()},
trast to the kernel method, the local linear weights de-
∗
pend on both Xi and x. In particular, they satisfy var(Zi |Xi ) = 2σ 4 (Xi )V2 (k){1 + O()},
n−1 where the variance inflation factors V1 (k) and V2 (k)
Kn (Xi − x, x) = 1 are explicitly given by Fan and Zhang [53]. Table 1
i=1 shows some of the numerical results for the variance
and inflation factor.
The above theoretical results have also been veri-
n−1
Kn (Xi − x, x)(Xi − x) = 0. fied via empirical simulations in [53]. The problem is
i=1 no monopoly for nonparametric fitting—it is shared by
A SELECTIVE OVERVIEW 327
F IG . 5. Nonparametric estimates of volatility based on order one and two differences. The bars represent two standard deviations above
and below the estimated volatility. Top panel: order one fit. Bottom panel: order two fit.
328 J. FAN
and Gao [11] investigate the mean integrated square Zhang [53] for checking whether the return and volatil-
error of several methods for estimating the drift and ity functions possess certain parametric forms.
diffusion and compare their performances. Aït-Sahalia Another viable approach of model validation is
and Mykland [9, 10] study the effects of random and to base it on the transition density. One can check
discrete sampling when estimating continuous-time whether the nonparametrically estimated transition
diffusions. Bandi and Nguyen [14] investigate small density is significantly different from the parametri-
sample behavior of nonparametric diffusion estima- cally estimated one. Section 4.3 provides some addi-
tors. Thorough study of nonparametric estimation of tional details. Another approach, proposed by Hong
conditional variance functions can be found in [62, 69, and Li [77], uses the fact that under the null hypothesis
91, 99]. In particular, Section 8.7 of [50] gives var- the random variables {Zi } are a sequence of i.i.d. uni-
ious methods for estimating the conditional variance form random variables where Zi = P (Xi |X(i−1) , θ )
function. Wang [108] studies the relationship between and P (y|x, θ ) is the transition distribution function.
diffusion and GARCH models. They propose to detect the departure from the null
hypothesis by comparing the kernel-estimated bivari-
3.3 Model Validation
ate density of {(Zi , Zi+1 )} with that of the uniform
Stanton [104] applies his kernel estimator to a Trea- distribution on the unit square. The transition-density-
sury bill data set and observes a nonlinear return based approaches appear more elegant as they check
function in his nonparametric estimate, particularly in simultaneously the forms of drift and diffusion. How-
the region where the interest rate is high (over 14%, ever, the transition density does often not admit an
say). This leads him to postulate the hypothesis that analytic form and the tests can be computationally in-
the return functions of short-term rates are nonlin- tensive.
ear. Chapman and Pearson [30] study the finite sam- 3.4 Fixed Sampling Interval
ple properties of Stanton’s estimator. By applying his
procedure to the CIR model, they find that Stanton’s For practical analysis of financial data, it is hard to
procedure produces spurious nonlinearity, due to the determine whether the sampling interval tends to zero.
boundary effect and the mean reversion. The key determination is whether the approximation
Can we apply a formal statistics test to errors for small “” are negligible. It is ideal when a
Stanton’s hypothesis? The null hypothesis can sim- method is applicable whether or not “” is small. This
ply be formulated: the drift is of a linear form as kind of method is possible, as demonstrated below.
in model (6). What is the alternative hypothesis? For The simplest problem to illustrate the idea is the ker-
such a problem our alternative model is usually vague. nel density estimation of the invariant density of the
Hence, it is natural to assume that the drift is a nonlin- stationary process {Xt }. For the given sample {Xt },
ear smooth function. This becomes a testing problem the kernel density estimate for the invariant density is
with a parametric null hypothesis versus a nonpara-
n
metric alternative hypothesis. There is a large body (26) fˆ(x) = n−1 Kh (Xi − x),
of literature on this. The basic idea is to compute a i=1
discrepancy measure between the parametric estimates based on the discrete data {Xi , i = 1, . . . , n}. This
and nonparametric estimates and to reject the paramet- method is valid for all . It gives a consistent estimate
ric hypothesis when the discrepancy is large. See, for of f as long as the time horizon is long: n → ∞.
example, the book by Hart [73]. We will refer to this kind of nonparametric method as
In an effort to derive a generally applicable principle, state-domain smoothing, as the procedure localizes in
Fan et al. [54] propose the generalized likelihood ra- the state variable Xt . Various properties, including con-
tio (GLR) tests for parametric-versus-nonparametric or sistency and asymptotic normality, of the kernel esti-
nonparametric-versus-parametric hypotheses. The ba- mator (26) are studied by Bandi [13] and Bandi and
sic idea is to replace the maximum likelihood under Phillips [15]. Bandi [13] also uses the estimator (26),
a nonparametric hypothesis (which usually does not which is the same as the local time of the process
exist) by the likelihood under good nonparametric es- spending at a point x except for a scaling constant, as a
timates. Section 9.3 of [50] gives details on the im- descriptive tool for potentially nonstationary diffusion
plementation of the GLR tests, including estimating processes.
P -values, bias reduction and bandwidth selection. The Why can the state-domain smoothing methods be
method has been successfully employed by Fan and employed as if the data were independent? This is due
A SELECTIVE OVERVIEW 329
to the fact that localizing in the state domain weakens dices of the data that fall in the local window are quite
the correlation structure and that nonparametric esti- far apart. This in turn implies the weak dependence
mates use essentially only local data. Hence many re- for the data in the local window, that is, “whitening
sults on nonparametric estimators for independent data by windowing.” See Section 5.4 of [50] and Hart [72]
continue to hold for dependent data as long as their for further details. The effect of dependence structure
mixing coefficients decay sufficiently fast. As men- on kernel density estimation was thoroughly studied by
tioned at the end of Section 2.2, geometric mixing and Claeskens and Hall [35].
mixing are equivalent for time-homogeneous diffusion The diffusion function can also be consistently esti-
processes. Hence, the mixing coefficients decay usu- mated when is fixed. In pricing the derivatives of in-
ally sufficiently fast for theoretical investigation. terest rates, Aït-Sahalia [2] assumes µ(x) = k(α − x).
The localizing and whitening can be understood Using the kernel density estimator fˆ and estimated κ
graphically in Figure 6. Figure 6(a) shows that there is and α from a least-squares
method, he applied (11)
very strong serial correlation of the yields of the two- to estimate σ (·) : σ̂ 2 (x) = 2 0x µ̂(u)fˆ(u) du/fˆ(x). He
year Treasury notes. However, this correlation is signif- further established the asymptotic normality of such an
icantly weakened for the local data in the neighborhood estimator. Gao and King [56] propose tests of diffusion
8% ± 0.2%. In fact, as detailed in Figure 6(b), the in- models based on the discrepancy between the paramet-
F IG . 6. (a) Lag 1 scatterplot of the two-year Treasury note data. (b) Lag 1 scatterplot of those data falling in the neighborhood
8% ± 0.2%—the points are represented by the times of the observed data. The numbers in the scatterplot show the indices of the data
falling in the neighborhood. (c) Kernel density estimate of the invariant density.
330 J. FAN
ric and nonparametric estimates of the invariant den- of convergence for such a scheme, using a wavelet ba-
sity. sis. In particular, [58] shows that for fixed , the op-
The Aït-Sahalia method [2] easily illustrates that the timal rates of convergence for µ and σ are of orders
volatility function can be consistently estimated for O(n−s/(2s+5) ) and O(n−s/(2s+3) ), respectively, where
fixed . However, we do not expect that it is effi- s is the degree of smoothness of µ and σ .
cient. Indeed, we use only the marginal information of
3.5 Time-Dependent Model
the data. As shown in (21), almost all information is
contained in the transition density p (·|·). The tran- The time-dependent model (8) was introduced to ac-
sition density can be estimated as in Section 4.2 be- commodate the possibility of economic changes over
low whether is small or large. Since the transition time. The coefficient functions in (8) are assumed to
density and drift and volatility are in one-to-one cor- be slowly time-varying and smooth. Nonparametric
respondence for the diffusion process (5), the drift and techniques can be applied to estimate these coefficient
diffusion functions can be consistently estimated via functions. The basic idea is to localizing in time, re-
inverting the relationship between the transition den- sulting in a time-domain smoothing.
sity and the drift and diffusion functions. We first estimate the coefficient functions α0 (t)
There is no simple formula for expressing the drift and α1 (t). For each given time t0 , approximate the co-
and diffusion in terms of the transition density. The in- efficient functions locally by constants, α(t) ≈ a and
version is frequently carried out via a spectral analysis β(t) = b for t in a neighborhood of t0 . Using the Euler
of the operator H = exp(L), where the infinitesimal approximation (19), we run a local regression: Mini-
operator L is defined as mize
σ 2 (x)
n−1
Lg(x) = g (x) + µ(x)g (x). (27) (Yi − a − bXi )2 Kh (i − t0 )
2
i=0
It has the property
with respect to a and b. This results in an estimate
Lg(x) = lim −1 [E{g(Xt+ )|Xt = x} − g(x)] α̂0 (t0 ) = â and α̂1 (t0 ) = b̂, where â and b̂ are the
→0
minimizers of the local regression (27). Fan et al. [48]
by Itô’s formula (10). The operator H is the transition suggest using a one-sided kernel such as K(u) = (1 −
operator in that [see also (12)] u2 )I (−1 < u < 0) so that only the historical data in
H g(x) = E{g(X )|X0 = x}. the time interval (t0 − h, t0 ) are used in the above local
regression. This facilitates forecasting and bandwidth
The works of Hansen and Scheinkman [66], Hansen, selection. Our experience shows that there are no sig-
Scheinkman and Touzi [67] and Kessler and Sørensen nificant differences between nonparametric fitting with
[86] consist of the following idea. The first step is to es- one-sided and two-sided kernels. We opt for local con-
timate the transition operator H from the data. From stant approximations instead of local linear approxi-
the transition operator, one can identify the infinitesi- mations in (27), since the local linear fit can create
mal operator L and hence the functions µ(·) and σ (·). artificial albeit insignificant linear trends when the un-
More precisely, let λ1 be the largest negative eigen- derlying functions α0 (t) and α1 (t) are indeed time-
value of the operator L with eigenfunction ξ1 (x). Then independent. To appreciate this, for constant functions
Lξ1 = λ1 ξ1 , or equivalently, σ 2 ξ1 + 2µξ1 = 2λ1 ξ1 . α1 and α2 a large bandwidth will be chosen to reduce
This gives one equation of µ and σ . Another equation the variance in the estimation. This is in essence fitting
can be obtained via (11): (σ 2 f ) − 2µf = 0. Solving a global linear regression by (27). If the local linear ap-
these two equations we obtain proximations are used, since no variable selection pro-
x cedures have been incorporated in the local fitting (27),
σ 2 (x) = 2λ1 ξ1 (y)f (y) dy/[f (x)ξ1 (x)] the slopes of the local linear approximations will not be
0
estimated as zero and hence artificial linear trends will
and another explicit expression for µ(x). Using semi-
be created for the estimated coefficients.
group theory ([44], Theorem IV.3.7), ξ1 is also an
The coefficient functions in the volatility can be es-
eigenfunction of H with eigenvalue exp(λ1 ). Hence,
timated by the local approximated likelihood method.
the proposal is to estimate the invariant density f and
Let
the transition density p (y|x), which implies the val-
ues of λ1 and ξ1 . Gobet [58] derives the optimal rate Êt = −1/2 Xt+ − Xt − α̂0 (t) + α̂1 (t)Xt
A SELECTIVE OVERVIEW 331
be the normalized residuals. Then Based on the GLR technique, Fan et al. [48] proposed
β (t) a formal test for this kind of problem.
(28) Êt ≈ β0 (t)Xt 1 εt . The coefficient functions in the semiparametric
The conditional log-likelihood of Êt given Xt can eas- model (9) can also be estimated by using the profile
ily be obtained by the approximation (28). Using lo- approximated-likelihood method. For each given β1 ,
cal constant approximations and incorporating the ker- one can easily estimate β0 (·) via the approxima-
nel weight, we obtain the local approximated likeli- tion (28), resulting in an estimate β̂0 (·; β1 ). Regarding
hood at each time point and estimates of the functions the nonparametric function β0 (·) as being parameter-
β0 (·) and β1 (·) at that time point. This type of local ized by β̂0 (·; β1 ), model (28) with β1 (t) ≡ β1 becomes
a “synthesized” parametric model with unknown β1 .
approximated-likelihood method is related to the gen-
The parameter β1 can be estimated by the maximum
eralized method of moments of Hansen [65] and the
(approximated) likelihood method. Note that β1 is es-
ideas of Florens-Zmirou [55] and Genon-Catalot and
timated by using all the data points, while β̂0 (t) =
Jacod [57].
β̂0 (t; β̂1 ) is obtained by using only the local data
Since the coefficient functions in both return and
points. See [48] for details.
volatility functions are estimated using only historical
For other nonparametric methods of estimating vola-
data, their bandwidths can be selected based on a form
tility in time inhomogeneous models, see Härdle,
of the average prediction error. See Fan et al. [48] for
Herwartz and Spokoiny [68] and Mercurio and
details. The local least-squares regression can also be Spokoiny [89]. Their methods are based on model (8)
applied to estimate the coefficient functions β0 (t) and with α1 (t) = β1 (t) = 0.
β1 (t) via the transformed model [see (28)]
3.6 State-Domain Versus Time-Domain Smoothing
log(Êt2 ) ≈ 2 log β0 (t) + β1 (t) log(Xt2 ) + log(εt2 ),
So far, we have introduced both state- and time-
but we do not continue in this direction since the lo- domain smoothing. The former relies on the structural
cal least-squares estimate is known to be inefficient in invariability implied by the stationarity assumption and
the likelihood context and the exponentiation of an es- depends predominantly on the (remote) historical data.
timated coefficient function of log β0 (t) is unstable. The latter uses the continuity of underlying parame-
The question arises naturally if the coefficients in ters and concentrates basically on the recent data. This
the model (8) are really time-varying. This amounts, is illustrated in Figure 7 using the yields of the three-
for example, to testing H0 : β0 (t) = β0 and β1 (t) = β1 . month Treasury bills from January 8, 1954 to July 16,
F IG . 7. Illustration of time- and state-domain smoothing using the yields of three-month Treasury bills. The state-domain smoothing is
localized in the horizontal bars, while the time-domain smoothing is concentrated in the vertical bars.
332 J. FAN
2004 sampled at weekly frequency. On December 28, Let us assume again that the observed process {Xt } fol-
1990, the interest rate was about 6.48%. To estimate lows the SDE (5). In this case σ 2 (Xt ) is the derivative
the drift and diffusion around x = 6.48, the state- of the quadratic variation process of Xt and hence is
domain smoothing focuses on the dynamics where in- known up to time T . By (11), estimating the drift func-
terest rates are around 6.48%, the horizontal bar with tion µ(x) is equivalent to estimating the invariant den-
interest rates falling in 6.48% ± 0.25%. The estimated sity f . In fact,
volatility is basically the sample standard deviation of
(29) µ(x) = [σ 2 (x)f (x)] /[2f (x)].
the differences {Xi − X(i−1) } within this horizon-
tal bar. On the other hand, the time-domain smoothing The invariant density f can easily be estimated by
focuses predominantly on the recent history, say one kernel density estimation. When → 0, the summa-
year, as illustrated in the figure. The time-domain esti- tion in (26) converges to
mate of volatility is basically a sample standard devia- T
tion within the vertical bar. (30) fˆ(x) = T −1 Kh (Xt − x) dt.
For a given time series, it is hard to say which esti- 0
mate is better. This depends on the underlying stochas- This forms a kernel density estimate of the invari-
tic processes and also on the time when the forecast is ant density based on the continuously observed data.
made. If the underlying process is continuous and sta- Thus, an estimator for µ(x) can be obtained by
tionary, such as model (5), both methods are applica- substituting fˆ(x) into (29). Such an approach has
ble. For example, standing at December 28, 1990, one been employed by Kutoyants [88] and Dalalyan and
can forecast the volatility by using the sample standard Kutoyants [40, 41]. They established the sharp asymp-
deviation in either the horizontal bar or the vertical bar. totic minimax risk for estimating the invariant density
However, the estimated precision depends on the lo- f and its derivative as well as the drift function µ. In
cal data. Since the sample variance is basically linear particular, the functions f , f and µ can be estimated
in the squared differences {Zi 2 }, the standard errors of with rates T −1/2 , T −2s/(2s+1) and T −2s/(2s+1) , respec-
both estimates can be assessed and used to guide the tively, where s is the degree of smoothness of µ. These
forecasting. are the optimal rates of convergence.
For stationary diffusion processes, it is possible to An alternative approach is to estimate the drift func-
integrate both the time-domain and state-domain esti- tion directly from (23). By letting → 0, one can
mates. Note that the historical data (with interest rates easily obtain a local linear regression estimator for con-
in 6.48% ± 0.25%) are far apart in time from the data tinuously observed data, which admits a similar form
used in the time-domain smoothing (vertical bar), ex- to (23) and (30). This is the approach that Spokoiny
cept the last segment, which can be ignored in the state- [103] used. He showed that this estimator attains the
domain fitting. The next-to-last segment with interest optimal rate of convergence and established further a
rates in 6.48% ± 0.25% is May 11 to July 20, 1988, data-driven bandwidth such that the local linear esti-
123 weeks prior to the last segment. Hence, these two mator attains adaptive minimax rates.
estimates are nearly independent. The integrated esti-
mate is a linear combination of these two nearly in- 4. ESTIMATION OF STATE PRICE DENSITIES AND
dependent estimates. The weights can easily be cho- TRANSITION DENSITIES
sen to minimize the variance of the integrated estima- The state price density (SPD) is the probability den-
tor, by using the assessed standard errors of the state- sity of the value of an asset under the risk-neutral
and time-domain estimators. The optimal weights are world (14) (see [38]) or equivalent martingale mea-
proportional to the variances of the two estimators, sure [71]. It is directly related to the pricing of financial
which depend on time t. This forms a dynamically inte- derivatives. It is the transition density of XT given X0
grated predictor for volatility estimation, as the optimal under the equivalent martingale Q. The SPD does not
weights change over time. depend on the payoff function and hence it can be used
to evaluate other illiquid derivatives, once it is esti-
3.7 Continuously Observed Data
mated from more liquid derivatives. On the other hand,
At the theoretical level, one may also examine the the transition density characterizes the probability law
problem of estimating the drift and diffusion functions of a Markovian process and hence is useful for validat-
assuming the whole process is observable up to time T . ing Markovian properties and parametric models.
A SELECTIVE OVERVIEW 333
4.1 Estimation of the State Price Density they reduced the dimensionality to three. Aït-Sahalia
and Lo [7] imposed a semiparametric form on the pric-
For some specific models, the state price density can
be formed explicitly. For example, for the GBM (1) ing formula,
with a constant risk-free rate r, according to (17), the C(S, K, T , r, δ) = CBS F, K, T , r, σ (F, K, T ) ,
SPD is log-normal with mean log x0 + (r − σ 2 )/(2T )
and variance σ 2 . where CBS (F, K, T , r, σ ) is the Black–Scholes pricing
Assume that the SPD f ∗ exists. Then the European formula given in (18) and σ (F, K, T ) is the implied
call option can be expressed as volatility, computed by inverting the Black–Scholes
T ∞ formula. Thus, the problem becomes one of nonpara-
C = exp − rs ds (x − K)f ∗ (x) dx. metrically estimating the implied volatility function
0 K
σ (F, K, T ). This is estimated by using a nonparamet-
See (14) (we have changed the notation from P0 to C ric regression technique from historical data, namely,
to emphasize the price of the European call option).
Hence, σi = σ (Fi , Ki , Ti ) + εi ,
T 2
∂ C where σi is the implied volatility of Ci , by inverting
(31) f ∗ (K) = exp rs ds .
0 ∂K 2 the Black–Scholes formula. By assuming further that
This was observed by Breeden and Litzenberger [25]. σ (F, K, T ) = σ (F /K, T ), the dimensionality is re-
Thus, the state price density can be estimated from the duced to two. This is one of the options in [4].
European call options with different strike prices. With The state price density f ∗ is nonnegative and hence
the estimated state price density, one can price new or the function C should be convex in the strike price K.
less liquid securities such as over-the-counter deriva- Aït-Sahalia and Duarte [6] propose to estimate the op-
tives or nontraded options using formula (14). tion price under the convexity constraint using a local
In general, the price of a European call option de- linear estimator. See also [70] for a related approach.
pends on the current stock price S, the strike price K,
4.2 Estimation of Transition Densities
the time to maturity T , the risk-free interest rate r and
dividend yield rate δ. It can be written as C(S, K, T , The transition density of a Markov process charac-
r, δ). The exact form of C, in general, is hard to de- terizes the law of the process, except the initial distrib-
termine unless we assume the Black–Scholes model. ution. It provides useful tools for checking whether or
Based on historical data {(Ci , Si , Ki , Ti , ri , δi ), i = not such a process follows a certain SDE and for statis-
1, . . . , n}, where Ci is the ith traded-option price tical estimation and inference. It is the state price den-
with associated characteristics (Si , Ki , Ti , ri , δi ), Aït- sity of the price process under the risk neutral world. If
Sahalia and Lo [7] fit the nonparametric regression such a process were observable, the state price density
Ci = C(Si , Ki , Ti , ri , δi ) + εi would be estimated using the methods to be introduced.
Assume that we have a sample {Xi , i = 0, . . . , n}
to obtain an estimate of the function C and hence the
from model (5). The “double-kernel” method of Fan,
SPD f ∗ .
Yao and Tong [51] is to observe that
Due to the curse of dimensionality, the five-dimen-
sional nonparametric function cannot be estimated well E Wh2 (Xi − y)|X(i−1) = x ≈ p (y|x)
with practical range of sample sizes. Aït-Sahalia and (32)
Lo [7] realized that and proposed a few dimensionality as h2 → 0,
reduction methods. First, by assuming that the option
for a kernel function W . Thus, the transition density
price depends only on the futures price F = S exp((r −
δ)T ), namely, p (y|x) can be regarded approximately as the non-
parametric regression function of the response variable
C(S, K, T , r, δ) = C(F, K, T , r) Wh2 (Xi − y) on X(i−1) . An application of the local
(the Black–Scholes formula satisfies such an assump- linear estimator (23) yields
tion), they reduced the dimensionality from five to four.
n
By assuming further that the option-pricing function is p̂ (y|x) = Kn X(i−1) − x, x
homogeneous of degree one in F and K, namely, i=1
(33)
C(S, K, T , r, δ) = KC(F /K, T , r), · Wh2 (Xi − y),
334 J. FAN
where the equivalent kernel Kn (u, x) was defined The transition density-based methods depend on two
in (24). Fan, Yao and Tong [51] establish the asymp- bandwidths and are harder to implement. Indeed, their
totic normality of such an estimator under stationar- null distributions are harder to determine than those
ity and ρ-mixing conditions [necessarily decaying at based on the transition distribution methods. In com-
geometric rate for SDE (5)], which gives explicitly parison with the invariant density-based approach of
the asymptotic bias and variance of the estimator. See Arapis and Gao [11], it is consistent against a much
also Section 6.5 of [50]. The cross-validation idea of larger family of alternatives.
Rudemo [98] and Bowman [24] can be extended to One can also use the transition density to test
select bandwidths for estimating conditional densities. whether an observed series is Markovian (from per-
See [52, 63]. sonal communication with Yacine Aït-Sahalia). For ex-
The transition distribution can be estimated by in- ample, if a process {Xi } is Markovian, then
tegrating the estimator (33) over y. By letting h2 → 0, +∞
the estimator is the regression of the indicator I (Xi < p2 (y|x) = p (y|z)p (z|x) dz.
−∞
y) on X(i−1) . Alternative estimators can be obtained
by an application of the local logistic regression and
+∞ one can use the distance between p̂2 (y|x) and
Thus,
adjusted Nadaraya–Watson method of Hall et al. [64]. −∞ p̂ (y|z)p̂ (z|x) dz as a test statistic.
Early references on the estimation of the transition The transition density can also be used for parameter
distributions and densities include [96, 97] and [95]. estimation. One possible approach is to find the para-
meter which minimizes the distance P̂ − P,θ . In
4.3 Inferences Based on Transition Densities this case, the bandwidth should be chosen to optimize
With the estimated transition density, one can now the performance for estimating θ . The approach is ap-
verify whether parametric models such as (1)–(3), (6) plicable whether or not → 0.
are consistent with the observed data. Let p,θ (y|x)
be the transition density under a parametric diffusion 5. CONCLUDING REMARKS
model. For example, for the CIR model (2), the pa- Enormous efforts in financial econometrics have
rameter θ = (κ, α, σ ). As in (21), ignoring the initial been made in modeling the dynamics of stock prices
value X0 , the parameter θ can be estimated by maxi- and bond yields. There are directly related to pricing
mizing derivative securities, proprietary trading and portfo-
n
lio management. Various parametric models have been
(p,θ ) = log p,θ Xi |X(i−1) . proposed to facilitate mathematical derivations. They
i=1 have risks that misspecifications of models lead to er-
roneous pricing and hedging strategies. Nonparamet-
Let θ̂ be the maximum likelihood estimator. By the
ric models provide a powerful and flexible treatment.
spirit of the GLR of Fan et al. [54], the GLR test for
They aim at reducing modeling biases by increasing
the null hypothesis H0 : p (y|x) = p,θ (y|x) is
somewhat the variances of resulting estimators. They
GLR = (p̂ ) − (p,θ̂ ), provide an elegant method for validating or suggesting
a family of parametric models.
where p̂ is a nonparametric estimate of the transi- The versatility of nonparametric techniques in fi-
tion density. Since the transition density cannot be es- nancial econometrics has been demonstrated in this
timated well over the region where data are sparse paper. They are applicable to various aspects of dif-
(usually at boundaries of the process), we need to fusion models: drift, diffusion, transition densities and
truncate the nonparametric (and simultaneously para- even state price densities. They allow us to examine
metric) evaluation of the likelihood at appropriate in- whether the stochastic dynamics for stocks and bonds
tervals. are time varying and whether famous parametric mod-
In addition to employing the GLR test, one can also els are consistent with empirical financial data. They
compare directly the difference between the paramet- permit us to price illiquid or nontraded derivatives from
ric and nonparametric fits, resulting in test statistics liquid derivatives.
such as p̂ − p,θ̂ 2 and P̂ − P,θ̂ 2 for an ap- The applications of nonparametric techniques in fi-
propriate norm · , where P̂ and P,θ̂ are the esti- nancial econometrics are far wider than what has been
mates of the cumulative transition distributions under presented. There are several areas where nonparamet-
respectively the parametric and nonparametric models. ric methods have played a pivotal role. One example
A SELECTIVE OVERVIEW 335
is to test various versions of capital asset pricing mod- [12] A RFI , M. (1998). Non-parametric variance estimation from
els (CAPM) and their related stochastic discount mod- ergodic samples. Scand. J. Statist. 25 225–234.
els [36]. See, for example, the research manuscript [13] BANDI , F. (2002). Short-term interest rate dynamics: A spa-
tial approach. J. Financial Economics 65 73–110.
by Chen and Ludvigson [34] in this direction. An-
[14] BANDI , F. and N GUYEN , T. (1999). Fully nonparametric
other important class of models are stochastic volatil- estimators for diffusions: A small sample analysis. Unpub-
ity models [19, 101], where nonparametric methods lished manuscript.
can be also applied. The nonparametric techniques [15] BANDI , F. and P HILLIPS , P. C. B. (2003). Fully nonpara-
have been prominently featured in the RiskMetrics of metric estimation of scalar diffusion models. Econometrica
J. P. Morgan. It can be employed to forecast the risks 71 241–283.
[16] BANON , G. (1977). Estimation non paramétrique de densité
of portfolios. See, for example, [8, 32, 33, 47, 82] for de probabilité pour les processus de Markov. Thése, Univ.
related nonparametric techniques on risk management. Paul Sabatier de Toulouse, France.
[17] BANON , G. (1978). Nonparametric identification for diffu-
ACKNOWLEDGMENTS sion processes. SIAM J. Control Optim. 16 380–395.
[18] BANON , G. and N GUYEN , H. T. (1981). Recursive estima-
The author gratefully acknowledges various discus- tion in diffusion models. SIAM J. Control Optim. 19 676–
sions with Professors Yacine Aït-Sahalia and Jia-an 685.
Yan and helpful comments of the editors and review- [19] BARNDOFF -N IELSEN , O. E. and S HEPHARD , N. (2001).
ers that led to significant improvement of the presenta- Non-Gaussian Ornstein–Uhlenbeck-based models and some
of their uses in financial economics (with discussion). J. R.
tion of this paper. This research was supported in part Stat. Soc. Ser. B Stat. Methodol. 63 167–241.
by NSF Grant DMS-03-55179 and a direct allocation [20] B INGHAM , N. H. and K IESEL , R. (1998). Risk-Neutral
RGC grant of the Chinese University of Hong Kong. Valuation: Pricing and Hedging of Financial Derivatives.
Springer, New York.
[21] B LACK , F., D ERMAN , E. and T OY, W. (1990). A one-factor
REFERENCES model of interest rates and its application to Treasury bond
[1] A HN , D. H. and G AO , B. (1999). A parametric nonlinear options. Financial Analysts Journal 46(1) 33–39.
model of term structure dynamics. Review of Financial Stud- [22] B LACK , F. and K ARASINSKI , P. (1991). Bond and option
ies 12 721–762. pricing when short rates are lognormal. Financial Analysts
[2] A ÏT-S AHALIA , Y. (1996). Nonparametric pricing of interest Journal 47(4) 52–59.
rate derivative securities. Econometrica 64 527–560. [23] B LACK , F. and S CHOLES , M. (1973). The pricing of op-
[3] A ÏT-S AHALIA , Y. (1996). Testing continuous-time models tions and corporate liabilities. J. Political Economy 81 637–
of the spot interest rate. Review of Financial Studies 9 385– 654.
426. [24] B OWMAN , A. W. (1984). An alternative method of cross-
[4] A ÏT-S AHALIA , Y. (1999). Transition densities for inter- validation for the smoothing of density estimates. Bio-
est rate and other nonlinear diffusions. J. Finance 54 metrika 71 353–360.
1361–1395. [25] B REEDEN , D. and L ITZENBERGER , R. H. (1978). Prices of
[5] A ÏT-S AHALIA , Y. (2002). Maximum likelihood estimation state-contingent claims implicit in option prices. J. Business
of discretely sampled diffusions: A closed-form approxima- 51 621–651.
tion approach. Econometrica 70 223–262. [26] C AI , Z., FAN , J. and YAO , Q. (2000). Functional-
[6] A ÏT-S AHALIA , Y. and D UARTE , J. (2003). Nonparametric coefficient regression models for nonlinear time series. J.
option pricing under shape restrictions. J. Econometrics 116 Amer. Statist. Assoc. 95 941–956.
9–47. [27] C AI , Z. and H ONG , Y. (2003). Nonparametric methods in
[7] A ÏT-S AHALIA , Y. and L O , A. W. (1998). Nonparametric continuous-time finance: A selective review. In Recent Ad-
estimation of state-price densities implicit in financial asset vances and Trends in Nonparametric Statistics (M. G. Akri-
prices. J. Finance 53 499–547. tas and D. N. Politis, eds.) 283–302. North-Holland, Ams-
[8] A ÏT-S AHALIA , Y. and L O , A. W. (2000). Nonparametric terdam.
risk management and implied risk aversion. J. Econometrics [28] C AMPBELL , J. Y., L O , A. W. and M AC K INLAY, A. C.
94 9–51. (1997). The Econometrics of Financial Markets. Princeton
[9] A ÏT-S AHALIA , Y. and M YKLAND , P. (2003). The effects of Univ. Press.
random and discrete sampling when estimating continuous- [29] C HAN , K. C., K AROLYI , G. A., L ONGSTAFF , F. A. and
time diffusions. Econometrica 71 483–549. S ANDERS , A. B. (1992). An empirical comparison of alter-
[10] A ÏT-S AHALIA , Y. and M YKLAND , P. (2004). Estimators native models of the short-term interest rate. J. Finance 47
of diffusions with randomly spaced discrete observations: 1209–1227.
A general theory. Ann. Statist. 32 2186–2222. [30] C HAPMAN , D. A. and P EARSON , N. D. (2000). Is the short
[11] A RAPIS , M. and G AO , J. (2004). Nonparametric kernel es- rate drift actually nonlinear? J. Finance 55 355–388.
timation and testing in continuous-time financial economet- [31] C HEN , R. and T SAY, R. S. (1993). Functional-coefficient
rics. Unpublished manuscript. autoregressive models. J. Amer. Statist. Assoc. 88 298–308.
336 J. FAN
[32] C HEN , S. X. (2005). Nonparametric estimation of expected [54] FAN , J., Z HANG , C. and Z HANG , J. (2001). Generalized
shortfall. Econometric Theory. To appear. likelihood ratio statistics and Wilks phenomenon. Ann. Sta-
[33] C HEN , S. X. and TANG , C. Y. (2005). Nonparametric in- tist. 29 153–193.
ference of value-at-risk for dependent financial returns. J. [55] F LORENS -Z MIROU , D. (1993). On estimating the diffusion
Financial Econometrics 3 227–255. coefficient from discrete observations. J. Appl. Probab. 30
[34] C HEN , X. and L UDVIGSON , S. (2003). Land of Addicts? 790–804.
An empirical investigation of habit-based asset pricing [56] G AO , J. and K ING , M. (2004). Adaptive testing in
model. Unpublished manuscript. continuous-time diffusion models. Econometric Theory 20
[35] C LAESKENS , G. and H ALL , P. (2002). Effect of depen- 844–882.
dence on stochastic measures of accuracy of density esti- [57] G ENON -C ATALOT, V. and JACOD , J. (1993). On the estima-
mators. Ann. Statist. 30 431–454. tion of the diffusion coefficient for multi-dimensional diffu-
[36] C OCHRANE , J. H. (2001). Asset Pricing. Princeton Univ. sion processes. Ann. Inst. H. Poincaré Probab. Statist. 29
Press. 119–151.
[37] C OX , J. C., I NGERSOLL , J. E. and ROSS , S. A. (1985). [58] G OBET, E. (2002). LAN property for ergodic diffusions
A theory of the term structure of interest rates. Econometrica with discrete observations. Ann. Inst. H. Poincaré Probab.
53 385–407. Statist. 38 711–737.
[38] C OX , J. C. and ROSS , S. (1976). The valuation of options [59] G OBET, E., H OFFMANN , M. and R EISS , M. (2004). Non-
for alternative stochastic processes. J. Financial Economics parametric estimation of scalar diffusions based on low fre-
3 145–166. quency data. Ann. Statist. 32 2223–2253.
[39] DACUNHA -C ASTELLE , D. and F LORENS , D. (1986). Esti- [60] G OURIÉROUX , C. and JASIAK , J. (2001). Financial Econo-
mation of the coefficients of a diffusion from discrete obser- metrics: Problems, Models, and Methods. Princeton Univ.
vations. Stochastics 19 263–284. Press.
[40] DALALYAN , A. S. and K UTOYANTS , Y. A. (2002). Asymp- [61] G OURIÉROUX , C., M ONFORT, A. and R ENAULT, E.
totically efficient trend coefficient estimation for ergodic dif- (1993). Indirect inference. J. Appl. Econometrics 8 suppl.
fusion. Math. Methods Statist. 11 402–427. S85–S118.
[41] DALALYAN , A. S. and K UTOYANTS , Y. A. (2003). Asymp-
[62] H ALL , P. and C ARROLL , R. J. (1989). Variance function
totically efficient estimation of the derivative of the invariant
estimation in regression: The effect of estimating the mean.
density. Stat. Inference Stoch. Process. 6 89–107.
J. Roy. Statist. Soc. Ser. B 51 3–14.
[42] D UFFIE , D. (2001). Dynamic Asset Pricing Theory, 3rd ed.
[63] H ALL , P., R ACINE , J. and L I , Q. (2004). Cross-validation
Princeton Univ. Press.
and the estimation of conditional probability densities. J.
[43] E GOROV, A. V., L I , H. and X U , Y. (2003). Maximum
Amer. Statist. Assoc. 99 1015–1026.
likelihood estimation of time-inhomogeneous diffusions. J.
[64] H ALL , P., W OLFF , R. C. L. and YAO , Q. (1999). Methods
Econometrics 114 107–139.
for estimating a conditional distribution function. J. Amer.
[44] E NGEL , K.-J. and NAGEL , R. (2000). One-Parameter
Statist. Assoc. 94 154–163.
Semigroups for Linear Evolution Equations. Springer,
Berlin. [65] H ANSEN , L. P. (1982). Large sample properties of gen-
[45] E NGELBERT, H. J. and S CHMIDT, W. (1984). On one- eralized method of moments estimators. Econometrica 50
dimensional stochastic differential equations with general- 1029–1054.
ized drift. Stochastic Differential Systems. Lecture Notes in [66] H ANSEN , L. P. and S CHEINKMAN , J. A. (1995). Back to
Control and Inform. Sci. 69 143–155. Springer, Berlin. the future: Generating moment implications for continuous-
[46] FAN , J. (1992). Design-adaptive nonparametric regression. time Markov processes. Econometrica 63 767–804.
J. Amer. Statist. Assoc. 87 998–1004. [67] H ANSEN , L. P., S CHEINKMAN , J. A. and T OUZI , N.
[47] FAN , J. and G U , J. (2003). Semiparametric estimation of (1998). Spectral methods for identifying scalar diffusions.
value-at-risk. Econom. J. 6 261–290. J. Econometrics 86 1–32.
[48] FAN , J., J IANG , J., Z HANG , C. and Z HOU , Z. (2003). [68] H ÄRDLE , W., H ERWARTZ , H. and S POKOINY, V. (2003).
Time-dependent diffusion models for term structure dynam- Time inhomogeneous multiple volatility modelling. J. Fi-
ics. Statist. Sinica 13 965–992. nancial Econometrics 1 55–95.
[49] FAN , J. and YAO , Q. (1998). Efficient estimation of con- [69] H ÄRDLE , W. and T SYBAKOV, A. B. (1997). Local polyno-
ditional variance functions in stochastic regression. Bio- mial estimators of the volatility function in nonparametric
metrika 85 645–660. autoregression. J. Econometrics 81 223–242.
[50] FAN , J. and YAO , Q. (2003). Nonlinear Time Series: Non- [70] H ÄRDLE , W. and YATCHEW, A. (2002). Dynamic
parametric and Parametric Methods. Springer, New York. nonparametric state price density estimation using con-
[51] FAN , J., YAO , Q. and T ONG , H. (1996). Estimation of con- strained least-squares and the bootstrap. Discussion pa-
ditional densities and sensitivity measures in nonlinear dy- per 16, Quantification and Simulation of Economics
namical systems. Biometrika 83 189–206. Processes, Humboldt-Universität zu Berlin.
[52] FAN , J. and Y IM , T. H. (2004). A crossvalidation method [71] H ARRISON , J. M. and K REPS , D. (1979). Martingales and
for estimating conditional densities. Biometrika 91 819–834. arbitrage in multiperiod securities markets. J. Econom. The-
[53] FAN , J. and Z HANG , C. (2003). A re-examination of diffu- ory 2 381–408.
sion estimators with applications to financial model valida- [72] H ART, J. D. (1996). Some automated methods of smoothing
tion. J. Amer. Statist. Assoc. 98 118–134. time-dependent data. Nonparametr. Statist. 6 115–142.
A SELECTIVE OVERVIEW 337
[73] H ART, J. D. (1997). Nonparametric Smoothing and Lack- [91] M ÜLLER , H.-G. and S TADTMÜLLER , U. (1987). Estima-
of-Fit Tests. Springer, New York. tion of heteroscedasticity in regression analysis. Ann. Statist.
[74] H ASTIE , T. J. and T IBSHIRANI , R. J. (1993). Varying- 15 610–625.
coefficient models (with discussion). J. Roy. Statist. Soc. Ser. [92] O SBORNE , M. F. M. (1959). Brownian motion in the stock
B. 55 757–796. market. Operations Res. 7 145–173.
[75] H O , T. S. Y. and L EE , S.-B. (1986). Term structure move- [93] P HAM , D. T. (1981). Nonparametric estimation of the drift
ments and pricing interest rate contingent claims. J. Finance coefficient in the diffusion equation. Math. Operations-
41 1011–1029. forsch. Statist. Ser. Statist. 12 61–73.
[76] H ONG , Y. and L EE , T.-H. (2003). Inference on predictabil- [94] P RAKASA R AO , B. L. S. (1985). Estimation of the drift for
ity of foreign exchange rates via generalized spectrum and diffusion process. Statistics 16 263–275.
nonlinear time series models. Review of Economics and Sta- [95] ROSENBLATT, M. (1970). Density estimates and Markov
tistics 85 1048–1062. sequences. In Nonparametric Techniques in Statistical In-
[77] H ONG , Y. and L I , H. (2005). Nonparametric specification ference (M. L. Puri, ed.) 199–213. Cambridge Univ. Press.
testing for continuous-time models with applications to term [96] ROUSSAS , G. G. (1969). Nonparametric estimation in
structure of interest rates. Review of Financial Studies 18 Markov processes. Ann. Inst. Statist. Math. 21 73–87.
37–84. [97] ROUSSAS , G. G. (1969). Nonparametric estimation of the
[78] H ULL , J. (2003). Options, Futures, and Other Derivatives, transition distribution function of a Markov process. Ann.
5th ed. Prentice Hall, Upper Saddle River, NJ. Math. Statist. 40 1386–1400.
[79] H ULL , J. and W HITE , A. (1990). Pricing interest-rate-
[98] RUDEMO , M. (1982). Empirical choice of histograms and
derivative securities. Review of Financial Studies 3 573–592.
kernel density estimators. Scand. J. Statist. 9 65–78.
[80] I TÔ , K. (1942). Differential equations determining Markov
[99] RUPPERT, D., WAND , M. P., H OLST, U. and H ÖSSJER ,
processes. Zenkoku Shijo Sugaku Danwakai 244 1352–
O. (1997). Local polynomial variance function estimation.
1400. (In Japanese.)
Technometrics 39 262–273.
[81] I TÔ , K. (1946). On a stochastic integral equation. Proc.
[100] S CHURZ , H. (2000). Numerical analysis of stochastic dif-
Japan Acad. 22 32–35.
ferential equations without tears. In Handbook of Stochastic
[82] J ORION , P. (2000). Value at Risk: The New Benchmark for
Analysis and Applications (D. Kannan and V. Lakshmikan-
Managing Financial Risk, 2nd ed. McGraw–Hill, New York.
tham, eds.) 237–359. Dekker, New York.
[83] K ALLENBERG , O. (2002). Foundations of Modern Proba-
bility, 2nd ed. Springer, New York. [101] S HEPHARD , N., ed. (2005). Stochastic Volatility: Selected
[84] K ARATZAS , I. and S HREVE , S. E. (1991). Brownian Mo- Readings. Oxford Univ. Press.
tion and Stochastic Calculus, 2nd ed. Springer, New York. [102] S IMONOFF , J. S. (1996). Smoothing Methods in Statistics.
[85] K ESSLER , M. (1997). Estimation of an ergodic diffusion Springer, New York.
from discrete observations. Scand. J. Statist. 24 211–229. [103] S POKOINY, V. (2000). Adaptive drift estimation for non-
[86] K ESSLER , M. and S ØRENSEN , M. (1999). Estimating equa- parametric diffusion model. Ann. Statist. 28 815–836.
tions based on eigenfunctions for a discretely observed dif- [104] S TANTON , R. (1997). A nonparametric model of term struc-
fusion process. Bernoulli 5 299–314. ture dynamics and the market price of interest rate risk. J.
[87] K LOEDEN , P. E., P LATEN , E., S CHURZ , H. and Finance 52 1973–2002.
S ØRENSEN , M. (1996). On effects of discretization on es- [105] S TEELE , J. M. (2001). Stochastic Calculus and Financial
timators of drift parameters for diffusion processes. J. Appl. Applications. Springer, New York.
Probab. 33 1061–1076. [106] VASICEK , O. A. (1977). An equilibrium characterization of
[88] K UTOYANTS , Y. A. (1998). Efficient density estimation for the term structure. J. Financial Economics 5 177–188.
ergodic diffusion processes. Stat. Inference Stoch. Process. [107] WAND , M. P. and J ONES , M. C. (1995). Kernel Smoothing.
1 131–155. Chapman and Hall, London.
[89] M ERCURIO , D. and S POKOINY, V. (2004). Statistical infer- [108] WANG , Y. (2002). Asymptotic nonequivalence of GARCH
ence for time-inhomogeneous volatility models. Ann. Statist. models and diffusions. Ann. Statist. 30 754–783.
32 577–602. [109] YOSHIDA , N. (1992). Estimation for diffusion processes
[90] M ERTON , R. (1973). Theory of rational option pricing. Bell from discrete observations. J. Multivariate Anal. 41
J. Econom. Management Sci. 4 141–183. 220–242.
Statistical Science
2005, Vol. 20, No. 4, 338–343
DOI 10.1214/088342305000000430
© Institute of Mathematical Statistics, 2005
338
A SELECTIVE OVERWIEV 339
problems of estimating dynamic models that are well of diffusions have been available since Phillips [32].
known in discrete time series, such as the bias in ML A simple approach discussed in the paper is to use the
estimation, also manifest in the estimation of contin- Euler approximation scheme to discretize the model,
uous time systems and affect subsequent use of these a process which naturally creates some discretization
estimates, for instance in derivative pricing. In conse- bias. This discretization bias can lead to erroneous
quence, a relevant concern is the relative importance financial pricing and investment decisions. In conse-
of the estimation and discretization biases. As we will quence, the issue of discretization has attracted a lot of
show below, the former often dominates the latter even attention in the literature and many methods have been
when the sample size is large (at least 500 monthly ob- proposed to reduce the bias that it causes. Examples are
servations, say). Moreover, it turns out that correction Pedersen [30], Kessler [26], Durham and Gallant [18],
for the finite sample estimation bias continues to be Aït-Sahalia [2, 3] and Elerian, Chib and Shephard [19],
more important when the diffusion component of the among many others.
model is itself misspecified. Such corrections appear Next, many diffusion models in practical use are
to be particularly important in models that are nonsta- specified in a way that makes them mathematically
tionary or nearly nonstationary. convenient. These specifications are typically not de-
The second issue we discuss deals with a very differ- rived from any underlying economic theory and are
ent nonparametric technique, which is not discussed by therefore likely to be misspecified. Potential misspec-
Fan, but which has recently attracted much attention in ifications, like discretization, can lead to erroneous fi-
financial econometrics and empirical applications. This nancial decisions. Accordingly, specification bias has
method involves the use of quadratic variation mea- attracted a great deal of attention in the literature and
sures of realized volatility using ultra high frequency has helped to motivate the use of functional estimation
financial data. Like other nonparametric methods, em- techniques that treat the drift and diffusion coefficients
pirical quadratic variation techniques also have to deal nonparametrically. Important contributions include
with statistical bias, which in the present case arises Aït-Sahalia [1], Stanton [36], Bandi and Phillips [5]
from the presence of microstructure noise. The field of and Hong and Li [21].
research on this topic in econometrics is now very ac- While we agree that both discretization and specifi-
tive. cation bias are important issues, finite sample estima-
tion bias can be of equal or even greater importance for
2. FINITE SAMPLE EFFECTS financial decision making, as noted by Phillips and Yu
[33] in the context of pricing bonds and bond options.
In his overview of diffusion equation estimation, Fan The strong effect of the finite sample estimation bias in
discusses two sources of bias, one arising from the dis- this context can be explained as follows. In continuous
cretization process and the second from misspecifica- time specifications, the prices of bonds and bond op-
tion. We review these two bias effects and then discuss tions depend crucially on the mean reversion parameter
the bias that comes from finite sample estimation ef- in the associated interest rate diffusion equation. This
fects. parameter is well known to be subject to estimation
The attractions of Itô calculus have made it partic- bias when standard methods like ML are used. The bias
ularly easy to work with stochastic differential equa- is comparable to, but generally has larger magnitude
tions driven by Brownian motion. Diffusion processes than, the usual bias that appears in time series autore-
in particular have been used widely in finance to model gression. As the parameter is often very close to zero
asset prices, including stock prices, interest rates and in empirical applications (corresponding to near mar-
exchange rates. Despite their mathematical attractabil- tingale behavior and an autoregressive root near unity
ity, diffusion processes present some formidable chal- in discrete time), the estimation bias can be substantial
lenges for econometric estimation. The primary reason even in very large samples.
for the difficulty is that sample data, even very high- To reduce the finite sample estimation bias in para-
frequency data, are always discrete and for many popu- meter estimation as well as the consequential bias that
lar nonlinear diffusion models the transition density of arises in asset pricing, Phillips and Yu [33] proposed
the discrete sample does not have a closed form expres- the use of jackknife techniques. Suppose a sample of n
sion, as noted by Fan. The problem is specific to non- observations is available and that this sample is decom-
linear diffusions, as consistent methods for estimating posed into m consecutive sub-samples each with ob-
exact discrete models corresponding to linear systems servations (n = m × ). The jackknife estimator of a
340 P. C. B. PHILLIPS AND J. YU
parameter θ in the model is defined by discretization bias, we estimate model (2.3) by the
m (quasi-) ML approach. To investigate the finite sample
m θ̂i
(2.1) θ̂jack = θ̂n − i=1 , estimation bias effects, we estimate model (2.2) based
m−1 m −m
2
on the true transition density. To examine the effects
where θ̂n and θ̂i are the extreme estimates of θ based of bias reduction in estimation, we apply the jackknife
on the entire sample and the i’th sub-sample, respec- method (with m = 3) to the mean reversion parame-
tively. The parameter θ can be a coefficient in the dif- ter κ, the bond price and the bond option price.
fusion process, such as the mean reversion parameter, To examine the effects of specification bias, we fit
or a much more complex function of the parameters each simulated sequence from the true model to the
of the diffusion process and the data, such as an asset misspecified Vasicek model [37] to obtain the exact
price or derivative price. Typically, the full sample ex- ML estimates of κ, the bond price and the option price
treme estimator has bias of order O(n−1 ), whereas un- from this misspecified model. The Vasicek model is
der mild conditions the bias in the jackknife estimate is given by the simple linear diffusion
of order O(n−2 ).
(2.4) dr(t) = κ µ − r(t) dt + σ dB(t).
The following simulation illustrates these various
bias effects and compares their magnitudes. In the ex- We use this model to price the same bond and bond
periment, the true generating process is assumed to be option. Vasicek [37] derived the expression for bond
the following commonly used model (CIR hereafter) prices and Jamshidian [23] gave the corresponding for-
of short term interest rates due to Cox, Ingersoll and mula for bond option prices. The transition density for
Ross [17]: the Vasicek model is
r(t + )|r(t)
(2.2) dr(t) = κ µ − r(t) dt + σ r 1/2 (t) dB(t).
The transition density of the CIR model is known to (2.5) ∼ N µ(1 − e−κ )
be ce−u−v (v/u)q/2 Iq (2(uv)1/2 ) and the marginal den- + e−κ rt , σ 2 (1 − e−2κ )/(2κ) .
sity is w1w2 r w2 −1 e−w1 r / (w2 ), where c = 2κ/(σ 2 (1 −
e−κ )), u = cr(t)e−κ , v = cr(t + ), q = 2κµ/ This transition density is utilized to obtain the exact
σ 2 − 1, w1 = 2κ/σ 2 , w2 = 2κµ/σ 2 , is the sampling ML estimates of κ, the bond price and the bond op-
frequency, and Iq (·) is the modified Bessel function of tion price, all under the mistaken presumption that the
the first kind of order q. The transition density together misspecified model (2.4) is correctly specified.
with the marginal density can be used for simulation Table 1 reports the means and root mean square er-
purposes as well as to obtain the exact ML estimator of rors (RMSEs) for all these cases. It is clear that the
θ (= (κ, µ, σ ) ). In the simulation, we use this model finite sample estimation bias is more substantial than
to price a discount bond, which is a three-year bond
with a face value of $1 and initial interest rate of 5%, TABLE 1
and a one-year European call option on a three-year Finite sample properties of ML and jackknife estimates of κ, bond
price and option price for the (true) CIR model using a (correctly
discount bond which has a face value of $100 and a specified ) fitted CIR model and a (misspecified ) fitted Vasicek
strike price of $87. The reader is referred to [33] for model (sample size n = 600)
further details.
In addition to exact ML estimation, we may dis- Parameter κ Bond price Option price
cretize the CIR model via the Euler method and es- True value 0.1 0.8503 2.3920
timate the discretized model using (quasi-) ML. The
Euler scheme leads to the discretization Exact ML Mean 0.1845 0.8438 1.8085
of CIR RMSE 0.1319 0.0103 0.9052
r(t + ) = κµ + (1 − κ)r(t) Euler ML Mean 0.1905 0.8433 1.7693
(2.3) of CIR RMSE 0.1397 0.0111 0.9668
+ σ N 0, r(t) . Jackknife (m = 3) Mean 0.0911 0.8488 2.1473
One thousand samples, each with 600 monthly ob- of CIR RMSE 0.1205 0.0094 0.8704
ML of Vasicek Mean 0.1746 0.8444 1.8837
servations (i.e., = 1/12), are simulated from the
(misspecified) RMSE 0.1175 0.0088 0.7637
true model (2.2) with (κ, µ, σ ) being set at (0.1, 0.08, Jackknife (m = 2) of Mean 0.0977 0.8488 2.2483
0.02) , which are settings that are realistic in many Vasicek (misspecified) RMSE 0.1628 0.0120 1.0289
financial applications. To investigate the effects of
A SELECTIVE OVERWIEV 341
the discretization bias and the specification bias for all a parametric context, similar results can be expected
three quantities, at least in this experiment. In particu- for some nonparametric models. For example, in the
lar, κ is estimated by the exact ML method with 84.5% semiparametric model examined in [1], the diffusion
upward bias, which contributes toward the −0.76% function is nonparametrically specified and the drift
bias in the bond price and the −24.39% bias in the function is linear, so that the mean reversion parameter
option price. Relative to the finite sample bias, the is estimated parametrically as in the above example. In
bias in κ due to the discretization is almost negli- such cases, we can expect substantial finite sample es-
gible since the total bias in κ changes from 84.5% timation bias to persist and to have important practical
to 90.5%. (The increase in the total bias indicates implications in financial pricing applications.
that the discretization bias effect is in the same di-
rection as that of the estimation bias.) The total bias 3. REALIZED VOLATILITY
changes from −0.76% to −0.82% in the bond price
As noted in Fan’s overview, many models used in fi-
and from −24.39% to −26.03% in the option price.
nancial econometrics for modeling asset prices and in-
These changes are marginal. Similarly, relative to the
terest rates have the fully functional scalar differential
finite sample bias, the bias in κ due to misspecification
form
of the drift function is almost negligible since the total
bias changes from 84.5% to 74.6%. (The decrease in (3.1) dXt = µ(Xt ) dt + σ (Xt ) dBt ,
the total bias indicates that the misspecification bias ef-
where both drift and diffusion functions are nonpara-
fect is in the opposite direction to that of the estimation
metric and where the equation is driven by Brownian
bias.) The total bias changes from −0.76% to −0.69%
motion increments dBt . For models such as (3.1), we
in the bond price and from −24.39% to −21.25% in
have (dXt )2 = σ 2 (Xt ) dt a.s. and hence the quadratic
the option price. Once again, these changes are mar-
variation of Xt is
ginal. When the jackknife method is applied to the cor-
T T
rectly specified model, the estimation bias is greatly
(3.2) [X]T = (dXt ) dt =
2
σ 2 (Xt ) dt,
reduced in all cases (from 84.5% to −8.9% for κ; 0 0
from −0.76% to −0.18% for the bond price; and T
from −24.39% to −10.23% for the option price). where 0 σ 2 (X
t ) dt is the accumulated or integrated
Even more remarkably, when the jackknife method volatility of X. Were Xt observed continuously, [X]T
is applied to the incorrectly specified model (see the fi- and, hence, integrated volatility, would also be ob-
nal row of Table 1), the estimation bias is also greatly served. For discretely recorded data, estimation of (3.2)
reduced in all cases (from 84.5% to −2.3% for κ; is an important practical problem. This can be accom-
from −0.76% to −0.18% for the bond price; and plished by direct nonparametric methods using an em-
from −24.39% to −6.01% for the option price). These pirical estimate of the quadratic variation that is called
figures reveal that dealing with estimation bias can be realized volatility. The idea has been discussed for
much more important than ensuring correct specifica- some time, an early reference being Maheswaran and
tion in diffusion equation estimation, suggesting that Sims [28], and it has recently attracted a good deal of
general econometric treatment of the diffusion through attention in the econometric literature now that very
nonparametric methods may not address the major high frequency data has become available for empirical
source of bias effects on financial decision making. use. Recent contributions to the subject are reviewed in
Although the estimation bias is not completely re- [4] and [8].
moved by the jackknife method, the bias reduction is Suppose Xt is recorded discretely at equispaced
clearly substantial and the RMSE of the jackknife es- points (, 2, . . . , n (≡ T )) over the time interval
timate is smaller in all cases than that of exact ML. In [0, T ]. Then, [X]T can be consistently estimated by the
sum, it is apparent from Table 1 that the finite sample realized volatility of Xt defined by
estimation bias is larger in magnitude than either of the
n
2
biases due to discretization and misspecification and (3.3) [X ]T = Xi − X(i−1) ,
correcting this bias is therefore a matter of importance i=2
in empirical work on which financial decisions depend. as → 0, as is well known. In fact, any construction
Although this demonstration of the relative impor- of realized volatility based on an empirical grid of ob-
tance of finite sample estimation bias in relation to dis- servations where the maximum grid size tends to zero
cretization bias and specification bias is conducted in will produce a consistent estimate. It follows that the
342 P. C. B. PHILLIPS AND J. YU
integrated volatility can be consistently estimated by such as (3.1) have now been extended to accommo-
this nonparametric approach, regardless of the form of date fractional Brownian motion increments. The sto-
µ(Xt ) and σ (Xt ). The approach has received a great chastic calculus of fractional Brownian motion, which
deal of attention in the recent volatility literature and is not a semi-martingale, is not as friendly as that of
serves as a powerful alternative to the methods dis- Brownian motion and requires new constructs, involv-
cussed by Fan, especially when ultra-high frequency ing Wick products and versions of the Stratonovich in-
data are available. tegral. Moreover, certain quantities, such as quadratic
While this approach is seemingly straightforward, variation, that have proved useful in the recent empiri-
it is not without difficulties. First, in order for the cal literature may no longer exist and must be replaced
approach to be useful in empirical research, it is by different forms of variation, although the idea of
necessary to estimate the precision of the realized volatility is still present. Developing a statistical theory
volatility estimates. Important contributions on the of inference to address these issues in financial econo-
central limit theory of these empirical quadratic vari- metric models is presenting new challenges.
ation estimates by Jacod [22] and Barndorff-Nielson
and Shephard [10, 11] has facilitated the construction ACKNOWLEDGMENTS
of suitable methods of inference. Second, in practical
Peter C. B. Phillips gratefully acknowledges visit-
applications, realized volatility measures such as (3.3)
ing support from the School of Economics and Social
are usually contaminated by microstructure noise bias,
Science at Singapore Management University. Sup-
especially at ultra high frequencies and tick-by-tick
port was provided by NSF Grant SES-04-142254. Jun
data. Noise sources arise from various market frictions
Yu gratefully acknowledges financial support from the
and discontinuities in trading behavior that prevent the
Wharton-SMU Research Center at Singapore Manage-
full operation of efficient financial markets. Recent
ment University.
work on this subject (e.g., [8, 9, 21, 38]) has devel-
oped various methods, including nonparametric kernel
techniques, for reducing the effects of microstructure REFERENCES
noise bias. [1] A ÏT-S AHALIA , Y. (1996). Nonparametric pricing of interest
rate derivative securities. Econometrica 64 527–560.
4. ADDITIONAL ISSUES [2] A ÏT-S AHALIA , Y. (1999). Transition densities for interest
rate and other nonlinear diffusions. J. Finance 54 1361–1395.
Given efficient market theory, there is good reason to [3] A ÏT-S AHALIA , Y. (2002). Maximum likelihood estimation of
expect that diffusion models like (3.1) may have non- discretely sampled diffusions: A closed-form approximation
approach. Econometrica 70 223–262.
stationary characteristics. Similar comments apply to
[4] A NDERSEN , T. G., B OLLERSLEV, T. and D IEBOLD , F. X.
term structure models and yield curves. In such cases, (2005). Parametric and nonparametric volatility measure-
nonparametric estimation methods lead to the estima- ment. In Handbook of Financial Econometrics (Y. Aït-Sahalia
tion of the local time (or sojourn time) of the cor- and L. P. Hansen, eds.). North-Holland, Amsterdam. To ap-
responding stochastic process and functionals of this pear.
[5] BANDI , F. M. and P HILLIPS , P. C. B. (2003). Fully nonpara-
quantity, rather than a stationary probability density.
metric estimation of scalar diffusion models. Econometrica
Moreover, rates of convergence in such cases become 71 241–283.
path dependent and the limit theory for nonparametric [6] BANDI , F. M. and P HILLIPS , P. C. B. (2005). Nonstation-
estimates of the drift and diffusion functions in (3.1) ary continuous-time processes. In Handbook of Financial
is mixed normal. Asymptotics of this type require an Econometrics (Y. Aït-Sahalia and L. P. Hansen, eds.). North-
Holland, Amsterdam. To appear.
enlarging time span of data as well as increasing in-fill
[7] BANDI , F. M. and RUSSELL , J. (2005). Volatility. In Hand-
within each discrete interval as n → ∞. An overview book of Financial Engineering (J. R. Birge and V. Linetsky,
of this literature and its implications for financial data eds.). To appear.
applications is given in [6]. Nonparametric estimates [8] BANDI , F. M. and RUSSELL , J. (2005). Microstructure noise,
of yield curves in multifactor term structure models are realized volatility and optimal sampling. Working paper,
Graduate School of Business, Univ. Chicago.
studied in [25].
[9] BARNDORFF -N IELSEN , O., H ANSEN , P., L UNDE , A. and
Not all models in finance are driven by Brown- S HEPHARD , N. (2005). Regular and modified kernel-based
ian motion. In some cases, one can expect noise to estimators of integrated variance: The case with independent
have to have some memory and, accordingly, models noise. Working paper, Nuffield College.
A SELECTIVE OVERWIEV 343
[10] BARNDORFF -N IELSEN , O. and S HEPHARD , N. (2002). 1880–1970. In A Festschrift for Herman Rubin (A. DasGupta,
Econometric analysis of realized volatility and its use in esti- ed.) 75–91. IMS, Beachwood, OH.
mating stochastic volatility models. J. R. Stat. Soc. Ser. B Stat. [25] J EFFREY, A., K RISTENSEN , D., L INTON , O., N GUYEN , T.
Methodol. 64 253–280. and P HILLIPS , P. C. B. (2004). Nonparametric estimation of
[11] BARNDORFF -N IELSEN , O. and S HEPHARD , N. (2004). a multifactor Heath–Jarrow–Morton model: An integrated ap-
Econometric analysis of realized covariation: High frequency proach. J. Financial Econometrics 2 251–289.
based covariance, regression, and correlation in financial eco- [26] K ESSLER , M. (1997). Estimation of an ergodic diffusion
nomics. Econometrica 72 885–925. from discrete observations. Scand. J. Statist. 24 211–229.
[12] BARTLETT, M. S. (1955). An Introduction to Stochastic [27] KOOPMANS , T., ed. (1950). Statistical Inference in Dynamic
Processes. Cambridge Univ. Press. Economic Models. Wiley, New York.
[13] BARTLETT, M. S. and R AJALAKSHMAN , D. V. (1953). [28] M AHESWARAN , S. and S IMS , C. A. (1993). Empirical im-
Goodness of fit tests for simultaneous autoregressive series. plications of arbitrage-free asset markets. In Models, Meth-
J. Roy. Statist. Soc. Ser. B 15 107–124. ods and Applications of Econometrics (P. C. B. Phillips, ed.)
[14] B ERGSTROM , A. (1966). Nonrecursive models as discrete 301–316. Blackwell, Cambridge, MA.
approximations to systems of stochastic differential equa- [29] M ERTON , R. (1973). Theory of rational option pricing. Bell
tions. Econometrica 34 173–182. J. Econom. and Management Sci. 4 141–183.
[15] B ERGSTROM , A. (1988). The history of continuous-time
[30] P EDERSEN , A. (1995). A new approach to maximum likeli-
econometric models. Econometric Theory 4 365–383.
hood estimation for stochastic differential equations based on
[16] B LACK , F. and S CHOLES , M. (1973). The pricing of options
discrete observations. Scand. J. Statist. 22 55–71.
and corporate liabilities. J. Political Economy 81 637–654.
[31] P HILLIPS , A. W. (1959). The estimation of parameters in
[17] C OX , J., I NGERSOLL , J. and ROSS , S. (1985). A theory of
systems of stochastic differential equations. Biometrika 46
the term structure of interest rates. Econometrica 53 385–407.
67–76.
[18] D URHAM , G. and G ALLANT, A. R. (2002). Numerical tech-
niques for maximum likelihood estimation of continuous- [32] P HILLIPS , P. C. B. (1972). The structural estimation of
time diffusion processes (with discussion). J. Bus. Econom. a stochastic differential equation system. Econometrica 40
Statist. 20 297–338. 1021–1041.
[19] E LERIAN , O., C HIB , S. and S HEPHARD , N. (2001). Likeli- [33] P HILLIPS , P. C. B. and Y U , J. (2005). Jackknifing bond op-
hood inference for discretely observed non-linear diffusions. tion prices. Review of Financial Studies 18 707–742.
Econometrica 69 959–993. [34] S ARGAN , J. D. (1974). Some discrete approximations to con-
[20] H ANSEN , P. and L UNDE , A. (2006). An unbiased measure tinuous time stochastic models. J. Roy. Statist. Soc. Ser. B 36
of realized volatility. J. Bus. Econom. Statist. To appear. 74–90.
[21] H ONG , Y. and L I , H. (2005). Nonparametric specification [35] S IMS , C. (1971). Discrete approximations to continuous time
testing for continuous-time models with applications to term distributed lags in econometrics. Econometrica 39 545–563.
structure of interest rates. Review of Financial Studies 18 [36] S TANTON , R. (1997). A nonparametric model of term struc-
37–84. ture dynamics and the market price of interest rate risk. J. Fi-
[22] JACOD , J. (1994). Limit of random measures associated with nance 52 1973–2002.
the increments of a Brownian semimartingale. Working pa- [37] VASICEK , O. (1977). An equilibrium characterization of the
per, Univ. P. and M. Curie, Paris. term structure. J. Financial Economics 5 177–188.
[23] JAMSHIDIAN , F. (1989). An exact bond option formula. J. Fi- [38] Z HANG , L., M YKLAND , P. and A ÏT-S AHALIA , Y. (2005).
nance 44 205–209. A tale of two time scales: Determining integrated volatility
[24] JARROW, R. and P ROTTER , P. (2004). A short history of sto- with noisy high-frequency data. J. Amer. Statist. Assoc. 100
chastic integration and mathematical finance: The early years, 1394–1411.
Statistical Science
2005, Vol. 20, No. 4, 344–346
DOI 10.1214/088342305000000449
© Institute of Mathematical Statistics, 2005
Michael Sørensen is Professor and Head, Department More generally, martingale estimating functions pro-
of Applied Mathematics and Statistics, University of vide a simple and versatile technique for estimation
Copenhagen, Universitetsparken 5, DK-2100 Copen- in discretely sampled parametric stochastic differential
hagen Ø, Denmark (e-mail: michael@math.ku.dk). equation models that works whether or not is small.
344
A SELECTIVE OVERVIEW 345
of important papers, Marc Hoffmann has studied opti- [10] H EYDE , C. C. (1997). Quasi-Likelihood and Its Application.
mal rates of convergence of nonparametric estimators Springer, New York.
of the drift and diffusion coefficient under the three [11] H OFFMANN , M. (1999). Adaptive estimation in diffusion
asymptotic scenarios usually considered for diffusion processes. Stochastic Process. Appl. 79 135–163.
models including optimal estimators; see [8, 11, 12]. [12] H OFFMANN , M. (1999). Lp estimation of the diffusion coef-
Other estimators of the diffusion coefficient were pro- ficient. Bernoulli 5 447–481.
posed by Soulier [23] and Jacod [15]. [13] JACOBSEN , M. (2001). Discretely observed diffusions:
Classes of estimating functions and small -optimality.
Scand. J. Statist. 28 123–149.
REFERENCES [14] JACOBSEN , M. (2002). Optimality and small -optimality of
[1] B IBBY, B. M., JACOBSEN , M. and S ØRENSEN , M. (2005). martingale estimating functions. Bernoulli 8 643–668.
Estimating functions for discretely sampled diffusion-type [15] JACOD , J. (2000). Nonparametric kernel estimation of the co-
models. In Handbook of Financial Econometrics (Y. Aït- efficient of a diffusion. Scand. J. Statist. 27 83–96.
Sahalia and L. P. Hansen, eds.). North-Holland, Amsterdam. [16] K ESSLER , M. (1997). Estimation of an ergodic diffusion
To appear. from discrete observations. Scand. J. Statist. 24 211–229.
[2] B IBBY, B. M., S KOVGAARD , I. M. and S ØRENSEN , M. [17] K ESSLER , M. and S ØRENSEN , M. (1999). Estimating equa-
(2005). Diffusion-type models with given marginal distribu- tions based on eigenfunctions for a discretely observed diffu-
tion and autocorrelation function. Bernoulli 11 191–220. sion process. Bernoulli 5 299–314.
[3] B IBBY, B. M. and S ØRENSEN , M. (1995). Martingale esti-
[18] L ARSEN , K. S. and S ØRENSEN , M. (2005). A diffusion
mation functions for discretely observed diffusion processes.
model for exchange rates in a target zone. Math. Finance.
Bernoulli 1 17–39.
[4] D ITLEVSEN , S. and S ØRENSEN , M. (2004). Inference for To appear.
observations of integrated diffusion processes. Scand. J. Sta- [19] S ØRENSEN , H. (2004). Parametric inference for diffusion
tist. 31 417–429. processes observed at discrete points in time: A survey. In-
[5] F LORENS -Z MIROU , D. (1993). On estimating the diffusion ternat. Statist. Rev. 72 337–354.
coefficient from discrete observations. J. Appl. Probab. 30 [20] S ØRENSEN , M. (1997). Estimating functions for discretely
790–804. observed diffusions: A review. In Selected Proceedings
[6] G ENON -C ATALOT, V., J EANTHEAU , T. and L ARÉDO , C. of the Symposium on Estimating Functions (I. V. Basawa,
(2000). Stochastic volatility models as hidden Markov mod- V. P. Godambe and R. L. Taylor, eds.) 305–325. IMS, Hay-
els and statistical applications. Bernoulli 6 1051–1079. ward, CA.
[7] G ENON -C ATALOT, V., L ARÉDO , C. and P ICARD , D. [21] S ØRENSEN , M. (2000). Prediction-based estimating func-
(1992). Nonparametric estimation of the diffusion coefficient
tions. Econom. J. 3 123–147.
by wavelet methods. Scand. J. Statist. 19 317–335.
[8] G OBET, E., H OFFMANN , M. and R EISS , M. (2004). Non- [22] S ØRENSEN , M. (2005). Efficient martingale estimating func-
parametric estimation of scalar diffusions based on low fre- tions for discretely sampled ergodic diffusions. Preprint,
quency data. Ann. Statist. 32 2223–2253. Dept. Appl. Math. and Statistics, Univ. Copenhagen.
[9] G ODAMBE , V. P. and H EYDE , C. C. (1987). Quasi- [23] S OULIER , P. (1998). Nonparametric estimation of the diffu-
likelihood and optimal estimation. Internat. Statist. Rev. 55 sion coefficient of a diffusion process. Stochastic Anal. Appl.
231–244. 16 185–200.
Statistical Science
2005, Vol. 20, No. 4, 347–350
DOI 10.1214/088342305000000458
© Institute of Mathematical Statistics, 2005
We would like to congratulate Jianqing Fan for an (or volatility) of the returns process Xt , will be sto-
excellent and well-written survey of some of the lit- chastic processes, but these processes can depend on
erature in this area. We will here focus on some of the past in ways that need not be specified, and can be
the issues which are at the research frontiers in finan- substantially more complex than a Markov model. This
cial econometrics but are not covered in the survey. is known as an Itô process.
Most importantly, we consider the estimation of actual A main quantity of econometric interest is to obtain
volatility. Related to this is the realization that financial Ti+ 2
time series of the form i = Ti− σt dt, i = 1, 2, . . . .
data is actually observed with error (typically called
market microstructure), and that one needs to consider Here Ti− and Ti+ can, for example, be the beginning
a hidden semimartingale model. This has implications and the end of day number i. i is variously known
for the Markov models discussed above. as the integrated variance (or volatility) or quadratic
For reasons of space, we have not included refer- variation of the process X. The reason why one can
ences to all the relevant work by the authors that are hope to obtain this series is as follows. If Ti− = t0 <
cited, but we have tried to include at least one refer- t1 < · · · < tn = Ti+ spans day number i, define the re-
ence to each of the main contributors to the realized alized volatility by
volatility area.
n−1
2
(2) ˆi =
Xtj +1 − Xtj .
1. THE ESTIMATION OF ACTUAL VOLATILITY: j =0
THE IDEAL CASE
Then stochastic calculus tells us that
The paper discusses the estimation of Markovian
systems, models where the drift and volatility coeffi- (3) i = lim ˆ i.
max |tj +1 −tj |→0
cients are functions of time t or state x. There is, how-
ever, scope for considering more complicated systems. In the presence of high frequency financial data, in
An important tool in this respect is the direct estima- many cases with transactions as often as every few sec-
tion of volatility based on high-frequency data. One onds, one can, therefore, hope to almost observe i .
considers a system of, say, log securities prices, which One can then either fit a model to the series of ˆ i , or
follows: one can use it directly for portfolio management (as
in [12]), options hedging (as in [29]), or to test good-
(1) dXt = µt dt + σt dBt , ness of fit [31].
where Bt is a standard Brownian motion. Typically, µt , There are too many references to the relationship (3)
the drift coefficient, and σt2 , the instantaneous variance to name them all, but some excellent treatments can
be found in [27], Section 1.5; [26], Theorem I.4.47
Per A. Mykland is Professor, Department of Statis- on page 52; and [33], Theorem II-22 on page 66. An
tics, The University of Chicago, Chicago, Illinois early econometric discussion of this relationship can
60637, USA (e-mail: mykland@galton.uchicago.edu). be found in [2].
Lan Zhang is Assistant Professor, Department of Fi- To make it even more intriguing, recent work both
nance, University of Illinois at Chicago, Chicago, Illi- from the probabilistic and econometric sides gives
nois 60607, and Department of Statistics, Carnegie the mixed normal distribution of the error in the ap-
Mellon University, Pittsburgh, Pennsylvania 15213, proximation in (3). References include [6, 25, 31].
Ti+ −Ti−
USA (e-mail: lzhang@stat.cmu.edu). The random variance of the normal error is 2 n ·
347
348 P. A. MYKLAND AND L. ZHANG
Ti+ 4 the problem is due to jumps, but that is not the case.
− σt dH (t), where H is the quadratic variation of
Ti The limit in (3) exists even when there are jumps. The
time. H (t) = t in the case where the ti are equidistant. requirement for (3) to exist is that the process X be a
Further econometric literature includes, in particu- semimartingale (we again cite Theorem I.4.47 of [26]),
lar, [3, 4, 8, 9, 14, 18, 32]. Problems that are attached to which includes both Itô processes and jumps.
the estimation of covariations between two processes The inconsistency between the empirical results
are discussed in [22]. Estimating σt2 at each point t where the realized volatility diverges with finer sam-
goes back to [13]; see also [30], but this has not caught pling, and the semimartingale theory which dictates
on quite as much in econometric applications. the convergence of the realized volatility, poses a prob-
lem, since financial processes are usually assumed
2. THE PRESENCE OF MEASUREMENT ERROR to be semimartingales. Otherwise, somewhat loosely
The theory described above runs into a problem with speaking, there would be arbitrage opportunities in
real data. For illustration, consider how the realized the financial markets. For rigorous statements, see, in
volatility depends on sampling frequency for the stock particular, [11]. The semimartingaleness of financial
(and day) considered in Figure 1. The estimator does processes, therefore, is almost a matter of theology in
not converge as the observation points ti become dense most of finance, and yet, because of Figure 1 and sim-
ilar graphs for other stocks, we have to abandon it.
in the interval of this one day, but rather seems to take
Our alternative model is that there is measurement
off to infinity. This phenomenon was originally docu-
error in the observation. At transaction number i, in-
mented in [2]. For transaction data, this picture is re-
stead of seeing Xti from model (1) or, more generally,
peated for most liquid securities [19, 37].
from a semimartingale, one observes
In other words, the model (1) is wrong. What can one
do about this? A lot of people immediately think that (4) Yti = Xti + εi .
F IG . 1. Plot of realized volatility for Alcoa Aluminum for January 4, 2001. The data is from the TAQ database. There are 2011 trans-
actions on that day, on average one every 13.365 seconds. The most frequently sampled volatility uses all the data, and this is denoted as
“frequency = 1.” “Frequency = 2” corresponds to taking every second sampling point. Because this gives rise to two estimators of volatility,
we have averaged the two. And so on for “frequency = k” up to 20. The plot corresponds to the average realized volatility discussed in [37].
Volatilities are given on an annualized and square root scale.
A SELECTIVE OVERVIEW 349
[17] G LOTER , A. and JACOD , J. (2000). Diffusions with measure- [29] M YKLAND , P. A. (2003). Financial options and statistical
ment errors: I—Local asymptotic normality and II—Optimal prediction initervals. Ann. Statist. 31 1413–1438.
estimators. Technical report, Univ. de Paris VI. [30] M YKLAND , P. A. and Z HANG , L. (2001). Inference for
[18] G ONCALVES , S. and M EDDAHI , N. (2005). Bootstrapping volatility type objects and implications for hedging. Technical
realized volatility. Technical report, Univ. de Montréal. report, Dept. Statistics, Carnegie Mellon Univ.
[19] H ANSEN , P. R. and L UNDE , A. (2006). Realized variance [31] M YKLAND , P. A. and Z HANG , L. (2002). ANOVA for diffu-
and market microstructure noise. J. Bus. Econom. Statist. sions. Technical report, Dept. Statistics, Univ. Chicago.
To appear. [32] O OMEN , R. (2004). Properties of realized variance for a pure
[20] H ARRIS , L. (1990). Statistical properties of the Roll serial jump process: Calendar time sampling versus business time
covariance bid/ask spread estimator. J. Finance 45 579–590. sampling. Technical report, Warwick Business School, Univ.
[21] H ASBROUCK , J. (1993). Assessing the quality of a security Warwick.
market: A new approach to transaction-cost measurement. [33] P ROTTER , P. (2004). Stochastic Integration and Differential
Review of Financial Studies 6 191–212. Equations: A New Approach, 2nd ed. Springer, New York.
[22] H AYASHI , T. and YOSHIDA , N. (2005). On covariance es-
[34] ROLL , R. (1984). A simple implicit measure of the effec-
timation of non-synchronously observed diffusion processes.
tive bid-ask spread in an efficient market. J. Finance 39
Bernoulli 11 359–379.
1127–1139.
[23] H OFFMANN , M. (1999). Lp estimation of the diffusion coef-
[35] Z ENG , Y. (2003). A partially-observed model for micro-
ficient. Bernoulli 5 447–481.
movement of asset process with Bayes estimation via filter-
[24] JACOD , J. (2000). Nonparametric kernel estimation of the co-
efficient of a diffusion. Scand. J. Statist. 27 83–96. ing. Math. Finance 13 411–444.
[25] JACOD , J. and P ROTTER , P. (1998). Asymptotic error distri- [36] Z HANG , L. (2004). Efficient estimation of stochastic volatil-
butions for the Euler method for stochastic differential equa- ity using noisy observations: A multi-scale approach. Techni-
tions. Ann. Probab. 26 267–307. cal report, Dept. Statistics, Carnegie Mellon Univ.
[26] JACOD , J. and S HIRYAEV, A. N. (2003). Limit Theorems for [37] Z HANG , L., M YKLAND , P. A. and A ÏT-S AHALIA , Y.
Stochastic Processes, 2nd ed. Springer, New York. (2005). A tale of two time scales: Determining integrated
[27] K ARATZAS , I. and S HREVE , S. E. (1991). Brownian Motion volatility with noisy high-frequency data. J. Amer. Statist. As-
and Stochastic Calculus, 2nd ed. Springer, New York. soc. 100 1394–1411.
[28] KOLASSA , J. and M C C ULLAGH , P. (1990). Edgeworth series [38] Z HOU , B. (1996). High-frequency data and volatility in
for lattice distributions. Ann. Statist. 18 981–985. foreign-exchange rates. J. Bus. Econom. Statist. 14 45–52.
Statistical Science
2005, Vol. 20, No. 4, 351–357
DOI 10.1214/088342305000000421
© Institute of Mathematical Statistics, 2005
I am very grateful to the Executive Editor, Edward I am fully aware that financial econometrics has
George, for organizing this stimulating discussion. grown into a vast discipline itself and that it is im-
I would like to take this opportunity to thank Pro- possible for me to provide an overview within a rea-
fessors Peter Phillips, Jun Yu, Michael Sørensen, Per sonable length. Therefore, I greatly appreciate what
Mykland and Lan Zhang for their insightful and stimu- all discussants have done to expand the scope of dis-
lating comments, touching both practical, methodolog- cussion and provide additional references. They have
ical and theoretical aspects of financial econometrics also posed open statistical problems for handling non-
and their applications in asset pricing, portfolio alloca- stationary and/or non-Markovian data with or without
tion and risk management. They have made valuable market noise. In addition, statistical issues on various
contributions to the understanding of various financial versions of capital asset pricing models and their re-
econometric problems. lated stochastic discount models [15, 19], the efficient
The last two decades have witnessed an explosion market hypothesis [44] and risk management [17, 45]
of developments of data-analytic techniques in statis- have barely been discussed. These reflect the vibrant
tical modeling and analysis of complex systems. At intersection of the interfaces between statistics and fi-
the same time, statistical techniques have been widely nance. I will make some further efforts in outlining
employed to confront various complex problems aris- econometric problems where statistics plays an impor-
ing from financial and economic activities. While the tant role after brief response to the issues raised by the
discipline has grown rapidly over the last two decades discussants.
and has rich and challenging statistical problems, the
number of statisticians involved in studying financial 1. BIASES IN STATISTICAL ESTIMATION
econometric problems is still limited. In comparison The contributions by Professors Phillips, Yu and
with statisticians working on problems in biological Sørensen address the bias issues on the estimation of
sciences and medicine, the group working on finan- parameters in diffusion processes. Professors Phillips
cial and econometric problems is dismally small. It is and Yu further translate the bias of diffusion parame-
my hope that this article will provide statisticians with ter estimation into those of pricing errors of bonds and
quick access to some important and interesting prob- bond derivatives. Their results are very illuminating
lems in financial econometrics and to catalyze the ro- and illustrate the importance of estimation bias in fi-
mance between statistics and finance. A similar effort nancial asset pricing. Their results can be understood
was made by Cai and Hong [12], where various aspects as follows. Suppose that the price of a financial asset
of nonparametric methods in continuous-time finance depends on certain parameters θ (the speed of the re-
are reviewed. It is my intention to connect financial version κ in their illustrative example). Let us denote it
econometric problems as closely to statistical problems by p(θ ), which can be in one case the price of a bond
as possible so that familiar statistical tools can be em- and in another case the prices of derivatives of a bond.
ployed. With this in mind, I sometimes oversimplify The value of the asset is now estimated by p(θ̂ ) with θ̂
the problems and techniques so that key features can being estimated from empirical data. When θ̂ is overes-
be highlighted. timated (say), which shifts the whole distribution of θ̂
to the left, the distribution of p(θ̂ ) will also be shifted,
Jianqing Fan is Professor, Benheim Center of Finance depending on the sensitivity of p to θ . The sensitivity
and Department of Operations Research and Financial is much larger for bond derivatives when κ is close to
Engineering, Princeton University, Princeton, New zero (see Figure 2 of [46]), and hence the pricing errors
Jersey 08544, USA (e-mail: jqfan@princeton.edu). are much larger. On the other hand, as the distribution
351
352 J. FAN
of κ is shifted to the left, from Figure 2 of [46], both Phillips and Yu, the transition density of the CIR model
prices of bonds and their derivatives get smaller and so has a noncentral χ 2 -distributions with degrees of free-
does the variance of pricing errors. Simulation studies dom 80, which is close to the normal transition density
in [46] suggest that these two effects cancel each other given by the Vasicek model. Therefore, the model is
out in terms of mean square error. not very seriously misspecified.
I agree with Phillips and Yu’s observation that dis- Nonparametric methods reduce model specification
cretization is not the main source of biases for many errors by either global modeling such as spline meth-
reasonable financial applications. Finite-sample esti- ods or local approximations. This reduces significantly
mation bias can be more severe. This partially an- the possibility of specification errors. Since nonpara-
swers the question raised by Professor Sørensen. On metric methods are somewhat crude and often used
the other hand, his comments give theoretical insights as model diagnostic and exploration tools, simple and
into the bias due to discretization. For financial ap- quick methods serve many practical purposes. For ex-
plications (such as modeling short-term rates) when ample, in time domain smoothing, the bandwidth h is
the data are collected at monthly frequency, the bias always an order of magnitude larger than the sampling
{1 − exp(−κ)}/ − κ = −0.0019 and −0.00042, re- frequency . Therefore, the approximation errors due
spectively, for κ = 0.21459 used in Figure 3 of [34] and to discretization are really negligible. Similarly, for
for κ = 0.1 used in the discussion by Phillips and Yu. many realistic problems, the function approximation
For weekly data, using the parameter κ = 0.0446 cited errors can be an order of magnitude larger than dis-
in [14], the discretization bias is merely 9.2 × 10−5 . cretization errors. Hence, discretization errors are often
For other types of applications, such as climatol- not a main source of errors in nonparametric inference.
ogy, Professor Sørensen is right that the bias due to
discretization can sometimes be substantial. It is both 2. HIGH-FREQUENCY DATA
theoretically elegant and practically viable to have
methods that work well for all situations. The quasi- Professors Mykland, Zhang, Phillips and Jun address
maximum likelihood methods and their modifications statistical issues for high-frequency data. I greatly
discussed by Professor Sørensen are attractive alter- appreciate their insightful comments and their elabora-
natives. As he pointed out, analytical solutions are tions on the importance and applications of the subject.
rare and computation algorithms are required. This in- Thanks to the advances in modern trading technology,
creases the chance of numerical instability in practi- the availability of high-frequency data over the last
cal implementations. The problem can be attenuated decade has significantly increased. Research in this
with the estimates based on the Euler approximation area has advanced very rapidly lately. I would like to
as an initial value. The martingale method is a gener- thank Professors Mykland and Zhang for their compre-
alization of his quasi-maximum likelihood estimator, hensive overview on this active research area.
which aims at improving efficiency by suitable choice With high-frequency data, discretization errors have
of weighting functions aj . However, unless the con- significantly been reduced. Nonparametric methods
ditional density has multiplicative score functions, the become even more important for this type of large
estimation equations will not be efficient. This explains sample problem. The connections between the realized
the observation made by Professor Sørensen that the volatility and the time-inhomogeneous model can sim-
methods based on martingale estimating functions are ply be made as follows. Consider a subfamily of mod-
usually not efficient for low frequency data. The above els of (8) in [34],
discussion tends to suggest that when the Euler approx- dXt = αt dt + σt dWt .
imation is reasonable, the resulting estimates tend to
have smaller variances. For high-frequency data the sampling interval is very
In addition to the discretization bias and finite sam- small. For the sampling frequency of a minute, =
ple estimation bias, there is model specification bias. 1/(252 ∗ 24 ∗ 60) ≈ 2.756 × 10−6 . Hence, standard-
This can be serious in many applications. In the ex- ized residuals in Section 2.5 of [34] become Et =
ample given by Professors Phillips and Yu, the mod- −1/2 (Xt+ − Xt ) and the local constant estimate of
eling errors do not have any serious adverse effects the spot volatility reduces to
on pricing bonds and their derivatives. However, we −1
j
should be wary of generalizing this statement. Indeed, σ̂j2 = 2
wj −i Ei ,
for the model parameters given in the discussion by i=−∞
A SELECTIVE OVERVIEW 353
available sample size is usually in the order of hun- the implied volatility function by a quadratic func-
dreds or a few thousand. Longer time series (larger n) tion of the strike price and time to maturity and deter-
will increase modeling biases. Without imposing struc- mine these parameters by minimizing pricing errors.
tures on the covariance matrices, they are hard to esti- Based on the analytic formula of Bakshi, Cao and
mate. Thanks to the multi-factor models (see Chapter 6 Chen [6] for option price under the stochastic volatil-
of [13]), if a few factors can capture completely the ity models, Chernov and Ghysels [16] estimate the
cross-sectional risks, the number of parameters can be risk neutral parameters by integrating information from
significantly reduced. For example, using the Fama– both historical data and risk-neutral data implied by
French three-factor models [32, 33], there are 4p in- observed option prices. Instead of using continuous-
stead of p(p +1)/2 parameters. Natural questions arise time diffusion models, Heston and Nandi [41] assume
with this structured estimate of the covariance matrix, that the stock prices under the risk-neutral world fol-
how large p can be such that the estimation error in the low a GARCH model and derive a closed form for
covariance matrix is negligible in asset allocation and European options. They determine the structural pa-
risk management. The problems of this kind are inter- rameters by minimizing the discrepancy between the
esting and remain open. empirical and theoretical option prices. Barone-Adesi,
Another possible approach to the estimation of co- Engle and Mancini [7] estimate risk-neutral parame-
variance matrices is to use a model selection ap- ters by integrating the information from both historical
proach. First of all, according to Chapter 3 of [39], data and option prices. Christoffersen and Jakobs [18]
the Cholesky decomposition admits nice autoregres- expand the flexility of the model by introducing long-
sive interpretation. We may reasonably assume that the and short-run volatility components.
elements in the Cholesky decomposition of the covari- The above approaches can be summarized as fol-
ance matrix are sparse. Hence, the penalized likelihood lows. Using the notation in Section 4.1 of [34], the
method [3, 35, 42] can be employed to select and es- theoretical option price with option characteristics
timate nonsparse elements. The sampling property of (Si , Ki , Ti , ri , δi ) is governed by a parametric form
C(Si , Ki , Ti , ri , δi , θ), where θ is a vector of structural
such a method remains unknown. Its impact on portfo-
parameters of the stock price dynamics under the risk-
lio allocation and risk management needs to be studied.
neutral measure. The form depends on the underlying
parameters of the stochastic model. This can be in one
4. STATISTICS IN DERIVATIVE PRICING
case a stochastic volatility model and in another case
Over last three decades, option pricing has witnessed a GARCH model. The parameters are then determined
an explosion of new models that extend the original by minimizing
work of Black and Scholes [9]. Empirically pricing fi-
n
nancial derivatives is innately related to statistical re- {Ci − C(Si , Ki , Ti , ri , δi , θ)}2
gression problems. This is well documented in papers i=1
such as [6, 7, 15, 16, 25, 41]. See also a brief review or similar discrepancy measures. The success of a
given by Cai and Hong [12]. For a given stochastic method depends critically on the correctness of model
model with given structural parameters under the risk- assumptions under the risk-neutral measure. Since
neutral measure, the prices of European options can these assumptions are not on the physical measure,
be determined, which are simply the discounted ex- they are hard to verify. This is why so many para-
pected payoffs under the risk-neutral measure. Bakshi, metric models have been introduced. Their efforts can
Cao and Chen [6] give the analytic formulas of op- be regarded as searching an appropriate parametric
tion prices for five commonly used stochastic mod- form C(·; θ ) to better fit the option data. Nonparamet-
els, including the stochastic-volatility random-jump ric methods in Section 4.1 provide a viable alternative
model. They then estimate the risk-neutral parame- for this purpose. They can be combined with paramet-
ters by minimizing the discrepancies between the ob- ric approaches to improve the accuracy of pricing.
served prices and the theoretical ones. With estimated As an illustration, let us consider the options with
risk-neutral parameters, option prices with different fixed (Si , Ti , ri , δi ) so that their prices are only a func-
characteristics can be evaluated. They conduct a com- tion of K or equivalently a function of the moneyness
prehensive study of the relative merits of competing m = K/S,
option pricing models by computing pricing errors for ∞
new options. Dumas, Fleming and Whaley [25] model C = exp(−rT ) (x − K)f ∗ (x) dx.
K
A SELECTIVE OVERVIEW 355
(a) (b)
F IG . 1. (a) Scatterplot of the response variable computed based on option prices with consecutive strike price against the moneyness.
(b) The implied volatilities of the options during the period July 7–11, 2003.
DenotingD = exp(rT )C/S and letting F̄ ∗ (x) = 1 − the state price distribution becomes a familiar nonpara-
F ∗ (x) = x∞ f ∗ (y) dy be the survival function, then by metric regression problem,
integration by parts,
∞ ∞ yi ≈ F̄ (xi ) + εi .
D = −S −1 (x − K) d F̄ ∗ (x) = S −1 F̄ ∗ (x) dx.
K K In the above equation, the dependence on t is sup-
By a change of variable, we have pressed. Figure 1(a) shows the scatterplot of the pairs
∞ (xi , yi ) based on the closing call option prices (average
D= F̄ (u) du, of bid-ask prices) of the Standard and Poor’s 500 index
m
with maturity of Tt = 75 − t days on the week of July 7
where F (u) = F ∗ (Su) is the state price distribution to July 11, 2003 (t = 0, . . . , 4). The implied volatility
in the normalized scale (the stock price is normalized curve is given in Figure 1(b). It is not a constant and
to $1). Let us write explicitly D(m) to stress the depen- provides stark evidence against the Black–Scholes for-
dence of discounted option price on the moneyness m. mula.
Then The waterfall shape of the regression curve is very
m2
D(m1 ) − D(m2 ) −1 clear. The naive applications of nonparametric tech-
= (m2 − m1 ) F̄ (u) du
m2 − m1 m1 niques will incur large approximation biases resulting
in systematic pricing errors. One possible improve-
m2 + m1
[39] FAN , J. and YAO , Q. (2003). Nonlinear Time Series: Non- [45] M OFFATT, H. K. (2002). Risk Management: Value at Risk
parametric and Parametric Methods. Springer, New York. and Beyond. Cambridge Univ. Press.
[40] G OLDFARB , D. and I YENGAR , G. (2003). Robust portfolio [46] P HILLIPS , P. C. B. and Y U , J. (2005). Jackknifing bond op-
selection problems. Math. Oper. Res. 28 1–37. tion prices. Review of Financial Studies 18 707–742.
[41] H ESTON , S. L. and NANDI , S. (2000). A closed-form [47] T YLER , D. E. (1981). Asymptotic inference for eigenvectors.
GARCH option valuation model. Review of Financial Stud- Ann. Statist. 9 725–736.
ies 13 585–625. [48] VAN DER W EIDE , R. (2002). GO-GARCH: A multivariate
[42] H UANG , J. Z., L IU , N. and P OURAHMADI , M. (2005). Co- generalized orthogonal GARCH model. J. Appl. Economet-
variance selection and estimation via penalized normal likeli- rics 17 549–564.
hood. Unpublished manuscript. [49] W IEAND , K. (2002). Eigenvalue distributions of random uni-
[43] J OHNSTONE , I. M. (2001). On the distribution of the largest tary matrices. Probab. Theory Related Fields 123 202–224.
eigenvalue in principal components analysis. Ann. Statist. 29 [50] Z HANG , L., M YKLAND , P. and A ÏT-S AHALIA , Y. (2005).
295–327. A tale of two time scales: Determining integrated volatility
[44] L O , A. (2000). Finance: A selective survey. J. Amer. Statist. with noisy high-frequency data. J. Amer. Statist. Assoc. 100
Assoc. 95 629–635. 1394–1411.
Statistical Science
2005, Vol. 20, No. 3, 205–209
DOI 10.1214/088342305000000205
© Institute of Mathematical Statistics, 2005
Abstract. Over the last fifty years, How to Lie with Statistics has sold more
copies than any other statistical text. This note explores the factors that con-
tributed to its success and provides biographical sketches of its creators: au-
thor Darrell Huff and illustrator Irving Geis.
Key words and phrases: Darrell Huff, Irving Geis, How to Lie with Statis-
tics, numeracy, graphs, crescent cow.
1. TOUCHING A MILLION LIVES studied both sociology and journalism at the Univer-
sity of Iowa, but even before completing his bache-
In 1954 former Better Homes and Gardens editor lor’s degree in 1938, Huff had worked as a reporter for
and active freelance writer Darrell Huff published a The Herald of Clinton, Iowa and as a feature writer
slim (142 page) volume which over time would be- for The Gazette of Cedar Rapids. In 1937 he had also
come the most widely read statistics book in the history married Frances Marie Nelson, who would become his
of the world. In its English language edition, more than co-author, mother of his four daughters and wife of
one-half million copies of How to Lie with Statistics sixty-four years.
have been sold. Editions in other languages have been In 1939, when Huff finished his work at the Univer-
available for many years, and new translations continue sity of Iowa with a master’s degree, he made the move
to appear. In 2003 the first Chinese edition was pub- from newspapers to magazines. This was a golden age
lished by the Department of Economics of Shanghai for the magazine industry and in some ways Iowa was
University. at the heart of it. Through 1940, Huff served as an as-
There is some irony to the world’s most famous sta- sociate editor at Look magazine in Des Moines, and he
tistics book having been written by a person with no then took a brief detour as editor-in-chief of D. C. Cook
formal training in statistics, but there is also some logic publishing in Elgin, Illinois. After two years in Elgin,
to how this came to be. Huff had a thorough training he returned to Des Moines to become managing editor
for excellence in communication, and he had an excep- of the very influential Better Homes and Gardens [6].
tional commitment to doing things for himself. Huff’s position at Better Homes and Gardens put
him at the top of his profession, but these were tur-
bulent times. In 1944 Huff was offered the executive
2. DARRELL HUFF AND THE PATH TO A editorship of Liberty magazine, and he and his fam-
FREELANCER’S LIFE ily made the hard decision to move to New York.
Darrell Huff was born on July 15, 1913 in Gowrie, Almost forgotten now, Liberty was at the time a mag-
Iowa, a small farming community fifty miles from azine of great national importance. It competed vigor-
Ames, Iowa. Huff received his early education in ously with the famous The Saturday Evening Post with
which it shared many contributors, readers and adver-
Gowrie, had a lively curiosity about almost everything
tisers. Even today, Liberty competes with The Saturday
and eventually evolved an interest in journalism. He
Evening Post for the attention and affection of collec-
tors of cover illustrations and nostalgic advertising art.
J. Michael Steele is C.F. Koo Professor of Statistics With the end of the Second World War, both New
and Operations and Information Management, De- York and the editorship of Liberty lost some of their ap-
partment of Statistics, The Wharton School, University peal to Huff. What had begun as an exciting adventure
of Pennsylvania, Philadelphia, Pennsylvania 19104- started to feel like a “rat race.” As Huff would write six-
6340, USA (e-mail: steele@wharton.upenn.edu). teen years later in an article [14], “I suppose the whole
205
206 J. M. STEELE
thing got started because I developed a habit of grind- Huff’s first book, Pictures by Pete [7], appeared in
ing my teeth in my sleep.” Huff then went on to explain 1944 with the subtitle A Career Story of a Young Com-
that one night when the teeth grinding was particularly mercial Photographer; in 1945 Huff broadened his
ominous, his wife woke him saying, “This has got to theme with Twenty Careers of Tomorrow [8]. Perhaps
stop.” Huff thought she meant the teeth grinding and the young Huff was thinking out loud as he contem-
he agreed, but it turned out that Fran had other things plated his own future. In any event, Huff took an en-
in mind. “Let’s get away from all this,” she said. “You tirely new direction in 1946 with his third book, The
can go back to freelancing.” Dog that Came True [9].
Doing a dog book seemed to put Huff off book writ-
3. A DO-IT-YOURSELFER’S CALIFORNIA IDYLL ing for a while, and for the next nine years he devoted
himself to article writing and to do-it-yourself projects,
Freelance writing and photography had supported such as his remarkable California homes. Still, when
Huff through college, but he had no illusions about Huff took up his next book project, he bought himself
the challenge he would face trying to support a fam- a piece of posterity. The book was How to Lie with
ily without a steady job. Nevertheless, his wife Fran Statistics [10], and it is the reason why we celebrate
was undaunted, and they decided to move to Califor- Darrell Huff in this volume.
nia, build their own home and make a living through While the longevity of How to Lie with Statistics
freelance writing. In 1946 they headed west with their must have come as a surprise to Huff, the seeds for its
life savings of $4000 and a secondhand trailer towed initial success were carefully planted. Remarkably, it
behind a 1941 sedan. Today, such an action would be was reviewed twice in The New York Times, a feat that
as sensible as looking for gold in the rivers, but those would be almost impossible today. First, on January 4,
were different times. For $1500 the Huffs picked up ten 1954, Burton Crane gave it a column [4] and-a-half in
acres in Valley of the Moon, five miles from Sonoma. “The Business Bookshelf”; then on January 16, it got
Post-war rationing and a limited budget forced the another half column [30] in Charles Poore’s “Books of
Huffs to be creative, but after a few months of working The Times.” Both reviews were highly positive, though
on their house while living in their trailer, the Huffs not especially insightful. In August of 1954, Huff got
moved into their first home of their own design. It was to follow up in the Times with a more informative fea-
not finished, but it had a huge fireplace and an inspiring ture article [11] of his own, “How to Spot Statistical
wall of glass facing the California foothills. Jokers.” At the bottom of the first column, Huff added
The Sonoma home was the first of several homes a footnote that says “Darrell Huff is a specialist in
that Huff would build with his own hands, including reading between statistical lines. He is the author of
one in Carmel on the Monterey peninsula. It has 3000 the book How to Lie with Statistics.” This does indeed
square feet, four bedrooms, a study, three baths and seem to be how the world most remembers him today.
a hexagonal living room that looks directly over the Darrell Huff was not one to argue with success, and
Pacific. It has been featured in numerous articles and the thoughts and skills that led to How to Lie with Sta-
has been honored by the National Building Museum. tistics were to provide the basis for six further books
Darrell Huff lived in his Carmel home until he died on that would deal with what we now call quantitative lit-
June 27, 2001. Frances Marie Huff lives there today, eracy. The first of these was the natural How to Take
where she is often visited by her daughters. a Chance [12], where he again partnered with the bril-
liant illustrator Irving Geis. Even though the sales of
4. OF HUFF’S MANY BOOKS this book were small in comparison with How to Lie
with Statistics, it was still a solid success.
At his core, Darrell Huff was a magazine man, and Huff’s other efforts in quantitative literacy were
almost all of his writing was directed at the daily also commercially successful, but less dramatically so.
concerns of the common person. Popular writing of- They include Score: The Strategy of Taking Tests [13]
ten goes undocumented into history, and today it is and Cycles in Your Life [15], two volumes which set
not completely certain how many books Huff wrote. a pattern that others would carry to even greater com-
Cataloguers of the Library of Congress list fourteen mercial success. In How to Figure the Odds on Every-
volumes to his credit, but they missed at least two thing [17], Huff polished and modernized his treatment
([12] and [20]). Of Huff’s several hundred feature arti- of the themes that he first addressed in How to Take a
cles, only one [14] is available on the Internet. Chance [12].
HUFF AND HOW TO LIE WITH STATISTICS 207
Darrell Huff’s last book, The Complete How to Fig- South Carolina and studied architecture at Georgia
ure It [19], was written with the design and illustration Tech before going to the University of Pennsylvania,
assistance of two of his daughters, Carolyn and Kristy. where he obtained a Bachelor of Fine Arts degree in
It was published in 1996 when Huff was eighty-three 1929. He found success as a magazine illustrator, and
years old, and it is an unusual book that is in many the first pictures that many of us ever saw of Sputniks
ways strikingly original. Its modular structure suggests orbiting, continents drifting or double helixes dividing,
that it may have been brewing for many years. The re- were the work of Irving Geis.
mainder of Huff’s books ([16, 18, 20–23]) deal more Nevertheless, it was through his illustrations of com-
directly with household projects than with quantitative plex molecules that Geis assured his own place in his-
literacy, but in each of these one finds considerable cut- tory. His involvement began with a 1961 Scientific
ting and measuring—practical challenges that put to American article by John Kendrew where Geis illus-
test even the most quantitatively literate. trated the first protein crystal structure to be discov-
ered, that of sperm whale myoglobin. Large molecules
5. FOUR SOURCES OF SUCCESS provided Geis with a perfect venue for his talents, and
When one asks what may have led to the remarkable he continued to illustrate them for the rest of his life.
success of How to Lie with Statistics, it is natural to He died on July 22, 1997, and, in his funeral oration,
consider four likely sources: the title, the illustrations, Richard Dickerson [5] called Geis the “Leonardo da
the style and the intellectual content. In each case, from Vinci of protein structure.”
the perspective of fifty years, one finds more than might The Style
have originally met the eye.
In a word, Huff’s style was—breezy. A statistically
The Title trained reader may even find it to be breezy to a fault,
Many statisticians are uncomfortable with Huff’s ti- but such a person never was part of Huff’s intended
tle. We spend much of our lives trying to persuade audience. My copy of The Complete Idiot’s Guide to
others of the importance and integrity of statistical the Roman Empire [27] is also breezy, but, since I am
analysis, and we are naturally uncomfortable with the not a historian, that is just the way I like it.
suggestion that statistics can be used to craft an inten- We all know now (history has taught us!) that many
tional lie. Nevertheless, the suggestion is valid. People subjects can be made more accessible when they are
do lie with statistics every day, and it is to Huff’s credit lightened up with cartoons and an over-the-top casu-
that he takes the media (and others) to task for having alness. Fifteen years of Idiot, Dummy and Cartoon
stretched, torn or mutilated the truth. guides have shown that this formula works—not every
Irving Geis, the illustrator of How to Lie with Sta- time—but often enough to guarantee that such guides
tistics, also said [5] what many have thought: “Huff will be bookstore staples for many years to come. It
could have well titled it An Introduction to Statistics would require careful bibliographical research to deter-
and it would have sold a few hundred copies for a year mine how much credit Huff deserves for this formula,
or two. But with that title, it’s been selling steadily but even a quick look will show that many of the el-
since 1954.” Indeed, the title has done more than sim- ements of the modern formula are already present in
ply stand the test of time; it has been honored through How to Lie with Statistics. In the publishing field, this
the years by other authors who pursued their own vari- is what one means by pioneering, original work.
ations, including How to Lie with . . . charts [24], maps The Content
[26], methodology [31] and even “Your Mac” [2].
A great title, great illustrations and chatty quips will
The Illustrations—and the Illustrator
quickly run out of steam unless they are used in support
Although it is now commonplace to see cartoons in of genuine intellectual content. The first four chapters
serious books, it was not always so. Huff and his il- (The Sample with the Built-in Bias, The Well-Chosen
lustrator, Irving Geis, helped to pave the way. As an Average, The Little Figures That Are Not There and
experienced magazine editor, Huff may—or may not— Much Ado about Practically Nothing) deal with mater-
have foreseen this development, but what Huff surely ial that is covered in any introductory statistics class,
saw was the brilliance of his partner. so to statisticians there is not much that is original
Irving Geis was born in New York in 1908 as Irving here—or is there? These chapters take only forty-nine
Geisberg [1], but he attended high school in Anderson, (breezy, cartoon-filled) pages, yet I suspect that many
208 J. M. STEELE
of us would be content to know that our students could • Charles Murray: How to Accuse the Other Guy of
be certain to have a mastery of these chapters a year af- Lying with Statistics
ter having completed one of our introductory classes. • Sally Morton: Ephedra
The next three chapters (The Gee-Whiz Graph, The • Stephen E. Fienberg and Paul C. Stern: In Search of
One-Dimensional Picture and The Semiattached Fig- the Magic Lasso: The Truth About the Polygraph.
ure) deal with graphics, and to me these are the most
The first four of these articles explore a remarkable va-
original in the book, which, incidentally, I first read as
riety of newly discovered pathways by which statistical
a high school student in Lubbock, Texas. What struck
lies continue to flow into our collective conscious-
me then—and still strikes me now—was the utter dev-
ness. The fifth piece, by Charles Murray, then amus-
ilishness of the “crescive cow” illustration, which uses
ingly explores the Swiftian proposal that young social
a 1936 cow that is 25/8 taller than an 1860 cow to
scientists may be somehow secretly coached in subtle
demonstrate the growth of the US cow population to
techniques for suggesting that someone else might be
25 from 8 million between 1860 and 1932. Since we
intuitively judge a cow more by her volume than by her lying. Finally, in the last two pieces, we see how sta-
height, this illustration is massively, yet slyly, decep- tistics is used (or not used) for better or worse in the
tive. Today the graduate text for such insights would public policy domain. Morton reprises the role of sta-
be Tufte’s beautiful book The Visual Display of Quan- tistics in the regulatory background of the controversial
titative Information [32]. herb ephedra, and Fienberg and Stern take on the tech-
Huff’s next two chapters are more heterogeneous. nology of lie detection, where great expectations and
Chapter 8, Post Hoc Rides Again, gives just a so-so dis- empirical evidence face irreconcilable differences.
cussion of cause and effect, but Chapter 9 makes up for
the lull. In How to Statisticulate, Huff takes on an is- ACKNOWLEDGMENTS
sue that statisticians seldom discuss, even though they I am especially pleased to thank Frances Marie Huff
should. When we find that someone seems to be “lying for her willingness to discuss the life and career of
with statistics,” is he really lying or is he just exhibit- Darrell Huff. I also wish to thank Ed George, Dean
ing an unfortunate incompetence? Huff argues that it is Foster and Milo Schield for their advice and encour-
often simple, rock-bottom, conniving dishonesty, and agement at the beginning of this project, Paul Shaman
I believe that Huff is right. and Adi Wyner for their comments on earlier drafts,
Huff’s last chapter is How to Talk Back to a Statis- and George Casella for giving me the original green
tic, and it gets directly to what these days we cover light.
in courses on critical thinking. He boils his technique
down to just five questions: Who says so? How does he
know? What’s missing? Did someone change the sub- REFERENCES
ject? Does it make sense? Today anyone can test the ef- [1] A NONYMOUS (1998). Obituaries, ’20s. The Pennsylvania
fectiveness of these questions simply by checking how Gazette, University of Pennsylvania Alummi Magazine,
well they deal with problems such as those collected by April.
[2] B EAMER , S. (1994). How to Lie With Your Mac. Hayden,
Best [3], Levitt and Dubner [25] and Paulos [28, 29]. Indianapolis.
[3] B EST, J. (2001). Damned Lies and Statistics. Untangling
6. ABOUT THE SPECIAL ISSUE Numbers from the Media, Politicians, and Activists. Univ.
California Press, Berkeley.
This special section of Statistical Science collects [4] C RANE , B. (1954). The Business Bookshelf. The New York
seven further articles that address issues with which Times, January 4, p. 41.
Huff and Geis would surely have had a natural rapport: [5] D ICKERSON , R. E. (1997). Irving Geis, Molecular artist,
1908–1997. Protein Science 6 2483–2484.
• Joel Best: Lies, Calculations, and Constructions: Be- [6] E THRIDGE , J. M. and KOPALA , B., eds. (1967). Contempo-
yond How to Lie with Statistics rary Authors. A Biobiliographical Guide to Current Authors
• Mark Monmonier: Lying with Maps and their Works. Gale Research Company, Detroit.
• Walter Krämer and Gerd Gigerenzer: How to Con- [7] H UFF , D. (1944). Pictures by Pete. A Career Story of a Young
Commercial Photographer. Dodd, Mead, New York.
fuse with Statistics or: The Use and Misuse of Con-
[8] H UFF , D. (1945). Twenty Careers of Tomorrow. Whittlesey
ditional Probabilities House, McGraw–Hill, New York.
• Richard De Veaux and David Hand: How to Lie with [9] H UFF , D. (1946). The Dog that Came True (illust. C. Moran
Bad Data and D. Thorne). Whittlesey House, McGraw–Hill, New York.
HUFF AND HOW TO LIE WITH STATISTICS 209
[10] H UFF , D. (1954). How to Lie with Statistics (illust. I. Geis). [21] H UFF , D. and H UFF , F. (1963). How to Lower Your
Norton, New York. Food Bills. Your Guide to the Battle of the Supermarket.
[11] H UFF , D. (1954). How to Spot Statistical Jokers. The New Macfadden–Bartell, New York.
York Times, August 22, p. SM13. [22] H UFF , D. and H UFF , F. (1970). Complete Book of Home Im-
[12] H UFF , D. (1959). How to Take a Chance (illust. I. Geis). Nor- provement (illust. G. and C. Kinsey and Bray–Schaible De-
ton, New York. sign, Inc.). Popular Science Publishing, New York.
[13] H UFF , D. (1961). Score. The Strategy of Taking Tests (illust. [23] H UFF , F. M. (1973). Family Vacations. More Fun For Less
C. Huff ). Appleton–Century Crofts, New York. Money (with D. Huff and the editors of Dreyfus Publications;
[14] H UFF , D. (1962). Living high on $6500 a year. The Saturday illust. J. Huehnergarth). Dreyfus, New York.
Evening Post 235 60–62. [Reprinted in Mother Earth News, [24] J ONES , G. E. (2000). How to Lie with Charts. iUniverse, Lin-
January 1970.] Available online at www.motherearthnews. coln, NE.
com/mothers_library/1970. [25] L EVITT, S. D. and D UBNER , S. J. (2005). Freakonomics.
[15] H UFF , D. (1964). Cycles in Your Life—The Rhythms of A Rogue Economist Explores the Hidden Side of Everything.
War, Wealth, Nature, and Human Behavior. Or Patterns in Morrow, New York.
War, Wealth, Weather, Women, Men, and Nature (illust. [26] M ONMONIER , M. (1996). How to Lie with Maps, 2nd ed.
A. Kovarsky). Norton, New York. Univ. Chicago Press, Chicago.
[16] H UFF , D. (1968). How to Work With Concrete and Masonry [27] N ELSON , E. (2002). The Complete Idiot’s Guide to the Ro-
(illust. C. and G. Kinsey). Popular Science Publishing, New man Empire. Alpha Books, Indianapolis.
York. [28] PAULOS , J. A. (1988). Innumeracy. Mathematical Illiteracy
[17] H UFF , D. (1972). How to Figure the Odds on Everything (il- and Its Consequences. Hill and Wang, New York.
lust. J. Huehnergarth). Dreyfus, New York. [29] PAULOS , J. A. (1996). A Mathematician Reads the Newspa-
[18] H UFF , D. (1972). How to Save on the Home You Want per. Anchor Books, New York.
(with F. Huff and the editors of Dreyfus Publications; illust. [30] P OORE , C. (1954). Books of The Times. The New York Times,
R. Doty). Dreyfus, New York. January 16, p. 13.
[19] H UFF , D. (1996). The Complete How to Figure It. Using [31] S CHARNBERG , M. (1984). The Myth of Paradigm-Shift, or
Math in Everyday Life (illust. C. Kinsey; design K. M. Huff ). How to Lie with Methodology. Almqvist and Wiksell, Upp-
Norton, New York. sala.
[20] H UFF , D. and C OREY, P. (1957). Home Workshop Furniture [32] T UFTE , E. (1983). The Visual Display of Quantitative Infor-
Projects. Fawcett, New York. mation. Graphics Press, Cheshire, CT.
Statistical Science
2005, Vol. 20, No. 3, 210–214
DOI 10.1214/088342305000000232
© Institute of Mathematical Statistics, 2005
Abstract. Darrell Huff’s How to Lie with Statistics remains the best-known,
nontechnical call for critical thinking about statistics. However, drawing a
distinction between statistics and lying ignores the process by which statistics
are socially constructed. For instance, bad statistics often are disseminated by
sincere, albeit innumerate advocates (e.g., inflated estimates for the number
of anorexia deaths) or through research findings selectively highlighted to at-
tract media coverage (e.g., a recent study on the extent of bullying). Further,
the spread of computers has made the production and dissemination of dubi-
ous statistics easier. While critics may agree on the desirability of increasing
statistical literacy, it is unclear who might accept this responsibility.
Key words and phrases: Darrell Huff, social construction, statistical liter-
acy.
In the spring of 1965, I was a freshman taking Soci- sociologist, I now realized that there was much more
ology 25, the introductory course in social statistics at to say.
the University of Minnesota. One day the TA in charge In particular, I’d become interested in the role statis-
of our lab mentioned that this stuff could actually be in- tics play in drawing attention to social problems. Dur-
teresting. There was, he said, a pretty good book called ing the early 1980s, the campaign to call attention to
How to Lie with Statistics. I perked up; any book with the problem of missing children used a simple, familiar
that title promised to be fun. As a high-school debater, recipe to mobilize public concern: (1) present terrify-
I’d had a favorite opening for rebuttals: “Disraeli1 said, ing examples (e.g., the most notorious case involved a
‘There are lies, damned lies, and statistics.’ While I cer- six-year-old boy who wandered away from his mother
tainly don’t want to accuse our opponents of lying, they in the local mall and disappeared until, weeks later, the
have presented a lot of statistics. . . .” I checked Darrell authorities recovered part of his body); (2) explain that
Huff’s little book out of the library and I’d have to say this example is but one instance of a larger problem
it made as big an impression on me as anything else and name that problem (e.g., that boy was a missing
I read during my freshman year. child ); and (3) give shocking statistics about the prob-
I recommended the book to friends and, once I be- lem’s extent (e.g., each year, activists claimed, there
gan teaching sociology myself, to countless students. are nearly two million cases of missing children, in-
I don’t think I read it again until the early 1990s. By cluding 50,000 abducted by strangers). It was years
that time, I’d encountered other, more sophisticated
books on related topics, such as John Allen Paulos’ In-
1 This aphorism also gets attributed to Mark Twain. So far as
numeracy (1988), Edward Tufte’s The Visual Display
I know, no one has been able to locate it in Disraeli’s writ-
of Quantitative Information (1983) and Mark Mon- ings, but it does appear in Twain’s autobiography, where Twain
monier’s How to Lie with Maps (1996). How to Lie ascribes it to Disraeli. Given that Twain was not unwilling to
with Statistics remained a wonderful primer but, as a take credit for a funny line, I had come to assume that he
at least believed that it originated with Disraeli. However, Pe-
ter M. Lee of the University of York’s Department of Mathe-
Joel Best is Professor and Chair, Department of So-
matics has traced the aphorism to Courtney’s (1895) reference
ciology & Criminal Justice, University of Delaware, to “. . . the words of the Wise Statesman, ‘Lies–damned lies–
Newark, Delaware 19716-2580, USA (e-mail: joelbest and statistics’ . . . ” (for a full discussion, see Lee’s Web page:
@udel.edu). www.york.ac.uk/depts/maths/histstat/lies.htm).
210
LIES, CALCULATIONS AND CONSTRUCTIONS: BEYOND HOW TO LIE WITH STATISTICS 211
before reporters began to challenge these widely circu- Damned Lies” (Dewdney, 1993), “Lying with Statis-
lated numbers, in spite of their obvious implausibility. tics” (Gelman and Nolan, 2002), and so on. Folk wis-
(At that time, there were roughly 60 million Ameri- dom draws on the same theme: “Figures may not lie,
cans under age 18. Was it really possible that one in but liars figure”; “You can prove anything with statis-
thirty—think of a child from every schoolroom in the tics.” You get the idea: there are good statistics, and
nation—went missing each year?) then there are bad lies. Let’s call this the statistic-or-lie
Once I’d noticed the three-part (atrocity tale/problem distinction.
name/inflated statistic) recipe for problem building, Of course, this is an appealing interpretation, par-
I began to appreciate just how often it was used. ticularly when the numbers bear on some controversy.
To be sure, the bad guys—that is, those with whom I have statistical evidence. My opponent (the weasel)
I disagreed—regularly adopted this combination of has lies. It has been my experience that almost every-
claims to arouse public opinion. But then, so did advo- one enjoys criticizing the other guy’s bad statistics.
cates for positions I personally supported. And, while I have appeared on conservative radio talk shows where
undoubtedly some claims featuring bad statistics were the hosts focused on dubious figures promoted by lib-
disingenuous—Huffian lies, as it were—others seemed erals, and I have been on shows with liberal hosts (they
to be sincere—albeit innumerate—claims. People try- do exist!) who pointed to the bad numbers favored
ing to draw attention to some social problem tend to be by conservatives. Our critical faculties come into play
convinced that they’ve identified a big, serious prob- when we confront a statistic that challenges what we
lem. When they come upon a big numeric estimate believe; we become analytic tigers pouncing on prob-
for the problem’s size, they figure it must be about lems of improper sampling, dubious measurements,
right, so they decide to repeat it. Since everybody in and so on. On the other hand, we tend to be more for-
this process—the advocates making the claims, the re- giving when we encounter numbers that seem to sup-
porters covering the story, and the audience for this port what we’d like to think. Oh, maybe our figures
media coverage—is likely to be more-or-less innumer- aren’t perfect, but they’re certainly suggestive, so let’s
ate, it is easy for bad numbers—especially bad big avoid quibbling over minor matters. . . .
numbers—to spread. And, of course, in today’s world It is my impression that the statistic-or-lie distinction
the Internet guarantees a figure’s continued circulation. is often implicitly endorsed in statistics instruction.
Ridiculous statistics live on, long after they’ve been Statistics courses naturally gravitate toward matters of
thoroughly debunked; they are harder to kill than vam- calculation; after mastering each statistic, the class
pires. moves on to the next, more complicated one. If “lies”
are mentioned, it tends to be in terms of “bias.” That is,
THE TROUBLE WITH LYING students are warned that there are biased people who
In at least one way, Huff’s book may have made may deliberately choose to calculate statistics that will
things worse. His title, while clever and—at least to lend support to the position they favor. This reduces ly-
former debaters—appealing, suggests that the prob- ing to a variant of the larger problem of bias—simply
lem is lying. Presumably lying with statistics involves another flaw to be avoided in producing sound calcula-
knowingly spreading false numbers, or at least de- tions.
ceptive figures. Others have followed Huff’s lead. As a sociologist, I am not sure that the statistic-or-
A surprisingly large number of book titles draw a lie distinction is all that helpful. It makes an implicit
distinction between statistics and lies. In addition to claim that, if statistics are not lies, they must be true—
How to Lie with Statistics [also, How to Lie with that is, really true in some objective sense. The image is
Charts (Jones, 1995), How to Lie with Maps (Mon- that statistics are real, much as rocks are real, and that
monier, 1996), etc.], we have How to Tell the Liars people can gather statistics in the way that rock col-
from the Statisticians (Hooke, 1983), The Honest Truth lectors pick up stones. After all, we think, a statistic is
about Lying with Statistics (Homes, 1990), How Num- a number, and numbers seem solid, factual, proof that
bers Lie (Runyon, 1981), Thicker than Blood: How somebody must have actually counted something. But
Racial Statistical Lie (Zuberi, 2001), and (ahem) my that’s the point: people count. For every number we
own Damned Lies and Statistics (Best, 2001) and encounter, somebody had to do the counting. Instead
More Damned Lies and Statistics (Best, 2004). Other of imagining that statistics are like rocks, we’d do bet-
books have chapters on the theme: “Statistics and ter to think of them as being like jewels. Gemstones
212 J. BEST
may be found in nature, but people have to create jew- fanciful or arbitrary; instead, it means that scientific
els. Jewels must be selected, cut, polished and placed knowledge is the result of people’s work.
in settings so that they can be viewed from particular So, what do we gain when we think about statistics
angles. In much the same way, people create statistics: as socially constructed? For one thing, we can get past
They choose what to count, how to go about count- the statistic-or-lie distinction. Talking about lies leads
ing, which of the resulting numbers they will share us to concentrate on whether people knowingly, delib-
with others, and which words they will use to describe erately say things they know aren’t true. Thus: “Those
and interpret those figures. Numbers do not exist inde- tobacco executives knew full well that smoking was
pendently of people; understanding numbers requires harmful; we can prove this because we have uncov-
knowing who counted what, why they bothered count- ered internal memoranda that make it clear they knew
ing and how they went about it. this; therefore they were lying when they said smoking
was harmless.” Well, yes. But few bad statistics involve
SOCIAL CONSTRUCTION AND STATISTICS this sort of egregious bad faith. Misplaced enthusiasm
is probably responsible for more dubious figures than
This is what sociologists mean when they speak of conscious lying.
social construction. I know this term has gotten a bad Consider the case of anorexia deaths. Someone
rap. After being introduced by the sociologists Peter active in the struggle against eating disorders esti-
Berger and Thomas Luckmann in their 1966 book, mated that perhaps 150,000 Americans suffer from
The Social Construction of Reality, the notion of so- anorexia nervosa, and noted that this disease can
cial construction was hijacked and put to all sorts of be fatal (Sommers, 1994). Someone else—probably
uses—some of them rather silly—by an assortment of inadvertently—garbled this claim and announced that
literary critics and postmodernist thinkers. Ignore all anorexia kills 150,000 each year. This dramatic num-
that. Berger and Luckmann’s key observation is very ber was repeated in best-selling books, in news sto-
simple: Without doubting that the real world (rocks ries and—here I speak from experience—countless
and such) exists, it remains true that we understand term papers. It was a patently ridiculous number: most
that world through language, and we learn words and anorexics are young women; the annual death toll from
their meanings from other people, so our every thought all causes for women aged 15–44 was about 55,000; so
is shaped by our culture’s system for categorizing the what were the odds that 150,000 of those 55,000 were
world. This means that everything we know is neces- anorexic? Yet, were the various advocates, authors and
sarily a social construction. Calling something a social journalists who repeated this very dubious number ly-
construction doesn’t mean that it is false or arbitrary or ing? I presume most of them thought it was true. After
wrong. When I think, “This rock is hard,” my notions all, they believed that anorexia is a big problem, and
of rockness and hardness derive from my culture, they 150,000 is a big number; moreover, other people said
are social constructions. But this does not imply that that was the right number, so why not repeat it? Does it
the thought is false or illusionary, that other members help to call the sincere, albeit credulous, dissemination
of my culture won’t agree that it’s a hard rock, or that if of a bad number a lie?
I whack my head with the rock, it won’t hurt. Much of Or what about a recent, widely publicized report that
what we know—of our social constructions—provides 30% of students in sixth through tenth grades have
essential help in getting through the world. moderate or frequent involvement in bullying? This
In my view, it helps to think about statistics in terms was the highlighted finding from an article in the Jour-
of construction, as well as calculation. Understand: nal of the American Medical Association (Nansel et al.,
I am not suggesting we replace the statistic-or-lie dis- 2001), mentioned in the article’s abstract, in JAMA’s
tinction with a calculation-or-construction distinction. news release about the article, and in the extensive me-
Rather, my point is that every number is inevitably both dia coverage that resulted (Best, 2004). This article
calculated and constructed, because counting is one of survived the peer review process in one of medicine’s
the ways we construct the world. Anyone who has done premier journals; the study, conducted by researchers
any sort of research is painfully aware that this is true. in the federal government, surveyed more than 10,000
All research involves choosing what to study and how students. But of course the researchers had to make
to study it. This is why scientists include methods sec- choices when analyzing their data. Respondents were
tions in their papers. When we say that science is a asked whether they had been bullied or had themselves
social construction, this does not mean that science is bullied others and, if so, how often. Bullying that was
LIES, CALCULATIONS AND CONSTRUCTIONS: BEYOND HOW TO LIE WITH STATISTICS 213
reported occurring “sometimes” was designated “mod- Tens of millions of people have been given statistical
erate,” while bullying at least once a week was labeled and spreadsheet software. We can hardly be surprised
“frequent.” This produced a pie of data that could be that we find ourselves surrounded by statistical presen-
sliced in various ways. The researchers carved the data tations.
to show that 30% of the students reported moderate or Interpreting these numbers, however, requires two
frequent involvement in bullying. But consider other distinct sets of statistical skills. The first set concerns
possible slices: “involvement” meant either as a bully matters of calculation—the sort of lessons taught in
or a bullying victim; only 17% reported being victims statistics classes. But in order to assess, to criticize
of moderate or frequent bullying; and only 8% reported those numbers, we also need to appreciate issues of
being victims of frequent bullying. All of this informa- construction. That is, we need to worry about how sta-
tion is included in the text of the article. tistics were brought into being. Who did the counting?
In other words, the claim that the study found 30% What did they decide to count, and why? How did they
of students with moderate or frequent involvement in go about it?
bullying was no lie. But it would have been equally There is a great deal of discussion these days about
true to state that 8% were frequent victims of bully- the desirability of increasing numeracy, quantitative
ing. The former statement was featured in the abstract literacy and particularly statistical literacy. Who can
and the press release; the latter was buried in the arti- disagree? Certainly, part of the problem is that many
cle. We can imagine that everyone involved in dissem- people aren’t particularly adept at calculation. But,
inating the results of this study—the newspaper editors I would argue, genuine statistical literacy requires that
trying to decide whether to run a story about this re- people also become more alert to what I’ve called mat-
search, the wire-service reporter trying to write a story ters of construction.
that would seem newsworthy, JAMA’s editors prepar- Anyone who reads the newspaper can find exam-
ing news releases about that week’s articles, the au- ples of stat wars, debates over social issues in which
thors hoping that their paper would be accepted by a opponents lob competing numbers at each other. Statis-
top-tier journal and that their research would attract at- tical literacy ought to help people assess such compet-
tention, even the funders who wanted to feel that their ing claims, but that requires more than teaching them
money had been well spent—found a statistic that im- how to calculate and warning them to watch out for
plicated 30% of students in bullying more compelling liars. It would help to also understand something about
than one that suggested 8% were frequent targets of the place of statistics in contemporary policy rhetoric,
bullies. If there is publication bias against studies with about the processes by which numbers get produced
negative findings, so, too, is there a publicity bias fa- and circulated and so on. But who’s going to teach
voring studies with dramatic results. But drawing a dis- these lessons?
tinction between statistics and lies ignores this pattern Here I think we might pause to consider the lessons
in disseminating research results. from the “critical thinking” movement that became
fashionable in academia in the late 1980s and early
TOWARD STATISTICAL LITERACY? 1990s. It is no wonder that the cause of critical thinking
gained widespread support. After all, virtually all aca-
While many of Huff’s critiques remain perfectly ap- demics consider themselves critical thinkers, and they
plicable to contemporary statistics, there have been im- would agree that their students need to become better
portant developments during the intervening 50 years. critical thinkers. Yet, if you track the numbers of arti-
In particular, personal computers have transformed the cles about critical thinking in the education literature,
production and dissemination of statistics. The PC’s you discover a steep rise in the late 1980s, but then a
effects—including inexpensive software for generating peak, followed by a decline. This is a pattern familiar
sophisticated statistical analyses, bundled spreadsheet to sociologists: these dynamics characterize—in fact
programs that allow users to create an extraordinary define—the fad. The celebration of critical thinking
array of graphs and charts and professional designers turned out to be just one more academic fad, a short-
able to create eye-catching graphics—have democra- lived enthusiasm.
tized the means of statistical production. Philosophers Why, if everyone agreed that critical thinking was
speak of the Law of the Instrument (originally stated, very important, did interest in the topic fade? I think the
in an era less concerned with sexism, as: “If you give answer is that no one assumed ownership of the crit-
a small boy a hammer, he’ll find things to pound.”). ical thinking problem. Sociologists interested in how
214 J. BEST
particular social issues gain and then lose public atten- We all know statistical literacy is an important prob-
tion argue that an issue’s survival depends on some- lem, but we’re not going to be able to agree on its place
one assuming ownership of the problem, so that there in the curriculum.
are continuing, active efforts to keep it in the pub- Which means that How to Lie with Statistics is going
lic eye. In the case of critical thinking, no discipline to continue to be needed in the years ahead.
stepped up and took responsibility for teaching criti-
cal thinking. Rather, teaching critical thinking was seen
as everybody’s responsibility, and that meant, in effect, REFERENCES
that nobody was especially responsible for it. Without B ERGER, P. L. and L UCKMANN , T. (1966). The Social Construc-
an intellectual owner to promote it, critical thinking tion of Reality. Doubleday, Garden City, NY.
slipped quietly from view, to be replaced by the next B EST, J. (2001). Damned Lies and Statistics. Untangling Numbers
from the Media, Politicians, and Activists. Univ. California
next thing.
Press, Berkeley.
So—even if we agree that statistical literacy is im- B EST, J. (2004). More Damned Lies and Statistics. How Numbers
portant, and that we need to teach these skills, we still Confuse Public Issues. Univ. California Press, Berkeley.
need to figure out who is going to do that teaching. C OURTNEY, L. H. (1895). To my fellow-disciples at Saratoga
I speak as an outsider, but I doubt that it will be sta- Springs. The National Review (London) 26 21–26.
tisticians. The statistics curriculum is based on master- D EWDNEY, A. K. (1993). 200% of Nothing. Wiley, New York.
ing ever more complex matters of calculation. It may G ELMAN, A. and N OLAN , D. (2002). Teaching Statistics. A Bag of
be desirable for students to learn, say, the principles Tricks. Oxford Univ. Press.
H OMES , C. B. (1990). The Honest Truth about Lying with Statis-
for making good pie charts, but few Ph.D.s in statistics tics. Charles C. Thomas, Springfield, IL.
will be eager to teach those lessons. Statisticians are H OOKE , R. (1983). How to Tell the Liars from the Statisticians.
likely to consider teaching courses in statistical liter- Dekker, New York.
acy beneath their talents, just as professors of English H UFF , D. (1954). How to Lie with Statistics. Norton, New York.
literature tend to avoid teaching freshman composition. J ONES , G. E. (1995). How to Lie with Charts. Sybex, San Fran-
Even though I am a sociologist who believes that the cisco.
M ONMONIER , M. (1996). How to Lie with Maps, 2nd ed. Univ.
idea of social construction has much to contribute to
Chicago Press, Chicago.
the cause of statistical literacy, I also doubt that soci- NANSEL , T. R. et al. (2001). Bullying behaviors among U.S.
ologists will claim ownership of statistical literacy. Af- youth: Prevalence and association with psychosocial adjust-
ter all, statistical literacy is only tangentially related to ment. J. American Medical Assoc. 285 2094–2100.
sociologists’ core concerns. Similar reactions can be PAULOS , J. A. (1988). Innumeracy. Mathematical Illiteracy and
expected from psychologists, political scientists, and Its Consequences. Hill and Wang, New York.
people in other disciplines. RUNYON , R. P. (1981). How Numbers Lie. Lewis, Lexington, MA.
S OMMERS , C. H. (1994). Who Stole Feminism? Simon and Schus-
In other words, its advocates are likely to wind up
ter, New York.
agreeing that statistical literacy is important, so impor- T UFTE , E. (1983). The Visual Display of Quantitative Information.
tant that it needs to be taught throughout the curricu- Graphics Press, Cheshire, CT.
lum. Once we reach that agreement, we will be well Z UBERI , T. (2001). Thicker than Blood. How Racial Statistics Lie.
along the faddish trajectory taken by critical thinking. Univ. Minnesota Press, Minneapolis.
Statistical Science
2005, Vol. 20, No. 3, 215–222
DOI 10.1214/088342305000000241
© Institute of Mathematical Statistics, 2005
Abstract. Darrell Huff’s How to Lie with Statistics was the inspiration for
How to Lie with Maps, in which the author showed that geometric distortion
and graphic generalization of data are unavoidable elements of cartographic
representation. New examples of how ill-conceived or deliberately contrived
statistical maps can greatly distort geographic reality demonstrate that lying
with maps is a special case of lying with statistics. Issues addressed include
the effects of map scale on geometry and feature selection, the importance
of using a symbolization metaphor appropriate to the data and the power
of data classification to either reveal meaningful spatial trends or promote
misleading interpretations.
Key words and phrases: Classification, deception, generalization, maps,
statistical graphics.
215
216 M. MONMONIER
area is large and the phenomenon at least moderately world at 1/75,000,000—a markedly smaller fraction.
complex. Map users understand this and trust the map- (Planners and engineers sometimes confuse scale and
maker to select relevant facts and highlight what’s im- geographic scope, the size of the area represented. It
portant, even if the map must grossly distort the earth’s might seem counterintuitive that small-scale maps can
geometry as well as lump together dissimilar features. cover vast regions while large-scale maps are much
When combined with the public’s naive acceptance of more narrowly focused, but when the issue is scale, not
maps as objective representations, cartographic gener- scope, “large” means comparatively detailed whereas
alization becomes an open invitation to both deliberate “small” means highly generalized.)
and unintentional prevarication. Mapmakers can report a map’s scale as a ratio or
At the risk of stretching the notion of lying, I’m con- fraction, state it verbally using specific distance units—
vinced that inadvertent fabrication is far more common “one inch represents two miles” is more user friendly
these days than intentional deceit. Moreover, because than 1:126,720—or provide a scale bar illustrating one
most maps now are customized, one-of-a-kind graph- or more representative distances. Bar scales, also called
ics that never make it into print or onto the Internet, graphic scales, are ideal for large-scale maps because
prevaricating mapmakers often lie more to themselves they promote direct estimates of distance, without re-
than to an audience. Blame technology—a conspir- quiring the user to locate or envision a ruler. What’s
acy between user-friendly mapping software (or not- more, a graphic scale remains true when you use a pho-
so-user-friendly geographic information systems) and tocopier to compress a larger map onto letter-size pa-
high-resolution laser printers that can render crisp type per. Not so with ratio or verbal scales.
and convincing symbols with little effort or thought. However helpful they might be on large-scale maps,
There’s a warning here I’m sure Darrell Huff would ap- bar scales should never appear on maps of the world,
plaud: watch out for the well-intended mapmaker who a continent, or a large country, all of which are dras-
doesn’t understand cartographic principles yet blindly tically distorted in some fashion when coastlines and
trusts the equally naive software developer determined other features are transferred from a spherical earth to a
to give the buyer an immediate success experience— flat map. Because of the stretching and compression in-
default settings are some of the worst offenders. Be- volved in flattening the globe, the distance represented
cause lying with maps is so easy in our information- by a one-inch line can vary enormously across a world
rich world, infrequent mapmakers need to understand map, and scale can fluctuate significantly along, say,
the pitfalls of map generalization and map readers need a six-inch line. Because map scale varies not only from
to become informed skeptics. point to point but also with direction, a bar scale on
As this essay suggests, maps can lie in diverse ways. a small-scale map invites grossly inaccurate estimates.
Among the topics discussed here are the effects of map Fortunately for hikers and city planners, earth curva-
scale on geometry and feature selection, the impor- ture is not problematic for the small areas shown on
tance of using a symbolization metaphor appropriate large-scale maps; use an appropriate map projection,
to the data and the power of data classification to re- and scale distortion is negligible.
veal meaningful spatial trends or promote misleading What’s not negligible on most large-scale maps is
interpretations. the generalization required when map symbols with
a finite width represent political boundaries, streams,
2. SELECTIVE TRUTH streets and railroads. Legibility requires line symbols
not much thinner than 0.02 inch. At 1:24,000, for in-
An understanding of how maps distort reality re- stance, a 1/50-inch line represents a corridor 40 feet
quires an appreciation of scale, defined simply as the wide, appreciably broader than the average residential
ratio of map distance to ground distance. For exam- street, rural road or single-track railway but usually
ple, a map at 1:24,000, the scale of the U.S. Geo- not troublesome if the mapmaker foregoes a detailed
logical Survey’s most detailed topographic maps, uses treatment of driveways, property lines, rivulets and rail
a one-inch line to represent a road or stream 24,000 yards. At 1:100,000 and 1:250,000, which cartogra-
inches (or 2,000 feet) long. Ratio scales are often re- phers typically consider “intermediate” scales, sym-
ported as fractions, which account for distinctions be- bolic corridors 166.7 and 416.7 feet wide, respectively,
tween “large-scale” and “small-scale.” Thus a quad- make graphic congestion ever more likely unless the
rangle map showing a small portion of a county at mapmaker weeds out less significant features, simpli-
1/24,000 is very much a large-scale map when com- fies complex curves and displaces otherwise overlap-
pared, for instance, to an atlas map showing the whole ping symbols.
LYING WITH MAPS 217
F IG . 5. Crude birth rates, 2000, by state, categorized to suggest F IG . 7. The darker-is-more-intense metaphor of choropleth maps
dangerously low rates overall. offers a potentially misleading view of numbers of births.
pale shadow of population—states with more people, theory condemns using a choropleth map because its
not surprisingly, register more births, whereas those ink (or toner) metaphor is misleading. Graytone area
with the smallest populations are in the lowest cate- symbols, whereby darker suggests “denser” or “more
gory. If you want to explore geographic differences in intense” while lighter implies “more dispersed” or
fertility, it’s far more sensible to look at birth rates as “less intense,” are wholly inappropriate for count data,
well as the total fertility index and other more sensi- which are much better served by symbols that vary in
tive fertility measures used in demography (Srinivasan, size to portray differences in magnitude (Bertin, 1983).
1998). A map focusing on number of births, rather than In other words, while rate data mesh nicely with the
a rate, has little meaning outside an education or mar- choropleth map’s darker-means-more rule, count data
keting campaign pitched at obstetricians, new mothers require bigger-means-more coding.
or toy manufacturers. Although college courses on map design emphasize
Whenever a map of count data makes sense, per- this fundamental distinction between intensity data and
haps to place a map of rates in perspective, graphic count (magnitude) data, developers of geographic in-
formation systems and other mapping software show
little interest in preventing misuse of their products.
No warning pops up when a user asks for a choropleth
map of count data, training manuals invoke choropleth
maps of count data to illustrate commands and settings,
and alternative symbols like squares or circles that vary
with magnitude are either absent or awkwardly im-
plemented. One developer—I won’t name names—not
only requires users to digitize center points of states but
also scales the graduated symbols by height rather than
area, a fallacious strategy famously ridiculed by Huff’s
pair of caricatured blast furnaces, scaled by height to
compare steel capacity added during the 1930s and
1940s (Huff, 1954, page 71). Map viewers see these
differences in height, but differences in area are more
prominent if not overwhelming.
Several remedies are indicated: improved software
F IG . 6. Crude birth rates, 2000, by state, categorized to suggest manuals, more savvy users, metadata (data about data)
dangerously high rates overall. that can alert the software to incompatible symbols
220 M. MONMONIER
F IG . 9. The two lower maps are different representations of the same data. An optimization algorithm found cut-points intended to yield
displays that look very similar (lower left) and very dissimilar (lower right) to the map at the top. Cut-points for the upper map include 0.0,
which separates gains from losses, and 13.3, the national rate.
J ENKS, G. F. and C ASPALL, F. C. (1971). Error on choroplethic M ONMONIER, M. (1977). Maps, Distortion, and Meaning. Com-
maps: Definition, measurement, reduction. Annals of the Asso- mission on College Geography Resource Paper 75-4. Associa-
ciation of American Geographers 61 217–244. tion of American Geographers, Washington.
K RUSE, S. (1992). Review of How to Lie with Maps. Whole Earth M ONMONIER, M. (1991). How to Lie with Maps. Univ. Chicago
Review 74 109. Press, Chicago.
M AC E ACHREN , A. M., B REWER, C. A. and P ICKLE, L. W. M ONMONIER, M. (1996). How to Lie with Maps, 2nd ed. Univ.
(1998). Visualizing georeferenced data: Representing relia-
Chicago Press, Chicago.
bility of health statistics. Environment and Planning A 30
1547–1561. S WAN, J. (1992). Review of How to Lie with Maps. J. Information
M ONMONIER, M. (1976). Modifying objective functions and con- Ethics 1 86–89.
straints for maximizing visual correspondence of choroplethic S RINIVASAN, K. (1998). Basic Demographic Techniques and Ap-
maps. Canadian Cartographer 13 21–34. plications. Sage, Thousand Oaks, CA.
Statistical Science
2005, Vol. 20, No. 3, 223–230
DOI 10.1214/088342305000000296
© Institute of Mathematical Statistics, 2005
223
224 W. KRÄMER AND G. GIGERENZER
which will display red). The example therefore boils and sometimes there are disastrous consequences. In
down to an incorrect enumeration of simple events in the fall of 1973 in the German city of Wuppertal, a lo-
a Laplace experiment in the subpopulation composed cal workman was accused of having murdered another
of the remaining possibilities. As such, it has famous local workman’s wife. A forensic expert (correctly)
antecedents: The erroneous assignment by d’Alembert computed a probability of only 0.027 that blood found
(1779, entry “Croix ou pile”) of a probability of 1/3 on the defendant’s clothes and on the scene of the
for heads-heads when twice throwing a coin, or the crime by chance matched the victim’s and defendant’s
equally erroneous assertion by Leibniz (in a letter to blood groups, respectively. From this figure the expert
L. Bourgnet from March 2, 1714, reprinted in Leibniz, then derived a probability of 97.3% for the defendant’s
1887, pages 569–570) that, when throwing two dice, guilt, and later, this probability came close to 100% by
a sum of 11 is as likely as a sum of 12. A sum of 11, adding evidence from textile fibers. Only a perfect alibi
so he argued, can be obtained by adding 5 and 6, and a saved the workman from an otherwise certain convic-
sum of 12 by adding 6 and 6. It did not occur to him that
tion (see the account in Ziegler, 1974).
there are two equally probable ways of adding 5 and 6,
Episodes such as this have undoubtedly happened
but only one way to obtain 6 and 6.
in many courtrooms all over the world (Gigerenzer,
Given illustrious precedents such as these, it comes
as no surprise that wrongly inferred conditional and 2002). On a formal level, a probability of 2.7% for the
unconditional probabilities are lurking everywhere. observed data, given innocence, was confused with a
Prominent textbook examples are the paradox of the probability of 2.7% for innocence, given the observed
second ace or the problem of the second boy (see, e.g., data. Even in a Bayesian setting with certain a priori
Bar-Hillel and Falk, 1982), not to mention the famous probabilities for guilt and innocence, one finds that a
car-and-goat puzzle, also called the Monty Hall prob- probability of 2.7% for the observed data given inno-
lem, which has engendered an enormous literature of cence does not necessarily translate into a probability
its own. These puzzles are mainly of interest as mathe- of 97.3% that the defendant is guilty. And from the fre-
matical curiosities and they are rarely used for statisti- quentist perspective, which is more common in foren-
cal manipulation. We shall not dwell on them in detail sic science, it is nonsense to assign a probability to
here, but they serve to point out what many consumers either the null or to the alternative hypothesis.
of statistical information are ill-prepared to master. Still, students and, remarkably, teachers of statis-
tics, often misread the meaning of a statistical test of
3. CONFUSING CONDITIONAL AND significance. Haller and Krauss (2002) asked 30 sta-
CONDITIONING EVENTS tistics instructors, 44 statistics students and 39 scien-
German medical doctors with an average of 14 years tific psychologists from six psychology departments
of professional experience were asked to imagine us- in Germany about the meaning of a significant two-
ing a certain test to screen for colorectal cancer. The sample t-test (significance level = 1%). The test was
prevalence of this type of cancer was 0.3%, the sensi- supposed to detect a possible treatment effect based on
tivity of the test (the conditional probability of detect- a control group and a treatment group. The subjects
ing cancer when there is one) was 50% and the false were asked to comment upon the following six state-
positive rate was 3% (Gigerenzer, 2002; Gigerenzer ments (all of which are false). They were told in ad-
and Edwards, 2003). The doctors were asked: “What vance that several or perhaps none of the statements
is the probability that someone who tests positive were correct.
actually has colorectal cancer?” The correct answer
is about 5%. However, the doctors’ answers ranged (1) You have absolutely disproved the null hypothesis
from 1% to 99%, with about half of them estimat- that there is no difference between the population
ing this probability as 50% (the sensitivity) or 47% means. ! true/false !
(the sensitivity minus the false positive rate). (2) You have found the probability of the null hypoth-
The most common fault was to confuse the condi- esis being true. ! true/false !
tional probability of cancer, given that the test is pos- (3) You have absolutely proved your experimental hy-
itive, with the conditional probability that the test is pothesis that there is a difference between the pop-
positive, given that the individual has cancer. An anal- ulation means. ! true/false !
ogous error also occurs when people are asked to in- (4) You can deduce the probability of the experimental
terpret the result of a statistical test of significance, hypothesis being true. ! true/false !
HOW TO CONFUSE WITH STATISTICS OR: THE USE AND MISUSE OF CONDITIONAL PROBABILITIES 225
(5) You know, if you decide to reject the null hypothe- confusion that results from interchanging conditioning
sis, the probability that you are making the wrong and conditional events. It is based on taking as equal
decision. ! true/false ! the conditional probabilities P (not Pope | human) and
(6) You have a reliable experimental finding in the P (not human | Pope). Since
sense that if, hypothetically, the experiment were
P (Ā | B) = P (B̄ | A) ⇐⇒ P (A | B) = P (B | A),
repeated a great number of times, you would ob-
tain a significant result on 99% of occasions. this is equivalent to taking as equal, in a universe com-
! true/false ! prised of humans and aliens, the conditional probabil-
All of the statistics students marked at least one of ities P (Pope | human) and P (human | Pope), which is
the above faulty statements as correct. And, quite dis- nonsense. Or in terms of rules of logic: If the statement
concertingly, 90% of the academic psychologists and “If human then not Pope” holds most of the time, one
80% of the methodology instructors did as well! In par- cannot infer, but sometimes does, that its logical equiv-
ticular, one third of both the instructors and the acad- alent “If Pope then not human” likewise holds most of
emic psychologists and 59% of the statistics students the time.
marked item 4 as correct; that is, they believe that, Strange as it may seem, this form of reasoning has
given a rejection of the null at level 1%, they can de- even made its way into the pages of respectable jour-
duce a probability of 99% that the alternative is correct. nals. For instance, it was used by Leslie (1992) to prove
Ironically, one finds that this misconception is per- that doom is near (the “doomsday argument”; see also
petuated in many textbooks. Examples from the Amer- Schrage, 1993). In this case the argument went: (1) If
ican market include Guilford (1942, and later editions), mankind is going to survive for a long time, then all
which was probably the most widely read textbook human beings born so far, including myself, are only a
in the 1940s and 1950s, Miller and Buckhout (1973, small proportion of all human beings that will ever be
statistical appendix by Brown, page 523) or Nunnally born (i.e., the probability that I observe myself is neg-
(1975, pages 194–196). Additional examples are col- ligible). (2) I observe myself. (3) Therefore, the end is
lected in Gigerenzer (2000, Chapter 13) and Nickerson near.
(2000). On the German market, there is Wyss (1991, This argument is likewise based on interchanging
page 547) or Schuchard-Fischer et al. (1982), who on conditioning and conditional events. While it is per-
page 83 of their best-selling textbook explicitly advise fectly true that the conditional probability that a ran-
their readers that a rejection of the null at 5% implies a domly selected human being (from among all human
probability of 95% that the alternative is correct. beings that have ever been and will ever be born) hap-
In one sense, this error can be seen as a proba- pens to be me, given that doom is near, is much larger
bilistic variant of a classic rule of logic (modus tol- that the conditional probability of the same event, given
lens): (1) “All human beings will eventually die” and that doom is far away, one cannot infer from this in-
(2) “Socrates is a human being” implies (3) “Socrates equality that the conditional probability that doom is
will die.” Now, what if (1) is not necessarily true, near, given my existence, is likewise much larger than
only highly probable [in the sense that the statement the conditional probability that doom is far away, given
“If A(= human being) then B(= eventual death)” my existence. More formally: while the inequality in
holds not always, only most of the time]? Does this im- the following expression is correct, the equality signs
ply that its logical equivalent “If not B then not A” has are not:
the same large probability attached to it? This question
P (doom is near | me) = P (me | doom is near)
has led to a lively exchange of letters in Nature (see
Beck-Bornholdt and Dubben, 1996, 1997; or Edwards, P (me | doom far away)
1996), which culminated in the scientific proof that the
= P (doom far away | me).
Pope is an alien: (1) A randomly selected human be-
ing is most probably not the Pope (the probability of
4. CONDITIONAL PROBABILITIES AND
selecting the Pope is 1 : 6 billion = 0.000 000 000 17).
FAVORABLE EVENTS
(2) John Paul II is the Pope. (3) Therefore, John Paul II
is most probably not a human being. The tendency to confuse conditioning and condi-
Setting aside the fact that John Paul II has not tional events can also lead to other incorrect con-
been randomly selected from among all human be- clusions. The most popular one is to infer from a
ings, one finds that this argument again reflects the conditional probability P (A | B) that is seen as “large”
226 W. KRÄMER AND G. GIGERENZER
that the conditional event A is “favorable” to the con- In terms of favorable events, Der Spiegel, on ob-
ditioning event B. This term was suggested by Chung serving that, among foreigners, P (German tourist |
(1942) and means that skiing accident) was “large,” concluded that the re-
verse conditional probability was also large, in partic-
P (B | A) > P (B).
ular, that being a German tourist increases the chances
This confusion occurs in various contexts and is possi- of being involved in a skiing accident:
bly the most frequent logical error that is found in the P (skiing accident | German tourist)
interpretation of statistical information. Here are some
examples from the German press (with the headlines > P (skiing accident).
translated into English): Similarly, Hannoversche Allgemeine Zeitung conclu-
• “Beware of German tourists” (according to Der ded from P (boy | bicycle accident) = large that
Spiegel magazine, most foreign skiers involved in P (bicycle accident | boy) > P (bicycle accident)
accidents in a Swiss skiing resort came from Ger-
many). and so on. In all these examples, the point of departure
• “Boys more at risk on bicycles” (the newspaper always was a large value of P (A | B), which then led
Hannoversche Allgemeine Zeitung reported that to the—possibly unwarranted—conclusion that P (B |
among children involved in bicycle accidents the A) > P (B). From the symmetry
majority were boys). P (B | A) > P (B) ⇐⇒ P (A | B) > P (A)
• “Soccer most dangerous sport” (the weekly maga-
it is however clear that one cannot infer anything
zine Stern commenting on a survey of accidents in
on A’s favorableness for B from P (A | B) alone, and
sports).
that one needs information on P (A) as well.
• “Private homes as danger spots” (the newspaper Die
The British Home Office nevertheless once did so
Welt musing about the fact that a third of all fatal
in its call for more attention to domestic violence
accidents in Germany occur in private homes).
(Cowdry, 1990). Among 1221 female murder victims
• “German shepherd most dangerous dog around”
between 1984 and 1988, 44% were killed by their
(the newspaper Ruhr-Nachrichten on a statistic ac-
husbands or lovers, 18% by other relatives, and an-
cording to which German shepherds account for a
other 18% by friends or acquaintances. Only 14% were
record 31% of all reported attacks by dogs).
killed by strangers. Does this prove that
• “Women more disoriented drivers” (the newspaper
Bild commenting on the fact that among cars that P (murder | encounter with husband)
were found entering a one-way-street in the wrong > P (murder | encounter with a stranger),
direction, most were driven by women).
that is, that marriage is favorable to murder? Evidently
These examples can easily be extended. Most of not. While it is perfectly fine to investigate the causes
them result from unintentionally misreading the statis- and mechanics of domestic violence, there is no evi-
tical evidence. When there are cherished stereotypes to dence that the private home is a particularly dangerous
conserve, such as the German tourist bullying his fel- environment (even though, as The Times mourns, “as-
low vacationers, or women somehow lost in space, per- saults . . . often happen when families are together”).
haps some intentional neglect of logic may have played
a role as well. Also, not all of the above statements are 5. FAVORABLENESS AND SIMPSON’S PARADOX
necessarily false. It might, for instance, well be true
that when 1000 men and 1000 women drivers are given Another avenue through which the attribute of favor-
a chance to enter a one-way street the wrong way, more ableness can be incorrectly attached to certain events is
women than men will actually do so, but the survey by Simpson’s paradox (Blyth, 1973), which in our context
Bild simply counted wrongly entering cars and this is asserts that it is possible that B is favorable to A when
certainly no proof of their claim. For example, what C holds, B is favorable to A when C does not hold, yet
if there were no men on the street at that time of the overall, B is unfavorable to A. Formally, one has
day? And in the case of the Swiss skiing resort, where P (A | B ∩ C) > P (A)
almost all foreign tourists came from Germany, the at-
and
tribution of abnormal dangerous behavior to this class
of visitors is clearly wrong. P (A | B ∩ C̄) > P (A),
HOW TO CONFUSE WITH STATISTICS OR: THE USE AND MISUSE OF CONDITIONAL PROBABILITIES 227
In fact, a glance at any statistical almanac shows that base-rate fallacy (Borgida and Brekke, 1981). In fact,
quite the opposite is true. this environmental change underlies most of the mis-
Here is a more recent example from the U.S., where leading arguments with conditional probabilities.
likewise P (A | B) is confused with P (A | B ∩ D). Consider for instance the question: “What is the
This time the confusion is spread by Alan Dershowitz, probability that a woman with a positive mammogra-
a renowned Harvard Law professor who advised the phy result actually has breast cancer?” There are two
O. J. Simpson defense team. The prosecution had ways to represent the relevant statistical information:
argued that Simpson’s history of spousal abuse re- in terms of conditional probabilities, or in terms of nat-
flected a motive to kill, advancing the premise that ural frequencies.
“a slap is a prelude to homicide” (see Gigerenzer, 2002, Conditional probabilities: The probability that a
pages 142–145). Dershowitz, however, called this ar- woman has breast cancer is 0.8%. If she has breast
gument “a show of weakness” and said: “We knew cancer the probability that a mammogram will show a
that we could prove, if we had to, that an infinitesi- positive result is 90%. If a woman does not have breast
mal percentage—certainly fewer than 1 of 2,500—of cancer the probability of a positive result is 7%. Take,
men who slap or beat their domestic partners go on to for example, a woman who has a positive result. What
murder them.” Thus, he argued that the probability of is the probability that she actually has breast cancer?
the event K that a husband killed his wife if he battered Natural frequencies: Our data tells us that eight out
her was small, of every 1000 women have breast cancer. Of these eight
P (K | battered) = 1/2,500. women with breast cancer seven will have a positive
result on mammography. Of the 992 women who do
The relevant probability, however, is not this one, as not have breast cancer some 70 will still have a positive
Dershowitz would have us believe. Instead, the relevant mammogram. Take, for example, a sample of women
probability is that of a man murdering his partner given who have positive mammograms. How many of these
that he battered her and that she was murdered, women actually have breast cancer?
P (K | battered and murdered). Apart from rounding, the information is the same
in both of these summaries, but with natural frequen-
This probability is about 8/9 (Good, 1996). It must cies the message comes through much more clearly.
of course not be confused with the probability that We see quickly that only seven of the 77 women who
O. J. Simpson is guilty; a jury must take into account test positive actually have breast cancer, which is one
much more evidence than battering. But it shows that in 11 (9%).
battering is a fairly good predictor of guilt for murder, Natural frequencies correspond to the way humans
contrary to Dershowitz’s assertions. have encountered statistical information during most
of their history. They are called “natural” because, un-
6. HOW TO MAKE THE SOURCES OF like conditional probabilities or relative frequencies, on
CONFUSION DISAPPEAR each occurrence the numerical quantities in our sum-
Fallacies can sometimes be attributed to the un- mary refer to the same class of observations. For in-
warranted application of what we have elsewhere stance, the natural frequencies “seven women” (with
called “fast and frugal heuristics” (Gigerenzer, 2004). a positive mammogram and cancer) and “70 women”
Heuristics are simple rules that exploit evolved men- (with a positive mammogram and no breast cancer)
tal capacities, as well as structures of environments. both refer to the same class of 1000 women. In con-
When applied in an environment for which they were trast, the conditional probability 90% (the sensitivity)
designed, heuristics often work well, often outperform- refers to the class of eight women with breast can-
ing more complicated optimizing models. Neverthe- cer, but the conditional probability 7% (the specificity)
less, when applied in an unsuitable environment, they refers to a different class of 992 women without breast
can easily mislead. cancer. This switch of reference class easily confuses
When a heuristic misleads, it is not always the the minds of both doctors and patients.
heuristic that is to blame. More often than not, it is the To judge the extent of the confusion consider Fig-
structure of the environment that does not fit (Hoffrage ure 1, which shows the responses of 48 experienced
et al., 2000). The examples we have seen here amount doctors who were given the information cited above,
to what has elsewhere been called a shift of base or the except that in this case the statistics were a base rate
HOW TO CONFUSE WITH STATISTICS OR: THE USE AND MISUSE OF CONDITIONAL PROBABILITIES 229
F IG . 1. Doctors’ estimates of the probability of breast cancer in women with a positive result on mammography (Gigerenzer, 2002).
of cancer of 1%, a sensitivity of 80%, and a false posi- B ECK -B ORNHOLDT, H.-P. and D UBBEN, H.-H. (1996). Is the
tive rate of 10%. Half the doctors received the informa- Pope an alien? Nature 381 730.
tion in conditional probabilities and half received the B ECK -B ORNHOLDT, H.-P. and D UBBEN, H.-H. (1997). Der Hund,
der Eier legt—Erkennen von Fehlinformationen durch Quer-
data as expressed by natural frequencies. When asked denken. Rowohlt, Hamburg.
to estimate the probability that a woman with a posi- B LYTH, C. R. (1973). Simpson’s paradox and mutually favorable
tive screening mammogram actually has breast cancer, events. J. Amer. Statist. Assoc. 68 746.
doctors who received conditional probabilities gave an- B ORGIDA, E. and B REKKE, N. (1981). The base rate fallacy in at-
tribution and prediction. In New Directions in Attribution Re-
swers that ranged from 1% to 90%; very few of them
search (J. H. Harvey, W. J. Ickes and R. F. Kidd, eds.) 3 63–95.
gave the correct answer of about 8%. In contrast, most Erlbaum, Hillsdale, NJ.
of the doctors who were given natural frequencies gave C HUNG, K.-L. (1942). On mutually favorable events. Ann. Math.
the correct answer or were close to it. Simply con- Statist. 13 338–349.
verting the information into natural frequencies was C OWDRY, Q. (1990). Husbands or lovers kill half of women mur-
der victims. The Times, April 14, p. 11.
enough to turn much of the doctor’s innumeracy into
D ’A LEMBERT, J. ET AL . (1779). Encyclopédie, ou dictionnaire
insight. Presenting information in natural frequencies raisonné des sciences, des arts et des métiers 10. J. L. Pellet,
is therefore a simple and effective mind tool to reduce Geneva.
the confusion resulting from conditional probabilities. E DWARDS, A. W. F. (1996). Is the Pope an alien? Nature 382 202.
F ELLER, W. (1968). An Introduction to Probability Theory and Its
Applications 1, 3rd ed. Wiley, New York.
ACKNOWLEDGMENTS G IGERENZER, G. (2000). Adaptive Thinking—Rationality in the
Research supported by Deutsche Forschungsge- Real World. Oxford Univ. Press, New York.
G IGERENZER, G. (2002). Calculated Risks: How to Know When
meinschaft. We are grateful to Robert Ineichen for Numbers Deceive You. Simon and Schuster, New York. [British
helping us to track down a reference to Leibniz, and edition (2002). Reckoning with Risk. Penguin, London.]
to George Casella, Gregor Schöner, Michael Steele, G IGERENZER, G. (2004). Fast and frugal heuristics: The tools of
Andy Tremayne and Rona Unrau for helpful criticism bounded rationality. In Handbook of Judgement and Decision
Making (D. Koehler and N. Harvey, eds.) 62–88. Blackwell,
and comments.
Oxford.
G IGERENZER, G. and E DWARDS, A. (2003). Simple tools for un-
derstanding risks: From innumeracy to insight. British Med-
REFERENCES ical J. 327 741–744.
BAR -H ILLEL, M. and FALK, R. (1982). Some teasers concerning G OOD, I. J. (1996). When batterer becomes murderer. Nature 381
conditional probabilities. Cognition 11 109–122. 481.
230 W. KRÄMER AND G. GIGERENZER
G UILFORD, J. P. (1942). Fundamental Statistics in Psychology and N ICKERSON, R. S. (2000). Null hypothesis significance testing:
Education. McGraw–Hill, New York. A review of an old and continuing controversy. Psychological
H ALLER, H. and K RAUS, S. (2002). Misinterpretations of signifi- Methods 5 241–301.
cance: A problem students share with their teachers? Methods N UNNALLY, J. C. (1975). Introduction to Statistics for Psychology
of Psychological Research 7 1–20. and Education. McGraw–Hill, New York.
H OFFRAGE , U., L INDSAY, S., H ERTWIG, R. and S CHRAGE, G. (1993). Letter to the editor. Math. Intelligencer 15
G IGERENZER , G. (2000). Communicating statistical in- 3–4.
formation. Science 290 2261–2262. S CHUCANY, W. R. (1989). Comment on “A category representa-
tion paradox,” by W. D. Kaigh. Amer. Statist. 43 94–95.
K AIGH, W. D. (1989). A category representation paradox.
S CHUCHARD -F ISCHER , C., BACKHAUS , K., H UMMEL , H.,
Amer. Statist. 43 92–97.
L OHRBERG , W., P LINKE, W. and S CHREINER, W. (1982).
K RÄMER, W. (2002). Denkste—Trugschlüsse aus der Welt des Zu-
Multivariate Analysemethoden—Eine anwendungsorientierte
falls und der Zahlen, 3rd paperback ed. Piper, München.
Einführung, 2nd ed. Springer, Berlin.
K RÄMER, W. (2004). So lügt man mit Statistik, 5th paperback ed. S WOBODA, H. (1971). Knaurs Buch der modernen Statistik. Droe-
Piper, München. mer Knaur, München.
L ESLIE, J. (1992). The doomsday argument. Math. Intelligencer 14 WAGNER, C. H. (1982). Simpson’s paradox in real life. Amer. Sta-
48–51. tist. 36 46–48.
VON L EIBNIZ, G. W. (1887). Die philosophischen Schriften. W YSS, W. (1991). Marktforschung von A – Z. Demoscope, Luzern.
(C. I. Gerhardt, ed.) 3. Weidmann, Berlin. Z IEGLER, H. (1974). Das Alibi des Schornsteinfegers—
M ILLER, G. A. and B UCKHOUT, R. (1973). Psychology: The Sci- Unwahrscheinliche Wahrscheinlichkeitsrechnung in einem
ence of Mental Life, 2nd ed. Harper and Row, New York. Mordprozeß. Rheinischer Merkur 39.
Statistical Science
2005, Vol. 20, No. 3, 231–238
DOI 10.1214/088342305000000269
© Institute of Mathematical Statistics, 2005
Abstract. As Huff’s landmark book made clear, lying with statistics can be
accomplished in many ways. Distorting graphics, manipulating data or using
biased samples are just a few of the tried and true methods. Failing to use the
correct statistical procedure or failing to check the conditions for when the
selected method is appropriate can distort results as well, whether the motives
of the analyst are honorable or not. Even when the statistical procedure and
motives are correct, bad data can produce results that have no validity at all.
This article provides some examples of how bad data can arise, what kinds
of bad data exist, how to detect and measure bad data, and how to improve
the quality of data that have already been collected.
Key words and phrases: Data quality, data profiling, data rectification, data
consistency, accuracy, distortion, missing values, record linkage, data ware-
housing, data mining.
231
232 R. D. DE VEAUX AND D. J. HAND
1981), Kruskal devoted much of his time to “incon- and other data attributes. For many problems, for ex-
sistent or clearly wrong data, especially in large data ample, data gradually become less and less relevant—
sets.” As just one example, he cited a 1960 census a phenomenon sometimes termed data decay or pop-
study that showed 62 women, aged 15 to 19 with 12 ulation drift (Hand, 2004a). Thus the characteristics
or more children. Coale and Stephan (1962) pointed collected on mortgage applicants 25 years ago would
out similar anomalies when they found a large number probably not be of much use for developing a pre-
of 14-year-old widows. In a classic study by Wolins dictive risk model for new applicants, no matter how
(1962), a researcher attempted to obtain raw data from accurately they were measured at the time. In some en-
37 authors of articles appearing in American Psycho- vironments, the time scale that renders a model useless
logical Association journals. Of the seven data sets that can become frighteningly short. A model of customer
were actually obtained, three contained gross data er- behavior on a web site may quickly become out of date.
rors. Sometimes different aspects of this broader interpre-
A 1986 study by the U.S. Census estimated that be- tation of data quality work in opposition. Timeliness
tween 3 and 5% of all census enumerators engaged in and accuracy provide an obvious example (and, in-
some form of fabrication of questionnaire responses deed, one which is often seen when economic time se-
ries are revised as more accurate information becomes
without actually visiting the residence. This practice
available).
was widespread enough to warrant its own term: curb-
From the perspective of the statistical analyst, there
stoning, which is the “enumerator jargon for sitting on
are three phases in data evolution: collection, prelimi-
the curbstone filling out the forms with made-up infor-
nary analysis and modeling. Of course, the easiest way
mation” (Wainer, 2004). While curbstoning does not
to deal with bad data is to prevent poor data from be-
imply bad data per se, at the very least, such practices ing collected in the first place. Much of sample survey
imply that the data set you are analyzing does not de- methodology and experimental design is devoted to
scribe the underlying mechanism you think you are de- this subject, and many famous stories of analysis gone
scribing. wrong are based on faulty survey designs or experi-
What exactly are bad data? The quality of data is ments. The Literary Digest poll proclaiming Landon’s
relative both to the context and to the question one win over Roosevelt in 1936 that starred in Chapter 1 of
is trying to answer. If data are wrong, then they are Huff (1954) is just one of the more famous examples.
obviously bad, but context can make the distinction At the other end of the process, we have resistant and
more subtle. In a regression analysis, errors in the pre- robust statistical procedures explicitly designed to per-
dictor variables may bias the estimates of the regres- form adequately even when a percentage of the data do
sion coefficients and this will matter if the aim hinges not conform or are inaccurate, or when the assumptions
on interpreting these values, but it will not matter if of the underlying model are violated.
the aim is predicting response values for new cases In this article we will concentrate on the “middle”
drawn from the same distribution. Likewise, whether phase of bad data evolution—that is, on its discovery
data are “good” also depends on the aims: precise, ac- and correction. Of course, no analysis proceeds lin-
curate measurements are useless if one is measuring early through the process of initial collection to final
the wrong thing. Increasingly in the modern world, es- report. The discoveries in one phase can impact the
pecially in data mining, we are confronted with sec- entire analysis. Our purpose will be to discuss how to
ondary data analysis: the analysis of data that have recognize and discover these bad data using a variety
been collected for some other purpose (e.g., analyz- of examples, and to discuss their impact on subsequent
ing billing data for transaction patterns). The data may statistical analysis. In the next section we discuss the
have been perfect for the original aim, but could have causes of bad data. Section 3 discusses the ways in
serious deficiencies for the new analysis. which data can be bad. In Section 4 we turn to the prob-
For this paper, we will take a rather narrow view lem of detecting bad data and in Section 5 we provide
of data quality. In particular, we are concerned with some guidelines for improving data quality. We sum-
data accuracy, so that, for us, “poor quality data are marize and present our conclusions in Section 6.
defined as erroneous values assigned to attributes of
2. WHAT ARE THE CAUSES OF BAD DATA?
some entity,” as in Pierce (1997). A broader perspective
might also take account of relevance, timeliness, exis- There is an infinite variety to the ways in which data
tence, coherence, completeness, accessibility, security can go bad, and the specifics depend on the underlying
HOW TO LIE WITH BAD DATA 233
process that generate the data. Data may be distorted that some surviving examples of the greater mouse-
from the outset during the initial collection phase or eared bat, previously thought to be extinct, had been
they may be distorted when the data are transcribed, discovered hibernating in West Sussex. It went on to
transferred, merged or copied. Finally, they may deteri- assert that “they can weigh up to 30 kg” (see Hand,
orate, change definition or otherwise go through trans- 2004b, Chapter 4). A considerable amount of enter-
formations that render them less representative of the taining correspondence resulted from the fact that they
original underlying process they were designed to mea- had misstated the weight by three decimal places.
sure. Sometimes data are distorted from the source itself,
The breakdown in the collection phase can occur either knowingly or not. Examples occur in survey
whether the data are collected by instrument or di- work and tax returns, just to name two. It is well known
rectly recorded by human beings. Examples of break- to researchers of sexual behavior that men tend to re-
downs at the instrument level include instrument drift, port more lifetime sexual partners than women, a sit-
initial miscalibration, or a large random or otherwise uation that is highly unlikely sociologically (National
unpredictable variation in measurement. As an exam- Statistics website: www.statistics.gov.uk). Some data
ple of instrument level data collection, consider the are deliberately distorted to prevent disclosure of con-
measurement of the concentration of a particular chem- fidential information collected by governments in, for
ical compound by gas chromatography, as used in rou- example, censuses (e.g., Willenborg and de Waal,
tine drug testing. When reading the results of such a 2001) and health care data.
test, it is easy to think that a machine measures the Even if the data are initially recorded accurately, data
amount of the compound in an automatic and straight- can be compromised by data integration, data ware-
forward way, and thus that the resulting data are mea- housing and record linkage. Often a wide range of
suring some quantity directly. It turns out to be a bit sources of different types are involved (e.g., in the
more complicated. At the outset, a sample of the ma-
pharmaceutical sector, data from clinical trials, ani-
terial of interest is injected into a stream of carrier
mal trials, manufacturers, marketing, insurance claims
gas where it travels down a silica column heated by
and postmarketing surveillance might be merged).
an oven. The column then separates the mixture of
At a more mundane level, records that describe dif-
compounds according to their relative attraction to a
ferent individuals might be inappropriately merged be-
material called the adsorbent. This stream of different
cause they are described by the same key. When going
compounds travels “far enough” (via choices of col-
through his medical records for insurance purposes,
umn length and gas flow rates) so that by the time they
pass by the detector, they are well separated (at least in one of the authors discovered that he was recorded as
theory). At this point, both the arrival time and the con- having had his tonsils removed as a child. A subse-
centration of the compound are recorded by an electro- quent search revealed the fact that the records of some-
mechanical device (depending on the type of detector one else with the same name (but a different address)
used). The drifts inherent in the oven temperature, gas had been mixed in with his. More generally, what is
flow, detector sensitivity and a myriad of other environ- good quality for (the limited demands made of ) an op-
mental conditions can affect the recorded numbers. To erational data base may not be good quality for (poten-
determine actual amounts of material present, a known tially unlimited demands made of ) a data warehouse.
quantity must be tested at about the same time and the In a data warehouse, the definitions, sources and
machine must be calibrated. Thus the number reported other information for the variables are contained in
as a simple percentage of compound present has not a dictionary, often referred to as metadata. In a large
only been subjected to many potential sources of error corporation it is often the IT (information technology)
in its raw form, but is actually the output of a calibra- group that has responsibility for maintaining both the
tion model. data warehouse and metadata. Merging sources and
Examples of data distortion at the human level checking for consistent definitions form a large part of
include misreading of a scale, incorrect copying of their duties.
values from an instrument, transposition of digits and A recent example in bioinformatics shows that data
misplaced decimal points. Of course, such mistakes are problems are not limited to business and economics.
not always easy to detect. Even if every data value is In a recent issue of The Lancet, Petricoin et al. (2002)
checked for plausibility, it often takes expert knowl- reported an ability to distinguish between serum sam-
edge to know if a data value is reasonable or ab- ples from healthy women, those with ovarian cancers
surd. Consider the report in The Times of London and women with a benign ovarian disease. It was so
234 R. D. DE VEAUX AND D. J. HAND
exciting that it prompted the “U.S. Congress to pass the statistician, wanting to incorporate this predictor
a resolution urging continued funding to drive a new into a model, asked one of the physicists whether any
diagnostic test toward the clinic” (Check, 2004). The wind data existed. It was difficult to imagine very
researchers trained an algorithm on 50 cancer spectra many Antarctic stations with anemometers and so he
and 50 normals, and then predicted 116 new spectra. was very surprised when the physicist replied, “Sure,
The results were impressive with the algorithm cor- there’s plenty of it.” Excitedly he asked what spatial
rectly identifying all 50 of the cancers, 47 out of 50 nor- resolution he could provide. When the physicist coun-
mals, and classifying the 16 benign disease spectra as tered with “what resolution do you want?” the statisti-
“other.” Statisticians Baggerly, Morris and Coombes cian became suspicious. He probed further and asked
(2004) attempted to reproduce the Petricoin et al. re- if they really had anemometers set up on a 5 km grid
sults, but were unable to do so. Finally, they concluded on the sea ice? He said, “Of course not. The wind
that the three types of spectra had been preprocessed data come from a global weather model—I can gen-
differently, so that the algorithm correctly identified erate them at any resolution you want!” It turned out
differences in the data, much of which that had noth- that all the other satellite data had gone through some
ing to do with the underlying biology of cancer. sort of preprocessing before it was given to the sta-
A more subtle source of data distortion is a change tistician. Some were processed actual direct measure-
in the measurement or collection procedure. When the ments, some were processed through models and some,
cause of the change is explicit and recognized, this can like the wind, were produced solely from models. Of
be adjusted for, at least to some extent. Common exam- course, this (as with curbstoning) does not necessarily
ples include a change in the structure of the Dow Jones imply that the resulting data are bad, but it should at
Industrial Average or the recent U.K. change from the least serve to warn the analyst that the data may not be
Retail Price Index to the European Union standard Har- what they were thought to be.
monized Index of Consumer Prices. In other cases, Each of these different mechanisms for data distor-
one might not be aware of the change. Some of the tions has its own set of detection and correction chal-
changes can be subtle. In looking at historical records lenges. Ensuring good data collection through survey
to assess long-term temperature changes, Jones and and/or experimental design is certainly an important
Wigley (1990) noted “changing landscapes affect tem- first step. A bad design that results in data that are not
perature readings in ways that may produce spurious representative of the phenomenon being studied can
temperature trends.” In particular, the location of the render even the best analysis worthless. At the next
weather station assigned to a city may have changed. step, detecting errors can be attempted in a variety of
During the 19th century, most cities and towns were ways, a topic to which we will return in Section 4.
too small to impact temperature readings. As urbaniza-
tion increased, urban heat islands directly affected tem- 3. IN HOW MANY WAYS?
perature readings, creating bias in the regional trends.
While global warming may be a contributor, the dom- Data can be bad in an infinite variety of ways, and
inant factor is the placement of the weather station, some authors have attempted to construct taxonomies
which moved several times. As it became more and of data distortion (e.g., Kim et al., 2003). An important
more surrounded by the city, the temperature increased, simple categorization is into missing data and distorted
mainly because the environment itself had changed. values.
A problem related to changes in the collection pro- 3.1 Missing Data
cedure is not knowing the true source of the data.
In scientific analysis, data are often preprocessed by Data can be missing at two levels: entire records
technicians and scientists before being analyzed. The might be absent, or one or more individual fields may
statistician may be unaware (or uninterested) in the be missing. If entire records are missing, any analysis
details of the processing. To create accurate models, may well be describing or making inferences about a
however, it can be important to know the source and population different from that intended. The possibility
therefore the accuracy of the measurements. Consider that entire records may be missing is particularly prob-
a study of the effect of ocean bottom topography on lematic, since there will often be no way of knowing
sea ice formation in the southern oceans (De Veaux, this. Individual fields can be missing for a huge vari-
Gordon, Comiso and Bacherer, 1993). After learning ety of reasons, and the mechanism by which they are
that wind can have a strong effect on sea ice formation, missing is likely to influence their distribution over the
HOW TO LIE WITH BAD DATA 235
data, but at least when individual fields are missing one 3.2 Distorted Data
can see that this is the case.
Although there are an unlimited number of pos-
If the missingness of a particular value is unrelated
sible causes of distortion, a first split can be made
either to the response or predictor variables (miss-
into those attributable to instrumentation and those at-
ing completely at random—Little and Rubin, 1987,
tributed to human agency. Floor and ceiling effects are
give technical definitions), then case deletion can be
examples of the first kind (instruments here can be
employed. However, even ignoring the potential bias
mechanical or electronic, but also questionnaires), al-
problems, complete case deletion can severely reduce
though in this case it is sometimes possible to fore-
the effective sample size. In many data mining situa-
see that such things might occur and take account of
tions with a large number of variables, even though
this in the statistical modeling. Human distortions can
each field has only a relatively small proportion of
arise from misreading instruments or misrecording val-
missing values, all of the records may have some val-
ues at any level. Brunskill (1990) gave an illustration
ues missing, so that the case deletion strategy leaves
from public records of birth weights, where ounces are
one with no data at all.
commonly confused with pounds, the number 1 is con-
Complications arise when the pattern of missing
fused with 11 and errors in decimal placements pro-
data does depend on the values that would have been
duce order of magnitude errors. In such cases, using
recorded. If, for example, there are no records for pa-
ancillary information such as gestation times or new-
tients who experience severe pain, inferences to the
born heights can help to spot gross errors. Some data
entire pain distribution will be impossible (at least,
collection procedures, in an attempt to avoid missing
without making some pretty strong distributional as-
data, actually introduce distortions. A data set we ana-
sumptions). Likewise, poor choice of a missing value
lyzed had a striking number of doctors born on Novem-
code (e.g., 0 or 99 for age) or accidental inclusion of
ber 11, 1911. It turned out that most doctors (or their
a missing value code in the analysis (e.g., 99,999 for
secretaries) wanted to avoid typing in age information,
age) has been known to lead to mistaken conclusions.
but because the program insisted on a value and the
Sometimes missingness arises because of the nature
choice of 00/00/00 was invalid, the easiest way to by-
of the problem, and presents real theoretical and practi-
pass the system was simply to type 11/11/11. Such er-
cal issues. For example, in personal banking, banks ac-
rors might not seem of much consequence, but they
cept those loan applicants whom they expect to repay
can be crucial. Confusion between English and metric
the loans. For such people, the bank eventually discov-
units was responsible for the loss of the $125 million
ers the true outcome (repay, do not repay), but for those
Martian Climate Orbiter space probe (The New York
rejected for a loan, the true outcome is unknown: it is
Times, October 1, 1999). Jet Propulsion Laboratory
a missing value. This poses difficulties when the bank
engineers mistook acceleration readings measured in
wants to construct new predictive models (Hand and
English units of pound-seconds for the metric mea-
Henley, 1993; Hand, 2001). If a loan application asks
sure of force in newton-seconds. In 1985, in a prece-
for household income, replacing a missing value by a
dence setting case, the Supreme Court ruled that Dun
mean or even by a model based imputation may lead to
& Bradstreet had to pay $350,000 in libel damages to a
a highly optimistic assessment of risk.
small Vermont construction company. A part-time stu-
When the missingness in a predictor is related di-
dent worker had apparently entered the wrong data into
rectly to the response, it may be useful for exploratory
the Dun & Bradstreet data base. As a result, Dun &
and prediction purposes to create indicator variables
Bradstreet issued a credit report that mistakenly iden-
for each predictor, where the variable is a binary in-
tified the construction company as bankrupt (Percy,
dictor of whether the variable is missing or not. For
1986).
categorical predictor variables, missing values can be
treated simply as a new category. In a study of dropout
4. HOW TO DETECT DATA ERRORS
rates from a clinical trial for a depression drug, it was
found that the single most important indicator of ul- While it may be obvious that a value is missing from
timately dropping out from the study was not the de- a record, it is often less obvious that a value is in er-
pression score on the second week’s test as indicated ror. The presence of errors can (sometimes) be proven,
from complete case analysis, but simply the indicator but the absence of errors cannot. There is no guaran-
of whether the patient showed up to take it (De Veaux, tee that a data set that looks perfect will not contain
Donahue and Small, 2002). mistakes. Some of these mistakes may be intrinsically
236 R. D. DE VEAUX AND D. J. HAND
undetectable: they might be values that are well within of by presenting the data in a form whereby advantage
the range of the data and could easily have occurred. can be taken of these abilities. Such plots have become
Moreover, since errors can occur in an unlimited num- prevalent in statistical packages for examining missing
ber of ways, there is no end to the list of possible tests data patterns. Hand, Blunt, Kelly and Adams (2000)
for detecting errors. On the other hand, strategic choice gave the illustration of a plot showing a point for each
of tests can help to pinpoint the root causes that lead to missing value in a rectangular array of 1012 potential
errors and, hence, to the identification of changes in sufferers from osteoporosis measured on 45 variables.
the data collection process that will lead to the greatest It is immediately clear which cases and which variables
improvement in data quality. account for most of the problems.
When the data collection can be repeated, the re- Unfortunately, as we face larger and larger data
sults of the duplicate measurements, recordings or sets, so we are also faced with increasing difficulty in
transcriptions (e.g., the double entry system used in data profiling. The missing value plot described above
clinical trials) can be compared by automatic methods. works for a thousand cases, but would probably not be
In this “duplicate performance method,” a machine so effective for 10 million. Even in this case, however,
checks for any differences in the two data records. a Pareto chart of percent missing for each variable may
All discrepancies are noted and the only remaining er- be useful for deciding where to spend data preparation
rors are when both collectors made the same mistake. effort. Knowing that a variable is 96% missing makes
Strayhorn (1990) and West and Winkler (1991) pro- one think pretty hard about including it in a model. On
vided statistical methods for estimating that propor- the other hand, separate manual examination of each
tion. In another quality control method, known errors of 30,000 gene expression variables is not to be recom-
are added to a data set whose integrity is then assessed mended.
by an external observer. The “known errors” method When even simple summaries of all the variables in
devised statistical methods for estimating how many a data base are not feasible, some methods for reducing
errors remain based on the success of the observer in the number of potential predictors in the models might
discovering the known errors (Strayhorn, 1990; West be warranted. We see an important role for data mining
and Winkler, 1991). Taking this further, one can build tools here. It may be wise to reverse the usual para-
models (similar to those developed for software relia- digm of explore the data first, then model. Instead, ex-
bility) that estimate how many errors are likely to re- ploratory models of the data can be useful as a first step
main in a data set based on extrapolation from the rate and can serve two purposes (De Veaux, 2002). First,
of discovery of errors. At some point one decides that models such as tree models and clustering can high-
the impact of remaining errors on the conclusions is light groups of anomalous cases. Second, the models
likely to be sufficiently small that one can ignore them. can be used to reduce the number of potential predictor
Automatic methods of data collection use metadata variables and enable the analyst to examine the remain-
information to check for consistency across multiple ing predictors in more detail. The resulting process is
records or variables, integrity (e.g., correct data type), a circular one, with more examination possible at each
plausibility (within the possible range of the data) and subsequent modeling phase. Simply checking whether
coherence between related variables (e.g., number of 500 numerical predictor variables are categorical or
sons plus number of daughters equals number of chil- quantitative without the aid of metadata is a daunt-
dren). Sometimes redundant data can be collected with ing (and tedious) task. In one analysis, we were asked
such checks in mind. However, one cannot rely on soft- to develop a fraud detection model for a large credit
ware to protect one from mistakes. Even when such au- card bank. In the data set was one potential predic-
tomatic methods are in place, the analyst should spend tor variable that ranged from around 2000 to 9000,
some time looking for errors in the data prior to any roughly symmetric and unimodal, which was selected
modeling effort. as a highly significant predictor for fraud in a stepwise
Data profiling is the use of exploratory and data min- logistic regression model. It turned out that this predic-
ing tools aimed at identifying errors, rather than at the tor was a categorical variable (SIC code) used to spec-
substantive questions of interest. When the number of ify the industry from which the product purchases in
predictor variables is manageable, simple plots such the transaction came. Useless as a predictor in a logis-
as bar charts, histograms, scatterplots and time series tic regression model, it had escaped detection as a cat-
plots can be invaluable. The human eye has evolved to egorical variable among the several hundred potential
detect anomalies, and this should be taken advantage candidates. Once the preliminary model whittled the
HOW TO LIE WITH BAD DATA 237
candidate predictors down to a few dozen, it was easy even be possible to devise useful error detection and
to use standard data analysis techniques and to detect correction resource allocation strategies.
which were appropriate for the final model. Sometimes an entirely different approach to improv-
ing data quality can be used. This is simply to hide the
5. IMPROVING DATA QUALITY poor quality by coarsening or aggregating the data. In
fact, a simple example of this implicitly occurs all the
The best way to improve the quality of data is to
time: rather than reporting uncertain and error-prone fi-
improve things in the data collection phase. The ideal
nal digits of measured variables, researchers round to
would be to prevent errors from arising in the first
the nearest digit.
place. Prevention and detection have a reciprocal role
to play here. Once one has detected data errors, one
6. CONCLUSIONS AND FURTHER DISCUSSION
can investigate why they occurred and prevent them
from happening in the future. Once it has been recog- This article has been about data quality from the
nized (detected) that the question “How many miles perspective of an analyst called upon to extract some
do you commute to work each day?” permits more meaning from it. We have already remarked that there
than one interpretation, mistakes can be prevented by are also other aspects to data quality, and these are of
rewording. Progress toward direct keyboard or other equal importance when action is to be taken or deci-
electronic data entry systems means that error detec- sions made on the basis of the data. These include such
tion tools can be applied in real time at data entry— aspects as timeliness (the most sophisticated analysis
when there is still an opportunity to correct the data. applied to out-of-date data will be of limited value),
At the data base phase, metadata can be used to ensure completeness and, of central importance, fitness for
that the data conform to expected forms, and relation- purpose. Data quality, in the abstract, is all very well,
ships between variables can be used to cross-check en- but what may be perfectly fine for one use may be
tries. If the data can be collected more than once, the woefully inadequate for another. Thus ISO 8402 de-
rate of discovery of errors can be used as the basis for a fines quality as “The totality of characteristics of an
statistical model to reveal how many undetected errors entity that bare on its ability to satisfy stated and im-
are likely to remain in the data base. plied needs.”
Various other principles also come into play when It is also important to maintain a sense of propor-
considering how to improve data quality. For exam- tion in assessing and deciding how to cope with data
ple, a Pareto principle often applies: most of the er- distortions. In one large quality control problem in
rors are attributable to just a few variables. This may polymer viscosity, each 1% improvement was worth
happen simply because some variables are intrinsically about $1,000,000 a year, but viscosity itself could be
less reliable (and important) than others. Sometimes it measured only to a standard deviation of around 8%.
is possible to improve the overall level of quality sig- Before bothering about the accuracy of the predictor
nificantly by removing just a few of these low quality variables, it was first necessary to find improved ways
variables. This has a complementary corollary: a law to measure the response. In an entirely different con-
of diminishing returns applies that suggests that suc- text, much work in the personal banking sector concen-
cessive attempts to improve the quality of the data are trates on improved models for predicting risk—where,
likely to lead to less improvement. If one has a partic- again, a slight improvement translates into millions of
ular analytic aim in mind, then one might reasonably dollars of increased profit. In general, however, these
assert that data errors that do not affect the conclusions models are based on retrospective data—data drawn
do not matter. Moreover, for those that do matter, per- from distributions that are unlikely still to apply. We
haps the ease with which they can be corrected should need to be sure that the inaccuracies induced by this
have some bearing on the effort that goes into detect- population drift do not swamp the apparent improve-
ing them—although the overriding criterion should be ments we have made.
the loss consequent on the error being made. This is al- Data quality is a key issue throughout science, com-
lied with the point that the base rate of errors should be merce, and industry, and entire disciplines have grown
taken into account: if one expects to find many errors, up to address particular aspects of the problem. In man-
then it is worth attempting to find them, since the likely ufacturing and, to a lesser extent, the service industries,
rewards, in terms of an improved data base, are likely we have schools for quality control and total quality
to be large. In a well-understood environment, it might management (Six Sigma, Kaizen, etc.). In large part,
238 R. D. DE VEAUX AND D. J. HAND
these are concerned with reducing random variation. H UFF, D. (1954). How to Lie with Statistics. Norton, New York.
In official statistics, strict data collection protocols are J ONES, P. D. and W IGLEY, T. M. L. (1990). Global warming
typically used. trends. Scientific American 263(2) 84–91.
Of course, ensuring high quality data does not come K IM , W., C HOI , B.-J., H ONG , E.-K., K IM, S.-K. and L EE, D.
without a cost. The bottom line is that one must weigh (2003). A taxonomy of dirty data. Data Mining and Knowledge
up the potential gains to be made from capturing and Discovery 7 81–99.
K LEIN, B. D. (1998). Data quality in the practice of consumer prod-
recording better quality data against the costs of en-
uct management: Evidence from the field. Data Quality 4(1).
suring that quality. No matter how much money one
K RUSKAL, W. (1981). Statistics in society: Problems unsolved and
spends, and how much resource one consumes in at- unformulated. J. Amer. Statist. Assoc. 76 505–515.
tempting to detect and prevent bad data, the unfortu- L AUDON, K. C. (1986). Data quality and due process in large in-
nate fact is that bad data will always be with us. terorganizational record systems. Communications of the ACM
29 4–11.
REFERENCES L ITTLE, R. J. A. and RUBIN, D. B. (1987). Statistical Analysis with
Missing Data. Wiley, New York.
BAGGERLY, K. A, M ORRIS, J. S. and C OOMBES, K. R. (2004).
L OSHIN, D. (2001). Enterprise Knowledge Management: The Data
Reproducibility of SELDI-TOF protein patterns in serum:
Quality Approach. Morgan Kaufmann, San Francisco.
Comparing datasets from different experiments. Bioinformat-
ics 20 777–785. M ADNICK, S. E. and WANG, R. Y. (1992). Introduction to the
B RUNSKILL, A. J. (1990). Some sources of error in the coding of TDQM research program. Working Paper 92-01, Total Data
birth weight. American J. Public Health 80 72–73. Quality Management Research Program.
C HECK, E. (2004). Proteomics and cancer: Running before we can M OREY, R. C. (1982). Estimating and improving the quality of in-
walk? Nature 429 496–497. formation in a MIS. Communications of the ACM 25 337–342.
C OALE, A. J. and S TEPHAN, F. F. (1962). The case of the Indians P ERCY, T. (1986). My data, right or wrong. Datamation 32(11)
and the teen-age widows. J. Amer. Statist. Assoc. 57 338–347. 123–124.
D E V EAUX, R. D. (2002). Data mining: A view from down in the P ETRICOIN , E. F., III, A RDEKANI , A. M., H ITT, B. A.,
pit. Stats (34) 3–9. L EVINE , P. J., F USARO , V. A., S TEINBERG , S. M.,
D E V EAUX , R. D., D ONAHUE, R. and S MALL, R. D. (2002). M ILLS , G. B., S IMONE , C., F ISHMAN , D. A., KOHN, E. C.
Using data mining techniques to harvest information in clinical and L IOTTA, L. A. (2002). Use of proteomic patterns in serum
trials. Presentation at Joint Statistical Meetings, New York.
to identify ovarian cancer. The Lancet 359 572–577.
D E V EAUX , R. D., G ORDON, A., C OMISO, J. and
P IERCE, E. (1997). Modeling database error rates. Data Quality
BACHERER , N. E. (1993). Modeling of topographic effects
on Antarctic sea-ice using multivariate adaptive regression 3(1). Available at www.dataquality.com/dqsep97.htm.
splines. J. Geophysical Research—Oceans 98 20,307–20,320. P RICEWATERHOUSE C OOPERS (2004). The Tech Spotlight 22.
H AND, D. J. (2001). Reject inference in credit operations. In Hand- Available at www.pwc.com/extweb/manissue.nsf/docid/
book of Credit Scoring (E. Mays, ed.) 225–240. Glenlake Pub- 2D6E2F57E06E022F85256B8F006F389A.
lishing, Chicago. R EDMAN, T. C. (1992). Data Quality. Management and Technol-
H AND, D. J. (2004a). Academic obsessions and classification ogy. Bantam, New York.
realities: Ignoring practicalities in supervised classification. S TRAYHORN, J. M. (1990). Estimating the errors remaining in
In Classification, Clustering and Data Mining Applications a data set: Techniques for quality control. Amer. Statist. 44
(D. Banks, L. House, F. R. McMorris, P. Arabie and W. Gaul, 14–18.
eds.) 209–232. Springer, Berlin. WAINER, H. (2004). Curbstoning IQ and the 2000 presidential
H AND, D. J. (2004b). Measurement Theory and Practice: election. Chance 17(4) 43–46.
The World Through Quantification. Arnold, London.
W EST, M. and W INKLER, R. L. (1991). Data base error trapping
H AND , D. J., B LUNT, G., K ELLY, M. G. and A DAMS, N. M.
and prediction. J. Amer. Statist. Assoc. 86 987–996.
(2000). Data mining for fun and profit (with discussion). Sta-
tist. Sci. 15 111–131. W ILLENBORG, L. and DE WAAL, T. (2001). Elements of Statistical
H AND, D. J. and H ENLEY, W. E. (1993). Can reject inference ever Disclosure Control. Springer, New York.
work? IMA J. of Mathematics Applied in Business and Industry W OLINS, L. (1962). Responsibility for raw data. American Psy-
5(4) 45–55. chologist 17 657–658.
Statistical Science
2005, Vol. 20, No. 3, 239–241
DOI 10.1214/088342305000000250
© Institute of Mathematical Statistics, 2005
Abstract. We’ve known how to lie with statistics for 50 years now. What
we really need are theory and praxis for accusing someone else of lying with
statistics. The author’s experience with the response to The Bell Curve has
led him to suspect that such a formulation already exists, probably imparted
during a secret initiation for professors in the social sciences. This article
represents his best attempt to reconstruct what must be in it.
Key words and phrases: Public policy, regression analysis, lying with sta-
tistics.
In 1994, the late Richard J. Herrnstein and I pub- Unfortunately, the people who write such books of-
lished The Bell Curve (Herrnstein and Murray, 1994) ten call upon data that have some validity, which con-
and set off an avalanche of editorials, news stories, fronts us with a dilemma. Such books must be dis-
articles and entire books in response. That avalanche credited, but if we remain strictly within the rules of
included valuable technical contributions that have scholarly discourse, they won’t be. What to do? Recall
moved the debate forward. But much of the reaction Lawrence Kohlberg’s theory of moral development:
that went under the cover of scholarly critique baf- At the sixth and highest level of morality, it is permis-
fled me because it seemed transparently meretricious. sible to violate ordinary ethical conventions to serve
These people were too smart and well trained to be- a higher good (Kohlberg, 1981). Such is the situation
lieve their own arguments, I said to myself, and I spent forced upon us by these books. Let me offer six strate-
many hours imagining how they rationalized lying (in gies that you may adapt to the specific situation you
my not-disinterested view) about the book’s arguments face.
As you consider these strategies, always keep in
and evidence. But The Bell Curve wasn’t a unique case.
mind the cardinal principle when attacking the tar-
For books on certain high-profile policy issues—Bjorn
get book: Hardly anyone in your audience will have
Lomborg’s The Skeptical Environmentalist (Lomborg,
read it. If you can convince the great majority who
1998) is another prominent example—the ordinary
never open the book, it doesn’t matter that the tiny mi-
rules constraining scholarly debate seem to go out the nority who have read it will know what you are doing.
window. In my more paranoid moments, I envision
a secret initiation for newly-appointed assistant pro- #1. THE WHOLE THING IS A MESS
fessors in the social sciences that goes something like
this: This is a form of softening up, “preparing the battle-
field” as the military would put it. The goal is to gener-
*
ate maximum smoke. The specific criticisms need not
Over the last few decades, a number of books on
be central to the target book’s argument. They need
public policy aimed at a lay readership have advanced
not even be relevant. All you need to do is to create
conclusions that no socially responsible person can
an impression of many errors, concluding with, “If a
abide, written so cleverly that they have misled many sophomore submitted this as a paper in my introduc-
gullible people. tory [insert name of course], I would flunk it.”
Samples offer a rich source of smoke. Something is
Charles Murray is the Bradley Fellow at the American wrong with every sample. Start with that assumption,
Enterprise Institute for Public Policy Research, 1150 which has the advantage of being true, seek out that
Seventeenth Street, N.W., Washington, D.C. 20036, something, and then announce that the data are uninter-
USA (e-mail: charlesmurray@adelphia.net). pretable. If the sample is representative, argue that the
239
240 C. MURRAY
data are outdated. If the sample is recent, argue that it is now to refutations of statistical evidence that exploit
unrepresentative. If it is both recent and representative, this profound truth.
you may be able to get some mileage out of missing
data. If the author drops cases with missing data, argue #3. ANY ALTERNATIVE EXPLANATION THAT CAN
that doing so biases the sample. If instead the author BE IMAGINED IS TRUE
uses imputed values, accuse him of making up data.
The first of these ways to fight evidence without ev-
Another excellent way to create smoke is to focus on
idence calls on the power of the alternative explana-
the target book’s R 2 ’s, which are almost always go-
tory hypothesis. As every poker player knows, it is not
ing to be smaller than 0.5 and often will be around
necessary actually to have good cards if you play the
0.1 or 0.2. The general form of the accusation in this
hand as if you had good cards. Similarly, you can ad-
case is, “[The independent variable] that the author
vance competing hypotheses as if they are known to be
claims is so important explains only [x] percent of the
true, as in this form: “The author fails to acknowledge
variance in [the dependent variable]. That means that
that [some other cause] can have [the observed effect],
[100-x] percent is the result of other causes. The role
invalidating the author’s explanation.” Technical note:
of [the author’s independent variable] is trivial.” Do not
Don’t make the beginner’s mistake of using “could”
let slip that your own published work is based on simi-
instead of “can” in this formulation—a careful reader
larly low R 2 ’s. might notice the implication that the alternative has no
A third generic way to create smoke is to accuse of evidence to back it up.
the author of choosing the wrong analytical model. The
author chose a linear model when it should have been #4. NOTHING IS INNOCENT
nonlinear. He chose a tobit model instead of a nega-
tive binomial model. He used a fixed-effects model in- If you can persuade your audience that the author of
stead of a random-effects model. Here the general form the target book is slanting the data, you cast a cloud
of your position is, “Even a first-year graduate student of suspicion over everything the author says. Thus the
would know better than to use [the target’s choice of rationale for strategy #4, again happily requiring no ev-
model] instead of [the preferred model].” Do not be de- idence: Treat any inconsistency or complication in the
terred if the results are robust across alternative models. target book’s interpretation of the data as deliberately
Remember the cardinal rule: Hardly anyone will have duplicitous. Some useful phrases are that the author
read the book, so hardly anyone will know. “tries to obscure. . . ” or “conspicuously fails to men-
tion. . . ” or “pretends not to be aware that. . . .” Here,
#2. KEEP ADDING INDEPENDENT VARIABLES remember that the more detailed the book’s technical
presentation, the more ammunition you have: any time
Now you are ready to demonstrate that the author is
the author introduces a caveat or an alternative inter-
not only incompetent, but wrong. If you have access
pretation in an endnote or appendix, it has been delib-
to data for replicating the target book’s analysis, one
erately hidden.
statistical tool is so uniformly devastating that no critic
should be without it: Keep adding independent vari-
#5. SOMEONE SOMEWHERE SOMETIME HAS
ables. Don’t worry if the new variables are not causally
SAID WHAT YOU PREFER TO BE TRUE
antecedent to the author’s independent variables. You
can achieve the same result by adding correlated inde- Sometimes the target book will use evidence based
pendent variables that are causally posterior. The re- on a review of the extant technical literature. Such evi-
gression coefficients for the key variables in the target dence is as easy to attack as the quantitative evidence if
book’s analyses will be attenuated and sometimes be- you remember “The Rule of One,” which is as follows:
come statistically insignificant. Technical note: Com- In a literature in which a large number of studies find X
bine the old and new variables into a single-equation but even one study finds not-X, and the finding X is
model, not into a multi-equation model. You don’t pernicious, you may ignore the many and focus exclu-
want to give your reader a chance to realize that you’re sively on the one. Ideally, the target book will not have
saying that the sun rises because it gets light. cited the anomalous study, allowing you to charge that
So far, I have given you some tools for fight- the author deliberately ignored it (see strategy #4). But
ing statistics with statistics. But remember Frederick even if the target book includes the anomalous study in
Mosteller’s dictum that while it is easy to lie with sta- its literature review, you can still treat the one as defin-
tistics, it is even easier to lie without them. Let me turn itive. Don’t mention the many.
HOW TO ACCUSE THE OTHER GUY 241
A related principle is the “Preferential Option for What makes the Big Lie so powerful is the multiplier
the Most Favorable Finding,” applied to panel studies effect you can get from the media. A television news
and/or disaggregated results for subsamples. If the au- show or a syndicated columnist is unlikely to repeat
thor of the target book has mentioned the overall results a technical criticism of the book, but a nicely framed
of such a study, find the results for one of the panels Big Lie can be newsworthy. And remember: It’s not
or one of the subsamples that are inconsistent with the just the public who won’t read the target book. Hardly
overall finding, and focus on them. As you gain ex- anybody in the media will read it either. If you can get
perience, you will eventually be able to attack the tar- your accusation into one important outlet, you can start
get book using one subsample from an early panel and
a chain reaction. Others will repeat your accusation,
another subsample from a later panel without anyone
soon it will become the conventional wisdom, and no
noticing.
one will remember who started it. Done right, the Big
#6. THE JUDICIOUS USE OF THE BIG LIE Lie can forever after define the target book in the public
mind.
Finally, let us turn from strategies based on half- *
truths and misdirection to a more ambitious approach: So there you have it: six tough but effective strate-
to borrow from Goebbels, the Big Lie.
gies for making people think that the target book is an
The necessary and sufficient condition for a success-
irredeemable mess, the findings are meaningless, the
ful Big Lie is that the target book has at some point
discussed a politically sensitive issue involving gender, author is incompetent and devious and the book’s the-
race, class or the environment, and has treated this is- sis is something it isn’t. Good luck and good hunting.
sue as a scientifically legitimate subject of investigation
(note that the discussion need not be a long one, nor is
REFERENCES
it required that the target book takes a strong position,
H ERRNSTEIN, R. J. and M URRAY, C. (1994). The Bell Curve: In-
nor need the topic be relevant to the book’s main argu-
telligence and Class Structure in American Life. Free Press,
ment). Once this condition is met, you can restate the New York.
book’s position on this topic in a way that most people KOHLBERG, L. (1981). The Philosophy of Moral Development:
will find repugnant (e.g., women are inferior to men, Moral Stages and the Idea of Justice. Harper and Row, San
blacks are inferior to whites, we don’t need to worry Francisco.
about the environment), and then claim that this repug- L OMBORG, B. (1998). The Skeptical Environmentalist: Measuring
nant position is what the book is about. the Real State of the World. Cambridge Univ. Press.