Gwirtz Et Al - GJI - 2021

Geophys. J. Int. (2021) 225, 277–297 doi: 10.
1093/gji/ggaa542
Advance Access publication 2020 November 13
GJI Geomagnetism, Rock Magnetism and Palaeomagnetism
Can one use Earth’s magnetic axial dipole field intensity to predict
reversals?
K. Gwirtz,1 M. Morzfeld,1 A. Fournier 2

and G. Hulot2
1 Institute
of Geophysics and Planetary Physics, Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA 92037, USA. E-mail:
kgwirtz@ucsd.edu
2 Université de Paris, Institut de Physique du Globe de Paris, CNRS, F-75005 Paris, France
Accepted 2020 November 10. Received 2020 November 2; in original form 2020 May 29
Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

SUMMARY
We study predictions of reversals of Earth’s axial magnetic dipole field that are based solely on
the dipole’s intensity. The prediction strategy is, roughly, that once the dipole intensity drops
below a threshold, then the field will continue to decrease and a reversal (or a major excursion)
will occur. We first present a rigorous definition of an intensity threshold-based prediction
strategy and then describe a mathematical and numerical framework to investigate its validity
and robustness in view of the data being limited. We apply threshold-based predictions to
a hierarchy of numerical models, ranging from simple scalar models to 3-D geodynamos.
We find that the skill of threshold-based predictions varies across the model hierarchy. The
differences in skill can be explained by differences in how reversals occur: if the field decreases
towards a reversal slowly (in a sense made precise in this paper), the skill is high, and if the field
decreases quickly, the skill is low. Such a property could be used as an additional criterion to
identify which models qualify as Earth-like. Applying threshold-based predictions to Virtual
Axial Dipole Moment palaeomagnetic reconstructions (PADM2M and Sint-2000) covering
the last two million years, reveals a moderate skill of threshold-based predictions for Earth’s
dynamo. Besides all of their limitations, threshold-based predictions suggests that no reversal
is to be expected within the next 10 kyr. Most importantly, however, we show that considering
an intensity threshold for identifying upcoming reversals is intrinsically limited by the dynamic
behaviour of Earth’s magnetic field.
Key words: Dynamo: theories and simulations; Magnetic field variations through time;
Palaeointensity; Reversals: process, timescale, magnetostratigraphy; Time-series analysis.
At first sight, the task seems hopeless because simulations of

1 I N T RO D U C T I O N
Earth’s magnetic field suggest that the geomagnetic field is not
Earth possesses a time-varying magnetic field which is generated predictable beyond a century (Hulot et al. 2010b; Lhuillier et al.
and sustained by turbulent flow of liquid metal alloy in the core. The 2011a). The typical time elapsed between two reversals is much
field varies over a wide range of spatial and temporal scales, but larger (often hundreds of millennia) which implies that the exact
this paper focuses on the dynamics of the axial dipole component timing of a reversal cannot be predicted until a reversal is just about
over millions of years (Myr henceforth), which are relevant for to happen (see also Hulot & Le Mouël 1994; Lhuillier et al. 2011b).
the investigation of dipole reversals. When a reversal occurs, the This predictability limit, however, concerns the field in its full detail,
intensity of the dipole collapses and then builds up in reversed and one may be able to identify macroscopic conditions that occur
polarity, with the magnetic north pole becoming the south pole and over long timescales that are largely independent of the detailed
vice versa. Occurrence of dipole reversals is well-documented over morphology of the field. For the remainder of this paper, we assume
the past 150 Myr (Cande & Kent 1995; Lowrie & Kent 2004; Ogg that the predictability limit for reversals, viewed as macroscopic
2012). We thus know that the last reversal occurred about 780 kilo features, is larger than the predictability limit of the field’s details.
years (kyr henceforth) ago and that the average reversal rate over This is motivated by the rich low-frequency dynamics of the long-
the past 5–10 Myr is about 4 reversals per Myr (see, e.g. Morzfeld term dipole field (Constable & Johnson 2005), and by the recent
& Buffett 2019). Given these numbers, we wonder if we can reliably study of Morzfeld et al. (2017) that suggested this could possibly
predict if a reversal can be expected to occur any time soon. be the case.

C The Author(s) 2020. Published by Oxford University Press on behalf of The Royal Astronomical Society. All rights reserved. For
permissions, please e-mail: journals.permissions@oup.com
277
278 K. Gwirtz et al.
In fact, many researchers have implicitly relied on this assumption 2 B A C KG R O U N D : M O D E L H I E R A R C H Y

and studied precursors of dipole reversals. Examples include careful AND SKILL SCORES
investigations of the characteristics of past reversals (see, e.g. Valet
We briefly describe the geomagnetic models we use and then outline
& Fournier 2016, for a recent review), studying the field structure
how to assess prediction strategies via skill scores and receiver
during reversals and excursions (Brown et al. 2018), studying the
operator characteristic (ROC) curves. Readers who are familiar
cause of the present fast decrease of the dipole field, which could
with the models we use or with skill scores and ROC curves may
be a precursor for a reversal (see, e.g. Hulot et al. 2002; Finlay
skip this part of the paper.
et al. 2016), and computational modelling (see, e.g. Olson et al.
2009). Besides these efforts, no consensus has been reached as to
what a reliable precursor for a reversal is (see, e.g. Constable &
Korte 2006; Laj & Kissel 2015). This is caused, at least in part, 2.1 Numerical modelling of the geomagnetic field
by the fact that simulations and the palaeomagnetic record indicate
that the details of dipole reversals vary greatly (see, e.g. Hulot et al. A realistic model for the Earth’s magnetic field is a 3-D magneto-
2010a; Glatzmaier & Coe 2015), even if their directional behaviour, hydrodynamic (MHD) model. Today’s MHD models are realistic
as recorded by lava flows, shows some degree of similarity from one representations of Earth’s magnetic field over a large range of spatial
reversal to the next (Valet et al. 2012). and temporal scales (Schaeffer et al. 2017; Aubert 2019; Wicht &

Here, we revisit these issues and specifically test the often sug- Sanchez 2019), but the simulation of dipole reversals remains a
gested possibility that a small value of the dipole’s strength could computational challenge and the number of MHD simulations that
be used as a natural indicator of an upcoming dipole reversal. Our exhibit reversals remains limited (Lhuillier et al. 2013; Olson et al.
predictions do not distinguish between reversals and major excur- 2013). The reason is that Earth-like, high-resolution simulations of
sions that lead to a near or total collapse of the axial dipole field, the field are difficult to do, even with today’s supercomputers. As a
but end up with the axial dipole rebuilding with the same polarity result, simulations that exhibit reversals often require that they be
(see below for why). In this study, we therefore collectively refer to pushed away from the Earth-like regime. For example, the Ekman
reversals or such excursions as ‘low-dipole events’ (see Section 3 number is a control parameter which expresses the ratio of the
for a precise definition of a low-dipole event). rotation timescale to the viscous timescale. Increasing the Ekman
The idea is as follows. During a low-dipole event, the dipole number amounts to increasing the kinematic diffusivity of the fluid
intensity drops to a very low value. Since the intensity is a con- and thereby the laminar character of the simulated flow. This in turn
tinuous function, it must have approached this low intensity level decreases the required resolution and the time-to-solution. For this
continuously. One may thus ask: can one identify a threshold with reason, many reversing simulations are characterized by an Ekman
the property that if this threshold is passed, the intensity will con- number that is much larger than the Ekman number of the Earth’s
tinue to decay and a low-dipole event will occur during a specified dynamo.
time interval, called the prediction horizon. The prediction horizon An alternative to 3-D simulations are low-dimensional models.
is critical to the usefulness of the prediction strategy. A prediction The terminology is perhaps confusing here because the word ‘di-
horizon of several million years, for example, is not useful, be- mensional’ does not refer to the spatial dimension, but the number
cause a low-dipole event is likely to occur over these timescales. of variables within the model. In this terminology, a 3-D model is
Similarly, a prediction horizon of a few hundred years is not useful high-dimensional because it contains a large number of variables
because the low-dipole event may be already in full-swing when that describe the 3-D structure of the fluid flow and its interactions
we catch it. Given that it takes several kyr for a dipole reversal to with the magnetic field. The 3-D model we consider below has more
take place, a useful prediction horizon should be at least several kyr than three million variables and, hence, its dimension is O(106 ).
long. Low-dimensional models aim to represent selected aspects of the
We can now state the question we want to address more precisely: geodynamo—in our case the axial dipole over Myr timescales—
Can we identify a threshold that is useful for predicting low-dipole with only a small number of variables. The models we consider
events? We study this question via a hierarchy of models, ranging have one or three variables and, hence, dimension one or three—six
from simplified, low-dimensional models, to 3-D simulations of orders of magnitude less than the 3-D model. Examples of low-
Earth’s magnetic field. We identify, for each model, a threshold by dimensional models include scalar stochastic differential equations
maximizing a skill score that quantifies the skill of the prediction. (SDE) that model the time evolution of the axial dipole as a particle
When identifying a threshold one should keep in mind that the event in a double well potential (Hoyng et al. 2001; Schmitt et al. 2001;
‘a low-dipole event will occur during the prediction horizon’ is rare Buffett et al. 2013, 2014; Buffett & Matsui 2015; Buffett 2015;
in comparison to the event ‘no low-dipole event will occur during Meduri & Wicht 2016; Morzfeld & Buffett 2019), scalar SDEs that
the prediction horizon’ (at least for useful prediction horizons); this are inspired by MHD (Pétrélis et al. 2009), and systems of chaotic
is addressed by using well-established skill scores that are robust differential equations that model the interaction of the dipole with
to imbalances of the occurrence of one event over another. We the non-dipole (quadrupole) field, coupled and perturbed by a ve-
carefully discuss the numerical robustness of our approach and locity variable (Gissinger 2012).
also study robustness with respect to the duration of the training
data that are used to identify a threshold. We then apply the same
methodology to palaeomagnetic reconstructions and discuss the
2.2 The model hierarchy
geophysical implications of our study.
Overall, we introduce a new prediction strategy, apply it to four We consider three low-dimensional models and one 3-D simulation.
models and two palaeomagnetic reconstructions (PADM2M and We give a concise description of all four models we use and refer to
Sint-2000), and test it with a variety of skill scores. This causes the original works for further information. The 3-D model we use
us to use a large number of acronyms, most of which are listed in is unpublished and, for that reason, we provide more information
Table A1. about the 3-D model than the simpler models.
Predicting dipole reversals 279
G12 P09
(a) (b)
2 2
Dipole
Dipole
0 0
-2 -2
25 26 27 28 29 30 3 4 5 6 7 8
Time (Myr) Time (Myr)
DW 3D
(c) (d)
2 2
Dipole
Dipole
0 0
-2 -2

20 21 22 23 24 25 100 101 102 103 104 105
Time (Myr) Time (Myr)
Figure 1. Signed dipole as a function of time for the four models considered in this study. (a) G12, (b) P09, (c) DW and (d) 3-D. In each case, the amplitude is
scaled so that the average absolute value of the time series is one. Some reversals and excursions are highlighted in light red and light blue.
2.2.1 The deterministic G12 model We use the same parameters as in Pétrélis et al. (2009), α1 =
−185 Myr−1 , α 0 /α 1 = −0.9. The dipole, D, can be calculated from
Following Gissinger (2012), we consider the ordinary differential
the phase by D = Rcos (x + x0 ). Following Morzfeld et al. (2017),
equations
we set x0 = 0.3 and R = 1.3 [the latter scales the dipole vari-
dQ able D to have approximately the same time average as the relative
= μQ − V D,
dt palaeointensity reported by the reconstruction of Sint-2000 (Valet
dD et al. 2005)]. For the reminder of this paper, we refer to this model
= −ν D + V Q, (1)
dt as the P09 model.
dV Note that the parameters define the model’s timescale. The pa-
= − V + Q D, rameters are chosen so that the P09 model exhibits reversals and
dt
excursions, and so that its reversal rate is comparable to that of
where μ = 0.119, ν = 0.1 and = 0.9. Here, D is the dipole
Earth’s dipole. This is illustrated in Fig. 1, where a typical simula-
and the variable Q represents the quadrupole or, more generally,
tion result with this model is shown. For a simulation we discretize
the non-dipole field; V is a velocity variable that couples D and
the differential equation using a forward Euler–Maruyama method
Q. A change in the sign of D corresponds to a dipole reversal. We
(Kloeden & Platen 1999). The time step is 1 kyr.
refer to this model as the G12 model. A typical simulation with
G12 is shown in Fig. 1. Here, model time t is scaled to represent
the G12 millennium timescale (1 dimensionless time unit = 4 kyr), 2.2.3 The double well model
see Morzfeld et al. (2017). The simulation is done by discretizing
the differential equation by a fourth-order Runge–Kutta scheme A simple model for reversals of a quantity (not necessarily Earth’s
(Matlab’s ode45). dipole field) is a particle in a double well potential. Such a model is
defined by an SDE model as in eq. (2), and with an f(x) that is equal
to the negative gradient of a double well potential. Variations of this
2.2.2 The stochastic P09 model model for geomagnetic dipole reversals have been considered by
many researchers (Hoyng et al. 2001; Schmitt et al. 2001; Buffett
Pétrélis et al. (2009) derived a model for dipole reversals by consid-
et al. 2013, 2014; Buffett & Matsui 2015; Buffett 2015; Meduri &
ering the interaction of two modes. Using the symmetry of the equa-
Wicht 2016; Morzfeld & Buffett 2019). The basic idea is that the
tions of magnetohydrodynamics B → −B in an amplitude equation,
state, x, of the SDE is within one of the two wells of the double
and by assuming that the amplitude has a shorter timescale than a
well
√ potential and is pushed around by noise (the Brownian motion
phase, a stochastic differential equation (SDE) of the form
2q dW ). When the noise builds up towards one side of the well,
dx = f (x)dt + 2q dW, (2) the state may cross over to the other potential well. One can identify
a transition from one well to the other as a reversal of Earth’s dipole.
is derived for the phase, x, where f(x) and q are defined below. In
We use a recent version of this model, called the Myr model in
eq. (2), W is Brownian motion, a stochastic process with the follow-
Morzfeld & Buffett (2019), for which
ing properties: (i) W(0) = 0; (ii) W (t) − W (t + t) ∼ N (0, t)
and (iii) W(t) is almost surely continuous for all t ≥ 0 (see, e.g. x (x̄ − x), if x ≥ 0
f (x) = γ · , (4)
Chorin & Hald 2013). Here and below, N (m, σ 2 ) denotes a Gaus- x̄ (x + x̄), if x < 0
sian random variable with mean m, standard deviation σ and vari-
where γ = 0.1 kyr−1 , x̄ = 5.23 × 1022 Am2 and q = 0.34 × 1044
ance σ 2 .
A2 m4 kyr−1 . These parameters define the model’s natural timescale
More specifically, the SDE for the phase is defined by
and the values we chose are based on configuration (a) in Morzfeld
f (x) = α0 + α1 sin(2x), 2q = 0.2 |α1 |. (3) & Buffett (2019), which implies that the model’s reversal rate is
comparable with Earth’s reversal rate. For the reminder of this In contrast to the other models described above, its 3-D nature
paper, we refer to this model as the DW model. A typical simulation makes this dynamo model amenable to quantitative comparisons
of this model is shown in Fig. 1. For a simulation we discretize the against more observed properties of the Earth’s magnetic field
equation using a fourth-order Runge–Kutta for the deterministic than just the axial dipole. From a morphological standpoint, the
part and a forward Euler–Maruyama for the stochastic component 3-D model produces a magnetic field whose large-scale properties
(Kloeden & Platen 1999). The time step is 1 kyr. at the core–mantle boundary are in ‘good’ agreement with well-
established observations, according to the four criteria introduced
by Christensen et al. (2010):
2.2.4 The 3-D model (i) the axial dipole to non-axial dipole energetic ratio;
We consider a 3-D, convection-driven, dynamo simulation which (ii) the equatorially symmetric to antisymmetric non-dipole en-
exhibits polarity reversals and dipole excursions. The simulation ergetic ratio;
we consider has not yet been published and is part of an ensemble (iii) the zonal to non-zonal energetic ratio;
of reversing simulations run by N. Schaeffer (ISTerre, CNRS, Uni- (iv) the flux concentration factor.
versité Grenoble Alpes), A. Fournier and T. Gastine (both affiliated The terrestrial reference values for these four quantities are re-
with Université de Paris, Institut de Physique du Globe de Paris). spectively (1.4, 1.0, 0.15, 1.5, Christensen et al. 2010). For the 3-D

We refer to this simulation simply as the 3-D model for the rest of model, we compute average values of these quantities of, respec-
this paper. tively (0.72, 1.36, 0.18, 2.45). This leads to an average misfit χ 2 of
The 3-D model uses a pseudo-spectral approximation to solve 1.94, while the median value of χ 2 over the course of the numerical
the set of equations governing rotating dynamo action in a spheri- integration is 2.89. We refer to Christensen et al. (2010) for further
cal shell geometry (see, e.g. Christensen & Wicht 2015, for details). details.
The scales chosen to non-dimensionalize the set of equations are the From a palaeomagnetic perspective, it is worth noting that Sprain
same as those used by for example Schaeffer et al. (2017). The radius et al. (2019) recently introduced a method to assess the degree of
ratio of the inner-core boundary to the core-mantle boundary is set spatial and temporal agreement of a simulated dynamo field with
to its present-day value. The 3-D model has no-slip boundary condi- the long-term (∼10 Myr) palaeomagnetic field. The agreement is
tions on the inner-core and core–mantle boundaries, and it assumes defined on the basis of five properties of the palaeomagnetic field,
that the inner core is conducting. The four non-dimensional control namely the inclination anomaly, the virtual geomagnetic pole disper-
parameters, as defined, for example in Schaeffer et al. (2017), are sion at the equator, the latitudinal variation in virtual geomagnetic
as follows. The Ekman number is 10−4 , the Prandtl number is 1, the pole dispersion, the normalized width of virtual dipole moment
magnetic Prandtl number is 3 and the Rayleigh number is 15 000. (VDM) distribution, and the dipole field reversals (in terms of the
These choices result in an average hydrodynamic Reynolds number relative time spent by the dipole at transitional latitudes lower than
of 216 (recall that the hydrodynamic Reynolds number is defined as 45◦ ). This quantity, termed QPM , is the sum of five misfits, one
the product of the root-mean-squared velocity and the shell thick- for each criterion. For the 3-D model, we find the following values
ness divided by the kinematic viscosity). The open-source, freely of the misfit for each criterion
available xshells code1 is used to numerically solve the equations.
This code combines the finite difference method in the radial direc- Q PM (inclination anomaly) = 0.84,
tion with a spherical harmonic representation of field variables in Q PM (equatorial dispersion) = 0.39,
the horizontal direction, using the dedicated SHTns library (Scha- Q PM (latitudinal dispersion) = 0.95,
effer 2013). To ensure numerical convergence, a hyperdiffusivity is
Q PM (VDM distribution) = 0.81,
applied beyond spherical harmonic degree 55.
The resolution of the 3-D model is defined by the triplet (Nr , Q PM (reversals) = 1.45,
max , mmax ), giving the number of points used in the radial direction for a total QPM = 4.43. This value is good, according to Sprain
together with the maximum degree and order used in the horizontal et al. (2019), who argue that individual misfits lower than unity
approximation of field variables with spherical harmonics. Since indicate an adequate similarity with the palaeomagnetic field. For
five scalar fields are discretized, the total number of degrees of the study of interest here, we note that the width of the VDM
freedom of a simulation is O (5Nr max m max ). Namely, the triplet distribution is adequately captured by the 3-D model. On the other
defining the resolution is ( 144,7 9,63), which results in about 3.6 hand, the simulated dipole spends a fraction of time at transitional
× 10 6 variables – recall that G12 has three variables, P09 and DW latitudes (1.2 per cent of the model integration time) smaller than
have one variable. what is inferred for Earth over the last 10 Myr, which is expected
As discussed above, time in the 3-D model is non-dimensional. To to lie somewhere between 3.75 and 15.0 per cent (see Sprain et al.
scale to geophysical time, one first computes the non-dimensional 2019, for details). In summary, based on a series of metrics that
secular-variation timescale of the non-dipole field up to spherical have come to the fore, this 3-D model compares favourably against
harmonic degree 13, based on the average power spectra of the mag- the recent and more ancient geomagnetic field.
netic field and its secular variation (see Lhuillier et al. 2011b). The
rescaling of the time axis is then performed under the assumption
that the dynamo simulations and the Earth share the same secular- 2.3 Similarities and differences across the model hierarchy
variation timescale, equal to 415 yr. With this scaling, the simulation Fig. 1 shows the axial dipole as a function of time for each model,
time of the 3-D model is 147 Myr; the time step is 43.09 yr. The scaled so that the average absolute value of the time series is one,
number of reversals that occur during this time frame is 109. and with the sign indicating polarity: a negative sign indicates to-
day’s polarity, a positive sign corresponds to a reversed polarity.
The figure shows the evolution of the models’ dipoles on their
1 https://nschaeff .bitbucket.io/xshells/ natural timescales described above. Each model exhibits reversals
G12 P09 PADM2M

0.12 0.12 0.12
(a) (b) (c)
0.1 Skewness: 0.92 0.1 Skewness: -0.94 0.1 Skewness: -0.30
0.08 0.08 0.08
0.06 0.06 0.06
0.04 0.04 0.04
0.02 0.02 0.02
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Dipole Dipole Dipole
DW 3D Sint-2000
0.12 (d) 0.12 (e) 0.12 (f)
Skewness: -0.17 Skewness: -0.46 Skewness: 0.07

0.1 0.1 0.1
0.08 0.08 0.08
0.06 0.06 0.06
0.04 0.04 0.04
0.02 0.02 0.02
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Dipole Dipole Dipole
Figure 2. Scaled histograms of the dipole intensities of the four models and two palaeomagnetic reconstructions (PADM2M and Sint-2000). (a) G12, (b) P09,
(c) PADM2M, (d) DW, (e) 3-D and (f) Sint-2000. The y-axes of all histograms are scaled so that the area under the graph is equal to one and the x-axes are
scaled so that one corresponds to the average intensity. Also shown is the skewness which indicates the degree of asymmetry in the distribution. A thicker tail
near zero suggests that a model, or reconstruction, lingers in a state of low intensity.
and excursions as observed in palaeomagnetic reconstructions, for of two palaeomagnetic reconstructions, PADM2M and Sint-2000
example PADM2M and Sint-2000 (Valet et al. 2005; Ziegler et al. (Valet et al. 2005; Ziegler et al. 2011), which document the time
2011), but these events occur over different timescales and, for some evolution of the virtual axial dipole moment (VADM) over the past
of the models, the events also occur over timescales that are different 2 Myr at a frequency of 1 per kyr. PADM2M and Sint-2000 thus
from what is observed in Earth’s axial dipole field. The 3-D model, contain 2000 points and the histograms are not as well resolved as
for example, has a reversal rate of about 0.7 reversals per Myr, but those of the four models (for which we used substantially longer
the reversal rates of the DW and P09 models are about 5 reversals simulations). This is also evident from the difference in the skew-
per Myr, which is comparable to the average reversal rate of Earth ness, which is positive for Sint-2000, but negative for PADM2M.
over the last 25 Myr (Ogg 2012). In addition, the way a reversal These values should be used with caution, because the estimates
occurs in each model can be different. Reversals of the G12 model of skewness are contaminated by large sampling error (based on
are characterized by a continuous decay in intensity, immediately only 2000 intensities) and by the fact that low intensities (below
followed by a rapid increase in intensity. The DW lingers at low 10 per cent) are not present in these reconstructions. In view of the
intensity longer than the other models (see also Fig. 2). We also large uncertainty in the reconstructions, all four models are reason-
observe that the DW and 3-D models, and to a lesser extent the P09 able, at least qualitatively in view of Figs 1 and 2, although all four
model, exhibit multiple, rapid fluctuations in sign during a reversal, models are constructed from drastically different assumptions and
see Fig. 4. This behaviour is not observed in the G12 model. with very different modelling goals in mind.
Differences between the various models can be illustrated fur-
ther by comparing histograms of their intensities, shown in Fig. 2,
which also provides the models’ Pearson’s moment coefficient of 2.4 Predictions, skill scores and ROC curves
skewness [third standardized moment, Kenney & Keeping (1966)] We want to predict whether a low-dipole event will occur within
for the four models. It is clear that all models, except G12, are char- an a priori specified time interval, called the prediction horizon.
acterized by a negative skew. Thus, the P09, DW and 3-D models We thus consider only two outcomes of an experiment. Outcome
spend more time at a lower than average intensity than at a higher 1: yes, the event occurred during the prediction horizon; outcome
than average intensity. The G12 model tends to spend more time 2: no, the event did not occur during the prediction horizon. As is
at a higher than average intensity than at a lower than average in- common, we denote the outcomes of an experiment by ‘positives’
tensity. The DW model has the smallest skew (in amplitude) and a and ‘negatives’:
thicker left tail than the other three models, which indicates that it
spends a considerable amount of time in low-intensity states (but Positive (P): the event occurred.
other formulations of double-well models, with different parame- Negative (N): the event did not occur.
ters or even different parametrizations of the potential, may behave With two possible outcomes of an experiment, a prediction can
differently). Also shown in Fig. 2 are histograms of the intensities result in one of four possibilities:
True positive (TP): predict that an event will occur and the event
occurs. Good strategy
False positive (FP): predict that an event will occur, but the event 1 A y
strateg ategy
does not occur.
B Worse Bad str
True negative (TN): predict that an event will not occur and the
True positive rate (TPR)

event does not occur.
False negative (FN): predict that an event will not occur, but the
event occurs.
The concepts and ideas described here have been used in many
e
lin
areas. We make an effort to be consistent in the terminology and
ce
an
to bring up only the definitions we need, sticking to commonly
Ch
used names (Fawcett 2006; Joliffe 2016; Chicco & Jurman 2020).
For a more thorough review of the such predictions in the context
of (medical) imaging, see Barrett & Myers (2003), Chapter 13,
where, a prediction strategy of the type discussed here is called a

binary decision. In the language of machine learning, the problem
of predicting ‘an event will occur within the horizon’ or ‘no event
will occur within the horizon’ is called a classification problem, 0 False positive rate (FPR) 1
similar to distinguishing dogs from cats (Goodfellow et al. 2016).
It is clear that a good prediction strategy should be characterized Figure 3. Three examples of ROC curves. Two threshold levels, labeled A
and B, are identified on one of the curves (see text for the definition of the
by a large number of true positives and true negatives, but a small
chance line).
number of false positives or false negatives. A skill score is a quan-
titative means for describing the quality of a prediction strategy.
There is a large number of skill scores and these usually require
that one applies the prediction strategy, say n times, followed by
counting the number P of positives that occurred, the number N of TPR and FPR have the desirable property that they are independent
negatives that occurred, and the true/false positives and true/false of the frequency of the event and a good prediction strategy should
negatives. Which of the many skill scores is most appropriate de- have a high TPR and low FPR, independently of how often an event
pends on the problem one wishes to solve. For example, one may occurs. Ideally, TPR = 1, so that all events are predicted correctly,
define the accuracy by and FPR = 0, so that no false positives occur.
TP + TN The bulk of this paper is concerned with predictions based on
ACC = . (5) whether the dipole is below a specified threshold. Naturally, one
P+N
may investigate how the skill of the predictions depends on the
A good prediction strategy should be characterized by a high ac- threshold. For example, one can compute skill scores as functions
curacy, but a bad prediction strategy may also be characterized by of a varying threshold and determine an optimal threshold as the one
a high accuracy. For example, the event ‘a low-dipole event occurs that maximizes the skill score. One can also compute the TPR and
within the prediction horizon’ is rare compared to the event ‘no FPR as functions of the threshold. The line that a varying threshold
low-dipole event occurs within the prediction horizon’ (unless the traces out in TPR − FPR space is called the receiver operating
prediction horizon is large). This means that the prediction strategy characteristic (ROC) curve. Three examples of ROC curves are
‘predict that no low-dipole event occurs within the prediction hori- illustrated in Fig. 3. The figure also shows the chance line, defined by
zon’ is characterized by a high accuracy, but this strategy is useless a straight line at a 45◦ angle, on which the (FTR,TPR) points should
because it can never achieve a true positive (the event ‘a low-dipole lie when randomly guessing occurrence of events, and varying the
event occurs within the horizon’ is never predicted). Other skill probability with which one makes this guess.
scores, for example, the F1 score The ROC curve of a good prediction strategy should be above the
2 TP chance line and should quickly transition from the origin towards
F1 = , (6) (0,1), thus being characterized by a high TPR and a small FPR.
2TP + FP + FN
One can use ROC curves as qualitative tools to assess different
the critical success index (CSI) prediction strategies. We have included labels for the three ROC
TP curves in Fig. 3, that identify which strategies are good, worse or
CSI = , (7) bad.
TP + FP + FN
We note that, besides an impressively large body of work across
or Matthew’s correlation coefficient (MCC) many disciplines, it remains difficult to unambiguously argue that
TP · TN − FP · FN a prediction strategy is good or bad, or if one prediction strategy is
MCC = , (8) better than another. As a simple example, consider the green ROC
(TP + FP)(TP + FN)(TN + FP)(TN + FN)
curve in Fig. 3 with threshold levels labeled by A and B. It is not easy
are designed to alleviate these issues and are applicable in problems to say which threshold level one should choose. Threshold A leads
where the occurrence of the event is rare. to the smallest FPR while also achieving TPR = 1, but threshold
One can also compute the true-positive-rate (TPR) and false- B achieves a smaller FPR than threshold A, at the cost of a slightly
positive-rate (FPR), defined by: smaller TPR. The difficulties arise because many issues, such as
how dangerous false positives are compared to false negatives, are
TP FP
TPR = , FPR = . (9) problem dependent and remain subjective.
P N
3D
2 A B C D
Dipole
0
-1
-2
32 32.1 32.2 32.3 32.4 32.5 32.6 32.7 32.8 32.9 33
Time (Myr)
Figure 4. Excerpt of the 3-D simulation showing the signed dipole as a function of time (same scaling as in Fig. 1). Four events are labeled A − D.

3 FINDING THRESHOLDS FOR THE In Fig. 4, for example, the low-dipole event A has a much larger
P R E D I C T I O N O F L O W- D I P O L E E V E N T S event duration than event C, but both events are major excursions.
To define a strategy for predicting low-dipole events, we introduce
In simple terms, our prediction strategy is
the prediction horizon (PH), which is the time window during which
If the dipole intensity drops below a threshold, then the field
we predict that a low-dipole event will start to occur. Note that we do
will continue to decay and a low-dipole event will occur within the
not make any prediction as to when precisely the low-dipole event
prediction horizon.
starts—we merely predict that a low-dipole event will start (or not)
The situation, however, is more delicate because the models ex-
at some point during the PH. We further make no predictions as to
hibit complex behaviour while undergoing reversals or major excur-
when the low-dipole event will end. With the above definitions, the
sions. The subtleties can be illustrated by considering the excerpt
prediction strategy can be stated precisely.
of the 3-D simulation shown in Fig. 4, where we highlight reversals
and major excursions during which the axial dipole field temporarily Definition: Threshold-based predictions for low-dipole events.
changed sign. One wishes to define a reversal event as the transition We predict that a low-dipole event will start to occur within the
of a strong dipole field in one polarity, to a strong dipole field in the prediction horizon if the intensity drops below a warning threshold
opposite polarity, rather than just a short-term temporary change (WT); we predict that no low-dipole event will start during the
of polarity. In fact, the field may quickly change polarity several prediction horizon if the intensity is above the WT; we stop making
times while the field is weak when undergoing a reversal. This oc- predictions from the time the low-dipole event started (intensity
curs in the 3-D simulation during the events labeled B and D in below ST) until the event ends (intensity above ET).
Fig. 4. Similarly, one wishes to interpret event A (or C) as a single
major excursion, rather than a sequence of reversals. We thus revise We make no predictions while the event is observed, because a
the simple prediction strategy above to ensure that each reversal prediction made while an event is happening is of limited use. Our
or major excursion, labeled by A – D in Fig. 4, is considered as a prediction strategy is illustrated in Fig. 5. In this illustration, ST and
single low-dipole event. A careful definition of low-dipole events is ET are chosen such that Event A is a single event; fast oscillations in
provided below. polarity occur while the field is weak. The prediction strategy leads
to TNs followed by FNs and TPs for Events A and B. The false
negatives occur because, given the prediction horizon and average
3.1 Precise formulation of threshold-based predictions intensity, the WT is small, so that the events tend to be predicted a
We introduce definitions that allow us to clearly specify the events little too late. The figure also illustrates FPs, which occur when the
and predictions whose skill we want to study. We start with the field drops below the WT, but no low-dipole event occurs because
definition of the event we want to predict. the field does not continue to drop below the ST. True negatives
occur often, because reversals and low dipole events are rare. In the
Definition: Low-dipole event. A low-dipole event starts when figure, TNs occur whenever the ‘Truth’ (bottom) and the prediction
the intensity drops below a specified value, called the start-of-event (centre) are both at zero.
threshold (ST), or if a the dipole changes its sign,2 and ends when Finally, note that with our definitions, one may require that
the intensity exceeds a second specified value, called the end-of-
ST < WT < ET, (10)
event threshold (ET). The event duration is the time interval from
start to end of the event. because other choices for WT lead to rather strange prediction
strategies. If ET ≤ WT, for example, then a low-dipole event would
This definition ensures that a low-dipole event describes reversals be predicted immediately after an event just ended.
and major excursions, because the field may drop below the ST, but
can build back up above the ET without changing polarity. Events
A and C in Fig. 4 are examples of this situation. We also emphasize 3.2 Scaling thresholds and rescaling time
that the event duration is defined implicitly by the start and end of an
event and may vary considerably across several low-dipole events. For each model, we define all three thresholds (warning, start-of-
event, and end-of-event thresholds) as a fraction of the average
intensity of the model. Such a scaling of the thresholds makes com-
2 The
addition of the ‘or-statement’ is relevant only in the context of palaeo- parisons across the hierarchy of models easier to understand. For
magnetic reconstructions, see Section 5. the rest of this paper we fix the start-of-event and end-of-event
Event A Event B
End-of-event threshold (ET)

Warning threshold (WT)
Prediction
horizon (PH) Start-of-event threshold (ST)
Dipole
Time
PH
Event
duration Event duration

Prediction
Prediction
TP FP TP horizon (PH)
No No
TN FN prediction TN TN FN prediction TN Past Future
Today Time
P P
Truth
N N N
Time
Figure 5. Illustration of the prediction strategy. Top graph: dipole (solid blue) as a function of time. The thin blue, green and red horizontal lines represent
the start-of-event, the end-of-event and the warning thresholds. Two low-dipole events are labeled A (reversal) and B (excursion), and we indicate their event
durations. Highlighted in red is a period of low intensity, which is not a low-dipole event, but where the low intensity causes false positives (FP). Towards
the right, we illustrate a prediction over a given prediction horizon, which will lead to true negatives (TN). The prediction horizon also defines the true labels
(see bottom panel). Centre graph: prediction as a function of time. The red line at zero corresponds to the prediction ‘no low-dipole event occurs during
the prediction horizon’, and the red line at one corresponds to the prediction ‘a low-dipole event occurs during the prediction horizon’. The thick black line
segments correspond to periods during which no prediction is made. For events A and B, we first observe TNs, followed by false negatives (FN), caused by the
warning threshold being small; then we observe TPs followed by a period during which no prediction is made. Bottom graph: true occurrences of low-dipole
events within the prediction horizon. The orange line at zero corresponds to negatives (N), that is ‘no low-dipole event occurs during the prediction horizon’.
The orange line at one corresponds to positives (P), that is ‘a low-dipole event occurs during the prediction horizon’.
thresholds as: ST = 10 per cent and ET = 80 per cent. With these change of sign in the axial dipole are considered as events of interest
choices, we focus on events that start when the intensity is very within PADM2M and Sint-2000 (see Section 5).
low and which end when the field has nearly fully recovered (see Nonetheless, the precise values of ST and ET are not critical
Fig. 4). The choice of ST = 10 per cent is guided by the consider- because our overall approach is robust with respect to choices.
ation that we want to focus on events that correspond to reversals This is evident from a limited number of numerical experiments
and major excursion. During a reversal, the signed dipole can reach we performed with different choices of ET and ST. Specifically, we
an arbitrarily low value, before switching sign. During a major ex- tried the combinations ST = 10 per cent and ET = 50 per cent, ST
cursion, the dipole amplitude is very low, but we do not necessarily = 20 per cent and ET = 50 per cent, and ST = 20 per cent and ET
observe a switch in the sign. Moreover, palaeomagnetic reconstruc- = 80 per cent and obtained qualitatively and quantitatively similar
tions, such as PADM2M and Sint-2000 (Valet et al. 2005; Ziegler results.
et al. 2011), have difficulties with resolving small dipole values. We rescale time in each model so that the prediction results are
The palaeomagnetic reconstructions we consider below consist of comparable across the hierarchy of models. A natural choice for
signed Virtual Axial Dipole Moments (VADM), which are proxies this timescale is the average event duration (AED). That is, we
for the true axial dipole magnitude. The weakest VADMs recorded compute the average event duration given the natural timescale of
are about 10–20 per cent of the present axial dipole field intensity each model, and then rescale time so that one time unit corresponds
(see, e.g. Constable & Korte 2006; Hulot et al. 2010a). This is caused to one average event duration. The average event duration for each
by (i) VADM reconstructions sensing the non-dipole field during of the models is listed in Table 1. For the simplified models (G12,
a low-dipole event; (ii) VADM reconstructions are temporally fil- P09 and DW), the statistics of the event duration are computed from
tered by sediment recording processes and (iii) additional smooth- simulations that include about 550 events. For the 3-D model, we
ing is introduced by modelling choices and stacking of the relative use the entire duration of the simulation to compute the statistics of
palaeointensity (RPI) records (some of the individual records may the event duration.
have a higher resolution and features that are not aligned in time The prediction horizon is defined as a fraction of the average
are smoothed out). As we will see, by choosing ST = 10 per cent, event duration. We focus on the prediction horizon PH = 1 × AED,
we ensure that only events that experienced at least one temporary that is we focus on short-term predictions of low-dipole events,
Table 1. This table summarizes key results obtained throughout the paper. We list all information in this one table to make it easier to make connections between
the various quantities listed. Description of each column. First column: the model or palaeomagnetic reconstruction considered. Second column: number of
low-dipole events in the verification portion of a simulation/palaeomagnetic reconstruction. Values in brackets are the number of events in the training data.
Third column: maximum MCC (prediction skill) achieved for optimal WT (see Section 2.4 for the definition of MCC). Values in brackets are for training data.
Verification and training data are explained in Section 4.2 for the models and in Section 5.2 for the palaeomagnetic reconstructions. Fourth column: optimal
WT that maximizes MCC over the training data (see Section 3.3). Fifth column: average duration of a low-dipole event (AED). Values in brackets are standard
deviations. Sixth column: average decay time (ADT) with standard deviations in brackets. See sections 4.1.2 (models) and 5.1 (palaeomagnetic reconstructions)
for definitions of average event duration and decay time and their computation. Seventh column: ratio of average decay time to average event duration. All
results listed here correspond to a prediction horizon PH = 1, a start-of-event threshold ST = 10 per cent, and an end-of-event threshold ET = 80 per cent.
# of events MCC ˆ
WT Event duration (AED) Decay time (ADT) ρ= ADT
AED
G12 554 (5) 0.96 (0.97) 30.75% 3.2 kyr (0.1 kyr) 26.9 kyr (2.0 kyr) 8.49
P09 551 (5) 0.57 (0.56) 54.50% 6.0 kyr (3.9 kyr) 10.8 kyr (5.0 kyr) 1.81
DW 551 (5) 0.31 (0.30) 69.25% 16.1 kyr (12.2 kyr) 8.5 kyr (4.7 kyr) 0.53
3-D 368 (5) 0.12 (0.14) 17.50% 16.4 kyr (11.6 kyr) 5.7 kyr (4.2 kyr) 0.35
PADM2M 2 (4) 0.62 (0.73) 50.75% 11.7 kyr (8.1 kyr) 25.5 kyr (10.9 kyr) 2.19
Sint-2000 2 (4) 0.44 (0.77) 36.75 % 10.2 kyr (8.7 kyr) 32.0 kyr (15.1 kyr) 3.1

attempting to predict whether a low-dipole event will start with that the prediction horizon is equal to one average event duration
a lead time comparable to the event’s duration. We also consider (PH =1 × AED), but we also consider slightly longer (1.5×) and
prediction horizons of 0.5 × AED or 1.5 × AED, to show how slightly shorter (0.5×) PHs. We also test if useful threshold-based
the prediction skill degrades with longer prediction horizons, but predictions can be made if the training period is short and, therefore,
also to demonstrate the robustness of our approach (which is not contains only a small number of low-dipole events.
sensitive to minor variations of the various parameters).
4.1 Skill of threshold-based predictions

3.3 Finding thresholds via maximization of skill scores
4.1.1 Qualitative comparison and illustration
For a fixed prediction horizon, we compute an optimal WT as fol-
lows. For a given dipole time-series, we compute a skill score for We first qualitatively assess threshold-based predictions by inspec-
varying WTs, using a regular grid with spacing of 0.25 per cent. The tion of ROC curves. The ROC curves shown in Fig. 6 are computed
WT that leads to the largest skill score is selected as the optimal using the entire model run in the case of the 3-D model, and long
WT: WT ˆ = arg max Skill(WT). simulations with around 550 events for the G12, P09 and DW mod-
This approach can be implemented with a variety of skill scores, els, see Section 3.2. We note that, for all four models, the ROC
for example MCC, CSI or F1 . For the short prediction horizons curves get closer to the chance line (higher false positive rate, lower
we consider, we did not notice any significant differences in the true positive rate) as the prediction horizon increases. This implies
optimal WTs one finds regardless of which skill score is used, with that the predictions get worse, by any measure, as the prediction
the exception of the ACC score, which is not robust with respect horizon increases. This means, perhaps not so surprisingly, that
to imbalances in the data (one event occurring more often than the predictions via a threshold-based strategy are more difficult to do
other). Below we present results obtained by using MCC, because it when the prediction horizon is large. More interestingly, we note
recently has been reported to be more appropriate than the F1 score that for any fixed PH, the ROC curves of the G12 or P09 models
for binary classification (Chicco & Jurman 2020), but other skill are further from the chance line than the ROC curves of the DW
scores (not ACC) may be used to obtain similar results. and 3-D models. This suggests a ‘ranking’ of the models in terms
To prevent overfitting, it is necessary to validate a prediction of how skillful threshold-based predictions are.3 We investigate this
strategy by applying it to an independent data set. An optimal WT ranking quantitatively via MCC skill scores below.
is determined by using a given dipole time-series, which we call the For each model, we illustrate threshold-based predictions for
training data set. The optimal WT is then applied to an independent which an optimal WT is found by maximizing MCC skill score,
time-series, which we call the verification data set, and the skill as a function of the WT. The data sets used for finding the opti-
score is computed for the verification data. For the simplified models mal WTs are the entire model run in the case of the 3-D model,
(G12, P09, DW), the verification data are independent simulations and long simulations with around 550 events for the G12, P09 and
(using different initial conditions in the case of the deterministic DW models, see Section 3.2 (no distinction between training and
G12 model and different initial conditions and random forcing in the verification data). This results in optimal WTs of WT ˆ G12 = 31.25
case of the stochastic P09 and DW models). For the 3-D model, we per cent, WTˆ P09 = 43.00 per cent, WT ˆ DW = 60.25 per cent and
compute the optimal WT by using only a portion of the simulation WTˆ 3D = 45.50 per cent for, respectively, the G12, P09, DW and
as training data, and then use the remainder of the simulation as 3-D models. Results for a prediction horizon PH = 1 are shown
verification data. in Fig. 7; results for PH = 0.5 or PH = 1.5 are similar. We plot
excerpts of the dipole time-series of the four models, along with
two graphs that illustrate the predictions and their validity. Each
4 A P P L I C AT I O N T O A H I E R A R C H Y O F model is represented by one subfigure which contains three panels.
MODELS The top panel shows an excerpt of the dipole time-series. We show
We apply threshold-based predictions to the models in the hierarchy.

For each model, we predict a low-dipole event about as far ahead of 3 But a higher ranking in predictive skill does not imply that the model is
time as one expects the event will last. In our terminology, this means ‘better’, that is, more similar to Earth’s axial dipole, see Section 5.
PH=0.5 PH=1.0 PH=1.5

Zoom G12 P09
1 1
0.8 0.8
Zoom
(a)
0.6 0.6
TPR
TPR
1
0.4 0.4
TPR
0.2 0.2
0.995
0 FPR 0.005 (b)
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
FPR FPR
DW 3D
1 1

0.8 0.8
0.6 0.6
TPR
TPR
0.4 0.4
0.2 0.2
(c) (d)
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
FPR FPR
Figure 6. ROC curves for the four models and three prediction horizons, with PH = 0.5 in green, PH = 1 in purple and PH = 1.5 in orange. (a) G12, (b) P09,
(c) DW (d) 3-D. A ROC curve is the collection of TPR/FPR pairs one obtains when varying the warning threshold. The thicker line corresponds to TPR/FPR
pairs for which ST < WT < ET. The thin lines continue the ROC curves for WT ≥ ET. The figure-in-figure in (a) (ROC curves for G12) shows a zoom near
the (0,1) point to illustrate that the three ROC curves, corresponding to different PHs, do not overlap. The ROC curves are computed using long simulations,
containing a large number of low-dipole events (see text for details).
a time interval of 50 non-dimensional time units for each model In the case of P09, we note a larger number of false positives
and each model exhibits two events during this time interval (recall and false negatives than in the case of G12, but false positives
that time is scaled by the average event duration, AED, see Table 1). or false negatives occur less frequently than for the DW or 3-D
The orange lines in the bottom of each subfigure are zero if no low- models. Consistent with what was suggested by Fig. 6, the skill of
dipole event starts within the prediction horizon, and they are one if threshold-based predictions for the P09 model thus seems to fall in
a low-dipole event starts during the prediction horizon. Because we between the skills of threshold-based predictions for the G12 (very
show the same time-interval in non-dimensional units, the intervals high skill) and DW/3-D models (very low skills).
during which the orange line is at one are of equal width across
all four subfigures. The red lines in the centre panels are zero if no
low-dipole event is predicted to start during the prediction horizon;
4.1.2 Quantitative comparison and ranking
the lines are one if a low-dipole event is predicted to start during
the prediction horizon. Thus, the overlap of the red and orange lines We compute MCC skill scores to quantitatively compare the skill of
defines TPs, FPs, TNs and FNs, and a large overlap corresponds to threshold-based predictions for the various models. To avoid over-
a skillful prediction. For example, a FP corresponds to a situation fitting we now compute skill scores on verification data, that is data
where the orange is at zero while the red line is at one; a FN corre- that are not used for computing the optimal WT, as described in
sponds to a situation where the orange line is at one while the red Section 3.3.
line is at zero. We generate training and verification data as follows. For the
We note that predictions for the G12 model lead to a small number G12, P09 and DW models, the training data are the long simu-
of false positives or false negatives. In the excerpt shown for the lations that were also used in Section 3.2. The verification data
G12 model in Fig. 7, there is only one false positive, caused by the are ten independent simulations, each of length 104 . For the 3-D
prediction starting one time step too early (during the first of the model, we create training and verification data by ‘chopping up’
two events shown). For the DW and 3-D models, on the other hand, the overall simulation as follows. We split the simulations into two
we note a large number of false positives and false negatives, which parts of equal length and use one for training and the other for
renders threshold-based predictions unreliable for these models. verification. We then repeat the procedure, but split the simulation
Comparisons of the graphs for G12 and the DW and 3-D models into three equally long portions, using one for training and two for
suggest that threshold-based predictions for the DW or 3-D model verification. Finally, we split the simulation into four equally long
are indeed worse, by any measure, than those for the G12 model. portions, and use one for training and three for verification. This

Figure 7. Illustration of threshold-based predictions for the four models. (a) G12, (b) P09, (c) DW and (d) 3-D. The plots show predictions over a time period
of 50 dimensionless time units and for a prediction horizon of PH = 1. The corresponding (optimal) warning thresholds (expressed in percent of the average
ˆ G12 = 31.25 per cent, WT
intensity) are WT ˆ P09 = 43.00 per cent, WT ˆ DW = 60.25 per cent and WT ˆ 3D = 45.50 per cent for, respectively, the G12, P09, DW and
3-D models. These optimal WTs are computed using time-series of the four models that contain a large number of events (see text for details). For each model,
a dimensionless time is defined via a scaling of time with the average event duration (see text for details). Each sub-figure contains three panels. Top panel.
Blue: dipole time-series. Blue/green/red horizontal lines: start-of-event/end-of-event/warning thresholds. Centre panel. Graphs are zero if the threshold-based
prediction is ‘no low-dipole event will start during the prediction horizon;’ graphs are one if the threshold-based prediction is ‘a low-dipole event will start
during the prediction horizon’. Bottom panel. The graphs are zero if no low-dipole event starts within the prediction horizon; the graphs are one if a low-dipole
event starts during the prediction horizon. Black lines in the centre and bottom graphs denote times when no predictions are being made.
procedure leads to six MCC scores over verification data. Generat- qualitatively and, to a large extent, quantitatively the same results
ing multiple verification data sets in this way allows us to estimate using, for example the F1 or CSI skill scores.
the variability in the skill of threshold-based predictions for all four This ranking and, more generally, the skill of threshold-based
models. predictions appears to be determined by an interplay of:
Results are shown in Fig. 8, where we plot MCC scores for the
(i) The extent of variation in the dipole intensity: a high potential
four models for threshold-based predictions with prediction hori-
for false positives results if the intensity dips to low values regularly,
zon PH = 1. We only show the results for a prediction horizon PH
but if no low-dipole event follows.
= 1, but one obtains qualitatively the same results with PH = 0.5
(ii) The decay rate prior to a low-dipole event: a quick decay
or PH = 1.5. We note that the variation in skill over the different
results in a high potential for false negatives.
verification data sets is small. This suggests that the verification
data sets are ‘large enough’ so that variation in the verification data For (i), we recall the intensity histograms of Fig. 2, which show
does not affect the scores. Moreover, our results confirm the rank- that the P09, DW and 3-D models (low skill) spend more time
ing of the skill of threshold-based predictions that we anticipated at low intensity values than the G12 model (high skill). For (ii),
from inspection of ROC curves. Specifically, we rank the models we compute the average decay time (ADT), which measures how
(skill from high to low) in terms of their predictability via inten- quickly the dipole intensity decays prior to a low-dipole event. We
sity thresholding as: G12, P09, DW and 3-D. Indeed, we found that define the decay time as the absolute value of the time difference
this result is independent of the choice of skill score—one obtains between the start of the event and the last previous instance at which
1 duration of the training data set. This means that, for this model,
one can find a useful WT from a rather short training period. For
all other models, we observe a variation of the optimal WT as we
0.8 vary the duration of the training period. The variations are most
significant for the DW model, for which the optimal WT varies
0.6 from about 25 per cent to nearly 80 per cent (which is the maximum
MCC
allowed value). For P09, the optimal WT varies between 35 and 55

0.4 per cent, but there seems to be a plateau of nearly constant WT for
training data sets that contain 20–35 events. For the 3-D model, we
observe a variation of the optimal WT between 20 and 40 per cent.
0.2 Again, we note a plateau of nearly constant WT for training data
with 10–40 events.
0 Variations in the optimal WT, however, do not necessarily im-
ply variations in the resulting MCC skill score. This is shown in
12
W
9
3D
P0
D
G
Fig. 10(b). Here, we use the optimal WTs obtained from the same
various training periods (and shown in Fig. 10a), but compute the

Figure 8. MCC skill scores (verification) of the four models for prediction
MCC over the verification data. For the simple models (G12, P09
horizon PH = 1. For each model, several MCC scores are shown. The MCC
scores are computed over multiple sets of verification data. The training
and DW), the verification data are the long simulations (with about
data are long simulations containing many low-dipole events (see text for 550 events, see Section 3.2). For the 3-D model, the verification data
details). are the portion of the simulation that was not used during training.
We observe that the MCC skill score of threshold-based predic-
the field exceeded the end-of-event threshold (ET, 80 per cent of its tions is nearly independent of the duration of the training data. This
average value). We list the ADT for the four models in Table 1, is consistent across the hierarchy of models and suggests that the
with standard deviations. These ADT should be compared to the shortness of the observational record may not be the critical limiting
average event durations (AED), also listed in Table 1. Recall that factor for determining a useful WT. Our numerical results indeed
the event duration is defined by the time interval that starts when suggest that a useful WT can be found even if the training period is
the dipole intensity drops below a given start-of-event threshold (10 short and comparable with the observational record.
per cent of the average value) and ends when the dipole intensity The reason why the skill is independent of the duration of the
exceeds a given end-of-event threshold (80 per cent of the average training data varies across the hierarchy. This can be understood
value). Thus, the average event duration describes how quickly on by considering how MCC depends on WT, which we compute and
average the dipole recovers to a large value after it dropped to a show in Fig. 11. If the MCC versus WT graph is sharply peaked
low value. If the average decay time is larger than the average event around an optimal value, and if the peak is nearly independent of the
duration (ADT > AED), then low-dipole events occur slowly; if the duration of the training period, then a good WT can be found even
average decay time is smaller than the average event duration (ADT with a limited amount of training data. This is the case for the G12
< AED), then low-dipole events occur quickly. The two behaviours model. If the graph of MCC skill score plateaus for large values
are illustrated by the G12 and 3-D models in Figs 9(b) and (c). of WT, then rather different WT values can produce a similar skill
In Fig. 9(a), we plot the ratio of the average decay time to the scores. This is the explanation for why drastic variations in optimal
average event duration for the four models (see Table 1). For brevity, WT cause nearly no variations in the resulting optimal MCC in case
we introduce an abbreviation for this ratio: of the DW model in Fig. 10.
ADT
ρ= . (11)
AED
We note that ρ follows a similar trend as the MCC score. In par- 4.3 Impact of data filtering
ticular, the ranking of the models it leads to is identical to the Threshold-based predictions for the 3-D and DW model have a low
ranking inferred from the MCC skill score. This suggests that the skill compared to P09 or G12. This could be due to the quick changes
skill of threshold-based predictions is influenced by how quickly in polarity that we observe in the 3-D and DW models, and that occur
low-dipole events occur with respect to their duration. If they oc- on short timescales (recall Fig. 1). These are absent from the P09
cur slowly (ADT > AED, ρ > 1), then threshold-based predictions or G12 models. Palaeomagnetic reconstructions such as PADM2M
have a high skill. If they occur quickly (ADT < AED, ρ < 1), then and Sint-2000, which are inherently smoothing the field they record
threshold-based predictions may have a low skill. through the slowly depositing sedimentary process, also fail to show
such a behaviour. One may thus wonder if the 3-D or DW models
could become more amenable to threshold-based predictions if the
4.2 Robustness of skill to a short training period
dipole is smoothed in an analogous way.
Motivated by the fact that the observational record is short (the To test this possibility, we first consider the 3-D model, and rely
PADM2M and Sint-2000 reconstructions that we investigate in Sec- on the secular variation timescale τ = 415 yr, which we already used
tion 5 extend over 2 Myr and contain only six low-dipole events), to scale time for this simulation. The idea is to test a filtering that
we investigate the robustness of the optimal WT and corresponding mimics the sedimentary process and makes physical sense from the
skill with respect to the duration of the training data. For each model point of view of a 3-D dynamo. For 3-D dynamos, and for Earth’s
we compute an optimal WT for several training data sets of differ- dynamo, the secular variation timescale defines the main timescale
ent durations, and, hence, containing a different number of events. with which the non-dipole field is behaving (Lhuillier et al. 2011b).
Results for a prediction horizon PH = 1 are shown in Fig. 10(a). It provides a natural separation between the times scales of the
For G12, we note that the optimal WT is nearly independent of the long-term behaviour of the dipole field, which is the one we are
G12
3 (b) Event
10 (a) 2 duration
Decay time
1
Dipole
8 0
ADT/AED ( )
-1
6 -2
-3
745 750 755 760 765
4 Time
3D
1.5 (c)
1
2
0.5

Dipole
0
0 -0.5
Event duration
12
W
9
3D
2M
0
Decay
P0
00
-1 time
D
G
-2
D
-1.5
nt
PA
7888 7888.5 7889 7889.5 7890
Si
Time
Figure 9. (a) Ratio ρ of the average decay time (ADT) to the average event duration (AED) for the four models and two palaeomagnetic reconstructions
(PADM2M and Sint-2000, see Section 5). Also shown is the ρ = 1 line (dashed). (b) Illustration of the decay time and event duration of an event for G12. (c)
Illustration of the decay time and event duration of an event for the 3-D model. The beginning of the decay is marked in green, the start of an event is marked in
orange and the end of an event is marked in red. The decay time is the time interval between the start of the decay and the start of an event. The event duration
is the time interval between the start and end of an event.
G12 P09 DW 3D
100 (a) 1 (b)
Warning threshold (%)
80 0.8
60 0.6
MCC
40 0.4
20 0.2
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Number of events Number of events
Figure 10. (a) Optimal warning threshold as a function of the number of events contained in the training data. (b) MCC computed over verification data as a
function of the number of events contained in the training data. The prediction horizon is PH = 1.
most interested in here, and its short timescales. Smoothing over a We found that the skill of threshold-based predictions only slightly
time period of 4τ typically removes such short timescales (Hulot increases for the 3-D model, but hardly at all (to two digits) for the
& Le Mouël 1994). This corresponds to about 2 kyr. This is the DW model. Thus, skills associated with the DW and 3-D models
value we tested, as it also is roughly consistent with the smoothing are nearly unchanged by the smoothing process, and remain smaller
due to the sedimentary process in palaeomagnetic reconstructions than the skills associated with the P09 and G12 models.
such as PADM2M and Sint-2000. For example, the regularization
used to obtain the PADM2M reconstruction suppresses energy at
timescales of 5–10 kyr (Ziegler et al. 2011). It finally is short enough 4.4 Summary of results from the hierarchy of models
compared to the decay time and event durations we identified for The hierarchy of models is consistent in that threshold-based pre-
the field produced by the 3-D (and DW) model (see Table 1). For dictions become more difficult, or, equivalently, less skillful, when
consistency, we then also used the same time filtering to filter the the prediction horizon increases. This suggests that threshold-based
time-series produced by the DW model. In both cases, we used a predictions are at best useful for predicting low-dipole events with
moving average filter. Results are provided in Table 2, which lists the a lead time that is comparable to the average duration of the event
optimal MCC of threshold-based predictions for the DW and 3-D (about 10 kyr on Earth’s timescales). Moreover, the machinery of
models with and without smoothing for three prediction horizons. identifying thresholds by maximizing a skill score is robust in the
G12 P09
1 (a) 1 (b)
0.8 0.8
0.6 0.6
MCC
MCC
0.4 0.4
0.2 0.2
0 0
20 40 60 80 100 120 20 40 60 80 100 120
Warning threshold (%) Warning threshold (%)
DW 3D
1 (c) 1 (d)

0.8 0.8
0.6 0.6
MCC
MCC
0.4 0.4
0.2 0.2
0 0
20 40 60 80 100 120 20 40 60 80 100 120
Warning threshold (%) Warning threshold (%)
Figure 11. MCC skill score as a function of WT for the four models. (a) G12, (b) P09, (c) DW and (d) 3-D. The various graphs shown for each model differ
in the number of events contained in the training data (see text for details). The thin lines continue the curves for WT ≥ ET.
Table 2. Maximum MCC of threshold-based predictions for the DW and is surprisingly consistent in that one may be able to determine useful
3-D models with and without smoothing (smoothing window is 4τ ≈ 2 WTs, even if the training data are limited. The reasons for why this
kyr) for three different prediction horizons. Optimal warning thresholds stability occurs, however, vary across the hierarchy of models. For
and MCC scores are computed over the entire run (no verification). the G12 model, low-dipole events are indeed easy to predict by a
Prediction horizon 0.5 1 1.5 threshold and this threshold can be found by optimizing skill scores
No smoothing 0.39 0.32 0.27
over short data sets. For the other models, the skill score is a nearly
DW flat function of the threshold, that is different thresholds can lead
2 kyr smoothing 0.39 0.32 0.27
No smoothing 0.28 0.22 0.18 to similar skill scores (recall Fig. 11). More importantly, the overall
3-D skill of threshold-based predictions is low for the DW and 3-D
2 kyr smoothing 0.32 0.24 0.19
models, even when introducing some smoothing. Thus, threshold-
based predictions may be of limited use for the DW and 3-D models,
sense that the skill during training is comparable to the skill during because false positives and false negatives occur frequently. Again,
verification. Our overall approach is also robust with respect to the the P09 model falls in between the G12 and DW/3-D models.
precise choices of start-of-event and end-of-event thresholds, and We summarize our main results about threshold-based predic-
with respect to the choice of skill score (MCC, F1 or CSI). tions for dipole models as follows.
We observe strong differences in the skills of threshold-based
predictions across the various models. The DW and 3-D models ex- (i) Across the hierarchy of models, the skill of threshold-based
hibit complex behaviour during reversals or excursions, with many predictions degrades with the prediction horizon.
polarity changes during the low-dipole event and the decay time (ii) Across the hierarchy of models, threshold-based predictions
is short compared to the event duration (fast reversals). The G12 are robust to minor variations of numerical details, such as choice
model behaves differently: we do not observe quick polarity changes of skill sore (MCC or F1 or CSI), or choices of start-of-event and
during a G12 reversal, no major excursions occur, and the decay time end-of-event thresholds.
is larger than the event duration (slow reversals). The G12 model is (iii) Across the hierarchy of models, useful WTs can be found
more amenable to threshold-based predictions than the DW or 3-D even if the duration of the training period is short and comparable
models, because of its simpler reversing behaviour and because re- to the observational record. This suggests that the shortness of the
versals are approached slowly. The P09 model falls in between the observational record is not the main issue that makes computing
DW and 3-D models and the G12 model. WTs difficult. The reasons for why this is the case, however, differs
Our numerical experiments with short training data sets, suggest across the hierarchy of models.
that the main difficulty for threshold-based predictions may not be (iv) The G12 model is more amenable (highest skill) to threshold-
the shortness of the observational record. The hierarchy of models based predictions than the DW or 3-D models (lowest skill). The skill
of threshold-based predictions for the P09 model falls in between G12 P09 DW 3D PADM2M Sint-2000
the skills for G12 and DW/3-D. Furthermore, we found that skills
strongly correlate with the ratio of the average decay time to the 50
average event duration.
40
ADT (kyr)
30
5 A P P L I C AT I O N T O PA L A E O M A G N E T I C
R E C O N S T RU C T I O N S
20
We now take advantage of the lessons learned from the hierarchy of
models and apply threshold-based predictions to the PADM2M and 10
Sint-2000 palaeomagnetic reconstructions, which provide proxies
of the Earth’s axial dipole intensity over the past 2 Myr (Valet et al. 0
2005; Ziegler et al. 2011). More specifically, PADM2M and Sint- 0 10 20 30 40 50
AED (kyr)
2000 report the virtual axial dipole moment (VADM) in increments
of 1 kyr for the past 2 Myr. We scale each reconstruction so that one

Figure 12. Average decay time (ADT) plotted as a function of the average
unit of relative palaeointensity corresponds to its time average (5.32 event duration (AED) for the four models and the palaeomagnetic recon-
× 1022 Am2 for PADM2M, 5.81 × 1022 Am2 for Sint-2000). The structions. Also shown are the error bars based on one standard deviation.
timing of reversals is based on the geomagnetic polarity timescale In the case of the G12 model, the standard deviation of the event duration is
of Cande & Kent (1995), with a slight modification for the Cobb too small to be visible as an error bar. Also shown is a 45◦ line that separates
mountain sub-chron in the case of PADM2M (Morzfeld et al. 2017). models or data for which ADT > AED from models for which ADT < AED.
We note that PADM2M and Sint-2000 are ‘data’ of the same pro-
cess, namely Earth’s dipole intensity over the past 2 Myr. Nonethe-
less, there are differences between PADM2M and Sint-2000, which The average event duration of PADM2M or Sint-2000 is longer
are due to variations in the processing and interpretation of raw than that of the G12 (shortest) and P09 models, and shorter than that
data, and also the raw data that goes into the two reconstructions. of the DW and 3-D (longest) models. We note, however, that associ-
This means that differences between PADM2M and Sint-2000 in- ated standard deviations may reconcile the average event durations
dicate the level of uncertainty that is caused by difficulties with of the palaeomagnetic reconstructions with those of the various
observing Earth’s dipole over millions of years (see also Morzfeld models, but only marginally so for G12, which intrinsically displays
& Buffett (2019)). Moreover, the fact that the observational record little variation in the event duration. Moreover, the standard devia-
is short (2 Myr sampled in 1 kyr increments), implies that it is dif- tions for the event durations of the palaeomagnetic reconstructions
ficult to determine if any differences are (statistically) significant. are quite comparable to those of the DW and 3-D models, but larger
It is important to keep this ‘minimum level of uncertainty’ in mind than those of the P09 model, and are much larger than those of the
when evaluating threshold-based predictions for the palaeomagnetic G12 model. Overall, the average event duration of the palaeomag-
reconstructions (note that we essentially treat the palaeomagnetic netic reconstructions lies in-between the average event durations of
reconstructions as ‘data’, but we are aware that these reconstructions the G12/P09 and DW/3-D models. We keep in mind that standard
are themselves ‘models’). deviations for the palaeomagnetic reconstructions may be corrupted
by insufficient statistics, since the data document only six events.
The situation, however, is different when considering average
decay times. The average decay times of the palaeomagnetic re-
5.1 Event durations and decay times
constructions are much larger than those of the 3-D (shortest), DW
Based on our definitions above, we compute the average and stan- and P09 models, but comparable to that of the G12 model (longest).
dard deviation of the event duration and decay times for the six The standard deviations are much larger than that of the G12 model,
events of PADM2M and Sint-2000. Results using ST = 10 per cent and substantially larger than those of the P09, 3-D and DW mod-
and ET = 80 per cent as before, are listed in Table 1 and these val- els (P09, DW and 3-D models are comparable). This could be due
ues should be compared with the corresponding values for the four to insufficient statistics or to data uncertainties, as suggested by
models. In this context, it is important to realize that PADM2M and the disagreement between the different values obtained with the
Sint-2000 never exhibit intensity values below 10 per cent of their PADM2M and Sint-2000 data sets. From the perspective of average
time average, which is why the definition of the low-dipole event decay times, it thus appears that the data are consistent with the G12
in Section 3.1 contains the ‘or-statement’: a low-dipole event starts model.
when the intensity drops below the ST or if the dipole changes its Finally, we compute the ratio ρ of the average decay time to the
sign. average event duration for both palaeomagnetic reconstructions.
We first note that PADM2M and Sint-2000 lead to results consis- This leads to values of about two for PADM2M and three for Sint-
tent with each other (e.g. average event duration and average decay 2000 (see Table 1), which is consistent with the already known fact
times agree with each other within the corresponding standard devi- that intensity tends to decrease more slowly before a reversal than
ations). We also note that both average event durations and average it recovers after it (Valet et al. 2005). The ratio ρ of PADM2M and
decay times fall within the range of values covered by the hierarchy Sint-2000 can also be compared to the corresponding ratios of the
of models. Hardly any model, however, leads to values satisfyingly four models in Fig. 9. We note that the ratios of the palaeomagnetic
matching those of PADM2M and Sint-2000 for both quantities. This reconstructions are much larger than the corresponding ratios asso-
is best seen in Fig. 12, which shows the average decay time (ADT) ciated with the 3-D (smallest) and DW models; they are comparable
plotted as a function of the average event duration (AED) for the to the corresponding ratio of the P09 model, and much smaller than
four models and the palaeomagnetic reconstructions. the corresponding ratio of the G12 model.
5.2 Threshold-based predictions and their skills –0.25 Myr marks). One of these instances of false positives occurs
during training, the other during verification. Such false positives
We now apply threshold-based predictions to PADM2M and Sint-
do not occur in the case of the Sint-2000, which also has a lower
2000 using the same techniques as above and, as before, consider ˆ Sint-2000 = 36.75 per cent (corresponding to 2.14
optimal WT of WT
prediction horizons PH = 0.5, PH = 1 and PH = 1.5. Note that these
× 1022 Am2 ). One may thus intuitively expect that the predictions
PHs correspond to about 6, 11 and 17 kyr in geophysical time. The
will have a lower skill when applied to PADM2M than to Sint-2000,
ROC curves of threshold-based predictions for PADM2M and Sint-
but in fact this is not the case: the skill during verifications is higher
2000 are shown in Figs 13(a) and (b). These curves are computed
for PADM2M than for Sint-2000, but the skill for training is higher
over the entire 2 Myr time window covered by the palaeomagnetic
for Sint-2000 than for PADM2M. This is perhaps counter intuitive
reconstructions. Inspecting the ROC curves qualitatively, we see that
because one is tempted to think of false positives that occur ‘far’
the skill of threshold-based predictions decreases with the prediction
from a reversal as more severe than false positives or false neg-
horizon. We observed this also for all four models. Comparing the
atives that occur ‘close’ to a reversal. The MCC score, however,
ROC curves of the palaeomagnetic reconstructions in Figs 13(a)
does not apply special meaning to the categories of ‘positive’ and
and (b) with the ROC curves of the models in Fig. 6, the ROC
‘negative’, so that predictions of the timing of the two reversals, for
curves of the palaeomagnetic reconstructions resemble those of the
example during verification, are more accurate for PADM2M than
P09 model. Fig. 13(c) shows the curve traced out by the MCC
for Sint-2000.

as when varying the WT for PADM2M and Sint-2000 (MCC is
The ROC curves and the MCC skill scores for the palaeomag-
computed over the entire 2 Myr time window). Again, we note that
netic reconstructions and models suggest that the predictive skill of
the curves corresponding to the palaeomagnetic reconstructions are
threshold-based predictions of the palaeomagnetic reconstructions
qualitatively similar to the corresponding curve of the P09 model
may be comparable to the skill of these predictions for the P09
(see Fig. 11).
model. Because it is difficult to verify threshold-based predictions
We also compute optimal MCC scores for PADM2M and Sint-
using the observational record only, we may use the P09 model to
2000 via training and verification. We use the first 0.95 Myr, con-
investigate the skill of threshold-based predictions, applied to the
taining four events, for training (finding an optimal WT) and use
palaeomagnetic reconstructions. Threshold-based predictions (PH
the remaining 1.05 Myr, containing two events, for verification.
= 1) for the P09 model are illustrated in Fig. 7 (note that the predic-
Table 1 lists these MCCs for PADM2M and Sint-2000, together
tions in Fig. 7 make use of a large training data set). Indeed, when
with the MCCs of the four models, when computed with training
training threshold-based predictions for P09 with training data that
data containing a comparable number of events (four events during
contains five low-dipole events (comparable to palaeomagnetic re-
training for the palaeomagnetic data and five events during training
constructions), the optimal WT of the P09 model of 54.5 per cent is
for the models, see Section 4.2). Fig. 14 shows the (verification)
quite comparable to that obtained for PADM2M (50.75 per cent) and
MCC scores for PADM2M and Sint-2000 along with those of the
slightly more than that obtained for Sint-2000 (36.75 per cent), all
models.
of which are consistent with the range of values found in Figs. 10
We first note from Table 1 that the MCC skill score drops from
and 11. We also observe that the predictions for P09 are similar
training to verification and that the verification skills for PADM2M
to the predictions for the palaeomagnetic reconstructions. We en-
and Sint-2000 are quite different. This is caused by the verifica-
counter a large number of true negatives, several false negatives,
tion periods being extremely short, with only two events during
for which the threshold-based predictions trigger a little too late,
verification. Thus, while the WT we find from a limited observa-
and occasionally encounter false positives that occur during periods
tional record may be quite accurate, it remains difficult to evaluate
when no low-dipole event occurs.
the skill of threshold-based predictions. These difficulties are due
In summary, we conclude that threshold-based predictions are
to the shortness of the observational record—we only have 2 Myr,
feasible for the palaeomagnetic reconstructions, but lead to mod-
with six events, to base our training and validation on. Nevertheless,
erate success. They share similar characteristics as threshold-based
we again find that palaeomagnetic reconstructions tend to produce
predictions for the P09 model, and suffer from similar caveats:
verification MCC scores quite consistent with what could be an-
ticipated based on our analysis of the ratio of the average decay (i) Low-dipole events can be predicted only a relatively short
time to the average event duration. The MCC associated with the time ahead, that is the prediction horizon should be about one av-
palaeomagnetic reconstructions, indicative of the skill of intensity erage event duration or less. On Earth’s timescale, this means the
threshold based prediction, is indeed larger than the MCC recovered prediction horizon should be about 10 kyr or less.
for the 3-D (smallest) and DW models, comparable to that of the (ii) Low-dipole events may be predicted a few kyr too late (false
P09 model, and much smaller than that of the G12 model. negatives), which is significant in view of the relatively short pre-
We illustrate threshold-based predictions for the palaeomagnetic diction horizon.
reconstructions and PH = 1 in Fig. 15. This figure first confirms (iii) One must be prepared for false positives to occur even when
that the choice of the start-of-event (ST) and end-of-event (ET) no low-dipole event is about to happen.
thresholds properly identifies the six events of interest. There are
The above conclusions are supported by two palaeomagnetic
five reversals and one major excursion, which corresponds to what
reconstructions, PADM2M and Sint-2000, but threshold-based pre-
is known as the Cobb mountain subchron at 1.19 Myr, and is indeed
dictions show some sensitivity to which reconstruction we use.
an event during which the field temporarily changed its polarity at
This is perhaps best illustrated by the predictions in Fig. 15, but it
low intensity. This figure also illustrates the limitations of threshold-
is also clear from the skill scores in Table 1. As indicated above,
based predictions when using PADM2M or Sint-2000. In the case
ˆ PADM2M = 50.75 per cent differences between results stemming from PADM2M or Sint-2000
of PADM2M, with an optimal WT of WT
establish an uncertainty that cannot be resolved, because this uncer-
(corresponding to 2.70 × 10 Am ), we note the occurrence of
22 2
tainty is caused by our limited ability to observe Earth’s dipole over
two instances of false positives, where no low-dipole event is ob-
millions of years. In this context, we wish to point out that we did not
served, but a low-dipole event is predicted, (near the –1.5 Myr and
use other global models, for example PISO-1500 (Channell et al.
PH=0.5 PH=1.0 PH=1.5

PADM2M Sint-2000
1 1 1
(c) PADM2M
Sint-2000
0.8 0.8 0.8
0.6 0.6 0.6
MCC
TPR
TPR
0.4 0.4 0.4
0.2 0.2 0.2

(a) (b)
0 0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 20 40 60 80 100 120
FPR FPR Warning threshold (%)
Figure 13. Panels (a) and (b): ROC curves for two palaeomagnetic reconstructions and three prediction horizons, with PH = 0.5 (about 6 kyr) in green, PH =

1 (about 11 kyr) in purple, and PH = 1.5 (about 17 kyr) in orange. (a) PADM2M, (b) Sint-2000. An ROC curve is the collection of TPR/FPR pairs one obtains
when varying the warning threshold. The thicker line corresponds to TPR/FPR pairs for which ST < WT < ET. The thin lines continue the ROC curves for
ET < =WT. Panel (c): MCC as a function of the warning threshold (prediction horizon is PH = 1). The ROC curves and MCC scores are computed over the
entire 2 Myr covered by the palaeomagnetic reconstructions.
1 these difficulties away and fix the average intensity a priori. We find
that this is more practically relevant because the average intensity
0.8 may be determined by using additional information. Nonetheless,
we also made threshold-based predictions for which we compute
0.6 the average event duration based on training data and the results are
MCC
nearly identical to the results we show above.

0.4
0.2
6 C O N C LU D I N G C O M M E N T S
0
The main purpose of this study is to test the possibility that a low
12
W
9
3D
2M
0
P0
value of the axial dipole intensity could be used as a natural in-

00
D
G
-2
dicator of an upcoming dipole reversal. To answer this question,

D
nt
PA
Si
we analysed a hierarchy of numerical models, and Earth’s axial

dipole field as documented by the PADM2M and Sint-2000 palaeo-
Figure 14. MCC of four models and two palaeomagnetic reconstructions
magnetic VADM reconstructions (Valet et al. 2005; Ziegler et al.
(PADM2M and Sint-2000). The optimal WT is computed using training data
containing five events in the case of the models, and four events in the case 2011). More specifically, we test the possibility of relying on an
of the palaeomagnetic reconstructions (see text for details). intensity threshold-based strategy, whereby once the axial dipole
intensity drops below a WT, it is predicted that the intensity will
2009), because it is biased towards the North Atlantic region due drop further and lead to a low-dipole event (either a reversal or a
to the fact that only stacks with a high sedimentation rate are used major excursion) within some specified time, called the prediction
(see, e.g. fig. 5 of Panovska et al. 2019). Indeed, PISO-1500 is less horizon. Although the principle of such a strategy appears to be
representative of the (global) axial dipole field than PADM2M and fairly intuitive, implementing it in a robust way led us to introduce
Sint-2000 (see also Ziegler et al. 2011). Exploring the consequences a dedicated methodology.
of such differences is beyond the scope of our work. Our method requires that we define a WT, a start-of-event thresh-
Finally, we want to bring a few details to the reader’s attention. In old (ST), an end-of-event threshold (ET) and a prediction horizon
particular, we want to emphasize that the average dipole intensity (PH). Both ST and ET appear to be most conveniently defined in
and the average event duration for threshold-based predictions for terms the average intensity of the axial dipole (in practice ST =
PADM2M or Sint-2000 are computed using the entire 2 Myr record. 10 per cent and ET = 80 per cent). ST and ET also define an aver-
One could also envision to compute the average intensity based age event duration (AED, average time elapsed between when the
on training data only. We decided not to do so for the following intensity passes below the ST and when it recovers back to above
reasons. The average intensity defines the start and end of an event, the ET). The prediction horizon is defined in terms of the aver-
because start-of-event and end-of-event thresholds are defined in age event duration and we consider predictions with PHs of about
terms of the average intensity. The average event duration, and one AED. Having chosen the ST, ET and PH, we identify the WT
even the number of events, are implicitly defined by the start-of- by maximizing a skill score. Several skill scores have been tested,
event and end-of-event thresholds and, thus, also depends on the and all adequate choices led to similar conclusions. Similarly, we
average intensity. The average event duration, in turn, is used in showed that the exact choices of the ST and ET percentages are
the definition of the prediction horizon. In summary, the average not critical, provided these properly bracket the events of interest.
intensity directly affects (i) the number of events; (ii) the average The code we use to implement the prediction is available on github
event duration and (iii) the prediction horizon. By computing the (https://github.com/kjg136/Threshold). We archived the code used
average intensity over the 2 Myr reconstructions, we have assumed to generate all figures in (https://doi.org/10.5281/zenodo.4267116).

Figure 15. Illustration of threshold-based predictions for the PADM2M [(a) and (c)] and Sint-2000 [(b) and (d)] reconstructions. The prediction horizon
is PH = 1 (about 11 kyr) and the optimal warning threshold computed over 0.95 Myr of training data, containing four events. The corresponding warning
thresholds (expressed in percent of the average intensity) are WT ˆ PADM2M = 50.75 per cent (2.70 · 1022 Am2 ) and WT ˆ Sint-2000 = 36.75 per cent (2.14 · 1022
Am2 ) for respectively PADM2M and Sint-2000. Panels (a) and (b) contain three subfigures. Top panel. Blue: dipole time-series. Blue/green/red horizontal
lines: start-of-event/end-of-event/warning thresholds. Centre panel. Graphs are zero if the threshold-based prediction is ‘no low-dipole event will start during
the prediction horizon’; graphs are one if the threshold-based prediction is ‘a low-dipole event will start during the prediction horizon’. Bottom panel. Graphs
are zero if no low-dipole event starts within the prediction horizon; graphs are one if a low-dipole event starts during the prediction horizon. Panels (c) and (d)
show magnifications during a time interval that includes the two reversals that occur during verification.
A first major conclusion is that the skills of intensity threshold- of the field (Valet et al. 2005), which in fact may be related to the
based predictions vary surprisingly widely within the hierarchy of more general tendency of the Earth’s magnetic field to spend more
numerical models we investigated (G12, P09, DW and 3-D models). time decreasing than increasing at any time (see, e.g. Ziegler &
The only model that leads to a high skill (implying that the intensity Constable 2011; Avery et al. 2017). What this study thus suggests
threshold-based predictions are reliable) is the G12 model. This re- is that this slight asymmetry is what defines the skill of inten-
sult is in line with the results obtained by Morzfeld et al. (2017), sity threshold-based predictions when applied to Earth’s magnetic
who identified a high skill of intensity threshold-based predictions field. Unfortunately, because this ratio is about two to three, the
for this model, using a simpler strategy and a less robust analysis. skill of threshold-based predictions is limited. As our study further
All other models lead to lower skills, implying that the intensity shows, this, more than the relatively short duration of the Sint-2000
threshold-based predictions are less reliable. This is, again, consis- and PADMD2M reconstructions, is what likely makes intensity
tent with Morzfeld et al. (2017), who investigated the P09 model threshold-based predictions using these data modestly reliable.
and a model (B13, Buffett et al. 2013) similar to the DW model, but Despite the limitations we identified for intensity threshold-based
did not investigate the 3-D model. In this study, we were able to rank predictions, it is worth pointing out that today’s axial dipole field,
these skills more accurately and identify one key property that may with a magnitude of about 7.8 × 1022 Am2 (Constable & Korte
play a major role in defining the skills of threshold-based predictions 2006), is much larger than the WTs we identified by using ei-
in the context of numerical dynamos and VADM reconstructions ther Sint-2000 (WT ˆ Sint-2000 = 36.75 per cent of the average 5.81
(PADM2M and Sint-2000). × 10 Am , which amounts to 2.14 × 1022 Am2 ) or PADM2M
22 2
This key property is that skills of intensity threshold-based pre- (WTˆ PADM2M = 50.75 per cent of the average 5.32 × 1022 Am2 ,
dictions correlate with the ratio of the average decay time (defined amounting to 2.70 × 1022 Am2 ). Intensity threshold-based pre-
as the time between the start of the event and the most recent time dictions thus suggest that no low-dipole event will occur within the
instance at which the intensity is equal to the end-of-event thresh- next 10 kyr. This is in line with many other recent predictions (see,
old) to the average event duration. The larger this ratio, the better the e.g. Constable & Korte 2006; Morzfeld et al. 2017; Brown et al.
skill. The models and the PADMD2M and Sint-2000 reconstruc- 2018).
tions are consistent with this rule. As already noted, this asymmetry As an interesting additional outcome of this study, we note that
between the way the field decreases towards a reversal and the way testing the skills of threshold-based predictions on numerical dy-
it recovers its strength after the reversal is a well-known property namos is a fairly discriminating way of testing the Earth-like nature
of the axial dipole field behaviour of the models. This skill is dis- This leads to the interesting possibility of finding a better suited
tinct from the ability of numerical simulations to reproduce the low-dimensional model with properties intermediate between the
frequency with which reversal occurs. This is evident from the fact G12 model (whose decay-time properties make it well suited for
that threshold-based predictions have different skills for the DW DA) and P09 (with intensity threshold-based prediction properties
and P09 models, whereas both models are characterized by reversal closest to that of the palaeomagnetic reconstructions) leading to
frequencies comparable to that of the Earth over the last 25 Myr better predictions of reversals several kyr ahead.
(about 5 reversals per Myr). As this skill appears to be correlated
with the ratio of the average decay time to the average event duration
(a measure of the asymmetry with which the field evolves towards a AC K N OW L E D G E M E N T S
reversal and next recovers its full strength), it also appears to be dis-
tinct from other criteria often used to characterize the Earth’s dipole KG acknowledges that this work was supported by NASA Head-
field behaviour, such as its frequency content (Constable & John- quarters under the NASA Earth and Space Science Fellowship
son 2005), or the relative time spent in transitional periods (based Program—Grant ‘80NSSC18K1351’. This work was supported in
on dipole latitudes being less than 45◦ ), as recently suggested by part by the French Agence Nationale de la Recherche under grant
Sprain et al. (2019). Furthermore, in spite of its favourable ratings ANR-19-CE31-0019 (revEarth). All authors would like to thank
according to the criteria defined by Christensen et al. (2010) for the Nathanael Schaeffer (ISTerre, CNRS, Université Grenoble Alpes)

recent field, and Sprain et al. (2019) for the palaeomagnetic field and Thomas Gastine (Université de Paris, Institut de Physique du
(recall Section 2.2.4), the field produced by the 3-D model appears Globe de Paris) for allowing us to use the dipole time-series of
to not match that of the Earth’s field (as described by PADM2M the 3-D model. We acknowledge GENCI for access to the Irene
and Sint-2000) in terms of intensity threshold-based prediction skill resource (TGCC) under grants ‘Grand Challenge’ GCH0315 and
(and ratio of the average decay time to the average event duration). A0060407382. We thank Maggie S. Arvery (UC Berkeley) and an
In agreement with the suggestions of Ziegler & Constable (2011) anonymous reviewer for helping us improve this paper. We thank
and Avery et al. (2017), and since it appears to play a significant Cathy Constable (Scripps Institution of Oceanography, University
role in the way reversals occur, we strongly encourage the com- of California, San Diego) and Bruce Buffett (UC Berkeley) for
munity to also consider predictive skills and asymmetric tempo- meaningful discussions. AF thanks Richard Bono and Courtney
ral behaviour as additional criteria to identify Earth-like dynamo Sprain for their assistance in the calculation of QPM . All authors
simulations. contributed to the ideas behind the approach taken in the manuscript
This study shows that intensity threshold-based predictions of and all authors contributed to writing the paper; KG wrote the code.
reversals appear to be of only limited value, but we emphasize that
we investigated these limitations for only one specific threshold-
based prediction, namely predicting whether a reversal or major REFERENCES
excursion occurs during a specified time window. Other types of Aubert, J., 2019. Approaching Earth’s core conditions in high-resolution
predictions might deal instead with the probability of a reversal geodynamo simulations, Geophys. J. Int., 219, S137–S151.
or major excursion during a specified time window. In this case, a Avery, M.S., Gee, J.S. & Constable, C.G., 2017. Asymmetry in growth and
large number of reversals would be needed to test these predictions. decay of the geomagnetic dipole revealed in seafloor magnetization, Earth
planet. Sci. Lett., 467, 79–88.
It is also worthwhile to comment on other routes to more robust
Barrett, H. & Myers, K., 2003. Foundations of Image Science, Wiley.
and reliable predictions. Taking advantage of machine-learning and
Bauer, P., Thorpe, A. & Brunet, G., 2015. The quiet revolution of numerical
deep learning could be a possibility (Goodfellow et al. 2016). In this weather prediction, Nature, 525(7567), 47.
context, however, one should be careful to check that the shortness of Brown, M., Korte, M., Holme, R., Wardinski, I. & Gunnarson, S., 2018.
the palaeomagnetic reconstructions is not a limiting factor, as deep Earth’s magnetic field is probably not reversing, Proc. Natl. Acad. Sci.,
learning is known to work best when data availability is vast, and 115(20), 5111–5116.
only poorly when data are limited. Another approach is to rely on Buffett, B., 2015. Dipole fluctuations and the duration of geomagnetic po-
merging the observations in a process called data assimilation (DA, larity transitions, Geophys. Res. Lett., 42, 7444–7451.
see, e.g. Carrassi et al. 2018). This strategy has been successful Buffett, B. & Matsui, H., 2015. A power spectrum for the geomagnetic
in numerical weather prediction because the atmospheric model dipole moment, Earth planet. Sci. Lett., 411, 20–26.
Buffett, B., Ziegler, L. & Constable, C., 2013. A stochastic model for pale-
is of high quality, and because observations of the atmospheric
omagnetic field variations, Geophys. J. Int., 195(1), 86–97.
state are plentiful (Bauer et al. 2015). It currently is developing
Buffett, B.A., King, E.M. & Matsui, H., 2014. A physical interpretation of
in the field of geomagnetism (Fournier et al. 2010). Using DA for stochastic models for fluctuations in the Earth’s dipole field, Geophys. J.
predicting dipole reversals, however, is difficult due to the lack of Int., 198(1), 597–608.
a suitable 3-D model that can be run fast enough and the fact that Cande, S. & Kent, D., 1995. Revised calibration of the geomagnetic polarity
the observations are limited to the virtual axial dipole moment over timescale for the late cretaceous and Cenozoic, J. geophys. Res, 100,
2 Myr. Here, the main difficulty lies in identifying, or creating, 6093–6095.
useful models that are simple enough to allow for DA but complex Carrassi, A., Bocquet, M., Bertino, L. & Evensen, G., 2018. Data assimila-
enough to represent all relevant timescales. Nevertheless, Morzfeld tion in the geosciences: an overview of methods, issues, and perspectives,
et al. (2017) recently showed that using such an approach with the WIREs: Clim. Change, 9(5), e535, doi:10.1002/wcc.535.
Channell, J., Xuan, C. & Hodell, D., 2009. Stacking paleointensity and
G12 model and assimilating either PADM2M or Sint-2000, could
oxygen isotope data for the last 1.5Myr (PISO-1500), Earth Planet. Sci.
lead to some success. No similar success could be reached with
Lett, 283(1), 14–23.
the P09 model, which was also tested. In that approach, indeed, the Chicco, D. & Jurman, G., 2020. The advantages of the Matthews correlation
key to success appears to be the dynamical way the axial dipole coefficient (MCC) over F1 score and accuracy in binary classification
produced by the model approaches reversals. It appears that the evaluation, BMC Genomics, 21(6), doi:10.1186/s12864–019-6413-7.
way the G12 model approaches reversals is more similar to how Chorin, A. & Hald, O., 2013. Stochastic Tools in Mathematics and Science,
Earth’s axial dipole field approaches reversals, than the P09 model. 3rd edn, Springer.
Christensen, U.R. & Wicht, J., 2015. Numerical dynamo simulations, in Lowrie, W. & Kent, D., 2004. Geomagnetic polarity time scale and reversal
Core Dynamics, Vol. 8: Treatise on Geophysics, Chapter 8, 2nd edn, pp. frequency regimes, Timescal. Paleomag. Field, 145, 117–129.
245–277, eds Olson, P. & Schubert, G., Elsevier. Meduri, D. & Wicht, J., 2016. A simple stochastic model for dipole moment
Christensen, U.R., Aubert, J. & Hulot, G., 2010. Conditions for Earth-like fluctuations in numerical dynamo simulations, Front. Earth Sci., 4, 38.
geodynamo models, Earth planet. Sci. Lett., 296(3–4), 487–496. Morzfeld, M. & Buffett, B.A., 2019. A comprehensive model for the kyr
Constable, C. & Johnson, C., 2005. A paleomagnetic power spectrum, Phys. and Myr timescales of Earth’s axial magnetic dipole field, Nonlin. Proc.
Earth planet. Inter., 153, 61–73. Geophyys., 26(3), 123–142.
Constable, C. & Korte, M., 2006. Is Earth’s magnetic field reversing?, Earth Morzfeld, M., Fournier, A. & Hulot, G., 2017. Coarse predictions of dipole
planet. Sci. Lett., 246, 1–16. reversals by low-dimensional modeling and data assimilation, Phys. Earth
Fawcett, T., 2006. An introduction to ROC analysis, Pattern Recognit. Lett., planet. Inter., 262, 8–27.
27(8), 861–874. Ogg, J., 2012. Geomagnetic polarity time scale, in The Geologic Time Scale
Finlay, C.C., Aubert, J. & Gillet, N., 2016. Gyre-driven decay of the Earth’s 2012, Chapter 5, pp. 85–113, eds ,Gradstein, F., Ogg, J., Schmitz, M. &
magnetic dipole, Nat. Commun., 7, 10422. Ogg, G., Elsevier Science.
Fournier, A., et al., 2010. An introduction to data assimilation and pre- Olson, P., Driscoll, P. & Amit, H., 2009. Dipole collapse and reversal precur-
dictability in geomagnetism, Space Sci. Rev., 155, 247–291. sors in a numerical dynamo, Phys. Earth planet. Inter., 173(1), 121–140.
Gissinger, C., 2012. A new deterministic model for chaotic reversals, Eur. Olson, P., Deguen, R., Hinnov, L.A. & Zhong, S., 2013. Controls on geomag-
Phys. J. B., 85, 137. netic reversals and core evolution by mantle convection in the phanero-

Glatzmaier, G. & Coe, R., 2015. Magnetic polarity reversals in the core, in zoic, Phys. Earth planet. Inter., 214, 87–103.
Core Dynamics, Vol. 8: Treatise on Geophysics, Chapter 9, 2nd edn, pp. Panovska, S., Korte, M. & Constable, C., 2019. One hundred thousand years
279–295, eds Olson, P. & Schubert, G., Elsevier. of geomagnetic field evolution, Rev. Geophys., 57(4), 1289–1337.
Goodfellow, I., Bengio, Y. & Courville, A., 2016. Deep Learning, MIT Pétrélis, F., Fauve, S., Dormy, E. & Valet, J.-P., 2009. Simple mechanism for
Press, http://www.deeplearningbook.org. reversals of Earth’s magnetic field, Phys. Rev. Lett., 102, 144503.
Hoyng, P., Ossendrijver, M. & Schmitt, D., 2001. The geodynamo as a Schaeffer, N., 2013. Efficient spherical harmonic transforms aimed at pseu-
bistable oscillator, Geophys. Astrophys. Fluid Dyn., 94, 263–314. dospectral numerical simulations, Geochem., Geophys., Geosyst., 14(3),
Hulot, G. & Le Mouël, J.-L., 1994. A statistical approach to the Earth’s main 751–758.
magnetic field, Phys. Earth planet. Inter., 82, 167–183. Schaeffer, N., Jault, D., Nataf, H.-C. & Fournier, A., 2017. Turbulent geody-
Hulot, G., Eymin, C., Langlais, B., Mandea, M. & Olsen, N., 2002. Small- namo simulations: a leap towards Earth’s core, Geophys. J. Int., 211(1),
scale structure of the geodynamo inferred from Oersted and Magsat satel- 1–29.
lite data, Nature, 416, 620–623. Schmitt, D., Ossendrijver, M. & Hoyng, P., 2001. Magnetic field reversals
Hulot, G., Finlay, C.C., Constable, C.G., Olsen, N. & Mandea, M., 2010a. and secular variation in a bistable geodynamo model, Phys. Earth planet.
The magnetic field of planet Earth, Space Sci. Rev., 152, 159–222. Inter., 125, 119–124.
Hulot, G., Lhuillier, F. & Aubert, J., 2010b. Earth’s dynamo limit of pre- Sprain, C.J., Biggin, A.J., Davies, C.J., Bono, R.K. & Meduri, D.G., 2019.
dictability, Geophys. Res. Lett., 37, L06305. An assessment of long duration geodynamo simulations using new paleo-
Joliffe, I., 2016. The dice co-efficient: a neglected verification performance magnetic modeling criteria (QPM ), Earth planet. Sci. Lett., 526, 115758.
measure for deterministic forecasts of binary events, Meteorol. Appl., 23, Valet, J.-P. & Fournier, A., 2016. Deciphering records of geomagnetic rever-
89–90. sals, Rev. Geophys., 54(2), 410–446, 2015RG000506.
Kenney, J. & Keeping, E., 1966. Mathematics of Statistics, Pt. 1, 3rd edn, Valet, J.-P., Meynadier, L. & Guyodo, Y., 2005. Geomagnetic field strength
Van Nostrand Company. and reversal rate over the past 2 million years, Nature, 435, 802–805.
Kloeden, P.E. & Platen, E., 1999. Numerical Solution of Stochastic Differ- Valet, J.-P., Fournier, A., Courtillot, V. & Herrero-Bervera, E., 2012. Dy-
ential Equations, Springer. namical similarity of geomagnetic field reversals, Nature, 490, 89–93.
Laj, C. & Kissel, C., 2015. An impending geomagnetic transition? Hints Wicht, J. & Sanchez, S., 2019. Advances in geodynamo modelling, Geophys.
from the past, Front. Earth Sci., 3, 61. Astrophys. Fluid Dyn., 113(1–2), 2–50.
Lhuillier, F., Aubert, J. & Hulot, G., 2011a. Earth’s dynamo limit of pre- Ziegler, L. & Constable, C., 2011. Asymmetry in growth and decay of the
dictability controlled by magnetic dissipation, Geophys. J. Int., 186, 492– geomagnetic dipole, Earth planet. Sci. Lett., 312(3), 300–304.
508. Ziegler, L.B., Constable, C.G., Johnson, C.L. & Tauxe, L., 2011. PADM2M:
Lhuillier, F., Fournier, A., Hulot, G. & Aubert, J., 2011b. The geomag- a penalized maximum likelihood model of the 0-2 Ma paleomagnetic
netic secular-variation timescale in observations and numerical dynamo axial dipole model, Geophys. J. Int., 184(3), 1069–1089.
models, Geophys. Res. Lett., 38, L09306.
Lhuillier, F., Hulot, G. & Gallet, Y., 2013. Statistical properties of reversals
and chrons in numerical dynamos and implications for the geodynamo, APPENDIX:
Phys. Earth planet. Inter., 220, 19–36.
Table A1. Acronyms used in this paper.

Type Acronym Explanation
Outcomes of events P Number of positives
N Number of negatives
Outcomes of predictions TP True positive
FP False positive
TN True negative
FN False negative
Receiver operator characteristics TPR True positive rate (eq. 9)
FPR False positive rate (eq. 9)
ROC Receiver operator characteristic
Skill scores ACC Accuracy (eq. 5)
F1 F1 skill score (eq. (6)
CSI Critical success index (eq. (7)
MCC Mathews correlation coefficient (eq. 8)
Threshold-based predictions ST Start-of-event threshold

ET End-of-event threshold
WT Warning threshold
PH Prediction horizon
AED Average event duration
ADT Average decay time
ρ = ADT
AED ratio of ADT and AED
Models G12 Differential equation model (Gissinger 2012)
P09 Stochastic model (Pétrélis et al. 2009)
DW Stochastic double well model (Morzfeld & Buffett 2019)
3-D model 3-D dynamo simulation (unpublished)
SDE Stochastic differential equation
MHD Magneto-hydrodynamic
DA Data assimilation
Data VADM Virtual axial dipole moment
PADM2M VADM reconstruction (Ziegler et al. 2011)
Sint-2000 VADM reconstruction (Valet et al. 2005)

Gwirtz Et Al - GJI - 2021

Uploaded by

Copyright:

Available Formats

Gwirtz Et Al - GJI - 2021

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Gwirtz Et Al - GJI - 2021

Uploaded by

Copyright:

Available Formats

Geophys. J. Int. (2021) 225, 277–297 doi: 10.

K. Gwirtz,1 M. Morzfeld,1 A. Fournier 2

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

At first sight, the task seems hopeless because simulations of

In fact, many researchers have implicitly relied on this assumption 2 B A C KG R O U N D : M O D E L H I E R A R C H Y

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

G12 P09 PADM2M

0.06 0.06 0.06

0.04 0.04 0.04

0.02 0.02 0.02

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

0.08 0.08 0.08

0.06 0.06 0.06

0.04 0.04 0.04

0.02 0.02 0.02

True positive rate (TPR)

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

End-of-event threshold (ET)

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

4.1 Skill of threshold-based predictions

We apply threshold-based predictions to the models in the hierarchy.

PH=0.5 PH=1.0 PH=1.5

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

allowed value). For P09, the optimal WT varies between 35 and 55

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

PH=0.5 PH=1.0 PH=1.5

0.6 0.6 0.6

0.2 0.2 0.2

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

nearly identical to the results we show above.

value of the axial dipole intensity could be used as a natural in-

dicator of an upcoming dipole reversal. To answer this question,

we analysed a hierarchy of numerical models, and Earth’s axial

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

Table A1. Acronyms used in this paper.

Downloaded from https://academic.oup.com/gji/article/225/1/277/5981611 by guest on 15 March 2021

You might also like