Gwirtz Et Al - GJI - 2021
Gwirtz Et Al - GJI - 2021
Gwirtz Et Al - GJI - 2021
1093/gji/ggaa542
Advance Access publication 2020 November 13
GJI Geomagnetism, Rock Magnetism and Palaeomagnetism
Can one use Earth’s magnetic axial dipole field intensity to predict
reversals?
Accepted 2020 November 10. Received 2020 November 2; in original form 2020 May 29
C The Author(s) 2020. Published by Oxford University Press on behalf of The Royal Astronomical Society. All rights reserved. For
permissions, please e-mail: journals.permissions@oup.com
277
278 K. Gwirtz et al.
G12 P09
(a) (b)
2 2
Dipole
Dipole
0 0
-2 -2
25 26 27 28 29 30 3 4 5 6 7 8
Time (Myr) Time (Myr)
DW 3D
(c) (d)
2 2
Dipole
Dipole
0 0
-2 -2
Figure 1. Signed dipole as a function of time for the four models considered in this study. (a) G12, (b) P09, (c) DW and (d) 3-D. In each case, the amplitude is
scaled so that the average absolute value of the time series is one. Some reversals and excursions are highlighted in light red and light blue.
2.2.1 The deterministic G12 model We use the same parameters as in Pétrélis et al. (2009), α1 =
−185 Myr−1 , α 0 /α 1 = −0.9. The dipole, D, can be calculated from
Following Gissinger (2012), we consider the ordinary differential
the phase by D = Rcos (x + x0 ). Following Morzfeld et al. (2017),
equations
we set x0 = 0.3 and R = 1.3 [the latter scales the dipole vari-
dQ able D to have approximately the same time average as the relative
= μQ − V D,
dt palaeointensity reported by the reconstruction of Sint-2000 (Valet
dD et al. 2005)]. For the reminder of this paper, we refer to this model
= −ν D + V Q, (1)
dt as the P09 model.
dV Note that the parameters define the model’s timescale. The pa-
= − V + Q D, rameters are chosen so that the P09 model exhibits reversals and
dt
excursions, and so that its reversal rate is comparable to that of
where μ = 0.119, ν = 0.1 and = 0.9. Here, D is the dipole
Earth’s dipole. This is illustrated in Fig. 1, where a typical simula-
and the variable Q represents the quadrupole or, more generally,
tion result with this model is shown. For a simulation we discretize
the non-dipole field; V is a velocity variable that couples D and
the differential equation using a forward Euler–Maruyama method
Q. A change in the sign of D corresponds to a dipole reversal. We
(Kloeden & Platen 1999). The time step is 1 kyr.
refer to this model as the G12 model. A typical simulation with
G12 is shown in Fig. 1. Here, model time t is scaled to represent
the G12 millennium timescale (1 dimensionless time unit = 4 kyr), 2.2.3 The double well model
see Morzfeld et al. (2017). The simulation is done by discretizing
the differential equation by a fourth-order Runge–Kutta scheme A simple model for reversals of a quantity (not necessarily Earth’s
(Matlab’s ode45). dipole field) is a particle in a double well potential. Such a model is
defined by an SDE model as in eq. (2), and with an f(x) that is equal
to the negative gradient of a double well potential. Variations of this
2.2.2 The stochastic P09 model model for geomagnetic dipole reversals have been considered by
many researchers (Hoyng et al. 2001; Schmitt et al. 2001; Buffett
Pétrélis et al. (2009) derived a model for dipole reversals by consid-
et al. 2013, 2014; Buffett & Matsui 2015; Buffett 2015; Meduri &
ering the interaction of two modes. Using the symmetry of the equa-
Wicht 2016; Morzfeld & Buffett 2019). The basic idea is that the
tions of magnetohydrodynamics B → −B in an amplitude equation,
state, x, of the SDE is within one of the two wells of the double
and by assuming that the amplitude has a shorter timescale than a
well
√ potential and is pushed around by noise (the Brownian motion
phase, a stochastic differential equation (SDE) of the form
2q dW ). When the noise builds up towards one side of the well,
dx = f (x)dt + 2q dW, (2) the state may cross over to the other potential well. One can identify
a transition from one well to the other as a reversal of Earth’s dipole.
is derived for the phase, x, where f(x) and q are defined below. In
We use a recent version of this model, called the Myr model in
eq. (2), W is Brownian motion, a stochastic process with the follow-
Morzfeld & Buffett (2019), for which
ing properties: (i) W(0) = 0; (ii) W (t) − W (t + t) ∼ N (0, t)
and (iii) W(t) is almost surely continuous for all t ≥ 0 (see, e.g. x (x̄ − x), if x ≥ 0
f (x) = γ · , (4)
Chorin & Hald 2013). Here and below, N (m, σ 2 ) denotes a Gaus- x̄ (x + x̄), if x < 0
sian random variable with mean m, standard deviation σ and vari-
where γ = 0.1 kyr−1 , x̄ = 5.23 × 1022 Am2 and q = 0.34 × 1044
ance σ 2 .
A2 m4 kyr−1 . These parameters define the model’s natural timescale
More specifically, the SDE for the phase is defined by
and the values we chose are based on configuration (a) in Morzfeld
f (x) = α0 + α1 sin(2x), 2q = 0.2 |α1 |. (3) & Buffett (2019), which implies that the model’s reversal rate is
280 K. Gwirtz et al.
comparable with Earth’s reversal rate. For the reminder of this In contrast to the other models described above, its 3-D nature
paper, we refer to this model as the DW model. A typical simulation makes this dynamo model amenable to quantitative comparisons
of this model is shown in Fig. 1. For a simulation we discretize the against more observed properties of the Earth’s magnetic field
equation using a fourth-order Runge–Kutta for the deterministic than just the axial dipole. From a morphological standpoint, the
part and a forward Euler–Maruyama for the stochastic component 3-D model produces a magnetic field whose large-scale properties
(Kloeden & Platen 1999). The time step is 1 kyr. at the core–mantle boundary are in ‘good’ agreement with well-
established observations, according to the four criteria introduced
by Christensen et al. (2010):
2.2.4 The 3-D model (i) the axial dipole to non-axial dipole energetic ratio;
We consider a 3-D, convection-driven, dynamo simulation which (ii) the equatorially symmetric to antisymmetric non-dipole en-
exhibits polarity reversals and dipole excursions. The simulation ergetic ratio;
we consider has not yet been published and is part of an ensemble (iii) the zonal to non-zonal energetic ratio;
of reversing simulations run by N. Schaeffer (ISTerre, CNRS, Uni- (iv) the flux concentration factor.
versité Grenoble Alpes), A. Fournier and T. Gastine (both affiliated The terrestrial reference values for these four quantities are re-
with Université de Paris, Institut de Physique du Globe de Paris). spectively (1.4, 1.0, 0.15, 1.5, Christensen et al. 2010). For the 3-D
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Dipole Dipole Dipole
DW 3D Sint-2000
0.12 (d) 0.12 (e) 0.12 (f)
Skewness: -0.17 Skewness: -0.46 Skewness: 0.07
0 0 0
0 1 2 3 0 1 2 3 0 1 2 3
Dipole Dipole Dipole
Figure 2. Scaled histograms of the dipole intensities of the four models and two palaeomagnetic reconstructions (PADM2M and Sint-2000). (a) G12, (b) P09,
(c) PADM2M, (d) DW, (e) 3-D and (f) Sint-2000. The y-axes of all histograms are scaled so that the area under the graph is equal to one and the x-axes are
scaled so that one corresponds to the average intensity. Also shown is the skewness which indicates the degree of asymmetry in the distribution. A thicker tail
near zero suggests that a model, or reconstruction, lingers in a state of low intensity.
and excursions as observed in palaeomagnetic reconstructions, for of two palaeomagnetic reconstructions, PADM2M and Sint-2000
example PADM2M and Sint-2000 (Valet et al. 2005; Ziegler et al. (Valet et al. 2005; Ziegler et al. 2011), which document the time
2011), but these events occur over different timescales and, for some evolution of the virtual axial dipole moment (VADM) over the past
of the models, the events also occur over timescales that are different 2 Myr at a frequency of 1 per kyr. PADM2M and Sint-2000 thus
from what is observed in Earth’s axial dipole field. The 3-D model, contain 2000 points and the histograms are not as well resolved as
for example, has a reversal rate of about 0.7 reversals per Myr, but those of the four models (for which we used substantially longer
the reversal rates of the DW and P09 models are about 5 reversals simulations). This is also evident from the difference in the skew-
per Myr, which is comparable to the average reversal rate of Earth ness, which is positive for Sint-2000, but negative for PADM2M.
over the last 25 Myr (Ogg 2012). In addition, the way a reversal These values should be used with caution, because the estimates
occurs in each model can be different. Reversals of the G12 model of skewness are contaminated by large sampling error (based on
are characterized by a continuous decay in intensity, immediately only 2000 intensities) and by the fact that low intensities (below
followed by a rapid increase in intensity. The DW lingers at low 10 per cent) are not present in these reconstructions. In view of the
intensity longer than the other models (see also Fig. 2). We also large uncertainty in the reconstructions, all four models are reason-
observe that the DW and 3-D models, and to a lesser extent the P09 able, at least qualitatively in view of Figs 1 and 2, although all four
model, exhibit multiple, rapid fluctuations in sign during a reversal, models are constructed from drastically different assumptions and
see Fig. 4. This behaviour is not observed in the G12 model. with very different modelling goals in mind.
Differences between the various models can be illustrated fur-
ther by comparing histograms of their intensities, shown in Fig. 2,
which also provides the models’ Pearson’s moment coefficient of 2.4 Predictions, skill scores and ROC curves
skewness [third standardized moment, Kenney & Keeping (1966)] We want to predict whether a low-dipole event will occur within
for the four models. It is clear that all models, except G12, are char- an a priori specified time interval, called the prediction horizon.
acterized by a negative skew. Thus, the P09, DW and 3-D models We thus consider only two outcomes of an experiment. Outcome
spend more time at a lower than average intensity than at a higher 1: yes, the event occurred during the prediction horizon; outcome
than average intensity. The G12 model tends to spend more time 2: no, the event did not occur during the prediction horizon. As is
at a higher than average intensity than at a lower than average in- common, we denote the outcomes of an experiment by ‘positives’
tensity. The DW model has the smallest skew (in amplitude) and a and ‘negatives’:
thicker left tail than the other three models, which indicates that it
spends a considerable amount of time in low-intensity states (but Positive (P): the event occurred.
other formulations of double-well models, with different parame- Negative (N): the event did not occur.
ters or even different parametrizations of the potential, may behave With two possible outcomes of an experiment, a prediction can
differently). Also shown in Fig. 2 are histograms of the intensities result in one of four possibilities:
282 K. Gwirtz et al.
True positive (TP): predict that an event will occur and the event
occurs. Good strategy
False positive (FP): predict that an event will occur, but the event 1 A y
strateg ategy
does not occur.
B Worse Bad str
True negative (TN): predict that an event will not occur and the
The concepts and ideas described here have been used in many
e
lin
areas. We make an effort to be consistent in the terminology and
ce
an
to bring up only the definitions we need, sticking to commonly
Ch
used names (Fawcett 2006; Joliffe 2016; Chicco & Jurman 2020).
For a more thorough review of the such predictions in the context
of (medical) imaging, see Barrett & Myers (2003), Chapter 13,
where, a prediction strategy of the type discussed here is called a
3D
2 A B C D
Dipole
0
-1
-2
32 32.1 32.2 32.3 32.4 32.5 32.6 32.7 32.8 32.9 33
Time (Myr)
Figure 4. Excerpt of the 3-D simulation showing the signed dipole as a function of time (same scaling as in Fig. 1). Four events are labeled A − D.
Event A Event B
Time
PH
Event
duration Event duration
TP FP TP horizon (PH)
No No
TN FN prediction TN TN FN prediction TN Past Future
Today Time
P P
Truth
N N N
Time
Figure 5. Illustration of the prediction strategy. Top graph: dipole (solid blue) as a function of time. The thin blue, green and red horizontal lines represent
the start-of-event, the end-of-event and the warning thresholds. Two low-dipole events are labeled A (reversal) and B (excursion), and we indicate their event
durations. Highlighted in red is a period of low intensity, which is not a low-dipole event, but where the low intensity causes false positives (FP). Towards
the right, we illustrate a prediction over a given prediction horizon, which will lead to true negatives (TN). The prediction horizon also defines the true labels
(see bottom panel). Centre graph: prediction as a function of time. The red line at zero corresponds to the prediction ‘no low-dipole event occurs during
the prediction horizon’, and the red line at one corresponds to the prediction ‘a low-dipole event occurs during the prediction horizon’. The thick black line
segments correspond to periods during which no prediction is made. For events A and B, we first observe TNs, followed by false negatives (FN), caused by the
warning threshold being small; then we observe TPs followed by a period during which no prediction is made. Bottom graph: true occurrences of low-dipole
events within the prediction horizon. The orange line at zero corresponds to negatives (N), that is ‘no low-dipole event occurs during the prediction horizon’.
The orange line at one corresponds to positives (P), that is ‘a low-dipole event occurs during the prediction horizon’.
thresholds as: ST = 10 per cent and ET = 80 per cent. With these change of sign in the axial dipole are considered as events of interest
choices, we focus on events that start when the intensity is very within PADM2M and Sint-2000 (see Section 5).
low and which end when the field has nearly fully recovered (see Nonetheless, the precise values of ST and ET are not critical
Fig. 4). The choice of ST = 10 per cent is guided by the consider- because our overall approach is robust with respect to choices.
ation that we want to focus on events that correspond to reversals This is evident from a limited number of numerical experiments
and major excursion. During a reversal, the signed dipole can reach we performed with different choices of ET and ST. Specifically, we
an arbitrarily low value, before switching sign. During a major ex- tried the combinations ST = 10 per cent and ET = 50 per cent, ST
cursion, the dipole amplitude is very low, but we do not necessarily = 20 per cent and ET = 50 per cent, and ST = 20 per cent and ET
observe a switch in the sign. Moreover, palaeomagnetic reconstruc- = 80 per cent and obtained qualitatively and quantitatively similar
tions, such as PADM2M and Sint-2000 (Valet et al. 2005; Ziegler results.
et al. 2011), have difficulties with resolving small dipole values. We rescale time in each model so that the prediction results are
The palaeomagnetic reconstructions we consider below consist of comparable across the hierarchy of models. A natural choice for
signed Virtual Axial Dipole Moments (VADM), which are proxies this timescale is the average event duration (AED). That is, we
for the true axial dipole magnitude. The weakest VADMs recorded compute the average event duration given the natural timescale of
are about 10–20 per cent of the present axial dipole field intensity each model, and then rescale time so that one time unit corresponds
(see, e.g. Constable & Korte 2006; Hulot et al. 2010a). This is caused to one average event duration. The average event duration for each
by (i) VADM reconstructions sensing the non-dipole field during of the models is listed in Table 1. For the simplified models (G12,
a low-dipole event; (ii) VADM reconstructions are temporally fil- P09 and DW), the statistics of the event duration are computed from
tered by sediment recording processes and (iii) additional smooth- simulations that include about 550 events. For the 3-D model, we
ing is introduced by modelling choices and stacking of the relative use the entire duration of the simulation to compute the statistics of
palaeointensity (RPI) records (some of the individual records may the event duration.
have a higher resolution and features that are not aligned in time The prediction horizon is defined as a fraction of the average
are smoothed out). As we will see, by choosing ST = 10 per cent, event duration. We focus on the prediction horizon PH = 1 × AED,
we ensure that only events that experienced at least one temporary that is we focus on short-term predictions of low-dipole events,
Predicting dipole reversals 285
Table 1. This table summarizes key results obtained throughout the paper. We list all information in this one table to make it easier to make connections between
the various quantities listed. Description of each column. First column: the model or palaeomagnetic reconstruction considered. Second column: number of
low-dipole events in the verification portion of a simulation/palaeomagnetic reconstruction. Values in brackets are the number of events in the training data.
Third column: maximum MCC (prediction skill) achieved for optimal WT (see Section 2.4 for the definition of MCC). Values in brackets are for training data.
Verification and training data are explained in Section 4.2 for the models and in Section 5.2 for the palaeomagnetic reconstructions. Fourth column: optimal
WT that maximizes MCC over the training data (see Section 3.3). Fifth column: average duration of a low-dipole event (AED). Values in brackets are standard
deviations. Sixth column: average decay time (ADT) with standard deviations in brackets. See sections 4.1.2 (models) and 5.1 (palaeomagnetic reconstructions)
for definitions of average event duration and decay time and their computation. Seventh column: ratio of average decay time to average event duration. All
results listed here correspond to a prediction horizon PH = 1, a start-of-event threshold ST = 10 per cent, and an end-of-event threshold ET = 80 per cent.
# of events MCC ˆ
WT Event duration (AED) Decay time (ADT) ρ= ADT
AED
G12 554 (5) 0.96 (0.97) 30.75% 3.2 kyr (0.1 kyr) 26.9 kyr (2.0 kyr) 8.49
P09 551 (5) 0.57 (0.56) 54.50% 6.0 kyr (3.9 kyr) 10.8 kyr (5.0 kyr) 1.81
DW 551 (5) 0.31 (0.30) 69.25% 16.1 kyr (12.2 kyr) 8.5 kyr (4.7 kyr) 0.53
3-D 368 (5) 0.12 (0.14) 17.50% 16.4 kyr (11.6 kyr) 5.7 kyr (4.2 kyr) 0.35
PADM2M 2 (4) 0.62 (0.73) 50.75% 11.7 kyr (8.1 kyr) 25.5 kyr (10.9 kyr) 2.19
Sint-2000 2 (4) 0.44 (0.77) 36.75 % 10.2 kyr (8.7 kyr) 32.0 kyr (15.1 kyr) 3.1
0.8 0.8
Zoom
(a)
0.6 0.6
TPR
TPR
1
0.4 0.4
TPR
0.2 0.2
0.995
0 FPR 0.005 (b)
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
FPR FPR
DW 3D
1 1
0.6 0.6
TPR
TPR
0.4 0.4
0.2 0.2
(c) (d)
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
FPR FPR
Figure 6. ROC curves for the four models and three prediction horizons, with PH = 0.5 in green, PH = 1 in purple and PH = 1.5 in orange. (a) G12, (b) P09,
(c) DW (d) 3-D. A ROC curve is the collection of TPR/FPR pairs one obtains when varying the warning threshold. The thicker line corresponds to TPR/FPR
pairs for which ST < WT < ET. The thin lines continue the ROC curves for WT ≥ ET. The figure-in-figure in (a) (ROC curves for G12) shows a zoom near
the (0,1) point to illustrate that the three ROC curves, corresponding to different PHs, do not overlap. The ROC curves are computed using long simulations,
containing a large number of low-dipole events (see text for details).
a time interval of 50 non-dimensional time units for each model In the case of P09, we note a larger number of false positives
and each model exhibits two events during this time interval (recall and false negatives than in the case of G12, but false positives
that time is scaled by the average event duration, AED, see Table 1). or false negatives occur less frequently than for the DW or 3-D
The orange lines in the bottom of each subfigure are zero if no low- models. Consistent with what was suggested by Fig. 6, the skill of
dipole event starts within the prediction horizon, and they are one if threshold-based predictions for the P09 model thus seems to fall in
a low-dipole event starts during the prediction horizon. Because we between the skills of threshold-based predictions for the G12 (very
show the same time-interval in non-dimensional units, the intervals high skill) and DW/3-D models (very low skills).
during which the orange line is at one are of equal width across
all four subfigures. The red lines in the centre panels are zero if no
low-dipole event is predicted to start during the prediction horizon;
4.1.2 Quantitative comparison and ranking
the lines are one if a low-dipole event is predicted to start during
the prediction horizon. Thus, the overlap of the red and orange lines We compute MCC skill scores to quantitatively compare the skill of
defines TPs, FPs, TNs and FNs, and a large overlap corresponds to threshold-based predictions for the various models. To avoid over-
a skillful prediction. For example, a FP corresponds to a situation fitting we now compute skill scores on verification data, that is data
where the orange is at zero while the red line is at one; a FN corre- that are not used for computing the optimal WT, as described in
sponds to a situation where the orange line is at one while the red Section 3.3.
line is at zero. We generate training and verification data as follows. For the
We note that predictions for the G12 model lead to a small number G12, P09 and DW models, the training data are the long simu-
of false positives or false negatives. In the excerpt shown for the lations that were also used in Section 3.2. The verification data
G12 model in Fig. 7, there is only one false positive, caused by the are ten independent simulations, each of length 104 . For the 3-D
prediction starting one time step too early (during the first of the model, we create training and verification data by ‘chopping up’
two events shown). For the DW and 3-D models, on the other hand, the overall simulation as follows. We split the simulations into two
we note a large number of false positives and false negatives, which parts of equal length and use one for training and the other for
renders threshold-based predictions unreliable for these models. verification. We then repeat the procedure, but split the simulation
Comparisons of the graphs for G12 and the DW and 3-D models into three equally long portions, using one for training and two for
suggest that threshold-based predictions for the DW or 3-D model verification. Finally, we split the simulation into four equally long
are indeed worse, by any measure, than those for the G12 model. portions, and use one for training and three for verification. This
Predicting dipole reversals 287
procedure leads to six MCC scores over verification data. Generat- qualitatively and, to a large extent, quantitatively the same results
ing multiple verification data sets in this way allows us to estimate using, for example the F1 or CSI skill scores.
the variability in the skill of threshold-based predictions for all four This ranking and, more generally, the skill of threshold-based
models. predictions appears to be determined by an interplay of:
Results are shown in Fig. 8, where we plot MCC scores for the
(i) The extent of variation in the dipole intensity: a high potential
four models for threshold-based predictions with prediction hori-
for false positives results if the intensity dips to low values regularly,
zon PH = 1. We only show the results for a prediction horizon PH
but if no low-dipole event follows.
= 1, but one obtains qualitatively the same results with PH = 0.5
(ii) The decay rate prior to a low-dipole event: a quick decay
or PH = 1.5. We note that the variation in skill over the different
results in a high potential for false negatives.
verification data sets is small. This suggests that the verification
data sets are ‘large enough’ so that variation in the verification data For (i), we recall the intensity histograms of Fig. 2, which show
does not affect the scores. Moreover, our results confirm the rank- that the P09, DW and 3-D models (low skill) spend more time
ing of the skill of threshold-based predictions that we anticipated at low intensity values than the G12 model (high skill). For (ii),
from inspection of ROC curves. Specifically, we rank the models we compute the average decay time (ADT), which measures how
(skill from high to low) in terms of their predictability via inten- quickly the dipole intensity decays prior to a low-dipole event. We
sity thresholding as: G12, P09, DW and 3-D. Indeed, we found that define the decay time as the absolute value of the time difference
this result is independent of the choice of skill score—one obtains between the start of the event and the last previous instance at which
288 K. Gwirtz et al.
1 duration of the training data set. This means that, for this model,
one can find a useful WT from a rather short training period. For
all other models, we observe a variation of the optimal WT as we
0.8 vary the duration of the training period. The variations are most
significant for the DW model, for which the optimal WT varies
0.6 from about 25 per cent to nearly 80 per cent (which is the maximum
MCC
W
9
3D
P0
D
G
Fig. 10(b). Here, we use the optimal WTs obtained from the same
various training periods (and shown in Fig. 10a), but compute the
G12
3 (b) Event
10 (a) 2 duration
Decay time
1
Dipole
8 0
ADT/AED ( )
-1
6 -2
-3
745 750 755 760 765
4 Time
3D
1.5 (c)
1
2
0.5
W
9
3D
2M
0
Decay
P0
00
-1 time
D
G
-2
D
-1.5
nt
PA
7888 7888.5 7889 7889.5 7890
Si
Time
Figure 9. (a) Ratio ρ of the average decay time (ADT) to the average event duration (AED) for the four models and two palaeomagnetic reconstructions
(PADM2M and Sint-2000, see Section 5). Also shown is the ρ = 1 line (dashed). (b) Illustration of the decay time and event duration of an event for G12. (c)
Illustration of the decay time and event duration of an event for the 3-D model. The beginning of the decay is marked in green, the start of an event is marked in
orange and the end of an event is marked in red. The decay time is the time interval between the start of the decay and the start of an event. The event duration
is the time interval between the start and end of an event.
G12 P09 DW 3D
100 (a) 1 (b)
Warning threshold (%)
80 0.8
60 0.6
MCC
40 0.4
20 0.2
0 0
0 10 20 30 40 50 0 10 20 30 40 50
Number of events Number of events
Figure 10. (a) Optimal warning threshold as a function of the number of events contained in the training data. (b) MCC computed over verification data as a
function of the number of events contained in the training data. The prediction horizon is PH = 1.
most interested in here, and its short timescales. Smoothing over a We found that the skill of threshold-based predictions only slightly
time period of 4τ typically removes such short timescales (Hulot increases for the 3-D model, but hardly at all (to two digits) for the
& Le Mouël 1994). This corresponds to about 2 kyr. This is the DW model. Thus, skills associated with the DW and 3-D models
value we tested, as it also is roughly consistent with the smoothing are nearly unchanged by the smoothing process, and remain smaller
due to the sedimentary process in palaeomagnetic reconstructions than the skills associated with the P09 and G12 models.
such as PADM2M and Sint-2000. For example, the regularization
used to obtain the PADM2M reconstruction suppresses energy at
timescales of 5–10 kyr (Ziegler et al. 2011). It finally is short enough 4.4 Summary of results from the hierarchy of models
compared to the decay time and event durations we identified for The hierarchy of models is consistent in that threshold-based pre-
the field produced by the 3-D (and DW) model (see Table 1). For dictions become more difficult, or, equivalently, less skillful, when
consistency, we then also used the same time filtering to filter the the prediction horizon increases. This suggests that threshold-based
time-series produced by the DW model. In both cases, we used a predictions are at best useful for predicting low-dipole events with
moving average filter. Results are provided in Table 2, which lists the a lead time that is comparable to the average duration of the event
optimal MCC of threshold-based predictions for the DW and 3-D (about 10 kyr on Earth’s timescales). Moreover, the machinery of
models with and without smoothing for three prediction horizons. identifying thresholds by maximizing a skill score is robust in the
290 K. Gwirtz et al.
G12 P09
1 (a) 1 (b)
0.8 0.8
0.6 0.6
MCC
MCC
0.4 0.4
0.2 0.2
0 0
20 40 60 80 100 120 20 40 60 80 100 120
Warning threshold (%) Warning threshold (%)
DW 3D
1 (c) 1 (d)
0.6 0.6
MCC
MCC
0.4 0.4
0.2 0.2
0 0
20 40 60 80 100 120 20 40 60 80 100 120
Warning threshold (%) Warning threshold (%)
Figure 11. MCC skill score as a function of WT for the four models. (a) G12, (b) P09, (c) DW and (d) 3-D. The various graphs shown for each model differ
in the number of events contained in the training data (see text for details). The thin lines continue the curves for WT ≥ ET.
Table 2. Maximum MCC of threshold-based predictions for the DW and is surprisingly consistent in that one may be able to determine useful
3-D models with and without smoothing (smoothing window is 4τ ≈ 2 WTs, even if the training data are limited. The reasons for why this
kyr) for three different prediction horizons. Optimal warning thresholds stability occurs, however, vary across the hierarchy of models. For
and MCC scores are computed over the entire run (no verification). the G12 model, low-dipole events are indeed easy to predict by a
Prediction horizon 0.5 1 1.5 threshold and this threshold can be found by optimizing skill scores
No smoothing 0.39 0.32 0.27
over short data sets. For the other models, the skill score is a nearly
DW flat function of the threshold, that is different thresholds can lead
2 kyr smoothing 0.39 0.32 0.27
No smoothing 0.28 0.22 0.18 to similar skill scores (recall Fig. 11). More importantly, the overall
3-D skill of threshold-based predictions is low for the DW and 3-D
2 kyr smoothing 0.32 0.24 0.19
models, even when introducing some smoothing. Thus, threshold-
based predictions may be of limited use for the DW and 3-D models,
sense that the skill during training is comparable to the skill during because false positives and false negatives occur frequently. Again,
verification. Our overall approach is also robust with respect to the the P09 model falls in between the G12 and DW/3-D models.
precise choices of start-of-event and end-of-event thresholds, and We summarize our main results about threshold-based predic-
with respect to the choice of skill score (MCC, F1 or CSI). tions for dipole models as follows.
We observe strong differences in the skills of threshold-based
predictions across the various models. The DW and 3-D models ex- (i) Across the hierarchy of models, the skill of threshold-based
hibit complex behaviour during reversals or excursions, with many predictions degrades with the prediction horizon.
polarity changes during the low-dipole event and the decay time (ii) Across the hierarchy of models, threshold-based predictions
is short compared to the event duration (fast reversals). The G12 are robust to minor variations of numerical details, such as choice
model behaves differently: we do not observe quick polarity changes of skill sore (MCC or F1 or CSI), or choices of start-of-event and
during a G12 reversal, no major excursions occur, and the decay time end-of-event thresholds.
is larger than the event duration (slow reversals). The G12 model is (iii) Across the hierarchy of models, useful WTs can be found
more amenable to threshold-based predictions than the DW or 3-D even if the duration of the training period is short and comparable
models, because of its simpler reversing behaviour and because re- to the observational record. This suggests that the shortness of the
versals are approached slowly. The P09 model falls in between the observational record is not the main issue that makes computing
DW and 3-D models and the G12 model. WTs difficult. The reasons for why this is the case, however, differs
Our numerical experiments with short training data sets, suggest across the hierarchy of models.
that the main difficulty for threshold-based predictions may not be (iv) The G12 model is more amenable (highest skill) to threshold-
the shortness of the observational record. The hierarchy of models based predictions than the DW or 3-D models (lowest skill). The skill
Predicting dipole reversals 291
of threshold-based predictions for the P09 model falls in between G12 P09 DW 3D PADM2M Sint-2000
the skills for G12 and DW/3-D. Furthermore, we found that skills
strongly correlate with the ratio of the average decay time to the 50
average event duration.
40
ADT (kyr)
30
5 A P P L I C AT I O N T O PA L A E O M A G N E T I C
R E C O N S T RU C T I O N S
20
We now take advantage of the lessons learned from the hierarchy of
models and apply threshold-based predictions to the PADM2M and 10
Sint-2000 palaeomagnetic reconstructions, which provide proxies
of the Earth’s axial dipole intensity over the past 2 Myr (Valet et al. 0
2005; Ziegler et al. 2011). More specifically, PADM2M and Sint- 0 10 20 30 40 50
AED (kyr)
2000 report the virtual axial dipole moment (VADM) in increments
of 1 kyr for the past 2 Myr. We scale each reconstruction so that one
5.2 Threshold-based predictions and their skills –0.25 Myr marks). One of these instances of false positives occurs
during training, the other during verification. Such false positives
We now apply threshold-based predictions to PADM2M and Sint-
do not occur in the case of the Sint-2000, which also has a lower
2000 using the same techniques as above and, as before, consider ˆ Sint-2000 = 36.75 per cent (corresponding to 2.14
optimal WT of WT
prediction horizons PH = 0.5, PH = 1 and PH = 1.5. Note that these
× 1022 Am2 ). One may thus intuitively expect that the predictions
PHs correspond to about 6, 11 and 17 kyr in geophysical time. The
will have a lower skill when applied to PADM2M than to Sint-2000,
ROC curves of threshold-based predictions for PADM2M and Sint-
but in fact this is not the case: the skill during verifications is higher
2000 are shown in Figs 13(a) and (b). These curves are computed
for PADM2M than for Sint-2000, but the skill for training is higher
over the entire 2 Myr time window covered by the palaeomagnetic
for Sint-2000 than for PADM2M. This is perhaps counter intuitive
reconstructions. Inspecting the ROC curves qualitatively, we see that
because one is tempted to think of false positives that occur ‘far’
the skill of threshold-based predictions decreases with the prediction
from a reversal as more severe than false positives or false neg-
horizon. We observed this also for all four models. Comparing the
atives that occur ‘close’ to a reversal. The MCC score, however,
ROC curves of the palaeomagnetic reconstructions in Figs 13(a)
does not apply special meaning to the categories of ‘positive’ and
and (b) with the ROC curves of the models in Fig. 6, the ROC
‘negative’, so that predictions of the timing of the two reversals, for
curves of the palaeomagnetic reconstructions resemble those of the
example during verification, are more accurate for PADM2M than
P09 model. Fig. 13(c) shows the curve traced out by the MCC
for Sint-2000.
MCC
TPR
TPR
0.4 0.4 0.4
Figure 13. Panels (a) and (b): ROC curves for two palaeomagnetic reconstructions and three prediction horizons, with PH = 0.5 (about 6 kyr) in green, PH =
1 these difficulties away and fix the average intensity a priori. We find
that this is more practically relevant because the average intensity
0.8 may be determined by using additional information. Nonetheless,
we also made threshold-based predictions for which we compute
0.6 the average event duration based on training data and the results are
MCC
0.2
6 C O N C LU D I N G C O M M E N T S
0
The main purpose of this study is to test the possibility that a low
12
W
9
3D
2M
0
P0
-2
nt
PA
Si
A first major conclusion is that the skills of intensity threshold- of the field (Valet et al. 2005), which in fact may be related to the
based predictions vary surprisingly widely within the hierarchy of more general tendency of the Earth’s magnetic field to spend more
numerical models we investigated (G12, P09, DW and 3-D models). time decreasing than increasing at any time (see, e.g. Ziegler &
The only model that leads to a high skill (implying that the intensity Constable 2011; Avery et al. 2017). What this study thus suggests
threshold-based predictions are reliable) is the G12 model. This re- is that this slight asymmetry is what defines the skill of inten-
sult is in line with the results obtained by Morzfeld et al. (2017), sity threshold-based predictions when applied to Earth’s magnetic
who identified a high skill of intensity threshold-based predictions field. Unfortunately, because this ratio is about two to three, the
for this model, using a simpler strategy and a less robust analysis. skill of threshold-based predictions is limited. As our study further
All other models lead to lower skills, implying that the intensity shows, this, more than the relatively short duration of the Sint-2000
threshold-based predictions are less reliable. This is, again, consis- and PADMD2M reconstructions, is what likely makes intensity
tent with Morzfeld et al. (2017), who investigated the P09 model threshold-based predictions using these data modestly reliable.
and a model (B13, Buffett et al. 2013) similar to the DW model, but Despite the limitations we identified for intensity threshold-based
did not investigate the 3-D model. In this study, we were able to rank predictions, it is worth pointing out that today’s axial dipole field,
these skills more accurately and identify one key property that may with a magnitude of about 7.8 × 1022 Am2 (Constable & Korte
play a major role in defining the skills of threshold-based predictions 2006), is much larger than the WTs we identified by using ei-
in the context of numerical dynamos and VADM reconstructions ther Sint-2000 (WT ˆ Sint-2000 = 36.75 per cent of the average 5.81
(PADM2M and Sint-2000). × 10 Am , which amounts to 2.14 × 1022 Am2 ) or PADM2M
22 2
This key property is that skills of intensity threshold-based pre- (WTˆ PADM2M = 50.75 per cent of the average 5.32 × 1022 Am2 ,
dictions correlate with the ratio of the average decay time (defined amounting to 2.70 × 1022 Am2 ). Intensity threshold-based pre-
as the time between the start of the event and the most recent time dictions thus suggest that no low-dipole event will occur within the
instance at which the intensity is equal to the end-of-event thresh- next 10 kyr. This is in line with many other recent predictions (see,
old) to the average event duration. The larger this ratio, the better the e.g. Constable & Korte 2006; Morzfeld et al. 2017; Brown et al.
skill. The models and the PADMD2M and Sint-2000 reconstruc- 2018).
tions are consistent with this rule. As already noted, this asymmetry As an interesting additional outcome of this study, we note that
between the way the field decreases towards a reversal and the way testing the skills of threshold-based predictions on numerical dy-
it recovers its strength after the reversal is a well-known property namos is a fairly discriminating way of testing the Earth-like nature
Predicting dipole reversals 295
of the axial dipole field behaviour of the models. This skill is dis- This leads to the interesting possibility of finding a better suited
tinct from the ability of numerical simulations to reproduce the low-dimensional model with properties intermediate between the
frequency with which reversal occurs. This is evident from the fact G12 model (whose decay-time properties make it well suited for
that threshold-based predictions have different skills for the DW DA) and P09 (with intensity threshold-based prediction properties
and P09 models, whereas both models are characterized by reversal closest to that of the palaeomagnetic reconstructions) leading to
frequencies comparable to that of the Earth over the last 25 Myr better predictions of reversals several kyr ahead.
(about 5 reversals per Myr). As this skill appears to be correlated
with the ratio of the average decay time to the average event duration
(a measure of the asymmetry with which the field evolves towards a AC K N OW L E D G E M E N T S
reversal and next recovers its full strength), it also appears to be dis-
tinct from other criteria often used to characterize the Earth’s dipole KG acknowledges that this work was supported by NASA Head-
field behaviour, such as its frequency content (Constable & John- quarters under the NASA Earth and Space Science Fellowship
son 2005), or the relative time spent in transitional periods (based Program—Grant ‘80NSSC18K1351’. This work was supported in
on dipole latitudes being less than 45◦ ), as recently suggested by part by the French Agence Nationale de la Recherche under grant
Sprain et al. (2019). Furthermore, in spite of its favourable ratings ANR-19-CE31-0019 (revEarth). All authors would like to thank
according to the criteria defined by Christensen et al. (2010) for the Nathanael Schaeffer (ISTerre, CNRS, Université Grenoble Alpes)
Christensen, U.R. & Wicht, J., 2015. Numerical dynamo simulations, in Lowrie, W. & Kent, D., 2004. Geomagnetic polarity time scale and reversal
Core Dynamics, Vol. 8: Treatise on Geophysics, Chapter 8, 2nd edn, pp. frequency regimes, Timescal. Paleomag. Field, 145, 117–129.
245–277, eds Olson, P. & Schubert, G., Elsevier. Meduri, D. & Wicht, J., 2016. A simple stochastic model for dipole moment
Christensen, U.R., Aubert, J. & Hulot, G., 2010. Conditions for Earth-like fluctuations in numerical dynamo simulations, Front. Earth Sci., 4, 38.
geodynamo models, Earth planet. Sci. Lett., 296(3–4), 487–496. Morzfeld, M. & Buffett, B.A., 2019. A comprehensive model for the kyr
Constable, C. & Johnson, C., 2005. A paleomagnetic power spectrum, Phys. and Myr timescales of Earth’s axial magnetic dipole field, Nonlin. Proc.
Earth planet. Inter., 153, 61–73. Geophyys., 26(3), 123–142.
Constable, C. & Korte, M., 2006. Is Earth’s magnetic field reversing?, Earth Morzfeld, M., Fournier, A. & Hulot, G., 2017. Coarse predictions of dipole
planet. Sci. Lett., 246, 1–16. reversals by low-dimensional modeling and data assimilation, Phys. Earth
Fawcett, T., 2006. An introduction to ROC analysis, Pattern Recognit. Lett., planet. Inter., 262, 8–27.
27(8), 861–874. Ogg, J., 2012. Geomagnetic polarity time scale, in The Geologic Time Scale
Finlay, C.C., Aubert, J. & Gillet, N., 2016. Gyre-driven decay of the Earth’s 2012, Chapter 5, pp. 85–113, eds ,Gradstein, F., Ogg, J., Schmitz, M. &
magnetic dipole, Nat. Commun., 7, 10422. Ogg, G., Elsevier Science.
Fournier, A., et al., 2010. An introduction to data assimilation and pre- Olson, P., Driscoll, P. & Amit, H., 2009. Dipole collapse and reversal precur-
dictability in geomagnetism, Space Sci. Rev., 155, 247–291. sors in a numerical dynamo, Phys. Earth planet. Inter., 173(1), 121–140.
Gissinger, C., 2012. A new deterministic model for chaotic reversals, Eur. Olson, P., Deguen, R., Hinnov, L.A. & Zhong, S., 2013. Controls on geomag-
Phys. J. B., 85, 137. netic reversals and core evolution by mantle convection in the phanero-