SSRN Id4355794
SSRN Id4355794
SSRN Id4355794
Abstract
This short note outlines the general approach used for the forecast-
ing part of the M6 forecasting competition. It describes a meta-learning
approach that is based on an encoder-decoder hypernetwork, capable of
identifying the most appropriate parametric model for a given family of
related prediction tasks. In addition to its application in the M6 forecast-
ing competition, we also evaluate it on the sinusoidal regression problem.
There, the proposed method outperforms established methods by an order
of magnitude, achieving near-oracle level performance.
* This research project is ongoing, and this text is only a rough preliminary draft intended
to spark discussion and elicit feedback. With that in mind, thank you for taking the time to
read it.
E-mail: filip.stanek@cerge-ei.cz.
CERGE-EI, a joint workplace of Charles University and the Economics Institute of the
Czech Academy of Sciences, Politickych veznu 7, 111 21 Prague, Czech Republic.
2 Method
2.1 Notation
Following the notation of Hospedales et al. (2021), a task T = {D, L} consists
of data D = {(xi , yi )}N
i=1 and loss function L. The loss function L(D; θ, ω)
measures the prediction performance on dataset D, given a vector of task-specific
parameters θ and a vector of meta parameters ω, which are shared across tasks.
Throughout the text, we focus on the canonical supervised learning problem,
M
Given a collection of M observed tasks {Tm }m=1 , the finite sample equivalent
of the problem is stated as follows:
M
1 X val
ω̂ = arg min L(Dm ; θ̂m , ω)
ω M m=1 (4)
train train
s.t.: θ̂m = κω (Dm , Lm ) ≈ arg min L(Dm ; θm , ω).
θm
M
where both ω̂ and {θm }m=1 are optimized simultaneously. In effect, simul-
taneous search over both parametric functions fω (·; θm ) = f (·; g(θm ; ω)) and
M
corresponding parameters {θm }m=1 is performed.
To allow for maximal flexibility, we express both the function f (·; β) and
β = g(·; ω) as feedforward neural networks. The total size of network f (·; β),
represented by dβ = card(β), controls the level of complexity with which the
...
...
...
...
...
...
...
...
...
...
...
...
...
...
-0.1 0.5 0.4 -0.5 0.3 -0.2 x 0 0 0 0 0 0 0 0 1 q
mesa module:
θ = Θ ∗ q⊤
-0.4 2.3 θ⊤
meta module:
β = g(θ; ω)
...
0.8 0.3 -0.4 -0.0 -0.7 -1.5 ... -0.1 -0.3 -0.4 -0.8 -0.9 0.9
β
base model:
ŷ = f (x; β)
0.3 ŷ
3 Applications
3.1 Sinusoidal Task
To evaluate the potential of the MtMs to find the most appropriate parametric
model for a given family of prediction problems, we first consider a simula-
tion exercise originally proposed by Finn et al. (2017) to test the performance
of MAML. Since then, this environment has frequently been used to compare
competing meta-learning methods.
In particular, the tasks Tm = {Dm , Lm } are generated according to the
following DGP:
Am ∼ U (0.1, 5)
bm ∼ U (0, π)
(10)
xm,i |Am , bm ∼ U (−5, 5)
ym,i |xm,i , Am , bm = Am ∗ sin(xm,i + bm )
The goal is to find the best model that can predict yi,m based on xi,m for i > K
train
after observing only Dm , as measured by the mean squared error:
N
val 1 X
Lm (Dm ; θ̂m , ω) = (ym,i − fω (xm,i ; θ̂m ))2
N −K (11)
i=K+1
train
s.t.: θ̂m = κω (D , Lm )
For fair comparison, we follow Finn et al. (2017) and set the base model to
be a feedforward neural network with two hidden layers of size 40 and ReLU
non-linearities. The number of mesa parameters, dθ , is set to 2 and the meta
module g(·; ω) is a simple fully connected feedforward network with no hidden
layers or non-linearities. For training of the MtMs, it is entirely sufficient to use
only 1000 distinct tasks. This is far fewer then the 70000 task originally used
in Finn et al. (2017) and in the followup studies. Likewise, the training is done
with a fraction of computational resources. It takes approximately 0.5 hour on
a consumer grade mid-range CPU1 , which is in sharp contrast to the powerful
1 AMD Ryzen 7 4700U
θ1
2.5 −0.085
−0.054
y^ = fω(x;[θ1, θ2])
−0.029
−0.008
0.0
0.017
0.043
0.06
−2.5 0.081
0.105
θ2
−0.102
2
−0.069
y^ = fω(x;[θ1, θ2])
−0.049
−0.033
0
−0.013
0.008
0.03
−2
0.056
0.094
−4
−5.0 −2.5 0.0 2.5 5.0
x
4 Concluding remarks
The satisfactory performance in the M6 forecasting competition and the unpar-
alleled performance in the sinusoidal regression task indicate that the proposed
MtMs approach might be useful in practical applications. Currently, we are
testing MtMs on data from the M4 forecasting competition (Makridakis et al.,
2020), and preliminary results seem promising. The mesa parameters here gen-
erally tend to capture some combination of time series persistence and season-
ality. Generally, MtMs appears well-suited for time series forecasting, as the
DGPs typically have some common patterns (e.g., seasonality), yet at the same
time, they are not identical. Another promising area might be few-shot image
recognition, a typical domain of meta-learning approaches. However, we are not
currently pursuing this avenue due to the lack of computational resources.
10
11