P P: Training An Increasingly General Problem Solver by Continually Searching For The Simplest Still Unsolvable Problem
P P: Training An Increasingly General Problem Solver by Continually Searching For The Simplest Still Unsolvable Problem
P P: Training An Increasingly General Problem Solver by Continually Searching For The Simplest Still Unsolvable Problem
Jürgen Schmidhuber
The Swiss AI Lab IDSIA, Galleria 2, 6928 Manno-Lugano
University of Lugano & SUPSI, Switzerland
22 December 2011 (arXiv:1112.5309v1), revised 4 November 2012
Abstract
Most of computer science focuses on automatically solving given computational problems. I focus on
automatically inventing or discovering problems in a way inspired by the playful behavior of animals and
humans, to train a more and more general problem solver from scratch in an unsupervised fashion. Con-
sider the infinite set of all computable descriptions of tasks with possibly computable solutions. Given a
general problem solving architecture, at any given time, the novel algorithmic framework P OWER P LAY
[46] searches the space of possible pairs of new tasks and modifications of the current problem solver,
until it finds a more powerful problem solver that provably solves all previously learned tasks plus the
new one, while the unmodified predecessor does not. Newly invented tasks may require to achieve a
wow-effect by making previously learned skills more efficient such that they require less time and space.
New skills may (partially) re-use previously learned skills. The greedy search of typical P OWER P LAY
variants uses time-optimal program search to order candidate pairs of tasks and solver modifications by
their conditional computational (time & space) complexity, given the stored experience so far. The new
task and its corresponding task-solving skill are those first found and validated. This biases the search
towards pairs that can be described compactly and validated quickly. The computational costs of validat-
ing new tasks need not grow with task repertoire size. Standard problem solver architectures of personal
computers or neural networks tend to generalize by solving numerous tasks outside the self-invented
training set; P OWER P LAY’s ongoing search for novelty keeps breaking the generalization abilities of
its present solver. This is related to Gödel’s sequence of increasingly powerful formal theories based
on adding formerly unprovable statements to the axioms without affecting previously provable theorems.
The continually increasing repertoire of problem solving procedures can be exploited by a parallel search
for solutions to additional externally posed tasks. P OWER P LAY may be viewed as a greedy but practical
implementation of basic principles of creativity [42, 45]. A first experimental analysis can be found in
separate papers [53, 52].
1
Contents
1 Introduction 3
1.1 Basic Ideas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Outline of Remainder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
10 Words of Caution 17
11 Acknowledgments 18
2
1 Introduction
Given a realistic piece of computational hardware with specific resource limitations, how can one devise
software for it that will solve all, or at least many, of the a priori unknown tasks that are in principle easily
solvable on this architecture? In other words, how to build a practical general problem solver, given the
computational restrictions? It does not need to be universal and asymptotically optimal [17, 13, 41, 44] like
the recent (not necessarily practically feasible) general problem solvers discussed in Section 9.1; instead it
should take into account all constant architecture-specific slowdowns ignored in the asymptotic optimality
notation of theoretical computer science, and be generally useful for real-world applications.
Let us draw inspiration from biology. How do initially helpless human babies become rather general
problem solvers over time? Apparently by playing. For example, even in the absence of external reward or
hunger they are curious about what happens if they move their eyes or fingers in particular ways, creating
little experiments which lead to initially novel and surprising but eventually predictable sensory inputs,
while also learning motor skills to reproduce these outcomes. (See [31, 30, 37, 42, 45, 62] and Section
9.3 for previous artificial systems of this type.) Infants continually seem to invent new tasks that become
boring as soon as their solutions become known. Easy-to-learn new tasks are preferred over unsolvable or
hard-to-learn tasks. Eventually the numerous skills acquired in this creative, self-supervised way may get
re-used to facilitate the search for solutions to external problems, such as finding food when hungry. While
kids keep inventing new problems for themselves, they move through remarkable developmental stages
[23, 2, 11].
Here I introduce a novel unsupervised algorithmic framework for training a computational problem
solver from scratch, continually searching for the simplest (fastest to find) combination of task and corre-
sponding task-solving skill to add to its growing repertoire, without forgetting any previous skills (Section
2), or at least without decreasing average performance on previously solved tasks (Section 7.1). New skills
may (partially) re-use previously learned skills. Every new task added to the repertoire is essentially de-
fined by the time required to invent it, to solve it, and to demonstrate that no previously learned skills got
lost. The search takes into account that typical problem solvers may learn to solve tasks outside the growing
self-made training set due to generalization properties of their architectures. The framework is called P OW-
ER P LAY because it continually [25] aims at boosting computational prowess and problem solving capacity,
reminiscent of humans or human societies trying to boost their general power/capabilities/knowledge/skills
in playful ways, even in the absence of externally defined goals, although the skills learned by this type of
pure curiosity may later help to solve externally posed tasks.
Unlike our first implementations of curious/creative/playful agents from the 1990s [30, 54, 37] (Section
9.3; compare [1, 5, 22, 20]), P OWER P LAY provably (by design) does not have any problems with online
learning—it cannot forget previously learned skills, automatically segmenting its life into a sequence of
clearly identified tasks with explicitly recorded solutions. Unlike the task search of theoretically optimal
creative agents [42, 45] (Section 9.3), P OWER P LAY’s task search is greedy, but at least practically feasible.
Some claim that scientists often invent appropriate problems for their methods, rather than inventing
methods to solve given problems. The present paper formalizes this in a way that may be more convenient
to implement than those of previous work [30, 37, 42, 45], and describes a simple practical framework for
building creative artificial scientists or explorers that by design continually come up with the fastest to find,
initially novel, but eventually solvable problems.
3
the new problem solver (some modification of the old one). (3) The new solver can still solve the known
set of previously learned tasks.
Once such a pair is found, the cycle repeats itself. This will result in a continually growing set of known
tasks solvable by an increasingly more powerful problem solver. Solutions to new tasks may (partially) re-
use solutions to previously learned tasks.
Smart search (e.g., Section 4.1, Algorithm 4.1) orders candidate pairs of the type (task, solver) by
computational complexity, using concepts of optimal universal search [17, 41], with a bias towards pairs
that can be described by few additional bits of information (given the experience so far) and that can be
validated quickly.
At first glance it might seem harder to search for pairs of tasks and solvers instead of solvers only, due
to the apparently larger search space. However, the additional freedom of inventing the tasks to be solved
may actually greatly reduce the time intervals between problem solver advances, because the system may
often have the option of inventing a rather simple task with an easy-to-find solution.
A new task may be about simplifying the old solver such that it can still solve all tasks learned so far,
but with less computational resources such as time and storage space (e.g., Section 3.1 and Algorithm 7.1).
Since the new pair (task, solver) is the first one found and validated, the search automatically trades
off the time-varying efforts required to either invent completely new, previously unsolvable problems, or
compressing/speeding up previous solutions. Sometimes it is easier to refine or simplify known skills,
sometimes to invent new skills.
On typical problem solver architectures of personal computers (PCs) or neural networks (NNs), while
a limited known number of previously learned tasks has become solvable, so too has a large number of
unknown, never-tested tasks (in the field of Machine Learning, this is known as generalization). P OWER -
P LAY’s ongoing search is continually testing (and always trying to go beyond) the generalization abilities of
the most recent solver instance; some of its search time has to be spent on demonstrating that self-invented
new tasks are not already solvable.
Often, however, much more time will have to be spent on making sure that a newly modified solver did
not forget any of the possibly many previously learned skills. Problem solver modularization (Section 3.3,
especially 3.3.2) may greatly reduce this time though, making P OWER P LAY prefer pairs whose validation
does not require the re-testing of too many previously learned skills, thus decomposing at least part of the
search space into somewhat independent regions, realizing divide and conquer strategies as by-products of
its built-in drive to invent and validate novel tasks/skills as quickly as possible.
A biologically inspired hope is that as the problem solver is becoming more and more general, it will
find it easier and easier to solve externally posed tasks (Section 6), just like growing infants often seem to
re-use their playfully acquired skills to solve teacher-given problems.
4
2 Notation & Algorithmic Framework P OWER P LAY (Variant I)
B ∗ denotes the set of finite sequences or bitstrings over the binary alphabet B = {0, 1}, λ the empty
string, x, y, z, p, q, r, u strings in B ∗ , N the natural numbers, R the real numbers, ǫ ∈ R a positive constant,
m, n, n0 , k, i, j, k, l non-negative integers, L(x) the number of bits in x (where L(λ) = 0), f, g functions
mapping integers to integers. We write f (n) = O(g(n)) if there exist positive c, n0 such that f (n) ≤ cg(n)
for all n > n0 .
The computational architecture of the problem solver may be a deterministic universal computer, or a
more limited device such as a finite state automaton or a feedforward neural network (NN) [3]. All such
problem solvers can be uniquely encoded [9] or implemented on universal computers such as universal
Turing Machines (TM) [56]. Therefore, without loss of generality, the remainder of this paper assumes a
fixed universal reference computer whose input programs and outputs are elements of B ∗ . A user-defined
subset S ⊂ B ∗ defines the set of possible problem solvers. For example, if the problem solver’s architecture
is itself a binary universal TM or a standard computer, then S represents its set of possible programs, or
a limited subset thereof—compare Sections 3.2 and 4.1. If it is a feedforward NN, then S could be a
highly restricted subset of programs encoding the NN’s possible topologies and weights (floating point
numbers)—compare Section 8 and the original SLIM NN paper [47].
In what follows, for convenience I will often identify bitstrings in B ∗ with things they encode, such
as integers, real-valued vectors, weight matrices, or programs—the context will always make clear what is
meant.
The problem solver’s initial program is called s0 . There is a set of possible task descriptions T ⊂ B ∗ . T
may be the infinite set of all possible computable descriptions of tasks with possibly computable solutions,
or just a small subset thereof. For example, a simple task may require the solver to answer a particular input
pattern with a particular output pattern (more formal details on pattern recognition tasks are given in Section
3.1.1). Or it may require the solver to steer a robot towards a goal through a sequence of actions (more
formal details on sequential decision making tasks in unknown environments are given in Section 3.1.2).
There is a particular sequence of task descriptions T1 , T2 , . . ., where each unique Ti ∈ T (i = 1, 2, . . .)
is chosen or “invented” by a search method described below such that the solutions of T1 , T2 , . . . , Ti can
be computed by si , the i-th instance of the program, but not by si−1 (i = 1, 2, . . .). Each Ti consists
of a unique problem identifier that can be read by si through some built-in input processing mechanism
(e.g., input neurons of an NN [47]), and a unique description of a deterministic procedure for determining
whether the problem has been solved. Denote T≤i = {T1 , . . . , Ti }; T<i = {T1 , . . . , Ti−1 }.
A valid task Ti (i > 1) may require solving at least one previously solved task Tk (k < i) more ef-
ficiently, by using less resources such as storage space, computation time, energy, etc., thus achieving a
wow-effect. See Section 3.1.
Tasks and problem solver modifications are computed and validated by elements of another appropriate
set of programs P ⊂ B ∗ . Programs p ∈ P may contain instructions for reading and executing (parts of)
the code of the present problem solver and reading (parts of) a recorded history T race ∈ B ∗ of previous
events that led to the present solver. The algorithmic framework (Alg. 2) incrementally trains the problem
solver by finding p ∈ P that increase the set of solvable tasks.
5
Alg. 2: Algorithmic Framework P OWER P LAY (Variant I)
Initialize s0 in some way.
for i := 1, 2, . . . do
repeat
Let a search algorithm (examples in Section 4) create a new candidate program p ∈ P. Give p
limited time to do (not necessarily in this order):
* TASK I NVENTION: Let p compute a task T ∈ T . See Section 3.1.
* S OLVER M ODIFICATION: Let p compute a value of the variable q ∈ S ⊂ B ∗ (a candidate for si )
by computing a modification of si−1 . See Section 3.2.
* C ORRECTNESS D EMONSTRATION: Let p try to show that T cannot be solved by si−1 , but that
T and all Tk (k < i) can be solved by q. See Section 3.3.
until C ORRECTNESS D EMONSTRATION was successful
Set pi := p; Ti := T ; si := q; update T race.
end for
6
environmental input xi (t) ∈ B ∗ and a reward signal ri (t) ∈ B ∗ (interpreted as a real number) into parts of
ui (t), then update other parts of ui (t) (a function of ui (t − 1)) and compute action yi (t) ∈ B ∗ encoded as
a part of ui (t). yi (t) may affect the environment, and thus future inputs.
If P allows for programs that can dynamically acquire additional physical computational resources
such as additional CPUs and storage, then the above constant number of elementary computational in-
structions should be replaced by a constant amount of real time, to be measured by a reliable physical
clock.
The sequence of 4-tuples (xi (t), ri (t), ui (t), yi (t)) (t = 1, . . . , ti ) gets recorded by the so-called trace
T racei ∈ B ∗ . If at the end of the interaction a desirable computable property Ji (T racei ) (computed by
applying program Ji to T racei ) is satisfied, then by definition the task is solved. The set J of possible
Ji may represent an infinite set of all computable tasks with solutions computable by the given hardware.
For practical reasons, however, the set J of possible Ji may also be restricted to bit sequences encoding
just a few possible goals. For example, Ji may only encode goals of the form: a robot arm steered by
program or “policy” si has reached a certain target (a desired final observation xi (ti ) recorded in T racei )
without measurably bumping into an obstacle along the way, that is, there were no negative rewards, that
is, ri (τ ) ≥ 0 for τ = 1 . . . ti .
If the environment is deterministic, e.g., a digital physics simulation of a robot, then its current state
can be encoded as part of u(t), and it is straight-forward for C ORRECTNESS D EMONSTRATION to test
whether some si still can solve a previously solved task Tj (j < i). However, what if the environment is
only partially observable, like the real world, and non-stationary, changing in unknown ways? Then C OR -
RECTNESS D EMONSTRATION must check whether si still produces the same action sequence in response
to the input sequence recorded in T racej (often this replay-based test will actually be computationally
cheaper than a test involving the environment). Achieving the same goal in a changed environment must
be considered a different task, even if the changes are just due to noise on the environmental inputs. (Sure,
in the real world sj (j > i) might actually achieve Ji faster than si , given the description of Ti , but C OR -
RECTNESS D EMONSTRATION in general cannot know whether this acceleration was due to plain luck—it
must stick to reproducing T racej to make sure it did not forget anything.)
See Section 7.2, however, for a less strict P OWER P LAY variant whose C ORRECTNESS D EMONSTRA -
TION directly interacts with the real world to collect sufficient problem-solving statistics through repeated
trials, making certain assumptions about the probabilistic nature of the environment, and the repeatability
of experiments.
7
can solve T1 , T2 , . . . , Ti , because one naive way of ensuring correctness is to re-test si on all previously
solved tasks. Theoretically more efficient ways are considered next.
3.3.2 Keeping Track Which Components of the Solver Affect Which Tasks
Often it is possible to partition s ∈ S into components, such as individual bits of the software of a PC,
or weights of a NN. Here the k-th component of s is denoted sk . For each k (k = 1, 2, . . .) a variable
list Lk = (T1k , T2k , . . .) is introduced. Its initial value before the start of P OWER P LAY is Lk0 , an empty
list. Whenever pi found si and Ti at the end of C ORRECTNESS D EMONSTRATION, each Lk is updated as
follows: Its new value Lki is obtained by appending to Lki−1 those Tj ∈ / Lki−1 (j = 1, . . . , i) whose current
k
(possibly revised) solutions now need s at least once during the solution-computing process, and deleting
those Tj whose current solutions do not use sk any more.
P OWER P LAY’s C ORRECTNESS D EMONSTRATION thus has to test only tasks in the union of all Lki .
That is, if the most recent task does not require changes of many components of s, and if the changed bits
do not affect many previous tasks, then C ORRECTNESS D EMONSTRATION may be very efficient.
Since every new task added to the repertoire is essentially defined by the time required to invent it, to
solve it, and to show that no previous tasks became unsolvable in the process, P OWER P LAY is generally
“motivated” to invent tasks whose validity check does not require too much computational effort. That is,
P OWER P LAY will often find pi that generate si−1 -modifications that don’t affect too many previous tasks,
thus decomposing at least part of the spaces of tasks and their solutions into more or less independent
regions, realizing divide and conquer strategies as by-products. Compare a recent experimental analysis of
this effect [53, 52].
8
not of si−1 , where s′i may read si−1 or invoke parts of si−1 as sub-programs to solve T≤i — only then
C ORRECTNESS D EMONSTRATION has to test si not only on Ti but also on T<i (see [41] for details).
A simple but not very general way of doing something similar is to interleave TASK I NVENTION,
S OLVER M ODIFICATION, C ORRECTNESS D EMONSTRATION as follows: restrict all p ∈ P such that they
must define Ii := i as the unique task identifier Ii for Ti (see Section 3.1.2); restrict all s ∈ S such that the
input of Ii = i automatically invokes sub-program s′i , a part of si but not of si−1 (although s′i may read si−1
or invoke parts of si−1 as sub-programs to solve Ti ). Restrict Ji to a subset of acceptable computational
outcomes (Section 3.1.2). Run si until it halts and has computed a novel output acceptable by Ji that
is different from all outputs computed by the (halting) solutions to T<i ; this novel output becomes Ti ’s
goal. By induction over i, since all previously used components of si−1 remain unmodified, the set T<i is
guaranteed to remain solvable, no matter s′i . That is, C ORRECTNESS D EMONSTRATION on previous tasks
becomes trivial. However, in this simple setup there is no immediate generalization across tasks like in
OOPS [41] and the previous paragraph: the trivial task identifier i will always first invoke some s′i different
from all s′k (k 6= i), instead of allowing for solving a new task solely by previously found code.
The i-th problem is to find a program pi ∈ P that creates si and Ti and demonstrates that si but not
si−1 can solve T1 , T2 , . . . , Ti . This yields a perfectly ordered problem sequence for a variant of the Optimal
9
Ordered Problem Solver OOPS [41] (Algorithm 4.1).
While a candidate program p ∈ P is executed, at any given discrete time step t = 1, 2, ..., its internal
state or dynamical storage U at time t is denoted U (t) ∈ B ∗ (not to be confused with the solver’s internal
state u(t) of Section 3.1.2). Its initial default value is U (0). E.g., U (t) could encode the current contents
of the internal tape of a TM (to be modified by p), or of certain cells in the dynamic storage area of a PC.
Once pi is found, pi , si , Ti , T racei (if applicable; see Section 3.1.2) will be saved in unmodifiable read-
only storage, possibly together with other data observed during the search so far. This may greatly facilitate
the search for pk , k > i, since pk may contain instructions for addressing and reading pj , sj , Tj , T racej (j =
1, . . . , k − 1) and for copying the read code into modifiable storage U , where pk may further edit the code,
and execute the result, which may be a useful subprogram [41].
Define a probability distribution P (p) on P to represent the searcher’s initial bias (more likely programs
p will be tested earlier [17]). P could be based on program length, e.g., P (p) = 2−L(p) , or on a probabilistic
syntax diagram [41, 40]. See Algorithm 4.1.
OOPS keeps doubling the time limit until there is sufficient runtime for a sufficiently likely program
to compute a novel, previously unsolvable task, plus its solver, which provably does not forget previ-
ous solutions. OOPS allocates time to programs according to an asymptotically optimal universal search
method [17] for problems with easily verifiable solutions, that is, solutions whose validity can be quickly
tested. Given some problem class, if some unknown optimal program p requires f (k) steps to solve a
problem instance of size k and demonstrate the correctness of the result, then this search method will need
at most O(f (k)/P (p)) = O(f (k)) steps—the constant factor 1/P (p) may be large but does not depend
on k. Since OOPS may re-use previously generated solutions and solution-computing programs, however,
it may be possible to greatly reduce the constant factor associated with plain universal search [41].
The big difference to previous implementations of OOPS is that P OWER P LAY has the additional free-
dom to define its own tasks. As always, every new task added to the repertoire is essentially defined by the
time required to invent it, to solve it, and to demonstrate that no previously learned skills got lost.
10
3-dimensional brain-like multi-processor hardware to expected in the future. This encourages SLIM NNs
to solve many subtasks by subsets of neurons that are physically close [47].
Alg. 4.3: P OWER P LAY for RNNs Using Stochastic or Evolutionary Search
Randomly initialize RNN1’s variable weight matrix hwlk i and use the result as s0 (see Section 4.1.2)
for i := 1, 2, . . . do
set Boolean variable DONE=FALSE
repeat
use a black box optimization algorithm BBOA (many are possible [24, 10, 60, 49]) with adaptive
parameter vector θ to create some T ∈ T (to define the task input to RNN1; see Section 3.1) and a
modification of si−1 , the current hwlk i of RNN1, thus obtaining a new candidate q ∈ S
if q but not si−1 can solve T and all Tk (k < i) (see Sections 3.3, 3.3.2) then
set DONE=TRUE
end if
until DONE
set si := q; hwlk i := q; Ti := T ; (also store T racei if applicable, see Section 3.1.2). Use the
information stored so far to adapt the parameters θ of the BBOA, e.g., by gradient-based search [60,
49], or according to the principles of evolutionary computation [24, 10, 60].
end for
11
6 Adding External Tasks
The growing repertoire of the problem solver may facilitate learning of solutions to externally posed tasks.
For example, one may modify P OWER P LAY such that for certain i, Ti is defined externally, instead of being
invented by the system itself. In general, the resulting si will contain an externally inserted bias in form
of code that will make some future self-generated tasks easier to find than others. It should be possible to
push the system in a human-understandable or otherwise useful direction by regularly inserting appropriate
external goals. See Algorithm 7.1.
Another way of exploiting the growing repertoire is to simply copy si for some i and use it as a starting
point for a search for a solution to an externally posed task T , without insisting that the modified si also
can solve T1 , T2 , . . . , Ti . This may be much faster than trying to solve T from scratch, to the extent the
solutions to self-generated tasks reflect general knowledge (code) re-usable for T .
In general, however, it will be possible to design external tasks whose solutions do not profit from those
of self-generated tasks—the latter even may turn out to slow down the search.
On the other hand, in the real world the benefits of curious exploration seem obvious. One should
analyze theoretically and experimentally under which conditions the creation of self-generated tasks can
accelerate the solution to externally generated tasks—see [30, 54, 37, 38, 19, 4, 28, 62] for previous simple
experimental studies in this vein.
12
7.1 P OWER P LAY Variant II: Explicitly Penalizing Time and Space Complexity
Let us remove time and space bounds from the task definitions of Section 3.1.2, since the modified cost-
based P OWER P LAY framework below (Algorithm 7.1) will handle computational costs (such as time and
space complexity of solutions) more directly. In the present section, Ti encodes a tuple (Ii , Ji ) ∈ I × J
with interpretation: si must first read Ii and then interact with an environment through a sequence of
perceptions and actions, to achieve some computable goal defined by Ji within a certain maximal time
interval tmax (a positive constant). Let t′s (T ) be tmax if s cannot solve task T , otherwise it is the time
needed to solve T by s. Let ls′ (T ) be the positive constant lmax if s cannot solve T , otherwise it is the
number of components of s needed to solve task T by s. The non-negative real-valued reward r(T ) for
solving T is a positive constant rnew for self-defined previously unsolvable T , or user-defined if T is
an external task solved by s (Section 6). The real-valued cost Cost(s, T SET ) of solving all tasks in
′ ′
Pset T SET through s is a real-valued function of: all ls (T ), ts (T ) (for all TP∈ T SET ),′ L(s),
a task
and T ∈T SET r(T ). For example, the cost function Cost(s, T SET ) = L(s) + α T ∈T SET [ts (T ) −
r(T )] encourages compact and fast solvers solving many different tasks with the same components of s,
where the real-valued positive parameter α weighs space costs against time costs, and rnew should exceed
tmax to encourage solutions of novel self-generated tasks, whose cost contributions should be below zero
(alternative cost definitions could also take into account energy consumption etc.)
Let us keep an analogue of the remaining notation of Section 3.1.2, such as ui (t), xi (t), ri (t), yi (t), T racei ,
Ji (T racei ). As always, if the environment is unknown and possibly changing over time, to test perfor-
mance of a new solver s on a previous task Tk , only T racek is necessary—see Section 3.1.2. As always,
let T≤i denote the set containing all tasks T1 , . . . , Ti (note that if Ti =Tk for some k < i then it will appear
only once in T≤i ), and let ǫ > 0 again define what’s acceptable progress:
Alg. 7.1: P OWER P LAY Framework (Variant II) Explicitly Handling Costs of Solving Tasks
By Algorithm 7.1, si may forget certain abilities of si−1 , provided that the overall performance as
measured by Cost(si , T≤i ) has improved, either because a new task became solvable, or previous tasks
became solvable more efficiently.
Following Section 3.3, C ORRECTNESS D EMONSTRATION can often be facilitated, for example, by
tracking which components of si are used for solving which tasks (Section 3.3.2).
To further refine this approach, consider that in phase i, the list Lki (defined in Section 3.3.2) contains
all previously learned tasks whose solutions dependP on sk . This can be used to determine the current value
V al(ski ) of some component sk of s: V al(ski ) = − T ∈Lk Cost(si , T≤i ). It is a simple exercise to invent
i
P OWER P LAY variants that do not forget valuable components as easily as less valuable ones.
The implementations of Sections 4.1 and 4.3 are easily adapted to the cost-based P OWER P LAY frame-
work. Compare separate papers [53, 52].
13
7.2 Probabilistic P OWER P LAY Variants
Section 3.1.2 pointed out that in partially observable and/or non-stationary unknown environments C OR -
RECTNESS D EMONSTRATION must use T racek to check whether a new si still knows how to solve an
earlier task Tk (k < i). A less strict variant of P OWER P LAY, however, will simply make certain assump-
tions about the probabilistic nature of the environment and the repeatability of trials, assuming that a limited
fixed number of interactions with the real world are sufficient to estimate the costs c∗i , ci in Algorithm 7.1.
Another probabilistic way of softening P OWER P LAY is to add new tasks without proof that s won’t
forget solutions to previous tasks, provided C ORRECTNESS D EMONSTRATION can at least show that the
probability of forgetting any previous solution is below some real-valued positive constant threshold.
14
some problem, the Gödel machine may decide to replace Hsearch by a faster method suffering less from
large constant overhead, but even if it doesn’t, its performance won’t be less than asymptotically optimal.
Why doesn’t everybody use such universal problem solvers for all computational real-world problems?
Because most real-world problems are so small that the ominous constant slowdowns (potentially relevant
at least before the first self-rewrite) may be large enough to prevent the universal methods from being
feasible.
P OWER P LAY, on the other hand, is designed to incrementally build a practical more and more general
problem solver that can solve numerous tasks quickly, not in the asymptotic sense, but by exploiting to
the max its given particular search algorithm and computational architecture, with all its space and time
limitations, including those reflected by constants ignored by the asymptotic optimality notation.
As mentioned in Section 6, however, one must now analyze under which conditions P OWER P LAY’s
self-generated tasks can accelerate the solution to externally generated tasks (compare previous experi-
mental studies of this type [30, 54, 37, 38]).
15
capacity and its unavoidable generalization effects (many never-tried tasks will become solvable by
solutions to the few explicitly tested Ti ). Compare Section 5.
2. The general creative agent above [42, 45] is motivated to improve performance on the entire history
of previous still unsolved tasks, while P OWER P LAY may discard much of this history, keeping only a
selective list of previously solved tasks. However, as the system is interacting with its environment,
one could store the entire continually growing history, and make sure that T always allows for
defining the task of better compressing the history so far.
3. P OWER P LAY as in Section 2 has a binary criterion for adding knowledge (was the new task solv-
able without forgetting old solutions?), while the general agent [42, 45] uses a more informative
information-theoretic measure. The cost-based P OWER P LAY framework (Alg. 7.1) of Section 7,
however, offers similar, more flexible options, rewarding compression or speedup of solutions to
previously solved tasks.
On the other hand, drawbacks of previous implementations of formal creativity theory include:
1. Some previous approximative implementations [30, 54] used traditional RL methods [15] with the-
oretically unlimited look-ahead, but those are not guaranteed to work well in partially observable
and/or non-stationary environments where the reward function changes over time, and won’t neces-
sarily generate an optimal sequence of future tasks or experiments.
2. Theoretically optimal implementations [42, 45] are currently still impractical, for reasons similar to
those discussed in Section 9.1.
Hence P OWER P LAY may be viewed as a greedy but feasible implementation of certain basic principles
of creativity [42, 45]. P OWER P LAY-based systems are continually motivated to invent new tasks solvable
by formerly unknown procedures, or to compress or speed up problem solving procedures discovered
earlier. Unlike previous implementations, P OWER P LAY extracts from the lifelong experience history a
sequence of clearly identified and separated tasks with explicitly recorded solutions. By design it cannot
suffer from online learning problems affecting its solver’s performance on previously solved problems.
16
However, the previous system [37, 38] did not have a built-in guarantee that it cannot forget previously
learned skills, while P OWER P LAY as in Section 2 does (and the time and space complexity-based variant
Alg. 7.1 of Section 7 can forget only if this improves the average efficiency of previous solutions).
To analyze the novel framework’s consequences in practical settings, experiments are currently being
conducted with various problem solver architectures with different generalization properties. See separate
papers [53, 52] and Section 8.
10 Words of Caution
The behavior of P OWER P LAY is determined by the nature and the limitations of T , S, P, and its algorithm
for searching P. If T includes all computable task descriptions, and both S and P allow for implementing
arbitrary programs, and the search algorithm is a general method for search in program space (Section 4),
then there are few limits to what P OWER P LAY may do (besides the limits of computability [9]).
It may not be advisable to let a general variant of P OWER P LAY loose in an uncontrolled situation,
e.g., on a multi-computer network on the internet, possibly with access to control of physical devices, and
the potential to acquire additional computational and physical resources (Section 3.1.2) through programs
executed during P OWER P LAY. Unlike, say, traditional virus programs, P OWER P LAY-based systems will
continually change in a way hard to predict, incessantly inventing and solving novel, self-generated tasks,
only driven by a desire to increase their general problem-solving capacity, perhaps a bit like many humans
seek to increase their power once their basic needs are satisfied. This type of artificial curiosity/creativity,
however, may conflict with human intentions on occasion. On the other hand, unchecked curiosity may
sometimes also be harmful or fatal to the learning system itself (Section 6)—curiosity can kill the cat.
17
11 Acknowledgments
Thanks to Mark Ring, Bas Steunebrink, Faustino Gomez, Sohrob Kazerounian, Hung Ngo, Leo Pape,
Giuseppe Cuccu, for useful comments.
References
[1] A. Barto. Intrinsic motivation and reinforcement learning. In G. Baldassarre and M. Mirolli, editors,
Intrinsically Motivated Learning in Natural and Artificial Systems. Springer, 2012. In press.
[2] D. E. Berlyne. A theory of human curiosity. British Journal of Psychology, 45:180–191, 1954.
[3] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
[4] G. Cuccu, M. Luciw, J. Schmidhuber, and F. Gomez. Intrinsically motivated evolutionary search for
vision-based reinforcement learning. In Proceedings of the 2011 IEEE Conference on Development
and Learning and Epigenetic Robotics IEEE-ICDL-EPIROB. IEEE, 2011.
[5] P. Dayan. Exploration from generalization mediated by multiple controllers. In G. Baldassarre and
M. Mirolli, editors, Intrinsically Motivated Learning in Natural and Artificial Systems. Springer,
2012. In press.
[6] V. V. Fedorov. Theory of optimal experiments. Academic Press, 1972.
[7] M. C. Fitting. First-Order Logic and Automated Theorem Proving. Graduate Texts in Computer
Science. Springer-Verlag, Berlin, 2nd edition, 1996.
[8] Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an applica-
tion to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
[9] K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I.
Monatshefte für Mathematik und Physik, 38:173–198, 1931.
[10] F. J. Gomez, J. Schmidhuber, and R. Miikkulainen. Efficient non-linear control through neuroevolu-
tion. Journal of Machine Learning Research JMLR, 9:937–965, 2008.
[11] H. F. Harlow, M. K. Harlow, and D. R. Meyer. Novelty and curiosity as determinants of exploratory
behavior. Journal of Experimental Psychology, 41:68–80, 1950.
[12] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780,
1997.
[13] M. Hutter. The fastest and shortest algorithm for all well-defined problems. International Journal of
Foundations of Computer Science, 13(3):431–443, 2002. (On J. Schmidhuber’s SNF grant 20-61847).
[14] M. Hutter. Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability.
Springer, Berlin, 2005. (On J. Schmidhuber’s SNF grant 20-61847).
[15] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: a survey. Journal of AI
research, 4:237–285, 1996.
[16] A. N. Kolmogorov. Three approaches to the quantitative definition of information. Problems of
Information Transmission, 1:1–11, 1965.
[17] L. A. Levin. Universal sequential search problems. Problems of Information Transmission, 9(3):265–
266, 1973.
[18] M. Li and P. M. B. Vitányi. An Introduction to Kolmogorov Complexity and its Applications (2nd
edition). Springer, 1997.
18
[19] M. Luciw, V. Graziano, M. Ring, and J. Schmidhuber. Artificial curiosity with planning for au-
tonomous perceptual and cognitive development. In Proceedings of the First Joint Conference on
Development Learning and on Epigenetic Robotics ICDL-EPIROB, Frankfurt, August 2011.
[29] J. Schmidhuber. Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem.
Dissertation, Institut für Informatik, Technische Universität München, 1990.
[30] J. Schmidhuber. Curious model-building control systems. In Proceedings of the International Joint
Conference on Neural Networks, Singapore, volume 2, pages 1458–1463. IEEE press, 1991.
[31] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural
controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on
Simulation of Adaptive Behavior: From Animals to Animats, pages 222–227. MIT Press/Bradford
Books, 1991.
[32] J. Schmidhuber. A fixed size storage O(n3 ) time complexity learning algorithm for fully recurrent
continually running networks. Neural Computation, 4(2):243–248, 1992.
[33] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural
Computation, 4(1):131–139, 1992.
[34] J. Schmidhuber. On decreasing the ratio between learning complexity and number of time-varying
variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural
Networks, Amsterdam, pages 460–463. Springer, 1993.
[35] J. Schmidhuber. A self-referential weight matrix. In Proceedings of the International Conference on
Artificial Neural Networks, Amsterdam, pages 446–451. Springer, 1993.
[36] J. Schmidhuber. What’s interesting? Technical Report IDSIA-35-97, IDSIA, 1997.
ftp://ftp.idsia.ch/pub/juergen/interest.ps.gz; extended abstract in Proc. Snowbird’98, Utah, 1998; see
also [38].
19
[37] J. Schmidhuber. Artificial curiosity based on discovering novel algorithmic predictability through co-
evolution. In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, and Z. Zalzala, editors, Congress
on Evolutionary Computation, pages 1612–1618. IEEE Press, 1999.
[38] J. Schmidhuber. Exploring the predictable. In A. Ghosh and S. Tsuitsui, editors, Advances in Evolu-
tionary Computing, pages 579–612. Springer, 2002.
[39] J. Schmidhuber. Bias-optimal incremental problem solving. In S. Becker, S. Thrun, and K. Ober-
mayer, editors, Advances in Neural Information Processing Systems 15 (NIPS 15), pages 1571–1578,
Cambridge, MA, 2003. MIT Press.
[40] J. Schmidhuber. OOPS source code in crystalline format: http://www.idsia.ch/˜juergen/oopscode.c,
2004.
[41] J. Schmidhuber. Optimal ordered problem solver. Machine Learning, 54:211–254, 2004.
[42] J. Schmidhuber. Developmental robotics, optimal artificial curiosity, creativity, music, and the fine
arts. Connection Science, 18(2):173–187, 2006.
[43] J. Schmidhuber. Gödel machines: Fully self-referential optimal universal self-improvers. In B. Go-
ertzel and C. Pennachin, editors, Artificial General Intelligence, pages 199–226. Springer Verlag,
2006. Variant available as arXiv:cs.LO/0309048.
[44] J. Schmidhuber. Ultimate cognition à la Gödel. Cognitive Computation, 1(2):177–193, 2009.
[45] J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990-2010). IEEE Trans-
actions on Autonomous Mental Development, 2(3):230–247, 2010.
[46] J. Schmidhuber. P OWER P LAY: Training an Increasingly General Problem Solver by Continually
Searching for the Simplest Still Unsolvable Problem. Technical Report arXiv:1112.5309v1 [cs.AI],
2011.
[47] J. Schmidhuber. Self-delimiting neural networks. Technical Report IDSIA-08-12, arXiv:1210.0118v1
[cs.NE], IDSIA, 2012.
[48] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adap-
tive Levin search, and incremental self-improvement. Machine Learning, 28:105–130, 1997.
[52] R. K. Srivastava, B. R. Steunebrink, and J. Schmidhuber. First Experiments with P OWER P LAY.
Technical Report arXiv:1210.8385v1 [cs.AI], 2012.
[53] R. K. Srivastava, B. R. Steunebrink, M. Stollenga, and J. Schmidhuber. Continually adding self-
invented problems to the repertoire: First experiments with POWERPLAY. In Proceedings of the
2012 IEEE Conference on Development and Learning and Epigenetic Robotics IEEE-ICDL-EPIROB.
IEEE, 2012.
[54] J. Storck, S. Hochreiter, and J. Schmidhuber. Reinforcement driven information acquisition in non-
deterministic environments. In Proceedings of the International Conference on Artificial Neural Net-
works, Paris, volume 2, pages 159–164. EC2 & Cie, 1995.
20
[55] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge,
MA, 1998.
[56] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceed-
ings of the London Mathematical Society, Series 2, 41:230–267, 1936.
[57] C. S. Wallace and D. M. Boulton. An information theoretic measure for classification. Computer
Journal, 11(2):185–194, 1968.
[58] C. S. Wallace and P. R. Freeman. Estimation and inference by compact coding. Journal of the Royal
Statistical Society, Series ”B”, 49(3):240–265, 1987.
[59] P. J. Werbos. Generalization of backpropagation with application to a recurrent gas market model.
Neural Networks, 1, 1988.
[60] D. Wierstra, T. Schaul, J. Peters, and J. Schmidhuber. Natural evolution strategies. In Congress of
Evolutionary Computation (CEC 2008), 2008.
[61] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks and their
computational complexity. In Back-propagation: Theory, Architectures and Applications. Hillsdale,
NJ: Erlbaum, 1994.
[62] S. Yi, F. Gomez, and J. Schmidhuber. Planning to be surprised: Optimal Bayesian exploration in
dynamic environments. In Proc. Fourth Conference on Artificial General Intelligence (AGI), Google,
Mountain View, CA, 2011.
21