The symbol grounding problem part II
Manuel Rodriguez
Feb 28, 2024*
Abstract
The dominant challenge in robotics is the large state space. Any possible algorithm which traverses the entire state space will become too slow even on supercomputing hardware. Overcoming
the bottleneck can be realized with guided instructions which are resulting into a speaker hearer
interaction. The high level commands from the speaker are interpreted by the hearer which is the
robot. Even if the concept sounds a bit uncommon, it makes sense to investigate it in detail because
it allows to solve complex robotics problem with natural language.
After a short introduction into the last AI Winter which went until the year 1992, the dominant
part of the following paper is dedicated to the instruction following problem. A simple maze game
is taken as an example to implement a working prototype in Python which allows to control a robot
swarm with basic commands.
Keywords: AI winter, instruction following, symbol grounding problem, robotics
Contents
1 Inventing the benchmark for AI
1a Evaluation function as datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1b Heuristics are feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2 From backtracking algorithms to heuristics
2a Memory based heuristics . . . . . . . . . . . . . . . . . . . . . . . . .
2b Pattern databases in the late 1990s . . . . . . . . . . . . . . . . . . . .
2c Early robots in the 1950s . . . . . . . . . . . . . . . . . . . . . . . . .
2d Artificial Intelligence in the 1990s . . . . . . . . . . . . . . . . . . . .
2d1 Developments since the last AI Winter . . . . . . . . . . . . . .
2d2 The invisible 5th generation computers . . . . . . . . . . . . .
2e Early teleoperated robots . . . . . . . . . . . . . . . . . . . . . . . . .
2e1 Anti pattern in teleoperation . . . . . . . . . . . . . . . . . . .
2f Textual teleoperation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2f1 From solving games towards generating games . . . . . . . . .
2f2 Early adventure games in the 1980s . . . . . . . . . . . . . . .
2f3 A review of the Castle adventure game . . . . . . . . . . . . . .
2f4 Speaker hearer interaction in adventure games of the mid 1980s
2f5 User interfaces as a grounding mechanism . . . . . . . . . . . .
2f6 Creating toy problems from scratch . . . . . . . . . . . . . . .
2f7 A closer look into action adventure games from the past . . . .
2f8 Communication with a game engine . . . . . . . . . . . . . . .
2f9 Simplified instruction following example with video games . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
4
5
6
8
9
9
10
11
13
14
15
16
16
17
18
18
19
19
20
21
21
* This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License <http://creativecommons.org/licenses/by-sa/4.0/>.
1
3 Instruction following
3a Instruction following for goal formulation . . . . . . . . . . . .
3b Cost function as instruction following . . . . . . . . . . . . . . .
3c From line following to instruction following . . . . . . . . . . .
3c1 History of instruction following . . . . . . . . . . . . . .
3c2 NP hard problems vs instruction following . . . . . . . .
3d Instruction following for a robot swarm . . . . . . . . . . . . . .
3e The logic is hidden in the GUI menu . . . . . . . . . . . . . . .
3f The auto mode for instruction following . . . . . . . . . . . . . .
3f1 Programming an AI with a communication board . . . . .
3g Theoretical reason for instruction following . . . . . . . . . . . .
3g1 Instruction Following as communication paradigm . . . .
3g2 Programming an instruction following robot from scratch
3g3 Increasing the automation level with natural language . .
3g4 An instrument panel for man machine interaction . . . . .
3g5 Communication based AI . . . . . . . . . . . . . . . . .
3g6 Programming an instruction following robot step by step .
3g7 Robot control with a protocol . . . . . . . . . . . . . . .
3g8 Object oriented programming . . . . . . . . . . . . . . .
3h Instruction following with speaker and hearer . . . . . . . . . . .
3i Behavior tree based instruction following . . . . . . . . . . . . .
3i1 From commands to behavior trees . . . . . . . . . . . . .
3i2 Behavior tree as dialogue games . . . . . . . . . . . . . .
3i3 Dialogue based teleoperation . . . . . . . . . . . . . . . .
3i4 Voice commands and behavior trees . . . . . . . . . . . .
3j Connectionist model of dialogue . . . . . . . . . . . . . . . . . .
4 Practical demonstration
4a Creating a cleaning bot with a dialogue . . .
4b A grammar controlled gripper robot . . . .
4c How to automate a warehouse with robots .
4c1 Command dictionary for a maze robot
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
23
24
25
26
27
28
29
29
30
32
32
33
34
34
35
35
36
37
37
38
39
39
40
40
41
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
42
42
43
44
44
5 Robot programming languages
5a The symbol grounding problem . . . . . . . . .
5a1 Symbol grounding mindmap . . . . . . .
5b Inventing a Domain specific language (DSL) . .
5c Interactive robot control . . . . . . . . . . . . .
5c1 Interactive animation with datasets . . . .
5c2 Stick figure dataset to animation . . . . .
5c3 Interface for character animation . . . . .
5c4 Early parameterized computer animation
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
46
47
49
49
50
51
51
52
References
.
.
.
.
53
1 Inventing the benchmark for AI
A common assumption about Artificial Intelligence is, that it has to be realized within a computer.
AI, at least this is the claim, is the result of an intelligent piece of software. Of course the software
2
works with an algorithm, so the open question is which sort of algorithm will match closely to human
intelligence.
The surprising situation is, that such kind of understanding won’t motivate AI research but it
prevents it. The more sense making goal is postpone the development of intelligent software in favor
of creating an objective scale to measure intelligence. The task for the computer is to pass the test by
providing a high score.
Even this concept sounds unusual it can be realized much easier than programming a classical
AI. The idea is, that intelligence is realized as a dialogue. One instance provides a problem, while
the opponent has to provide the answer to the problem. In the simple case the problem is “3+2” and
the correct answer is 5. The reason why such an interactive definition of AI makes sense is because
it allows to program the computer with a certain objective. In the concrete case the problem for the
computer is to add 3+2. Even if the computer isn’t equipped with this capabilities it possible to write a
small calculator app.
Let me give another example. Suppose the question is about a different operation including the
usage of floating point numbers. The problem would be “7.3/2”. This time, the difficulty for the
problem solving algorithm is of course harder. The computer has to handle the floating point and needs
to know how to divide two numbers. The needed calculator app is more complex to program. But
similar to the first trial its possible to write a computer which is able to do so. In high level programming
languages like python, the feature is built in as default while in other language like assembly, the
programmer has to manually type in the code.
The general idea is to scale up question answering systems from easy to hard questions. Easy
questions are similar to the previously mentioned example normal math problems which can be solved
with a calculator. More harder questions are about word matching games and visual Q&A tasks. All
these problems have one thing in common. There is always a dialogue visible. One instance formulates
a task, while the second instance provides the answer.
Let us go a step backward to understand the situation better. Before its possible to program an
Artificial intelligence, there is a need to program an automatic referee. The referee judges if an answer
is correct or wrong. In a video game the referee is equal to the scoring unit. For example, in the game
of pong this unit will determine if the ball is out and then the score for the player is adjusted. For
problems which are provided as a dataset the score gets determined by comparing the correct answer
with the wrong answer. For example the correct answer is “3+2=5” and if the computer provides “6” as
answer its wrong.
More advanced computer software like Large language models are trained with question answering
tasks and they are also tested with these tasks. So the intelligence is not located in the large language
model itself but in the dataset. A Q&A dataset is a method for storing knowledge. This knowledge is
used to create intelligent software and its also useful to evaluate intelligent software.
In the history of AI research there was an interesting milestone available. Since the 1980s lots of
robotics challenges and toy problems were invented or rediscovered like the micromouse challenge,
the 15 sliding puzzle game and the MNIST OCR dataset. What these problems have in common that
they are not useful for practical application but their only purpose is to get a better understanding of AI.
For example in the 15 puzzle game the user has to sort pieces. This is able to solve the problem. The
overall structure works with the previously mentioned dialogue which contains of a problem and an
answer to the problem. The problem might be to navigate a robot to the goal, and the answer is that the
robot is doing so.
What makes these challenges interesting is the ability to score the situation from an objective
perspective. Its possible to determine if the robot has reached the goal, and the precise amount of
seconds gets measured. This evaluation allows to compare robots against each other. The same situation
is available for the MNIST OCR dataset which is an example for image to text translation.
Perhaps it makes sense to give examples for other benchmarks which can be used to judge about
computer programs. The math example “3+2” was mentioned already. A more complicated challenge
3
would be a question like “sin(pi)”. In case for language oriented question an entry level question would
be to find the correct spelling of a word “aple, apple, appel”. of course only the word in the middle is
correct while the other are wrong. The interesting situation is, that such kind of questions including the
correct answer can be collected in a large table and then a certain computer software can be scored how
many questions were answered correct. If the computer program is only familiar with math question it
will fail to answer the language problems.
It should be mentioned, that the inner working of such a computer is less important. The computer
can use neural networks, handwritten code, an expert system or even a random algorithm which makes
elaborated guesses. The only thing important is the score at the end.
What makes a Q&A Dataset interesting is, that the same principle can be adapted to any problems.
Its possible to ask simple math question, include natural language tasks and encode even visual question
answer problems. Also the principle scales well enough. If a certain Artificial intelligence isn’t able to
answer the dataset, the difficulty can be lowered. This allows to determine which questions are exactly
a source of failure and the reason why can be investigated.
The cause why q&a datasets were practical unknown before the year 2000 is because they working with the opposite assumption than existing tools within computer science. In the past, the self
understanding of computer science was to solve problems. An algorithm is a computer recipe to solve
a problem, while a processor has to do something. In contrast a q&a dataset is doing nothing but it
formulates a problem.
A common paradigm until the year 2000 to talk about artificial intelligence was to imagine how
to solve AI problems. The idea was to build a computer and write a program which is able to do so.
Such an attempt has failed because it was unclear what the problem was about. There was only a vague
definition what intelligence is and it was normal that no algorithm in the world is able to solve a vague
problem. The advantage of a Q&A dataset is, that it inverts the situation. The focus is on the problem
definition while the ability to solve it gets delayed. The new paradigm is, that in the now a question is
created like “what color has the ball in the image?” while the answer to the problem is left open. The
assumption is, that computers in 30 years are able to answer this question.
1a Evaluation function as datasets
The need for an evaluation function was recognized in computer chess early. Nearly all chess engines
programmed after the 1970s are using a simple or a more advanced equation to judge about the board
situation. Combined with the move generator the evaluation function helps to determine the optimal
move.
The new insight is, that an evaluation function can be realized in a data driven format. A typical
example is to formulate a series of chess puzzles as a q&a dataset and the computer has to answer the
problems.[Cirik2015] Typical questions are:
1. How many pieces are on the board?
2. Is b2b3 a legal move?
3. What is the material advantage of black?
Most of these questions are easy to answer and its likely that existing chess engines have a built in
module which is providing the correct answer. But the innovation which can also be called a small
revolution is, that the questions are formulated in an explicit way within a dataset outside of any
computer program. There is no need to write the computer code for an evaluation function but the more
important step is to invent the chess puzzles and format it as question answer pairs.
There is a similarity between a q&a dataset and a classical evaluation function. Both things are used
to judge about the current chess position. The board is taken as input and then a score or a judgment
is provided as output. This score helps to guide the search process in the game tree. Thats because it
4
called a heuristic evaluation function. The heuristic is referencing to domain knowledge taken from
human experts.
It should be mentioned that the problem is not very new. Even in the 1980s there was a debate
available how to convert expert knowledge into computer programs. The understanding in the past
was, that expert knowledge should be encoded in a rule base, or in computer software. Such kind of
understanding was not very powerful. The better idea is to encode expert knowledge in questions. Each
question in a q&a datasets defines a new problem and its up to the solver which might be a human or a
computer to answer the question.
1b Heuristics are feature vectors
From a technical perspective its possible to store a heuristics in a feature vector. In the simplest case the
vector contains of only boolean values and stores information about recognized motion capture events.
The harder task is to give a reason why before the year 2000 such an understanding was not available.
Especially not the addition that encoding in the feature vector represents natural language information.
But let us make a practical example. Suppose there is a robot in a maze, and the current state is
encoded in a vector with:
[direction, speed, xpos, ypos, hasob ject, energy]
If the robot updates the position, the feature vector gets filled with new values too. So the feature
vector mirrors the current game state in an abstraction layer. This concept sounds logical and allows to
construct advanced planners on top of the information which is known as heuristic search.
The surprising insight is, that such a feature vector was practically unknown in robotics of the
past. Not because of technical limitations. Its for sure, that computer before the year 2000 were more
than capable of storing a 5 cell long array, but because of a philosophical mismatch. First problem
was that in the past, the advantages of heuristics were not seen and secondly it was unknown that the
precondition for a heuristics is an encoding system. Encoding means, that the information about the
direction of the robot is stored into the first cell, the speed into the second cell and so on. Each cell of
the array has a meaning which is provided by its feature name and can be fully described in a sentence.
Its some sort of struct variable used to capture outside knowledge.
Let me explain the philosophical mismatch the other way around. Suppose in the robot simulation
there is no feature vector available and then we are asking a programmer from the mid 1990s if the
robot project is missing some important elements. Its for sure, that the programmer won’t recognized
the missing feature vector. He would perhaps ask, that the AI is missing but he will also say, that its
unclear how to program such an AI in software.
The main problem with feature vectors and the more general theory, the symbol grounding problem,
is, that from a technical perspective its pretty easy to realize but at the same time its very hard to
explain the advantages. The cause is, that a feature vector is not needed in a standard computer
software. Software programs are created to fulfill the needs of a computer. They make sure that a
certain algorithms runs on a machine. In contrast, a feature vector has its origin in a domain specific
problem. It encodes a sentence like “direction of the robot in degree” as a numerical value. Such an
information is highly important for the task description but is ignored by a computer program.
The interesting situation is, that even chess engines which are equipped with an evaluation function
have no dedicated feature vector. Even if the concept is strongly related to evaluate a chess board, most
chess engine documentation doesn’t even mention the term. instead, the evaluation function is realized
somehow else. The common approach is to describe the features direct in a programming language like
C or Java. The idea is, that the source code is the important and the encoding of possible chess puzzles
is less important. This results into a certain understanding what a chess engine is. The programmer
assumes, that the chess engine is a collection of source code lines formulated in a certain programming
5
language. This source code gets compiled into binary code and the main objective is that no error is
available during this translation step.
In contrast, the perspective from a feature vector is the opposite. Here, the written source code gets
ignored and the only important fact are the single features which are encoding chess knowledge in an
array. And the computer has to store the array inside its memory and has to update the information in
realtime.
A feature vector encoding is an example for a memory based architecture, while algorithmic centric
programming is based on the Von neumann CPU model. Only the later one was accepted until the year
2000 in computer science as a valid model to discuss problem. A typical question in this period was
about the runtime of an algorithm which might be np hard. Also it was asked how to program the Von
neumann machine in a way that a certain algorithm gets executed. In contrast, it was ignored what is
stored in the memory. The untold assumption was, that robot problems are formulated as a graph so the
datastructure in the memory will become a linked list. The more interesting problem was until the year
2000 what to do with this graph because this question is strongly related to program the computer with
an algorithm.
To emphasize the difference between both philosophies let us describe whats important for a feature
vector. A feature vector is an array with cells. Each cell represents something of the problem. The cells
are stored in the memory of the computer which is mostly the RAM. In contrast, other elements of a
computer like the CPU are ignored. There is no need to do something with the values, they are simply
stored in the memory and that is all. What gets discussed instead is which values are in the cells, and
very important, which features are needed for a certain problem. For example, “are 6 features enough,
or should the vector store at least 12 features to get a better problem understanding”.
2 From backtracking algorithms to heuristics
Artificial Intelligence until the year 2000 can be roughly summarized as Algorithm for uninformed
search. The assumption was, that a robot has to do an exhaustive search in the state space and this
will allow him to solve any problem. Possible problems with this approach were known and the
understanding was, that more efficient search algorithm and especially faster computers which are
working with optical and parallel components will become available in the future.
It was obvious why the understanding in the past was dominated by backtracking related search
algorithm. Because notable examples for artificial intelligence which were able to plan a path in a
maze, and especially chess engines were working with this single principle. In all the case the computer
is traversing a graph in a certain sequence and this allows to plan the optimal actions. The possible
improvement to this approach (a heuristic) was known in theory but never applied for real problems.
The transition from past Artificial Intelligence which was common until the year 2000 towards more
recent AI approaches can be described with the stronger focus on heuristics and heuristic search. A
heuristic allows to improve the efficiency of a solver dramatically. In contrast, the existing backtracking
and depth first search algorithm are a backend. Even if better implementation direct in C sourcecode
and with faster computer hardware they won’t scale up well enough. The problem with AI related
problems like motion planning for a robot arm is, that a simple improvement with the factor 2x or 10x
is not enough to solve the problem.
Let me give an example. Suppose somebody implements a chess engine in Turbo Pascal on a 286
Intel MS DOS Computer in the 1980s. The algorithm traverses the game tree with a horizon of 5
movements into the future. With classical understanding of computing the algorithm can be improved
drastically. At first a different programming language is used, for example C which is around 4x faster
than Turbo Pascal. The next improvement is to use a more recent computer architecture. The Pentium I
is around 150x faster than a 286 CPU. Overall the improvement would be 4x150=600x which sounds a
lot. But for winning the game of chess a much higher improvement is needed. Even if the algorithm is
6
executed on a supercomputer which has a huge amount of electricity consumption the performance is
too slow.
The only sense making tool is a heuristics. And exactly this part of an Artificial Intelligence was
investigated since the year 2000 in detail. It allows to treat AI problems as informed search problems.
The term informed is referencing to any sort of domain knowledge which goes beyond simple trial and
error backtracking search.
Or let me formulate the situation different. The reason why robotics after the year 2000 is so much
more powerful than former attempts is that modern robotics are using heuristics in different degrees. In
the minimal case, the robot has a built in heuristics in the format of an evaluation function. There is
a unit in the source code available which determines a score for the current game node. In case of a
chess engine, the evaluation function determines the winning probability, while in a motion planning
algorithm, the evaluation function calculates the badness of a certain trajectory in respect to obstacle
avoidance. More advanced approaches are not working with a hard code evaluation function but the
domain knowledge is encoded in an input dataset. There are examples available which are converted
in natural languages and the detected events are converted into a score. The score can be utilized the
search process.
So we can say, that the only difference between past robotics and modern robotics is the heuristics.
The existence or the absence of domain specific knowledge is the single cause why modern robotics is
more powerful.
Until the year some papers were published about the need of heuristic planning. So the subject was
recognized early as a bottleneck in Artificial Intelligence. The main problem in the past was, that it was
unclear how exactly human knowledge should be converted into heuristics. The only thing which was
known in the 1990s was that heuristic search will give a major improvement and how to implement an
algorithm like A* in software.
It should be mentioned, that without a dedicated heuristic nearly all AI related problems like robotics,
and game playing remain unsolved. The state space of these domains is too large for uninformed search
algorithm. The only thing what is possible is too prove that a certain problem is np hard, for example
the game of lemmings. This understanding is equal to the confession, that no computer in the world
can play the game by itself.
There is a reason available why it was hard or even impossible to create heuristics for certain
problems. Because this task fells outside the score of computer science. The sad situation is, that a
well formulated mathematical problem doesn’t contain of any heuristic knowledge. Even if games like
Lemmings, chess or Tetris are formulated in mathematical equations there is no such thing available
like domain knowledge which can be converted programs. One explanation might be, that mathematics
is about reducing the amount of details, but these details are important to describe the problem.
To grasp the development which resulted into a better understanding of heuristics we have to
describe the situation different. The focus is not how to solve problems with a computer but the more
interesting question is how to describe the problems. There are two important problems available which
can be utilized to create heuristics. First, motion capture problem and secondly toy problems like the
15 puzzle problems. Both problems are strongly related to heuristics. In case of mocap recording a
dominant problem is to determine possible binary features like the position of marker or the collision of
two markers. If two markers collide, the binary feature is true, otherwise its false. This boolean feature
is the natural heuristic to describe the problem.
In the other example, the 15 puzzle toy problem, the heuristics is created with an evaluation function
which is the result of a feature vector. The features are: distance to goal, number of cells cleared,
position of the empty. These features are used to determine the score. In general, a heuristics has
to do with identifying a feature vector which gets converted into a numerical score. Such kind of
understanding was not available before the year 2000.
In contrast there was an early paper published in 1993, which has recognized the importance of a
feature vector to encode a heuristic. [Rumelhart1993] describes the problem solving capabilities of a
7
connectionist neural network. Such a network takes a vector as input layer.
What we can say for sure is, that after the year 2000, the advantages of feature vectors and other
heuristics were investigated in depth by endless amount of papers. Even the symbol grounding problem
which is a theory to explain how to create heuristics in general, was widely discussed after the year
2000.
2a Memory based heuristics
The transition from former backtracking based algorithm towards more advanced memory based
abstraction mechanism can be traced back in the literature until the year 2000. There is a paper
available published in 1999 which discusses on a very theoretical perspective the advantages of a state
space abstraction mechanism realized with a pattern database. The promise is to make existing search
based planners for the 15 puzzle game more efficient.[Holte1999] And indeed the promise can be
realized, memory based heuristics will provide a huge improvement.
The only problem with these early attempts is that the explanation is a bit cumbersome. Especially
for non experts in the domain of planning problems it remains unclear what exactly the task for a pattern
database is. But this problem gets solved automatically because many more papers are published since
the year 2000 about the same subject which are describing the situation more precise. From an objective
perspective there are two sort of search algorithm available: memory based and cpu based.
It was uncommon in the late 1990s to use
terms like “Symbol grounding problem” and “fea- Name
Efficiency
Principle
ture vector” to describe memory based search. CPU based
uninformed algorithm
Instead the principle was known as “pattern Memory based heuristics
feature vector
database” which allows to store heuristics for dedTable 1: Search algorithm
icated planning problems:
quote “Pattern databases (PDBs) are
dictionaries of heuristic values that have been originally applied to the Fifteen Puzzle”
[Edelkamp2002]
In the late 1990s the subject of heuristic search was very new and possible improvements like natural
language tags were not invented yet. So its normal, that the definition what a pattern database is, was a
bit theoretical:
quote “A pattern is the partial specification of a permutation (or state). That is, the tiles
occupying certain locations are unspecified.” [Culberson1998]
The definition was closely related to the application of planning for the 15 puzzle game. An example
for a pattern would be to sort only the first row and then label it as “firstrowcleared”. Its unclear if
the pattern were labeled with natural language in the late 1990s but presumably not. Nevertheless,
the paper has recognized the advantages of state space abstraction and gives an introduction into the
subject.
It should be mentioned that pattern databases for planning problems in the 1990s were equal to state
of the art high performance supercomputing. Experiments were realized on the fastest supercomputers
at this time:
quote “The programs were written in C and run on a BBN TC2000 [...] The TC2000 has
128 processors and 1 GB of RAM” [Culberson1998] page 11
Even from today’s perspective (the year 2024) such a computer is a super fast parallel cluster. From a
technical perspective a pattern database can be realized on a much smaller homecomputer, for example
a 286 MS DOS PC from the mid 1980s. But before someone can write the code, the subject has to be
explored from a theoretical perspective, and this knowledge was missing in the 1980s.
8
2b Pattern databases in the late 1990s
The transition from past uninformed search algorithm to more efficient memory based search can be
traced back into the late 1990s. Instead of describe the situation with the more recent term “Symbol
grounding problem” the widespread used term was “pattern database”. In the published papers it was
clearly recognized what the steps are towards heuristic search. At first an abstraction mechanism is
needed:
quote “The first step is to translate the physical puzzle into a symbolic problem space to be
manipulated by a computer. “ [Korf1997] page 1
In contrast to todays attempt in creating the heuristic in a feature set, the understanding the late 1990s
was dominated by an algorithm perspective. The idea in the past was to handle everything as a for loop
in a computer program:
quote: “for reasons of efficiency, heuristic functions are commonly precomputed and stored in
memory.” [Korf1997] page 3
Translated into today’s term, the idea was to loop through the state space and create the pattern
database in realtime instead of a more static understanding as a feature vector. A possible guess why
this philosophy was preferred is because the objective of the paper was to solve the 15 puzzle game.
This objective was similar to former uninformed search algorithm, which have also a strong focus on
solving problems. A pattern database was interpreted only as a faster technique to search the state
space. It was an add on to the main program but not in the center of attention.
The interesting situation is, that only a few years later in the year 2008 the focus was strictly set to
the pattern itself and not on the algorithm how to create it. Also the relationship between state space,
feature vector and pattern database were clearly recognized:
quote “The state space was partitioned based on feature vector” [Samadi2008] page 1
In the concrete paper [Samadi2008] the idea was to store the feature vector in a neural network
to solve the 15 puzzle problem. Such attempt matches closely to current Artificial Intelligence
understanding which has a strong focus on feature vectors for storing problem heuristics.
2c Early robots in the 1950s
Before the advent of the computer revolution in the 1980s there was an interesting decade available
which resulted into many electronic projects around robotics. In the 1950s household robots were
created by artists which were mostly working by remote control. The human operator stands behind
the robot and moves the base back and forth, and the operator has a joystick to control the arm of the
robot. After pressing a button, the robot will playback a short speech sample which was stored on a
vinyl record to make the audience think that the robot is real. of course the robot was only a theater
presentation but not a scientific break trough.
Nevertheless it makes sense to investigate the principle from a technical perspective and its even
possible to reproduce the behavior. Everything what is needed is a lifelike humanoid robot controlled
by servo motors, a human operator in the background and some wires to connect both. The overall
project is an example for remote manipulation, or in other words, its an interactive game. The human
operator has an input device and this will move the robot arm and move the robot’s base.
More experienced observer will notice what the difference is to a real robot. Industrial robots are
working usually not interactive with a human operator but they are controlled by a computer program
in an autonomous fashion. This allows to run the robot without a human operator which is the main
difference between a theater robot and a scientific project. But what if such an assumption is wrong?
What if so called interactive robots are the real robots?
The principle of an interactive robot is the same like in a video game. The computer provides
the environment and the human user has to take decisions in the game. This is called a simulation.
The computer is in charge to transmit the joystick movements of the human into actions inside the
9
simulation. For reason of simplification we can assume that early robots in the 1950s are working like a
modern video game. This description allows to get a better understanding which component is missing.
Its about the difference of a normal video game to an AI controlled video game.
Let me give an example. Suppose a box2d simulation is written which allows a human to control
a simulated robot arm. Such a game has much in common with early robots in the 1950s. The only
difference is that the in 1950s the robot arm was a physical object in the reality, while the box2d robot
arm is only available on the computer screen. The challenge is to automate the movements of the arm.
The arm should grasp an object without the input signal of a human operator..
2d Artificial Intelligence in the 1990s
In the 1990s the last AI Winter was visible. The perception of the subject was negative in the public
and even among experts on the field. Despite the existence of powerful 8 bit microcomputers and more
powerful 32bit Personal computers it was not realistic to program robots and doing useful tasks with
them. Even the more reduce attempt in creating research only robots who should demonstrate within
the safe space of a university laboratory a basic function like object grasping and walking have failed.
What was unknown in the 1990s was the future development, especially the 2000s with the advent
of deep learning. The general outlook in the 1990s was pessimistic. The idea was that robotics in the
1990s has failed and will show the same outcome within 20 years in the future. Nevertheless some
books and papers were published during the sad 1990s about the subject, mostly from a philosophical
perspective. A common question was, how to convert the human brain into a computer brain.
If the 1990s was equal to an AI winter it should be important to know, how exactly a more positive
environment was created which resulted into today’s euphoric perception of robotics in the pubic and
among AI programmers. The interesting situation is, that only on the surface the AI Winter took place,
in parallel there was also a lot of research available but with a very little visibility. Let me give some
examples. The teaching language “Karel the robot” was invented in the 1980s and it was available
during the 1990s. But it was nearly unknown or it was ignored. From today’s perspective the robot
language provides a hands on mentality and is a great choice to explain newbies how to program robots.
Another example are motion capture projects from the 1980s and 1990s. These projects were started
with the attempt to create realistic computer animation. They can be seen as the forerunner to modern
biped robotics which is working with a similar principle. And last but not least, many neural network
experiments were realized in the 1990s so we can say, that from a technical perspective, this decade
was highly productive.
One possible reason why during the 1990s these projects were ignored might be the absence of
the Internet. During the 1990s most of the research literature was published in printed format only.
Specialist proceedings which are presenting the research outcome, were distributed in a small amount
of copies. Even well informed book authors during this time, were not aware of the projects. This
resulted into a superficial perception of the reality. Only paper which are available can be referenced in
a book. And access to such literature was not available.
The main problem in artificial intelligence is, that its hard to define the boundaries of this subject.
Existing university disciplines have a precise discourse space which can be captured by a handful of
important books. In contrast, Artificial Intelligence is referencing to many subjects at the same time
like mathematics, psychology, neurobiology, linguistics and computer science. This makes it hard to
provide the literature in a single bookshelf. A closer look into a modern AI related paper published at
arxiv will show that the paper is some sort of overview paper about progress in many subjects from
humanities, arts and mathematics.
Perhaps a concrete example can demonstrate the situation. Suppose the idea is to write a paper
about a state of the art robot system. For doing so, lots of subjects needs to known like motion capture
for recording human motion, computer science for programming the software, and natural language
processing for annotating the dataset. Unfortunately, these subjects are reached at the university in
10
different departments. NLP is located within language studies, motion capture is researched as art,
while programming is the interest of computer scientists. During the 1990s it was nearly impossible
to combine these efforts. It was simply to complicated to get books from these subjects because no
library in the world is large enough to collect this amount of information. Even libraries in larger
technical universites can’t provide the needed information, because subjects like history of filmmaking
and grammar based parsing of languages aren’t teached in a mathematics or computer class.
2d1 Developments since the last AI Winter
A decline in AI research is frequently called an AI Winter which includes failed AI projects and a
missing theoretical understanding what AI is about. In short, an AI Winter means that the researchers
have no idea how to program robots and more important they doesn’t even know what is wrong with
the software and algorithms they are creating.
The last AI Winter was from 1987-1992. After this period, AI research has shown many success.
An unsolved question is, what exactly has made the last AI Winter end and resulted into AI progress.
To discuss this problem in detail we should first explain the situation during from 1987-1992 and make
clear why exactly the robots from this time were not very advanced.
In the late 1980s, a lot of advanced computer hardware and software was invented yet. There was
a 32bit microprocessor available (Intel 386) with high amount of RAM, modern operating systems
including UNIX and GUI windowing systems were invented yet and the internet was in an early stage.
Even beginner friendly programming language like Turbo pascal and video games were very popular in
mainstream market. At the same period, it was unclear how to use this technology to build robots. The
dominant reason was that robot related problems including path planning were recognized as np hard
problem and no algorithm was available to address these problems.
Let me explain the situation from the technical perspective. In the late 1980s it was a common
knowledge how to print out something with a printer driver, it was common knowledge how to program
a 2d platformer game and it was possible to create a database like dbase for handling tables. So the
computer technology demonstrated that it can be used for many purposes. But it was completely unclear
how to program robots so that they can grasp objects, or walking on two legs. Such kind of technology
was only available in science fiction literature. The problem of missing robotics algorithms was known
in the late 1980s and because of this situation the period was called an AI winter.
The interesting and seldom explained question is what exactly has made the AI Winter disappear.
After the year 1992 and especially after the year 2000, the robotics and AI situation was more relaxed.
Many new projects were started and most of them with a success. It should important to know what
exactly was different than before.
The core problem in AI is about complexity which is equal to a large stare space. Even a fast 32
bit CPU with 50 mhz has limited amount of resources. So the question is which sort of hardware or
software fits great for robotics. In the 1980s a typical approach was to build certain computer chips and
invent new programming languages like Prolog with the goal to address this issue. All these projects
have failed too. It seems, that even with a parallel processor and 5th generation programming language,
its impossible to program robots.
If we want to explain in a single sentence what has started the AI revolution since 1992 we can
focus on Robotics program challenges. Until the year 1992 it was unknown how important puzzle
problems like the 15 puzzle, chess, or the line following problem are. Even if the puzzle were known
they wasn’t recognized as relevant for AI research. In contrast to an algorithm, a puzzle problem defines
not the answer but the question. Its up to a programmer to find a solution for the puzzle. They are many
indicators that the shift from solution oriented algorithm thinking to puzzle oriented problem thinking
has helped to overcome the last AI winter.
Perhaps it makes sense to give a concrete example. A typical approach in handling AI problems
after the year 1992 would be the following. There is a toy problem available like the 15 puzzle. A
11
human plays the game and the current game state including the players decisions are dumped into a
game log. The game log is the dataset which gets feed into a neural network.
Such kind of pipeline was to advanced for the last AI winter from 1987-1992. It was beyond the
horizon during this time. Even the first sentence wouldn’t be understood from researchers in the past.
They wouldn’t see what the advantage is to research a toy puzzle and what the purpose is to create
a game log of human actions. The reason is, that during the last Winter the focus was on solving
problems with algorithm which is the opposite of the described workflow.
The best practice method in tackling AI domains is problem centric, dataset centric and focused
on human interaction. These elements combined explain the success of modern AI research. From a
technical perspective its trivial to create gamelogs and research toy problems like the 15 puzzle problem.
A dataset is mostly a csv file and the 15 puzzle problem was known for sure in the late 1980s. What
was unknown was why these tools are needed in AI research.
Let us go back to the hard problems during the last AI winter. The typical problem was how to
search in a large state space. The assumption was that a robot can move in many directions and the
computer program has to traverse the game tree to find an answer. Even for simple games like chess
the game tree is very large and more advanced robotics problems are generating a much larger state
space. During the last AI winter, the AI researchers were not able to search such a state space. What
was available during this period was the problem itself including a precise understanding about the size
of the state space. It can be calculated precisely how many possible game states are available in chess
and other problems which are millions of millions. Then it can be calculated how long a typical 32bit
CPU will need to traverse this space. The conclusion was, that even a supercomputer is not able to play
such games and this has stopped AI research at all.
In other words, in the late 1980s a certain problem was discovered as the core problem in AI which
was how to search in a large state space. This problem was unsolvable and therefor an AI winter was
the result.
In contrast the situation after the year 1992 has been changed in a sense, that now the main problem
is described different. The researcher have learned that they can’t solve large state space problem. They
doesn’t even try out a depth first search algorithm on a robotics problem, because they know that the
algorithm will run 100 years and longer to figure out a simple path planning problem. What the AI
researchers are prefering instead are modified problems, datasets, and heuristics. All these approaches
are working with low amount of CPU resources and have a higher chance of success. Let me go into the
details. Suppose the idea is to create a game log for the 15 puzzle game. Its likely that a programmer is
able to program such an algorithm. It won’t need many cpu and other resources. The unsolved problem
is what to do with the gamelog in the CSV format. But this problem can be postponed. The next AI
researcher will perhaps a dataset with a motion capture recording device and the third researcher will
try to make sense of these datas with a neural networks.
Such kind of distributed pipeline has become popular with the advent of deep learning after the
year 2000. Since then, endless of datasets and lots of projects to feed the data into neural networks
have been started. The main difference to AI projects until the 1992 is, that dataset oriented AI projects
have a great success probability. Instead of building entire robots, the projects are mostly about pre
steps towards this goal. The sub problem of creating a dataset for an existing game like Tetris or chess
is so easy to master, that even programmer newbies in the first semester at the university are able to
program such a software with success. This is maybe the biggest difference to AI projects during the
last AI winter.
Let me give an example. A typical robot project in the late 1980s would be to create a software
which controls a robot. The chance of failure for such a goal is very high. The robot won’t do anything.
In contrast, the self selected project goals after the year 1992 are much smaller. A typical project is
about recording the position of a teleoperated robot and convert the file into a MS Excel table. The
result of such a newbie friendly AI project is a simple table with 2 columns for x and y position. The
idea is, that such a table may help in the future to program robots. And the assumption is right.
12
In other words, post AI winter projects are tackling the AI problem by dividing it into smaller
chunks. Possible strategies are:
1. Dataset creation
2. Speaker hearer interaction
3. Teleoperated robots
4. Heuristics
These four elements alone explain the success of the AI research since 1992. Most of the projects have
focus on these attempts in building intelligent machines. Especially the first approach (dataset creation)
has been identified as a swiss army knife for any sort of AI proeblems. No matter if the robot is a
logistics robot, a flying drone or a grasping robot – a modern research project is working always with a
dataset. Either a new dataset is created, an existing dataset is selected, or a past dataset gets updated.
The last AI winter from 1987-1992 is often referenced as the 5th generation computer generation.
To make the point clear we have to explain first what the other 4 generations are. A computer generation
is about a certain mixture of hardware and software. For example 4. generation computers are
equal to current computer science which is about object oriented programming in Java plus modern
multiprocessor CPUs. The idea of so called fifth generation computers was to invent more advanced
systems which are going beyond the capabilities of 4th generation computers. This attempt has to be
called a failure. Even latest computers from 2023 are mostly 4th generation computers. The user sees a
GUI menu and has ability to run webbrowsers, databases and video games. Roughly spoken, computer
development has stopped in the late 1980s and since then the researchers are trying to build robots
which is more complex than previous computer generations.
The difference between the 4th and the possible future 5th computer generation may explain why AI
projects are so hard to realize. The problem is that apart from the tools of the 4th generation computers
which are object oriented programming languages and modern 32bit cpu, there are no additional tools
available What is missing are dedicated AI libraries, AI hardware and AI programming languages, and
the chance is high that such tools won’t never be invented. In other words, since the 4th computer
generation no measurable progress is visible and this is equal to the last AI Winter.
The mentioned tools to overcome the AI winter which are datasets, heuristics and so forth are
located outside of computer science. They have nothing to do with classical hardware and software but
they are using existing technology in a different form. That means, modern AI software is written in
classical 4th computer programming languages and runs on ordinary 32bit computer. And dumping a
dataset into a csv file won’t need a dedicated AI library but it can be realized with existing tools and
libraries much better. This situation made it hard to localize modern AI tools. AI has nothing to do
with computers, hardware and software but its located outside the box. The development of AI can be
traced back to certain subjects which are discussed in academic papers.
For example, the term dataset was nearly unknown in papers until 1992 but it was used frequently
with the advent of deep learning.
2d2 The invisible 5th generation computers
Its possible to describe the entire computing history in generations. Every possible piece of hardware or
software can be assigned to one of the fourth computer generations. For example, the c programming
language was invented during the 3th generation computer period. The simple 4 categories are given a
great overview over a complex subject and are used in most museums as a time line to explain who
computing technology has evolved.
Unfortunately the concept shows a contradiction for 5th generation computers and Artificial
intelligence. According to a purely numerical understanding the 4th computer generation has ended in
13
mid 1980s so this was the starting year for the 5th generation. the problem is, that the 5th generation of
computers is not defined precisely, and even worst, it is equal to the last WInter which means, that the
attempt in building such computers has failed. This would imply that modern computers build in the
2020s are working with the same hardware and software principle than the 4th generation computers,
so this period has never ended.
The idea in the mid 1980s was, that the 5th generation of computers is equal to Artificial intelligence
which is working with parallel supercomputers and the prolog language. But this technology was a
failure. The mentioned tools are not used in AI projects. So the question is where exactly is the 5th
generation computer generation?
The working thesis in this chapter is, that the technology is available but its invisible. 5th generation
computer generation works with a different principle than the generation before, it has nothing to do
with hardware or software but with a philosophical standpoint. The only place in which 5th generation
computer generation can be localized for sure is the gutenberg galaxy which are books and papers
about AI. The advent of deep learning was equal to an increase publication activity around this subject.
Endless amount of papers were written about deep neurral networks including their applications. There
is no single computer chip available and no single programming library which has powered the deep
learning hype, but its purely paper based discipline
To visualize the 5th computer generation we have to draw a barchart with the amount of keywords
used in academic papers. Since the mid 1980s (which was the starting point of 5th generation computers)
many new papers were written about neural networks, robotics and natural language processing. These
papers including their content are the proof that the 5th generation computing technology exists.
The surprising situation is, that practical demonstrations of robots are working 4th generation and
even with 3th generation computer technology The typical neural network was programmed in the c
language without further libraries and it runs on outdated hardware. That meas, from the perspective of
computer history such a project is boring. The innovation is located in the application of this technology.
The well known computers are used in a different way than before. The same limited CPU is used for
solving a new sort of problems, not known before.
The self understanding of any computer museum is to make the development visible so there is a
need to overcome a hidden technology. A possible attempt to visualize AI development since the 1980s
is to focus on puzzles instead of describing algorithm which can solve the puzzle. A well designed
AI museum would present some of these puzzles like rubik’s cube, a chess board, the 15 puzzle game
and a micromouse maze to the audience. The idea is to use 4th generations computers to solve 5th
generation puzzles. Especially the 15 puzzle is an important milestone because its commonly used to
explain the advantages of heuristics. A heuristic is a tool to solve more advanced AI related problems.
Most AI related puzzles like Micromouse, 15 puzzle and Karel the robot have a background in
teaching computer science. In most cases these puzzles are invented with the only objective to teach
students how to program a computer. It sounds a bit contradicting, but such sort of puzzle is well suited
for explaining what artificial intelligence is about. Unfortunately it is difficult to explain these tools
similar to 4th generation computers. A puzzle in the strict sense is mostly a computer course or a paper
which describes such a course.
The assumption is, that future robots (not invented yet) will have much in common with a puzzle
but low similarity with an algorithm.´
2e Early teleoperated robots
Most robots and pseudo automaton from the past are undocumented. Only textual description or
hardware parts are available. These machines were created from around the 1940s until the 1970s.
Sometimes they are working with analog circuits and sometimes with mechanical components which
makes it hard to describe the logic with modern understanding. But there is a certain perspective
available which makes it much easier to understand the principle.
14
The working thesis is that every robot build until the 1970s was teleoperated. Teleoperation means,
that the signals for the servo motors are provided from distance. Either with a wire or with wireless
transmission. This understanding helps to reduce the complexity of the early robots. Instead of
analyzing the inner working including the cables and the analog circuits its enough to know that there
was a remote control device with some buttons and knobs and a human operator was in charge to press
the buttons. The resulting robot can be divided into two parts. The robot itself shown the audience
which was a platform on wheels and even a robot hand, and in addition there was a remote control
panel used by the human to control the robot.
The only interesting thing is the remote control. in the easiest case its equal to a joystick plus some
buttons to activate the voice or a light bulb. The user interaction between a human and the remote
control is important to understand robots from the past. The technical evolution is not located inside a
robot but it has to do with improved remote control panels.
Let us imagine some possible interactions between humans and the remote control device. In the
primitive version there is a cable needed which transmits joystick signals to the robot. The human
is pressing the joystick forward and this will make the robot move forward. More advanced remote
control device are working with wireless (=radio) controlled panels. And very recent remote control
technique is using a high abstraction level. Let me give an example.
Suppose there is a snake robot controlled with a remote control. In the basic settup, the human has
to control the joints one by one which results into a slow movement. The human has to press a lots
of buttons until the snake will do something. In a more advanced setup, the movement is generated
by an oscillator. An oscillator is a sinus function realized in hardware which has some parameters.
The joystick allows the human to adjust the parameters which is equal to provide only high level
adjustments to the ongoing movement. The resulting movement will produce a lower workload for the
human operator.
Perhaps an example makes sense at this point. Suppose there is a robot in a museum build in the
1960s by an unknown engineer. There is no documentation available, the robot doesn’t work anymore.
After opening the hardware there are endless amount of cables and circuitry inside and no one knows
what the purpose is. The working thesis is that the inner working of the machine can be ignored. Its for
sure, that the electronics was in charge to control the robot and that there is somewhere an input module
which takes the signals from a remote control. The more interesting part of the machine is located
outside of the robot. The remote control itself is the single device which is in charge to generate the
signals. These signals are send to the robot and then the arm will move. So the dominant question is:
which sort of remote control belongs to the robot. Is this device documented?, can the remote control
be rebuild from scratch? How many buttons are on the panel? What was their purpose?
Early humanoid robots shouldn’t be interpreted as a calculator but as a transistor radio. A radio
receives and transmits signals from a remote location. It never works by itself. The voice from the
loudspeaker isn’t generated by a person in the radio, but the person sits kilometers outside of the device.
2e1 Anti pattern in teleoperation
Teleoperation is mostly rejected by computer scientists because it looks like a dead end. The typical
remote control for a robot is a simply joystick which is connected with a robot arm. Even if its possible
to control a robot with this approach it sounds trivial for doing so. What is missing is the ability to
fulfill a task autonomously by a computer program.
But what if such sort of teleoperaton was only made wrong in the details and the overall idea
is great? Let us investigate why typical joystick based remote control seems like a dead end. First
problem is that the analog joystick movements can’t be played back. If the same actuator movements
are recorded and repeated the robot arm will for sure collide with an object and it will miss the object.
The second problem with joystick control is, that apart from movements to left/right/up/down no further
action is possible, so the control is located only on a low level.
15
Let us modify the setup drastically so that teleoperation will become more interesting. First thing
to do is to define a list of commands instead of using an analog joystick. The idea is, that the human
operator has a dozens of button and has to press one of them. Such a user interface might be harder
to use, but its much easier to record and playback. Every possible sequence of action contains of the
pressed buttons. The second modification is to reduce the problem space. In stead of grasping objects
in the normal 3d space, the idea is to reduce the target position to a grid. That means the robot gripper
can be at (1,0), (1,1), (1,2) and so on but not on (1.231,3.212)
With these simply two modification the teloperated control makes more sense than in the initial
setup. To navigate the gripper to a goal position the human operator has to press buttons for left an
right movements, and a third button will grasp the object. This allows to grasp most objects with high
success rate. And very important, its possible to write down the command sequence into a script and
run it in the autonomous mode.
This simple example has shown that teloperation is a great idea, if some parameters are fulfilled.
Its important to design the GUI interface in a certain way which allows to reduce the state space of
the robot. With the improved GUI interface the teleoperated robot has much in common with a toy
problem in a maze game The human controls a gripper which can reach some positions and then the
gripper can grasp an object and bring it to the start position.
2f Textual teleoperation
Teleoperation in the classical sense is refering to joystick control and visual feedback [Farkhatdinov2008]. The human operator moves the joystick forward and will see on a monitor what the robot is
doing. Unfortunately this control strategy can’t be recorded and can’t be playbacked which prevents
that the robot will do the task autonomously. A typical assumption is, that because of this disadvantages,
teleoperation at all is useless.
There is another mode available how a human operator can interact with a robot which has no
dedicated name but can be coined “textual teleoperation”. The idea is the human operator types in
commands and the robot answers with textual information. Such a communication has much in common
with a point&click adventure like Maniac Mansion but its seldom utilized for robot control. The main
advantage is, there here the communication can be recorded easily and its possible to write a script
which automates the control of a robot.
The most surprising effect of textual teleoperation is, that the robot has no limitations anymore. Its
possible to control the robot even for complex tasks because every command is provided by a human
expert who is aware of the situation. The only con is, that its a demanding task to program the interface.
Somebody has to figure out the vocabulary and make sure that the robot will understand the words.
From a programming perspective this is equal to program a medium complex point&click adventure
in which the human player has some action words in a menu and can select them to execute longer
sequences.
2f1 From solving games towards generating games
A typical misunderstanding in the history of Artificial Intelligence is about the concrete task of a robot
or a robot agent. In the past the untold assumption has evolved over the decades. until the 1990s the
idea was that a computer program needs to be intelligent in a sense that its comparable to a human.
Later the assumption has evolved into the goal that the AI should play games like chess, Tetris and
robocup. A more recent approach in defining the role of Artificial intelligence is that the AI agent
should create a game from scratch which can be solved by another party.
Let us discuss the last two assumptions in the detail. A frequently assumption is that there is a
game available like Pacman and the task is to act in this game. This task is delegated to an AI. Endless
amount of projects are available around the goal of programming an AI which can play a single or
16
multiple games. The main disadvantage is, that only simple games like the 15 puzzle or Tetris can be
played with this principle and more complex situations from robotics are ignored.
The obvious problem with robotics domains is, that no such thing like a game is available. First
a robot acts outside of a computer in the reality and secondly there is no simulator available for this
domain. To overcome the obstacle there is a need to redefine the goal for an AI Instead of solving
games (which are not there) the idea is to design a game. Such a task is more complex because game
design is working with a different principle.
From an abstract perspective a text adventure game is defined by its vocabulary which is stored in a
2 tuple BNF grammar. The player can execute actions like “moveto location” “grasp obect”, “speakto
person”. Every text adventure is programmed around such a grammar, so the first task for a robot is to
invent the bnf grammar from scratch.
An existing text adventure which contains already of a vocabulary can be solved in a second step
by a different algorithm. Such a solver works more like a classical AI program which is able to find
the shortest path in a state space. The AI has to investigate possible alternative actions and select
the best one with the help of a heuristic. The principle is the same like the inner working of a chess
engine which is also working ontop of an existing game engine. Solving a text adventure with a search
algorithm like A* can be called a trivial task in computer science because endless examples from the
past are available.
2f2 Early adventure games in the 1980s
In contrast to a famous belief there is no need to reinvent the wheel and program advanced robotics
with a textual interface because in the 1980s lots of working examples were created already. The game
“Castle adventure” was released in 1984 and it combines a textual parser with a graphical representation.
The player has to escape from a castle by collecting items and doing actions. He can control the
character with the keyboard and has to enter commands, while the result is shown on the display. Apart
from Castle advanture there are many games available with a similar principle and exactly such a user
interface can be used to control AI related robots. The logic of the robot isn’t defined by some sort of
advanced algorithm but its connected to a game. The genre of MS DOS based Adventure games are the
ideal blueprint for programming robots. Thanks to the textual interface its easy to automate the control.
From a cognitive science perspective an MS DOS adventure game is man to machine dialogue with
grounded actions. That means, after entering a command something happens on the screen because its
preprogrammed in the game. Such a dialog allows to solve complex problems like escaping from the
castle. The only problem with these games is, that its very complex to program them from scratch. The
textual parser, the visual map and the behavior of the character has to be defined in source code. Even
with more advanced programming languages like Python and pygame such a programming project will
become a larger one.
The overall principle behind “Castle adventure (1984)” is, that it creates a simulated world in which
the player has to solve puzzles. The game provides an environment in which actions, events and objects
are available. The human player has to interact with the game engine to escape from the world.
Let us assume that the game setting is a different one. The same user interface is used to simulate a
kitchen in which the player has to prepare a meal. Such a game can be connected to a real kitchen robot
and then the robot can be controlled with the keyboard and textual commands.
The main task for an AI isn’t to solve an existing game but to design such a game. The AI has to
invent a game like“Castle adventure (1984) adapted to a certain domain like self driving cars or kitchen
domain. After such a game is available its possible to interact with the robot with an interface and its
even possible to program a behavior tree which solves the game automatically. Adventure games are
reducing the state space because the amount of possible actions is restricted to the preprogrammed
commands, objects and locations in the game. Instead of figuring out hundred of millions possible
actions for the robot gripper the robot has only 10 different actions.
17
With the help of a adventure game its possible to convert any domain into a toy problem. Toy
problems can be solved much faster than real problems and the key element in doing so is a mini
language which provides a natural language interface to interact with a game. Such a mini language
was preprogrammed in the Castle adventure (1984) game.
2f3 A review of the Castle adventure game
The project was created in 1984 by Kevin Bales. Some remakes written in Javascript and Cpp are
available in the Internet which have around 60kb in size. The original program occupied around 50 kb
on a floppy disc. Even if the game has only a low quality graphics it takes a long time to code such a
game. And this is perhaps the most disadvantage which can hinder that similar projects are useful for
robotics.
Before a robot can be controlled interactively with a two word command parser somebody has to
write a game similar to Castle adventure. Suppose a single line of code will need 40 Byte, then the
overall game consists of 1500 lines of code. If an average programmer can code 10 LoC per day it will
take 150 days to create such a game from scratch. For more complex robot control like biped walking
the needed simulation game might be more advanced so the lines of code will grow quickly.
Nevertheless from a technical perspective it makes sense to see Castle advanture as a blueprint
in creating a human to robot interface. The idea of the game was, that a human player controls the
game character with the help of arrow keys and simple commands like “drink water”. Such a user
interface can be scripted in a behavior tree which results into autonomous behavior. another interesting
advantage is, that its pretty easy to formulate a walk through tutorial for an existing game. It can be
written down on a single sheet of paper which actions the player has to do in which sequence to reach
the overall goal of the game. The formulated actions like “go, pick, place” have a meaning because
they are referencing to the predefined actions in the game. This makes the communication about the
game much easier.
In contrast, controlling a robot without a game engine is nearly impossible. If the robot doesn’t
consists of predefined actions and a map within a game, its not possible to formulate a walk through
tutorial because a grounded language is missing. It seems, that the precondition for any sort of robot is
to program first a game environment in which a human to robot dialogue takes place and only in the
second step a solver or a behavior tree is created for this game.
2f4 Speaker hearer interaction in adventure games of the mid 1980s
In the period from 1983-1987 there were some computer games available which have to be called
revolutionary. Former text adventures were improved with basic Ascii graphics which resulted into a
better user experience. What is seldom explained in the literature is the relationship between a human
player and the underlying game engine. The technical term is client server architecture but the more
precise linguistic description is to call the interaction mode a speaker hearer system.
The human player acts in the role of a speaker. He is pressing arrow keys and very important the
speaker selects commands from the menu like “go north”. In contrast, the game engine receives the
commands and generates a response on the screen which is the role of a hearer. For solving ingame
puzzle both parties have to cooperate. This allows to play a game from to finish.
Let us focus on the workflow in detail. Every adventure game in the 1980s is communicating with
the human user over a user interface which is a mixture of ascii graphics and pure textual information.
This user interface is the central element of the game and it allows to connect the speaker with the
hearer over a protocol. The user interface is fixed over all the time and establishes a common ground
during communication. That means, a certain command has the same meaning all the time.
From a programming perspective the interaction is realized with the help of a game engine which
takes the user input and produces a new state of the game. For example after entering the command
“go north” the position of the player gets modified.
18
It should be mentioned that adventure games from the mid 1980s are usually not referenced as
examples for artificial intelligence because in most cases they do not provide internal Non player
characters or other examples for autonomous intelligence. But they are a good example for man
machine interaction which is working with natural language.
Perhaps it makes sense to explain the advantages of a speaker hearer dialogue. It allows to divide
complex problems into two layers. First layer which is realized by the game engine is the simulation of
a domain, for example to draw a dungeon on the screen. The second layer is about acting inside this
simulation which is usually in control of a human user. Both layers have to working together.
2f5 User interfaces as a grounding mechanism
All the existing adventure games are working with a user interface which takes the input of a human and
transmits it to the user. Possible examples are text based parsers, point&click wordlist or a graphical
panel with icons. Its very common that the user has different options available at the same time. He can
decide if he likes to use a mouse, a joystick, a keyboard or likes to enter textual commands. In real
adventure games its a typical interaction method to combine these modalities at the same time. For
example, the avatar movement is controlled with arrow keys and in addition there is a list of icons to
activate high level commands like “grasp” “speakto”.
Its important to focus on a user interface because its equal to a common understanding of the
human player and the game engine in the backend. Grounding means usually, that the communication
is directed over the user interface. At first the programmer of a video game invents a user interface, and
then all the man machine communication is feed into this user interface.
So the user interface effects how the communication takes place. If there is no icon available to
pickup an object, the human user can’t do the action. For reason of simplication its possible to reduce
all sort of user interfaces into a word list. There are words for movements (up, down, left, right) words
for actions (grasp, give, open) and words for events (key is missing, door is closed). These words can
be translated into different graphics/sounds on the screen. For example the event “door is closed” can
be shown in a textual format or as a sound. Because words are part of a natural language it makes sense
to treat a user interface as a grammar in natural language processing. The BNF grammar consists all
the possible allowed language and describes the user interface.
2f6 Creating toy problems from scratch
In contrast to a famous assumption, the amount of toy problems isn’t limited to the 15 puzzle and
tictactoe game but any domain can be converted into a toy problem. In case of complex robotics
domains like a warehouse robots this transformation is realized with a game engine based on an
adventure game similar to the Zelda game. The robot acts in a grid maze, gets controlled by arrow keys
and certain actions like grasp, ungrasp and rotate can be activated. From a technical perspective such a
game consists of a simple 2d graphics plus a word list with possible actions. These ingredients make
sure that the state space become smaller.
After converting a robotics domain into an action adventure the discourse space is different. Its
possible to talk about the actions of the robot in a certain highly structured manner. For example a
possible plan would be to move to waypoint A, pickup an object and then moveto waypoint B. Such a
plan can only be formulated if the game supports actions like “moveto” and “grasp”. The game engine
provides the common ground and simplifies the communication between man and machine. After the
game was programmed it will become much easier to give a command to the robot. Every possible
command is predefined in the game engine. Possible alternative commands can be ignored. This allows
to define what sort of software is needed for the robot.
Suppose there is an action adventure game available for a warehouse robot. In the easiest case the
robot movements are teleoperated similar to playing a normal video game. That means, the human
operator will move the robot by pressing array keys and selects commands from the menu. The
19
Keyboard,
mouse
command
parser
Human
user
Game
engine
Virtual
character
Figure 1: symbol grounding pipeline in action adventure game
commands are send to the robot in the real world. In a more advanced case, some scripts can be written
to automate a task. For example a pick&place script would move to the initial position, pickup the
object and then move to the target position. Such a script can be executed within a for loop which
results into pseudo-autonomous behavior of the robot. The precondition for such an elegant example
of automation is of course the existence of a warehouse action advanture. So we can say, that the
bottleneck is located in the man to machine communication. If such a communication is available, the
robot will act in a toy world with a small state space and a low amount of possible commands.
2f7 A closer look into action adventure games from the past
During the period from 1984 until 1990 there were some so called action adventure videogames
available. The game principle was to improve existing text adventures with a 2d topdown graphics in
which the player has to find keys, search for swords and rescue a princess. The most famous example
is Zelda I, but lots of other examples are available for the C64 and the IBM PC. The game genre is
mostly forgotten today, because the graphics was poor and these games were never described in the
context of artificial intelligence. But there is a certain element available in the game which makes them
interesting from the perspective of cognitive science.
The overall idea behind an action adventure is, that the human user interacts with the avatar with the
help of a keyboard and a word list. In addition, the user gets visual feedback from the rendered maze.
This allows to solve complex ingame puzzles. The player has to visit certain places and fulfill tasks.
What makes the game genre advanced is, that it demonstrates very well what grounded communication
is about. Grounded communication means, that a human player interacts with a robot. The human
plays the role of a speaker which gives commands, while the game engine is the hearer which executes
the commands.
From a robotics perspective, the character is controlled with teleoperation. The human user is
pressing buttons on the keyboard and this will move the character. But this interaction is working
on a high abstraction. For a certain game its possible to write down a walk through tutorial which is
referencing to places in the game and in reference to available actions from the menu. Such sort of
walk through tutorial can only be formulated if the reader of the tutorial understands the meaning.
The game engine behind an action adventure provides a communication space. Such a high level
space allows to reduce the state space drastically. Instead of describing robot movements from a low
level perspective its possible to formulate high level statements like “Walk to city A and grasp object
B”.Even if the game engine isn’t able to understand such a statement directly, its possible to convert the
sentence into allowed actions in the game.
Most of the action adventures were realized with a split screen. On top there is a 2d maze visible
in a topdown perspective. The avatar can move in this way in all directions. On the bottom of the
screen there is an inventory and very important a list with possible actions like talk, use, open and
pickup. Before the human player is able to interact with the world he has to memorize the keys and
understand which sort of actions are possible. This allows the player to solve longer puzzle problems in
the game. The interesting situation is, that a similar interface is useful for controlling service robots in
a warehouse or in a kitchen. It makes sense to show the position of a robot in a visual map and provide
a list of possible action verbs. Similar to an action adventure int he mid 1980s there is no artificial
20
intelligence available in the classical sense but a human operator takes control of the character.
Such an interactive system can be converted into an autonomous system easily because an action
adventure is a toy problem with a small state space. Its possible to program an A* solver which controls
the character in a fully autonomous fashion. The reason why such a solver can be programmed is
because the game engine of an action adventures has solved the grounding problem already, that means
the domain was converted into a low state space problem. The result is a simple 2d maze game with a
small amount of action words. Its possible to play this game with an AI.
2f8 Communication with a game engine
Game engines are usually seen as technical parts of a computer game. They are the place in which the
simulation takes place and its possible to use a single game engine in many different games. Its possible
to use a programming language like C++ to python to create a single game engine which effects its
performance.
In addition there is an alternative approach available how to treat the subject which is related to the
symbol grounding problem. Grounding communication is about a common understanding of a speaker
and a hearer based on a language. This language is usually not defined very precise but the assumption
is, that a language is working with a syntax and a grammar. What if the shared understanding is realized
with a game engine?
A game engine provides basic commands which allows a human player to control a game. For
example the player can decide to move to left or pick up an object.[Tellex2007] These primitive simplify
the man to machine communication. So a game engine is similar to an AAC communication board
a tool which allows two parties to speak with each other. In case of a game engine the parties are a
human user in front of a computer, and the computer which contains of a graphics card and an operating
system.
Let us take a closer look how a game engine is working in an action adventure like Castle adventure
game (1984). The game engine monitors the keyboard events like pressing arrow keys or entering
a textual command. This input is parsed and if its fit to the internal command list a certain action
gets triggered, e.g. the player’s position gets modified. The result is that a human user communicates
with the virtual avatar on the screen. He gives a command to the sprite and then the sprite is doing
something.
Such kind of interaction was mostly ignored by AI researchers in the past because there is puzzle
available which can be solved. Its a trivial programming challenge how to write a game engine. On the
other hand, a game engine might be important to understand the symbol grounding problem. If the man
to robot communication fails, the cause is mostly a lack of understanding. The human gives an order
but the robot won’t react. This missing communication has its source in a malfunction game engine.
2f9 Simplified instruction following example with video games
Most existing demonstrations for instruction following are working with voice recognition, free form
textual parsers and physics robots. Such demonstrations are looking impressive but they are difficult to
reproduce. The easier attempt is to focus on a video game similar to the Castle adventure (1984) video
game. That game genre is also called an actio adventure and the most interesting element is the GUI
interface.
In the Castle adventure (1984) game the user controls the robot with arrow keys plus a 2 word
parser like “use sword”. This combined interface allows a human operator to do any possible. Grasping
objects is also simplified by simply running over an object and then the item is added to the inventory.
Programming a similar interface from scratch for a grid maze game can be seen as a good introduction
into the symbol grounding problem. Let us describe what the technical perspective is.
Symbol grounding means to interpret commands. The human operator is doing something with the
keyboard and the game engine responds with a low level action. For example, the human is pressing
21
Figure 2: One wheel balancing robot
the left arrow key and the game engine will modify the current position on the screen. The amount of
all possible actions are formalized in a grammar. So the basic task from a programming perspective is
to implement the entire command vocabulary in the game engine.
After the commands from the human are interpreted by the game engine its possible to start a man
machine dialogue. In the sense, that the human is able to play the game. He can enter commands and
the character in the game will do something. In other words, programming the GUI interface for an
action adventure is solving the symbol grounding problem.
3 Instruction following
The paradox situation in robotics is the absence of a task. Suppose there is a clearly defined problem,
than it could by solve by an algorithm. Unfortunately real robotics projects are missing of tasks but it
remains unclear what the robot should do next.
The example screenshot shows a balancing robot within a box2d simulation. One the first look the
picture shows a successful robot because the system is able to balance on wheel. The user can even
modify the angle which allows to move the robot left and right. The simulation has a frame speed of 20
fps so it works quite smooth. But the shown simulator is missing an important point. Its unclear what
the purpose is, the robot can’t accumulated a score. It can’t be called a meaningful game with clear
objectives but its an attempt to create such a game.
The situation is common for AI related problems. Suppose a different robot would be put into the
simulation which is a robot arm. Even if the arm can be controlled with accuracy it remains unclear
what the objective is. To overcome the missing game rule a certain sort of meta game has to be invented
which is a dialogue oriented instruction following game. Such a game consists of two parties: on
speaker and one hearer. The speaker formulates a goal and the hearer has to execute it.
For example, in the shown one wheel robot simulation possible sentences of the speaker might be:
1. wait for 5 seconds
2. move to the left and rest at the wall
3. then move in fast speed to the right
4. move back to the middle and wait for 4 seconds
These instructions are formulated in a high level syntax which is equal to natural language. They are
providing goals for the simulation. Its possible to measure if the goals are fulfilled or not. The shared
principle in instruction following is to treat any domain as a dialogue game which results into clear
defined objectives.
22
Let us take a closer look into the 4 goals. In contrast to common principle in AI programming these
instruction are not solving a problem like a path planning algorithm, but they are creating a problem.
Its up to the human or the AI controller to fulfill the objectives. The task for the speaker is to invent a
sensemaking goal and monitor if it was fulfilled. In other words, the speaker in the dialoge provides the
meaning to the hearer.
Another important characteristics is, that the goals are formulated in natural language but not in
terms used in the simulation. The goals of the speaker have nothing to do with the box2d simulation nor
the ability of the robot to balance on one wheel but they are formulated from an objective perspective.
The next figure shows an improved simulation. Most of the instructions are detected as events. In
the concrete example the robot is at the left wall because this was the formulated instruction of the
speaker. Its possible to compare the goal with the current situation which results into a positive score
for the robot. The look&feel of the simulation has a stronger focus on the text box. The gui widget
with the detected events is no longer seen as additional information but its the dominant element. The
newly defined objective for the robot is to fulfill the instructions of the speaker. Its up to the robot to
decide how to do so.
A sequence of natural language instruction allows to formulate an abstract goal trajectory. The
command sentences can describe longer behavior pattern which have to be fulfilled by the robot. This
is equal to provide meaning. From a technical perspective, meaning is simply the error between goal
and current situation, similar to a pid control which also compares the goal state with the current state.
3a Instruction following for goal formulation
A seldom described problem in robotics is the absence of a goal. Without such goals its impossible
to determine the cost and the reward which is needed in reinforcement learning. This prevents that
robotics problems can be treated as RL problems.
Instead of asking which algorithm is needed to control a robot, the more elaborated approach is
to create a instruction following scenario which can be seen as an articulated sort of teleoperation. A
human operator (the speaker) provides the commands and the robot has to fulfill these goal. During the
pipeline different goals can be formulated like:
1. go to left region
2. grasp an object
3. bring the object to the table
4. ungrasp the object
5. go to middle
6. jump into the air and so on.
The formulated textual goals are the input for the robot control system. And behavior of the robot can
either match to the requirements or not. And in both cases a score can determined. This score is highly
important for a machine learning task formulation.
In the literature the overall subject is described as instruction following task. The textual goals are
provided as a dataset and the robot has to convert the commands into actions. The interesting situation
is, that instruction following doesn’t need a certain library or an algorithm, but its about creating a
game in which a robot is doing something.
Instruction following closes the gap between a normal simulation like Box2d and the need in
reinforcement learning for a reward function. The formulated goals in the dataset are equal to scoring
the robot.
23
Let me give an example for a human formulated textual goal. The operator can say to the robot
“I’d like that you are driving with a speed of 30 mph straight until the traffic light”. The formulated
contains a measurement rule. Its possible to compare the robot’s behavior, with the announced goal.
For example the robot will drive with 26 mph and stops at the traffic light like requested, this would
generate a score o 100% which means, that the robot has done everything correct.
What makes textual commands interesting is, that they are not formulate in AI related terms, they
have nothing do with computer programming and they doesn’t belong to a classical simulation. A
textual goal is formulated on a very high abstraction level. Its some sort of textual interaction with a
machine.
From an outside perspective, instruction following has much in common with teleopeation. In
both cases there is no classical AI system available which takes decision, but the human operator is in
charge to control the robot. In case of joystick based interaction, the human will communicate with the
robot over the joystick, while in case for instruction following the communication medium is natural
language. In both cases the robot is doing what the human wants.
In a classical robotics competition like micromouse such an interaction would be perceived as
cheating. If the human operator controls the robot with the joystick it would be the opposite of the
descired behavior. A joystick in the loop means, that there is no software which controls the robot, so
it can be called a robot at all. The same situation is there for a speach controlled robot. If the human
has a panel with actions like: forward, stop, moveleft, backward, the robot can’t be called autonomous
anymore.
Nevertheless such an interaction makes sense. If instruction following is perceived as wrong
robotics, than there is a need to create sort of robot competition which puts a stronger focus on human
machine interaction.
An iteresting side problem towards instruction following is activity recognition. The idea here is,
that the robot can’t execute actions by itself, but its only able to monitor certain events like “Collision,
a threshold speed or a turn left action”. A detected event is shown on the command line, which looks
on the first look not very interesting. Because detecting a collision is an easy task, and in the classical
understanding it has nothing to do with artificial intelligence. From a technical side, such an event gets
recognized by a sensor and a single line of python allows to perceive such a situation.
But, an event detection system is at the same time a powerful element for an AI system. It can
be extended to an instruction following robot. The idea is, that possible events are equal to possible
instructions and its only a detail problem how to generate the action signals for the robot. Let me give
an example.
Suppose a robot car can detect if its current speed is around 30 mph. In such a situation, a light
informs the human operator. The newly formulated task for an instruction following system would be,
that the human operator formulates the goal and then the robot car is doing so. The robot has to use the
information from the event detection system, determines a cost value and this cost information is used
to search for the optimal action. The model predictive control solver has the constraint: “control the car
so that the costs a reduced”. The interesting situation, that any mpc solver available can do so very
good. The only precondition is, that the term “costs” are defined.
3b Cost function as instruction following
Most newbies in Artificial Intelligence would locate the AI of a robot in the solver subroutine. A certain
algorithm is analyzing the game tree, searches for the optimal action and will execute it. With this
understanding, an AI is equal to a depth first search algorithm, the minmax approach or in case of
robotics with the MPC solver. Unfortunately in a real robot, the MPC solver needs a cost information
as input. Otherwise the algorithm can’t minimize the cost. This requirement is often ignored. And even
more complicated, most robotics domains have no cost function at all which makes it hard to provide
the needed information.
24
Let me give a practical example. Suppose there is a robot in a maze which can move around and
pick up objects. Even if the robot is able to do something its undefined what the current costs are.
Roughly spoken, the AI community doesn’t know the answer and without the cost function, the mpc
solver can’t find the optimal action. And this means, that the robot at all can’t be controlled by Artificial
Intelligence.
In the past the problem was either ignored or it was explained that robot control works completely
different from chess engines which are working with an evaluation function. The assumption was, that
robot control can be realized without a cost value but with behavior trees or other bottom up approaches.
This understanding is outdated, robotics works the same way like a chess AI and the cost function can
be provided with an instruction following dataset.
The idea is, that the human gives a command like “pick up object” and if the robot is doing so, the
robot gets a reward of +1. If he fulfills the goal only in parts the reward is +0.5 and if he ignores the
command at all the reward is 0 which is the maximum amount of costs. In other words, the costs are
provided by natural language commands from a human operator. The robot has to do what the operator
demands, then the reward gets maximized.
Its surprisingly easy to implement such a paradigm in software. At first a table is created with
possible commands which are numbered from 0 to 7. Then the human operator is asked to select one of
the predefined goals. He is choosing for 0=”pickup object”.
A dedicated event recognition module in the robot determines which of
the actions is executed in reality. The value is compared with the selected id command
goal and this allows to determine the reward for the robot. No matter which 0 pickup object
goal was selected by the human, the robot knows the current reward value. 1 stop
And this value can be utilized by a mpc solver to maximize the accumulated 2 forward
reward.
3 waypoint A
With this pipeline its possible to control longer action sequences in a 4 waypoint B
semi-automatic fashion. The human operator can provide the goal sequence 5 left
[0,3,7,5,5,4,0,2,2,7] and the robot has to execute it. for each single command 6 right
the reward gets calculated and this allows to plan the low level actions of 7 release object
the robot.
Table 2: Commands for a
maze robot
3c From line following to instruction following
The advantage of line following robots is, that the task is easy to explain to
newbies. The robot has to programmed so that it will reach the end of the
line. If the robot moves outside the line the situation has become worse. This allows to decide which
actions are good and which are bad.
An improvement to a line following robot is an instruction following games. The natural language
command from a human operator can be interpreted as a virtual black line on the ground. The operator
might say “forward, right, forward, stop”. The action sequence defines a high level trajectory. In
contrast a physical black line on the ground, a natural language command can reference to many
possible goals. Let me give an example.
The black line on the ground can have different shapes: straight, left, right, crossing. Each possible
segment generates a command for the robot. If the robot recognizes a left line segment, the robot has to
follow the line by steering to the left. The mentioned instruction following task works with a similar
principle, except that any possible word can be utilized. Possible commands are simple navigation
instruction like “forward, right” but it can also a vocabulary from a different domain. A grasping robot
can be commanded with commands like “open_gripper, close_gripper”
Similar to a line following, an instruction following problem can be divided into two subproblems.
The first one is how to control the robot and the second is how to draw the line on the ground. Its
possible to use a simple map in which the line forms a circle. More advanced setups are using a more
25
year
project
author
1967
1972
1980
1981
1993
2006
2010
LOGO programming language interpreter
SHRDLU interactive animation
Put-That-There voice gesture interface
Karel the robot, domain specific language
AnimNL computeranimation
MARCO route instruction following
M.I.T. forklift
Papert
Winograd
Bolt
Pattis
Badler
MacMahon
Tellex
Table 3: Timeline of instruction following projects
complex line trajectory which consists of crossing, missing line segments, obstacles on the path and so
on. The opposite subproblems make it possible to adjust the difficulty. For example an entry level robot
parkour consists of a short path with easy to follow line segments. If enough robots are able to solve
this problem, the difficulty can be increased.
The same principle is available for the instruction following task. In the easiest case, the set of
instructions is small. The operator is only allowed to select one of four possible words from a predefined
vocabulary. In a more advanced setup the operator can formulated 2tuple commands like “open gripper,
moveto waypointA”. Executing these instructions is harder because a dedicated command parser is
needed.
3c1 History of instruction following
The history of instruction following projects is small. After the publication of the SHRDLU project in
the 1970s there was a long period in which the subject wasn’t researched very much. In 2006 some
effort was available to build a dedicated route instruction following system. instead of controlling an
entire robot with textual commands, the reduced goal was to navigate only in a maze.
The obvious reason why the subject wasn’t analyzed with more effort is because it sounds not
very interesting on the first look. According to the paper o MacMahon the idea is that speaker sends
commands to a hearer (=robot), while the robot has to interpret the command and move into a certain
direction. In a concrete example, the speaker may said “move north” and the robot is doing so. Or the
speaker might say “move north for 3 cells”.
Such kind of interaction sounds trivial because text adventures and especially the Karel the robot
programming language provides a more advanced command parser. Nevertheless there are many
arguments available why route instruction following might be an interesting subject. The key element is
to split the problem solving task into a director (=speaker) and a follower (=hearer). A second element
is, that the commands are formulated in natural language. Such a combination is a seldom case in
Artificial intelligence and apart from the instruction following problem no further projects are available.
Perhaps it makes sense to explain why exactly route instruction following sounds boring on the first
look. The cause is, that its not an algorithm, and not an AI related library but it has more in common
with a robot challenge similar to micromouse. The route instruction problem is at first a problem. The
assumption is, that there is a maze, a robot and a speaker. Also the assumption is, that the speaker and
the hearer and communicating back and forth and this allows to reach a position in the maze. Its up to a
programmer how to implement such a protocol and which commands are needed in detail.
The route instruction problem can be seen as an example for the symbol grounding problem. It
provides a concrete puzzle which demonstrates what symbol grounding is about. Grounding is a
synonym for “interpreting a command” [Matuszek2012] page 1. Its the action which is taken after the
robot receives the command “move north for 3 cells”.
26
3c2 NP hard problems vs instruction following
The symbol grounding problem can be converted into the instruction following task that is basically a
speaker hearer language game. The promis is, that such interactive games can make robots intelligent.
To understand how exactly instruction following is working we have to describe first the situation
around 1992 which is seen as the end of the last AI winter. The understanding during this period was,
that robots can’t be realized at all because robot control will create an np hard problem which is equal
to a large state space. Even future optical supercomputers are too slow to search in the large state space
for the optimal action. On the other hand, algorithms based on instruction following are promising to
solve exactly such problem category so it makes sense to investigate the approach in detail.
The understanding around 1992 was, that a computer consists of hardware based on CPU and
RAM and it contains also of software which is a programming language, and operating system and
libraries. From a theoretical standpoint a computer system works with an algorithm which solves a
certain problem. The approach during the last AI winter was, to use this technology for solving robotic
problems but the attempt has failed. No programming language is powerful enough for controlling
robots and inventing new language or new algorithm has failed too. Some improved AI related
languages like Prolog are available and modern graph search algorithm like Rapidly exploring random
tree (RRT) were designed but they have failed too. The unfixed problem is, that the state space of a
robot is too large.
The idea behind instruction following can be summarized as an improved heuristics. Its not based
on classical algorithm definition but instruction following is located outside of computer science at
all. This might explain why the subject was ignored over decades. Its basically a puzzle similar to
Micromouse or the 15 puzzle. After the puzzle was described its possible to solve it with a concrete
piece of software.
The starting point of the instruction following paradigm is the trivial recognition that any robot
can be teleoperated. Everything what is a needed is a joystick and a human operator which moves the
joystick. This allow the robot to grasp objects and navigate in complex environments. To automate a
teleooperated robot the interaction between man and machine has to be recorded. Instead of capturing
the joystick movements the idea is to record high level commands formulated in natural language. For
doing so a certain user interface is needed which has much in common with a text adventure from the
mid 1980s. The human operator controls a character with an interface like “go north” “take lamp”, and
the game engine is executing the commands. Such an interaction is known as speaker hearer interaction
which forms the basis of the instruction following game.
Classical computer science until 1992 are ignoring man machine interaction. The idea was that
a robot needs to be programmed in software. And computer science is the only disciplines which
understands how to create such a software including the algorithms. In contrast, the instruction
following paradigm assumes that computer science is obsolete and a broader perspective is needed to
understand robot programming.
The core element in the instruction following game is a corpus with the recorded speech between a
speaker and a hearer. The hearer will say sentences like “go north”, “stop”, and the hearer interprets
these commands. The task of an interpreter is the core element in symbol grounding. The term
grounding means “interpreting”. A given statement like “go north” is translated into low level actions
which means, the hearer will activate his muscles to change his position in space.
The newly created corpus defines what the problem is, it contains of a certain vocabulary which
includes commands, questions, objects and locations. And it shows a concrete interaction between both
parties. The role of computer science is, to emulate the interaction in software. The human hearer has
to be replaced with a computer program which accepts a high level command and generates the low
level output. The resulting robot can be controlled with natural language.
The most surprising situation is, that the cpu demand of such an interpreter is low. The software
hasn’t to traverse the state space and search for the optimal action but the correct command is given by a
27
human speaker. The only task for the robot is to execute such a command. So its not a classical artificial
intelligence but it has more in common with a textual interface known from the Zork I videogame. Zork
I was released in 1980 for the Apple II so its available in mainstream computing for over 40 years.c
3d Instruction following for a robot swarm
The ability of a robot to follow natural language instruction seems a powerful example for man machine
interaction but the principle can be leveraged by introducing a command based swarm. The picture
shows a minimalist prototype programmed in the python language which contains of a 10x10, 10 robots
on top and some objects in the grid. Of course, the task for the robot is to collect the robot, but they are
not allowed to decide the movement by itself but in addition its an instruction following task.
Possible commands for the human operator are:
1. movedown
2. moveup
3. movetoobject
4. movetotop
Its important to know that the instructions are referencing to the
entire robot swarm. The command “movedown” will result into
an action of 10 robots at the same time. Similar to instruction Figure 3: Swarm simulator in
following for a single robot, there is no AI algorithm needed Python
in the classical sense but the robots are simply executing the
commands. Its up to the human operator to provide meaningful
actions and its up to the robot to parse and execute the statements.
The idea is not completely new. The Karel the robot environment works with a similar principle, except that Karel is a single robot, while the example in the
screenshot contains of a robot swarm. Perhaps it makes sense to explain the inner working of the
action parser in detail. The first command “movedown” was realized with a python for loop which
goes through the robots and change the position to one cell down. That means the logic is encoded
as a python function which occupies around 10 lines of code including boundary check. The human
operator enters the command, and the described function gets executed.
The power of the proposed system isn’t provided in the source code which can be called trivial, but
it has to do with the set of possible commands. By providing four and more natural language commands
its possible for the human operator to control the swarm. This allows the operator to execute more
complex tasks like collecting all the objects.
For reason of better user experience, its possible to map the commands to keypresses from 1 to 9
similar to a piano keyboard. This allows the human operator to control the swarm similar to playing
an organ. At the same time the system is teleopeated, that means a human gives the command, but
these commands are formulated on a high abstraction level which allows the human to control multiple
robots.
Suppose the idea is that the robot swarm doesn’t need human interaction but should work autonomously. Even this objective can be realized. All what is needed is a small macro like subroutine
which generates the commands by itself. This allows to start the movements with a click on run and
then the swarm will collect the objects and returns to the top position. The reason why such complex
macro can be realized is the ability of the robots to execute natural language instructions. So we can
say, that instruction following is a practical example for the symbol grounding problem.
28
Figure 4: Robot swarm with possible instructions
3e The logic is hidden in the GUI menu
An instruction following robot works different from a classical robot because of the absence of any
algorithm. Even if the instruction following is using some algorithm which are written in a programming
language, it doesn’t make sense to explain the inner working because they are trivial, too specific
and can be replaced by better alternatives. This makes it hard to explain how exactly such a robot is
working.
The answer is, that not the robot is interesting but the challenge he has to solve. The challenge is
the true source of logic. Let us take a look into the example screenshot which shows a modified swarm
robot simulator. The new GUI element is the menu widget on bottom right which contains of possible
commands for the robot. The inner logic of the robots can be traced back to this widget.
The user can press a button, e.g. 0, and then the robots are doing whats written in the menu. The
robots will either locate the red object, they will clear the object, or they will change their position. Its
up to the human user which button he is pressing.
A menu can be seen as a game rule which describes which actions are available. The user can decide
for one of these actions. And because of this reason it doesn’t make sense to explain the algorithm
inside the robot, because there is no such an algorithm. What is available instead is the screenshot of a
grid game and the menu with possible options. Such a game can be implemented in any programming
language and with any source code. The only fixed parameter is, that the game rules are the same. That
means there is a 10x10 grid, and exactly 6 different commands to control the robot.
But let us go a step backward to understand why its hard to explain what instruction following is
about. In classical AI there is a bottom up principle available. The idea is, that a robot is equipped
with an intern algorithm, the AI, and this Artificial Intellligence decides what the robot is doing next.
This bottom up paradigm can’t applied to this case because the robots are teleoperated and they are
controlled with natural language. The better attempt to understand the robots is by focus on the game
in which they operate including the controlled vocabulary they have to understand.
From an outside perspective, the robots have no AI, but the decisions are made by the human
operator. The GUI is shown on the display and before the robot will move forward, the human has ato
press a button. So its a normal video game like pong, but not an Artificial intelligence. The new element
is, that the game was designed as an instruction following game which makes it easy to automate the
robots in this game.
3f The auto mode for instruction following
On the first look an instruction following robot looks a bit boring, because its teleoperated and can’t do
anything by its own. But, with a small modification in the software its possible to enable an auto mode.
29
The auto mode built on top of an instruction following robot means simply to execute the commands in
a script. In the screenshot the button 6 will activate this behavior. In the background a fixed sequence
of actions is repeated for an unlimited time span. The swarm will act with this pattern and the actions
are meaningful. At first the swarm will locate the objects in the map, then it will move to the objects,
then the objects gets cleared and the swarm moves back to the base station to charge the battery.
For making the process visible a score is shown in the screen which counts simply the amount of
cleared objects. Such a simulation can be run forever, and the swarm will collect very huge amount of
objects without human intervention.
It seems, that with a set of predefined actions its much easier to automate a task. Because the
main program which controls the swarm is using the existing methods. Every method is listed in the
menu because its a natural language instruction to the robot. The harder task is to describe the pipeline
from an AI perspective. Its a combination of tteleoperation, object oriented programming, instruction
following and the symbol grounding problem.
An open problem is to locate the source of intelligence. In the example there is a robot swarm
how is doing a complex task which is collecting objects in a grid. The swarm is doing this task
autonomously. At the same, there is no dedicated AI routine or algorithm written in software. The
overall sourcecode for the project has around 500 lines of code. There are classes for drawing the gui,
for sending commands to the robot and for processing keyboard inputs. So the question, why are the
robots doing a task, without a dedicated program?
In the previous section a first attempt was made to address this problem. What we can is, that the
project is not a classical robot, but it has more in common with a robot challenge. Its working similar
to a micromouse simulator or a chess simulator, which asks a human to control a robot in a maze. From
the development perspective, the first prototype was created with manual control. The user was able to
control each robot in the grid with up/down keys. So it was a normal video game which was boring
to play. The user is pressing a button and the robot on the screen is moving. The AI component was
added as an textual input parser. The amount of possible commands was formatted in a list. This list is
the controlled vocabulary for the interaction between the game and the human user. So we can say, that
in the game there is no intelligence, but the game assumes that the human operator is intelligent. The
operator has to decide which action comes next, and the operator is in charge to write a script for the
auto mode.
Let me give an example. From a technical perspective a script without any meaning can be executed
in the simulator. The script simply moves the robot swarm up and down, but doesn’t collect any object.
This script won’t produce an error message but according to the parser is valid. So it depends on the
human operator which sort of script he enters, very similar to writing a script for Karel the robot. In
this game, its possible to move the robot in more direction, but there are also many possible scripts
available which won’t result into anything useful.
Perhaps it makes sense to label the project
not with “Artificial intelligence” but it has more without meaning
in common with a robot environment which 0 alldown
sense making script
works with natural language instruction. Not 1 allup
2 findobject
the robot gets embedded into the game, but its 0 alldown
3 movetoobject (10 times)
the human operator who is using the mini script- 1 allup
4 clearobject
ing language to program the robot swarm.
0 alldown
1 allup (10 times)
1 allup
3f1 Programming an AI with a communicaTable 4: Script for auto mode
tion board
The screenshot shows the same game like in the
previous section. The blue robot on top has to
collect food and he can only move up and down but not in other directions. After collecting an item,
30
Figure 5: Food collecting game including communication board
s e l f . comboard ={
0 : " up " , 1 : " down " , 2 : " f i n d f o o d " , 3 : " m o v e t o g o a l " , 4 : " e a t f o o d " , 5 : " f i n d b a s e " , 6 : "
auto " ,
}
Figure 6: Communication board as Python dict
the robot has to return to base which is on top of the maze.
The game mechanics was chosen because it can be realized with a few lines of code in the
programming language of choice. The much harder to explain element is the AI which should control
the robot.
Let us explain the situation from the end. The game including the AI is working already and the AI
robot gets controlled with scripting AI. The commands shown in the menu are used inside a macro
and this will move the robot autonomously. Unfortunately, the menu with the possible commands isn’t
available for a game by default but it has to be invented first. Suppose there are no commands available
which can be executed by the robot, then its impossible to use scripting AI. The first step is to explain
how to create possible commands for the robot.
In the example case with the food eating robot, the vocabulary was stored in a python dictionary.
Each entry has a number and a text. This reduces the complexity. Instead of programming a robot
which can parse all sort of words and sentences, the vocabulary is numbered from 0 to 6.
The difference between algorithm driven Artificial intelligence and communication board oriented
AI is, that in the second example the source of intelligence is the human user. The vocabulary provides
only a list of commands, and the operator has to decide which of them should be activated. Each
command is mapped to a key on the keyboard.
The most obvious commands are “up”, “down” which were implemented first. This allows to
move the robot in its column. The other commands are bit complicated to implement. They have been
realized in Python source code with for loops and if-then statements like a normal Python method.
Not the written source code is important but its the communication protocol which is the communication board. The board including the words specifies a certain domain. Its used for man to to
machine communication. Such a board can be operated in a manual mode, by pressing a num key on
the keyboard or it can be referenced in a script so that the robot will run autonomously. Let me explain
the situation from a different perspective.
To estimate what sort of tasks can be executed by the robot it helps to read through the vocabulary
list. There are commands for findfood, movetogoal, eat and return to base. Outside of this scope the
robot can’t do something. He can’t rotate because there is no such command and he can’t avoid a
collision with other robots. If these features are needed the relevant words have to be added first into
the communication board and then the python code has to be written to execute some statements.
In general the described workflow of creating an AI bot is organized with natural language. The
31
English language is a tool to capture domain specific knowledge. All possible actions and events have
for sure a word. To create a robot, the robot needs to understand a certain vocabulary.
3g Theoretical reason for instruction following
Similar to the symbol grounding problem which connects language with sensory perception the
instruction following task is located within linguistics. Natural language is utilized to describe robotics
action. Similar to language in general a word is referencing to a meaning outside of this word. For
example the word “movedown” has no internal meaning because its a string of ascii characters. The
meaning is provided only for someone who understand the word. Its also provided in a dictionary
which explains what the word is about.
This linguistics perspective might help to understand why robots which are able to parse commands
are different from classical understanding of Artificial Intelligence. In the past an intelligent robot was
seen as intelligent. The assumption was, that inside the robot there is an algorithm which makes the
robot smart. In contrast, the origin of meaning from a linguistic perspective is located outside of the
individuum. The source are dictionary or the gutenberg galaxy which explains one word with lots of
more words. The consequence is, that meaning which is needed for intelligence is never located inside
a computer program but its locate outside of the software.
Its true that the described prototype for a robot swarm in section 3d-3f has no built in Artificial
intelligence. On the other side, the used vocabulary for controllling the swarm has a well defined
meaning because the normal english vocabulary was used, which is spoken worldwide. That means,
there is no secret why the robot swarm is doing something useful, because the words for interacting
with the robot have a meaning.
The interesting situation around the English language is, that any possible action, perception or task
can be described in a single word or in a short sentence. No matter if the domain is a grasping robot, a
one wheel robot or a UAV, all the possible interaction can be encoded in words. Exactly this feature
makes the English language a general language and its more powerful than a computer language like
C or Java. It makes sense to utilize the vocabulary for human to machine interaction. The only tool
needed to communicate with a robot is a short list of words which is maybe 10. This wordlist is used to
send back and forth messages between a robot and a human operator. This communication process
allows to solve any task in the real world.
In the section 3d a concrete example for an instruction following robot was presented including a
screenshot of the GUI. The open question is how to generalize this example into other domains? The
elements of the simulation were a robot swarm plus a menu with possible instructions. Each instruction
was equal to a single word, or sometimes it contains of two words. This allows to give an abstraction
definition about the instruction following task.
The robot has to understand a limited vocabulary of commands.
From a programming perspective this paradigm matches closely to object oriented programming which
assumes that there is a robot which has methods. The methods are executed from the outside and are
equal to the commands from the menu.
3g1 Instruction Following as communication paradigm
Sending messages between two parties is seldom discussed in the context of robotics. Instruction
following is the exception which works with a speaker and a hearer who are using a shared vocabulary.
The reason why message parsing is usually located outside of Artificial Intelligence is, that a single
robot doesn’t create such a need.
The microcontroller in a robot is programmed in a certain programming language for example
java. The source code contains of datastructures plus algorithm but everything is encapsulated into a
32
single program. The robot is a single unit which has no need to answer messages from the outside. The
only exception to this understanding is a teleoperated robot but teleoperation is the opposite goal of an
autonomous robot so it can be ignored.
But it seems, that Artificial Intelligence can’t be described this way. Encapsulating all the domain
knowledge into a single instance which is the AI agent will produce a high complexity. The better idea
is to divide the problem into two parts: a speaker who is familar with natural language and symbolic
reasoning on the one hand, and a hearer who understands language and executes commands on the
other. If such an AI model gets converted into a programming language its equal to define a speaker
class, and a hearer class. Both classes are stored in different files and can be crated by different
human programmers. The classes are communicating back and forth and this will produce Artificial
Intelligence.
In terms of high level and low level, the speaker class is of course the high level instance which
describes a situation in linguistic terms, while the hearer class is responsible for low level actions and
low level sensor perception. The idea is, that each class can delegate tasks to other other class which
doesn’t fit into the own responsibility.
Let me give an example for a house hold robot. From the perspective of a speaker, the domain
consists of objects (table, drawer, stove), there are actions (move, open, grasp) and ingredients (rice,
potato, apple). A typical simulation game for the speaker class is a textadventure which ignores all
the details but its only about words. On the other hand, the hearer class is a 3d videogame which has
sensors, motor patterns and a physic engine. Only if both classes are communicating with each the
needed Robot intelligence can be created.
There is no need to introduces more instances than only 2. A speaker and a hearer is more than
capable of solving robot tasks. On the other hand, the absence of such a communication paradigm will
result into a dysfunctional robot. Such a robot remains ungrounded and has no internal communication
system but only algorithm and computer code.
3g2 Programming an instruction following robot from scratch
Classical software engineering is built around libraries. A typical question of a programmer is which
sort of library is needed to solve a task. And then the concrete functions in this library are used
to implement a certain design pattern. Additional tools like a version tracking system and iterated
prototyping helps to develop any sort of software.
Unfortunately this paradigm can’t be adapted to AI applications because there are no such thing
like AI libraries. This makes it hard to describe how the workflow should be to create autonomous
robots. To overcome the obstacle, there is a need for a paradigm shift. AI related projects are seldom
located within classical computer science but they have to be seen with a different perspective.
The symbol grounding problem including instruction following robots are not the answer to a
problem, but they are equal to the problem. What is needed is never the sourcecode which simulates a
game, but the task is to invent a game. The term instrucction following is referencing to a certain sort
of games which have much in common with point&click adventures from the 1980s. There is a visual
scene representation plus a textual menu. The menu allows to activate actions in the game.
Its a good starting point to implement any sort of game as a Maniac Mansion clone. The reason
why the textual menu is needed is because the focus is on man machine communication. In contrast
to normal video games which have a strong focus on haptic input devices like a mouse, the focus in
instruction following is on textual interfaces.
The reason for this preference is, that textual commands can be recorded while mouse movement
not. For example, a command sequence like “moveto table, grasp apple, moveto drawer, open drawer”
can be executed in a video game and at the same time, the sequence can be dumped into a text file. The
ability to store commands into a text file allows to construct high level macros with this commands
which can run autonomously on the robot.
33
In other terms, instruction following means to add a scripting module into a video game. The
scripting module provides a language to request sensor data and executing actions. A typical script
written in such a language would be “if no_obstacle then move(forward)”. The precondition to formulate
such high level statement is that the robot can parse commands like “no_obstacle” and “move()”.
3g3 Increasing the automation level with natural language
An example for low amount of automation is a teleoperated robot. The human operator is forced to
provide any detail action to the robot. The robot will stop working if the human stays away from the
panel. The reason why teleoperation is so common is because it allows to control robot arms without
programming them. There is no need to write an advanced software, but the human operator is in
charge of controlling the system in realtime.
Increasing the automation level is equal to record a demonstration and playback it later. Recording
a trajectory is not possible with classical teleoperated arms. The best way to record something is by
using natural language events and actions. If the interaction between human and robot is working with
words, its possible to store these words together with a timecode into a csv file.
Unfortunately, a classical joystick doesn’t understand words and pressing the joystick slightly
forward wont generate a sentence. What is needed is a different sort of teleopeation interface. Such an
interface is created in a two step process. First thing is, to create a dialogue between a hearer and a
speaker, and secondly the language corpus is extracted from this dialogue. Let me give an example.
Suppose the task is to control a robot in a maze. A possible dialogue would sound like:
Speaker: locate the next object
Hearer: done
Speaker: move towards the object
hearer: done
Speaker: grasp the object
hearer: failure
Speaker: move back to base station
hearer: ok
In the dialogue a certain vocabulary was used which is (moveto_object, grasp, moveto_basestation„
failure, ok” All the words have to be implemented in software. The idea is that a real robot can
communicate the same way. That means, the object oriented class in the software which represents the
robot has methods like “moveto()”, “grasp()” which can be executed. The methods allow to model the
interaction between a speaker and a hearer.
3g4 An instrument panel for man machine interaction
In the past, Artificial intelligence was mostly imagined from an algorithm perspective but seldom
explained as interactive intelligence. Interaction means, that a human operator sends commands to the
robot. For doing so a certain panel is needed. Such a panel looks like a piano with additional status
led. In comparison to other concepts in computer science the proposed control panel looks trivial. Its
simply a widget with buttons and lamps which are labeled.
At the same time such a panel is the fundamental building block
in advanced robotics. It allows to encode the communication between
0 sensor0
0 action0
a human and a robot. The human has predefined commands which
can be send to the robot and the robot can answer the requests with
1 action1
1 sensor1
predefined sensor signals. The system has much in common with an
2 action2
2 sensor2
engine order telegraph used in old ships for communicating between
the bridge and the machine room.
3 action3
3 sensor3
The main idea is to describe artificial intelligence bot from a robot
perspective, but as a communication protocol. There is a need to send
34
Figure 7: control panel or a
robot
messages back and forth between the speaker and the hearer. This
communication is equal to the generated intelligence.
The described control panel has the ad vantage that from a technical perspective it can be realized easily. Each action has a number and each led has a number too. The
robot can receive command id from the human and the robot can activate sensor id in response. At the
same time, the commands are not random but each button is labeled with a text message. For example,
button1 means “rotate left”, button2=”stop” and so forth. The meaning is only available for the human.
From a technical perspective the label has no function.
Let me give an example. Suppose the robot receives the command “0”. This command asks the
robot to rotate to the left. The robot is doing so and if the operation was completed he responds with
a status message “sensor0=on”. Which means “ok, command was executed”. Suppose the robot has
detected an obstacle 10 pixel in front, then the robot will send the message “sensor2=on”. Sensor2
stands for “obstacle detected” and the led in the front panel gets activated. .This allows the human
operator to decide what to do next. Perhaps he will press the button “stop”.
From a mathematical perspective the communication can be encoded in a feature vector: action=[0,0,0,0], sensor=[0,0,0,0]. The table itself has no meaning, it depends on the labels next to each
entry how to interpret the signals. Because of this reason the communication is distributed between two
parties. The robot alone can’t recognize what to do next.
3g5 Communication based AI
In contrast to existing programming techniques like Finite state machines, behavior trees and neural
network, a communication based AI assumes that the intelligence is located outside of a robot. Every
robot gets teleoperated by a human and communication based AI puts a high focus on the man machine
interface which is mostly textual.
In other words, its not a true AI but its a normal Videogame in which a human takes decisions. The
decision making process is simplified to selecting an action from a dropdown menu. Let us investigate
the concept of a GUI in detail. A typical windows based GUI menu consists of the menubar on top
of the window and a status bar in the bottom. The user is asked to select an action, for example “File
->Open” and then something will happen. Sometimes, there is a feedback available in the status bar to
inform the user about possible problems.
Its for sure, that a Window in an operating system isn’t intelligent. All the menus are preprogrammed
into the software. The computer has to render the window for the user. The assumption is, that robotic
control can be realized with the same paradigm.
3g6 Programming an instruction following robot step by step
At the beginning, the robot has no vocabulary so the dictionary is empty. At first, only basic commands
are added to the dictionary like up, down, stop. The dictionary is shown on the screen and for each
entry a python function is created.
The reason why the function names are encoded in the dictionary a second times is because of
two reason: first, each command is mapped to a number and secondly, the dictionary can be shown
on the screen. This visual appearance is important because the primary interaction with the robot is
manual. The human operator has a menu with possible commands, and then he can press a button on
the keyboard.
If the basic commands (up,down,stop) are working dedicate sensory commands can be implemented
which are (isobstacle, distancetogoal, battery). These sensory commands won’t do something, but
they are measuring values and store them into the robots class. The return value 0 signalizes that the
measurement was working with success.
35
Suppose the human operator has created around 10 different coc l a s s Robot :
mamnds to control the robot and measure something. These commands
def __init__ ( s e l f ) :
are used to interact with the robot and fulfill tasks. For example, the
s e l f . comboard ={
0 : " up " ,
robot can be commanded to go to a certain waypoint, pikcup an ob1 : " down " ,
ject and return to base. In the last step, the manual interaction with
2:" stop " ,
the robot gets automated by writing a script. The script contains of
}
command calls which are already there e.g. a sequence would be
d e f up ( s e l f ) :
return 0 # success
[0,4,1,1,3]. This macro will run autonomously.
return 1 # failure
The most important element in the source code isn’t the executable
return 2 # other
code but the dictionary which is used for human to machine command e f down ( s e l f ) :
ication. Each entry has an id which is important for the computer
return 0 # success
and a textlabel which is important for the human. The idea is that the
def stop ( s e l f ) :
return 0 # success
human describes domain knowledge in natural language words. For
example he would say that at first the robot should go to a waypoint,
then pickup the object and then do other tasks. These English sen- Figure 8: sourecode for instructences won’t make any sense for a robot. The only language a robot tion following robot
speaks are numbers. So the English sentences have to converted into
command-codes. In the interactive mode, the human operator (=client) sends command codes to the
robot (=server) which are responded with a numerical response code, e.g. 0=success. So the interaction
works by sending numbers back and forth.
One possible concern against an instruction
following robot is the absence of a communicatime Human (=client) Robot (=server)
tion protocol. In contrast to computer related data
request
response
communication protocols like TCP/IP or RS-232
0
0
0
there is no standard available about possible com1
4
0
mands. So the protocol has to be invented from
2
1
1
scratch and each domain has a different list of
3
1
0
commands and response codes. This situation
4
3
2
increases the complexity. Instead of simply im5
2
0
plement an existing protocol in software, the first
6
3
0
step is to design the protocol by itself.
Table 5: Sending command codes to the robot
3g7 Robot control with a protocol
Printers are usually controlled with a printer protocol which is for example Epson ESC/P. The protocol is defined in a table which provides different
commands for selecting a font, request the sensor and feed the paper forward. The software which
sends the commands to the printer is called a driver.
The principle can be adapted to robotics programming too. The idea is that the robot is a printer and
the communication is realized with a protocol. The commands depend from a domain. A line following
robot for example needs commands for detecting light on the ground and the robot should rotate to left
and right. The main difference in robotics is, that there is no standardized language available but the
protocol has to be invented from scratch.
In contrast, there is no need to program an artificial intelligence. What is called an AI is a normal
driver which sends the command to the robot. The logic of the robot is defined in the protocol. Complex
robots like a biped robot will need more advanced command protocol.
Let me give an example. Suppose there is a robot protocol available with 30 different commands.
The driver sends the following command sequence to the robot [4,1,12,11,2]. This sequence can be
roughly translated into: [start, getstatus, calibrate, searchline, forward]. it should be mentioned, that a
protocol is created as a communiation tool between two parties. One device is sending a command,
36
and the other device is receiving the command. The question is not how to program an algorithm on a
single device but its an interactive process between two parties. For reason of simplication a protocol
allows to teleoperate the robot, except that the teleoperation doesn’t work with a joystick but with the
protocol which grounds the communication in words.
3g8 Object oriented programming
One possible reason why instruction following and teleoperated robots aren’t very popular is because
there is a missing advantage over object oriented programming (OOP). OOP means usually to create an
object (e.g. an agent) and then attach methods to this object like “openhand()”, “standup()” etc. The
surprising situation is, that the modern term “instruction following” means basically the same. The
robot is capable to run some predefined methods.
Roughly spoken, classical OOP and command based natural language instruction is the same. in
both cases, a single objects receives messages from the outside which might be a human with a GUI
interface or a main program which runs a script. The assumption is, that object oriented programming
has become mainstrean computer science there is no need to reinvent the wheel.
On the other hand, the newly published papers about instruction following are completely different
than former papers about object oriented programming so there must be a difference available One
possible advantage is, that instruction following is closely related to knowledge modelling while OOP
is about the internal structure of a computer program. OOP related programming languages like C++
and Java were created with the goal to simplify the programming itself which is typing in the source
code and bugtesting an application. In contrast, dialogue based instruction following has its root in the
symbol grounding problem which tries to encode robot commands in a dictionary.
In other words, OOP has its origin in computer science, while instruction following is closely
related to natural language processing.
3h Instruction following with speaker and hearer
The communication process between speaker and hearer is very different from existing understanding
about robot programming. The classical assumption in the past was, that a robot is an embedded agent
which runs an intelligent algorithm. In contrast, the speaker hearer model assumes that the task of robot
control consists of two different layers which are working independent from each other.
The task for the speaker is to provide a sequence of useful actions. From a technical perspective
its equal to write a script. For example to navigate a robot from start to goal a certain script is needed
which makes sure that the robot has enough energy, avoids the obstacles and is moving into a certain
direction. Apart from this speaker related task there is a second task available. The hearer (=listener)
has to make sure that the instructions are executed. For example the command “move north” has to be
translated into control actions for the servo motor.
The communication between a speaker and a hearer is realized with a grammar. The BNF grammar
contains of a domain specific language. Its surprising to know that since the 1980s such a robot control
paradigm was discussed in the literature. The problem with DSLs for robot control was and is even
today, that its very complicated to invent such a language from scratch. This might explain why there
are only few working examples available.
Instead of rejecting the concept at all it makes sense to figure out a more efficient strategy to create
a domain specific language. One possible attempt in doing is to divide the problem further. There is
a dataset available which stores a speaker/hearer interaction, and there is a implementation available
which is the source code in a programming language. Let us focus on the first element in detail.
A speaker / hearer dataset stores the gamelog with textual interaction. It records the natural language
between two human operators. One of them gives the command, the other is executing them. Its some
sort of motion capture recording, but with two persons at the same time. Perhaps it makes sense to
describe the situation from an outside perspective.
37
dataset
domain
specific
language
speaker
hearer
speaker
hearer
Figure 9: speaker hearer dataset
There is a normal mocap studio. The camera are tracking the markers of the human actors. In
addition, there is a room in which the speaker sits. Both persons are connected with a microphone. And
everything is recorded with a computer. The speaker says something e,g, “jump” and the mocap actor
has to fulfill the request. The mocap actor can return a repsonse if he likes like “ok”. The result of the
interaction is feed into a dataset which is basically a time based table with different columns.
The only purpose of the mocap recording is to generate a dataset. The dataset has some properties:
It contains natural language instructions, it has a time code and very important it stores the mocap data.
So its a multimodal speaker hearer dataset.
The dataset is used in a second step to create a domain specific language. The task is to translate
the dataset into a BNF grammar which can be parsed by a computer. The natural language instructions
would overwhelm a computer program, so they need to convert first into a controlled vocabulary or less
than 20 words. Complex sentences like “could you please jump on times” are converted into machine
readable commands like “jump()”
3i Behavior tree based instruction following
in the minimal case, instruction following means that the human operator selects a command, e.g.
“moveto waypointA” and the robot is executing the command. Such kind of interaction seems to be
interesting but in a technical sense the robot is controlled by teleoperation and can’t solve problems by
itself. There are two possible techniques to overcome the challenge which are scripting AI and behavior
tree. A behavior tree stores a longer action sequence in a hierarchical fashion, while scripting AI means
basically the same, but its written down the notation as computer source code.
Perhaps it makes sense to give an example. Suppose the vocabulary for a maze robot contains of
the commands from the table. Then its possible to concatenate the commands into a sequence like
[2,3,4] or into [0,5,4,1]. A behavior tree allows to formulate a sequence. Executing the behavior tree
will generate the commands which are send to the robot.
Its a bit complicated to explain where exactly the intelligence of
a robot is hidden. Because both techniques which are a behavior tree id description
and a command vocabulary are technically trivial. The advantage 0 moveto waypointA
has to do with the abstraction mechanism. The ability to send a 1 direction north
command to the robot will improve the interaction with the robot. 2 pickup object
Instead of determining the servo signals the human operator selects 3 release object
only a command id and the robot will execute the action. In addition 4 moveto waypointB
the behavior tree will provide an additional layer which encodes a 5 direction south
longer sequence of possible actions. The result is a robot which looks
like a teleoperated robot but there is no human needed but the behavior Table 6: Command vocabulary
tree is the source for the commands.
The inner working will become more clear with the attempt to improve the robot system. Suppose
38
the robot should solve a different task. Then a new behavior tree is needed. If the entire domain is
different the command vocabulary has to be modified also. So we can say the current commands plus
the behavior tree determines the limits of the robot.
3i1 From commands to behavior trees
Even if behavior trees are a common tool in creating game AI characters and are explained in many
tutorials they are only one under many possible attempts in Artificial intelligence. Its not because they
have a disadvantage but because its complicated to explain what exactly a behavior tree is about. From
a technical perspective, behavior trees are a graphical representation of a computer program. They
have much in common with scripting AI. The programmer has to write down a sequence of actions
which can have sub actions and this program gets executed. Writing a behavior tree with a textual
programming language like lua or python is possible and this will increase the confusion because
writing a python script and creating a behavior tree is the same.
Nevertheless, behavior trees are a powerful AI tool and they are used frequently in videogames.
So there is a need to explain the advantages more precisely. The basic element of a behavior tree is a
node which stands for an action. Such a node is equal to a command given to a robot, e.g. “moveto
waypointA”. If the node was executed, the next node can be executed. The open question is, what
exactly is a command which is encoded as a node? This question is indeed hard to answer. It has
nothing to do with the inner working of a robot, but a command is a speech act send from the human to
the robot.
The interesting situaiton is, that any possible node has to be grounded in advance. The robot needs
to understand what the meaning is of “moveto()”. Before a certain behavior can be executed on a
robot, all the possible node commands have to be implemented in a robot language. Such a language is
the underlying concept which explains why the concept is power. A behavior tree is only a concrete
program which takes advantage of the implemented commands. The real power has its origin in the
single command the ability of the robot to understand it.
Let me give an example. Suppose a robot can be controlled with the following commands:
moveto(), pickup(), release(), rotate(), speed(), stop(). Then its possible to formulate complex actions
by concatenate these commands. The action sequence can be visualized in a behavior tree, it can
be written down in a textual program or the commands can be executed interactively by the human
operator.
Domain specific language are used to ground natural language commands into an action. Its the
computer program needed to understand a command like “moveto()”.[Howard2022] So we can say,
that not the behavior tree itself is the source of intelligence, but its the ability to translate an action node
into an action. This allows to control robots on a higher abstraction level. Instead of joystick to move
the robot forward, a textual command is entered.
3i2 Behavior tree as dialogue games
One possible cause why behavior tree in the past were difficult to understand was they were explained
from a computer perspective. The idea was that a behavior tree is a C++ library to create an agent.
The problem is, that a behavior tree is located above existing computer software. A more precise
explanation is, that a behavior tree is a language game between two humans: a speaker and a hearer.
This language game can be implemented on a computer but there is no need for doing so.
Let me describe a certain situation. Suppose the task is to prepare a meal in the kitchen. The
described language game has to adapted to this situation. The speaker might say “go to the table and
take some ingredients. Then go to the stove and put the ingredients into the stove”. In response the
hearer of the game might say things like “Acknowledge” or he might ask back “Which ingredients
exactly?”. The overall interaction game works by sending natural language statement between both
actors in the game.
39
A behavior tree is only the computerized version of such a game. The dialogue is refomulated
in actions and the single actions are connected to longer sequences. This paradigm allows to control
the robot. But let us go back to the dialogue. The idea is that speaker and hearer are using the same
vocabulary. All the possible commands are numbered from 0 to 99. This allows that both actors are
understanding each other. It formulates a discourse space in which possible actions have a position.
This concept has much in common with a domain specific language which is a computer representation
of a vocabulary.
3i3 Dialogue based teleoperation
The main advantage to construct an interactive speech related game is to make the natural language
visible which includes the ability to record it. In a normal single user teleoperation system the human
operator remains silent while he controls the robot arm. The silence is needed because the operator
needs to increase its awareness of the situation. Any sort of noise would distract the robot control.
Unfortunately the silence makes it hard to reproduce the task. Even if the human operator is asked
to think aloud the amount of generated speech isn’t enough to get a deep understanding of the situation.
The elaborated attempt in capturing the natural language including the hidden domain knowledge is
a dialogue oriented teleoperation which contains of a speaker and a hearer role. The speaker gives a
command like “grasp the apple, please” while the hearer controls the robot and gives feedback like
“done” or “there is no apple”.Such a dialogue can be recorded into a text file and its important to extract
the relevant vocabulary. For example all the nouns and all the action words.
This extracted vocabulary helps a lot to model a domain. The recorded dataset formulates a
challenge known as instruction following. This is the reverse interaction. There is a dialogue available
as input and the robot has to execute the actions. That means, the robot has to imitate either the speaker
or the hearer from the dataset. The surprising insight is, that such an imitation game can be fulfilled by
a computer, especially if the natural language is equal to a controlled vocabulary which contains of 10
actions and 4 nouns.
So we can say the robot domain gets converted into a linguistic dataset which allows to program
and benchmark a robot. The linguistic dataset acts as an abstraction mechanism which allows to encode
a problem into a machine readable format.
3i4 Voice commands and behavior trees
Voice commands for robot control are known since years in robot programming but apart from simple
demonstration they have never become very popular. First problem is, that such a robot isn’t perceived
as an autonomous robot and secondly it seems that such robots can’t solved a longer task.
The surprising situation is that with a simple modification, namely a behavior tree, both issues can
be overcome. A behavior tree is a longer and autonomous variant of voice control. Instead of interacting
with the robot in a dialogue, the commands and possible outcomes are scripted in a computer program
which is playbacked on the robot. The resulting system is similar to the Karel the robot programming
language which is also a powerful scripting technique. A karel robot can be programmed to do anything.
Its possible that the robot will move around, or it will put some objects into the map. The robot is
limited only in two ways: first the script and secondly the amount of possible commands.
A voice controlled robot is used to test and implement a vocabulary. it allows to recognize which
commands are needed and evaluate if the robot understands the commands. Its a prestep in programming
a longer script.
Let me give a concrete example. suppose the robot should cook a meal in the kitchen. For doing so
the robot needs a list of possible commands like “open drawer”, “take object”, “cut food” “clean the
table”. Every command is a subset in the overall activity. A meal can be prepared but combining the
actions into a longer sequence.
40
command
ID
0
1
2
3
Text
Speaker: Go to north please.
Hearer: Ok.
Speaker: Is there an obstacle?
Hearer: No
Tagging
command, direction
response, ack
question, lidar sensor
response, no
direction
no
response
ack
question
lidar
Table 7: Dialogue dataset and connectionist encoding
3j Connectionist model of dialogue
A wizard of oz experiment has the aim to create a dataset which contains of a natural language dialogue.
The speech between a hearer and a speaker are recorded with the attempt to extract commands, action
words and object names from the utterances. The next logical step is to convert the dialogue dataset
into a computer program which emulates the roles of the speaker and hearer. This translation can be
realized with a connectionist model which consists of input and output neurons.
The main advantage is that in a connectionist model the information is represented in a atomic
structure which makes it easier to process the information with a computer. The table “Dialogue
dataset” provides a simple example in which the speaker gives an order to the hearer. Both parties are
communicating with normal natural language which is easy to generate for humans but difficult to
understand for a computer. A possible intermediate representation is provided by tagging the speech
acts. These tags can be converted into neurons of a connectionist model.
The rough idea is, that during the dialogue between speaker and hearer one or multiple neurons in
the connectionist model are getting activated which allows to capture the interaction with mathematical
precision. In contrast to a natural language sentence which is stored as a string, a single neuron can be
either on or off and might be stored as a feature vector.
There is a similarity to a communication board which is used in language experiments. A communication board is a series of word buttons which can be activated or not and this simplifies the
communication. Or at least it allows a computer to understand the meaning easily.
Now its possible to solve the original problem which is how to emulate two humans which are
doing the dialogue. The connectionist model has to activate the same neurons like the human actors
at the right moment. Such a behavior can be learned with a neural network algorithm. This allows to
replace humans with machines.
Assigning possible answers to predefined nodes in a graph isn’t completely new but its used for
dialogue design frequently. A dialogue contains usually of a textfield e.g. “are sure that the program
should be ended?” and a list of possible options like “yes, no, cancel”. This paradigm allows to
formulate the interaction in a precise form. The user can’t answer freetext but he has to select a button.
In the program, possible button events are processed into a certain behavior of the software. This allows
to create and bugfix a GUI application.
In case of instruction following for human to robot interaction the same principle can be adapted.
The dialogue gets transformed into a predefined graph of question/answer pairs and each possible
interaction will produce an event. For example the question “is there an obstacle?” ca be answered with
yes, no or “don’t know” and each possible answer result into a predefined situation. In other words,
possible bugs of the robot program can be traced back into missing dialogue modeling. If the robot
answers a question wrong the problem can be analyzed and fixed. This ability may explain why the
“instruction following” task is an interesting problem in robotics. It means usually to send a series of
questions and commands to the robot and benchmark if the robot is able to execute the commands.
In the best case the robot can understand 100 different commands which contains of object names,
possible location, and obstacle finding questions.
41
Figure 10: Ship steering control with commands
An instruction following task is the opposite of former understanding of Artificial intelligence
which was dominated by the goal of implementing a certain amount of onboard intelligence. In the
past, most robotics projects were created with the goal that the robot itself is able to think and execute
an action. In contrast, the instruction following task assumes that the robot has to interact with a human
operator which asks the questions and gives the commands. The only task for the robot is to fulfill
these requests.
4 Practical demonstration
4a Creating a cleaning bot with a dialogue
Suppose there is a grid and the task for the robot is to clean up a certain cell which is highlighted in
red. Such a robot can’t be programmed in a certain way because the inner working of the robot is not
important or it is trivial. The better way to describe the behavior is to assume an abstraction layer in a
natural language dialogue. There is a virtual speaker outside of the maze which is talking to the robot.
And the language send back and forth has to be converted into robot commands and into a behavior
tree.
The described dialogue doesn’t belong to the internal Artificial Intelligence of a robot agent but its part speaker
hearer
of the game description. There is the mentioned grid What position has the object? (3,5)
game which has a grid of 10x10 cells, a robot which movetoX()
ok
can move in 4 directions and the speaker hearer dialog. Isobstacle()
No
With this constraints its possible to program the robot. movetoY()
ok
The robot has act inside this rule system. And if the Cleanobject()
Failure
robot isn’t able to solve the task the dialogue has to returntobase()
ok
be extended. That means, more question&answering yourbattery()
54%
pairs are needed which reflects additional knowledge
Table 8: Speaker hearer interaction for robot
encoded in natural language.
For reason of simplification we can assume that cleaning domain
a speaker hearer dataset is the knowledge layer for a
robot. It reduces the state space drastically. The robot in the maze isn’t solving a path planning problem
and its not solving a reinforcement learning problem, but the robot produces and understand natural
language.
Natural language is a collection of nouns, action words and speech acts like a question or an answer.
For example the robot might answer a simple “failure” or he can say “Sorry, but i can’t clean the object”.
The main advantage of a language based interaction is that it can be bugfixed quite easily. Even a non
programmer can interact with the robot and will recognize at which situation exactly the robot doesn’t
42
Figure 11: Screenshot of gripper task
fulfill the request.
4b A grammar controlled gripper robot
A movable gripper in a box2d world has to pick&place a ball. This is realized with a simple vocabulary:
action ::= stop |
rotleft
|
r o t r i g h t | f o r w a r d | open | c l o s e
The human operator can select one of these actions from the menu and this will move the robot.
Until now the situation sounds not very advanced and there is no advanced Artificial Intelligence
implemented. Apart from the mentioned BNF grammar no further libraries or functionality was
implemented in the prototype which is shown in the figure.
In other words, the robot gripper gets controlled manually be pressing buttons on a keyboard. The
surprising situation is, that the given commands can be executed in a script so its possible to program
the robot in an autonomous fashion. The situation is the same like the “karel the robot” example, except
that the robot world is not a simple grid maze, but a box2d physics engine.
Let us investigate how a typical robot program might look. What the gripper can do is: forward for
5 seconds, closegripper, rotleft for 5 seconds, forward.
Such an action sequence will grasp the ball and bring it back to start. In contrast to a famous myth
a teleoperated robot can be automated easily, the only requirement is, that a vocabulary is available
which was given in the beginning as a BNF grammar. The programmer has to design a word list (in
the example it contains of 6 actions) and then a script is referencing to these words. After executing a
single action for example rotleft, the robot is doing something. In the concrete case, rotleft will trigger
a python code segment which modifies the angularvelocity in the physics engine. The human operator
doesn’t need to know the detail, its enough to recognize that the button will rotate the robot.
The reason why a action vocabulary is a powerful tool is because it increases the abstraction level.
Instead of programming a robot in a certain language like Python or C++, the interaction with the
machine is realized with natural language commands. Nobody cares if the robot was programmed
internally in Java, Pythoon or any other language but the only interesting subject is the bnf grammar
which defines the textual userinterface to the robot. In theory its possible to extend the grammar with
some sensor command, for example: getdistance, getcontactforce, getangle. These information might
help to locate the ball and grasp it with the correct amount of force. For reason of simplication these
sensor feedback data aren’t implemented.
Its a bit hard to explain what the program is doing from a technical perspective, because the software
is different from a classic robot AI. In technical terms the program is mostly a graphical user interface
which shows available commands to the user and make sure that after pressing a number key the action
gets executed. So its some sort of GUI interface without further algorithm. If the idea is reduced even
more, the BNF grammar is the core element which is no algorithm but a list of possible words to interact
with a robot. So its a pulldown menu similar to what is known from classical desktop applications.
43
The surprising situation is that such a trivial GUI interface allows to control the robot with ease.
After a short amount of time, every newbie can send the correct commands to the robot.
Let us try to invent a more advanced example in the context of instruction following. Suppose the
task for the robot is to collect not one but hundreds of balls which are occurring on a random position
in the map. In each case, the robot has to move to the ball, grasp it and bring it back the starting
position. Such a task can be realized with a script which is using the predefined commands. Everything
what is needed are some if then statements and a for loop. Another way to program the robot would
be a behavior tree which is working with the same principle like scripting AI. Even if the proposed
vocabulary list is very simple, its possible to solve complex tasks.
4c How to automate a warehouse with robots
A swarm of pick&place robots in a warehouse are an easy to realize automation example. The robots
are navigating in a controlled environment, they are monitored by a computer and they can transport
a load over a long distance. The only challenge to solve is how to program these robots which is
explained in the following section.
It makes sense to build a robot swarm in a step by step fashion. First thing to do is to ignore the
technical side of robot programming because this hardware related technical side is surprisingly easy
to realize. A robot is basically an electric car which consists of battery, a motor and some sensors.
Building such a machine from scratch isn’t a real problem especially not if modern technology like a
microcontroller is available. The more serious task, and seldom explained problem, is how to describe
the autonomous working of a robot in computer code. And this task is realized as a natural language
dialogue.
Before a robot can be programmed to understand commands, it makes sense to test a self created
vocabulary in a dry test with two humans. Both humans have a sheet with words and one person gives
the other person the instructions. For example, the speaker says “goto shelf 5”. The hearer takes a
look at his vocabulary sheet and if the words goto, shelf and 5 are available he can execute the task.
Both humans are asked to use only the words on the sheet. Even if they know much more about the
warehouse they are restricted to the vocabulary in the sheet.
The reason is, that the newly created sheet is taken as input for programming the robots. A robot
will need some expert knowledge and this knowledge is encoded in the vocabulary used in human to
human interaction. Typical words in a warehouse setting are “shelf, pickup, place, bringme, moveto,
loadbattery, stop, obstacleinfront, obstacleleft” and so on. The assumption is that an entry level
warehouse robot has a vocabulary of less than 50 words to describe all the possible scenerios. If the
words are available its possible to put them into a sequence and write longer programs with the words.
This task has much in common with classical computer programming but has a strong focus on solving
domain related problems.
Even without such a program a robot can be used in an interactive fashion if he knows the important
words. Its possible to teleoperate such a robot by pressing buttons from 0 to 50 for a single command.
For example, if the operator likes that the robot is moving to shelf 5, he will enter “20 5” (20=moveto).
If the operator likes that the robot pickup the object he will enter “16” (16=pickup) and so on. If the
robot sends a message back like “4” This is equal to (4=obstacleahead). And if the robot sends the
feedback (5=batterylow) the machine needs to find the next charging station.
In summary, a fully working warehouse robot is consists of a vocabulary of 50 words which are
describing possible actions and sensor readings. This vocabulary is used to control the robot in an
interactive mode or in an autonomous mode which is equal to a behavior tree.
4c1 Command dictionary for a maze robot
In contrast to a famous myth, the main problem in Artificial Intelligence isn’t about inventing an
algorithm which controls a robot but about inventing a dictionary for man to machine communication.
44
A possible example is provided in the table which contains of 14 different controls. These commands
can be used to control the position, scan the maze and check the battery.
The surprising insight is, that such a command list isn’t an algorithm,
but a codebook which a list. Each entry has an id. The human operator id action
can enter the code e.g. 12 and this will start a certain behavior of the
0 move forward
robot (here: 12=play a sound). In other words, the command dictionary is
1 turn left
an elaborated remote control. Instead of providing only 4 arrow keys for
2 turn right
left,right,up,down, the human operator can take advantage of a long list
3 check wall
of possible commands. So its some sort of user interface known from a
4 Take a Picture
computer game. The commands are mapped to shortcuts and this allows
5 stop
the human to interact with the game as fast as possible.
6 restart
Such a user interface might sound trivial, because most video games
7 display map
have such a feature implemented already. On a closer look, such an interface
8 speed up
can be used to create an advanced AI coontrolled robot. The reason is, that
9 slow down
a command dictionary allows to record and playback actions. Suppose 10 show battery
the human operator is interacting with a robot and all the keystrokes are 11 scan surrounding
recorded in a game log. This will result into a MIDI like notation which is 12 play sound
maybe [0,1,0,0,11,10,2,2,0,0] The numerical sequence captures the actions 13 toggle lights
of the robot. Such a sequence can be play backed in a macro. This allows
to control the robot autonomously. The new (easier to solve task) is which Table 9: command dictiosort of action sequence is needed to fulfill a certain task. The goal for the nary for a maze robot
onboard AI is, to generate the optimized action sequence which contains of
actions from the codebook. In other words, with a command codebook every domain will become a toy
puzzle game.
5 Robot programming languages
In the 1980s many robot programming languages were available like Autopass and VAL. Some of
them were working with the client server communication model similar to the description in section “3
instruction following”. Nevertheless the concept of a robotics language has to be felt out of fashion.
The main obstacle is for sure that for any domain a new vocabulary set has to be created and an existing
language like VAL can’t be adapted.
The main problem with a domain specific language for robots is, that such a language is missing
and has to be invented from scratch. The initial situation is, that the robot has only a small amount
of commands like [on, off, moveleft] and these commands are not powerful enough to script more
complex actions. Creating a robot programming language has much in common with the Chicken
or the egg issue. Without a command set its not possible to create a robot program, and without a
robot program there is no vocabulary. To solve the issue we have to describe the situation from a non
computing perspective. A dataset might be the source for any language. In contrast to a robot language,
a dataset isn’t implemented in a computer but its a table, similar to a mocap dataset.
In the optimal case, a robot dataset is a game log which has pictures and natural language instructions
from a speaker. Its the recorded interaction between a speaker and a hearer including the picture of
the robot. The main advantage of the dataset is, that it can be created with less effort than a full blown
domain specific language. There is no BNF grammar and no parser, but the dataset is a normal MS
Word document which describes a situation.
Its correct that such a dataset is different from a robot programming language because the table
can’t be compiled into Java language. On the other hand, it can be used as a valuable source for software
engineering. Its possible to send the word document to a software programmer and ask him if he can
create a robot programming language according to the textual dialogue between the speaker and the
hearer. That means the textual document contains the specification for creating a DSL Language.
45
t
0
1
2
3
Speaker
Start the engine, please.
Move to north and grasp the box
The box id=3.
Can you return to base?
Tag speaker
object
action, object
object number
waypoint
Hearer
done
which one?
thank you
One second
Tag h.
ok
clarify
ok
delay
Picture
picture1.jpg
picture2.jpg
picture3.jpg
picrure4.jpg
Table 10: Dataset for instruction following between two humans
In the example table, the resulting robot language would contain of commands like “start, move,
returntobase”, a variable like “boxid” and response codes like “ok, id_needed”. In theory its possible to
convert this information into a Backus–Naur form grammar and then implement an executable robot
programming language.
5a The symbol grounding problem
The dominant problem in Artificial intelligence in the past was a missing understanding of Intelligence.
Robots were known from science fiction books but it was unclear if its possible to realize them in
reality. There was no detail problem with robots available but the situation in general was too complex
for the programmers.
Overcoming the bottleneck can be realized with two techniques: a dataset and the description of the
symbol grounding problem. Both techniques closes the gap between human and machine intelligence.
Let us start with the easier to explain technique which is a dataset. A dataset is a tool to define a
problem, for example a table with image to text entries, a motion capture database or a gamelog of
a videogame. All these datastructure allow to capture human knowledge. They are stored in a well
known database which a list of files stored in a directory. The dataset never contains an algorithm or the
Artificial intelligence itself, but the dataset is the input value for creating an AI algorithm. It allows to
benchmark existing neural network algorithms and it helps to describe problems in a machine readable
format.
The second tool, the symbol grounding problem, means to map numerical values into natural
language. For example the command “go north” has a certain representation in a computer program
while a certain sensor reading can be translated into a sentence like “obstacle ahead”. A solved symbol
grounding problems allows to interact with a computer in natural language. similar to a dataset, the
symbol grounding problem never contains of an algorithm or a computer program but it defines a
requirement how a computer has to work. Its a specification for the GUI which is used by the human
operator to send commands to a robot. If the gui has a text entry field which allows to enter a command
similar to a text adventure, than the domain was grounded in natural language.
For an AI newbie, it might be hard to realize why Artificial intelligence can be realized with non
algorithms like a dataset and a certain text based GUI. This understanding contradicts the common
assumption that an AI has to do with writing software. The misconception has to do with the self
understanding of computer science. In the past until the year 2000, computer science was equal to
invent algorithms and convert them into executable programs. A more recent description of computer
science tries to identify first the problems which can be solved later with a certain algorithm. This
problem focused computer science is more challenging to describe, because a problem is usually given
in the beginning.
Let me give an example: a typical computer science problem until the 2000 was how to solve the
traveling salesman problem. This path planning problem was introduced in computer science decades
ago and then it was discussed and solved multiple times. The only debate was, which algorithm works
well and how to implement the algorithm in a certain programming language. What was ignored in the
past is the question if the traveling salesman problem makes sense, how to modify the problem in a new
direction and which other problems might be there.
46
5a1 Symbol grounding mindmap
The symbol grounding problem is hard to explain and it contains of many references to existing
subjects like natural language procesing, Artificial intelligence, cognitive science and man to machine
interaction. A possible attempt to explore the subject is a mindmap which contains keywords. Of course
the following mind is a bit long because it contains 142 entries so it makes sense to reduce the subject
to only 5 keywords which are: speaker hearer dialogue, text to animation, abstraction mechanism,
controlled vocabulary and game log.
The main objective of the following mindmap is to provide an entry point for finding more literature
about the subject. In general, symbol grounding is about a mapping from language to pictures which
sounds trivial on the first look but its surprising difficult to realize with a computer. The assumption is
that the symbol grounding problem is the core problem within Artificial Intelligence. Its basically a
tool to reduce the state space. Only with a small state space its possible to solve robotics problems on
normal computer hardware.
47
1 sensor
1a clustering ID3
1a1 classification →3a2b
1a2 space partitioning →3b11
1b qualitative sensor
1b1 symbolic event recognition
→3a2b
1b2 human activity recognition
1b3 system identification
1b4 game log
1b5 event taxonomy
1b6 pattern recognition
1c qualitative physics
1c1 Psychophysics
1c1a cognitive science
1c1b minimal quantum theory
1c1c spatial grounding
1c1d quantum system
identification
1d bar chart
1d1 spectrogram
2 origin
2a Stevan Harnad, 1990
2b Rodney Brooks, 1990
2b1 physical grounding
hypothesis
2c John Searle, 1980
2d Luc Steels, 2000
2d1 cognitive science
2e practical application
2e1 Yiannis Aloimonos
2e2 Jeffrey Siskind
2e3 Neil Dantam
3 symbols
3a language
3a1 concept
3a1a concept learning
3a2 sensory words
3a2a measurement
3a2b categorical perception
3a3 natural language command
→4
3a4 voice command
3a5 mini language
3a6 activity language
3a7 motion description language
3a8 instruction following
3a9 route instructions
3a10 Scripting AI
3a11 BNF grammar
3a12 Behavior tree →1b5
3a13 Domain specific language
3b representation
3b1 feature vector
3b2 ontology
3b2a grounding graph
3b2b neural blackboard
architecture
3b2c multigraph
3b2d lexigram
3b2e gantt chart
3b2f communication board
3b2g bag of word
3b2h tagging →3b6b
3b2i data labeling
3b2j scoreboard
3b2k tag vocabulary →4c5
3b2l semiotic network
3b3 boolean variables in feature
set
3b4 short notation
3b5 morse code
3b6 case representation
3b6a knowledge representation
3b6b case frame
3b6c case grammar
3b7 Bayesian network
3b8 printer error code
3b8a ASCII code
3b9 telegraphic code book
3b10 information theory
3b11 Semantic memory
3b11a Connectionist
3b12 sender/receiver data
communication
4 Natural language
4a meaning
4a1 Translation between iconic
and textual language
4a2 user interface for man
machine communication
4a3 Question Answering
48
4a4 Questionnaire
4a5 language games
4a6 grounded communication
4a7 Artificial language
4a8 semiotic cycle
4a9 dialogue
4a10 speaker hearer dialogue
→3b12
4a11 Wizard of Oz experiment
4a12 human to robot dialogue
4b word embedding
4b1 Word2vec →3b2g
4b2 semantic indexing
4c minimal dictionary
4c1 grounding transfer
4c2 Structured English
4c3 conceptual spaces
4c4 multimodal translation
→4a1
4c5 controlled vocabulary
4c6 dictionary learning
4c7 Thesaurus
5 Artificial Intelligence
5a video annotation
5a1 dataset
5a2 json format
5a3 motion capture
5a4 Therbligs
5a5 kinesiology
5a6 motion graph →3a6
5a7 annotation vocabulary
5a8 data centric AI
5b evaluation function
5b1 cost map
5b2 reward function
5b3 potential field
5b4 grounded reinforcement
learning
5b5 heuristic algorithm
5b6 best first search
5b7 heuristic search
5b8 memory based heuristic
5b9 pattern database
5c soccer referee
5c1 event parser
5d Karel the robot →3a9
6 Anchoring
6a text to animation
6a1 Keyframe
6a2 interactive computer
animation →5a6
6b text adventure →4c2
6b1 Text adventure parser
6b2 game engine
7 idea
7a sensory projections to
categorize objects
7b language to meaning
relationship
7c abstraction mechanism
7c1 state space reduction →3b3
7d in opposition
7d1 computationalism
7d2 algorithms
7d3 spatial map
7e programming exercise
5b Inventing a Domain specific language (DSL)
In the past, domain specific languages were often explained from its implementation perspective. The
programmer was trying to formalize the specification in a BNF grammar and a parser was programmed.
Such a technical perspective prevents to understand what the real purpose of a domain specific language
is. A DSL is at foremost not a computer program but a communication device. It allows to communicate
back and forth between a speaker and a hearer. So its part of a dialogue game.
In contrast, the technical implementation including the BNF grammar has to be called trivial. Even
without programming a dedicated parser its possible to implement a DSL in a computer program. In the
easiest case its simple a python class with some methods. Each method is executedd from the outside
so there is no dedicated DSL at all. The more harder to grasp element of a DSL is its relationship
during interactive control of a robot.
Let me give an example. There is a warehouse robot which can do simple pick&place tasks and
there is a human speaker. The speaker sends commands to the robot and receives feedback. For example
the speaker says “moveto B3”, “moveto B7”, “pick object”, “moveto A1” and so on. Possible answers
from the robot might be “Yes”, “obstacleahead”, “Yes”. The language utterances between speaker and
hearer a formulated in a domain specific language or to be more precise its a shared vocabulary.
So we can say that a DSL plays the role of communication interface. It enables that a speaker and
a hearer will understand each other. A task is never solved by a robot or the DSL but the mentioned
warehouse task is solved with interactive communication. The reason why the communication is needed
is because of complexity reduction. The overall task is split into two subtasks which are realized by a
speaker and a hearer. This server/client model allows to solve very complex problems from robotics.
5c Interactive robot control
Artificial Intelligence in the past was mostly treated as algorithm centric search in a well defined
problem space. A typical example is a path finding algorithm or a chess engine which are both
searching in the state space with a highly efficient strategy. Unfortunately this perspective prevents to
solve more advanced problems from robotics. To address these issue a different paradigm is needed
which is interactive control.
In short, interactive control is a grounded communication between man and machine. Natural
language instructions are used to send commands to the machine. A working example is a classical text
adventure from the early 1980s in which a human operator types in a command and the game engine is
executing the command. Such an interaction isn’t described as Artificial Intelligence but its perceived
as a normal video game. The interface in this game can be realized with a mouse, a textual parser, a
keyboard or a joystick. This interface allows a grounded communication.
Its a seldom discussed fact, that the implementation of teleoperated control has a huge impact into
the state space. Its possible to use joystick to control a robot arm or its possible to use a vocabulary list.
The goal is to discretize the state space with the help of natural language words. If the robot can be
controlled with a list of less than 20 words its pretty easy to automate the teleoperation control into
fully autonomous control.
49
Let me give a simple example for working interactive control of a clock. There is a speaker who
gives a command like “Its 2:30 o’clock” and the hearer has to put the little and the big hand to the
correct position. In case of a videogame, the speaker is the human operator and the hearer is the game
engine. The task for the game engine is to take a clock information as input and render the pointer as
output.
The surprising situation is, that rendering a clock on a computer display is
Hearer
nothing new and it doesn’t even belong to artificial intelligence in the classical Speaker
sense but its a normal programming task. But at the same time it has a lot to Its 2:30
do with AI because its a practical example for man to machine communication. o’clock
Let me explain it the other way. The main bottleneck in AI is missing man to
machine communication. If a machine understands the commands of a human
better its pretty easy to program a robot.
Figure 12: Speaker
There are many examples available for interactive control like the mentioned
Hearer interaction
clock setting animation, an action adventure or interactive animation of a biped
for clock animation
robot. In all the cases the human is asked to enter commands or a press a button
and this will affect the rendering on the computer screen. The user interface gets adapted to a certain
domain and this is equal to ground a domain. That means, the communication with the user interface is
realized.
5c1 Interactive animation with datasets
Since the advent of interactive animation by [Licklider1976] lots of attempts were made to create
computer generated images. The problem with most of these projects like SIMBICON and other
parametrized animation systems is that the underlying software is very complex and its difficult to
reproduce a result. A promising attempt to overcome the complexity is to postpone the implementation
of such a software and focus only on the creation of a dataset which specifies the desired outcome.
A dataset for interactive animation has much in common with a question answering dataset except
that the domain is related to computer animation. The speaker in the dataset will formulate some
requests like “can you draw a circle in the middle” “please wave the left arm” “jump” while the role
of the speaker has to show a certain animated behavior. In the easiest case, the dataset has the same
format like a serious comic book, it contains of motion primitives together with its textual description.
There is no need to convert the dataset into a computer program or even into a
neural network but the dataset itself is the project goal. After creating such a dataset Speaker Hearer
no further algorithm or library is needed. This allow to focus on the content and Stand
capture the expert knowledge in a long table. Its up to machine learning experts how
to translate a dataset into a computer model.
Let us focus on the dataset itself. The structure contains of some fixed assumption.
At first there is a speaker hearer role model. So the animation is generated always in Run
a dialogue. The reason is, that a dialogue allows to capture high level and low level
knowledge at the same time. The speaker knows about the key elements like waving
the hand, or moving a rigid body, while the hearer has detail knowledge about how to
draw a certain request with vector graphics on the screen. The second fixed element Sit down
of the dataset is a multimodal information storage which means, that natural language
and pictures are available both. This allows to solve the symbol grounding problem
which is basically a translation from language into graphics.
In general an interactive animation dataset is a long table which contains of Figure
13:
columns for a speaker and a hearer while the content are words and drawings both. Dataset
for
Such a datasets encodes all the knowledge about animating characters.
stick
figure
Modern Artificial Intelligence works with a dataset in the background. At first, animation
the domain knowledge is captured in a long structured table and secondly this table is
50
converted into a mathematical model. Such kind of workflow makes it easy to bugfix possible errors
because it creates two different projects which are working independent from each other. In case
of computer animation the follow up problem after the dataset was created is how to reproduce the
information automatically. For example, a human user types in a command like “draw a circle in the
middle” and the AI has to execute the task. The reason why its possible for the computer to do so is
because the same request-response entry was given in the dataset, so the computer takes a look into the
correct row and extracts the information. In other words, its a retrieval task in a database.
5c2 Stick figure dataset to animation
The main reason why Artificial Intelligence was complicated in the past is because of its computer
science perspective. The implicit assumption was, that Artificial Intelligence and robotics has something
to do with Mathematics, algorithms and convex optimization. All these topics are interesting but they
can’t explain what Artificial intelligence is about. The more promising approach to develop an AI
system is to describe from a non computer perspective a certain domain and then in a second step think
about how to convert this description into a computer program.
The first step in developing an AI animation system is to create a dataset which contains of possible
stick figure images together with a textual description. A single stick figure can run, stand, sit or jump
and each entry gets annotated with one upto two words. So the overall dataset is a long table written in
the MS Word .doc format. This table forms the basis for the AI system, it can be compared to a manual
created form which is used in database design as an input to derive the SQL tables.
Even if a MS Word table is the opposite of a machine readable format it defines very clearly what a
domain is about. Suppose there there 10 different stick figures available in the dataset, then its possible
to retrieve one of the entries. The retrieval process is some sort of dialogue game in which one person
says a word like “walk” and the other person has to draw the stick figure from the table. Such kind of
interaction can be simulated with a computer too which results into the second step of an AI animation
system. Such a system looks different from a robot control system but at the same time it comes very
close to a robot. The software won’t need an AI algorithm but its a simple text to drawing converter. A
human user can map the words to a keyboard and this allows to interact with the system by pressing
buttons. After pressing 0 on the keypad, the stickfigure will stand up, after pressing 1 it will move and
button 2 will make it jump. so its some sort of animated character in a video game.
From a technical perspective, the dataset with the stick figure is a sprite sheet while the animation
system is a user interface of a game engine. The interesting situation is that more programming effort
is not needed but the robot control system is ready to deploy. The logic of such a system isn’t located
in the written software but in the dataset which was defined in step 1 as a MS Word table. The columns
in the dataset will specify what the robot can do and what not. If the robot animation looks unrealistic,
more entries in the dataset are needed.
5c3 Interface for character animation
In an old paper, published 20 years ago, a simple but effective example for interactive character
animation was presented.[Laszlo2000] For the domain of a hopping lamp, called Luxo, the movement
was controlled with keystrokes like k,l,o,i,u and so forth. The human operator was invented to press a
single button on the keyboard and this will activate a predefined pattern like “medium hop”. This gives
the human operator full control over the character. Its an example for high level interaction.
Somebody may ask, what the innovation is because its simple a mapping from keystrokes to
animation scripts. And exactly here is the source for a misunderstanding. This simple mapping is the
key element in the symbol grounding problem. Natural language commands are mapped to motion
primitives and this reduces the state space. instead of figuring out which angle is needed for each joint
of the robot, there is only a need to decide for one of 26 letters on the keyboard and this will activate the
51
movements. Its pretty easy to imagine a longer sequence of these actions which results into a realistic
movement.
One important element is missing to an animation control system which is a dataset. A dataset
allows to specify the mapping in a more elaborated attempt. Instead of explaining which key is mapped
to a certain behavior the idea is to list all the movements in a table and its up to the programmer to map
the movements to textual commands, a menu or a keystroke.
5c4 Early parameterized computer animation
Until the year 2000 it was very popular in computer graphics to use interactive and parameterized
animation systems.[Laszlo2000] The idea was to simplify the process of creating moving pictures and
give the human operator sliders and mouse gestures to take influence on the output. In contrast to a
purely mathematical understanding of calculating inverse kinematics and rigid body forces it was an
improvement. At the same time there is weakness available in parameterized animation which has to
do with creating new animation systems.
Any parameterized animation system works with a mathematical formula which gets fine tuned
in an interactive manner. The human operator can move a slider and this will affect the jump height
of a character on the screen. But, the same equation is useless if the character should run or take an
object so there is a need to increase the abstraction level further. This can be realized with a natural
language dialogue. The idea here is that no mathematical background at all is needed but the animation
is defined in a dataset which contains of language annotation.
Such a dataset works with a speaker hearer role description. One human gives a command like
jump upwards, while the other human is executing the motion with a puppet. The overall dialogue
including the demonstrated motion gets recorded with a computer. The main advantage over former
parameterized animation systems is the absence of computer code and mathematical equations but
its a simple table. Such a table gets converted into a model in the second step which is a completely
different problem and can be realized by hand coded software and neural networks as well.
The more important step is the first one which defines the motion vocabulary. Instead of figuring
out how a computer is useful to animate virtual characters, the idea is to describe only the problem
in a semi structured format. The dataset is some sort of game or challenge which contains possible
movements and word based annotation.
Let us go into the detail how to create such a dataset from scratch. Suppose there is a puppet show
in which one person controls the puppet with a cross brace while the speaker observes the situation and
gives the command. The speaker may say “move a bit to the left, now wave the hand, let the puppet
jump”. The hearer which is in control of the puppet has to translate the orders into actions and gives a
feedback like “sure, nice idea”. Such a language guided dialogue is the core element of a speaker hearer
dataset. It generates the discourse space in which action and language takes place. During repetition
the same corpus of commands and behavior is shown again. That means its only allowed to reference
to existing interaction but not invent a new one.
The interesting situation is, that after the creation of a speaker hearer dataset a new sort of AI
problem is available. The new problem is how to convert the table into a computer model. The computer
has to imitate the role of the speaker and the hearer both. That means the software has to generate
possible commands and it also has to convert the commands into action. But, such a task can be realized
much easier than generating the animation from scratch because it was defined clearly what the goal is.
There is a clear command like “move to left” which has to be converted into a precise movement of
the character which can be either right or wrong. So the process can be done of course by a computer
program.
52
References
[Cirik2015] Cirik, Volkan, Louis-Philippe Morency, and Eduard Hovy. "Chess Q&A: Question
answering on chess games." Reasoning, Attention, Memory (RAM) Workshop, Neural Information
Processing Systems. Vol. 7. No. 5. 2015.
[Culberson1998] Culberson, Joseph C., and Jonathan Schaeffer. "Pattern databases." Computational
Intelligence 14.3 (1998): 318-334.
[Edelkamp2002] Edelkamp, Stefan. "Symbolic Pattern Databases in Heuristic Search Planning."
AIPS. 2002.
[Farkhatdinov2008] Farkhatdinov, Ildar, Jee-Hwan Ryu, and Jury Poduraev. "Control strategies
and feedback information in mobile robot teleoperation." IFAC Proceedings Volumes 41.2 (2008):
14681-14686.
[Holte1999] Holte, Robert C., and István T. Hernádvölgyi. "A space-time tradeoff for memorybased heuristics." AAAI/IAAI. 1999.
[Howard2022] Howard, Thomas, et al. "An intelligence architecture for grounded language
communication with field robots." (2022).
[Korf1997] Korf, Richard E. "Finding optimal solutions to Rubik’s Cube using pattern databases."
AAAI/IAAI. 1997.
[Laszlo2000] Laszlo, Joseph, Michiel van de Panne, and Eugene Fiume. "Interactive control for
physically-based animation." Proceedings of the 27th annual conference on Computer graphics and
interactive techniques. 2000.
[Licklider1976] Licklider, Joseph CR. "User-oriented interactive computer graphics." Proceedings
of the ACM/SIGGRAPH Workshop on User-oriented Design of Interactive Graphics Systems. 1976.
[Matuszek2012] Matuszek, Cynthia, et al. Learning to parse natural language to a robot execution
system. Technical Report UW-CSE-12-01-01, University of Washington, 2012.
[Rumelhart1993] Rumelhart, David E., and Peter M. Todd. "Learning and connectionist representations." Attention and performance XIV: Synergies in experimental psychology, artificial intelligence,
and cognitive neuroscience 2 (1993): 3-30.
[Samadi2008] Samadi, Mehdi, et al. "Compressing pattern databases with learning." ECAI 2008.
IOS Press, 2008. 495-499.
[Stuart1995] Stuart, Russell, and Norvig Peter. "Artificial intelligence: a modern approach." (1995).
[Tellex2007] Tellex, Stefanie, and Deb Roy. "Grounding language in spatial routines." AAAI
Spring Symposium: Control Mechanisms for Spatial Knowledge Processing in Cognitive/Intelligent
Systems. 2007.
53