The symbol grounding problem part II

Manuel Rodriguez

The symbol grounding problem part II

The dominant challenge in robotics is the large state space. Any possible algorithm which traverses the entire state space will become too slow even on supercomputing hardware. Overcoming the bottleneck can be realized with guided instructions which are resulting into a speaker hearer interaction. The high level commands from the speaker are interpreted by the hearer which is the robot. Even if the concept sounds a bit uncommon, it makes sense to investigate it in detail because it allows to solve complex robotics problem with natural language. After a short introduction into the last AI Winter which went until the year 1992, the dominant part of the following paper is dedicated to the instruction following problem. A simple maze game is taken as an example to implement a working prototype in Python which allows to control a robot swarm with basic commands.

The symbol grounding problem part II Manuel Rodriguez Feb 28, 2024* Abstract The dominant challenge in robotics is the large state space. Any possible algorithm which traverses the entire state space will become too slow even on supercomputing hardware. Overcoming the bottleneck can be realized with guided instructions which are resulting into a speaker hearer interaction. The high level commands from the speaker are interpreted by the hearer which is the robot. Even if the concept sounds a bit uncommon, it makes sense to investigate it in detail because it allows to solve complex robotics problem with natural language. After a short introduction into the last AI Winter which went until the year 1992, the dominant part of the following paper is dedicated to the instruction following problem. A simple maze game is taken as an example to implement a working prototype in Python which allows to control a robot swarm with basic commands. Keywords: AI winter, instruction following, symbol grounding problem, robotics Contents 1 Inventing the benchmark for AI 1a Evaluation function as datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1b Heuristics are feature vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 From backtracking algorithms to heuristics 2a Memory based heuristics . . . . . . . . . . . . . . . . . . . . . . . . . 2b Pattern databases in the late 1990s . . . . . . . . . . . . . . . . . . . . 2c Early robots in the 1950s . . . . . . . . . . . . . . . . . . . . . . . . . 2d Artificial Intelligence in the 1990s . . . . . . . . . . . . . . . . . . . . 2d1 Developments since the last AI Winter . . . . . . . . . . . . . . 2d2 The invisible 5th generation computers . . . . . . . . . . . . . 2e Early teleoperated robots . . . . . . . . . . . . . . . . . . . . . . . . . 2e1 Anti pattern in teleoperation . . . . . . . . . . . . . . . . . . . 2f Textual teleoperation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2f1 From solving games towards generating games . . . . . . . . . 2f2 Early adventure games in the 1980s . . . . . . . . . . . . . . . 2f3 A review of the Castle adventure game . . . . . . . . . . . . . . 2f4 Speaker hearer interaction in adventure games of the mid 1980s 2f5 User interfaces as a grounding mechanism . . . . . . . . . . . . 2f6 Creating toy problems from scratch . . . . . . . . . . . . . . . 2f7 A closer look into action adventure games from the past . . . . 2f8 Communication with a game engine . . . . . . . . . . . . . . . 2f9 Simplified instruction following example with video games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 4 5 6 8 9 9 10 11 13 14 15 16 16 17 18 18 19 19 20 21 21 * This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License <http://creativecommons.org/licenses/by-sa/4.0/>. 1 3 Instruction following 3a Instruction following for goal formulation . . . . . . . . . . . . 3b Cost function as instruction following . . . . . . . . . . . . . . . 3c From line following to instruction following . . . . . . . . . . . 3c1 History of instruction following . . . . . . . . . . . . . . 3c2 NP hard problems vs instruction following . . . . . . . . 3d Instruction following for a robot swarm . . . . . . . . . . . . . . 3e The logic is hidden in the GUI menu . . . . . . . . . . . . . . . 3f The auto mode for instruction following . . . . . . . . . . . . . . 3f1 Programming an AI with a communication board . . . . . 3g Theoretical reason for instruction following . . . . . . . . . . . . 3g1 Instruction Following as communication paradigm . . . . 3g2 Programming an instruction following robot from scratch 3g3 Increasing the automation level with natural language . . 3g4 An instrument panel for man machine interaction . . . . . 3g5 Communication based AI . . . . . . . . . . . . . . . . . 3g6 Programming an instruction following robot step by step . 3g7 Robot control with a protocol . . . . . . . . . . . . . . . 3g8 Object oriented programming . . . . . . . . . . . . . . . 3h Instruction following with speaker and hearer . . . . . . . . . . . 3i Behavior tree based instruction following . . . . . . . . . . . . . 3i1 From commands to behavior trees . . . . . . . . . . . . . 3i2 Behavior tree as dialogue games . . . . . . . . . . . . . . 3i3 Dialogue based teleoperation . . . . . . . . . . . . . . . . 3i4 Voice commands and behavior trees . . . . . . . . . . . . 3j Connectionist model of dialogue . . . . . . . . . . . . . . . . . . 4 Practical demonstration 4a Creating a cleaning bot with a dialogue . . . 4b A grammar controlled gripper robot . . . . 4c How to automate a warehouse with robots . 4c1 Command dictionary for a maze robot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 23 24 25 26 27 28 29 29 30 32 32 33 34 34 35 35 36 37 37 38 39 39 40 40 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 42 43 44 44 5 Robot programming languages 5a The symbol grounding problem . . . . . . . . . 5a1 Symbol grounding mindmap . . . . . . . 5b Inventing a Domain specific language (DSL) . . 5c Interactive robot control . . . . . . . . . . . . . 5c1 Interactive animation with datasets . . . . 5c2 Stick figure dataset to animation . . . . . 5c3 Interface for character animation . . . . . 5c4 Early parameterized computer animation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 46 47 49 49 50 51 51 52 References . . . . 53 1 Inventing the benchmark for AI A common assumption about Artificial Intelligence is, that it has to be realized within a computer. AI, at least this is the claim, is the result of an intelligent piece of software. Of course the software 2 works with an algorithm, so the open question is which sort of algorithm will match closely to human intelligence. The surprising situation is, that such kind of understanding won’t motivate AI research but it prevents it. The more sense making goal is postpone the development of intelligent software in favor of creating an objective scale to measure intelligence. The task for the computer is to pass the test by providing a high score. Even this concept sounds unusual it can be realized much easier than programming a classical AI. The idea is, that intelligence is realized as a dialogue. One instance provides a problem, while the opponent has to provide the answer to the problem. In the simple case the problem is “3+2” and the correct answer is 5. The reason why such an interactive definition of AI makes sense is because it allows to program the computer with a certain objective. In the concrete case the problem for the computer is to add 3+2. Even if the computer isn’t equipped with this capabilities it possible to write a small calculator app. Let me give another example. Suppose the question is about a different operation including the usage of floating point numbers. The problem would be “7.3/2”. This time, the difficulty for the problem solving algorithm is of course harder. The computer has to handle the floating point and needs to know how to divide two numbers. The needed calculator app is more complex to program. But similar to the first trial its possible to write a computer which is able to do so. In high level programming languages like python, the feature is built in as default while in other language like assembly, the programmer has to manually type in the code. The general idea is to scale up question answering systems from easy to hard questions. Easy questions are similar to the previously mentioned example normal math problems which can be solved with a calculator. More harder questions are about word matching games and visual Q&A tasks. All these problems have one thing in common. There is always a dialogue visible. One instance formulates a task, while the second instance provides the answer. Let us go a step backward to understand the situation better. Before its possible to program an Artificial intelligence, there is a need to program an automatic referee. The referee judges if an answer is correct or wrong. In a video game the referee is equal to the scoring unit. For example, in the game of pong this unit will determine if the ball is out and then the score for the player is adjusted. For problems which are provided as a dataset the score gets determined by comparing the correct answer with the wrong answer. For example the correct answer is “3+2=5” and if the computer provides “6” as answer its wrong. More advanced computer software like Large language models are trained with question answering tasks and they are also tested with these tasks. So the intelligence is not located in the large language model itself but in the dataset. A Q&A dataset is a method for storing knowledge. This knowledge is used to create intelligent software and its also useful to evaluate intelligent software. In the history of AI research there was an interesting milestone available. Since the 1980s lots of robotics challenges and toy problems were invented or rediscovered like the micromouse challenge, the 15 sliding puzzle game and the MNIST OCR dataset. What these problems have in common that they are not useful for practical application but their only purpose is to get a better understanding of AI. For example in the 15 puzzle game the user has to sort pieces. This is able to solve the problem. The overall structure works with the previously mentioned dialogue which contains of a problem and an answer to the problem. The problem might be to navigate a robot to the goal, and the answer is that the robot is doing so. What makes these challenges interesting is the ability to score the situation from an objective perspective. Its possible to determine if the robot has reached the goal, and the precise amount of seconds gets measured. This evaluation allows to compare robots against each other. The same situation is available for the MNIST OCR dataset which is an example for image to text translation. Perhaps it makes sense to give examples for other benchmarks which can be used to judge about computer programs. The math example “3+2” was mentioned already. A more complicated challenge 3 would be a question like “sin(pi)”. In case for language oriented question an entry level question would be to find the correct spelling of a word “aple, apple, appel”. of course only the word in the middle is correct while the other are wrong. The interesting situation is, that such kind of questions including the correct answer can be collected in a large table and then a certain computer software can be scored how many questions were answered correct. If the computer program is only familiar with math question it will fail to answer the language problems. It should be mentioned, that the inner working of such a computer is less important. The computer can use neural networks, handwritten code, an expert system or even a random algorithm which makes elaborated guesses. The only thing important is the score at the end. What makes a Q&A Dataset interesting is, that the same principle can be adapted to any problems. Its possible to ask simple math question, include natural language tasks and encode even visual question answer problems. Also the principle scales well enough. If a certain Artificial intelligence isn’t able to answer the dataset, the difficulty can be lowered. This allows to determine which questions are exactly a source of failure and the reason why can be investigated. The cause why q&a datasets were practical unknown before the year 2000 is because they working with the opposite assumption than existing tools within computer science. In the past, the self understanding of computer science was to solve problems. An algorithm is a computer recipe to solve a problem, while a processor has to do something. In contrast a q&a dataset is doing nothing but it formulates a problem. A common paradigm until the year 2000 to talk about artificial intelligence was to imagine how to solve AI problems. The idea was to build a computer and write a program which is able to do so. Such an attempt has failed because it was unclear what the problem was about. There was only a vague definition what intelligence is and it was normal that no algorithm in the world is able to solve a vague problem. The advantage of a Q&A dataset is, that it inverts the situation. The focus is on the problem definition while the ability to solve it gets delayed. The new paradigm is, that in the now a question is created like “what color has the ball in the image?” while the answer to the problem is left open. The assumption is, that computers in 30 years are able to answer this question. 1a Evaluation function as datasets The need for an evaluation function was recognized in computer chess early. Nearly all chess engines programmed after the 1970s are using a simple or a more advanced equation to judge about the board situation. Combined with the move generator the evaluation function helps to determine the optimal move. The new insight is, that an evaluation function can be realized in a data driven format. A typical example is to formulate a series of chess puzzles as a q&a dataset and the computer has to answer the problems.[Cirik2015] Typical questions are: 1. How many pieces are on the board? 2. Is b2b3 a legal move? 3. What is the material advantage of black? Most of these questions are easy to answer and its likely that existing chess engines have a built in module which is providing the correct answer. But the innovation which can also be called a small revolution is, that the questions are formulated in an explicit way within a dataset outside of any computer program. There is no need to write the computer code for an evaluation function but the more important step is to invent the chess puzzles and format it as question answer pairs. There is a similarity between a q&a dataset and a classical evaluation function. Both things are used to judge about the current chess position. The board is taken as input and then a score or a judgment is provided as output. This score helps to guide the search process in the game tree. Thats because it 4 called a heuristic evaluation function. The heuristic is referencing to domain knowledge taken from human experts. It should be mentioned that the problem is not very new. Even in the 1980s there was a debate available how to convert expert knowledge into computer programs. The understanding in the past was, that expert knowledge should be encoded in a rule base, or in computer software. Such kind of understanding was not very powerful. The better idea is to encode expert knowledge in questions. Each question in a q&a datasets defines a new problem and its up to the solver which might be a human or a computer to answer the question. 1b Heuristics are feature vectors From a technical perspective its possible to store a heuristics in a feature vector. In the simplest case the vector contains of only boolean values and stores information about recognized motion capture events. The harder task is to give a reason why before the year 2000 such an understanding was not available. Especially not the addition that encoding in the feature vector represents natural language information. But let us make a practical example. Suppose there is a robot in a maze, and the current state is encoded in a vector with: [direction, speed, xpos, ypos, hasob ject, energy] If the robot updates the position, the feature vector gets filled with new values too. So the feature vector mirrors the current game state in an abstraction layer. This concept sounds logical and allows to construct advanced planners on top of the information which is known as heuristic search. The surprising insight is, that such a feature vector was practically unknown in robotics of the past. Not because of technical limitations. Its for sure, that computer before the year 2000 were more than capable of storing a 5 cell long array, but because of a philosophical mismatch. First problem was that in the past, the advantages of heuristics were not seen and secondly it was unknown that the precondition for a heuristics is an encoding system. Encoding means, that the information about the direction of the robot is stored into the first cell, the speed into the second cell and so on. Each cell of the array has a meaning which is provided by its feature name and can be fully described in a sentence. Its some sort of struct variable used to capture outside knowledge. Let me explain the philosophical mismatch the other way around. Suppose in the robot simulation there is no feature vector available and then we are asking a programmer from the mid 1990s if the robot project is missing some important elements. Its for sure, that the programmer won’t recognized the missing feature vector. He would perhaps ask, that the AI is missing but he will also say, that its unclear how to program such an AI in software. The main problem with feature vectors and the more general theory, the symbol grounding problem, is, that from a technical perspective its pretty easy to realize but at the same time its very hard to explain the advantages. The cause is, that a feature vector is not needed in a standard computer software. Software programs are created to fulfill the needs of a computer. They make sure that a certain algorithms runs on a machine. In contrast, a feature vector has its origin in a domain specific problem. It encodes a sentence like “direction of the robot in degree” as a numerical value. Such an information is highly important for the task description but is ignored by a computer program. The interesting situation is, that even chess engines which are equipped with an evaluation function have no dedicated feature vector. Even if the concept is strongly related to evaluate a chess board, most chess engine documentation doesn’t even mention the term. instead, the evaluation function is realized somehow else. The common approach is to describe the features direct in a programming language like C or Java. The idea is, that the source code is the important and the encoding of possible chess puzzles is less important. This results into a certain understanding what a chess engine is. The programmer assumes, that the chess engine is a collection of source code lines formulated in a certain programming 5 language. This source code gets compiled into binary code and the main objective is that no error is available during this translation step. In contrast, the perspective from a feature vector is the opposite. Here, the written source code gets ignored and the only important fact are the single features which are encoding chess knowledge in an array. And the computer has to store the array inside its memory and has to update the information in realtime. A feature vector encoding is an example for a memory based architecture, while algorithmic centric programming is based on the Von neumann CPU model. Only the later one was accepted until the year 2000 in computer science as a valid model to discuss problem. A typical question in this period was about the runtime of an algorithm which might be np hard. Also it was asked how to program the Von neumann machine in a way that a certain algorithm gets executed. In contrast, it was ignored what is stored in the memory. The untold assumption was, that robot problems are formulated as a graph so the datastructure in the memory will become a linked list. The more interesting problem was until the year 2000 what to do with this graph because this question is strongly related to program the computer with an algorithm. To emphasize the difference between both philosophies let us describe whats important for a feature vector. A feature vector is an array with cells. Each cell represents something of the problem. The cells are stored in the memory of the computer which is mostly the RAM. In contrast, other elements of a computer like the CPU are ignored. There is no need to do something with the values, they are simply stored in the memory and that is all. What gets discussed instead is which values are in the cells, and very important, which features are needed for a certain problem. For example, “are 6 features enough, or should the vector store at least 12 features to get a better problem understanding”. 2 From backtracking algorithms to heuristics Artificial Intelligence until the year 2000 can be roughly summarized as Algorithm for uninformed search. The assumption was, that a robot has to do an exhaustive search in the state space and this will allow him to solve any problem. Possible problems with this approach were known and the understanding was, that more efficient search algorithm and especially faster computers which are working with optical and parallel components will become available in the future. It was obvious why the understanding in the past was dominated by backtracking related search algorithm. Because notable examples for artificial intelligence which were able to plan a path in a maze, and especially chess engines were working with this single principle. In all the case the computer is traversing a graph in a certain sequence and this allows to plan the optimal actions. The possible improvement to this approach (a heuristic) was known in theory but never applied for real problems. The transition from past Artificial Intelligence which was common until the year 2000 towards more recent AI approaches can be described with the stronger focus on heuristics and heuristic search. A heuristic allows to improve the efficiency of a solver dramatically. In contrast, the existing backtracking and depth first search algorithm are a backend. Even if better implementation direct in C sourcecode and with faster computer hardware they won’t scale up well enough. The problem with AI related problems like motion planning for a robot arm is, that a simple improvement with the factor 2x or 10x is not enough to solve the problem. Let me give an example. Suppose somebody implements a chess engine in Turbo Pascal on a 286 Intel MS DOS Computer in the 1980s. The algorithm traverses the game tree with a horizon of 5 movements into the future. With classical understanding of computing the algorithm can be improved drastically. At first a different programming language is used, for example C which is around 4x faster than Turbo Pascal. The next improvement is to use a more recent computer architecture. The Pentium I is around 150x faster than a 286 CPU. Overall the improvement would be 4x150=600x which sounds a lot. But for winning the game of chess a much higher improvement is needed. Even if the algorithm is 6 executed on a supercomputer which has a huge amount of electricity consumption the performance is too slow. The only sense making tool is a heuristics. And exactly this part of an Artificial Intelligence was investigated since the year 2000 in detail. It allows to treat AI problems as informed search problems. The term informed is referencing to any sort of domain knowledge which goes beyond simple trial and error backtracking search. Or let me formulate the situation different. The reason why robotics after the year 2000 is so much more powerful than former attempts is that modern robotics are using heuristics in different degrees. In the minimal case, the robot has a built in heuristics in the format of an evaluation function. There is a unit in the source code available which determines a score for the current game node. In case of a chess engine, the evaluation function determines the winning probability, while in a motion planning algorithm, the evaluation function calculates the badness of a certain trajectory in respect to obstacle avoidance. More advanced approaches are not working with a hard code evaluation function but the domain knowledge is encoded in an input dataset. There are examples available which are converted in natural languages and the detected events are converted into a score. The score can be utilized the search process. So we can say, that the only difference between past robotics and modern robotics is the heuristics. The existence or the absence of domain specific knowledge is the single cause why modern robotics is more powerful. Until the year some papers were published about the need of heuristic planning. So the subject was recognized early as a bottleneck in Artificial Intelligence. The main problem in the past was, that it was unclear how exactly human knowledge should be converted into heuristics. The only thing which was known in the 1990s was that heuristic search will give a major improvement and how to implement an algorithm like A* in software. It should be mentioned, that without a dedicated heuristic nearly all AI related problems like robotics, and game playing remain unsolved. The state space of these domains is too large for uninformed search algorithm. The only thing what is possible is too prove that a certain problem is np hard, for example the game of lemmings. This understanding is equal to the confession, that no computer in the world can play the game by itself. There is a reason available why it was hard or even impossible to create heuristics for certain problems. Because this task fells outside the score of computer science. The sad situation is, that a well formulated mathematical problem doesn’t contain of any heuristic knowledge. Even if games like Lemmings, chess or Tetris are formulated in mathematical equations there is no such thing available like domain knowledge which can be converted programs. One explanation might be, that mathematics is about reducing the amount of details, but these details are important to describe the problem. To grasp the development which resulted into a better understanding of heuristics we have to describe the situation different. The focus is not how to solve problems with a computer but the more interesting question is how to describe the problems. There are two important problems available which can be utilized to create heuristics. First, motion capture problem and secondly toy problems like the 15 puzzle problems. Both problems are strongly related to heuristics. In case of mocap recording a dominant problem is to determine possible binary features like the position of marker or the collision of two markers. If two markers collide, the binary feature is true, otherwise its false. This boolean feature is the natural heuristic to describe the problem. In the other example, the 15 puzzle toy problem, the heuristics is created with an evaluation function which is the result of a feature vector. The features are: distance to goal, number of cells cleared, position of the empty. These features are used to determine the score. In general, a heuristics has to do with identifying a feature vector which gets converted into a numerical score. Such kind of understanding was not available before the year 2000. In contrast there was an early paper published in 1993, which has recognized the importance of a feature vector to encode a heuristic. [Rumelhart1993] describes the problem solving capabilities of a 7 connectionist neural network. Such a network takes a vector as input layer. What we can say for sure is, that after the year 2000, the advantages of feature vectors and other heuristics were investigated in depth by endless amount of papers. Even the symbol grounding problem which is a theory to explain how to create heuristics in general, was widely discussed after the year 2000. 2a Memory based heuristics The transition from former backtracking based algorithm towards more advanced memory based abstraction mechanism can be traced back in the literature until the year 2000. There is a paper available published in 1999 which discusses on a very theoretical perspective the advantages of a state space abstraction mechanism realized with a pattern database. The promise is to make existing search based planners for the 15 puzzle game more efficient.[Holte1999] And indeed the promise can be realized, memory based heuristics will provide a huge improvement. The only problem with these early attempts is that the explanation is a bit cumbersome. Especially for non experts in the domain of planning problems it remains unclear what exactly the task for a pattern database is. But this problem gets solved automatically because many more papers are published since the year 2000 about the same subject which are describing the situation more precise. From an objective perspective there are two sort of search algorithm available: memory based and cpu based. It was uncommon in the late 1990s to use terms like “Symbol grounding problem” and “fea- Name Efficiency Principle ture vector” to describe memory based search. CPU based uninformed algorithm Instead the principle was known as “pattern Memory based heuristics feature vector database” which allows to store heuristics for dedTable 1: Search algorithm icated planning problems: quote “Pattern databases (PDBs) are dictionaries of heuristic values that have been originally applied to the Fifteen Puzzle” [Edelkamp2002] In the late 1990s the subject of heuristic search was very new and possible improvements like natural language tags were not invented yet. So its normal, that the definition what a pattern database is, was a bit theoretical: quote “A pattern is the partial specification of a permutation (or state). That is, the tiles occupying certain locations are unspecified.” [Culberson1998] The definition was closely related to the application of planning for the 15 puzzle game. An example for a pattern would be to sort only the first row and then label it as “firstrowcleared”. Its unclear if the pattern were labeled with natural language in the late 1990s but presumably not. Nevertheless, the paper has recognized the advantages of state space abstraction and gives an introduction into the subject. It should be mentioned that pattern databases for planning problems in the 1990s were equal to state of the art high performance supercomputing. Experiments were realized on the fastest supercomputers at this time: quote “The programs were written in C and run on a BBN TC2000 [...] The TC2000 has 128 processors and 1 GB of RAM” [Culberson1998] page 11 Even from today’s perspective (the year 2024) such a computer is a super fast parallel cluster. From a technical perspective a pattern database can be realized on a much smaller homecomputer, for example a 286 MS DOS PC from the mid 1980s. But before someone can write the code, the subject has to be explored from a theoretical perspective, and this knowledge was missing in the 1980s. 8 2b Pattern databases in the late 1990s The transition from past uninformed search algorithm to more efficient memory based search can be traced back into the late 1990s. Instead of describe the situation with the more recent term “Symbol grounding problem” the widespread used term was “pattern database”. In the published papers it was clearly recognized what the steps are towards heuristic search. At first an abstraction mechanism is needed: quote “The first step is to translate the physical puzzle into a symbolic problem space to be manipulated by a computer. “ [Korf1997] page 1 In contrast to todays attempt in creating the heuristic in a feature set, the understanding the late 1990s was dominated by an algorithm perspective. The idea in the past was to handle everything as a for loop in a computer program: quote: “for reasons of efficiency, heuristic functions are commonly precomputed and stored in memory.” [Korf1997] page 3 Translated into today’s term, the idea was to loop through the state space and create the pattern database in realtime instead of a more static understanding as a feature vector. A possible guess why this philosophy was preferred is because the objective of the paper was to solve the 15 puzzle game. This objective was similar to former uninformed search algorithm, which have also a strong focus on solving problems. A pattern database was interpreted only as a faster technique to search the state space. It was an add on to the main program but not in the center of attention. The interesting situation is, that only a few years later in the year 2008 the focus was strictly set to the pattern itself and not on the algorithm how to create it. Also the relationship between state space, feature vector and pattern database were clearly recognized: quote “The state space was partitioned based on feature vector” [Samadi2008] page 1 In the concrete paper [Samadi2008] the idea was to store the feature vector in a neural network to solve the 15 puzzle problem. Such attempt matches closely to current Artificial Intelligence understanding which has a strong focus on feature vectors for storing problem heuristics. 2c Early robots in the 1950s Before the advent of the computer revolution in the 1980s there was an interesting decade available which resulted into many electronic projects around robotics. In the 1950s household robots were created by artists which were mostly working by remote control. The human operator stands behind the robot and moves the base back and forth, and the operator has a joystick to control the arm of the robot. After pressing a button, the robot will playback a short speech sample which was stored on a vinyl record to make the audience think that the robot is real. of course the robot was only a theater presentation but not a scientific break trough. Nevertheless it makes sense to investigate the principle from a technical perspective and its even possible to reproduce the behavior. Everything what is needed is a lifelike humanoid robot controlled by servo motors, a human operator in the background and some wires to connect both. The overall project is an example for remote manipulation, or in other words, its an interactive game. The human operator has an input device and this will move the robot arm and move the robot’s base. More experienced observer will notice what the difference is to a real robot. Industrial robots are working usually not interactive with a human operator but they are controlled by a computer program in an autonomous fashion. This allows to run the robot without a human operator which is the main difference between a theater robot and a scientific project. But what if such an assumption is wrong? What if so called interactive robots are the real robots? The principle of an interactive robot is the same like in a video game. The computer provides the environment and the human user has to take decisions in the game. This is called a simulation. The computer is in charge to transmit the joystick movements of the human into actions inside the 9 simulation. For reason of simplification we can assume that early robots in the 1950s are working like a modern video game. This description allows to get a better understanding which component is missing. Its about the difference of a normal video game to an AI controlled video game. Let me give an example. Suppose a box2d simulation is written which allows a human to control a simulated robot arm. Such a game has much in common with early robots in the 1950s. The only difference is that the in 1950s the robot arm was a physical object in the reality, while the box2d robot arm is only available on the computer screen. The challenge is to automate the movements of the arm. The arm should grasp an object without the input signal of a human operator.. 2d Artificial Intelligence in the 1990s In the 1990s the last AI Winter was visible. The perception of the subject was negative in the public and even among experts on the field. Despite the existence of powerful 8 bit microcomputers and more powerful 32bit Personal computers it was not realistic to program robots and doing useful tasks with them. Even the more reduce attempt in creating research only robots who should demonstrate within the safe space of a university laboratory a basic function like object grasping and walking have failed. What was unknown in the 1990s was the future development, especially the 2000s with the advent of deep learning. The general outlook in the 1990s was pessimistic. The idea was that robotics in the 1990s has failed and will show the same outcome within 20 years in the future. Nevertheless some books and papers were published during the sad 1990s about the subject, mostly from a philosophical perspective. A common question was, how to convert the human brain into a computer brain. If the 1990s was equal to an AI winter it should be important to know, how exactly a more positive environment was created which resulted into today’s euphoric perception of robotics in the pubic and among AI programmers. The interesting situation is, that only on the surface the AI Winter took place, in parallel there was also a lot of research available but with a very little visibility. Let me give some examples. The teaching language “Karel the robot” was invented in the 1980s and it was available during the 1990s. But it was nearly unknown or it was ignored. From today’s perspective the robot language provides a hands on mentality and is a great choice to explain newbies how to program robots. Another example are motion capture projects from the 1980s and 1990s. These projects were started with the attempt to create realistic computer animation. They can be seen as the forerunner to modern biped robotics which is working with a similar principle. And last but not least, many neural network experiments were realized in the 1990s so we can say, that from a technical perspective, this decade was highly productive. One possible reason why during the 1990s these projects were ignored might be the absence of the Internet. During the 1990s most of the research literature was published in printed format only. Specialist proceedings which are presenting the research outcome, were distributed in a small amount of copies. Even well informed book authors during this time, were not aware of the projects. This resulted into a superficial perception of the reality. Only paper which are available can be referenced in a book. And access to such literature was not available. The main problem in artificial intelligence is, that its hard to define the boundaries of this subject. Existing university disciplines have a precise discourse space which can be captured by a handful of important books. In contrast, Artificial Intelligence is referencing to many subjects at the same time like mathematics, psychology, neurobiology, linguistics and computer science. This makes it hard to provide the literature in a single bookshelf. A closer look into a modern AI related paper published at arxiv will show that the paper is some sort of overview paper about progress in many subjects from humanities, arts and mathematics. Perhaps a concrete example can demonstrate the situation. Suppose the idea is to write a paper about a state of the art robot system. For doing so, lots of subjects needs to known like motion capture for recording human motion, computer science for programming the software, and natural language processing for annotating the dataset. Unfortunately, these subjects are reached at the university in 10 different departments. NLP is located within language studies, motion capture is researched as art, while programming is the interest of computer scientists. During the 1990s it was nearly impossible to combine these efforts. It was simply to complicated to get books from these subjects because no library in the world is large enough to collect this amount of information. Even libraries in larger technical universites can’t provide the needed information, because subjects like history of filmmaking and grammar based parsing of languages aren’t teached in a mathematics or computer class. 2d1 Developments since the last AI Winter A decline in AI research is frequently called an AI Winter which includes failed AI projects and a missing theoretical understanding what AI is about. In short, an AI Winter means that the researchers have no idea how to program robots and more important they doesn’t even know what is wrong with the software and algorithms they are creating. The last AI Winter was from 1987-1992. After this period, AI research has shown many success. An unsolved question is, what exactly has made the last AI Winter end and resulted into AI progress. To discuss this problem in detail we should first explain the situation during from 1987-1992 and make clear why exactly the robots from this time were not very advanced. In the late 1980s, a lot of advanced computer hardware and software was invented yet. There was a 32bit microprocessor available (Intel 386) with high amount of RAM, modern operating systems including UNIX and GUI windowing systems were invented yet and the internet was in an early stage. Even beginner friendly programming language like Turbo pascal and video games were very popular in mainstream market. At the same period, it was unclear how to use this technology to build robots. The dominant reason was that robot related problems including path planning were recognized as np hard problem and no algorithm was available to address these problems. Let me explain the situation from the technical perspective. In the late 1980s it was a common knowledge how to print out something with a printer driver, it was common knowledge how to program a 2d platformer game and it was possible to create a database like dbase for handling tables. So the computer technology demonstrated that it can be used for many purposes. But it was completely unclear how to program robots so that they can grasp objects, or walking on two legs. Such kind of technology was only available in science fiction literature. The problem of missing robotics algorithms was known in the late 1980s and because of this situation the period was called an AI winter. The interesting and seldom explained question is what exactly has made the AI Winter disappear. After the year 1992 and especially after the year 2000, the robotics and AI situation was more relaxed. Many new projects were started and most of them with a success. It should important to know what exactly was different than before. The core problem in AI is about complexity which is equal to a large stare space. Even a fast 32 bit CPU with 50 mhz has limited amount of resources. So the question is which sort of hardware or software fits great for robotics. In the 1980s a typical approach was to build certain computer chips and invent new programming languages like Prolog with the goal to address this issue. All these projects have failed too. It seems, that even with a parallel processor and 5th generation programming language, its impossible to program robots. If we want to explain in a single sentence what has started the AI revolution since 1992 we can focus on Robotics program challenges. Until the year 1992 it was unknown how important puzzle problems like the 15 puzzle, chess, or the line following problem are. Even if the puzzle were known they wasn’t recognized as relevant for AI research. In contrast to an algorithm, a puzzle problem defines not the answer but the question. Its up to a programmer to find a solution for the puzzle. They are many indicators that the shift from solution oriented algorithm thinking to puzzle oriented problem thinking has helped to overcome the last AI winter. Perhaps it makes sense to give a concrete example. A typical approach in handling AI problems after the year 1992 would be the following. There is a toy problem available like the 15 puzzle. A 11 human plays the game and the current game state including the players decisions are dumped into a game log. The game log is the dataset which gets feed into a neural network. Such kind of pipeline was to advanced for the last AI winter from 1987-1992. It was beyond the horizon during this time. Even the first sentence wouldn’t be understood from researchers in the past. They wouldn’t see what the advantage is to research a toy puzzle and what the purpose is to create a game log of human actions. The reason is, that during the last Winter the focus was on solving problems with algorithm which is the opposite of the described workflow. The best practice method in tackling AI domains is problem centric, dataset centric and focused on human interaction. These elements combined explain the success of modern AI research. From a technical perspective its trivial to create gamelogs and research toy problems like the 15 puzzle problem. A dataset is mostly a csv file and the 15 puzzle problem was known for sure in the late 1980s. What was unknown was why these tools are needed in AI research. Let us go back to the hard problems during the last AI winter. The typical problem was how to search in a large state space. The assumption was that a robot can move in many directions and the computer program has to traverse the game tree to find an answer. Even for simple games like chess the game tree is very large and more advanced robotics problems are generating a much larger state space. During the last AI winter, the AI researchers were not able to search such a state space. What was available during this period was the problem itself including a precise understanding about the size of the state space. It can be calculated precisely how many possible game states are available in chess and other problems which are millions of millions. Then it can be calculated how long a typical 32bit CPU will need to traverse this space. The conclusion was, that even a supercomputer is not able to play such games and this has stopped AI research at all. In other words, in the late 1980s a certain problem was discovered as the core problem in AI which was how to search in a large state space. This problem was unsolvable and therefor an AI winter was the result. In contrast the situation after the year 1992 has been changed in a sense, that now the main problem is described different. The researcher have learned that they can’t solve large state space problem. They doesn’t even try out a depth first search algorithm on a robotics problem, because they know that the algorithm will run 100 years and longer to figure out a simple path planning problem. What the AI researchers are prefering instead are modified problems, datasets, and heuristics. All these approaches are working with low amount of CPU resources and have a higher chance of success. Let me go into the details. Suppose the idea is to create a game log for the 15 puzzle game. Its likely that a programmer is able to program such an algorithm. It won’t need many cpu and other resources. The unsolved problem is what to do with the gamelog in the CSV format. But this problem can be postponed. The next AI researcher will perhaps a dataset with a motion capture recording device and the third researcher will try to make sense of these datas with a neural networks. Such kind of distributed pipeline has become popular with the advent of deep learning after the year 2000. Since then, endless of datasets and lots of projects to feed the data into neural networks have been started. The main difference to AI projects until the 1992 is, that dataset oriented AI projects have a great success probability. Instead of building entire robots, the projects are mostly about pre steps towards this goal. The sub problem of creating a dataset for an existing game like Tetris or chess is so easy to master, that even programmer newbies in the first semester at the university are able to program such a software with success. This is maybe the biggest difference to AI projects during the last AI winter. Let me give an example. A typical robot project in the late 1980s would be to create a software which controls a robot. The chance of failure for such a goal is very high. The robot won’t do anything. In contrast, the self selected project goals after the year 1992 are much smaller. A typical project is about recording the position of a teleoperated robot and convert the file into a MS Excel table. The result of such a newbie friendly AI project is a simple table with 2 columns for x and y position. The idea is, that such a table may help in the future to program robots. And the assumption is right. 12 In other words, post AI winter projects are tackling the AI problem by dividing it into smaller chunks. Possible strategies are: 1. Dataset creation 2. Speaker hearer interaction 3. Teleoperated robots 4. Heuristics These four elements alone explain the success of the AI research since 1992. Most of the projects have focus on these attempts in building intelligent machines. Especially the first approach (dataset creation) has been identified as a swiss army knife for any sort of AI proeblems. No matter if the robot is a logistics robot, a flying drone or a grasping robot – a modern research project is working always with a dataset. Either a new dataset is created, an existing dataset is selected, or a past dataset gets updated. The last AI winter from 1987-1992 is often referenced as the 5th generation computer generation. To make the point clear we have to explain first what the other 4 generations are. A computer generation is about a certain mixture of hardware and software. For example 4. generation computers are equal to current computer science which is about object oriented programming in Java plus modern multiprocessor CPUs. The idea of so called fifth generation computers was to invent more advanced systems which are going beyond the capabilities of 4th generation computers. This attempt has to be called a failure. Even latest computers from 2023 are mostly 4th generation computers. The user sees a GUI menu and has ability to run webbrowsers, databases and video games. Roughly spoken, computer development has stopped in the late 1980s and since then the researchers are trying to build robots which is more complex than previous computer generations. The difference between the 4th and the possible future 5th computer generation may explain why AI projects are so hard to realize. The problem is that apart from the tools of the 4th generation computers which are object oriented programming languages and modern 32bit cpu, there are no additional tools available What is missing are dedicated AI libraries, AI hardware and AI programming languages, and the chance is high that such tools won’t never be invented. In other words, since the 4th computer generation no measurable progress is visible and this is equal to the last AI Winter. The mentioned tools to overcome the AI winter which are datasets, heuristics and so forth are located outside of computer science. They have nothing to do with classical hardware and software but they are using existing technology in a different form. That means, modern AI software is written in classical 4th computer programming languages and runs on ordinary 32bit computer. And dumping a dataset into a csv file won’t need a dedicated AI library but it can be realized with existing tools and libraries much better. This situation made it hard to localize modern AI tools. AI has nothing to do with computers, hardware and software but its located outside the box. The development of AI can be traced back to certain subjects which are discussed in academic papers. For example, the term dataset was nearly unknown in papers until 1992 but it was used frequently with the advent of deep learning. 2d2 The invisible 5th generation computers Its possible to describe the entire computing history in generations. Every possible piece of hardware or software can be assigned to one of the fourth computer generations. For example, the c programming language was invented during the 3th generation computer period. The simple 4 categories are given a great overview over a complex subject and are used in most museums as a time line to explain who computing technology has evolved. Unfortunately the concept shows a contradiction for 5th generation computers and Artificial intelligence. According to a purely numerical understanding the 4th computer generation has ended in 13 mid 1980s so this was the starting year for the 5th generation. the problem is, that the 5th generation of computers is not defined precisely, and even worst, it is equal to the last WInter which means, that the attempt in building such computers has failed. This would imply that modern computers build in the 2020s are working with the same hardware and software principle than the 4th generation computers, so this period has never ended. The idea in the mid 1980s was, that the 5th generation of computers is equal to Artificial intelligence which is working with parallel supercomputers and the prolog language. But this technology was a failure. The mentioned tools are not used in AI projects. So the question is where exactly is the 5th generation computer generation? The working thesis in this chapter is, that the technology is available but its invisible. 5th generation computer generation works with a different principle than the generation before, it has nothing to do with hardware or software but with a philosophical standpoint. The only place in which 5th generation computer generation can be localized for sure is the gutenberg galaxy which are books and papers about AI. The advent of deep learning was equal to an increase publication activity around this subject. Endless amount of papers were written about deep neurral networks including their applications. There is no single computer chip available and no single programming library which has powered the deep learning hype, but its purely paper based discipline To visualize the 5th computer generation we have to draw a barchart with the amount of keywords used in academic papers. Since the mid 1980s (which was the starting point of 5th generation computers) many new papers were written about neural networks, robotics and natural language processing. These papers including their content are the proof that the 5th generation computing technology exists. The surprising situation is, that practical demonstrations of robots are working 4th generation and even with 3th generation computer technology The typical neural network was programmed in the c language without further libraries and it runs on outdated hardware. That meas, from the perspective of computer history such a project is boring. The innovation is located in the application of this technology. The well known computers are used in a different way than before. The same limited CPU is used for solving a new sort of problems, not known before. The self understanding of any computer museum is to make the development visible so there is a need to overcome a hidden technology. A possible attempt to visualize AI development since the 1980s is to focus on puzzles instead of describing algorithm which can solve the puzzle. A well designed AI museum would present some of these puzzles like rubik’s cube, a chess board, the 15 puzzle game and a micromouse maze to the audience. The idea is to use 4th generations computers to solve 5th generation puzzles. Especially the 15 puzzle is an important milestone because its commonly used to explain the advantages of heuristics. A heuristic is a tool to solve more advanced AI related problems. Most AI related puzzles like Micromouse, 15 puzzle and Karel the robot have a background in teaching computer science. In most cases these puzzles are invented with the only objective to teach students how to program a computer. It sounds a bit contradicting, but such sort of puzzle is well suited for explaining what artificial intelligence is about. Unfortunately it is difficult to explain these tools similar to 4th generation computers. A puzzle in the strict sense is mostly a computer course or a paper which describes such a course. The assumption is, that future robots (not invented yet) will have much in common with a puzzle but low similarity with an algorithm.´ 2e Early teleoperated robots Most robots and pseudo automaton from the past are undocumented. Only textual description or hardware parts are available. These machines were created from around the 1940s until the 1970s. Sometimes they are working with analog circuits and sometimes with mechanical components which makes it hard to describe the logic with modern understanding. But there is a certain perspective available which makes it much easier to understand the principle. 14 The working thesis is that every robot build until the 1970s was teleoperated. Teleoperation means, that the signals for the servo motors are provided from distance. Either with a wire or with wireless transmission. This understanding helps to reduce the complexity of the early robots. Instead of analyzing the inner working including the cables and the analog circuits its enough to know that there was a remote control device with some buttons and knobs and a human operator was in charge to press the buttons. The resulting robot can be divided into two parts. The robot itself shown the audience which was a platform on wheels and even a robot hand, and in addition there was a remote control panel used by the human to control the robot. The only interesting thing is the remote control. in the easiest case its equal to a joystick plus some buttons to activate the voice or a light bulb. The user interaction between a human and the remote control is important to understand robots from the past. The technical evolution is not located inside a robot but it has to do with improved remote control panels. Let us imagine some possible interactions between humans and the remote control device. In the primitive version there is a cable needed which transmits joystick signals to the robot. The human is pressing the joystick forward and this will make the robot move forward. More advanced remote control device are working with wireless (=radio) controlled panels. And very recent remote control technique is using a high abstraction level. Let me give an example. Suppose there is a snake robot controlled with a remote control. In the basic settup, the human has to control the joints one by one which results into a slow movement. The human has to press a lots of buttons until the snake will do something. In a more advanced setup, the movement is generated by an oscillator. An oscillator is a sinus function realized in hardware which has some parameters. The joystick allows the human to adjust the parameters which is equal to provide only high level adjustments to the ongoing movement. The resulting movement will produce a lower workload for the human operator. Perhaps an example makes sense at this point. Suppose there is a robot in a museum build in the 1960s by an unknown engineer. There is no documentation available, the robot doesn’t work anymore. After opening the hardware there are endless amount of cables and circuitry inside and no one knows what the purpose is. The working thesis is that the inner working of the machine can be ignored. Its for sure, that the electronics was in charge to control the robot and that there is somewhere an input module which takes the signals from a remote control. The more interesting part of the machine is located outside of the robot. The remote control itself is the single device which is in charge to generate the signals. These signals are send to the robot and then the arm will move. So the dominant question is: which sort of remote control belongs to the robot. Is this device documented?, can the remote control be rebuild from scratch? How many buttons are on the panel? What was their purpose? Early humanoid robots shouldn’t be interpreted as a calculator but as a transistor radio. A radio receives and transmits signals from a remote location. It never works by itself. The voice from the loudspeaker isn’t generated by a person in the radio, but the person sits kilometers outside of the device. 2e1 Anti pattern in teleoperation Teleoperation is mostly rejected by computer scientists because it looks like a dead end. The typical remote control for a robot is a simply joystick which is connected with a robot arm. Even if its possible to control a robot with this approach it sounds trivial for doing so. What is missing is the ability to fulfill a task autonomously by a computer program. But what if such sort of teleoperaton was only made wrong in the details and the overall idea is great? Let us investigate why typical joystick based remote control seems like a dead end. First problem is that the analog joystick movements can’t be played back. If the same actuator movements are recorded and repeated the robot arm will for sure collide with an object and it will miss the object. The second problem with joystick control is, that apart from movements to left/right/up/down no further action is possible, so the control is located only on a low level. 15 Let us modify the setup drastically so that teleoperation will become more interesting. First thing to do is to define a list of commands instead of using an analog joystick. The idea is, that the human operator has a dozens of button and has to press one of them. Such a user interface might be harder to use, but its much easier to record and playback. Every possible sequence of action contains of the pressed buttons. The second modification is to reduce the problem space. In stead of grasping objects in the normal 3d space, the idea is to reduce the target position to a grid. That means the robot gripper can be at (1,0), (1,1), (1,2) and so on but not on (1.231,3.212) With these simply two modification the teloperated control makes more sense than in the initial setup. To navigate the gripper to a goal position the human operator has to press buttons for left an right movements, and a third button will grasp the object. This allows to grasp most objects with high success rate. And very important, its possible to write down the command sequence into a script and run it in the autonomous mode. This simple example has shown that teloperation is a great idea, if some parameters are fulfilled. Its important to design the GUI interface in a certain way which allows to reduce the state space of the robot. With the improved GUI interface the teleoperated robot has much in common with a toy problem in a maze game The human controls a gripper which can reach some positions and then the gripper can grasp an object and bring it to the start position. 2f Textual teleoperation Teleoperation in the classical sense is refering to joystick control and visual feedback [Farkhatdinov2008]. The human operator moves the joystick forward and will see on a monitor what the robot is doing. Unfortunately this control strategy can’t be recorded and can’t be playbacked which prevents that the robot will do the task autonomously. A typical assumption is, that because of this disadvantages, teleoperation at all is useless. There is another mode available how a human operator can interact with a robot which has no dedicated name but can be coined “textual teleoperation”. The idea is the human operator types in commands and the robot answers with textual information. Such a communication has much in common with a point&click adventure like Maniac Mansion but its seldom utilized for robot control. The main advantage is, there here the communication can be recorded easily and its possible to write a script which automates the control of a robot. The most surprising effect of textual teleoperation is, that the robot has no limitations anymore. Its possible to control the robot even for complex tasks because every command is provided by a human expert who is aware of the situation. The only con is, that its a demanding task to program the interface. Somebody has to figure out the vocabulary and make sure that the robot will understand the words. From a programming perspective this is equal to program a medium complex point&click adventure in which the human player has some action words in a menu and can select them to execute longer sequences. 2f1 From solving games towards generating games A typical misunderstanding in the history of Artificial Intelligence is about the concrete task of a robot or a robot agent. In the past the untold assumption has evolved over the decades. until the 1990s the idea was that a computer program needs to be intelligent in a sense that its comparable to a human. Later the assumption has evolved into the goal that the AI should play games like chess, Tetris and robocup. A more recent approach in defining the role of Artificial intelligence is that the AI agent should create a game from scratch which can be solved by another party. Let us discuss the last two assumptions in the detail. A frequently assumption is that there is a game available like Pacman and the task is to act in this game. This task is delegated to an AI. Endless amount of projects are available around the goal of programming an AI which can play a single or 16 multiple games. The main disadvantage is, that only simple games like the 15 puzzle or Tetris can be played with this principle and more complex situations from robotics are ignored. The obvious problem with robotics domains is, that no such thing like a game is available. First a robot acts outside of a computer in the reality and secondly there is no simulator available for this domain. To overcome the obstacle there is a need to redefine the goal for an AI Instead of solving games (which are not there) the idea is to design a game. Such a task is more complex because game design is working with a different principle. From an abstract perspective a text adventure game is defined by its vocabulary which is stored in a 2 tuple BNF grammar. The player can execute actions like “moveto location” “grasp obect”, “speakto person”. Every text adventure is programmed around such a grammar, so the first task for a robot is to invent the bnf grammar from scratch. An existing text adventure which contains already of a vocabulary can be solved in a second step by a different algorithm. Such a solver works more like a classical AI program which is able to find the shortest path in a state space. The AI has to investigate possible alternative actions and select the best one with the help of a heuristic. The principle is the same like the inner working of a chess engine which is also working ontop of an existing game engine. Solving a text adventure with a search algorithm like A* can be called a trivial task in computer science because endless examples from the past are available. 2f2 Early adventure games in the 1980s In contrast to a famous belief there is no need to reinvent the wheel and program advanced robotics with a textual interface because in the 1980s lots of working examples were created already. The game “Castle adventure” was released in 1984 and it combines a textual parser with a graphical representation. The player has to escape from a castle by collecting items and doing actions. He can control the character with the keyboard and has to enter commands, while the result is shown on the display. Apart from Castle advanture there are many games available with a similar principle and exactly such a user interface can be used to control AI related robots. The logic of the robot isn’t defined by some sort of advanced algorithm but its connected to a game. The genre of MS DOS based Adventure games are the ideal blueprint for programming robots. Thanks to the textual interface its easy to automate the control. From a cognitive science perspective an MS DOS adventure game is man to machine dialogue with grounded actions. That means, after entering a command something happens on the screen because its preprogrammed in the game. Such a dialog allows to solve complex problems like escaping from the castle. The only problem with these games is, that its very complex to program them from scratch. The textual parser, the visual map and the behavior of the character has to be defined in source code. Even with more advanced programming languages like Python and pygame such a programming project will become a larger one. The overall principle behind “Castle adventure (1984)” is, that it creates a simulated world in which the player has to solve puzzles. The game provides an environment in which actions, events and objects are available. The human player has to interact with the game engine to escape from the world. Let us assume that the game setting is a different one. The same user interface is used to simulate a kitchen in which the player has to prepare a meal. Such a game can be connected to a real kitchen robot and then the robot can be controlled with the keyboard and textual commands. The main task for an AI isn’t to solve an existing game but to design such a game. The AI has to invent a game like“Castle adventure (1984) adapted to a certain domain like self driving cars or kitchen domain. After such a game is available its possible to interact with the robot with an interface and its even possible to program a behavior tree which solves the game automatically. Adventure games are reducing the state space because the amount of possible actions is restricted to the preprogrammed commands, objects and locations in the game. Instead of figuring out hundred of millions possible actions for the robot gripper the robot has only 10 different actions. 17 With the help of a adventure game its possible to convert any domain into a toy problem. Toy problems can be solved much faster than real problems and the key element in doing so is a mini language which provides a natural language interface to interact with a game. Such a mini language was preprogrammed in the Castle adventure (1984) game. 2f3 A review of the Castle adventure game The project was created in 1984 by Kevin Bales. Some remakes written in Javascript and Cpp are available in the Internet which have around 60kb in size. The original program occupied around 50 kb on a floppy disc. Even if the game has only a low quality graphics it takes a long time to code such a game. And this is perhaps the most disadvantage which can hinder that similar projects are useful for robotics. Before a robot can be controlled interactively with a two word command parser somebody has to write a game similar to Castle adventure. Suppose a single line of code will need 40 Byte, then the overall game consists of 1500 lines of code. If an average programmer can code 10 LoC per day it will take 150 days to create such a game from scratch. For more complex robot control like biped walking the needed simulation game might be more advanced so the lines of code will grow quickly. Nevertheless from a technical perspective it makes sense to see Castle advanture as a blueprint in creating a human to robot interface. The idea of the game was, that a human player controls the game character with the help of arrow keys and simple commands like “drink water”. Such a user interface can be scripted in a behavior tree which results into autonomous behavior. another interesting advantage is, that its pretty easy to formulate a walk through tutorial for an existing game. It can be written down on a single sheet of paper which actions the player has to do in which sequence to reach the overall goal of the game. The formulated actions like “go, pick, place” have a meaning because they are referencing to the predefined actions in the game. This makes the communication about the game much easier. In contrast, controlling a robot without a game engine is nearly impossible. If the robot doesn’t consists of predefined actions and a map within a game, its not possible to formulate a walk through tutorial because a grounded language is missing. It seems, that the precondition for any sort of robot is to program first a game environment in which a human to robot dialogue takes place and only in the second step a solver or a behavior tree is created for this game. 2f4 Speaker hearer interaction in adventure games of the mid 1980s In the period from 1983-1987 there were some computer games available which have to be called revolutionary. Former text adventures were improved with basic Ascii graphics which resulted into a better user experience. What is seldom explained in the literature is the relationship between a human player and the underlying game engine. The technical term is client server architecture but the more precise linguistic description is to call the interaction mode a speaker hearer system. The human player acts in the role of a speaker. He is pressing arrow keys and very important the speaker selects commands from the menu like “go north”. In contrast, the game engine receives the commands and generates a response on the screen which is the role of a hearer. For solving ingame puzzle both parties have to cooperate. This allows to play a game from to finish. Let us focus on the workflow in detail. Every adventure game in the 1980s is communicating with the human user over a user interface which is a mixture of ascii graphics and pure textual information. This user interface is the central element of the game and it allows to connect the speaker with the hearer over a protocol. The user interface is fixed over all the time and establishes a common ground during communication. That means, a certain command has the same meaning all the time. From a programming perspective the interaction is realized with the help of a game engine which takes the user input and produces a new state of the game. For example after entering the command “go north” the position of the player gets modified. 18 It should be mentioned that adventure games from the mid 1980s are usually not referenced as examples for artificial intelligence because in most cases they do not provide internal Non player characters or other examples for autonomous intelligence. But they are a good example for man machine interaction which is working with natural language. Perhaps it makes sense to explain the advantages of a speaker hearer dialogue. It allows to divide complex problems into two layers. First layer which is realized by the game engine is the simulation of a domain, for example to draw a dungeon on the screen. The second layer is about acting inside this simulation which is usually in control of a human user. Both layers have to working together. 2f5 User interfaces as a grounding mechanism All the existing adventure games are working with a user interface which takes the input of a human and transmits it to the user. Possible examples are text based parsers, point&click wordlist or a graphical panel with icons. Its very common that the user has different options available at the same time. He can decide if he likes to use a mouse, a joystick, a keyboard or likes to enter textual commands. In real adventure games its a typical interaction method to combine these modalities at the same time. For example, the avatar movement is controlled with arrow keys and in addition there is a list of icons to activate high level commands like “grasp” “speakto”. Its important to focus on a user interface because its equal to a common understanding of the human player and the game engine in the backend. Grounding means usually, that the communication is directed over the user interface. At first the programmer of a video game invents a user interface, and then all the man machine communication is feed into this user interface. So the user interface effects how the communication takes place. If there is no icon available to pickup an object, the human user can’t do the action. For reason of simplication its possible to reduce all sort of user interfaces into a word list. There are words for movements (up, down, left, right) words for actions (grasp, give, open) and words for events (key is missing, door is closed). These words can be translated into different graphics/sounds on the screen. For example the event “door is closed” can be shown in a textual format or as a sound. Because words are part of a natural language it makes sense to treat a user interface as a grammar in natural language processing. The BNF grammar consists all the possible allowed language and describes the user interface. 2f6 Creating toy problems from scratch In contrast to a famous assumption, the amount of toy problems isn’t limited to the 15 puzzle and tictactoe game but any domain can be converted into a toy problem. In case of complex robotics domains like a warehouse robots this transformation is realized with a game engine based on an adventure game similar to the Zelda game. The robot acts in a grid maze, gets controlled by arrow keys and certain actions like grasp, ungrasp and rotate can be activated. From a technical perspective such a game consists of a simple 2d graphics plus a word list with possible actions. These ingredients make sure that the state space become smaller. After converting a robotics domain into an action adventure the discourse space is different. Its possible to talk about the actions of the robot in a certain highly structured manner. For example a possible plan would be to move to waypoint A, pickup an object and then moveto waypoint B. Such a plan can only be formulated if the game supports actions like “moveto” and “grasp”. The game engine provides the common ground and simplifies the communication between man and machine. After the game was programmed it will become much easier to give a command to the robot. Every possible command is predefined in the game engine. Possible alternative commands can be ignored. This allows to define what sort of software is needed for the robot. Suppose there is an action adventure game available for a warehouse robot. In the easiest case the robot movements are teleoperated similar to playing a normal video game. That means, the human operator will move the robot by pressing array keys and selects commands from the menu. The 19 Keyboard, mouse command parser Human user Game engine Virtual character Figure 1: symbol grounding pipeline in action adventure game commands are send to the robot in the real world. In a more advanced case, some scripts can be written to automate a task. For example a pick&place script would move to the initial position, pickup the object and then move to the target position. Such a script can be executed within a for loop which results into pseudo-autonomous behavior of the robot. The precondition for such an elegant example of automation is of course the existence of a warehouse action advanture. So we can say, that the bottleneck is located in the man to machine communication. If such a communication is available, the robot will act in a toy world with a small state space and a low amount of possible commands. 2f7 A closer look into action adventure games from the past During the period from 1984 until 1990 there were some so called action adventure videogames available. The game principle was to improve existing text adventures with a 2d topdown graphics in which the player has to find keys, search for swords and rescue a princess. The most famous example is Zelda I, but lots of other examples are available for the C64 and the IBM PC. The game genre is mostly forgotten today, because the graphics was poor and these games were never described in the context of artificial intelligence. But there is a certain element available in the game which makes them interesting from the perspective of cognitive science. The overall idea behind an action adventure is, that the human user interacts with the avatar with the help of a keyboard and a word list. In addition, the user gets visual feedback from the rendered maze. This allows to solve complex ingame puzzles. The player has to visit certain places and fulfill tasks. What makes the game genre advanced is, that it demonstrates very well what grounded communication is about. Grounded communication means, that a human player interacts with a robot. The human plays the role of a speaker which gives commands, while the game engine is the hearer which executes the commands. From a robotics perspective, the character is controlled with teleoperation. The human user is pressing buttons on the keyboard and this will move the character. But this interaction is working on a high abstraction. For a certain game its possible to write down a walk through tutorial which is referencing to places in the game and in reference to available actions from the menu. Such sort of walk through tutorial can only be formulated if the reader of the tutorial understands the meaning. The game engine behind an action adventure provides a communication space. Such a high level space allows to reduce the state space drastically. Instead of describing robot movements from a low level perspective its possible to formulate high level statements like “Walk to city A and grasp object B”.Even if the game engine isn’t able to understand such a statement directly, its possible to convert the sentence into allowed actions in the game. Most of the action adventures were realized with a split screen. On top there is a 2d maze visible in a topdown perspective. The avatar can move in this way in all directions. On the bottom of the screen there is an inventory and very important a list with possible actions like talk, use, open and pickup. Before the human player is able to interact with the world he has to memorize the keys and understand which sort of actions are possible. This allows the player to solve longer puzzle problems in the game. The interesting situation is, that a similar interface is useful for controlling service robots in a warehouse or in a kitchen. It makes sense to show the position of a robot in a visual map and provide a list of possible action verbs. Similar to an action adventure int he mid 1980s there is no artificial 20 intelligence available in the classical sense but a human operator takes control of the character. Such an interactive system can be converted into an autonomous system easily because an action adventure is a toy problem with a small state space. Its possible to program an A* solver which controls the character in a fully autonomous fashion. The reason why such a solver can be programmed is because the game engine of an action adventures has solved the grounding problem already, that means the domain was converted into a low state space problem. The result is a simple 2d maze game with a small amount of action words. Its possible to play this game with an AI. 2f8 Communication with a game engine Game engines are usually seen as technical parts of a computer game. They are the place in which the simulation takes place and its possible to use a single game engine in many different games. Its possible to use a programming language like C++ to python to create a single game engine which effects its performance. In addition there is an alternative approach available how to treat the subject which is related to the symbol grounding problem. Grounding communication is about a common understanding of a speaker and a hearer based on a language. This language is usually not defined very precise but the assumption is, that a language is working with a syntax and a grammar. What if the shared understanding is realized with a game engine? A game engine provides basic commands which allows a human player to control a game. For example the player can decide to move to left or pick up an object.[Tellex2007] These primitive simplify the man to machine communication. So a game engine is similar to an AAC communication board a tool which allows two parties to speak with each other. In case of a game engine the parties are a human user in front of a computer, and the computer which contains of a graphics card and an operating system. Let us take a closer look how a game engine is working in an action adventure like Castle adventure game (1984). The game engine monitors the keyboard events like pressing arrow keys or entering a textual command. This input is parsed and if its fit to the internal command list a certain action gets triggered, e.g. the player’s position gets modified. The result is that a human user communicates with the virtual avatar on the screen. He gives a command to the sprite and then the sprite is doing something. Such kind of interaction was mostly ignored by AI researchers in the past because there is puzzle available which can be solved. Its a trivial programming challenge how to write a game engine. On the other hand, a game engine might be important to understand the symbol grounding problem. If the man to robot communication fails, the cause is mostly a lack of understanding. The human gives an order but the robot won’t react. This missing communication has its source in a malfunction game engine. 2f9 Simplified instruction following example with video games Most existing demonstrations for instruction following are working with voice recognition, free form textual parsers and physics robots. Such demonstrations are looking impressive but they are difficult to reproduce. The easier attempt is to focus on a video game similar to the Castle adventure (1984) video game. That game genre is also called an actio adventure and the most interesting element is the GUI interface. In the Castle adventure (1984) game the user controls the robot with arrow keys plus a 2 word parser like “use sword”. This combined interface allows a human operator to do any possible. Grasping objects is also simplified by simply running over an object and then the item is added to the inventory. Programming a similar interface from scratch for a grid maze game can be seen as a good introduction into the symbol grounding problem. Let us describe what the technical perspective is. Symbol grounding means to interpret commands. The human operator is doing something with the keyboard and the game engine responds with a low level action. For example, the human is pressing 21 Figure 2: One wheel balancing robot the left arrow key and the game engine will modify the current position on the screen. The amount of all possible actions are formalized in a grammar. So the basic task from a programming perspective is to implement the entire command vocabulary in the game engine. After the commands from the human are interpreted by the game engine its possible to start a man machine dialogue. In the sense, that the human is able to play the game. He can enter commands and the character in the game will do something. In other words, programming the GUI interface for an action adventure is solving the symbol grounding problem. 3 Instruction following The paradox situation in robotics is the absence of a task. Suppose there is a clearly defined problem, than it could by solve by an algorithm. Unfortunately real robotics projects are missing of tasks but it remains unclear what the robot should do next. The example screenshot shows a balancing robot within a box2d simulation. One the first look the picture shows a successful robot because the system is able to balance on wheel. The user can even modify the angle which allows to move the robot left and right. The simulation has a frame speed of 20 fps so it works quite smooth. But the shown simulator is missing an important point. Its unclear what the purpose is, the robot can’t accumulated a score. It can’t be called a meaningful game with clear objectives but its an attempt to create such a game. The situation is common for AI related problems. Suppose a different robot would be put into the simulation which is a robot arm. Even if the arm can be controlled with accuracy it remains unclear what the objective is. To overcome the missing game rule a certain sort of meta game has to be invented which is a dialogue oriented instruction following game. Such a game consists of two parties: on speaker and one hearer. The speaker formulates a goal and the hearer has to execute it. For example, in the shown one wheel robot simulation possible sentences of the speaker might be: 1. wait for 5 seconds 2. move to the left and rest at the wall 3. then move in fast speed to the right 4. move back to the middle and wait for 4 seconds These instructions are formulated in a high level syntax which is equal to natural language. They are providing goals for the simulation. Its possible to measure if the goals are fulfilled or not. The shared principle in instruction following is to treat any domain as a dialogue game which results into clear defined objectives. 22 Let us take a closer look into the 4 goals. In contrast to common principle in AI programming these instruction are not solving a problem like a path planning algorithm, but they are creating a problem. Its up to the human or the AI controller to fulfill the objectives. The task for the speaker is to invent a sensemaking goal and monitor if it was fulfilled. In other words, the speaker in the dialoge provides the meaning to the hearer. Another important characteristics is, that the goals are formulated in natural language but not in terms used in the simulation. The goals of the speaker have nothing to do with the box2d simulation nor the ability of the robot to balance on one wheel but they are formulated from an objective perspective. The next figure shows an improved simulation. Most of the instructions are detected as events. In the concrete example the robot is at the left wall because this was the formulated instruction of the speaker. Its possible to compare the goal with the current situation which results into a positive score for the robot. The look&feel of the simulation has a stronger focus on the text box. The gui widget with the detected events is no longer seen as additional information but its the dominant element. The newly defined objective for the robot is to fulfill the instructions of the speaker. Its up to the robot to decide how to do so. A sequence of natural language instruction allows to formulate an abstract goal trajectory. The command sentences can describe longer behavior pattern which have to be fulfilled by the robot. This is equal to provide meaning. From a technical perspective, meaning is simply the error between goal and current situation, similar to a pid control which also compares the goal state with the current state. 3a Instruction following for goal formulation A seldom described problem in robotics is the absence of a goal. Without such goals its impossible to determine the cost and the reward which is needed in reinforcement learning. This prevents that robotics problems can be treated as RL problems. Instead of asking which algorithm is needed to control a robot, the more elaborated approach is to create a instruction following scenario which can be seen as an articulated sort of teleoperation. A human operator (the speaker) provides the commands and the robot has to fulfill these goal. During the pipeline different goals can be formulated like: 1. go to left region 2. grasp an object 3. bring the object to the table 4. ungrasp the object 5. go to middle 6. jump into the air and so on. The formulated textual goals are the input for the robot control system. And behavior of the robot can either match to the requirements or not. And in both cases a score can determined. This score is highly important for a machine learning task formulation. In the literature the overall subject is described as instruction following task. The textual goals are provided as a dataset and the robot has to convert the commands into actions. The interesting situation is, that instruction following doesn’t need a certain library or an algorithm, but its about creating a game in which a robot is doing something. Instruction following closes the gap between a normal simulation like Box2d and the need in reinforcement learning for a reward function. The formulated goals in the dataset are equal to scoring the robot. 23 Let me give an example for a human formulated textual goal. The operator can say to the robot “I’d like that you are driving with a speed of 30 mph straight until the traffic light”. The formulated contains a measurement rule. Its possible to compare the robot’s behavior, with the announced goal. For example the robot will drive with 26 mph and stops at the traffic light like requested, this would generate a score o 100% which means, that the robot has done everything correct. What makes textual commands interesting is, that they are not formulate in AI related terms, they have nothing do with computer programming and they doesn’t belong to a classical simulation. A textual goal is formulated on a very high abstraction level. Its some sort of textual interaction with a machine. From an outside perspective, instruction following has much in common with teleopeation. In both cases there is no classical AI system available which takes decision, but the human operator is in charge to control the robot. In case of joystick based interaction, the human will communicate with the robot over the joystick, while in case for instruction following the communication medium is natural language. In both cases the robot is doing what the human wants. In a classical robotics competition like micromouse such an interaction would be perceived as cheating. If the human operator controls the robot with the joystick it would be the opposite of the descired behavior. A joystick in the loop means, that there is no software which controls the robot, so it can be called a robot at all. The same situation is there for a speach controlled robot. If the human has a panel with actions like: forward, stop, moveleft, backward, the robot can’t be called autonomous anymore. Nevertheless such an interaction makes sense. If instruction following is perceived as wrong robotics, than there is a need to create sort of robot competition which puts a stronger focus on human machine interaction. An iteresting side problem towards instruction following is activity recognition. The idea here is, that the robot can’t execute actions by itself, but its only able to monitor certain events like “Collision, a threshold speed or a turn left action”. A detected event is shown on the command line, which looks on the first look not very interesting. Because detecting a collision is an easy task, and in the classical understanding it has nothing to do with artificial intelligence. From a technical side, such an event gets recognized by a sensor and a single line of python allows to perceive such a situation. But, an event detection system is at the same time a powerful element for an AI system. It can be extended to an instruction following robot. The idea is, that possible events are equal to possible instructions and its only a detail problem how to generate the action signals for the robot. Let me give an example. Suppose a robot car can detect if its current speed is around 30 mph. In such a situation, a light informs the human operator. The newly formulated task for an instruction following system would be, that the human operator formulates the goal and then the robot car is doing so. The robot has to use the information from the event detection system, determines a cost value and this cost information is used to search for the optimal action. The model predictive control solver has the constraint: “control the car so that the costs a reduced”. The interesting situation, that any mpc solver available can do so very good. The only precondition is, that the term “costs” are defined. 3b Cost function as instruction following Most newbies in Artificial Intelligence would locate the AI of a robot in the solver subroutine. A certain algorithm is analyzing the game tree, searches for the optimal action and will execute it. With this understanding, an AI is equal to a depth first search algorithm, the minmax approach or in case of robotics with the MPC solver. Unfortunately in a real robot, the MPC solver needs a cost information as input. Otherwise the algorithm can’t minimize the cost. This requirement is often ignored. And even more complicated, most robotics domains have no cost function at all which makes it hard to provide the needed information. 24 Let me give a practical example. Suppose there is a robot in a maze which can move around and pick up objects. Even if the robot is able to do something its undefined what the current costs are. Roughly spoken, the AI community doesn’t know the answer and without the cost function, the mpc solver can’t find the optimal action. And this means, that the robot at all can’t be controlled by Artificial Intelligence. In the past the problem was either ignored or it was explained that robot control works completely different from chess engines which are working with an evaluation function. The assumption was, that robot control can be realized without a cost value but with behavior trees or other bottom up approaches. This understanding is outdated, robotics works the same way like a chess AI and the cost function can be provided with an instruction following dataset. The idea is, that the human gives a command like “pick up object” and if the robot is doing so, the robot gets a reward of +1. If he fulfills the goal only in parts the reward is +0.5 and if he ignores the command at all the reward is 0 which is the maximum amount of costs. In other words, the costs are provided by natural language commands from a human operator. The robot has to do what the operator demands, then the reward gets maximized. Its surprisingly easy to implement such a paradigm in software. At first a table is created with possible commands which are numbered from 0 to 7. Then the human operator is asked to select one of the predefined goals. He is choosing for 0=”pickup object”. A dedicated event recognition module in the robot determines which of the actions is executed in reality. The value is compared with the selected id command goal and this allows to determine the reward for the robot. No matter which 0 pickup object goal was selected by the human, the robot knows the current reward value. 1 stop And this value can be utilized by a mpc solver to maximize the accumulated 2 forward reward. 3 waypoint A With this pipeline its possible to control longer action sequences in a 4 waypoint B semi-automatic fashion. The human operator can provide the goal sequence 5 left [0,3,7,5,5,4,0,2,2,7] and the robot has to execute it. for each single command 6 right the reward gets calculated and this allows to plan the low level actions of 7 release object the robot. Table 2: Commands for a maze robot 3c From line following to instruction following The advantage of line following robots is, that the task is easy to explain to newbies. The robot has to programmed so that it will reach the end of the line. If the robot moves outside the line the situation has become worse. This allows to decide which actions are good and which are bad. An improvement to a line following robot is an instruction following games. The natural language command from a human operator can be interpreted as a virtual black line on the ground. The operator might say “forward, right, forward, stop”. The action sequence defines a high level trajectory. In contrast a physical black line on the ground, a natural language command can reference to many possible goals. Let me give an example. The black line on the ground can have different shapes: straight, left, right, crossing. Each possible segment generates a command for the robot. If the robot recognizes a left line segment, the robot has to follow the line by steering to the left. The mentioned instruction following task works with a similar principle, except that any possible word can be utilized. Possible commands are simple navigation instruction like “forward, right” but it can also a vocabulary from a different domain. A grasping robot can be commanded with commands like “open_gripper, close_gripper” Similar to a line following, an instruction following problem can be divided into two subproblems. The first one is how to control the robot and the second is how to draw the line on the ground. Its possible to use a simple map in which the line forms a circle. More advanced setups are using a more 25 year project author 1967 1972 1980 1981 1993 2006 2010 LOGO programming language interpreter SHRDLU interactive animation Put-That-There voice gesture interface Karel the robot, domain specific language AnimNL computeranimation MARCO route instruction following M.I.T. forklift Papert Winograd Bolt Pattis Badler MacMahon Tellex Table 3: Timeline of instruction following projects complex line trajectory which consists of crossing, missing line segments, obstacles on the path and so on. The opposite subproblems make it possible to adjust the difficulty. For example an entry level robot parkour consists of a short path with easy to follow line segments. If enough robots are able to solve this problem, the difficulty can be increased. The same principle is available for the instruction following task. In the easiest case, the set of instructions is small. The operator is only allowed to select one of four possible words from a predefined vocabulary. In a more advanced setup the operator can formulated 2tuple commands like “open gripper, moveto waypointA”. Executing these instructions is harder because a dedicated command parser is needed. 3c1 History of instruction following The history of instruction following projects is small. After the publication of the SHRDLU project in the 1970s there was a long period in which the subject wasn’t researched very much. In 2006 some effort was available to build a dedicated route instruction following system. instead of controlling an entire robot with textual commands, the reduced goal was to navigate only in a maze. The obvious reason why the subject wasn’t analyzed with more effort is because it sounds not very interesting on the first look. According to the paper o MacMahon the idea is that speaker sends commands to a hearer (=robot), while the robot has to interpret the command and move into a certain direction. In a concrete example, the speaker may said “move north” and the robot is doing so. Or the speaker might say “move north for 3 cells”. Such kind of interaction sounds trivial because text adventures and especially the Karel the robot programming language provides a more advanced command parser. Nevertheless there are many arguments available why route instruction following might be an interesting subject. The key element is to split the problem solving task into a director (=speaker) and a follower (=hearer). A second element is, that the commands are formulated in natural language. Such a combination is a seldom case in Artificial intelligence and apart from the instruction following problem no further projects are available. Perhaps it makes sense to explain why exactly route instruction following sounds boring on the first look. The cause is, that its not an algorithm, and not an AI related library but it has more in common with a robot challenge similar to micromouse. The route instruction problem is at first a problem. The assumption is, that there is a maze, a robot and a speaker. Also the assumption is, that the speaker and the hearer and communicating back and forth and this allows to reach a position in the maze. Its up to a programmer how to implement such a protocol and which commands are needed in detail. The route instruction problem can be seen as an example for the symbol grounding problem. It provides a concrete puzzle which demonstrates what symbol grounding is about. Grounding is a synonym for “interpreting a command” [Matuszek2012] page 1. Its the action which is taken after the robot receives the command “move north for 3 cells”. 26 3c2 NP hard problems vs instruction following The symbol grounding problem can be converted into the instruction following task that is basically a speaker hearer language game. The promis is, that such interactive games can make robots intelligent. To understand how exactly instruction following is working we have to describe first the situation around 1992 which is seen as the end of the last AI winter. The understanding during this period was, that robots can’t be realized at all because robot control will create an np hard problem which is equal to a large state space. Even future optical supercomputers are too slow to search in the large state space for the optimal action. On the other hand, algorithms based on instruction following are promising to solve exactly such problem category so it makes sense to investigate the approach in detail. The understanding around 1992 was, that a computer consists of hardware based on CPU and RAM and it contains also of software which is a programming language, and operating system and libraries. From a theoretical standpoint a computer system works with an algorithm which solves a certain problem. The approach during the last AI winter was, to use this technology for solving robotic problems but the attempt has failed. No programming language is powerful enough for controlling robots and inventing new language or new algorithm has failed too. Some improved AI related languages like Prolog are available and modern graph search algorithm like Rapidly exploring random tree (RRT) were designed but they have failed too. The unfixed problem is, that the state space of a robot is too large. The idea behind instruction following can be summarized as an improved heuristics. Its not based on classical algorithm definition but instruction following is located outside of computer science at all. This might explain why the subject was ignored over decades. Its basically a puzzle similar to Micromouse or the 15 puzzle. After the puzzle was described its possible to solve it with a concrete piece of software. The starting point of the instruction following paradigm is the trivial recognition that any robot can be teleoperated. Everything what is a needed is a joystick and a human operator which moves the joystick. This allow the robot to grasp objects and navigate in complex environments. To automate a teleooperated robot the interaction between man and machine has to be recorded. Instead of capturing the joystick movements the idea is to record high level commands formulated in natural language. For doing so a certain user interface is needed which has much in common with a text adventure from the mid 1980s. The human operator controls a character with an interface like “go north” “take lamp”, and the game engine is executing the commands. Such an interaction is known as speaker hearer interaction which forms the basis of the instruction following game. Classical computer science until 1992 are ignoring man machine interaction. The idea was that a robot needs to be programmed in software. And computer science is the only disciplines which understands how to create such a software including the algorithms. In contrast, the instruction following paradigm assumes that computer science is obsolete and a broader perspective is needed to understand robot programming. The core element in the instruction following game is a corpus with the recorded speech between a speaker and a hearer. The hearer will say sentences like “go north”, “stop”, and the hearer interprets these commands. The task of an interpreter is the core element in symbol grounding. The term grounding means “interpreting”. A given statement like “go north” is translated into low level actions which means, the hearer will activate his muscles to change his position in space. The newly created corpus defines what the problem is, it contains of a certain vocabulary which includes commands, questions, objects and locations. And it shows a concrete interaction between both parties. The role of computer science is, to emulate the interaction in software. The human hearer has to be replaced with a computer program which accepts a high level command and generates the low level output. The resulting robot can be controlled with natural language. The most surprising situation is, that the cpu demand of such an interpreter is low. The software hasn’t to traverse the state space and search for the optimal action but the correct command is given by a 27 human speaker. The only task for the robot is to execute such a command. So its not a classical artificial intelligence but it has more in common with a textual interface known from the Zork I videogame. Zork I was released in 1980 for the Apple II so its available in mainstream computing for over 40 years.c 3d Instruction following for a robot swarm The ability of a robot to follow natural language instruction seems a powerful example for man machine interaction but the principle can be leveraged by introducing a command based swarm. The picture shows a minimalist prototype programmed in the python language which contains of a 10x10, 10 robots on top and some objects in the grid. Of course, the task for the robot is to collect the robot, but they are not allowed to decide the movement by itself but in addition its an instruction following task. Possible commands for the human operator are: 1. movedown 2. moveup 3. movetoobject 4. movetotop Its important to know that the instructions are referencing to the entire robot swarm. The command “movedown” will result into an action of 10 robots at the same time. Similar to instruction Figure 3: Swarm simulator in following for a single robot, there is no AI algorithm needed Python in the classical sense but the robots are simply executing the commands. Its up to the human operator to provide meaningful actions and its up to the robot to parse and execute the statements. The idea is not completely new. The Karel the robot environment works with a similar principle, except that Karel is a single robot, while the example in the screenshot contains of a robot swarm. Perhaps it makes sense to explain the inner working of the action parser in detail. The first command “movedown” was realized with a python for loop which goes through the robots and change the position to one cell down. That means the logic is encoded as a python function which occupies around 10 lines of code including boundary check. The human operator enters the command, and the described function gets executed. The power of the proposed system isn’t provided in the source code which can be called trivial, but it has to do with the set of possible commands. By providing four and more natural language commands its possible for the human operator to control the swarm. This allows the operator to execute more complex tasks like collecting all the objects. For reason of better user experience, its possible to map the commands to keypresses from 1 to 9 similar to a piano keyboard. This allows the human operator to control the swarm similar to playing an organ. At the same time the system is teleopeated, that means a human gives the command, but these commands are formulated on a high abstraction level which allows the human to control multiple robots. Suppose the idea is that the robot swarm doesn’t need human interaction but should work autonomously. Even this objective can be realized. All what is needed is a small macro like subroutine which generates the commands by itself. This allows to start the movements with a click on run and then the swarm will collect the objects and returns to the top position. The reason why such complex macro can be realized is the ability of the robots to execute natural language instructions. So we can say, that instruction following is a practical example for the symbol grounding problem. 28 Figure 4: Robot swarm with possible instructions 3e The logic is hidden in the GUI menu An instruction following robot works different from a classical robot because of the absence of any algorithm. Even if the instruction following is using some algorithm which are written in a programming language, it doesn’t make sense to explain the inner working because they are trivial, too specific and can be replaced by better alternatives. This makes it hard to explain how exactly such a robot is working. The answer is, that not the robot is interesting but the challenge he has to solve. The challenge is the true source of logic. Let us take a look into the example screenshot which shows a modified swarm robot simulator. The new GUI element is the menu widget on bottom right which contains of possible commands for the robot. The inner logic of the robots can be traced back to this widget. The user can press a button, e.g. 0, and then the robots are doing whats written in the menu. The robots will either locate the red object, they will clear the object, or they will change their position. Its up to the human user which button he is pressing. A menu can be seen as a game rule which describes which actions are available. The user can decide for one of these actions. And because of this reason it doesn’t make sense to explain the algorithm inside the robot, because there is no such an algorithm. What is available instead is the screenshot of a grid game and the menu with possible options. Such a game can be implemented in any programming language and with any source code. The only fixed parameter is, that the game rules are the same. That means there is a 10x10 grid, and exactly 6 different commands to control the robot. But let us go a step backward to understand why its hard to explain what instruction following is about. In classical AI there is a bottom up principle available. The idea is, that a robot is equipped with an intern algorithm, the AI, and this Artificial Intellligence decides what the robot is doing next. This bottom up paradigm can’t applied to this case because the robots are teleoperated and they are controlled with natural language. The better attempt to understand the robots is by focus on the game in which they operate including the controlled vocabulary they have to understand. From an outside perspective, the robots have no AI, but the decisions are made by the human operator. The GUI is shown on the display and before the robot will move forward, the human has ato press a button. So its a normal video game like pong, but not an Artificial intelligence. The new element is, that the game was designed as an instruction following game which makes it easy to automate the robots in this game. 3f The auto mode for instruction following On the first look an instruction following robot looks a bit boring, because its teleoperated and can’t do anything by its own. But, with a small modification in the software its possible to enable an auto mode. 29 The auto mode built on top of an instruction following robot means simply to execute the commands in a script. In the screenshot the button 6 will activate this behavior. In the background a fixed sequence of actions is repeated for an unlimited time span. The swarm will act with this pattern and the actions are meaningful. At first the swarm will locate the objects in the map, then it will move to the objects, then the objects gets cleared and the swarm moves back to the base station to charge the battery. For making the process visible a score is shown in the screen which counts simply the amount of cleared objects. Such a simulation can be run forever, and the swarm will collect very huge amount of objects without human intervention. It seems, that with a set of predefined actions its much easier to automate a task. Because the main program which controls the swarm is using the existing methods. Every method is listed in the menu because its a natural language instruction to the robot. The harder task is to describe the pipeline from an AI perspective. Its a combination of tteleoperation, object oriented programming, instruction following and the symbol grounding problem. An open problem is to locate the source of intelligence. In the example there is a robot swarm how is doing a complex task which is collecting objects in a grid. The swarm is doing this task autonomously. At the same, there is no dedicated AI routine or algorithm written in software. The overall sourcecode for the project has around 500 lines of code. There are classes for drawing the gui, for sending commands to the robot and for processing keyboard inputs. So the question, why are the robots doing a task, without a dedicated program? In the previous section a first attempt was made to address this problem. What we can is, that the project is not a classical robot, but it has more in common with a robot challenge. Its working similar to a micromouse simulator or a chess simulator, which asks a human to control a robot in a maze. From the development perspective, the first prototype was created with manual control. The user was able to control each robot in the grid with up/down keys. So it was a normal video game which was boring to play. The user is pressing a button and the robot on the screen is moving. The AI component was added as an textual input parser. The amount of possible commands was formatted in a list. This list is the controlled vocabulary for the interaction between the game and the human user. So we can say, that in the game there is no intelligence, but the game assumes that the human operator is intelligent. The operator has to decide which action comes next, and the operator is in charge to write a script for the auto mode. Let me give an example. From a technical perspective a script without any meaning can be executed in the simulator. The script simply moves the robot swarm up and down, but doesn’t collect any object. This script won’t produce an error message but according to the parser is valid. So it depends on the human operator which sort of script he enters, very similar to writing a script for Karel the robot. In this game, its possible to move the robot in more direction, but there are also many possible scripts available which won’t result into anything useful. Perhaps it makes sense to label the project not with “Artificial intelligence” but it has more without meaning in common with a robot environment which 0 alldown sense making script works with natural language instruction. Not 1 allup 2 findobject the robot gets embedded into the game, but its 0 alldown 3 movetoobject (10 times) the human operator who is using the mini script- 1 allup 4 clearobject ing language to program the robot swarm. 0 alldown 1 allup (10 times) 1 allup 3f1 Programming an AI with a communicaTable 4: Script for auto mode tion board The screenshot shows the same game like in the previous section. The blue robot on top has to collect food and he can only move up and down but not in other directions. After collecting an item, 30 Figure 5: Food collecting game including communication board s e l f . comboard ={ 0 : " up " , 1 : " down " , 2 : " f i n d f o o d " , 3 : " m o v e t o g o a l " , 4 : " e a t f o o d " , 5 : " f i n d b a s e " , 6 : " auto " , } Figure 6: Communication board as Python dict the robot has to return to base which is on top of the maze. The game mechanics was chosen because it can be realized with a few lines of code in the programming language of choice. The much harder to explain element is the AI which should control the robot. Let us explain the situation from the end. The game including the AI is working already and the AI robot gets controlled with scripting AI. The commands shown in the menu are used inside a macro and this will move the robot autonomously. Unfortunately, the menu with the possible commands isn’t available for a game by default but it has to be invented first. Suppose there are no commands available which can be executed by the robot, then its impossible to use scripting AI. The first step is to explain how to create possible commands for the robot. In the example case with the food eating robot, the vocabulary was stored in a python dictionary. Each entry has a number and a text. This reduces the complexity. Instead of programming a robot which can parse all sort of words and sentences, the vocabulary is numbered from 0 to 6. The difference between algorithm driven Artificial intelligence and communication board oriented AI is, that in the second example the source of intelligence is the human user. The vocabulary provides only a list of commands, and the operator has to decide which of them should be activated. Each command is mapped to a key on the keyboard. The most obvious commands are “up”, “down” which were implemented first. This allows to move the robot in its column. The other commands are bit complicated to implement. They have been realized in Python source code with for loops and if-then statements like a normal Python method. Not the written source code is important but its the communication protocol which is the communication board. The board including the words specifies a certain domain. Its used for man to to machine communication. Such a board can be operated in a manual mode, by pressing a num key on the keyboard or it can be referenced in a script so that the robot will run autonomously. Let me explain the situation from a different perspective. To estimate what sort of tasks can be executed by the robot it helps to read through the vocabulary list. There are commands for findfood, movetogoal, eat and return to base. Outside of this scope the robot can’t do something. He can’t rotate because there is no such command and he can’t avoid a collision with other robots. If these features are needed the relevant words have to be added first into the communication board and then the python code has to be written to execute some statements. In general the described workflow of creating an AI bot is organized with natural language. The 31 English language is a tool to capture domain specific knowledge. All possible actions and events have for sure a word. To create a robot, the robot needs to understand a certain vocabulary. 3g Theoretical reason for instruction following Similar to the symbol grounding problem which connects language with sensory perception the instruction following task is located within linguistics. Natural language is utilized to describe robotics action. Similar to language in general a word is referencing to a meaning outside of this word. For example the word “movedown” has no internal meaning because its a string of ascii characters. The meaning is provided only for someone who understand the word. Its also provided in a dictionary which explains what the word is about. This linguistics perspective might help to understand why robots which are able to parse commands are different from classical understanding of Artificial Intelligence. In the past an intelligent robot was seen as intelligent. The assumption was, that inside the robot there is an algorithm which makes the robot smart. In contrast, the origin of meaning from a linguistic perspective is located outside of the individuum. The source are dictionary or the gutenberg galaxy which explains one word with lots of more words. The consequence is, that meaning which is needed for intelligence is never located inside a computer program but its locate outside of the software. Its true that the described prototype for a robot swarm in section 3d-3f has no built in Artificial intelligence. On the other side, the used vocabulary for controllling the swarm has a well defined meaning because the normal english vocabulary was used, which is spoken worldwide. That means, there is no secret why the robot swarm is doing something useful, because the words for interacting with the robot have a meaning. The interesting situation around the English language is, that any possible action, perception or task can be described in a single word or in a short sentence. No matter if the domain is a grasping robot, a one wheel robot or a UAV, all the possible interaction can be encoded in words. Exactly this feature makes the English language a general language and its more powerful than a computer language like C or Java. It makes sense to utilize the vocabulary for human to machine interaction. The only tool needed to communicate with a robot is a short list of words which is maybe 10. This wordlist is used to send back and forth messages between a robot and a human operator. This communication process allows to solve any task in the real world. In the section 3d a concrete example for an instruction following robot was presented including a screenshot of the GUI. The open question is how to generalize this example into other domains? The elements of the simulation were a robot swarm plus a menu with possible instructions. Each instruction was equal to a single word, or sometimes it contains of two words. This allows to give an abstraction definition about the instruction following task. The robot has to understand a limited vocabulary of commands. From a programming perspective this paradigm matches closely to object oriented programming which assumes that there is a robot which has methods. The methods are executed from the outside and are equal to the commands from the menu. 3g1 Instruction Following as communication paradigm Sending messages between two parties is seldom discussed in the context of robotics. Instruction following is the exception which works with a speaker and a hearer who are using a shared vocabulary. The reason why message parsing is usually located outside of Artificial Intelligence is, that a single robot doesn’t create such a need. The microcontroller in a robot is programmed in a certain programming language for example java. The source code contains of datastructures plus algorithm but everything is encapsulated into a 32 single program. The robot is a single unit which has no need to answer messages from the outside. The only exception to this understanding is a teleoperated robot but teleoperation is the opposite goal of an autonomous robot so it can be ignored. But it seems, that Artificial Intelligence can’t be described this way. Encapsulating all the domain knowledge into a single instance which is the AI agent will produce a high complexity. The better idea is to divide the problem into two parts: a speaker who is familar with natural language and symbolic reasoning on the one hand, and a hearer who understands language and executes commands on the other. If such an AI model gets converted into a programming language its equal to define a speaker class, and a hearer class. Both classes are stored in different files and can be crated by different human programmers. The classes are communicating back and forth and this will produce Artificial Intelligence. In terms of high level and low level, the speaker class is of course the high level instance which describes a situation in linguistic terms, while the hearer class is responsible for low level actions and low level sensor perception. The idea is, that each class can delegate tasks to other other class which doesn’t fit into the own responsibility. Let me give an example for a house hold robot. From the perspective of a speaker, the domain consists of objects (table, drawer, stove), there are actions (move, open, grasp) and ingredients (rice, potato, apple). A typical simulation game for the speaker class is a textadventure which ignores all the details but its only about words. On the other hand, the hearer class is a 3d videogame which has sensors, motor patterns and a physic engine. Only if both classes are communicating with each the needed Robot intelligence can be created. There is no need to introduces more instances than only 2. A speaker and a hearer is more than capable of solving robot tasks. On the other hand, the absence of such a communication paradigm will result into a dysfunctional robot. Such a robot remains ungrounded and has no internal communication system but only algorithm and computer code. 3g2 Programming an instruction following robot from scratch Classical software engineering is built around libraries. A typical question of a programmer is which sort of library is needed to solve a task. And then the concrete functions in this library are used to implement a certain design pattern. Additional tools like a version tracking system and iterated prototyping helps to develop any sort of software. Unfortunately this paradigm can’t be adapted to AI applications because there are no such thing like AI libraries. This makes it hard to describe how the workflow should be to create autonomous robots. To overcome the obstacle, there is a need for a paradigm shift. AI related projects are seldom located within classical computer science but they have to be seen with a different perspective. The symbol grounding problem including instruction following robots are not the answer to a problem, but they are equal to the problem. What is needed is never the sourcecode which simulates a game, but the task is to invent a game. The term instrucction following is referencing to a certain sort of games which have much in common with point&click adventures from the 1980s. There is a visual scene representation plus a textual menu. The menu allows to activate actions in the game. Its a good starting point to implement any sort of game as a Maniac Mansion clone. The reason why the textual menu is needed is because the focus is on man machine communication. In contrast to normal video games which have a strong focus on haptic input devices like a mouse, the focus in instruction following is on textual interfaces. The reason for this preference is, that textual commands can be recorded while mouse movement not. For example, a command sequence like “moveto table, grasp apple, moveto drawer, open drawer” can be executed in a video game and at the same time, the sequence can be dumped into a text file. The ability to store commands into a text file allows to construct high level macros with this commands which can run autonomously on the robot. 33 In other terms, instruction following means to add a scripting module into a video game. The scripting module provides a language to request sensor data and executing actions. A typical script written in such a language would be “if no_obstacle then move(forward)”. The precondition to formulate such high level statement is that the robot can parse commands like “no_obstacle” and “move()”. 3g3 Increasing the automation level with natural language An example for low amount of automation is a teleoperated robot. The human operator is forced to provide any detail action to the robot. The robot will stop working if the human stays away from the panel. The reason why teleoperation is so common is because it allows to control robot arms without programming them. There is no need to write an advanced software, but the human operator is in charge of controlling the system in realtime. Increasing the automation level is equal to record a demonstration and playback it later. Recording a trajectory is not possible with classical teleoperated arms. The best way to record something is by using natural language events and actions. If the interaction between human and robot is working with words, its possible to store these words together with a timecode into a csv file. Unfortunately, a classical joystick doesn’t understand words and pressing the joystick slightly forward wont generate a sentence. What is needed is a different sort of teleopeation interface. Such an interface is created in a two step process. First thing is, to create a dialogue between a hearer and a speaker, and secondly the language corpus is extracted from this dialogue. Let me give an example. Suppose the task is to control a robot in a maze. A possible dialogue would sound like: Speaker: locate the next object Hearer: done Speaker: move towards the object hearer: done Speaker: grasp the object hearer: failure Speaker: move back to base station hearer: ok In the dialogue a certain vocabulary was used which is (moveto_object, grasp, moveto_basestation„ failure, ok” All the words have to be implemented in software. The idea is that a real robot can communicate the same way. That means, the object oriented class in the software which represents the robot has methods like “moveto()”, “grasp()” which can be executed. The methods allow to model the interaction between a speaker and a hearer. 3g4 An instrument panel for man machine interaction In the past, Artificial intelligence was mostly imagined from an algorithm perspective but seldom explained as interactive intelligence. Interaction means, that a human operator sends commands to the robot. For doing so a certain panel is needed. Such a panel looks like a piano with additional status led. In comparison to other concepts in computer science the proposed control panel looks trivial. Its simply a widget with buttons and lamps which are labeled. At the same time such a panel is the fundamental building block in advanced robotics. It allows to encode the communication between 0 sensor0 0 action0 a human and a robot. The human has predefined commands which can be send to the robot and the robot can answer the requests with 1 action1 1 sensor1 predefined sensor signals. The system has much in common with an 2 action2 2 sensor2 engine order telegraph used in old ships for communicating between the bridge and the machine room. 3 action3 3 sensor3 The main idea is to describe artificial intelligence bot from a robot perspective, but as a communication protocol. There is a need to send 34 Figure 7: control panel or a robot messages back and forth between the speaker and the hearer. This communication is equal to the generated intelligence. The described control panel has the ad vantage that from a technical perspective it can be realized easily. Each action has a number and each led has a number too. The robot can receive command id from the human and the robot can activate sensor id in response. At the same time, the commands are not random but each button is labeled with a text message. For example, button1 means “rotate left”, button2=”stop” and so forth. The meaning is only available for the human. From a technical perspective the label has no function. Let me give an example. Suppose the robot receives the command “0”. This command asks the robot to rotate to the left. The robot is doing so and if the operation was completed he responds with a status message “sensor0=on”. Which means “ok, command was executed”. Suppose the robot has detected an obstacle 10 pixel in front, then the robot will send the message “sensor2=on”. Sensor2 stands for “obstacle detected” and the led in the front panel gets activated. .This allows the human operator to decide what to do next. Perhaps he will press the button “stop”. From a mathematical perspective the communication can be encoded in a feature vector: action=[0,0,0,0], sensor=[0,0,0,0]. The table itself has no meaning, it depends on the labels next to each entry how to interpret the signals. Because of this reason the communication is distributed between two parties. The robot alone can’t recognize what to do next. 3g5 Communication based AI In contrast to existing programming techniques like Finite state machines, behavior trees and neural network, a communication based AI assumes that the intelligence is located outside of a robot. Every robot gets teleoperated by a human and communication based AI puts a high focus on the man machine interface which is mostly textual. In other words, its not a true AI but its a normal Videogame in which a human takes decisions. The decision making process is simplified to selecting an action from a dropdown menu. Let us investigate the concept of a GUI in detail. A typical windows based GUI menu consists of the menubar on top of the window and a status bar in the bottom. The user is asked to select an action, for example “File ->Open” and then something will happen. Sometimes, there is a feedback available in the status bar to inform the user about possible problems. Its for sure, that a Window in an operating system isn’t intelligent. All the menus are preprogrammed into the software. The computer has to render the window for the user. The assumption is, that robotic control can be realized with the same paradigm. 3g6 Programming an instruction following robot step by step At the beginning, the robot has no vocabulary so the dictionary is empty. At first, only basic commands are added to the dictionary like up, down, stop. The dictionary is shown on the screen and for each entry a python function is created. The reason why the function names are encoded in the dictionary a second times is because of two reason: first, each command is mapped to a number and secondly, the dictionary can be shown on the screen. This visual appearance is important because the primary interaction with the robot is manual. The human operator has a menu with possible commands, and then he can press a button on the keyboard. If the basic commands (up,down,stop) are working dedicate sensory commands can be implemented which are (isobstacle, distancetogoal, battery). These sensory commands won’t do something, but they are measuring values and store them into the robots class. The return value 0 signalizes that the measurement was working with success. 35 Suppose the human operator has created around 10 different coc l a s s Robot : mamnds to control the robot and measure something. These commands def __init__ ( s e l f ) : are used to interact with the robot and fulfill tasks. For example, the s e l f . comboard ={ 0 : " up " , robot can be commanded to go to a certain waypoint, pikcup an ob1 : " down " , ject and return to base. In the last step, the manual interaction with 2:" stop " , the robot gets automated by writing a script. The script contains of } command calls which are already there e.g. a sequence would be d e f up ( s e l f ) : return 0 # success [0,4,1,1,3]. This macro will run autonomously. return 1 # failure The most important element in the source code isn’t the executable return 2 # other code but the dictionary which is used for human to machine command e f down ( s e l f ) : ication. Each entry has an id which is important for the computer return 0 # success and a textlabel which is important for the human. The idea is that the def stop ( s e l f ) : return 0 # success human describes domain knowledge in natural language words. For example he would say that at first the robot should go to a waypoint, then pickup the object and then do other tasks. These English sen- Figure 8: sourecode for instructences won’t make any sense for a robot. The only language a robot tion following robot speaks are numbers. So the English sentences have to converted into command-codes. In the interactive mode, the human operator (=client) sends command codes to the robot (=server) which are responded with a numerical response code, e.g. 0=success. So the interaction works by sending numbers back and forth. One possible concern against an instruction following robot is the absence of a communicatime Human (=client) Robot (=server) tion protocol. In contrast to computer related data request response communication protocols like TCP/IP or RS-232 0 0 0 there is no standard available about possible com1 4 0 mands. So the protocol has to be invented from 2 1 1 scratch and each domain has a different list of 3 1 0 commands and response codes. This situation 4 3 2 increases the complexity. Instead of simply im5 2 0 plement an existing protocol in software, the first 6 3 0 step is to design the protocol by itself. Table 5: Sending command codes to the robot 3g7 Robot control with a protocol Printers are usually controlled with a printer protocol which is for example Epson ESC/P. The protocol is defined in a table which provides different commands for selecting a font, request the sensor and feed the paper forward. The software which sends the commands to the printer is called a driver. The principle can be adapted to robotics programming too. The idea is that the robot is a printer and the communication is realized with a protocol. The commands depend from a domain. A line following robot for example needs commands for detecting light on the ground and the robot should rotate to left and right. The main difference in robotics is, that there is no standardized language available but the protocol has to be invented from scratch. In contrast, there is no need to program an artificial intelligence. What is called an AI is a normal driver which sends the command to the robot. The logic of the robot is defined in the protocol. Complex robots like a biped robot will need more advanced command protocol. Let me give an example. Suppose there is a robot protocol available with 30 different commands. The driver sends the following command sequence to the robot [4,1,12,11,2]. This sequence can be roughly translated into: [start, getstatus, calibrate, searchline, forward]. it should be mentioned, that a protocol is created as a communiation tool between two parties. One device is sending a command, 36 and the other device is receiving the command. The question is not how to program an algorithm on a single device but its an interactive process between two parties. For reason of simplication a protocol allows to teleoperate the robot, except that the teleoperation doesn’t work with a joystick but with the protocol which grounds the communication in words. 3g8 Object oriented programming One possible reason why instruction following and teleoperated robots aren’t very popular is because there is a missing advantage over object oriented programming (OOP). OOP means usually to create an object (e.g. an agent) and then attach methods to this object like “openhand()”, “standup()” etc. The surprising situation is, that the modern term “instruction following” means basically the same. The robot is capable to run some predefined methods. Roughly spoken, classical OOP and command based natural language instruction is the same. in both cases, a single objects receives messages from the outside which might be a human with a GUI interface or a main program which runs a script. The assumption is, that object oriented programming has become mainstrean computer science there is no need to reinvent the wheel. On the other hand, the newly published papers about instruction following are completely different than former papers about object oriented programming so there must be a difference available One possible advantage is, that instruction following is closely related to knowledge modelling while OOP is about the internal structure of a computer program. OOP related programming languages like C++ and Java were created with the goal to simplify the programming itself which is typing in the source code and bugtesting an application. In contrast, dialogue based instruction following has its root in the symbol grounding problem which tries to encode robot commands in a dictionary. In other words, OOP has its origin in computer science, while instruction following is closely related to natural language processing. 3h Instruction following with speaker and hearer The communication process between speaker and hearer is very different from existing understanding about robot programming. The classical assumption in the past was, that a robot is an embedded agent which runs an intelligent algorithm. In contrast, the speaker hearer model assumes that the task of robot control consists of two different layers which are working independent from each other. The task for the speaker is to provide a sequence of useful actions. From a technical perspective its equal to write a script. For example to navigate a robot from start to goal a certain script is needed which makes sure that the robot has enough energy, avoids the obstacles and is moving into a certain direction. Apart from this speaker related task there is a second task available. The hearer (=listener) has to make sure that the instructions are executed. For example the command “move north” has to be translated into control actions for the servo motor. The communication between a speaker and a hearer is realized with a grammar. The BNF grammar contains of a domain specific language. Its surprising to know that since the 1980s such a robot control paradigm was discussed in the literature. The problem with DSLs for robot control was and is even today, that its very complicated to invent such a language from scratch. This might explain why there are only few working examples available. Instead of rejecting the concept at all it makes sense to figure out a more efficient strategy to create a domain specific language. One possible attempt in doing is to divide the problem further. There is a dataset available which stores a speaker/hearer interaction, and there is a implementation available which is the source code in a programming language. Let us focus on the first element in detail. A speaker / hearer dataset stores the gamelog with textual interaction. It records the natural language between two human operators. One of them gives the command, the other is executing them. Its some sort of motion capture recording, but with two persons at the same time. Perhaps it makes sense to describe the situation from an outside perspective. 37 dataset domain specific language speaker hearer speaker hearer Figure 9: speaker hearer dataset There is a normal mocap studio. The camera are tracking the markers of the human actors. In addition, there is a room in which the speaker sits. Both persons are connected with a microphone. And everything is recorded with a computer. The speaker says something e,g, “jump” and the mocap actor has to fulfill the request. The mocap actor can return a repsonse if he likes like “ok”. The result of the interaction is feed into a dataset which is basically a time based table with different columns. The only purpose of the mocap recording is to generate a dataset. The dataset has some properties: It contains natural language instructions, it has a time code and very important it stores the mocap data. So its a multimodal speaker hearer dataset. The dataset is used in a second step to create a domain specific language. The task is to translate the dataset into a BNF grammar which can be parsed by a computer. The natural language instructions would overwhelm a computer program, so they need to convert first into a controlled vocabulary or less than 20 words. Complex sentences like “could you please jump on times” are converted into machine readable commands like “jump()” 3i Behavior tree based instruction following in the minimal case, instruction following means that the human operator selects a command, e.g. “moveto waypointA” and the robot is executing the command. Such kind of interaction seems to be interesting but in a technical sense the robot is controlled by teleoperation and can’t solve problems by itself. There are two possible techniques to overcome the challenge which are scripting AI and behavior tree. A behavior tree stores a longer action sequence in a hierarchical fashion, while scripting AI means basically the same, but its written down the notation as computer source code. Perhaps it makes sense to give an example. Suppose the vocabulary for a maze robot contains of the commands from the table. Then its possible to concatenate the commands into a sequence like [2,3,4] or into [0,5,4,1]. A behavior tree allows to formulate a sequence. Executing the behavior tree will generate the commands which are send to the robot. Its a bit complicated to explain where exactly the intelligence of a robot is hidden. Because both techniques which are a behavior tree id description and a command vocabulary are technically trivial. The advantage 0 moveto waypointA has to do with the abstraction mechanism. The ability to send a 1 direction north command to the robot will improve the interaction with the robot. 2 pickup object Instead of determining the servo signals the human operator selects 3 release object only a command id and the robot will execute the action. In addition 4 moveto waypointB the behavior tree will provide an additional layer which encodes a 5 direction south longer sequence of possible actions. The result is a robot which looks like a teleoperated robot but there is no human needed but the behavior Table 6: Command vocabulary tree is the source for the commands. The inner working will become more clear with the attempt to improve the robot system. Suppose 38 the robot should solve a different task. Then a new behavior tree is needed. If the entire domain is different the command vocabulary has to be modified also. So we can say the current commands plus the behavior tree determines the limits of the robot. 3i1 From commands to behavior trees Even if behavior trees are a common tool in creating game AI characters and are explained in many tutorials they are only one under many possible attempts in Artificial intelligence. Its not because they have a disadvantage but because its complicated to explain what exactly a behavior tree is about. From a technical perspective, behavior trees are a graphical representation of a computer program. They have much in common with scripting AI. The programmer has to write down a sequence of actions which can have sub actions and this program gets executed. Writing a behavior tree with a textual programming language like lua or python is possible and this will increase the confusion because writing a python script and creating a behavior tree is the same. Nevertheless, behavior trees are a powerful AI tool and they are used frequently in videogames. So there is a need to explain the advantages more precisely. The basic element of a behavior tree is a node which stands for an action. Such a node is equal to a command given to a robot, e.g. “moveto waypointA”. If the node was executed, the next node can be executed. The open question is, what exactly is a command which is encoded as a node? This question is indeed hard to answer. It has nothing to do with the inner working of a robot, but a command is a speech act send from the human to the robot. The interesting situaiton is, that any possible node has to be grounded in advance. The robot needs to understand what the meaning is of “moveto()”. Before a certain behavior can be executed on a robot, all the possible node commands have to be implemented in a robot language. Such a language is the underlying concept which explains why the concept is power. A behavior tree is only a concrete program which takes advantage of the implemented commands. The real power has its origin in the single command the ability of the robot to understand it. Let me give an example. Suppose a robot can be controlled with the following commands: moveto(), pickup(), release(), rotate(), speed(), stop(). Then its possible to formulate complex actions by concatenate these commands. The action sequence can be visualized in a behavior tree, it can be written down in a textual program or the commands can be executed interactively by the human operator. Domain specific language are used to ground natural language commands into an action. Its the computer program needed to understand a command like “moveto()”.[Howard2022] So we can say, that not the behavior tree itself is the source of intelligence, but its the ability to translate an action node into an action. This allows to control robots on a higher abstraction level. Instead of joystick to move the robot forward, a textual command is entered. 3i2 Behavior tree as dialogue games One possible cause why behavior tree in the past were difficult to understand was they were explained from a computer perspective. The idea was that a behavior tree is a C++ library to create an agent. The problem is, that a behavior tree is located above existing computer software. A more precise explanation is, that a behavior tree is a language game between two humans: a speaker and a hearer. This language game can be implemented on a computer but there is no need for doing so. Let me describe a certain situation. Suppose the task is to prepare a meal in the kitchen. The described language game has to adapted to this situation. The speaker might say “go to the table and take some ingredients. Then go to the stove and put the ingredients into the stove”. In response the hearer of the game might say things like “Acknowledge” or he might ask back “Which ingredients exactly?”. The overall interaction game works by sending natural language statement between both actors in the game. 39 A behavior tree is only the computerized version of such a game. The dialogue is refomulated in actions and the single actions are connected to longer sequences. This paradigm allows to control the robot. But let us go back to the dialogue. The idea is that speaker and hearer are using the same vocabulary. All the possible commands are numbered from 0 to 99. This allows that both actors are understanding each other. It formulates a discourse space in which possible actions have a position. This concept has much in common with a domain specific language which is a computer representation of a vocabulary. 3i3 Dialogue based teleoperation The main advantage to construct an interactive speech related game is to make the natural language visible which includes the ability to record it. In a normal single user teleoperation system the human operator remains silent while he controls the robot arm. The silence is needed because the operator needs to increase its awareness of the situation. Any sort of noise would distract the robot control. Unfortunately the silence makes it hard to reproduce the task. Even if the human operator is asked to think aloud the amount of generated speech isn’t enough to get a deep understanding of the situation. The elaborated attempt in capturing the natural language including the hidden domain knowledge is a dialogue oriented teleoperation which contains of a speaker and a hearer role. The speaker gives a command like “grasp the apple, please” while the hearer controls the robot and gives feedback like “done” or “there is no apple”.Such a dialogue can be recorded into a text file and its important to extract the relevant vocabulary. For example all the nouns and all the action words. This extracted vocabulary helps a lot to model a domain. The recorded dataset formulates a challenge known as instruction following. This is the reverse interaction. There is a dialogue available as input and the robot has to execute the actions. That means, the robot has to imitate either the speaker or the hearer from the dataset. The surprising insight is, that such an imitation game can be fulfilled by a computer, especially if the natural language is equal to a controlled vocabulary which contains of 10 actions and 4 nouns. So we can say the robot domain gets converted into a linguistic dataset which allows to program and benchmark a robot. The linguistic dataset acts as an abstraction mechanism which allows to encode a problem into a machine readable format. 3i4 Voice commands and behavior trees Voice commands for robot control are known since years in robot programming but apart from simple demonstration they have never become very popular. First problem is, that such a robot isn’t perceived as an autonomous robot and secondly it seems that such robots can’t solved a longer task. The surprising situation is that with a simple modification, namely a behavior tree, both issues can be overcome. A behavior tree is a longer and autonomous variant of voice control. Instead of interacting with the robot in a dialogue, the commands and possible outcomes are scripted in a computer program which is playbacked on the robot. The resulting system is similar to the Karel the robot programming language which is also a powerful scripting technique. A karel robot can be programmed to do anything. Its possible that the robot will move around, or it will put some objects into the map. The robot is limited only in two ways: first the script and secondly the amount of possible commands. A voice controlled robot is used to test and implement a vocabulary. it allows to recognize which commands are needed and evaluate if the robot understands the commands. Its a prestep in programming a longer script. Let me give a concrete example. suppose the robot should cook a meal in the kitchen. For doing so the robot needs a list of possible commands like “open drawer”, “take object”, “cut food” “clean the table”. Every command is a subset in the overall activity. A meal can be prepared but combining the actions into a longer sequence. 40 command ID 0 1 2 3 Text Speaker: Go to north please. Hearer: Ok. Speaker: Is there an obstacle? Hearer: No Tagging command, direction response, ack question, lidar sensor response, no direction no response ack question lidar Table 7: Dialogue dataset and connectionist encoding 3j Connectionist model of dialogue A wizard of oz experiment has the aim to create a dataset which contains of a natural language dialogue. The speech between a hearer and a speaker are recorded with the attempt to extract commands, action words and object names from the utterances. The next logical step is to convert the dialogue dataset into a computer program which emulates the roles of the speaker and hearer. This translation can be realized with a connectionist model which consists of input and output neurons. The main advantage is that in a connectionist model the information is represented in a atomic structure which makes it easier to process the information with a computer. The table “Dialogue dataset” provides a simple example in which the speaker gives an order to the hearer. Both parties are communicating with normal natural language which is easy to generate for humans but difficult to understand for a computer. A possible intermediate representation is provided by tagging the speech acts. These tags can be converted into neurons of a connectionist model. The rough idea is, that during the dialogue between speaker and hearer one or multiple neurons in the connectionist model are getting activated which allows to capture the interaction with mathematical precision. In contrast to a natural language sentence which is stored as a string, a single neuron can be either on or off and might be stored as a feature vector. There is a similarity to a communication board which is used in language experiments. A communication board is a series of word buttons which can be activated or not and this simplifies the communication. Or at least it allows a computer to understand the meaning easily. Now its possible to solve the original problem which is how to emulate two humans which are doing the dialogue. The connectionist model has to activate the same neurons like the human actors at the right moment. Such a behavior can be learned with a neural network algorithm. This allows to replace humans with machines. Assigning possible answers to predefined nodes in a graph isn’t completely new but its used for dialogue design frequently. A dialogue contains usually of a textfield e.g. “are sure that the program should be ended?” and a list of possible options like “yes, no, cancel”. This paradigm allows to formulate the interaction in a precise form. The user can’t answer freetext but he has to select a button. In the program, possible button events are processed into a certain behavior of the software. This allows to create and bugfix a GUI application. In case of instruction following for human to robot interaction the same principle can be adapted. The dialogue gets transformed into a predefined graph of question/answer pairs and each possible interaction will produce an event. For example the question “is there an obstacle?” ca be answered with yes, no or “don’t know” and each possible answer result into a predefined situation. In other words, possible bugs of the robot program can be traced back into missing dialogue modeling. If the robot answers a question wrong the problem can be analyzed and fixed. This ability may explain why the “instruction following” task is an interesting problem in robotics. It means usually to send a series of questions and commands to the robot and benchmark if the robot is able to execute the commands. In the best case the robot can understand 100 different commands which contains of object names, possible location, and obstacle finding questions. 41 Figure 10: Ship steering control with commands An instruction following task is the opposite of former understanding of Artificial intelligence which was dominated by the goal of implementing a certain amount of onboard intelligence. In the past, most robotics projects were created with the goal that the robot itself is able to think and execute an action. In contrast, the instruction following task assumes that the robot has to interact with a human operator which asks the questions and gives the commands. The only task for the robot is to fulfill these requests. 4 Practical demonstration 4a Creating a cleaning bot with a dialogue Suppose there is a grid and the task for the robot is to clean up a certain cell which is highlighted in red. Such a robot can’t be programmed in a certain way because the inner working of the robot is not important or it is trivial. The better way to describe the behavior is to assume an abstraction layer in a natural language dialogue. There is a virtual speaker outside of the maze which is talking to the robot. And the language send back and forth has to be converted into robot commands and into a behavior tree. The described dialogue doesn’t belong to the internal Artificial Intelligence of a robot agent but its part speaker hearer of the game description. There is the mentioned grid What position has the object? (3,5) game which has a grid of 10x10 cells, a robot which movetoX() ok can move in 4 directions and the speaker hearer dialog. Isobstacle() No With this constraints its possible to program the robot. movetoY() ok The robot has act inside this rule system. And if the Cleanobject() Failure robot isn’t able to solve the task the dialogue has to returntobase() ok be extended. That means, more question&answering yourbattery() 54% pairs are needed which reflects additional knowledge Table 8: Speaker hearer interaction for robot encoded in natural language. For reason of simplification we can assume that cleaning domain a speaker hearer dataset is the knowledge layer for a robot. It reduces the state space drastically. The robot in the maze isn’t solving a path planning problem and its not solving a reinforcement learning problem, but the robot produces and understand natural language. Natural language is a collection of nouns, action words and speech acts like a question or an answer. For example the robot might answer a simple “failure” or he can say “Sorry, but i can’t clean the object”. The main advantage of a language based interaction is that it can be bugfixed quite easily. Even a non programmer can interact with the robot and will recognize at which situation exactly the robot doesn’t 42 Figure 11: Screenshot of gripper task fulfill the request. 4b A grammar controlled gripper robot A movable gripper in a box2d world has to pick&place a ball. This is realized with a simple vocabulary: action ::= stop | rotleft | r o t r i g h t | f o r w a r d | open | c l o s e The human operator can select one of these actions from the menu and this will move the robot. Until now the situation sounds not very advanced and there is no advanced Artificial Intelligence implemented. Apart from the mentioned BNF grammar no further libraries or functionality was implemented in the prototype which is shown in the figure. In other words, the robot gripper gets controlled manually be pressing buttons on a keyboard. The surprising situation is, that the given commands can be executed in a script so its possible to program the robot in an autonomous fashion. The situation is the same like the “karel the robot” example, except that the robot world is not a simple grid maze, but a box2d physics engine. Let us investigate how a typical robot program might look. What the gripper can do is: forward for 5 seconds, closegripper, rotleft for 5 seconds, forward. Such an action sequence will grasp the ball and bring it back to start. In contrast to a famous myth a teleoperated robot can be automated easily, the only requirement is, that a vocabulary is available which was given in the beginning as a BNF grammar. The programmer has to design a word list (in the example it contains of 6 actions) and then a script is referencing to these words. After executing a single action for example rotleft, the robot is doing something. In the concrete case, rotleft will trigger a python code segment which modifies the angularvelocity in the physics engine. The human operator doesn’t need to know the detail, its enough to recognize that the button will rotate the robot. The reason why a action vocabulary is a powerful tool is because it increases the abstraction level. Instead of programming a robot in a certain language like Python or C++, the interaction with the machine is realized with natural language commands. Nobody cares if the robot was programmed internally in Java, Pythoon or any other language but the only interesting subject is the bnf grammar which defines the textual userinterface to the robot. In theory its possible to extend the grammar with some sensor command, for example: getdistance, getcontactforce, getangle. These information might help to locate the ball and grasp it with the correct amount of force. For reason of simplication these sensor feedback data aren’t implemented. Its a bit hard to explain what the program is doing from a technical perspective, because the software is different from a classic robot AI. In technical terms the program is mostly a graphical user interface which shows available commands to the user and make sure that after pressing a number key the action gets executed. So its some sort of GUI interface without further algorithm. If the idea is reduced even more, the BNF grammar is the core element which is no algorithm but a list of possible words to interact with a robot. So its a pulldown menu similar to what is known from classical desktop applications. 43 The surprising situation is that such a trivial GUI interface allows to control the robot with ease. After a short amount of time, every newbie can send the correct commands to the robot. Let us try to invent a more advanced example in the context of instruction following. Suppose the task for the robot is to collect not one but hundreds of balls which are occurring on a random position in the map. In each case, the robot has to move to the ball, grasp it and bring it back the starting position. Such a task can be realized with a script which is using the predefined commands. Everything what is needed are some if then statements and a for loop. Another way to program the robot would be a behavior tree which is working with the same principle like scripting AI. Even if the proposed vocabulary list is very simple, its possible to solve complex tasks. 4c How to automate a warehouse with robots A swarm of pick&place robots in a warehouse are an easy to realize automation example. The robots are navigating in a controlled environment, they are monitored by a computer and they can transport a load over a long distance. The only challenge to solve is how to program these robots which is explained in the following section. It makes sense to build a robot swarm in a step by step fashion. First thing to do is to ignore the technical side of robot programming because this hardware related technical side is surprisingly easy to realize. A robot is basically an electric car which consists of battery, a motor and some sensors. Building such a machine from scratch isn’t a real problem especially not if modern technology like a microcontroller is available. The more serious task, and seldom explained problem, is how to describe the autonomous working of a robot in computer code. And this task is realized as a natural language dialogue. Before a robot can be programmed to understand commands, it makes sense to test a self created vocabulary in a dry test with two humans. Both humans have a sheet with words and one person gives the other person the instructions. For example, the speaker says “goto shelf 5”. The hearer takes a look at his vocabulary sheet and if the words goto, shelf and 5 are available he can execute the task. Both humans are asked to use only the words on the sheet. Even if they know much more about the warehouse they are restricted to the vocabulary in the sheet. The reason is, that the newly created sheet is taken as input for programming the robots. A robot will need some expert knowledge and this knowledge is encoded in the vocabulary used in human to human interaction. Typical words in a warehouse setting are “shelf, pickup, place, bringme, moveto, loadbattery, stop, obstacleinfront, obstacleleft” and so on. The assumption is that an entry level warehouse robot has a vocabulary of less than 50 words to describe all the possible scenerios. If the words are available its possible to put them into a sequence and write longer programs with the words. This task has much in common with classical computer programming but has a strong focus on solving domain related problems. Even without such a program a robot can be used in an interactive fashion if he knows the important words. Its possible to teleoperate such a robot by pressing buttons from 0 to 50 for a single command. For example, if the operator likes that the robot is moving to shelf 5, he will enter “20 5” (20=moveto). If the operator likes that the robot pickup the object he will enter “16” (16=pickup) and so on. If the robot sends a message back like “4” This is equal to (4=obstacleahead). And if the robot sends the feedback (5=batterylow) the machine needs to find the next charging station. In summary, a fully working warehouse robot is consists of a vocabulary of 50 words which are describing possible actions and sensor readings. This vocabulary is used to control the robot in an interactive mode or in an autonomous mode which is equal to a behavior tree. 4c1 Command dictionary for a maze robot In contrast to a famous myth, the main problem in Artificial Intelligence isn’t about inventing an algorithm which controls a robot but about inventing a dictionary for man to machine communication. 44 A possible example is provided in the table which contains of 14 different controls. These commands can be used to control the position, scan the maze and check the battery. The surprising insight is, that such a command list isn’t an algorithm, but a codebook which a list. Each entry has an id. The human operator id action can enter the code e.g. 12 and this will start a certain behavior of the 0 move forward robot (here: 12=play a sound). In other words, the command dictionary is 1 turn left an elaborated remote control. Instead of providing only 4 arrow keys for 2 turn right left,right,up,down, the human operator can take advantage of a long list 3 check wall of possible commands. So its some sort of user interface known from a 4 Take a Picture computer game. The commands are mapped to shortcuts and this allows 5 stop the human to interact with the game as fast as possible. 6 restart Such a user interface might sound trivial, because most video games 7 display map have such a feature implemented already. On a closer look, such an interface 8 speed up can be used to create an advanced AI coontrolled robot. The reason is, that 9 slow down a command dictionary allows to record and playback actions. Suppose 10 show battery the human operator is interacting with a robot and all the keystrokes are 11 scan surrounding recorded in a game log. This will result into a MIDI like notation which is 12 play sound maybe [0,1,0,0,11,10,2,2,0,0] The numerical sequence captures the actions 13 toggle lights of the robot. Such a sequence can be play backed in a macro. This allows to control the robot autonomously. The new (easier to solve task) is which Table 9: command dictiosort of action sequence is needed to fulfill a certain task. The goal for the nary for a maze robot onboard AI is, to generate the optimized action sequence which contains of actions from the codebook. In other words, with a command codebook every domain will become a toy puzzle game. 5 Robot programming languages In the 1980s many robot programming languages were available like Autopass and VAL. Some of them were working with the client server communication model similar to the description in section “3 instruction following”. Nevertheless the concept of a robotics language has to be felt out of fashion. The main obstacle is for sure that for any domain a new vocabulary set has to be created and an existing language like VAL can’t be adapted. The main problem with a domain specific language for robots is, that such a language is missing and has to be invented from scratch. The initial situation is, that the robot has only a small amount of commands like [on, off, moveleft] and these commands are not powerful enough to script more complex actions. Creating a robot programming language has much in common with the Chicken or the egg issue. Without a command set its not possible to create a robot program, and without a robot program there is no vocabulary. To solve the issue we have to describe the situation from a non computing perspective. A dataset might be the source for any language. In contrast to a robot language, a dataset isn’t implemented in a computer but its a table, similar to a mocap dataset. In the optimal case, a robot dataset is a game log which has pictures and natural language instructions from a speaker. Its the recorded interaction between a speaker and a hearer including the picture of the robot. The main advantage of the dataset is, that it can be created with less effort than a full blown domain specific language. There is no BNF grammar and no parser, but the dataset is a normal MS Word document which describes a situation. Its correct that such a dataset is different from a robot programming language because the table can’t be compiled into Java language. On the other hand, it can be used as a valuable source for software engineering. Its possible to send the word document to a software programmer and ask him if he can create a robot programming language according to the textual dialogue between the speaker and the hearer. That means the textual document contains the specification for creating a DSL Language. 45 t 0 1 2 3 Speaker Start the engine, please. Move to north and grasp the box The box id=3. Can you return to base? Tag speaker object action, object object number waypoint Hearer done which one? thank you One second Tag h. ok clarify ok delay Picture picture1.jpg picture2.jpg picture3.jpg picrure4.jpg Table 10: Dataset for instruction following between two humans In the example table, the resulting robot language would contain of commands like “start, move, returntobase”, a variable like “boxid” and response codes like “ok, id_needed”. In theory its possible to convert this information into a Backus–Naur form grammar and then implement an executable robot programming language. 5a The symbol grounding problem The dominant problem in Artificial intelligence in the past was a missing understanding of Intelligence. Robots were known from science fiction books but it was unclear if its possible to realize them in reality. There was no detail problem with robots available but the situation in general was too complex for the programmers. Overcoming the bottleneck can be realized with two techniques: a dataset and the description of the symbol grounding problem. Both techniques closes the gap between human and machine intelligence. Let us start with the easier to explain technique which is a dataset. A dataset is a tool to define a problem, for example a table with image to text entries, a motion capture database or a gamelog of a videogame. All these datastructure allow to capture human knowledge. They are stored in a well known database which a list of files stored in a directory. The dataset never contains an algorithm or the Artificial intelligence itself, but the dataset is the input value for creating an AI algorithm. It allows to benchmark existing neural network algorithms and it helps to describe problems in a machine readable format. The second tool, the symbol grounding problem, means to map numerical values into natural language. For example the command “go north” has a certain representation in a computer program while a certain sensor reading can be translated into a sentence like “obstacle ahead”. A solved symbol grounding problems allows to interact with a computer in natural language. similar to a dataset, the symbol grounding problem never contains of an algorithm or a computer program but it defines a requirement how a computer has to work. Its a specification for the GUI which is used by the human operator to send commands to a robot. If the gui has a text entry field which allows to enter a command similar to a text adventure, than the domain was grounded in natural language. For an AI newbie, it might be hard to realize why Artificial intelligence can be realized with non algorithms like a dataset and a certain text based GUI. This understanding contradicts the common assumption that an AI has to do with writing software. The misconception has to do with the self understanding of computer science. In the past until the year 2000, computer science was equal to invent algorithms and convert them into executable programs. A more recent description of computer science tries to identify first the problems which can be solved later with a certain algorithm. This problem focused computer science is more challenging to describe, because a problem is usually given in the beginning. Let me give an example: a typical computer science problem until the 2000 was how to solve the traveling salesman problem. This path planning problem was introduced in computer science decades ago and then it was discussed and solved multiple times. The only debate was, which algorithm works well and how to implement the algorithm in a certain programming language. What was ignored in the past is the question if the traveling salesman problem makes sense, how to modify the problem in a new direction and which other problems might be there. 46 5a1 Symbol grounding mindmap The symbol grounding problem is hard to explain and it contains of many references to existing subjects like natural language procesing, Artificial intelligence, cognitive science and man to machine interaction. A possible attempt to explore the subject is a mindmap which contains keywords. Of course the following mind is a bit long because it contains 142 entries so it makes sense to reduce the subject to only 5 keywords which are: speaker hearer dialogue, text to animation, abstraction mechanism, controlled vocabulary and game log. The main objective of the following mindmap is to provide an entry point for finding more literature about the subject. In general, symbol grounding is about a mapping from language to pictures which sounds trivial on the first look but its surprising difficult to realize with a computer. The assumption is that the symbol grounding problem is the core problem within Artificial Intelligence. Its basically a tool to reduce the state space. Only with a small state space its possible to solve robotics problems on normal computer hardware. 47 1 sensor 1a clustering ID3 1a1 classification →3a2b 1a2 space partitioning →3b11 1b qualitative sensor 1b1 symbolic event recognition →3a2b 1b2 human activity recognition 1b3 system identification 1b4 game log 1b5 event taxonomy 1b6 pattern recognition 1c qualitative physics 1c1 Psychophysics 1c1a cognitive science 1c1b minimal quantum theory 1c1c spatial grounding 1c1d quantum system identification 1d bar chart 1d1 spectrogram 2 origin 2a Stevan Harnad, 1990 2b Rodney Brooks, 1990 2b1 physical grounding hypothesis 2c John Searle, 1980 2d Luc Steels, 2000 2d1 cognitive science 2e practical application 2e1 Yiannis Aloimonos 2e2 Jeffrey Siskind 2e3 Neil Dantam 3 symbols 3a language 3a1 concept 3a1a concept learning 3a2 sensory words 3a2a measurement 3a2b categorical perception 3a3 natural language command →4 3a4 voice command 3a5 mini language 3a6 activity language 3a7 motion description language 3a8 instruction following 3a9 route instructions 3a10 Scripting AI 3a11 BNF grammar 3a12 Behavior tree →1b5 3a13 Domain specific language 3b representation 3b1 feature vector 3b2 ontology 3b2a grounding graph 3b2b neural blackboard architecture 3b2c multigraph 3b2d lexigram 3b2e gantt chart 3b2f communication board 3b2g bag of word 3b2h tagging →3b6b 3b2i data labeling 3b2j scoreboard 3b2k tag vocabulary →4c5 3b2l semiotic network 3b3 boolean variables in feature set 3b4 short notation 3b5 morse code 3b6 case representation 3b6a knowledge representation 3b6b case frame 3b6c case grammar 3b7 Bayesian network 3b8 printer error code 3b8a ASCII code 3b9 telegraphic code book 3b10 information theory 3b11 Semantic memory 3b11a Connectionist 3b12 sender/receiver data communication 4 Natural language 4a meaning 4a1 Translation between iconic and textual language 4a2 user interface for man machine communication 4a3 Question Answering 48 4a4 Questionnaire 4a5 language games 4a6 grounded communication 4a7 Artificial language 4a8 semiotic cycle 4a9 dialogue 4a10 speaker hearer dialogue →3b12 4a11 Wizard of Oz experiment 4a12 human to robot dialogue 4b word embedding 4b1 Word2vec →3b2g 4b2 semantic indexing 4c minimal dictionary 4c1 grounding transfer 4c2 Structured English 4c3 conceptual spaces 4c4 multimodal translation →4a1 4c5 controlled vocabulary 4c6 dictionary learning 4c7 Thesaurus 5 Artificial Intelligence 5a video annotation 5a1 dataset 5a2 json format 5a3 motion capture 5a4 Therbligs 5a5 kinesiology 5a6 motion graph →3a6 5a7 annotation vocabulary 5a8 data centric AI 5b evaluation function 5b1 cost map 5b2 reward function 5b3 potential field 5b4 grounded reinforcement learning 5b5 heuristic algorithm 5b6 best first search 5b7 heuristic search 5b8 memory based heuristic 5b9 pattern database 5c soccer referee 5c1 event parser 5d Karel the robot →3a9 6 Anchoring 6a text to animation 6a1 Keyframe 6a2 interactive computer animation →5a6 6b text adventure →4c2 6b1 Text adventure parser 6b2 game engine 7 idea 7a sensory projections to categorize objects 7b language to meaning relationship 7c abstraction mechanism 7c1 state space reduction →3b3 7d in opposition 7d1 computationalism 7d2 algorithms 7d3 spatial map 7e programming exercise 5b Inventing a Domain specific language (DSL) In the past, domain specific languages were often explained from its implementation perspective. The programmer was trying to formalize the specification in a BNF grammar and a parser was programmed. Such a technical perspective prevents to understand what the real purpose of a domain specific language is. A DSL is at foremost not a computer program but a communication device. It allows to communicate back and forth between a speaker and a hearer. So its part of a dialogue game. In contrast, the technical implementation including the BNF grammar has to be called trivial. Even without programming a dedicated parser its possible to implement a DSL in a computer program. In the easiest case its simple a python class with some methods. Each method is executedd from the outside so there is no dedicated DSL at all. The more harder to grasp element of a DSL is its relationship during interactive control of a robot. Let me give an example. There is a warehouse robot which can do simple pick&place tasks and there is a human speaker. The speaker sends commands to the robot and receives feedback. For example the speaker says “moveto B3”, “moveto B7”, “pick object”, “moveto A1” and so on. Possible answers from the robot might be “Yes”, “obstacleahead”, “Yes”. The language utterances between speaker and hearer a formulated in a domain specific language or to be more precise its a shared vocabulary. So we can say that a DSL plays the role of communication interface. It enables that a speaker and a hearer will understand each other. A task is never solved by a robot or the DSL but the mentioned warehouse task is solved with interactive communication. The reason why the communication is needed is because of complexity reduction. The overall task is split into two subtasks which are realized by a speaker and a hearer. This server/client model allows to solve very complex problems from robotics. 5c Interactive robot control Artificial Intelligence in the past was mostly treated as algorithm centric search in a well defined problem space. A typical example is a path finding algorithm or a chess engine which are both searching in the state space with a highly efficient strategy. Unfortunately this perspective prevents to solve more advanced problems from robotics. To address these issue a different paradigm is needed which is interactive control. In short, interactive control is a grounded communication between man and machine. Natural language instructions are used to send commands to the machine. A working example is a classical text adventure from the early 1980s in which a human operator types in a command and the game engine is executing the command. Such an interaction isn’t described as Artificial Intelligence but its perceived as a normal video game. The interface in this game can be realized with a mouse, a textual parser, a keyboard or a joystick. This interface allows a grounded communication. Its a seldom discussed fact, that the implementation of teleoperated control has a huge impact into the state space. Its possible to use joystick to control a robot arm or its possible to use a vocabulary list. The goal is to discretize the state space with the help of natural language words. If the robot can be controlled with a list of less than 20 words its pretty easy to automate the teleoperation control into fully autonomous control. 49 Let me give a simple example for working interactive control of a clock. There is a speaker who gives a command like “Its 2:30 o’clock” and the hearer has to put the little and the big hand to the correct position. In case of a videogame, the speaker is the human operator and the hearer is the game engine. The task for the game engine is to take a clock information as input and render the pointer as output. The surprising situation is, that rendering a clock on a computer display is Hearer nothing new and it doesn’t even belong to artificial intelligence in the classical Speaker sense but its a normal programming task. But at the same time it has a lot to Its 2:30 do with AI because its a practical example for man to machine communication. o’clock Let me explain it the other way. The main bottleneck in AI is missing man to machine communication. If a machine understands the commands of a human better its pretty easy to program a robot. Figure 12: Speaker There are many examples available for interactive control like the mentioned Hearer interaction clock setting animation, an action adventure or interactive animation of a biped for clock animation robot. In all the cases the human is asked to enter commands or a press a button and this will affect the rendering on the computer screen. The user interface gets adapted to a certain domain and this is equal to ground a domain. That means, the communication with the user interface is realized. 5c1 Interactive animation with datasets Since the advent of interactive animation by [Licklider1976] lots of attempts were made to create computer generated images. The problem with most of these projects like SIMBICON and other parametrized animation systems is that the underlying software is very complex and its difficult to reproduce a result. A promising attempt to overcome the complexity is to postpone the implementation of such a software and focus only on the creation of a dataset which specifies the desired outcome. A dataset for interactive animation has much in common with a question answering dataset except that the domain is related to computer animation. The speaker in the dataset will formulate some requests like “can you draw a circle in the middle” “please wave the left arm” “jump” while the role of the speaker has to show a certain animated behavior. In the easiest case, the dataset has the same format like a serious comic book, it contains of motion primitives together with its textual description. There is no need to convert the dataset into a computer program or even into a neural network but the dataset itself is the project goal. After creating such a dataset Speaker Hearer no further algorithm or library is needed. This allow to focus on the content and Stand capture the expert knowledge in a long table. Its up to machine learning experts how to translate a dataset into a computer model. Let us focus on the dataset itself. The structure contains of some fixed assumption. At first there is a speaker hearer role model. So the animation is generated always in Run a dialogue. The reason is, that a dialogue allows to capture high level and low level knowledge at the same time. The speaker knows about the key elements like waving the hand, or moving a rigid body, while the hearer has detail knowledge about how to draw a certain request with vector graphics on the screen. The second fixed element Sit down of the dataset is a multimodal information storage which means, that natural language and pictures are available both. This allows to solve the symbol grounding problem which is basically a translation from language into graphics. In general an interactive animation dataset is a long table which contains of Figure 13: columns for a speaker and a hearer while the content are words and drawings both. Dataset for Such a datasets encodes all the knowledge about animating characters. stick figure Modern Artificial Intelligence works with a dataset in the background. At first, animation the domain knowledge is captured in a long structured table and secondly this table is 50 converted into a mathematical model. Such kind of workflow makes it easy to bugfix possible errors because it creates two different projects which are working independent from each other. In case of computer animation the follow up problem after the dataset was created is how to reproduce the information automatically. For example, a human user types in a command like “draw a circle in the middle” and the AI has to execute the task. The reason why its possible for the computer to do so is because the same request-response entry was given in the dataset, so the computer takes a look into the correct row and extracts the information. In other words, its a retrieval task in a database. 5c2 Stick figure dataset to animation The main reason why Artificial Intelligence was complicated in the past is because of its computer science perspective. The implicit assumption was, that Artificial Intelligence and robotics has something to do with Mathematics, algorithms and convex optimization. All these topics are interesting but they can’t explain what Artificial intelligence is about. The more promising approach to develop an AI system is to describe from a non computer perspective a certain domain and then in a second step think about how to convert this description into a computer program. The first step in developing an AI animation system is to create a dataset which contains of possible stick figure images together with a textual description. A single stick figure can run, stand, sit or jump and each entry gets annotated with one upto two words. So the overall dataset is a long table written in the MS Word .doc format. This table forms the basis for the AI system, it can be compared to a manual created form which is used in database design as an input to derive the SQL tables. Even if a MS Word table is the opposite of a machine readable format it defines very clearly what a domain is about. Suppose there there 10 different stick figures available in the dataset, then its possible to retrieve one of the entries. The retrieval process is some sort of dialogue game in which one person says a word like “walk” and the other person has to draw the stick figure from the table. Such kind of interaction can be simulated with a computer too which results into the second step of an AI animation system. Such a system looks different from a robot control system but at the same time it comes very close to a robot. The software won’t need an AI algorithm but its a simple text to drawing converter. A human user can map the words to a keyboard and this allows to interact with the system by pressing buttons. After pressing 0 on the keypad, the stickfigure will stand up, after pressing 1 it will move and button 2 will make it jump. so its some sort of animated character in a video game. From a technical perspective, the dataset with the stick figure is a sprite sheet while the animation system is a user interface of a game engine. The interesting situation is that more programming effort is not needed but the robot control system is ready to deploy. The logic of such a system isn’t located in the written software but in the dataset which was defined in step 1 as a MS Word table. The columns in the dataset will specify what the robot can do and what not. If the robot animation looks unrealistic, more entries in the dataset are needed. 5c3 Interface for character animation In an old paper, published 20 years ago, a simple but effective example for interactive character animation was presented.[Laszlo2000] For the domain of a hopping lamp, called Luxo, the movement was controlled with keystrokes like k,l,o,i,u and so forth. The human operator was invented to press a single button on the keyboard and this will activate a predefined pattern like “medium hop”. This gives the human operator full control over the character. Its an example for high level interaction. Somebody may ask, what the innovation is because its simple a mapping from keystrokes to animation scripts. And exactly here is the source for a misunderstanding. This simple mapping is the key element in the symbol grounding problem. Natural language commands are mapped to motion primitives and this reduces the state space. instead of figuring out which angle is needed for each joint of the robot, there is only a need to decide for one of 26 letters on the keyboard and this will activate the 51 movements. Its pretty easy to imagine a longer sequence of these actions which results into a realistic movement. One important element is missing to an animation control system which is a dataset. A dataset allows to specify the mapping in a more elaborated attempt. Instead of explaining which key is mapped to a certain behavior the idea is to list all the movements in a table and its up to the programmer to map the movements to textual commands, a menu or a keystroke. 5c4 Early parameterized computer animation Until the year 2000 it was very popular in computer graphics to use interactive and parameterized animation systems.[Laszlo2000] The idea was to simplify the process of creating moving pictures and give the human operator sliders and mouse gestures to take influence on the output. In contrast to a purely mathematical understanding of calculating inverse kinematics and rigid body forces it was an improvement. At the same time there is weakness available in parameterized animation which has to do with creating new animation systems. Any parameterized animation system works with a mathematical formula which gets fine tuned in an interactive manner. The human operator can move a slider and this will affect the jump height of a character on the screen. But, the same equation is useless if the character should run or take an object so there is a need to increase the abstraction level further. This can be realized with a natural language dialogue. The idea here is that no mathematical background at all is needed but the animation is defined in a dataset which contains of language annotation. Such a dataset works with a speaker hearer role description. One human gives a command like jump upwards, while the other human is executing the motion with a puppet. The overall dialogue including the demonstrated motion gets recorded with a computer. The main advantage over former parameterized animation systems is the absence of computer code and mathematical equations but its a simple table. Such a table gets converted into a model in the second step which is a completely different problem and can be realized by hand coded software and neural networks as well. The more important step is the first one which defines the motion vocabulary. Instead of figuring out how a computer is useful to animate virtual characters, the idea is to describe only the problem in a semi structured format. The dataset is some sort of game or challenge which contains possible movements and word based annotation. Let us go into the detail how to create such a dataset from scratch. Suppose there is a puppet show in which one person controls the puppet with a cross brace while the speaker observes the situation and gives the command. The speaker may say “move a bit to the left, now wave the hand, let the puppet jump”. The hearer which is in control of the puppet has to translate the orders into actions and gives a feedback like “sure, nice idea”. Such a language guided dialogue is the core element of a speaker hearer dataset. It generates the discourse space in which action and language takes place. During repetition the same corpus of commands and behavior is shown again. That means its only allowed to reference to existing interaction but not invent a new one. The interesting situation is, that after the creation of a speaker hearer dataset a new sort of AI problem is available. The new problem is how to convert the table into a computer model. The computer has to imitate the role of the speaker and the hearer both. That means the software has to generate possible commands and it also has to convert the commands into action. But, such a task can be realized much easier than generating the animation from scratch because it was defined clearly what the goal is. There is a clear command like “move to left” which has to be converted into a precise movement of the character which can be either right or wrong. So the process can be done of course by a computer program. 52 References [Cirik2015] Cirik, Volkan, Louis-Philippe Morency, and Eduard Hovy. "Chess Q&A: Question answering on chess games." Reasoning, Attention, Memory (RAM) Workshop, Neural Information Processing Systems. Vol. 7. No. 5. 2015. [Culberson1998] Culberson, Joseph C., and Jonathan Schaeffer. "Pattern databases." Computational Intelligence 14.3 (1998): 318-334. [Edelkamp2002] Edelkamp, Stefan. "Symbolic Pattern Databases in Heuristic Search Planning." AIPS. 2002. [Farkhatdinov2008] Farkhatdinov, Ildar, Jee-Hwan Ryu, and Jury Poduraev. "Control strategies and feedback information in mobile robot teleoperation." IFAC Proceedings Volumes 41.2 (2008): 14681-14686. [Holte1999] Holte, Robert C., and István T. Hernádvölgyi. "A space-time tradeoff for memorybased heuristics." AAAI/IAAI. 1999. [Howard2022] Howard, Thomas, et al. "An intelligence architecture for grounded language communication with field robots." (2022). [Korf1997] Korf, Richard E. "Finding optimal solutions to Rubik’s Cube using pattern databases." AAAI/IAAI. 1997. [Laszlo2000] Laszlo, Joseph, Michiel van de Panne, and Eugene Fiume. "Interactive control for physically-based animation." Proceedings of the 27th annual conference on Computer graphics and interactive techniques. 2000. [Licklider1976] Licklider, Joseph CR. "User-oriented interactive computer graphics." Proceedings of the ACM/SIGGRAPH Workshop on User-oriented Design of Interactive Graphics Systems. 1976. [Matuszek2012] Matuszek, Cynthia, et al. Learning to parse natural language to a robot execution system. Technical Report UW-CSE-12-01-01, University of Washington, 2012. [Rumelhart1993] Rumelhart, David E., and Peter M. Todd. "Learning and connectionist representations." Attention and performance XIV: Synergies in experimental psychology, artificial intelligence, and cognitive neuroscience 2 (1993): 3-30. [Samadi2008] Samadi, Mehdi, et al. "Compressing pattern databases with learning." ECAI 2008. IOS Press, 2008. 495-499. [Stuart1995] Stuart, Russell, and Norvig Peter. "Artificial intelligence: a modern approach." (1995). [Tellex2007] Tellex, Stefanie, and Deb Roy. "Grounding language in spatial routines." AAAI Spring Symposium: Control Mechanisms for Spatial Knowledge Processing in Cognitive/Intelligent Systems. 2007. 53

Log In

The symbol grounding problem part II

Related papers

Related papers

Related topics