Lecture Notes in Artificial Intelligence
Subseries of Lecture Notes in Computer Science
Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science
Edited by G.Goos, J. Hartmanis, and J. van Leeuwen
1778
Berlin
Heidelberg
New York
Barcelona
Hong Kong
London
Milan
Paris
Singapore
Tokyo
Stefan Wermter Ron Sun (Eds.)
Hybrid
Neural Systems
Series Editors
Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA
Jörg Siekmann, University of Saarland, Saarbrücken, Germany
Volume Editors
Stefan Wermter
University of Suderland
Centre of Informatics, SCET
St Peters Way, Sunderland, SR6 0DD, UK
E-mail: stefan.wermter@sunderland.ac.uk
Ron Sun
University of Missouri-Colombia
CECS Department
201 Engineering Building West, Columbia, MO 65211-2060, USA
E-mail: rsun@cecs.missouri.edu
Cataloging-in-Publication Data applied for
Die Deutsche Bibliothek - CIP-Einheitsaufnahme
Hybrid neural systems / Stefan Wermter ; Ron Sun (ed.). - Berlin ;
Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ;
Paris ; Singapore ; Tokyo : Springer, 2000
(Lecture notes in computer science ; Vol. 1778 : Lecture notes in
artificial intelligence)
ISBN 3-540-67305-9
CR Subject Classification (1991): I.2.6, F.1, C.1.3, I.2
ISBN 3-540-67305-9 Springer-Verlag Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are
liable for prosecution under the German Copyright Law.
Springer is a company in the BertelsmannSpringer publishing group.
c Springer-Verlag Berlin Heidelberg 2000
Printed in Germany
Typesetting: Camera-ready by author data conversion by PTP Berlin, Stefan Sossna
Printed on acid-free paper
SPIN: 10719871
06/3142
543210
Preface
The aim of this book is to present a broad spectrum of current research in
hybrid neural systems, and advance the state of the art in neural networks and
artificial intelligence. Hybrid neural systems are computational systems which
are based mainly on artificial neural networks but which also allow a symbolic
interpretation or interaction with symbolic components.
This book focuses on the following issues related to different types of representation: How does neural representation contribute to the success of hybrid
systems? How does symbolic representation supplement neural representation?
How can these types of representation be combined? How can we utilize their
interaction and synergy? How can we develop neural and hybrid systems for new
domains? What are the strengths and weaknesses of hybrid neural techniques?
Are current principles and methodologies in hybrid neural systems useful? How
can they be extended? What will be the impact of hybrid and neural techniques
in the future?
In order to bring together new and different approaches, we organized an
international workshop. This workshop on hybrid neural systems, organized by
Stefan Wermter and Ron Sun, was held during December 4–5, 1998 in Denver.
In this well-attended workshop, 27 papers were presented. Overall, the workshop
was wide-ranging in scope, covering the essential aspects and strands of hybrid
neural systems research, and successfully addressed many important issues of
hybrid neural systems research. The best and most appropriate paper contributions were selected and revised twice. This book contains the best revised
papers, some of which are presented as state-of-the-art surveys, to cover the
various research areas of the collection.
This selection of contributions is a representative snapshot of the state of the
art in current approaches to hybrid neural systems. This is an extremely active
area of research that is growing in interest and popularity. We hope that this
collection will be stimulating and useful for all those interested in the area of
hybrid neural systems.
We would like to thank Garen Arevian, Mark Elshaw, Steve Womble and
in particular Christo Panchev, from the Hybrid Intelligent Systems Group of
the University of Sunderland for their important help and assistance during the
preparations of the book. We would like to thank Alfred Hofmann from Springer
for his cooperation. Finally, and most importantly, we thank the contributors to
this book.
January 2000
Stefan Wermter
Ron Sun
Table of Contents
An Overview of Hybrid Neural Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Stefan Wermter and Ron Sun
1
Structured Connectionism and Rule Representation
Layered Hybrid Connectionist Models for Cognitive Science . . . . . . . . . . . . . 14
Jerome Feldman and David Bailey
Types and Quantifiers in SHRUTI: A Connectionist Model of Rapid
Reasoning and Relational Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Lokendra Shastri
A Recursive Neural Network for Reflexive Reasoning . . . . . . . . . . . . . . . . . . . 46
Steffen Hölldobler, Yvonne Kalinke and Jörg Wunderlich
A Novel Modular Neural Architecture for Rule-Based and Similarity-Based
Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Rafal Bogacz and Christophe Giraud-Carrier
Addressing Knowledge-Representation Issues in Connectionist Symbolic
Rule Encoding for General Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Nam Seog Park
Towards a Hybrid Model of First-Order Theory Refinement . . . . . . . . . . . . . 92
Nelson A. Hallack, Gerson Zaverucha, and Valmir C. Barbosa
Distributed Neural Architectures and Language Processing
Dynamical Recurrent Networks for Sequential Data Processing . . . . . . . . . . 107
Stefan C. Kremer and John F. Kolen
Fuzzy Knowledge and Recurrent Neural Networks: A Dynamical Systems
Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Christian W. Omlin, Lee Giles, and Karvel K. Thornber
Combining Maps and Distributed Representations for Shift-Reduce
Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Marshall R. Mayberry and Risto Miikkulainen
Towards Hybrid Neural Learning Internet Agents . . . . . . . . . . . . . . . . . . . . . . 158
Stefan Wermter, Garen Arevian, and Christo Panchev
VIII
Table of Contents
A Connectionist Simulation of the Empirical Acquisition of Grammatical
Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
William C. Morris, Garrison W. Cottrell, and Jeffrey Elman
Large Patterns Make Great Symbols: An Example of Learning from
Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Pentti Kanerva
Context Vectors: A Step Toward a “Grand Unified Representation” . . . . . . 204
Stephen I. Gallant
Integration of Graphical Rules with Adaptive Learning of Structured
Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
Paolo Frasconi, Marco Gori, and Alessandro Sperduti
Transformation and Explanation
Lessons from Past, Current Issues, and Future Research Directions in
Extracting the Knowledge Embedded in Artificial Neural Networks . . . . . . . 226
Alan B. Tickle, Frederic Maire, Guido Bologna, Robert Andrews, and
Joachim Diederich
Symbolic Rule Extraction from the DIMLP Neural Network . . . . . . . . . . . . . 240
Guido Bologna
Understanding State Space Organization in Recurrent Neural Networks
with Iterative Function Systems Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Peter Tiňo, Georg Dorffner, and Christian Schittenkopf
Direct Explanations and Knowledge Extraction from a Multilayer
Perceptron Network that Performs Low Back Pain Classification . . . . . . . . . 270
Marilyn L. Vaughn, Steven J. Cavill, Stewart J. Taylor,
Michael A. Foy, and Anthony J.B. Fogg
High Order Eigentensors as Symbolic Rules in Competitive Learning . . . . . 286
Hod Lipson and Hava T. Siegelmann
Holistic Symbol Processing and the Sequential RAAM: An Evaluation . . . . 298
James A. Hammerton and Barry L. Kalman
Robotics, Vision and Cognitive Approaches
Life, Mind, and Robots: The Ins and Outs of Embodied Cognition . . . . . . . 313
Noel Sharkey and Tom Ziemke
Supplementing Neural Reinforcement Learning with Symbolic Methods . . . 333
Ron Sun
Table of Contents
IX
Self-Organizing Maps in Symbol Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Timo Honkela
Evolution of Symbolization: Signposts to a Bridge Between Connectionist
and Symbolic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Ronan G. Reilly
A Cellular Neural Associative Array for Symbolic Vision . . . . . . . . . . . . . . . . 372
Christos Orovas and James Austin
Application of Neurosymbolic Integration for Environment Modelling in
Mobile Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Gerhard Kraetzschmar, Stefan Sablatnög, Stefan Enderle, and
Günther Palm
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
An Overview of Hybrid Neural Systems
Stefan Wermter1 and Ron Sun2
1
University of Sunderland, Centre for Informatics, SCET
St. Peter’s Way, Sunderland, SR6 0DD, UK
2
University of Missouri, CECS Department
Columbia, MO, 65211-2060, USA
Abstract. This chapter provides an introduction to the field of hybrid
neural systems. Hybrid neural systems are computational systems which
are based mainly on artificial neural networks but also allow a symbolic
interpretation or interaction with symbolic components. In this overview,
we will describe recent results of hybrid neural systems. We will give
a brief overview of the main methods used, outline the work that is
presented here, and provide additional references. We will also highlight
some important general issues and trends.
1
Introduction
In recent years, the research area of hybrid and neural processing has seen a
remarkably active development [62,50,21,4,48,87,75,76,25,49,94,13,74,91]. Furthermore, there has been an enormous increase in the successful use of hybrid
intelligent systems in many diverse areas such as speech/natural language understanding, robotics, medical diagnosis, fault diagnosis of industrial equipment
and financial applications. Looking at this research area, the motivation for examining hybrid neural models is based on different viewpoints.
First, from the point of view of cognitive science and neuroscience, a purely neural representation may be most attractive but symbolic interpretation
of a neural architecture is also desirable, since the brain has not only a neuronal
structure but has the capability to perform symbolic reasoning. This leads to the
question how different processing mechanisms can bridge the large gap between,
for instance, acoustic or visual input signals and symbolic reasoning. The brain
uses specialization of different structures. Although a lot of the functionality of
the brain is not yet known in detail, its architecture is highly specialized and
organized at various levels of neurons, networks, nodes, cortex areas and their
respective connections [10]. Furthermore, different cognitive processes are not
homogeneous and it is to be expected that they are based on different representations [73]. Therefore, there is evidence from cognitive science and neuroscience
that multiple architectural representations are involved in human processing.
Second, from the point of view of knowledge-based systems, hybrid symbolic/neural representations have some advantages, since different, mutually complementary properties can be combined. Symbolic representations have advantages of easy interpretation, explicit control, fast initial coding, dynamic variable
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 1–13, 2000.
c Springer-Verlag Berlin Heidelberg 2000
2
S. Wermter and R. Sun
binding and knowledge abstraction. On the other hand, neural representations
show advantages for gradual analog plausibility, learning, robust fault-tolerant
processing, and generalization. Since these advantages are mutually complementary, a hybrid symbolic neural architecture can be useful if different processing
strategies have to be supported. While from a neuroscience or cognitive science
point of view it is most desirable to explore exclusively neural network representations, for knowledge engineering in complex real-world systems, hybrid symbolic/neural systems may be very useful.
2
Various Forms of Hybrid Neural Architectures
Various classification schemes of hybrid systems have been proposed [77,76,89,47].
Other characterizations of architectures covered specific neural architectures, for
instance recurrent networks [38,52], or they covered expert systems/knowledgebased systems [49,29,75]. Essentially, a continuum of hybrid neural architectures
emerges which contains neural and symbolic knowledge to various degrees. However, as a first introduction to the field, we present a simplified taxonomy here:
unified neural architectures, transformation architectures, and hybrid modular
architectures.
2.1
Unified Neural Architectures
Unified neural architectures are a type of hybrid neural system. They have also
been referred to as unified hybrid systems [47]. They rely solely on connectionist
representations but symbolic interpretations of nodes or links are possible. Often,
specific knowledge of the task is built into a unified neural architecture.
Much early research on unified neural architectures can be traced back to
work by Feldman and Ballard, who provided a general framework of structured connectionism [16]. This framework was extended in many different directions including, for instance, parsing [14], explanation [12], and logic reasoning
[30,40,70,71,72]. Recent work along these lines focuses also on the so-called NTL,
Neural Theory of Language, which attempts to bridge the large gap between neurons and cognitive behavior [17,65].
A question that naturally arises is: why should we use neural models for
symbol processing, instead of symbolic models? Possible reasons may include:
neural models are a more apt framework for capturing a variety of cognitive
processes, as is argued in [15,66,86,72]. Some inherent processing characteristics
of neural models, such as similarity-based processing, [72,6] make them more
suitable for certain tasks such as cognitive modeling. Learning processes may
be more easily developed in neural models, such as gradient descent [63] and its
various approximations, Expectation-Maximization, and even Inductive Logic
Programming methods [26].
There can be two types of representations [77]: Localist connectionist architectures contain one distinct node for representing each concept [42,71,67,3,58,31,66].
An Overview of Hybrid Neural Systems
3
Distributed neural architectures comprise a set of non-exclusive, overlapping
nodes for representing each concept [60,50,27].
The work of researchers such as Feldman [16,17], Ajjanagadde and Shastri
[67], Sun [72], and Smolensky [69] has demonstrated why localist connectionist
networks are suitable for implementing symbolic processes usually associated
with higher cognitive functions. On the other hand, “radical connectionism”
[13] is a distributed neural approach to modeling intelligence. Usually, it is easier to incorporate prior knowledge into localist models since their structures
can be made to directly correspond to that of symbolic knowledge [19]. On the
other hand, neural learning usually leads to distributed representation. Furthermore there has been work on integrating localist and distributed representations
[28,72,87].
2.2
Transformation Architectures
Hybrid transformation architectures transform symbolic representations into neural representations or vice versa. The main processing is performed by neural
representations but there are automatic procedures for transferring neural representations to symbolic representations or vice versa. Using a transformation
architecture it is possible to insert or extract symbolic knowledge into or from a
neural architecture. Hybrid transformation architectures differ from unified neural architectures by the automatic transfer. While certain units in unified neural
architectures may be interpreted symbolically by an observer, hybrid transformation architectures actually allow the knowledge transfer into symbolic rules,
symbolic automata, grammars, etc.
Examples of such transformation architectures include the work on activationbased automata extraction from recurrent networks [54,90]. Alternatively, a
weight-based transformation between symbolic rules and feedforward networks
has been extensively examined in knowledge-based artificial neural networks
[68,20].
The most common transformation architectures are rule extraction architectures where symbolic rules are extracted from neural networks [19,1]. These
architectures have received a lot of attention since rule extraction discovers the
hyperplane positions of units in neural networks and transforms them to if-thenelse rules. Rule extraction has been performed mostly with multi-layer perceptron networks [79,5,8,11], Kohonen networks, radial basis functions [2,33] and
recurrent networks [53,90]. Extraction of symbolic knowledge from neural networks has also played an important aspect in this current volume, e.g. [81,7,84].
Furthermore, insertion of symbolic knowledge can be either gradual through
practice [23] or one-shot.
2.3
Hybrid Modular Architectures
Hybrid modular architectures contain both symbolic and neural modules appropriate to the task. Here, symbolic representations are not just initial or final
representations as in a transformation architecture. Rather, they are combined
4
S. Wermter and R. Sun
and integrated with neural representations in many different ways. Examples in
this class, for instance, contain CONSYDERR [72], SCREEN [95] or robot navigators where sensors and neural processing are fused with symbolic top-down
expectations [37]. A variety of distinctions can be made. Neural and symbolic
modules in hybrid modular architectures can be loosely coupled, tightly coupled
or completely integrated [48].
Loosely Coupled Architectures A loosely coupled hybrid architecture has
separate symbolic and neural modules. The control flow is sequential in the sense
that processing has to be finished in one module before the next module can
begin. Only one module is active at any time, and the communication between
modules is unidirectional.
There are several loosely coupled hybrid modular architectures for semantic
analysis of database queries [9] or dialog processing [34] or simulated navigation
[78]. Another example of a loosely coupled architecture has been described in a
model for structural parsing [87] combining a chart parser and feedforward networks. Other examples of loose coupling, which is sometimes also called passive
coupling, include [45,36].
In general, this loose coupling enables various loose forms of cooperation
among modules [73]. One form of coupling is in terms of pre/postprocessing
vs. main processing: while one or more modules take care of pre/postprocessing,
such as transforming input data or rectifying output data, a main module focuses
on the main part of the processing task. Commonly, while pre/post processing
is done using a neural network, the main task is accomplished through the use
of symbolic methods. Another form of cooperation is through a master-slave
relationship: while one module maintains control of the task at hand, it can
signal other modules to handle some specific aspects of the task. Yet another
form of cooperation is the equal partnership of multiple modules.
Tightly Coupled Architectures A tightly coupled hybrid architecture contains separate symbolic and neural modules where control and communication
are via common shared internal data structures in each module. The main difference between loosely and tightly coupled hybrid architectures are common
data structures which allow bidirectional exchanges of knowledge between two
or more modules. This makes communication faster and more active but also
more difficult to control. Therefore, tightly coupled hybrid architectures have
also been referred to as actively coupled hybrid architectures [47].
As examples of tightly coupled architectures, systems for neural deterministic parsing [41] and inferencing [28] have been built where the control changes
between symbolic marker passing and neural similarity determination. Furthermore, a hybrid system developed by Tirri [83] consists of a rule base, a fact base
and a neural network of several trained radial basis function networks [57,59].
In general, a tightly coupled hybrid architecture allows multiple exchanges of
knowledge between two or more modules. The result of a neural module can have
a direct influence on a symbolic module or vice versa before it finishes its global
An Overview of Hybrid Neural Systems
5
processing. For instance, CDP is a system for deterministic parsing [41], SCAN
contains a tightly coupled component for structural processing and semantic
classification [87]. While the neural network chooses which action to perform,
the symbolic module carries out the action. During the process of parsing, control
is switched back and forth between these modules. Other tightly coupled hybrid
architectures for structural processing have been described in more detail in [89].
CLARION is also a system that couples symbolic and neural representations to
explore their synergy.
Fully Integrated Architectures In a fully integrated hybrid architecture there
is no discernible external difference between symbolic and neural modules, since
the modules have the same interface and they are embedded in the same architecture. The control flow may be parallel. Communication may be bidirectional
between many modules, although not all possible communication channels have
to be used.
One example of an integrated hybrid architecture is SCREEN, which was
developed for exploring integrated hybrid processing for spontaneous language
analysis [95,92]. In fully integrated and interleaved systems, the constituent modules interact through multiple channels (e.g., various possible function calls),
or may even have node-to-node connections across two modules, such as CONSYDERR [72] in which each node in one module is connected to a corresponding
node in the other module. Another hybrid system designed by Lees et al [43]
interleaves case-based reasoning modules with several neural network modules.
3
Directions for Hybrid Neural Systems
In Feldman and Bailey’s paper, it was proposed that there are the following
distinct levels [15]: cognitive linguistic level, computational level, structured
connectionist level, computational biology level and biological level. A condition for this vertical hybridization is that it should be possible to bridge the
different levels, and the higher levels should be reduced to, or grounded in, lower levels. A top-down research methodology is advocated and examined for
concepts towards a neural theory of language.
Although the particulars of this approach are not universally agreed upon,
researchers generally accept the overall idea of multiple levels of neural cognitive
modeling. In this view, models should be constructed entirely of neural components; both symbolic and subsymbolic processes should be implemented in
neural networks.
Another view, horizontal hybridization, argues that it may be beneficial, and
sometimes crucial, to “mix” levels so that we can make better progress on understanding cognition. This latter view is based on realistic assessment of the state of
the art of neural model development, and the need to focus on the essential issues
(such as the synergy between symbolic and subsymbolic processes [78]) rather
than nonessential details of implementation. Horizontal approaches have been
used successfully for real-world hybrid systems, for instance in speech/language
6
S. Wermter and R. Sun
analysis [95]. Purely neural systems in vertical hybridization are more attractive
for neuroscience but hybrid systems of horizontal hybridization are currently
also a tractable way of building large-scale hybrid neural systems.
Representation, learning and their interaction represent some of the major
issues for developing symbol processing neural networks. Neural networks designed for symbolic processing often involve complex internal structures consisting
of multiple components and several different representations [67,71,3]. Thus learning is made more difficult. There is a need to address the problems of what type
of representation to adopt, how the representational structure in such systems is
built up, how the learning processes involved affect the representation acquired
and how the representational constraints may facilitate or hamper learning.
In terms of what is being learned in hybrid neural systems, we can have (1)
learning contents for a fixed architecture, (2) learning architectures for given
contents, or (3) we can learn both contents and architecture at the same time.
Although most hybrid neural learning systems fall within the first two categories,
e.g. [18,46], there are some hybrid models that belong to the third category, e.g.
[50,92].
Furthermore, there is some current work on parallel neural and symbolic
learning, which includes using (1) two separate neural/symbolic algorithms applied simultaneously [78], (2) two separate algorithms applied in succession, (3)
integrated neural/symbolic learning [80,35], and (4) purely neural learning of
symbolic knowledge, e.g. [46,51].
The issues described above are important for making progress in theories and
applications of hybrid systems. Currently, there is not yet a theory of “hybrid
systems”. There has been some preliminary early work towards a theoretical framework for neural/symbolic representations, but to date there is still a lack of an
overall theoretical framework that abstracts away from the details of particular
applications, tasks and domains. One step towards such a direction may be the
research into the relationship between automata theory and neural representations [39,24,88].
Processing natural language has been and will continue to be a very important test area for exploring hybrid neural architectures. It has been argued that
“language is the quintessential feature of human intelligence” [85]. While certain
learning and architectures in humans may be innate, most researchers in neural
networks argue for the importance of development and environment during language learning [87,94]. For instance, it was argued [51] that syntax is not innate
and that it is a process rather than representation, and abstract categories, like
subject, can be learned bottom-up.
The dynamics of learning natural language is also important for designing
parsers using techniques like SRN and RAAM. SARDSRN and SARDRAAM
were presented in the context of shift-reduce parsing [46] to avoid the problem
associated with SRN and RAAM (that is, losing constituent information). Interestingly, it has been argued that compositionality and systematicity in neural
networks arise from an associationistic substrate [61] based on principles from
evolution.
An Overview of Hybrid Neural Systems
7
Also, research into improving WWW use by using neural networks may be
promising [93]. While currently most search engines only employ fairly traditional search strategies, machine learning and neural networks could improve
processing of heterogeneous unstructured multimedia data.
Another important promising research area is knowledge extraction from
neural networks in order to support text mining and information retrieval [81].
Inductive learning techniques from neural networks and symbolic machine learning algorithms could be combined to analyze the underlying rules for such data.
A crucial task for applying neural systems, especially for applying learning
distributed systems, is the design of appropriate vector representations for scaling up to real-world tasks. Large context vectors are also essential for learning
document retrieval [22]. Due to the size of the data, only linear computations
are useful for full-scale information retrieval. However, vector representations
are still often restricted to co-occurances, rather than focusing on syntax, discourse, logic and so on [22]. However, complex representations may be formed
and analyzed using fractal approaches [82].
Hard real-world applications are important. A system was built for foreign
exchange rate prediction that uses a SOM for reduction and that generates a
symbolic representation as input for a recurrent network which can produce
rules [55]. Another self-organizing approach for symbol processing was described
for classifying Usenet texts and presenting the classification as a hierarchical
two-dimensional map [32]. Related neural classification work for text routing
has been described [93]. Neural network representations have also been used for
important parts of vision and association [56].
Finally, there is promising progress in neuroscience. Computational neuroscience is still in its infancy but it may be very relevant to the long-term progress
of hybrid symbolic neural systems. Related to that, more complex high order
neurons may be one possibility for building more powerful functionality [44].
Another way would be to focus more on global brain architectures, for instance
for building biological inspired robots with rooted cognition [64].
It was argued [85] that in 20 years computer power will be sufficient to match
human capabilities, at least in principle. But meaning and deep understanding
are still lacking. Other important issues are perception, situation assessment and
action [78], although perceptual pattern recognition is still in a very primitive
state. Rich perception also requires links with rich sets of actions. Furthermore,
it has been argued that language is the “quintessential feature” of human intelligence [85] since it is involved in many intelligent cognitive processes.
4
Concluding Remarks
In summary, further work towards a theory and fundamental principles of hybrid
neural systems is needed. First of all, there is promising work towards relating
automata theory with neural networks, or logics with such networks. Furthermore, the issue of representation needs more focus. In order to tackle larger real
world tasks using neural networks, for instance in information retrieval, learning
8
S. Wermter and R. Sun
internet agents, or large-scale classification, further research on the underlying
vector representations for neural networks is important. Vertical forms of neural/symbolic hybridization models are widely used in cognitive processing, logic
representation and language processing. Horizontal forms of neural/symbolic hybridization exist for larger tasks, such as speech/language integration, knowledge
engineering, intelligent agents or condition monitoring. Furthermore, it will be
interesting to see in the future to what extent computational neuroscience will
offer further ideas and constraints for building more sophisticated forms of neural
systems.
References
1. R. Andrews, J. Diederich, and A. B. Tickle. A survey and critique of techniques
for extracting rules from trained artificial networks. Technical report, Queensland
University of Technology, 1995.
2. R. Andrews and S. Geva. Rules and local function networks. In Proceedings of
the Rule Extraction From Trained Artificial Neural Networks Workshop, Artificial
Intelligence and Simulation of Behaviour, Brighton UK, 1996.
3. J. Barnden. Complex symbol-processing in Conposit. In R. Sun and L. Bookman,
editors, Architectures incorporating neural and symbolic processes. Kluwer, Boston,
1994.
4. J. A. Barnden and K. J. Holyoak, editors. Advances in connectionist and neural
computation theory, volume 3. Ablex Publishing Corporation, 1994.
5. J. Benitz, J. Castro, and J. I. Requena. Are artificial neural networks black boxes?
IEEE Transactions on Neural Networks, 8(5):1156–1164, 1997.
6. R. Bogacz and C. Giraud-Carrier. A novel modular neural architecture for rulebased and similarity-based reasoning. In Hybrid Neural Systems (this volume).
Springer-Verlag, 2000.
7. G. Bologna. Symbolic rule extraction form the DIMLP neural network. In Hybrid
Neural Systems (this volume). Springer-Verlag, 2000.
8. G. Bologna and C. Pellegrini. Accurate decomposition of standard MLP classification responses into symbolic rules. In International Work Conference on Artificial
and Natural Neural Networks, IWANN’97, pages 616–627, Lanazrote, Canaries,
1997.
9. Y. Cheng, P. Fortier, and Y. Normandin. A system integrating connectionist and
symbolic approaches for spoken language understanding. In Proceedings of the
International Conference on Spoken Language Processing, pages 1511–1514, Yokohama, 1994.
10. P. S. Churchland and T. J. Sejnowski. The Computational Brain. MIT Press,
Cambridge, MA, 1992.
11. T. Corbett-Clarke and L. Tarassenko. A principled framework and technique for
rule extraction from multi-layer perceptrons. In Proceedings of the 5th International
Conference on Artificial Neural Networks, pages 233–238, Cambridge, England,
July 1997.
12. J. Diederich and D. L. Long. Efficient question answering in a hybrid system. In
Proceedings of the International Joint Conference on Neural Networks, Singapore,
1992.
13. G. Dorffner. Neural Networks and a New AI. Chapman and Hall, London, UK,
1997.
An Overview of Hybrid Neural Systems
9
14. M. A. Fanty. Learning in structured connectionist networks. Technical Report 252,
University of Rochester, Rochester, NY, 1988.
15. J. Feldman and D. Bailey. Layered hybrid connectionist models for cognitive
science. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
16. J. A. Feldman and D. H. Ballard. Connectionist models and their properties.
Cognitive Science, 6:205–254, 1982.
17. J. A. Feldman, G. Lakoff, D. R. Bailey, S. Narayanan, T. Regier, and A. Stolcke.
L0 - the first five years of an automated language acquisition project. AI Review,
8, 1996.
18. P. Frasconi, M. Gori, and A. Sperduti. Integration of graphical rules with adaptive learning of structured information. In Hybrid Neural Systems (this volume).
Springer-Verlag, 2000.
19. L.M. Fu. Rule learning by searching on adapted nets. In Proceedings of the National
Conference on Artificial Intelligence, pages 590–595, 1991.
20. L.M. Fu. Neural Networks in Computer Intelligence. McGraw-Hill, Inc., New York,
NY, 1994.
21. S. I. Gallant. Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA, 1993.
22. S. I. Gallant. Context vectors: a step toward a grand unified representation. In
Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
23. J. Gelfand, D. Handleman, and S. Lane. Integrating knowledge-based systems
and neural networks for robotic skill. In Proceedings of the International Joint
Conference on Artificial Intelligence, pages 193–198, San Mateo, CA., 1989.
24. L. Giles and C. W. Omlin. Extraction, insertion and refinement of symbolic rules
in dynamically driven recurrent neural networks. Connection Science, 5:307–337,
1993.
25. S. Goonatilake and S. Khebbal. Intelligent Hybrid Systems. Wiley, Chichester,
1995.
26. N. A. Hallack, G. Zaverucha, and V. C. Barbosa. Towards a hybrid model of firstorder theory refinement. In Hybrid Neural Systems (this volume). Springer-Verlag,
2000.
27. J. A. Hammerton and B. L. Kalman. Holistic symbol computation and the sequential RAAM: An evaluation. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000.
28. J. Hendler. Developing hybrid symbolic/connectionist models. In J. A. Barnden
and J. B. Pollack, editors, Advances in Connectionist and Neural Computation
Theory, Vol.1: High Level Connectionist Models, pages 165–179. Ablex Publishing
Corporation, Norwood, NJ, 1991.
29. M. Hilario. An overview of strategies for neurosymbolic integration. In Proceedings
of the Workshop on Connectionist-Symbolic Integration: From Unified to Hybrid
Approaches, pages 1–6, Montreal, 1995.
30. S. Hölldobler. A structured connectionist unification algorithm. In Proceedings of
the National Conference of the American Association on Artificial Intelligence 90,
pages 587–593, Boston, MA, 1990.
31. S. Hölldobler, Y. Kalinke, and J. Wunderlich. A recursive neural network for
reflexive reasoning. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
32. S. Honkela. Self-organizing maps in symbol processing. In Hybrid Neural Systems
(this volume). Springer-Verlag, 2000.
33. J. S. R. Jang and C. T. Sun. Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions Neural Networks,
4(1):156–159, 1993.
10
S. Wermter and R. Sun
34. D. Jurafsky, C. Wooters, G. Tajchman, J. Segal, A. Stolcke, E. Fosler, and N. Morgan. The Berkeley Restaurant Project. In Proceedings of the International Conference on Speech and Language Processing, pages 2139–2142, Yokohama, 1994.
35. P. Kanerva. Large patterns make great symbols: an example of learning from
example. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
36. C. Kirkham and T. Harris. Development of a hybrid neural network/expert system for machine health monitoring. In R. Rao, editor, Proceedings of the 8th
International Congress on Condition Monitoring and Engineering Management,
COMADEM95, pages 55–60, 1995.
37. G.K. Kraetzschmar, S. Sablatnoeg, S. Enderle, and G. Palm. Application of neurosymbolic integration for environment modelling in mobile robots. In Hybrid Neural
Systems (this volume). Springer-Verlag, 2000.
38. S. C. Kremer. A theory of grammatical induction in the connectionist paradigm.
Technical Report PhD dissertation, Dept. of Computing Science, University of
Alberta, Edmonton, 1996.
39. S.C. Kremer and J. Kolen. Dynamical recurrent networks for sequential data
processing. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
40. F. Kurfeß. Unification on a connectionist simulator. In T. Kohonen, K. Mäkisara,
O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 471–476.
North-Holland, 1991.
41. S. C. Kwasny and K. A. Faisal. Connectionism and determinism in a syntactic
parser. In N. Sharkey, editor, Connectionist natural language processing, pages
119–162. Lawrence Erlbaum, Hillsdale, NJ, 1992.
42. T. Lange and M. Dyer. High-level inferencing in a connectionist network. Connection Science, 1:181–217, 1989.
43. B. Lees, B. Kumar, A. Mathew, J. Corchado, B. Sinha, and R. Pedreschi. A hybrid
case-based neural network approach to scientific and engineering data analysis.
In Proceedings of the Eighteenth Annual International Conference of the British
Computer Society Specialist Group on Expert Systems, pages 245–260, Cambridge,
1998.
44. H. Lipson and H.T. Siegelmann. High order eigentensors as symbolic rules in
competitive learning. In Hybrid Neural Systems (this volume). Springer-Verlag,
2000.
45. J. MacIntyre and P. Smith. Application of hybrid systems in the power industry.
In L. Medsker, editor, Intelligent Hybrid Systems, pages 57–74. Kluwer Academic
Press, 1995.
46. M.R. Mayberry and R. Miikkulainen. Combining maps and distributed representations for shift-reduce parsing. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000.
47. K. McGarry, S. Wermter, and J. MacIntyre. Hybrid neural systems: from simple
coupling to fully integrated neural networks. Neural Computing Surveys, 2:62–94,
1999.
48. L. R. Medsker. Hybrid Neural Network and Expert Systems. Kluwer Academic
Publishers, Boston, 1994.
49. L. R. Medsker. Hybrid Intelligent Systems. Kluwer Academic Publishers, Boston,
1995.
50. R. Miikkulainen. Subsymbolic Natural Language Processing. MIT Press, Cambridge, MA, 1993.
51. W. C. Morris, G. W. Cottrell, and J. L. Elman. A connectionist simulation of
the empirical acquisition of grammatical relations. In Hybrid Neural Systems (this
volume). Springer-Verlag, 2000.
An Overview of Hybrid Neural Systems
11
52. M. C. Mozer. Neural net architectures for temporal sequence processing. In A. Weigend and N. Gershenfeld, editors, Time series prediction: Forecasting the future and
understanding the past, pages 243–264. Addison-Wesley, Redwood City, CA, 1993.
53. C. W. Omlin and C. L. Giles. Extraction and insertion of symbolic information
in recurrent neural networks. In V. Honavar and L. Uhr, editors, Artificial Intelligence and Neural Networks:Steps Towards principled Integration, pages 271–299.
Academic Press, San Diego, 1994.
54. C. W. Omlin and C. L. Giles. Extraction of rules from discrete-time recurrent
neural networks. Neural Networks, 9(1):41–52, 1996.
55. C.W. Omlin, L. Giles, and K. K. Thornber. Fuzzy knowledge and recurrent neural networks: A dynamical systems perspective. In Hybrid Neural Systems (this
volume). Springer-Verlag, 2000.
56. C. Orovas and J. Austin. A cellular neural associative array for symbolic vision.
In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
57. J. Park and I. W. Sandberg. Universal approximation using radial basis function
networks. Neural Computation, 3:246–257, 1991.
58. N. S. Park. Addressing knowledge representation issues in connectionist symbolic rule encoding for general inference. In Hybrid Neural Systems (this volume).
Springer-Verlag, 2000.
59. T. Peterson and R. Sun. An RBF network alternative for a hybrid architecture.
In International Joint Conference on Neural Networks, Ancorage, AK, May 1998.
60. J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46:77–
105, 1990.
61. R. Reilly. Evolution of symbolisation: Signposts to a bridge between connectionist
and symbolic systems. In Hybrid Neural Systems (this volume). Springer-Verlag,
2000.
62. R. G. Reilly and N. E. Sharkey. Connectionist Approaches to Natural Language
Processing. Lawrence Erlbaum Associates, Hillsdale, NJ, 1992.
63. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,
Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, Cambridge,
MA, 1986.
64. N. Sharkey and N. T. Ziemke. Life, mind and robots: The ins and outs of embodied
cognition. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
65. L. Shastri. A model of rapid memory formation in the hippocampal system. In Proceedings of the Meeting of the Cognitive Science Society, pages 680–685, Stanford,
1997.
66. L. Shastri. Types and quantifiers in SHRUTI: a connectionist model of rapid reasoning and relational processing. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000.
67. L. Shastri and V. Ajjanagadde. From simple associations to systematic reasoning:
A connectionist representation of rules, variables and dynamic bindings. Behavioral
and Brain Sciences, 16(3):417–94, 1993.
68. J. Shavlik. A framework for combining symbolic and neural learning. In V. Honavar
and L. Uhr, editors, Artificial Intelligence and Neural Networks: Steps towards
principled Integration, pages 561–580. Academic Press, San Diego, 1994.
69. P. Smolensky. On the proper treatment of connnectionism. Behavioral and Brain
Sciences, 11(1):1–74, March 1988.
70. A. Sperduti, A. Starita, and C. Goller. Learning distributed representations for
the classifications of terms. In Proceedings of the International Joint Conference
on Artificial Intelligence, pages 494–515, Montreal, 1995.
12
S. Wermter and R. Sun
71. R. Sun. On variable binding in connectionist networks. Connection Science,
4(2):93–124, 1992.
72. R. Sun. Integrating Rules and Connectionism for Robust Commonsense Reasoning.
Wiley, New York, 1994.
73. R. Sun. Hybrid connectionist-symbolic models: A report from the IJCAI95 workshop on connectionist-symbolic integration. Artificial Intelligence Magazine, 1996.
74. R. Sun. Supplementing neural reinforcement learning with symbolic methods. In
Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
75. R. Sun and F. Alexandre. Proceedings of the Workshop on Connectionist-Symbolic
Integration: From Unified to Hybrid Approaches. McGraw-Hill, Inc., Montreal,
1995.
76. R. Sun and F. Alexandre. Connectionist Symbolic Integration. Lawrence Erlbaum
Associates, Hillsdale, NJ, 1997.
77. R. Sun and L.A. Bookman. Computational Architectures Integrating Neural and
Symbolic Processes. Kluwer Academic Publishers, Boston, MA, 1995.
78. R. Sun and T. Peterson. Autonomous learning of sequential tasks: experiments
and analyses. IEEE Transactions on Neural Networks, 9(6):1217–1234, 1998.
79. S. Thrun. Extracting rules from artificial neural networks with distributed representations. In G.Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural
Information Processing Systems 7. MIT Press, San Mateo, CA, 1995.
80. S. Thrun. Explanation-Based Neural Network Learning. Kluwer, Boston, 1996.
81. A. Tickle, F. Maire, G. Bologna, R. Andrews, and J. Diederich. Lessons from
past, current issues and future research directions in extracting the knowledge
embedded in artificial neural networks. In Hybrid Neural Systems (this volume).
Springer-Verlag, 2000.
82. P. Tino, G. Dorffner, and C. Schittenkopf. Understanding state space organization
in recurrent neural networks with iterative function systems dynamics. In Hybrid
Neural Systems (this volume). Springer-Verlag, 2000.
83. H. Tirri. Replacing the pattern matcher of an expert system with a neural network.
In S. Goonatilake and S.Khebbal, editors, Intelligent Hybrid Systems, pages 47–62.
John Wiley and Sons, 1995.
84. M.L. Vaughn, S.J. Cavill, S.J. Taylor, M.A. Foy, and A.J.B. Fogg. Direct knowledge
extraction and interpretation from a multilayer perceptron network that performs
low back pain classification. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000.
85. D. Waltz. The importance of importance. In Presentation at Workshop on Hybrid
Neural Symbolic Integration, Breckenridge, CO., 1998.
86. D. L. Waltz and J. A. Feldman. Connectionist Models and their Implications.
Ablex, 1988.
87. S. Wermter. Hybrid Connectionist Natural Language Processing. Chapman and
Hall, Thomson International, London, UK, 1995.
88. S. Wermter. Preference Moore machines for neural fuzzy integration. In Proceedings
of the International Joint Conference on Artificial Intelligence, pages 840–845,
Stockholm, 1999.
89. S. Wermter. The hybrid approach to artificial neural network-based language
processing. In R. Dale, H. Moisl, and H. Somers, editors, A Handbook of Natural
Language Processing. Marcel Dekker, 2000.
90. S. Wermter. Knowledge extraction from transducer neural networks. Applied
Intelligence: The International Journal of Artificial Intelligence, Neural Networks,
and Complex Problem-Solving Techniques, 12:27–42, 2000.
An Overview of Hybrid Neural Systems
13
91. S. Wermter, G. Arevian, and C. Panchev. Towards hybrid neural learning internet
agents. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000.
92. S. Wermter and M. Meurer. Building lexical representations dynamically using
artificial neural networks. In Proceedings of the International Conference of the
Cognitive Science Society, pages 802–807, Stanford, 1997.
93. S. Wermter, C. Panchev, and G. Arevian. Hybrid neural plausibility networks for
news agents. In Proceedings of the National Conference on Artificial Intelligence,
pages 93–98, Orlando, USA, 1999.
94. S. Wermter, E. Riloff, and G. Scheler. Connectionist, Statistical and Symbolic
Approaches to Learning for Natural Language Processing. Springer, Berlin, 1996.
95. S. Wermter and V. Weber. SCREEN: Learning a flat syntactic and semantic spoken
language analysis using artificial neural networks. Journal of Artificial Intelligence
Research, 6(1):35–85, 1997.
Layered Hybrid Connectionist Models for
Cognitive Science
Jerome Feldman and David Bailey
International Computer Science Institute,
Berkeley CA 94704
Abstract. Direct connnectionist modeling of higher cognitive functions,
such as language understanding, is impractical. This chapter describes
a principled multi-layer architecture that supports AI style computational modeling while preserving the biological plausibility of structured
connectionist models. As an example, the connectionist realization of
Bayesian model merging as recruitment learning is presented.
1
Hybrid Models in Cognitive Science
Almost no one believes that connectionist models will suffice for the full range of
tasks in creating and modeling intelligent systems. People whose goals are primarily performance programs have no compunction about deploying hybrid systems
and rightly so. But many connectionists are primarily interested in modeling human and other animal intelligence and it is not as clear what methodology is
most appropriate in this enterprise. This paper provides one answer that has
been useful to our group in our decade of effort in modeling language acquisition
in the NTL (originally L0) project.
Connectionists of all persuasions agree that intelligence will best be explained in terms of its neural foundations using computational models with simple
notions of spreading activation and experience-based weight change. But no one
claims that a model containing millions (much less billions) of units is itself a
scientific description of some phenomenon, such as vision or language understanding. Beyond this basic agreement there is a major bifurcation into two
approaches. The (larger) PDP community believes that progress is best made
by training large back propagation networks (and more recently Elman style
recurrent nets) to perform specific functions and then examining the learned
weights for patterns and insight. There is a lot of good current work on extracting higher level representations from learned weights and this is discussed in
other chapters. But there is no evidence that PDP networks can just learn the
full range of intelligent behavior and they will not be discussed in this chapter.
The other main approach to connectionist modeling is usually called structured because varying amounts of specificly designed computational mechanism
are built into the model. This work is almost always localist in character, because
it is much more natural to postulate that the pre-wired computational mechanisms are realized by localized circuits, especially if one is actually building the
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 14–27, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Layered Hybrid Connectionist Models for Cognitive Science
15
model. In principle, structured connectionist models (SCM) could capture the
exact brain structure underlying some behavior and, in fact, models of this sort
are common in computational neuroscience. But for the complex behaviors underlying intelligence, a complete SCM would not be any simpler to understand
that the neural circuitry itself. And, of course, we don’t know nearly enough
about either the brain or cognition to even approximate such iconic models.
One suggestion that came up in the early years was to just use conventional symbolic models for so-called higher functions and restrict connectionist
modeling to simple sensory motor behaviors. This defeatist stance was never
very popular because it left higher cognition without the parallel, fault tolerant,
evidential computational mechanisms that are the heart of connectionism. Something like this is now more feasible because the conventional AI approach has
become a probabilistic belief network and thus more likely to be mappable to
connectionist models. Even so, if one is serious about cognitive modeling, there
are good reasons to restrict choices to computational mechanisms that are at
least arguably within the size, speed and learnability constraints of the brain.
For example, in general it takes exponential time for an arbitrary belief network to settle and thus specializations would be needed for a plausible cognitive
model.
Although the idea was implicit in earlier structured connectionist work, it is
only recently that we have enunciated a systematic philosophy on how to build
hybrid connectionist cognitive models. The central idea is hierarchical reducibility - any construct posited at a higher modeling level must have a computationally and biologically plausible reduction to the level below. The table below
depicts our current five level structure. We view this approach as a perfectly
ordinary instance of the scientific method as routinely practiced in the physical
and life sciences. But it does seem to provide us with a good working style providing tractability while maintaining connectionist principles and the potential
for direct experimental testing.
cognitive:
computational:
connectionist:
comp. neuro.:
neural:
words, concepts
f-structs, x-schemas (see below)
structured models, learning rules
detailed neural models
[implicit]
Our computational level is analogous to Marr’s and comprises a mixture
of familiar notions like feature structures and a novel representation, executing
schemas, described below. Apart from providing a valuable scientific language
for specifying proposed structures and mechanisms, these representational formalisms can be implemented in simulations to allow us to test our hypotheses.
They also support computational learning algorithms so we can use them in experiments on acquisition. Importantly, these computational mechanisms are all
reducible to structured connectionist models so that embodiment can be realized.
It is not necessarily easy to carry out these reductions and a great deal of effort has gone into understanding them. Perhaps the most challenging problem is
16
J. Feldman and D. Bailey
the connectionist representation of variable binding. This has been addressed for
over a decade by Shastri and his students [13,12] and also by a number of other
groups [9,15]. This body of work has shown that connectionist models can indeed
encode a large body of systematic knowledge and perform interesting inferences in parallel via spreading activation. A recent extension of these techniques
[7] supports the connectionist realization of the X-schema formalism discussed
below. For contrast, we focus here on a rather different kind of problem - the
mapping of the Bayesian model merging technique [14] to the standard structured connectionist mechanism of recruitment learning [6]. The particular context
is the David Bailey’s dissertation [2], on a model of how children learn the words
for simple actions of their hand.
2
Verblearn and Its Connectionist Reduction
Bailey’s Verblearn system has three major subparts as depicted at the computational level in Figure 1. The bottom section of Figure 1 depicts the underlying
actions, encoded as X-schemas. The top third depicts various possible wordsenses associated with an action, encoded as feature structures. The task is to
learn the best set of word senses for describing the variations expressed in the
training language. The crucial linking mechanism (also encoded as a feature
structure) is shown in the center of Figure 1. As will be described below, the
system learns an appropriate set of word senses and can demonstrate its knowledge by labeling novel actions or carrying out commands specified with newly
learned words.
The goal of this paper is to show how Bailey’s computational level account
can be reduced to the structured connectionist level in a principled way. The
two main data structures we need to map are feature structures and executing
schemas (X-schemas). As mentioned above, X-schemas can be modeled at the
connectionist level as an extension of Shruti [7].
The modeling of individual feature values follows the standard structures
connectionist strategy (and a biologically plausible one as well) of using place
coding [6]. That is, for each feature, there is a dedicated connectionist unit (i.e.
a separate place) for each possible value of the feature. Within the network
representing the possible values of a feature, we desire a winner-take-all (WTA)
behavior. That is, only one value unit should be active at a time, once the network
has settled, again in the standard way [6]. Representing the link between entities,
feature names and feature values is done, again as usual, with triangle nodes.
2.1
Triangle Units
The essential building block is the triangle unit [6,4], shown in Figure 2(a).
A triangle unit is an abstraction of a neural circuit which effects a three-way
binding. In the figure, the units A, B and C represent arbitrary “concepts”
which are bound by the triangle unit. All connections shown are bidirectional
and excitatory. The activation function of a triangle unit is such that activation
Layered Hybrid Connectionist Models for Cognitive Science
17
push
schema
posture
elbow jnt aspect
slide 1.0
palm 0.7
extend 0.9 once 0.8 cube 0.8
schema
posture
depress 1.0
relevant
linking
features
accel
index 0.9
aspect
low 0.7
shove
object
object
schema
posture
elbow jnt accel
slide 1.0
palm 0.9
extend 0.9 high 0.9
once 0.6 button 1.0
world state features
motor parameter features
schema
slide | depress
posture
elbow jnt
direction
accel
aspect
object
grasp|palm|indx flex | extend up | dn | lf | rt once | iteratedlow | med | hi cube | button
world state features
used by schema
small
PRESHAPE
Slide
Schema
GRASP
slipping
TIGHTEN
GRIP
weight
at goal
2.3 lbs
false
at goal
PRESHAPE
PALM
start
||
2
large
MOVE
ARM
(horiz-dir,
force,
dur)
APPLY
GRIP
done
not
at goal
MOVE
ARM TO
(objloc)
2
Fig. 1. An overview of the verb-learner at the computational level, showing details of
the Slide x-schema, some linking features, and two verbs: push (with two senses) and
shove (with one sense).
(a)
(b)
B
A
a+b+c>=2
a
C
A
b
B
c
C
(external sources of activation)
Fig. 2. (a) A simple triangle unit which binds A, B and C. (b) One possible neural
realization.
18
J. Feldman and D. Bailey
on any two of its incoming connections causes an excitatory signal to be sent
out over all three outgoing connections. Consequently, the triangle unit allows
activation of A and B to trigger C, or activation of A and C to trigger B, etc.
Triangle nodes will be used here as abstract building blocks, but Figure 2(b)
illustrates one possible connectionist realization. A single unit is employed to
implement the binding, and each concept unit projects onto it. Concept units
are assumed to fire at a uniform high rate when active and all weights into
the main unit are equal. As a result, each input site of the triangle unit can
be thought of as producing a single 0-or-1 value (shown as lower-case a, b and
c) indicating whether its corresponding input unit is active. The body of the
binding unit then just compares the sum of these three values to the threshold
of 2. If the threshold is met, the unit fires. Its axon projects to all three concept
units, and the connections are strong enough to activate all of the concept units,
even those receiving no external input.
posture
"push"
palm
Fig. 3. Using a triangle unit to represent the value (palm) of a feature (posture) for
an entity ("push").
A particularly useful type of three-way binding consists of an entity, a feature,
and a value for the feature, as shown in Figure 3. With this arrangement, if
posture and palm are active, then "push" will be activated—a primitive version
of the labelling process. Alternatively, if "push" and posture are active, then
palm will be activated—a primitive version of obeying. The full story [1] requires
a more complex form of triangle units, but the basic ideas can be conveyed with
just the simple form.
2.2
Connectionist Level Network Architecture
This section describes a network architecture which implements (approximately)
the multiple-sense verb representation and its associated algorithms for labelling
and obeying. The architecture is shown in Figure 4, whose layout is intended to
be reminiscent of the upper half of Figure 1.
Layered Hybrid Connectionist Models for Cognitive Science
19
phonology, morphology, etc.
"push"
"shove"
"pull"
word
senses
push1
force
force=low
force=med
force=high
push2
dir
dir=away
motor control (x-schemas)
dir=down
size
size=small
size=large
perceptual system
Fig. 4. A connectionist version of the model, using a collection of triangle units for
each word sense.
On the top is a “vocabulary” subnetwork containing a unit for each known
verb. Each verb is associated with a collection of phonological and morphological
details, whose connectionist representation is not considered here but is indicated
by the topmost “blob” in the figure. Each verb unit can be thought of as a binding
unit which ties together such information. The verb units are connected in a
winner-take-all fashion to facilitate choosing the best verb for a given situation.
On the bottom is a collection of subnetworks, one for each linking feature.
The collection is divided into two groups. One group—the motor-parameter
features—is bidirectionally connected to the motor control system, shown here
as a blob for simplicity. The other group—the world-state features—receives
connections from the perceptual system, which is not modelled here and is indicated by the bottom-right blob. Each feature subnetwork consists of one unit
for each possible value. Within each feature subnetwork, units are connected in
a winner-take-all fashion. A separate unit also represents each feature name.
The most interesting part of the architecture is the circuitry connecting the
verb units to the feature units. In the central portion of Figure 4 the connectionist
representation of two senses of push are shown, each demarcated by a box. Each
sense requires several triangle units with specialized functions.
One triangle unit for each sense can be thought of as primary; these are drawn
larger and labelled “push1” and “push2”. These units are of the soft conjunctive
type and serve to integrate information across the features which the sense is
concerned about. Their left side connects to the associated verb unit. Their right
20
J. Feldman and D. Bailey
side has multiple connections to a set of subsidiary triangle units, one for each
world-state feature (although only one is shown in the figure). The lower side of
the primary triangle unit works similarly, but for the motor-parameter features
(two are shown in the figure). Note also that the primary triangle units are
connected into a lexicon-wide winner-take-all network.
2.3
Labelling and Obeying
We can now illustrate how the network performs labelling and obeying. Essentially, these processes involve providing strong input to two of the three sides of
some word sense’s primary triangle unit, resulting in activation of the third side.
For labelling, the process begins when x-schema execution and the perceptual
system activate the appropriate feature and value units in the lower portion of
Figure 4. In response—and in parallel—every subsidiary triangle unit connected
to an active feature unit weighs the suitability of the currently active value unit
according to its learned connection strengths. In turn, these graded responses
are delivered to the lower and right-hand sides of each word sense’s primary
triangle unit. The triangle units become active to varying degrees, depending
on the number of activated subsidiary units and their degrees of activation. The
winner-take-all mechanism ensures that only one primary unit dominates, and
when that occurs the winning primary unit turns on its associated verb unit.
For obeying, we assume one verb unit has been activated (say, by the auditory system) and the appropriate world-state feature and value units have been
activated (by the perceptual system). As a result, the only primary triangle units
receiving activation on more than one side will be those connected to the command verb. This precipitates a competition amongst those senses to see which
has the most strongly active world-state subsidiary triangle units—that is, which
sense is most applicable to the current situation. The winner-take-all mechanism
boosts the winner and suppresses the others. When the winner’s activation peaks, it sends activation to its motor-parameter subsidiary triangle units. These,
in turn, will activate the motor-parameter value units in accordance with the
learned connection strengths. Commonly this will result in partial activation on
multiple values for some features. The winner-take-all mechanism within each
feature subnetwork chooses a winner. (Alternatively, we might prefer to preserve
the distributed activation pattern for use by smarter x-schemas which can reason
with probabilistic specification of parameters. E.g., if all the force value units are
weakly active, the x-schema knows it should choose a suitable amount of force.)
3
Learning - Connectionist Account
The ultimate goal of the system is to learn the right word senses from labeled
experience. The verb learning model assumes that the agent has already acquired
various x-schemas for the actions of one hand manipulating an object on a table
and that an informant labels actions that the agent is performing. The algorithm
starts by assuming that each instance (e.g. of a word sense) is a new category
Layered Hybrid Connectionist Models for Cognitive Science
21
and then proceeds to merge these until a total information criterion no longer
improves.
More technically, the learning task is an optimization problem, in that we
seek, amongst all possible lexicons, the “best” one given the training set. We
seek the lexicon model m that is most probable given the training data t.
argmax P (m | t)
(1)
m
The probability being maximized is the a posteriori probability of the model, and
our algorithm is a “maximum a posteriori (MAP) estimator” . The fundamental
insight of Bayesian learning is that this quantity can be decomposed, using Bayes’
rule, into components which separate the fit to the training data and an a priori
preference for certain models over others.
P (m | t) ∝ P (m) P (t | m)
(2)
Here, as usual, the prior term P(m) is proportional to the complexity of the
model and the likelihood term P(t | m) is a measure of how well the model fits
the data, in this case the labeled actions. The goal is to adjust the model to
optimize the overall fit, model merging is one algorithm for this.
In general terms, the algorithm is:
Model merging algorithm:
1. Create a simple model for each example in the training set.
2. Repeat the following until the posterior probability decreases:
a) Find the best candidate pair of models to merge.
b) Merge the two models to form a possibly more complex
model, and remove the original models.
In our case, “model” in the name “model merging” refers to an individual
word sense f-struct. The learning algorithm creates a separate word sense for
every occurrence of a word, and then merges these word sense f-structs so long
as the reduction in the number of word senses outweighs the loss of training-set
likelihood resulting from the merge.
A major advantage of the model merging algorithm is that it is one-shot.
After a single training example for a new verb, the system is capable of using
the verb in a meaningful, albeit limited, way. Model merging is also relatively
efficient since it does not backtrack. Yet it often successfully avoids poor local minima because its bottom-up rather than top-down strategy is less likely
to make premature irreversible commitments. We now consider how the model
merging algorithm can be realized in a connectionist manner, so that we will
have a unified connectionist story for the entire system.
At first glance, the model merging algorithm does not appear particularly
connectionist. Two properties cause trouble. First, the algorithm is constructivist. That is, new pieces of representation (word senses) need to be built, as
22
J. Feldman and D. Bailey
opposed to merely gradually changing existing structures. Second, the criterion
for merging is a global one, rather than depending on local properties of word
senses. Nevertheless, we have a proposed connectionist solution employing a
learning technique known as recruitment learning.
3.1
Recruitment Learning
Recruitment learning [5,11] assumes a localist representation of bindings such as
the triangle unit described in §2.1, and provides a rapid-weight-change algorithm
for forming such “effective circuits” from previously unused connectionist units.
Figure 5 illustrates recruitment with an example. Recall that a set of triangle
nodes is usually connected in a winner-take-all (WTA) fashion to ensure that
only one binding reaches an activation level sufficiently high to excite its third
member. For recruitment learning, we further posit that there is a pool of “free”
triangle units which also take part in the WTA competition. The units are free
in that they have low, random weights to the various “concept units” amongst
which bindings can occur. Crucially, though, they do have connections to these
concept units. But the low weights prevent these free units from playing an active
role in representing existing bindings.
recruited
Triangle unit WTA network
T1
Concept units
A
B
T2
C
D
free
E
T3
F
G
Fig. 5. Recruitment of triangle unit T3 to represent the binding E–F–G.
This architecture supports the learning of new bindings as follows. Suppose,
as in Figure 5, several triangle units already represent several bindings, such as
Layered Hybrid Connectionist Models for Cognitive Science
23
T1, which represents the binding of A, C and F. (The bindings for T2 are not
shown.) Suppose further that concept units E, F and G are currently active, and
the WTA network of triangle units is instructed (e.g. by a chemical mechanism)
that this binding must be represented. If there already exists a triangle unit
representing the binding, it will be activated by the firing of E, F and G, and
that will be that. But if none of the already-recruited triangle units represents
the binding, then it becomes possible for one of the free triangle units (e.g.
T3)—whose low, random weights happen to slightly bias it toward this new
binding—to become weakly active. The WTA mechanism selects this unit and
increases its activation, which then serves as a signal to the unit to rapidly
strengthen its connections to the active concept units.1 It thereby joins the pool
of recruited triangle units.
As described, the technique seems to require full connectivity and enough
unrecruited triangle units for all possible conjunctions. Often, though, the overall
architecture of a neural system provides constraints which greatly reduce the
number of possible bindings, compared to the number possible if the pool of
concept units is considered as an undifferentiated whole. For example, in our
connectionist word sense architecture, it is reasonable to assume that the initial
neural wiring is predisposed toward binding words to features—not words to
words, or feature units to value units of a different feature. The view that the
brain starts out with appropriate connectivity between regions on a coarse level
is bolstered by the imaging studies of [3] which show, for example, different
localization patterns for motor verbs (nearer the motor areas) vs. other kinds of
verbs.
Still, the number of potential bindings and connections may be daunting.
It turns out, though, that sparse random connection patterns can alleviate this
apparent problem [5]. The key idea is to use a multi-layered scheme for representing bindings, in which each binding is represented by paths amongst the
to-be-bound units rather than direct connections. The existence of such paths
can be shown to have high probability even in sparse networks, for reasonable
problem sizes [16].
3.2
Merging Via Recruitment
The techniques of recruitment learning can be put to use to create the word
sense circuitry shown earlier in Figure 4. The connectionist learning procedure
does not exactly mimic the algorithm given above but captures the main ideas.
To illustrate our connectionist learning procedure, we will assume that the two
senses of push shown in Figure 4 have already been learned, and a new training
example has just occurred. That is, the “push” unit has just become active, as
have some of the feature value units reflecting the just-executed action.
1
This kind of rapid and permanent weight change, often called long term potentiation
or LTP, has been documented in the nervous system. It is a characteristic of the
NMDA receptor, but may not be exclusive to it. It is hypothesized to be implicated
in memory formation. See [10] for details on the neurobiology, or [12] for a more
detailed connectionist model of LTP in memory formation.
24
J. Feldman and D. Bailey
The first key observation is that when a training example occurs, external activation arrives at a verb unit, motor-parameter feature value units, and
world-state feature value units. This three-way input is the local cue to the various triangle units that adaptation should occur—labelling and obeying never
produce such three-way external input to the triangle units. Depending on the
circumstances, there are three possible courses of action the net may take:
– Case 1: The training example’s features closely match those of an
existing word sense. This case is detected by activation of the primary
triangle unit of the matching sense—strong enough activation to dominate
the winner-take-all competition.
In this case, an abbreviated version of merging occurs. Rather than create a
full-fledged initial word sense for the new example, only to merge it into the
winning sense, the network simply “tweaks” the winning sense to accommodate the current example’s features. Conveniently, the winning sense’s
primary triangle unit can detect this situation using locally available information, namely: (1) it is highly active; and (2) it is receiving activation on
all three sides. The tweaking itself is a version of Hebb’s Rule [8]: the weights
on connections to active value units are incrementally strengthened. With
an appropriate weight update rule, this strategy can mimic the probability
distributions learned by the model merging algorithm.
– Case 2: The training example’s features do not closely match any
existing sense. This case is detected by failure of the winner take all mechanism to elevate any word sense above a threshold level.
In this case, standard recruitment learning is employed. Pools of unrecruited
triangle units are assumed to exist, pre-wired to function as either primary
or subsidiary units in future word senses. After the winner-take-all process
fails to produce a winner from the previously-recruited set of triangle units,
recruitment of a single new primary triangle unit and a set of new subsidiary
units occurs. The choice will depend on the connectivity and initial weights
of the subsidiary units to the feature value units, but will also depend on the
connections amongst the new units which are needed for the new sense to
cohere. Once chosen, these units’ weights are set to reflect the currently active
linking feature values, thereby forming a new word sense which essentially is
a copy of the training example.
– Case 3: The training example’s features are a moderate match
to two (or more) existing word senses. This case is detected by a
protracted competition between the two partially active senses which cannot
be resolved by the winner-take-all mechanism. Figure 6 depicts this case. As
indicated by the darkened ovals, the training example is labelled “push”
but involved medium force applied to a small size object—a combination
which doesn’t quite match either existing sense.
This case triggers recruitment of triangle units to form a new sense as described for case 2, but with an interesting twist. The difference is that the
weights of the new subsidiary triangle units will reflect not only the linking
features of the current training example, but also the distribution of values
Layered Hybrid Connectionist Models for Cognitive Science
25
phonology, morphology, etc.
"push"
"shove"
"pull"
new
merged
sense
push1
push12
force
force=low
force=med
force=high
push2
dir
dir=away
motor control (x-schemas)
dir=down
size
size=small
size=large
perceptual system
Fig. 6. Connectionist merging of two word senses via recruitment of a new triangle
unit circuit.
represented in the partially active senses. Thus, the newly recruited sense
will be a true merge of the two existing senses (as well as the new training
example). Figure 6 illustrates this outcome by the varying thicknesses on
the connections to the value units. If you inspect these closely you will see
that the new sense “push12” encodes broader correlations with the force
and size features than those of the previous senses “push1” and “push2”. In
other words, “push12” basically codes for dir = away, force not high.
How can this transfer of information be accomplished, since there are no
connections from the partially active senses to the newly recruited sense?
The trick is to use indirect activation via the feature value units. The partially active senses, due to their partial activation, will deliver some activation
to the value units—in proportion to their outgoing weights. Each value unit
adds any such input from the various senses which connect to it. Consequently, each feature subnetwork will exhibit a distributed activation pattern
reflecting an average of the distributions in the two partially active senses
(plus extra activation for the value associated with the current action). This
distribution will then be effectively copied into the weights in the newly
recruited triangle units, using the usual weight update rule for those units.
A final detail for case 3: to properly implement merging, the two original
senses must be removed from the network and returned to the pool of unrecruited units. If they were not removed, the network would quickly accumulate an implausible number of word senses. After all, part of the purpose
26
J. Feldman and D. Bailey
of merging is to produce a compact model of each verb’s semantics. But there
is another reason to remove the original senses. The new sense will typically
be more general than its predecessors. If the original senses were kept, they
would tend to “block” the new sense by virtue of their greater specificity
(i.e. more peaked distributions). The new sense would rarely get a chance
to become active, and its weights would weaken until it slipped back into
unrecruited status. So to force the model to use the new generalization, the
original senses must be removed. Fortunately, the cue for removal is available locally to these senses’ triangle units: the protracted period of partial
activation, so useful for synthesizing the new sense, can serve double duty
as a signal to these triangle units to greatly weaken their own weights, thus
returning them to the unrecruited pool.
The foregoing description is only a sketch, and activation functions have not
been fully worked out. It is possible, for example, that the threshold distinguishing case 2 from case 3 could prove too delicate to set reliably for different
languages. These issues are left for future work.
Nonetheless, several consequences of this particular connectionist realization
of a model-merging-like algorithm are apparent. First, the strategy requires presentation of an intermediate example to trigger merging of two existing senses.
The architecture does not suddenly “notice” that two existing senses are similar
and merge them.
Another consequence of the architecture is that it never performs a series
of merges as a “batch” as happens in model merging. On the other hand, the
architecture does, in principle, allow each merge operation to combine more
than two existing senses at a time. Indeed, technically speaking, the example
illustrated in Figure 6 is a three-way merge of “push1,” “push2” and the current
training example. The question of the relative merits of these two strategies is
another good question to pursue.
4
Conclusion
We have shown that the two seemingly connectionist-unfriendly aspects of model
merging—its constructiveness and its use of a global optimization criterion—can
be overcome by using recruitment learning and a modified winner-take-all mechanism. This, hopefully, elucidates the general point of this chapter. Of the
many ways of constructing hybrid connectionist models, one seems particularly
well suited for cognitive science. For both computational and explanatory purposes, it is convenient to do some (sometimes all) of our modeling at a computational level that is not explicitly connectionist. By requiring a biologically
and computationally plausible reduction of all computational level primitives to
the (structured) connectionist level, we retain the best features of connectionist
models and promote the development of an integrated Cognitive Science. And
it is a lot of fun.
Layered Hybrid Connectionist Models for Cognitive Science
27
References
1. David R. Bailey. When Push Comes to Shove: A Computational Model of the Role
of Motor Control in the Acquisition of Action Verbs. PhD thesis, Computer Science
Division, EECS Department, University of California at Berkeley, 1997.
2. David R. Bailey, Jerome A. Feldman, Srini Narayanan, and George Lakoff. Modeling embodied lexical development. In Proceedings of the 19th Cognitive Science
Society Conference, pages 19–24, 1997.
3. Antonio R. Damasio and Daniel Tranel. Nouns and verbs are retrieved with differently distributed neural systems. Proceedings of the National Academy of Sciences,
90:4757–4760, 1993.
4. Joachim Diederich. Knowledge-intensive recruitment learning. Technical Report
TR-88-010, International Computer Science Institute, Berkeley, CA, 1988.
5. Jerome A. Feldman. Dynamic connections in neural networks. Biological Cybernetics, 46:27–39, 1982.
6. Jerome A. Feldman and Dana Ballard. Connectionist models and their properties.
Cognitive Science, 6:205–254, 1982.
7. Dean J. Grannes, Lokendra Shastri, Srini Narayanan, and Jerome A. Feldman. A
connectionist encoding of schemas and reactive plans. Poster presented at 19th
Cognitive Science Society Conference, 1997.
8. Donald O. Hebb. The Organization of Behavior. Wiley, New York, NY, 1949.
9. J.E. Hummel and I. Biederman. Dynamic binding in a neural network for shape
recognition. Psychological Review, 99:480–517, 1992.
10. Gary Lynch and Richard Granger. Variations in synaptic plasticity and types
of memory in corticohippocampal networks. Journal of Cognitive Neuroscience,
4(3):189–199, 1992.
11. Lokendra Shastri. Semantic Networks: An evidential formalization and its connectionist realization. Morgan Kaufmann, Los Altos, CA, 1988.
12. Lokendra Shastri. A model of rapid memory formation in the hippocampal system.
In Proceedings of the 19th Cognitive Science Society Conference, pages 680–685,
1997.
13. V. Ajjanagadde & L. Shastri. Rules and variables in neural nets. Neural Computation, 3:121–134, 1991.
14. Andreas Stolcke and Stephen Omohundro. Best-first model merging for hidden
Markov model induction. Technical Report TR-94-003, International Computer
Science Institute, Berkeley, CA, January 1994.
15. Ron Sun. On variable binding in connectionist networks. Connection Science,
4:93–124, 1992.
16. Leslie Valiant. Circuits of the mind. Oxford University Press, New York, 1994.
Types and Quantifiers in shruti —
A Connectionist Model of Rapid Reasoning and
Relational Processing
Lokendra Shastri
International Computer Science Institute,
Berkeley CA 94704, USA,
shastri@icsi.berkeley.edu,
WWW home page: http://icsi.berkeley/˜shastri
Abstract. In order to understand language, a hearer must draw inferences to establish referential and causal coherence. Hence our ability to
understand language suggests that we are capable of performing a wide
range of inferences rapidly and spontaneously. This poses a challenge
for cognitive science: How can a system of slow neuron-like elements encode a large body of knowledge and perform inferences with such speed?
shruti attempts to answer this question by demonstrating how a neurally plausible network can encode a large body of semantic and episodic
facts, and systematic rule-like knowledge, and yet perform a range of inferences within a few hundred milliseconds. This paper describes a novel
representation of types and instances in shruti that supports the encoding of rules and facts involving types and quantifiers, enables shruti
to distinguish between hypothesized and asserted entities, and facilitates
the dynamic instantiation and unification of entities during inference.
1
Introduction
In order to understand language, a hearer must draw inferences to establish referential and causal coherence, generate expectations, make predictions, and recognize the speaker’s intent. Hence our ability to understand language suggests
that we are capable of performing a wide range of inferences rapidly, spontaneously and without conscious effort — as though they are a reflex response
of our cognitive apparatus. In view of this, such reasoning has been described
as reflexive reasoning [22]. This remarkable human ability poses a challenge for
cognitive science and computational neuroscience: How can a system of slow
neuron-like elements encode a large body of systematic knowledge and perform
a wide range of inferences with such speed?
The neurally plausible (connectionist) model shruti attempts to address the
above challenge. It demonstrates how a network of neuron-like elements could
encode a large body of structured knowledge and perform a variety of inferences
within a few hundred milliseconds [3][22][14][23][20].
shruti suggests that the encoding of relational information (frames, predicates, etc.) is mediated by neural circuits composed of focal-clusters and the
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 28–45, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Types and Quantifiers in shruti
29
dynamic representation and communication of relational instances involves the
transient propagation of rhythmic activity across these clusters. A role-entity
binding is represented within this rhythmic activity by the synchronous firing
of appropriate cells. Systematic mappings — and other rule-like knowledge —
are encoded by high-efficacy links that enable the propagation of rhythmic activity across focal-clusters, and a fact in long-term memory is a temporal pattern
matcher circuit.
The possible role of synchronous activity in dynamic neural representations
has been suggested by other researchers (e.g., [28]), but shruti offers a detailed
computational account of how synchronous activity can be harnessed to solve
problems in the representation and processing of high-level conceptual knowledge. A rich body of neurophysiological evidence has emerged suggesting that
synchronous activity might indeed play an important role in neural computation [26] and several models using synchrony to solve the binding problem during
inference have been developed (e.g., [9]).1
As an illustration of shruti’s inferential ability consider the following narrative: “John fell in the hallway. Tom had cleaned it. He got hurt.” Upon being
presented with the above narrative2 shruti reflexively infers the following:3 Tom
had mopped the floor. The floor was wet. John was walking in the hallway. John
slipped and fell because the floor was wet. John got hurt because he fell.
Notice that shruti draws inferences required to establish referential and
causal coherence. It explains John’s fall by making the plausible inference that
John was walking in the hallway and he slipped because the floor was wet. It
also infers that John got hurt because of the fall. Moreover, it determines that
“it” in the second sentence refers to the hallway, and that “He” in the third
sentence refers to John, and not to Tom.
The representational and inferential machinery developed in shruti can be
applied to other problems involving relational structures, systematic but contextsensitive mappings between such structures, and rapid interactions between persistent and dynamic structures. The shruti model meshes with the “Neural
Theory of Language” project [4] on language acquisition and provides neurally
plausible solutions to several representational and computational requirements
arising in the project. The model also offers a plausible framework for realizing
the “Interpretation as Abduction” approach to language understanding described in [8]. Moreover, shruti’s representational machinery has been extended to
realize control and coordination mechanisms required for modeling actions and
reactive plans [24].
This paper describes a novel representation of types and instances in shruti.
This representation supports the encoding of rules and facts involving types and
1
2
3
For other solutions to the binding problem within a structured connectionist framework see [11][5][27].
Each sentence in the narrative is conveyed to shruti as a set of dynamic bindings
(see Section 4). The sentences are presented in the order of their occurrence in the
narrative. After each sentence is presented, the network is allowed to propagate
activity for a fixed number of cycles.
A detailed discussion of this example appears in [25].
30
L. Shastri
quantifiers, and at the same time allows shruti to distinguish between hypothesized entities and asserted entities. This in turn facilitates the dynamic instantiation and unification of entities and relational instances during inference. For
a detailed description of various aspects of shruti’s representational machinery
refer to [22][20][25].
The rest of the chapter is organized as follows: Section 2 provides an overview of how relational knowledge is encoded in shruti. Section 3 discusses the
representation of types and instances. Section 4 describes the representation of
dynamic bindings. Section 5 explains how phase separation between incompatible entities is enforced in the type hierarchy via inhibitory mechanisms, and
how phases are merged to unify entities. Section 6 describes the associative potentiation of links in the type hierarchy. Next, Section 7 reviews the encoding
of facts, and Section 8 outlines the encoding of rules (or mappings) between
relational structures. A simple illustrative example is presented in Section 9.
+v:Human+v:Book
+:John +:Mary +:Book-17
?:John
?:Mary
?:Book-17
E-fact
F1
1000
-
+
+
giver recip g-obj
?
?v:Human
T-fact
F2
50
-
?
?v:Book
buyer
b-obj
buy
give
900
800
med1
+
r1
?
r2 r3
+
?
s1
med2
s2
John
800
+
Mary
+
?
?
980
Book-17
+
+
-
?
owner
?
o-obj
Human +e
+v
?v ?e
own
+e
Agent +e
+v
?v ?e
Book
+v
?v ?e
Fig. 1. An overview of shruti’s representational machinery.
Types and Quantifiers in shruti
2
31
An Overview of shruti’s Representational Machinery
All long-term (persistent) knowledge is encoded in shruti via structured networks of nodes and links. Such long-term knowledge includes generic relations,
instances, types, general rules, and specific facts. In contrast, dynamic aspects of
knowledge are represented via the activity of nodes, the propagation of activity
along excitatory and inhibitory links, and the integration of incident activity at
nodes. Such dynamic knowledge includes active (dynamic) facts and bindings,
propagation of bindings, fusion of evidence, competition among incompatible
entities, and the development of coherence.
Figure 1 provides an overview of some of the key elements of shruti’s representational machinery. The network fragment shown in the figure depicts a
partial encoding of the following rules, facts, instances, and types:
1.
2.
3.
4.
5.
6.
∀(x:Agent y:Agent z:Thing) give(x,y,z) ⇒ own(y,z) [800,800];
∀(x:Agent y:Thing) buy(x,y) ⇒ own(x,y) [900,980];
EF: give(John, Mary, Book-17) [1000];
TF: ∀(x:Human y:Book) buy(x,y) [50];
is-a(John, Human);
is-a(Mary, Human); 7. is-a(Human, Agent); 8. is-a(Book-17, Book).
Item (1) is a rule which captures a systematic relationship between giving and
owning. It states that when an entity x of type Agent, gives an entity z of type
Thing, to an entity y of type Agent, then the latter comes to own it. Similarly,
item (2) is a rule which states that whenever any entity of the type Agent buys
something, it comes to own it. The pair of weights [a,b] associated with a rule
have the following interpretation: a indicates the degree of evidential support for
the antecedent being the probable cause (or explanation) of the consequent, and
b indicates the degree of evidential support for the consequent being a probable
effect of the antecedent.4 Item (3) corresponds to a long-term “episodic” fact
(or E-fact) which states that John gave Mary a specific book (Book-17). Item
(4) is a long-term “taxon” fact (or T-fact) which states that the prior evidential
support for a given (random) human buying a given (random) book is 50. Item
(5) states that John is a human. Similarly, items (6–8).
Given the above knowledge, shruti can rapidly draw inferences of the following sort within a few hundred milliseconds5 (numbers in [] indicate strength
of inference):
4
5
Weights in shruti lie in the interval [0,1000]. The mapping of probabilities and
evidential supports to weights in shruti is non-linear and loosely defined. The initial
weights can be set approximately, and subsequently fine tuned to model a given
domain via learning.
The time required for drawing an inference is estimated by c∗π, where c is the number
of cycles of rhythmic activity it takes shruti to draw an inference (see Section 9),
and π is the period of rhythmicity. A plausible value of π is 25 milliseconds [22].
32
L. Shastri
1. own(Mary, Book-17) [784];
Mary owns a particular book (referred to as Book-17).
2. ∃x:Book own(Mary,x) [784];
Mary owns a book.
3. ∃(x:Agent y:Thing) own(x,y) [784];
Some agent owns something.
4. buy(Mary,Book-1) [41];
Mary bought a particular book (referred to as Book-1).
5. is-a(Mary, Agent);
Mary is an agent.
Figure 2 depicts a schematized response of the shruti network shown in
Figure 1 to the query “Does Mary own a book?” (∃ x:Book own(Mary, x)?). We
will revisit this activation trace in Section 9 after we have reviewed shruti’s
representational machinery, and discussed the encoding of instances and types
in more detail. For now it suffices to observe that the query is conveyed to
the network by activating appropriate “?” nodes (?:own, ?:Mary and ?e:Book)
and appropriate role nodes (owner and o-obj). This leads to a propagation of
activity in the network which eventually causes the activation of the nodes +:own
and +:Book-17. This signals an affirmative answer (Yes, Mary owns Book-17).
Note that bindings between roles and entities are expressed by the synchronous
activation of bound role and entity nodes.
2.1
Different Node Types and Their Computational Behavior
Nodes in shruti are computational abstractions and correspond to small ensembles of cells. Moreover, a connection from a node A to a node B corresponds
to several connections from cells in the A ensemble to cells in the B ensemble.
shruti makes use of four node types: m-ρ-nodes, τ -and nodes, τ -or nodes of
type 1), and τ -or nodes of type 2. This classification is based on the computational properties of nodes, and not on their functional or representational role. In
particular, nodes serving different representational functions can be of the same
computational type. The computational behavior of m-ρ-nodes and τ -and nodes
is described below:
m-ρ nodes: An m-ρ node with threshold n becomes active and fires upon receiving n synchronous inputs. Here synchrony is defined relative to a window
of temporal integration ω. Thus all inputs arriving at a node with a lead/lag of
no more than ω, are deemed to be synchronous. Thus an m-ρ node A receiving
above-threshold periodic inputs from m-ρ nodes B and C (where B and C may
be firing in different phases) will respond by firing in phase with both B and C.
A similar node type has been described in [15].
A scalar level (strength) of activity is associated with the response of an m-ρ
node.6 This level of activity is computed by the activation combination function
6
The response-level of a m-ρ node in a phase can be governed by the number of cells
in the node’s cluster firing in that phase.
Types and Quantifiers in shruti
33
+:own
+:med1
+:give
F1
?:give
giver
g-obj
recip
?:med2
s2
s1
ρ3
r1
?:med1
r3
r2
?:own
o-obj
owner
+:John
?:John
+:Book-17
?:Book-17
ρ2
?e:book
+:Mary
?:Mary
ρ1
time
Fig. 2. A schematized activation trace of selected nodes for the query own(Mary,Book17)?.
34
L. Shastri
(ECF) associated with the node. Some ECFs used in the past are sum, max, and
sigmoid. Other combination functions are under investigation [25].
τ -and nodes: A τ -and node becomes active on receiving an uninterrupted and
above-threshold input over an interval ≥ πmax , where πmax is a system parameter. Computationally, this sort of input can be idealized as a pulse whose
amplitude exceeds the threshold, and whose duration is greater than or equal
to πmax . Physiologically, such an input may be identified with a high-frequency
burst of spikes. Thus a τ -and node behaves like a temporal and node and becomes active upon receiving adequate and uninterrupted inputs over an interval
πmax . Upon becoming active, such a node produces an output pulse of width ≥
πmax . The level of output activation is determined by the ECF associated with
the node for combining the weighted inputs arriving at the node.
The model also makes use of inhibitory modifiers that can block the flow of
activation along a link. This blocking is phasic and lasts only for a duration ω.
2.2
Encoding of Relational Structures
Each relation (in general, a frame or a predicate) is represented by a focal-cluster
which as an anchor for the complete encoding of a relation. Such focal-clusters
are depicted as dotted ellipses in Figure 1. The focal-cluster for the relation
give is depicted toward the top and the left of Figure 1. For the purpose of this
example, it is assumed that give has only three roles: giver, recipient and giveobject. Each of these roles is encoded by a separate node labeled giver, recip and
g-obj, respectively. The focal-cluster of give also includes an enabler node labeled
? and two collector nodes labeled + and –. The positive and negative collectors
are mutually inhibitory (inhibitory links are depicted by filled blobs). In general,
the focal-cluster for an n-place relation contains n role nodes, one enabler node,
one positive collector node and one negative collector node. We will refer to the
enabler, the positive collector, and the negative collector of a relation P as ?:P,
+:P, and –:P, respectively. The collector and enabler nodes of relations behave
like τ -and nodes. Role nodes and the collector and enabler nodes of instances
behave like m-ρ nodes.
Semantic Import of Enabler and Collector Nodes. Assume that the roles
of a relation P have been dynamically bound to some fillers and thereby represent
an active instance of P (we will see how this is done, shortly). The activation
of the enabler ?:P means that the system is seeking an explanation for the
active instance of P. In contrast, the activation of the collector +:P means that
the system is affirming the active instance of P. Similarly, the activation of
the collector -:P means that the system is affirming the negation of the active
instance of P. The activation levels of ?:P, +:P and -:P signifies the strength with
which information about P is being sought, believed, or disbelieved, respectively.
For example, if the roles giver, recipient and object are dynamically bound
to John, Mary, and a book, respectively, then the activation of ?:give means
that the system is asking whether “John gave Mary a book” matches a fact
in memory, or whether it can be inferred from what is known. In contrast, the
Types and Quantifiers in shruti
35
activation of +:P with the same role bindings means that the system is asserting
“John gave Mary a book”.
Degrees of Belief: Support, no Information and Contradiction. The
levels of activation of the positive and negative collectors of a relation measure
the effective degree of support offered by the system to the currently active relational instance. Thus the activation levels of the collectors +:P and -:P encode
a graded belief ranging continuously from no on the one extreme (only -:P is
active), to yes on the other (only +:P is active), and don’t know in between
(neither collector is very active). If both the collectors receive comparable and
strong activation then a contradiction is indicated.
Significance of Collector to Enabler Connections. Links from the collector
nodes to the enabler node of a relation convert a dynamic assertion of a relational
instance into a query about the assertion. Thus the system continually seeks an
explanation for active assertions. The weight on the link from +:P (or -:P) to ?:P
is a sum of two terms. The first term is proportional to the system’s propensity
for seeking explanations — the more skeptical the system, the higher the weight.
The second term is inversely proportional to the probability of occurrence of a
positive (or negative) instance of P — the more unlikely a fact, the more intense
the search for an explanation.
The links from the collectors of a relation to its enabler also create positive
feedback loops of activation and thereby create stable coalitions of active cells
under appropriate circumstances. If the system seeks an explanation for an instance of P and finds support for this instance, then a stable coalition of activity
arises consisting of ?:P, other ensembles participating in the explanation, +:P
and finally ?:P. Such activity leads to priming (see Section 6), and the formation
of episodic memories (see [19,21]).
3
Encoding Instances and Types
The encoding of types and instances is illustrated in Figure 3. The focal-cluster
of each entity consists of a ? and a + node. In contrast, the focal-cluster of each
type consists of a pair of ? nodes (?e and ?v) and a pair of + nodes (+e and
+v). While the nodes +v and ?v participate in the expression of knowledge (facts
and attributes) involving the whole type, the nodes +e and ?e participate in the
encoding of knowledge involving particular instances of the type. Thus nodes
v and e signify universal and existential quantification, respectively. All nodes
participating in the representation of types are m-ρ nodes.
3.1
Interconnections within Focal-Clusters of Instances and Types
The interconnections shown in Figure 3 among nodes within the focal-cluster
of an instance and among nodes within the focal-cluster of a type lead to the
following functionality (I refers to an instance, T 1 refers to a type):
36
L. Shastri
– Because of the link from +:I to ?:I, any assertion about an instance leads
to a query or a search for a possible explanation of the assertion.
– Because of the link from +v:T 1 to +e:T 1, any assertion about the type leads
to the same assertion being made about an unspecified member of the type
(e.g., “Humans are mortal” leads to “there exists a mortal human”).7
– Because of the link from +v:T 1 to ?v:T 1, any assertion about the whole
type leads to a query or search for a possible explanation of the assertion
(e.g., the assertion “Humans are mortal” leads to the query “Are humans
mortal?”).
– Because of the link from +e:T 1 to ?e:T 1, any assertion about an instance of
the type leads to a query or search for a possible instance that would verify
the assertion (e.g., the assertion “There is a human who is mortal” to the
query “Is there is a human who is mortal?”).
– Because of the link from ?e:T 1 to ?v:T 1, any query or search for an explanation about a member of the type leads to a query about the whole type
(one way of determining whether “A human is mortal” is to find out whether
“Humans are mortal”).
– Moreover, paths formed by the above links lead to other behaviors. For example, given the path from +v:T 1 to ?e:T 1, any assertion about the whole
type leads to a query or search for an explanation of the assertion applied
to a given subtype/member of the type (e.g., “Humans are mortal” leads to
the query “Is there a human who is mortal?”).
Note that the closure between the “?” and “+” nodes is provided by the
matching of facts (see Section 7).
3.2
The Interconnections Between Focal-Clusters of Instances and
Types
The interconnections between nodes in the focal-clusters of instances and types
lead to the following functionality:
– Because of the link from +v:T 1 to +:I, any assertion about the type T 1 leads
to the same assertion about the instance I (“Humans are mortal” leads to
“John is mortal”).
– Because of the link from +:I to +e:T 1, any assertion about I leads to the
same assertion about a member of T 1 (“John is mortal” leads to “A human
is mortal”).
– Because of the link from ?:I to ?v:T 1, a query about I leads to a query
about T 1 as a whole (one way of determining whether “John is mortal” is
to determine whether “Humans are mortal”).
– Because of the link from ?e:T 1 to ?:I, a query about a member of T 1 leads
to a query about I (one way of determining whether “A human is mortal” is
to determine whether “John is mortal”).
7
shruti infers the existence of a mortal human given that all humans are mortal,
though this is not entailed in classical logic.
Types and Quantifiers in shruti
37
Similarly, interconnections between sub- and supertypes lead to the following
functionality.
– Because of the link from +v:T 2 to +v:T 1, any assertion about the supertype
T 2 leads to the same assertion about the subtype T 1 (“Agents can cause
change” leads to “Humans can cause change”).
– Because of the link from +e:T 1 to +e:T 2, any assertion about a member of
T 1 leads to the same assertion about a member of T 2 (“Humans are mortal”
leads to “mortal agents exist”).
– Because of the link from ?v:T 1 to ?v:T 2, a query about T 1 as a whole leads
to a query about T 2 as a whole (one way of determining whether “Humans
are mortal” is to determine whether “Agents are mortal”).
– Because of the link from ?e:T 2 to ?e:T 1, a query about a member of T 2
leads to a query about a member of T 1 (one way of determining whether
“an Agent is mortal” is to determine whether “a Human is mortal”).
I
+
(John)
?
from
to
T-facts and E-facts
T1
(Human)
+e
+v
?v
?e
to
from
T-facts and E-facts
T2
+e
+v
?v
?e
(Agent)
to
from
T-facts and E-facts
Fig. 3. The encoding of types and (specific) instances. See text for details.
38
L. Shastri
+:give
g-obj
recip
giver
+:John
+:Mary
+e:Book
Fig. 4. The rhythmic activity representing the dynamic bindings give(John, Mary,
a-Book). Bindings are expressed by the synchronous activity of bound role and entity
nodes.
4
Encoding of Dynamic Bindings
The dynamic encoding of a relational instance corresponds to a rhythmic pattern of activity wherein bindings between roles and entities are represented by
the synchronous firing of appropriate role and entity nodes. With reference to
Figure 1, the rhythmic pattern of activity shown in Figure 4 is the dynamic
representation of the relational instance (give: hgiver=Johni, hrecipient=Maryi,
hgive-object=a-Booki) (i.e., “John gave Mary a book”). Observe that the collector ensembles +:John, +:Mary and +e:Book are firing in distinct phases, but
in phase with the roles giver, recip, and g-obj, respectively. Since +:give is also
firing, the system is making an assertion. The dynamic representation of the
query “Did John give Mary a book?” would be similar except that the enabler
node would be active and not the collector node.
The rhythmic activity underlying the dynamic representation of relational
instances is expected to be highly variable, but it is assumed that over short
durations — ranging from a few hundred milliseconds to about a second —
such activity may be viewed as being composed of k interleaved quasi-periodic
activities where k equals the number of distinct entities filling roles in active
relational instances. The period of this transient activity is at least k ∗ ωint
where ωint is the window of synchrony, i.e., the amount by which two spikes can
lead/lag and still be treated as being synchronous. As speculated in [22], the
activity of role and entity cells engaged in dynamic bindings might correspond
to γ band activity (∼ 40 Hz).
5
Mutual Exclusion and Collapsing of Phases
Instances in the type hierarchy can be part of a phase-level mutual exclusion
cluster (ρ-mex cluster). The + node of every entity in a ρ-mex cluster sends
inhibitory links to, and receives inhibitory links from, the + node of all other
Types and Quantifiers in shruti
39
entities in the cluster. As a result of this mutual inhibition, only the most active
entity within a ρ-mex cluster can remain active in any given phase. A similar
ρ-mex cluster can be formed by +e: nodes of mutually exclusive types as well as
+v: nodes of mutually exclusive types.
Another form of inhibitory interaction between siblings in the type hierarchy
leads to an “explaining away” phenomenon in shruti. Let us illustrate this
inhibitory interaction with reference to the type hierarchy shown in Figure 1.
The link from +:John to +e:Human sends an inhibitory modifier to the link
from ?e:Human to ?:Mary. Similarly, the link from +:Mary to +e:Human sends
an inhibitory modifier to the link from ?e:Human to ?:John (such modifiers are
not shown in the figure). Analogous connections exist between all siblings in the
type hierarchy. As a result of such inhibitory modifiers, if ?e:Human propagates
activity to ?:John and ?:Mary in phase ρ1, then the strong activation of +:John
in phase ρ1 attenuates the activity arriving from ?e:Human into ?:Mary. In
essence, the success of the query “Is it John?” in the context of the query “Is
it human?” makes the query “Is it Mary?” unimportant. This use of inhibitory
connections for explaining away is motivated by [2].
As discussed in Section 8, shruti supports the introduction of “new” phases
during inference. In addition, shruti also allows multiple phases to coalesce
into a single phase during inference. In the current implementation, such phase
unification can occur under two circumstances. First, phase collapsing can occur
whenever a single entity dominates multiple phases (for example, if the same
entity comes to be the answer of multiple queries). Second, phase collapsing can
occur if two unifiable instantiations of a relation arise within a focal-cluster.
For example, an assertion own(Mary, Book-17) alongside the query ∃ x:Book
own(Mary,x)? (Does Mary own a book”) will result in a merging of the two
phases for “a book” and “Book-17. Note that the type hierarchy will map the
query ∃ x:Book own(Mary,x)? into own(Mary,Book-17)?, and hence, lead to a
direct match between own(Mary,Book-17) and own(Mary,Book-17)?.
6
Priming: Associative Short-Term Potentiation of Weights
Let I be an instance of T 1. If ?:I receives activity from ?e:T1 and concurrent
activity from +:I, then the weight of the link from ?e:T1 to ?:I increases (i.e.,
gets potentiated) for a short-duration.8 Let T2 be a supertype of T1. If ?e:T1
receives activity from ?e:T2, and concurrent activity from +e:T1, then the weight
of the link from ?e:T2 to ?e:T1 also increases for a short-duration. Analogous
weight increases can occur along the link from ?v:T1 to ?v:T2 if ?v:T2 receives
concurrent activity from +v:T2 and ?v:T1. Similarly, the weight of the link
from ?:I to ?v:T1 can undergo a short-term increase if ?v:T1 receives concurrent
activity from +v:T1 and ?:I.9
8
9
This is modeled after the biological phenomena of short-term potentiation (STP) [6].
In principle, short-term weight increases can occur along the link from +e:T1 to
+e:T2 if +e:T2 receives concurrent activity from +v:T2 and +e:T1. Similarly, the
weight of the link from +:I to +e:T1 can undergo a short-term increase, if +e:T1
receives concurrent activity from +v:T1 and +:I.
40
L. Shastri
The potentiation of link weights can affect the system’s response time as
well as the response itself. Let us refer to an entity whose incoming links are
potentiated as a “primed” entity. Since a primed entity would become active
sooner than an unprimed entity, a query whose answer is a primed entity would
be answered faster (all else being equal). Furthermore, all else being equal, a
primed entity would dominate an unprimed entity in a ρ-mex cluster, and hence,
if a primed and an unprimed entity compete to be the filler of a role, the primed
entity would emerge as the role-filler.
7
Facts in Long-Term Memory: E-Facts and T-Facts
Currently shruti encodes two types of relational instances (i.e., facts) in its
long-term memory (LTM): episodic facts (E-Facts) and taxon facts (T-facts).
While an E-fact corresponds to a specific instance of a relation, a T-fact corresponds to a distillation or statistical summary of various instances of a relation
(e.g., “Days tend to be hot in June”). An E-fact E1 associated with a relation
P becomes active whenever all the dynamic bindings specified in the currently
active instantiation of P match those encoded in E1 . Thus an E-fact is sensitive to any mismatch between the bindings it encodes and the currently active
dynamic bindings. In contrast, a T-fact is sensitive only to matches between its
bindings and the currently active dynamic bindings. Note that both E- and Tfacts tolerate missing bindings, and hence, respond to partial cues. The encoding
of E-facts is described below – the encoding of T-facts is described in [20].
Figure 5 illustrates the encoding of E-facts love(John, Mary) and ¬love(Tom,
Susan). Each E-fact is encoded using a distinct fact node (these are labeled F1
and F2 in Figure 5). A fact node sends a link to the + or – collector of the relation
depending on whether the fact encodes a positive or a negative assertion.
Given the query love(John,Mary)? the E-fact node F1 will become active
and activate +:love, +:John and +:Mary nodes indicating a “yes” answer to the
question. Similarly, given the query love(Tom,Susan)?, the E-fact node F2 will
become active and activate –:love, +:Tom and +:Susan nodes indicating a “no”
answer to the query. Finally, given the query love(John,Susan)?, neither +:love
nor –:love would become active, indicating that the system can neither affirm
nor deny whether John loves Susan (the nodes +:John and +:Susan will also
not receive any activation).
Types can also serve as role-fillers in E-facts (e.g., Dog in “Dogs chase cats”)
and so can unspecified instances of a type (e.g., a dog in “a dog bit John”).
Such E-facts are encoded by using the appropriate nodes in the focal-cluster for
“Dog”. In general, if an existing instance, I, is a role-filler in a fact, then ?:I
provides the input to the fact cluster and +:I receives inputs from the binder
node in the fact cluster. If the whole type T is a role-filler in a fact, then ?v:T
provides the input to the fact cluster and +v:T receives inputs from the binder
node in the fact cluster. If an unspecified instance of type T is a role-filler in a
long-term fact, then a new instance of type T is created and its “?” and “+”
nodes are used to encode the fact.
Types and Quantifiers in shruti
41
(b)
(a)
α1
α2
00
11 00
00
11 11
11
00 11
11
00 00
11 1000
00 10
11
+
?
000 F2
111
00011
111
00
000
00 111
11
000
111
00 F1
11
00
00 11
001111
00 000
11
111
from
?:Tom
from
F1
+:John
?:Susan
from
?:John
from
?:Mary
lover lovee
+:Mary
lover lovee
love
Fig. 5. (a) The encoding of E-facts: love(John,Mary) and ¬love(Tom,Susan). The pentagon shaped nodes are “fact” nodes and are of type τ -and. The dark blobs denote
inhibitory modifiers. The firing of a role node without the synchronous firing of the
associated filler node blocks the activation of the fact node. Consequently, the E-fact
is blocked whenever there is a mismatch between the dynamic binding of a role and
its binding specified in the E-fact. (b) Links from the fact node back to role-fillers are
shown only for the fact love(John,Mary) to avoid clutter. The circular nodes are m-ρ
nodes with a high threshold which is satisfied only when both the role node and the
fact node are firing. Consequently, a binder node fires in phase with the associated role
node, if the fact node is firing. Weights α1 and α2 indicate strengths of belief.
8
Encoding of Rules
A rule is encoded via a mediator focal-cluster that mediates the flow of activity
and bindings between antecedent and consequent clusters (mediators are depicted as parallelograms in Figure 1). A mediator consists of a single collector (+),
an enabler (?), and as many role-instantiation nodes as there are distinct variables in the rule. A mediator establishes links between nodes in the antecedent
and consequent clusters as follows: (i) The roles of the consequent and antecedent
relation(s) are linked via appropriate role-instantiation nodes in the mediator.
This linking reflects the correspondence between antecedent and consequent roles specified in the rule. (ii) The enabler of the consequent is connected to the
enabler of the antecedent via the enabler of the mediator. (iii) The appropriate
(+/–) collector of the antecedent relation is linked to the appropriate (+/–)
collector of the consequent relation via the collector of the mediator. A collector
to collector link originates at the + (–) collector of an antecedent relation if the
relation appears in its positive (negated) form in the antecedent. The link terminates at the + (–) collector of the consequent relation if the relation appears
in a positive (negated) form in the consequent.10
10
The design of the mediator was motivated, in part, by discussions the author had
with Jerry Hobbs.
42
L. Shastri
Consider the encoding of the following rule in Figure 1:
∀ x:agent y:agent z:thing give(x,y,z) ⇒ own(y,z) [800,800]
This rule is encoded via the mediator, med1, containing three role-instantiation
nodes r1, r2, and r3. The weight on the link from ?:med1 to ?:give indicates the
degree of evidential support for give being the probable cause (or explanation)
of own. The weight on the link from +:med1 to +:own indicates the degree of
evidential support for own being a probable effect of give. These strengths are
defined on a non-linear scale ranging from 0 to 1000.
A role-instantiation node is an abstraction of a neural circuit with the following functionality. If a role-instantiation node receives activation from the
mediator enabler and one or more consequent role nodes, it simply propagates
the activity onward to the connected antecedent role nodes. If on the other hand,
the role-instantiation node receives activity only from the mediator enabler, it
sends activity to the ?e node of the type specified in the rule as the type restriction for this role. This causes the ?e node of this type to become active in an
unoccupied phase.11 The ?e node of the type conveys activity in this phase to
the role-instantiation node which in turn propagates this activity to connected
antecedent roles nodes. The links between role-instantiation nodes and nodes in
the type hierarchy has not been shown in Figure 1.
shruti can encode rules involving multiple antecedents and consequents (see
[20]). Furthermore, shruti allows a bounded number of instantiations of the
same predicate to be simultaneously active during inference [14].
9
An Example of Inference
Figure 2 depicts a schematized response of the shruti network shown in Figure 1
to the query “Does Mary own a book?” (∃ x:Book own(Mary, x)?). This query
is posed by activating ?:Mary and ?e:book nodes, the role nodes owner and oobj, and the enabler ?:own, as shown in Figure 2. We will refer to the phases
of activation of ?:Mary and ?e:book as ρ1 and ρ2, respectively. Activation from
the focal-cluster for own reaches the mediator structure of rules (1) and (2).
Consequently, nodes r2 and r3 in the mediator med1 become active in phases ρ1
and ρ2, respectively. Similarly, nodes s1 and s2 in the mediator med2 become
active in phases ρ1 and ρ2, respectively. At the same time, the activation from
?:own activates the enablers ?:med1 and ?:med2 in the two mediators. Since r1
does not receive activation from any of the roles in its consequent’s focal-cluster
(own), it activates the node ?e:agency in the type hierarchy in a free phase (say
ρ3).
The activation from nodes r1, r2 and r3 reach the roles giver, recip and
g-obj in the give focal-cluster, respectively. Similarly, activation from nodes s1
and s2 reach the roles buyer and b-obj in the buy focal-cluster, respectively. In
essence, the system has created new bindings for give and buy wherein giver is
11
A similar phase-allocation mechanism is used in [1] for realizing function terms.
Currently, an unoccupied phase is assigned in software, but eventually this will
result from inhibitory interactions between nodes in the type hierarchy.
Types and Quantifiers in shruti
43
bound to an undetermined agent, recipient is bound to Mary, g-obj is bound to
a book, buyer is bound to Mary, and b-obj is bound to a book. These bindings
together with the activation of the enabler nodes ?:give and ?:own encode two
new queries: “Did some agent give Mary a book?” and “Did Mary buy a book?”.
At the same time, activation travels in the type hierarchy and thereby maps
the query to a large number of related queries such as “Did a human give Mary
a book?”, “Did John give Mary Book-17?”, “Did Mary buy all books” etc.
The E-fact give(John, Mary, Book-17) now becomes active as a result of
matching the query give(John, Mary, Book-17)? and causes +:give to become
active. This in turn causes +:med1, to become active and transmit activity to
+:own. This results in an affirmative answer to the query and creates a reverberant loop of activity involving the clusters own, med1, give, the fact node F1,
and the entities John, Mary, and Book-17.
10
Conclusion
The type structure described above, together with other enhancements such as
support for negation, priming, and evidential rules, allow shruti to support a
rich set of inferential behaviors, and perhaps, shed some light on the nature of
symbolic neural representations.
shruti identifies a number of constraints on the representation and processing of relational knowledge and predicts the capacity of the active (working)
memory underlying reflexive reasoning [17][22]. First, on the basis of neurophysiological data pertaining to the occurrence of synchronous activity in the γ
band, shruti leads to the prediction that a large number of facts (relational
instances) can be active simultaneously and a large number of rules can fire in
parallel during an episode of reflexive reasoning. However, the number of distinct
entities participating as role-fillers in these active facts and rules must remain
very small (≈ 7). Recent experimental findings as well as computational models lend support to this prediction (e.g., [12][13]). Second, since the quality of
synchronization degrades as activity propagates along a chain of cell clusters,
shruti predicts that as the depth of inference increases, binding information
is gradually lost and systematic inference reduces to a mere spreading of activation. Thus shruti predicts that reflexive reasoning has a limited inferential
horizon. Third, shruti predicts that only a small number of instances of any
given relation can be active simultaneously.
A number of issues remain open. These include the encoding of rules and
facts involving complex nesting of quantifiers. While the current implementation
supports multiple existential and universal quantifiers, it does not support the
occurrence of existential quantifiers within the scope of an universal quantifier.
Also, the current implementation does not support the encoding of complex
types such as radial categories [10]. Another open issue is the learning of new
relations and rules (mappings). In [18] it is shown that a recurrent network can
learn rules involving variables and semantic restrictions using gradient-descent
learning. While this work serves as a proof of concept, it does not address issues of
44
L. Shastri
scaling and catastrophic interference. Several researchers are pursuing solutions
to the problem of learning in the context of language acquisition (e.g., [16][4][7]).
In collaboration with M. Cohen, B. Thompson, and C. Wendelken, the author is also augmenting shruti to integrate the propagation of belief with the
propagation of utility. The integrated system will be capable of seeking explanations, making predictions, instantiating goals, constructing reactive plans, and
triggering actions that maximize the system’s expected future utility.
Acknowledgment
This work was partially funded by grants NSF SBR-9720398 and ONR N0001493-1-1149, and subcontracts from Cognitive Technologies Inc. related to contracts ONR N00014-95-C-0182 and ARI DASW01-97-C-0038. Thanks to M.
Cohen, J. Feldman, D. Grannes, J. Hobbs, D.R. Mani, B. Thompson, and C.
Wendelken.
References
1. Ajjanagadde, V.: Reasoning with function symbols in a connectionist network. In
the Proceedings of the 12th Conference of the Cognitive Science Society, Cambridge,
MA. (1990) 285–292.
2. Ajjanagadde, V.: Abductive reasoning in connectionist networks: Incorporating
variables, background knowledge, and structured explanada, Technical Report WSI
91-6, Wilhelm-Schickard Institute, University of Tubingen, Germany (1991).
3. Ajjanagadde, V., Shastri, L.: Efficient inference with multi-place predicates and
variables in a connectionist network. In the Proceedings of the 11th Conference of
the Cognitive Science Society, Ann-Arbor, MI (1989) 396–403.
4. Bailey, D., Chang, N., Feldman, J., Narayanan, S.: Extending Embodied Lexical
Development. In the Proceedings of the 20th Conference of the Cognitive Science
Society, Madison, WI. (1998) 84–89.
5. Barnden, J., Srinivas, K.: Encoding Techniques for Complex Information Structures in Connectionist Systems. Connection Science, 3, 3 (1991) 269–315.
6. Bliss, T.V.P., Collingridge, G.L.: A synaptic model of memory: long-term potentiation in the hippocampus. Nature 361, (1993) 31–39.
7. Gasser, M., Colunga, E.: Where Do Relations Come From? Indiana University
Cognitive Science Program, Technical Report 221, (1998).
8. Hobbs, J.R., Stickel, M., Appelt, D., Martin, P.: Interpretation as Abduction, Artificial Intelligence, 63, 1-2, (1993) 69–142.
9. Hummel, J. E., Holyoak, K.J.: Distributed representations of structure: a theory
of analogical access and mapping. Psychological Review, 104, (1997) 427–466.
10. Lakoff, G.: Women, Fire, and Dangerous Things — What categories reveal about
the mind, University of Chicago Press, Chicago (1987).
11. Lange, T. E., Dyer, M. G.: High-level Inferencing in a Connectionist Network.
Connection Science, 1, 2 (1989) 181–217.
12. Lisman, J. E., Idiart, M. A. P.: Storage of 7 ± 2 Short-Term Memories in Oscillatory
Subcycles. Science, 267 (1995) 1512–1515.
13. Luck, S. J., Vogel, E. K.: The capacity of visual working memory for features and
conjunctions. Nature 390 (1997) 279–281.
Types and Quantifiers in shruti
45
14. Mani, D.R., Shastri, L.: Reflexive Reasoning with Multiple-Instantiation in a
Connectionist Reasoning System with a Typed Hierarchy, Connection Science, 5,
3&4, (1993) 205–242.
15. Park, N.S., Robertson, D., Stenning, K.: An extension of the temporal synchrony approach to dynamic variable binding in a connectionist inference system.
Knowledge-Based Systems, 8, 6 (1995) 345–358.
16. Regier, T.: The Human Semantic Potential: Spatial Language and Constrained
Connectionism, MIT Press, Cambridge, MA, (1996).
17. Shastri, L.: Neurally motivated constraints on the working memory capacity of a
production system for parallel processing. In the Proceedings the 14th Conference
of the Cognitive Science Society, Bloomington, IN (1992) 159–164.
18. Shastri, L.: Exploiting temporal binding to learn relational rules within a connectionist network. TR-97-003, International Computer Science Institute, Berkeley,
CA, (1997).
19. Shastri, L.: A Model of Rapid Memory Formation in the Hippocampal System, In
the Proceedings of the 19th Annual Conference of the Cognitive Science Society,
Stanford University, CA, (1997) 680–685.
20. Shastri, L.: Advances in shruti — A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Applied Intelligence, 11 (1999) 79–108.
21. Shastri, L.: Recruitment of binding and binding-error detector circuits via longterm potentiation. Neurocomputing, 26-27 (1999) 865–874.
22. Shastri, L., Ajjanagadde V.: From simple associations to systematic reasoning: A
connectionist encoding of rules, variables and dynamic bindings using temporal
synchrony. Behavioral and Brain Sciences, 16:3 (1993) 417–494.
23. Shastri, L., Grannes, D.J.: A connectionist treatment of negation and inconsistency.
In the Proceedings of the 18th Conference of the Cognitive Science Society, San
Diego, CA, (1996).
24. Shastri, L., Grannes, D.J., Narayanan, S., Feldman, J.A.: A Connectionist Encoding of Schemas and Reactive Plans. In Hybrid Information Processing in Adaptive
Autonomous vehicles, G.K. Kraetzschmar and G. Palm (Eds.), Lecture Notes in
Computer Science, Springer-Verlag, Berlin (To appear).
25. Shastri, L., Wendelken, C.: Knowledge Fusion in the Large – taking a cue from the
brain. In the Proceedings of the Second International Conference on Information
Fusion, FUSION’99, Sunnyvale, CA, July (1999) 1262–1269.
26. Singer, W.: Synchronization of cortical activity and its putative role in information
processing and learning. Annual Review of Physiology 55 (1993) 349–74.
27. Sun, R.: On variable binding in connectionist networks. Connection Science, 4, 2
(1992) 93–124.
28. von der Malsburg, C.: Am I thinking assemblies? In Brain Theory, ed. G. Palm &
A. Aertsen. Springer-Verlag (1986).
A Recursive Neural Network for Reflexive
Reasoning
Steffen Hölldobler 1 , Yvonne Kalinke 2 ⋆ , and Jörg Wunderlich 3 ⋆⋆
1
2
Dresden University of Technology,
Dresden, Germany
Queensland University of Technology,
Brisbane, Australia
3
Neurotec Hochtechnologie GmbH,
Friedrichshafen, Germany
Abstract. We formally specify a connectionist system for generating
the least model of a datalogic program which uses linear time and space.
The system is shown to be sound and complete if only unary relation
symbols are involved and complete but unsound otherwise. For the latter case a criteria is defined which guarantees correctness. Finally, we
compare our system to the forward reasoning version of Shruti.
1
Introduction
Connectionist systems exhibit many desirable properties of intelligent systems
like, for example, being massively parallel, context–sensitive, adaptable and robust (see eg. [10]). It is strongly believed that intelligent systems must also be
able to represent and reason about structured objects and structure–sensitive
processes (see eg. [12,25]). Unfortunately, we are unaware of any connectionist
system which can handle structured objects and structure–sensitive processes
in a satisfying way. Logic systems were designed to cope with such objects and
processes and, consequently, it is a long–standing research goal to combine the
advantages of connectionist and logic systems in a single system.
There have been many results on such a combination which involves propositional logic (c.f. [24,26]). In [15] we have shown that a three–layered feed forward
network of binary threshold units can be used to compute the meaning function
of a logic program. The input and output layer of such a network consists of a
vector of units, each representing a propositional letter. The activation pattern
of these layers represent an interpretation I with the understanding that the
unit representing the propositional letter p is active iff p is true under I .
For certain classes of logic programs it is well–known that they admit a least
model, and that this model can be computed as the least fixed point of the program’s meaning function applied to an arbitrary initial interpretation [1,11]. To
⋆
⋆⋆
The author acknowledges support from the German Academic Exchange Service
(DAAD) under grant no. D/97/29570.
The results reported in this paper were achieved while the author was at the Dresden
University of Technology.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 46–62, 2000.
c Springer-Verlag Berlin Heidelberg 2000
A Recursive Neural Network for Reflexive Reasoning
47
compute the least fixed point in such cases, the feed forward network mentioned
in the previous paragraph is turned into a recurrent one by connecting each unit
in the output layer to the corresponding unit in the input layer. We were able
to show — among other results — that such so-called Rnns, ie. recursive neural
networks with a feed forward kernel , converge to a stable state which represents
the least model of the corresponding logic program. Moreover, in [6] it was shown
that the binary threshold units in the hidden layer of the kernel can be replaced
by units with sigmoidal activation function. Consequently, the networks can be
trained by backpropagation and after training new refined program clauses (or
rules) can be extracted using, for example, the techniques presented in [31]. Altogether, this is a good example of how the properties of logic systems can be
combined with the inherent properties of connectionist system.
Unfortunately, this does not solve the aforementioned problem because structured objects and structure–sensitive processes cannot be modeled within propositional logic but only within first- and higher–order logics. For these logics, however, similar results combining connectionist and logic systems are not known.
One of the problems is that as soon as the underlying alphabet contains a single
non–nullary function symbol and a single constant, then there are infinitely many
ground atoms which cannot be represented locally in a connectionist network.
In [17,18] we have shown that for certain classes of first–order logic programs
interpretations can be mapped onto real numbers such that the program’s meaning function can be encoded as a continuous function on the real numbers.
Applying a result from [13] we conclude that three–layered feed forward networks with sigmoidal activation function for the units occurring in the hidden
layer and linear activation function for the units occurring in the input and output layer can approximate the meaning function of first–order logic programs
arbitrarily well. Moreover, turning this feed forward kernel into an Rnn, the
Rnn computes an approximation of the least fixed point, i.e. the least model, of
a given logic program. The notion of an approximation is based on a distance
function between interpretations such that — loosely speaking — the distance
is indirectly proportional to the number of atoms on which both interpretations
agree.
Unfortunately, the result reported in [17,18] is purely theoretical and we have
not yet developed a real connectionist system which uses it. One of the main
obstacles for doing so is that we need to find a connectionist representation for
terms. There are various alternatives:
• We may use a structured connectionist network as in [14]: In this case the
network is completely local, all computations like, for example, unification
can be performed, but it is not obvious at all how such networks can be
learned. The structure is by far to complex for current learning algorithms
based on the recruitment paradigm [9].
• We may use a vector of fixed length to represent terms as in the recursive
auto–associative memory [28], the labeling recursive auto–associative memory [30] or in the memory based on holographic reduced representations
[27]. Unfortunately, in extensive tests none of these proposals has led to sa-
48
S. Hölldobler, Y. Kalinke and J. Wunderlich
tisfying results: The systems could not safely store and recall terms of depth
larger than five [20].
• We may use hybrid systems, where terms are represented and manipulated
in a conventional way. But this is not a kind of integration that we were
hoping for because in this case results from connectionist systems cannot be
applied to the conventional part.
• We may use connectionist encodings of conventional data structures like
counters and stacks [16,21], but currently the models are still too simple.
• We may use a phase–coding to bind constants to terms as suggested in
Shruti [29]: In this case we restrict our first–order language to contain only
constants and multi–place relation symbols.
Considering this current state of the art in connectionist term representations
we propose in this paper to extend our connectionist model developed in [15]
to handle constants and multi–place relations by turning the units into phase–
sensitive ones and solving the variable binding problem as suggested in Shruti.
Because our system generates models for logic programs in a forward reasoning
manner, it is necessary to consider the version of Shruti, were forward reasoning
is performed. There are three main difficulties with such a Shruti system:
• In almost any derivation more than one copy of the rules is needed, which
leads to sequential processing.
• The structure of the system is quite complex: there are many different types
of units with a complex connection structure, and it is not clear at all how
such structures can be learned.
• The logical foundation of the system has not been developed yet.
In [2] we have developed a logical calculus for the backward reasoning version
of Shruti by showing that reflexive reasoning as performed by Shruti is nothing
but reasoning by reductions in a conventional, but parallel logic system based on
the connection method [3]. In this paper we develop a logic system for forward
reasoning in a first–order calculus with constants and multi–place relations and
specify a recurrent neural network implementing this system. The logic system
is again based on the connection method using reduction techniques which link
logic systems to database systems. We define a calculus called Bur (for bottom–
up reductions) which has the following properties:
• For unary relation symbols the calculus is sound and complete.
• For relation symbols with an arity larger than one the calculus is complete
but not necessarily sound.
• We develop a criterion which guarantees that the result achieved in the case
where the relation symbols have an arity larger than one are sound.
• Computations require only linear parallel time and linear parallel space.
Furthermore, we extend the feed forward neural networks developed in [15] by
turning the units into phase–sensitive ones. We formally show that the Bur calculus can be implemented in these networks. Compared to Shruti our networks
consist only of two types of phase–sensitive units. The connection structure is
an Rnn with a four–layered feed forward neural network as kernel and, thus,
A Recursive Neural Network for Reflexive Reasoning
49
is considerable simpler than the connection structure of Shruti. Besides giving
a rigorous formal treatment of the logic underlying Shruti if run in a forward
reasoning manner, this line of research may also lead to networks for reflexive
reasoning, which can be trained using standard techniques like backpropagation.
The paper is organized as follows: In the following Section 2 we repeat some
basic notions, notations and results concerning logic programming and reflexive
reasoning. The Bur calculus is formally defined in Section 3. Its connectionist
implementation is developed in Section 4. The properties of the implementation and its relation to the Shruti system are discussed in Sections 5 and 6
respectively. In the final Section 7 we discuss our results and point out future
research.
2
Logic Programs and Reflexive Reasoning
We assume the reader to have some background in logic programming and deduction systems (see eg. [23,5]) as well as in connectionist systems and, in particular,
in the Shruti system [29]. Thus, in this section we will just briefly repeat the
basic notions, notations and results.
A (definite) logic program is a set of clauses, ie. universally closed formulas1
of the form A ← A1 ∧ . . . ∧ An , where A, Ai , 1 ≤ i ≤ n , are first–order atoms.
A and A1 ∧ . . . ∧ An are called head and body respectively. A clause of a logic
program is said to be a fact if its body is empty and its head does not contain
any occurrences of a variable; otherwise it is called a rule. A logic program is
said to be a datalogic program if all function symbols occurring in the program
are nullary, ie. if all function symbols are constants. For example, the database
in Shruti is a datalogic program.2
Definite logic programs enjoy many nice properties, among which is the one
that each program P admits a least model. This model contains precisely all the
logical consequences of the program. Moreover, it can be computed iteratively
as the least fixed point of a so–called meaning function TP which is defined on
interpretations I as
TP (I) = {A | there exists a ground instance A ← A1 ∧ . . . ∧ An
of a clause in P such that {A1 , . . . , An } ⊆ I},
where an interpretation is a set of ground atoms. In case of a datalogic program
P over a finite set of constants, the least fixed point of TP can be computed
in finite, albeit exponential time (in the worst case) with respect to the size of
P . Following the argumentation in [29], datalogic programs are thus unsuitable
to model reflexive reasoning. Only by imposing additional conditions on the
syntactic structure of datalogic programs as well as on their runtime behavior,
it was possible to show that a backward reasoning version of Shruti is able to
answer questions in linear time.
1
2
Ie. all variables are assumed to be universally closed.
To be precise, existentially bound variables in a Shruti database must be replaced
by new constants (see [2]).
50
3
S. Hölldobler, Y. Kalinke and J. Wunderlich
Bottom–Up Reductions: The BUR Calculus
In this section we develop a new calculus called Bur based on the idea to apply
reduction techniques to a given knowledge base in a bottom–up manner, whereby
the reduction techniques can be efficiently applied in parallel. We are particularly
interested in reduction techniques, which can be applied in linear time and space.
Let C be a finite set of constant symbols and R a finite set of relation
symbols.3 A Bur knowledge base P is a number of formulas that are either
facts or rules. Thus, a Bur knowledge base is simply a datalogic program.
Before turning to the definition of the reduction techniques we consider a
knowledge base P1 with the facts
p(a, b) and p(c, d)
(1)
q(a, c) and q(b, c)
(2)
r(X, Y, Z) ← p(X, Y ) ∧ q(Y, Z),
(3)
p(X, Y ) ← (X, Y ) ∈ {(a, b), (c, d)}
(4)
as well as
and a single rule
where C = {a, b, c, d} is the set of constants, R = {p, q, r} the set of relation symbols and X, Y, Z are variables. Using a technique known as database
(or DB) reduction in the connection method (see [4]) the facts in (1) can be
equivalently replaced by
and, likewise, the facts in (2) can be equivalently replaced by
q(X, Y ) ← (X, Y ) ∈ {(a, c), (b, c)}.
(5)
Although the transformations seem to be straightforward they have the desired side–effect that there is now only one possibility to satisfy the conditions
p(X, Y ) and q(Y, Z) in the body of rule (3), viz. by using (4) and (5) respectively. Technically speaking, there is an isolated connection between p(X, Y )
occurring in the body of (3) and the head of (4) and, likewise, between q(Y, Z)
occurring in the body of (3) and the head of (5) [4]. Such isolated connections
can be evaluated. Applying the corresponding reduction technique yields
r(X, Y, Z) ← (X, Y, Z) ∈ π1,2,4 (p ✶p/2=q/1 q),
(6)
where ✶ denotes the (natural equi-) join of the relations p and q , p/2 = q/1
denotes the constraint that the second argument of the relation p should be
identical to the first argument of q and π1,2,4 (s) denotes the projection of the
relation s to the first, second and forth argument. Evaluating the database
operations occurring in equation (6) leads to the reduced expression
3
r(X, Y, Z) ← (X, Y, Z) ∈ {(a, b, c)}.
(7)
Throughout the paper we will make use of the following notational conventions:
a, b, . . . denote constants, p, q, . . . relation symbols and X, Y, . . . variables.
A Recursive Neural Network for Reflexive Reasoning
51
In general, after applying database reductions to facts the evaluation of isolated connections between the reduced facts and the atoms occurring in the body
of rules leads to expressions containing the database operations union ( ∪ ), intersection ( ∩ ), projection ( π ), Cartesian product ( ⊗ ) and join ( ✶ ). These are
the standard operations of a relation database (see eg. [32]). The most costly
operation is the join, which in the worst case requires exponential space and
time with respect to the number of arguments of the involved relations and the
number of atoms occurring in the body of a rule. Because it is our goal to set
up a calculus which allows reasoning within linear time and space boundaries,
we must avoid the join operation.
This can be achieved if we replace database reductions by so–called pointwise
database reductions: In our example, the facts in (1) and (2) are replaced by
p(X, Y ) ← X ∈ {a, c} ∧ Y ∈ {b, d}
(8)
q(X, Y ) ← X ∈ {a, b} ∧ Y ∈ {c}
(9)
r(X, Y, Z) ← X ∈ π1 (p) ∧ Y ∈ π2 (p) ∩ π1 (q) ∧ Z ∈ π2 (q),
(10)
and
respectively. After evaluating isolated connections (3) now becomes
which can be further evaluated to
r(X, Y, Z) ← X ∈ {a, c} ∧ Y ∈ {b} ∧ Z ∈ {c}.
(11)
In general, the use of pointwise database reductions instead of database reductions leads to expressions involving only the database operations union, intersection, projection and Cartesian product, all of which can be computed in
linear time and space using an appropriate representation. The drawback of
this approach is that now so–called spurious tuples may occur in relations. For
example, according to (11) not only (a, b, c) is in relation r (as in (7)) but
also (c, b, c) . r(a, b, c) is a logical consequence of the example knowledge base,
whereas r(c, b, c) is not. It is easy to see that spurious tuples may occur only
if multiplace relation symbols are involved. We will come back to this problem
later in this section.
After this introductory example we can now formally define the reduction
rules of the Bur calculus. One should keep in mind that these rules are used to
compute the least fixed point of TP for a given Bur database P . Without loss
of generality we may assume that the head of each rule contains only variable
occurrences: Any occurrence of a constant c in the head of a rule may be
replaced by a new variable X if the condition X ∈ {c} is added to the body of
the rule.4 A similar transformation can be applied to facts, ie. each fact of the
form p(c1 , . . . , cn ) can be replaced by
p(X1 , . . . , Xn ) ←
n
^
i=1
Xi ∈ {ci }.
We will call such expressions generalized facts.
4
This is called the homogeneous form in [8].
52
S. Hölldobler, Y. Kalinke and J. Wunderlich
The Bur calculus contains the following two rules:
• Pointwise DB reduction: Let
p(X1 , . . . , Xn ) ←
n
^
i=1
Xi ∈ Ci and p(X1 , . . . , Xn ) ←
n
^
i=1
Xi ∈ Di
be two generalized facts in P . Replace these facts by
n
^
p(X1 , . . . , Xn ) ←
i=1
Xi ∈ Ci ∪ Di ,
• Evaluation of isolated connections: Let C be the set of constants,
p(X1 , . . . , Xm ) ←
n
^
pi (ti1 , . . . , tiki )
(12)
i=1
be a rule in P such that there are also generalized facts of the form
pi (Yi1 , . . . , Yiki ) ←
ki
^
l=1
Yil ∈ Cil , 1 ≤ i ≤ n,
in P . Let Dj = ∩{Cil | Xj occurs at the l th position in pi (ti1 , . . . , tiki )}
for each variable Xj occurring in (12). If Dj 6= ∅ for each j , then add the
following generalized fact to P :
p(X1 , . . . , Xm ) ←
n
^
i=1
Xi ∈ C ∩ Di .
One should observe that by evaluating isolated connections generalized facts
will be added to a Bur knowledge base P . Because pointwise DB reductions
do not decrease the number of facts and C is finite, this process will eventually
terminate in that no new facts are added. Let M denote the largest set of facts
obtained from p by applying the Bur reduction rules.
Theorem 1.
1. The Bur calculus is sound and complete if all relation symbols occurring in
P are unary, ie. M is precisely the least model of P .
2. The Bur calculus is complete but not necessarily sound if there are multiplace
relation symbols in P , ie. M is a superset of the least model of P .
The proof of this theorem can be found in [33]. The first part of Theorem 1
confirms the fact that considering unary relation symbols and a finite set of
constants does neither extend the expressive power of a Bur knowledge base
compared to propositional Horn logic nor does it affect the time and space requirements for computing the minimal model of a program (see [7]). Because
the reduction techniques in the Bur calculus can be applied in linear time and
A Recursive Neural Network for Reflexive Reasoning
53
space the minimal model of such a Bur knowledge base can be computed in
linear time and space as well.
The second part of Theorem 1 confirms the fact that considering multiplace
relation symbols and a finite set of constants does not change the expressive
power of a Bur knowledge base compared to propositional Horn logic but does
affect the time and space requirements. Turning a Bur knowledge base into an
equivalent propositional logic program may lead to exponentially more rules and
facts. Hence, the best we can hope for if we apply reduction techniques bottom–
up and in linear time and space is a pruning of the search space. In the worst
case the pruning can be neglected. In some cases however the application of the
reduction techniques may lead to considerable savings. Such a beneficial case is
characterized in the following theorem.
Theorem 2. If after d applications of the reduction techniques all relations
have at most one argument for which more than one binding is generated, then
all facts derived so far are logical consequences of the knowledge base.
In other words, the precondition of this theorem defines a correctness criterion
in that the Bur calculus is also sound for multiplace relations if the criterion is
met in the limit. The proof of the theorem can again be found in [33].
4
A Connectionist Implementation of the BUR Calculus
The connectionist implementation of the Bur calculus is based on two main
ideas: (1) use the kernel of the Rnn model to encode the logical structure of the
Bur knowledge base and its recursive part to encode successive applications of
the reduction techniques and (2) use the temporal synchronous activation model
of Shruti to solve the dynamic binding problem.
In the Bur model two types of phase–sensitive binary threshold units are
used, which are called btu–p–units and btu–c–units, respectively. They have the
same functionality as the ρ–btu and τ –and–units in the Shruti model, ie. the
output of a btu–p- and a btu–c–unit in a phase πc in a cycle ω are
1 if ibtu–p (πc ) ≥ θbtu–p
,
obtu–p (πc ) =
0 else
and
obtu–c (πc ) =
1 if ∃πc′ . [πc′ ∈ ω ∧ ibtu–c (πc′ ) ≥ θbtu–c ]
0 else
respectively, where θ denotes the threshold and i(πc ) the input of the unit in
the phase πc . Because all connections in the network will be defined as weighted
with 1 the input i(πc ) of a btu–p and btu–c–unit equals the sum of the outputs
of all units that are connected to that unit in the phase πc . The number of phases
in a cycle ω is determined by the number of constant symbols occurring in a
given Bur knowledge base.
54
S. Hölldobler, Y. Kalinke and J. Wunderlich
For a given Bur knowledge base we construct a four–layered feed forward
network according to the following algorithm. To shorten the notation the superscripts I , O , 1 and 2 indicate whether the unit belongs to the input, output,
first or second hidden layer of the network respectively.
Definition 3. The network corresponding to a Bur knowledge base P is an
Rnn with an input, two hidden and an output layer constructed as follows:
1 For each constant c occurring in P add a unit btu–pIc with threshold 1 .
2 For each relation symbol p with arity k occurring in P add units btu–pIp[1] ,
O
. . . , btu–pIp[k] and btu–pO
p[1] , . . . , btu–pp[k] each with threshold 1.
3 For each formula F of the form p(. . .) ← p1 (. . .) ∧ . . . ∧ pn (. . .) in P do:
3.1 For each variable X occurring in the body of F add a unit btu–c1X .
Draw connections from each unit btu–pIp[j] to this unit iff relation p
occurs in the body of C and its j th argument is X . Set the threshold
of the new unit btu–c1X to the number of incoming connections.
3.2 For each constant c occurring in F add a unit btu–c1c . Draw a connection from unit btu–pIc to this unit and connections from each unit btu–pIp[j]
iff relation p occurs in the body of F and its j th argument is c . Set the
threshold of the new unit btu–c1c to the number of incoming connections.
3.3 For each unit btu–c1X that was added in step 3.1 add a companion unit
btu–p1X iff variable X occurs in the head of F . For each unit btu–c1c
that was added in step 3.2 add a companion unit btu–p1c iff constant c
occurs in the head of F . Draw connections from the input layer to the
companion units such that these units receive the same input as their
companion units btu–c1X and btu–c1c , and assign the same threshold.
3.4 If k is the arity of the relation p(. . .) occurring in the head of F then
add units btu–p2p[1] , . . . , btu–p2p[k] . Draw a connection from each btu–c1
unit added in steps 3.1 to 3.3 to each of these units.
3.5 Draw a connection from btu–p1X to btu–p2p[j] iff variable X occurs
at position j in p(. . .) . Draw a connection from btu–p1c to btu–p2p[j]
iff constant c occurs at position j in p(. . .) . Set the threshold of the
btu–p2 units to the number of incoming connections.
3.6 For each 1 ≤ j ≤ k draw a connection from unit btu–p2p[j] to unit
btu–pO
p[j] .
4 For each relation p with arity k occurring in P and for each 1 ≤ j ≤ k
I
draw a connection from unit btu–pO
p[j] to unit btu–pp[j] .
5 Set the weights of all connections in the network to 1 .
The network is a recursive network with a feed forward kernel. This kernel
is constructed in steps (1) to (3.6). It is extended in step (4) to a Rnn. As an
example consider the following Bur knowledge base P2 :
p(a, b)
q(a, b) ← p(a, b)
p(Y, X) ← q(X, Y )
r(X) ← p(X, Y ) ∧ q(Y, X)
A Recursive Neural Network for Reflexive Reasoning
p
r
p[1]
p[2]
q[1]
q[2]
1
1
1
1
clause 2
2
q
1
clause 4
3
3
2
2
1
1
a
b
clause 3
3
2
3
3
1
1
1
2
2
1
1
1
1
p[1]
p[2]
q[1]
q[2]
p
2
55
q
1
1
r
Fig. 1. The Bur network for a simple knowledge base. btu–p–units are depicted as
squares and btu–c–units as circles. For presentation clarity we have dropped the recurrent connections between corresponding units in the output and input layer.
Fig. 1 shows the corresponding Bur network. Because this knowledge base contains just the two constants a and b , the cycle ω is defined by {πa , πb } . The
inference process is initiated by presenting the only fact p(a, b) to the input
layer of the network. More precisely, the units labelled p1 and a in the input
layer of Fig. 1 are clamped in phase πa , whereas the units labelled p2 and b
are clamped in phase πb . The external activation is maintained throughout the
inference process. Fig. 2 shows the activation of the units during the computation. After five cycles (equals 10 time steps) the spreading of activation reaches
a stable state and the generated model can be read off as will be explained in
the following paragraph.
Analogous to the Rnn model the input and the output layer of a Bur network
represent interpretations for the knowledge base. The instantiation of a variable
X (occurring as an argument of some relation) by a constant c is realized by
activating the unit that represents X in phase πc representing c . A ground
atom p(c1 , . . . , ck ) is an element of the interpretation encoded in the activation
patterns of the input (and output) layer in cycle ω iff for each 1 ≤ j ≤ k the
5
units btu–pIp[j] (and btu–pO
p[j] ) are activated in phases πcj ∈ ω .
Because all facts of a Bur knowledge base are ground, the set of facts represents interpretation for the given knowledge base. A computation is initialized
5
Because each relation is represented only once in the input and the output layer and
each argument p[j] may be activated in several phases during one cycle, these ac-
56
S. Hölldobler, Y. Kalinke and J. Wunderlich
r
q[2]
q[1]
p[2]
p[1]
b
a
0
2
4
6
8
10
time
Fig. 2. The activation of the units within the computation in the Bur network shown
in Fig. 1. Each cycle ω consists of two time steps, where the first one corresponds
to the phase πa and the second one to the phase πb . The bold line after two cycles
marks the point in time up to which the condition in Theorem 2 is fulfilled, ie. each
relation argument is bound to one constant only. During the third cycle the arguments
p[1] and p[2] are both bound to both constants a and b .
by presenting the facts to the network. This is done by clamping all units representing constants and all units representing the arguments of these facts.
Thereafter the activation is propagated through the network. We refer to the
propagation of activation from the input layer to the output layer as a comd
putation step. Let IBU
R denote the interpretation represented by the output
layer of a Bur network after d ≥ 1 computation steps. The computation in
the Bur network terminates if the network reaches a stable state, ie. if for two
interpretations computed in successive computation steps d and d + 1 we find
d+1
d
IBU
R = IBU R . Such a stable state will always be reached in finite time because
all rules in a Bur knowledge base are definite clauses and there are only finitely
many constants and no other function symbols.
5
Properties of the BUR Network
It is straightforward to verify that a Bur network encodes the reduction rules of
the Bur calculus. A condition X ∈ C for some argument j of a relation p is
encoded by activating the unit p[j] occurring in the input (and output) layer in
all phases corresponding to the constants occurring in C . This basically covers
pointwise DB reductions. The hidden layers and their connections are constructed such that they precisely realize the evaluation of isolated connections. This
is in fact an enrichment of a McCulloch–Pitts network [24] by a phase–coding
of bindings.
Let us now first consider the case where the Bur knowledge base P contains
only unary relation symbols. In this case the computation of the Bur network
with respect to an interpretation I within one computation step equals the
computation of the meaning function TP (I) for the logic program P . One
tivation pattern represent several bindings and, thus, several instances of the relation
p simultaneously. This may lead to crosstalk as will be discussed later.
A Recursive Neural Network for Reflexive Reasoning
57
should observe that TP (∅) is precisely the set of all facts occurring in P and,
thus, corresponds precisely to the activation pattern presented as external input
to initialize the Bur network. Hence, it is not too difficult to show by induction
on the number of computation steps that the following proposition holds.
Proposition 4. Let TP the meaning function for a Bur knowledge base P
and d be the number of computation steps. If P contains only unary relation
d+1
d
(∅) .
symbols, then IBU
R = TP
Because for a Bur knowledge base P the least fixed point of TP exists
and can be computed in finite time Proposition 4 ensures that in the case of
unary relation symbols the Bur network computes the least model of the Bur
knowledge base. Moreover, by Theorem 1(1) we learn that the Bur network is
a sound and complete implementation of the Bur calculus in this case.
The result is restricted to a knowledge base with unary relation symbols
only, because in the general case of multi–place relation symbols the so–called
crosstalk problem may occur. If several instances of a multi–place relation are
encoded in an interpretation the relation arguments can each be bound to several
constants. Because it is not encoded which argument binding belongs to which
instance, instances that are not actually elements of the interpretation may by
mistake supposed to be. This problem corresponds precisely to the problem of
whether spurious tuples are computed in the Bur calculus.
Reconsider P2 for which Fig. 2 shows the input to the Bur network and
the activation of the output units during the first five computation steps of
the computation. During the second computation step the units btu–pO
p[1] and
are
both
activated
in
the
phases
π
and
π
,
because
they
have to
btu–pO
a
b
p[2]
represent the bindings p[1] = a ∧ p[2] = b of the instance p(a, b) and p[1] =
b ∧ p[2] = a of the instance p(b, a) of the computed interpretation. But these
activations also represents the instances p(a, a) and p(b, b) .
We can show, however, that despite of the instances that are erroneously
represented as a result of the crosstalk problem, all instances that result from
an application of the meaning function to a given interpretation are computed
correctly.
Proposition 5. Let P be a Bur knowledge base, TP the meaning function for
d+1
d
(∅) .
P and d be the number of computation steps. Then, IBU
R ⊇ TP
One can be even more precise by showing that the Bur network is again a
sound and complete implementation of the Bur calculus in that the stable state
of the network precisely represents M (see Theorem 1(2)).
As a consequence of the one–to–one correspondence between the Bur calculus and its connectionist implementation we can now apply the precondition
of Theorem 2 as a criterion that, if met, ensures that all instances represented
by an interpretation in the Bur network belong to the least model of the Bur
knowledge base. More precisely, because the computation in the Bur network
yields all ground atoms that actually are logical consequences of P and using
the criterion stated in Theorem 2 we can determine the computation step d in
58
S. Hölldobler, Y. Kalinke and J. Wunderlich
which ground atoms that are not logical consequences of P are computed for
the first time. Let I<d denote the set of ground atoms that are computed within
computations steps less than d and I≥d the ones computed within computation
steps equal or higher than d . The elements of I<d are logical consequences of
the knowledge base by Theorem 2, whereas the elements in I≥d \ I<d may or
may not be logical consequences of the knowledge base. This problem can be
decided by presenting these elements to a sound and complete backward reasoning system for datalogic. One should observe, however, that in the worst case
the set I≥d \ I<d may contain exponentially many elements with respect to the
size of the knowledge base.
Finally, we will analyze the size of a Bur network and the time it takes for
a network to settle down in a stable state. From Definition 3 we learn that the
number of units in the Bur network grows linear with the size n of the Bur
knowledge base P .6 The time needed for one computation step is 4×|ω| , where
|ω| denotes the length of the cycle ω . An element A of the least model of P
is computed in 4 × |ω| × (l − 1) time, where l is the length of the shortest
derivation of A with respect to TP . In other words, the time is linear with
respect to the shortest derivation. In the worst case, however, l = 2|C| , where
|C| denotes the number of constants occurring in the alphabet underlying a Bur
knowledge base. This time complexity comes as no surprise as the computation
of the least model of a datalogic program is not in the class N C and thus is
unlikely to be parallelizable in an optimal or efficient way (see eg. [22]).
6
The BUR vs. the Forward Reasoning SHRUTI System
Note: In this section Shruti always refers to the forward reasoning Shruti
system, if not indicated otherwise.
The Bur system was invented to show to what extent the Rnn model of
[15] can be extended to multi–place relation symbols by using the phase–coding
model of Shruti. It was not intended to be a logical reconstruction of Shruti.
But the systems are quite close and, consequently, should be compared in detail.
Expressiveness: Both, the Bur and the Shruti7 system, can cope with datalogic
programs. However, the programs in Shruti are syntactically restricted and
certain conditions have to be met during the computation. There are no such
restrictions in Bur and, consequently, the expressive power of the Bur system
is larger than that of the Shruti system.
Soundness: A detailed analysis of the Shruti system showed that computing
in Shruti is nothing but computing with reductions in the connection method
[33]. Because these reductions are sound, Shruti is sound as well.8 As shown in
6
7
8
n is determined by the number of clauses, the number of relation and constant
symbols occurring in the knowledge base, the average arity of the relation symbols
and the average number of variables and constant symbols in each clause body.
After replacing existentially bound variables by new constants.
This should not be confused with the analysis done in [2], which was concerned with
the backward reasoning version of Shruti.
A Recursive Neural Network for Reflexive Reasoning
59
Theorem 1(2), Bur may be unsound if relation symbols with arity larger than
one are involved.
Completeness: Consider P2 as the knowledge base for a Shruti network. An
inspection of this knowledge base shows that its least model contains two different instances of the relation p . If we extend this example by adding more facts
concerning p and q , then many more — in fact, exponentially many — copies
of the rule r(X) ← p(X, Y ) ∧ q(Y, X) . are needed. This example is not specifically chosen, but in a forward reasoning system it is almost always the case that
multiple instances occur. Such a problem can only be solved in the Shruti system using the so–called multiple instantiation switches, which are able to store
a certain number of different instances of a relation (see [29]). The connectionist
network encoding of these switches is quite complicated and its size depends on
the number of needed copies. Hence, an ideal Shruti network requires exponential space. Because such networks cannot be realized, the number of copies
is restricted to a certain fixed number. But now Shruti is no longer complete
in that some logical consequences of the knowledge base cannot be computed
anymore. In contrast, as shown in Theorem 1(2) Bur is complete.
Space: As already mentioned, an ideal Shruti system requires exponential space
whereas Bur requires only linear space (see Section 5). Due to lack of space we
cannot depict the Shruti network for P2 . But the interested reader can easily
verify that the Shruti network is much more complicated than the Bur network
shown in Fig. 1. On the other hand, Bur has to face the crosstalk problem and
expensive postprocessing may be needed to separate the logical consequences
of a knowledge base from those facts which are computed but are no logical
consequences. In other words, Bur trades space for time.
Time: The time to show that a certain fact is a logical consequence of the
knowledge base is in both systems linear with respect to the shortest possible
derivation of this fact.
Learning: The Bur network has a simple recurrent structure with a feed forward
kernel. For feed forward networks backpropagation and its derivatives are well
established learning techniques. There seems to be no major hurdle to extend
these techniques to cope with phase–codings, although this has to be shown in
the future. Hence, we believe that the Bur model is well suited for inductive
learning tasks using sets of input/output patterns. The Shruti model and its extension using multiple instantiations and a special type hierarchy9 is encoded in
a connectionist setting using many different and complicated unit types. Hence,
learning in Shruti networks is hardly imaginable using standard algorithms.
Facts: A Bur network encodes only the rules of a knowledge base, whereas the
facts are presented (clamped) to the input layer as an initial activation pattern.
In other words, the same set of rules can be used with different facts without
changing the structure of the network. In contrast, a Shruti network encodes
9
Such a special hierarchy is not needed in Bur because types can be presented by
unary relation symbols and Bur is sound and complete for unary relations (see
Theorem 1(1)).
60
S. Hölldobler, Y. Kalinke and J. Wunderlich
the rules and the facts and, consequently, a change in the facts requires a change
in the network structure.
Summing up, Shruti is undoubtedly a powerful and quite successful tool for
reflexive backward reasoning. It is not so obvious that it is the best choice for
reflexive forward reasoning. The Bur calculus presented in this paper has some
advantages concerning expressive power, required space, simpler connectionist
structure and the handling of facts. On the other hand, the systems differ as
far as soundness and completeness issues are concerned: Bur is complete but
unsound, whereas Shruti is sound and incomplete.
7
Discussion
In this paper we have presented a new calculus together with a connectionist
implementation: the Bur system. It is a rigorous design starting from a first–
order logic (datalogic with the usual logical consequence relation), developing
a calculus (bottom–up reductions in the connection method using data base
technologies) and specifying a connectionist implementation (recurrent neural
networks with a feed forward kernel). The system is sound and complete for
unary relation symbols and complete but unsound for relation symbols with
arity larger than one. A correctness criteria is given, which provides a test for
soundness in the latter case. The connectionist implementation requires linear
space with respect to the size of the knowledge base. Furthermore, if a certain
atom A is a logical consequence of the knowledge base, then this can be shown
in time linear with respect to the shortest derivation of A . In general however,
due to the unsoundness, we may need exponential time to decide whether A
is a logical consequence of the knowledge base. This is not a bad design but
rather a consequence of the fact that this problem is in N P . Finally, Bur has a
simple connectionist structure, which should make it possible to adapt standard
learning techniques to the Bur networks.
In many cases a Bur network can be minimized by eliminating redundant
units and connections. For example, the network shown in Fig. 1 contains several
redundant units like the rightmost unit in the area marked as “clause 3” and
shortcuts may be introduced. It is important, however, that along such shortcuts
the activation is propagated in the time that is required if the additional units
are present. In other words, the shortcuts may not lead to speedups because
otherwise the logical structure is not maintained.
In contrast to the very successful backward reasoning version of Shruti, Bur
is a forward reasoning system for reflexive reasoning. There is some evidence that
humans reason in a forward direction by building partial models and basing their
decisions on these models (see [19]). It remains to be tested whether the Bur
system is a valid model for these findings.
Finally, let us come back to the problem mentioned at the beginning of this
paper. The Bur calculus is not a solution to the problem of how to represent
structured objects and structure sensitive processes in connectionist systems.
But we believe that it is another step in the right direction because it gives us
A Recursive Neural Network for Reflexive Reasoning
61
a better understanding on how logic systems and connectionist systems can be
amalgamated.
References
1. K. R. Apt and M. H. Van Emden. Contributions to the theory of logic programming. Journal of the ACM, 29:841–862, 1982.
2. A. Beringer and S. Hölldobler. On the adequateness of the connection method.
In Proceedings of the AAAI National Conference on Artificial Intelligence, pages
9–14, 1993.
3. W. Bibel. On matrices with connections. Journal of the ACM, 28:633–645, 1981.
4. W. Bibel. Advanced topics in automated deduction. In R. Nossum, editor, Fundamentals of Artificial Intelligence II, pages 41–59. Springer, LNCS 345, 1988.
5. W. Bibel. Deduction. Academic Press, London, San Diego, New York, 1993.
6. A.S. d’Avila Garcez, G. Zaverucha, and L.A.V. de Carvalho. Logic programming
and inductive learning in artificial neural networks. In Ch. Herrmann, F. Reine,
and A. Strohmaier, editors, Knowledge Representation in Neural Networks, pages
33–46, Berlin, 1997. Logos Verlag.
7. W. F. Dowling and J. H. Gallier. Linear-time algorithms for testing the satisfiability
of propositional Horn formulae. Journal of Logic Programming, 1(3):267–284, 1984.
8. E. W. Elcock and P. Hoddinott. Comments on Kornfeld’s equality for Prolog:
E-unification as a mechanism for argumenting the prolog search strategy. In Proceedings of the AAAI National Conference on Artificial Intelligence, pages 766–774,
1986.
9. J. A. Feldman. Memory and change in connection networks. Technical Report
TR96, Computer Science Department, University of Rochester, 1981.
10. J. A. Feldman and D. H. Ballard. Connectionist models and their properties.
Cognitive Science, 6(3):205–254, 1982.
11. M. Fitting. Metric methods – three examples and a theorem. Journal of Logic
Programming, 21(3):113–127, 1994.
12. J. A. Fodor and Z. W. Pylyshyn. Connectionism and cognitive architecture: A
critical analysis. In Pinker and Mehler, editors, Connections and Symbols, pages
3–71. MIT Press, 1988.
13. K.-I. Funahashi. On the approximate realization of continuous mappings by neural
networks. Neural Networks, 2:183–192, 1989.
14. S. Hölldobler. A structured connectionist unification algorithm. In Proceedings of
the AAAI National Conference on Artificial Intelligence, pages 587–593, 1990.
15. S. Hölldobler and Y. Kalinke. Towards a massively parallel computational model
for logic programming. In Proceedings of the ECAI94 Workshop on Combining
Symbolic and Connectionist Processing, pages 68–77. ECCAI, 1994.
16. S. Hölldobler, Y. Kalinke, and H. Lehmann. Designing a counter: Another case
study of dynamics and activation landscapes in recurrent networks. In Proceedings
of the KI97: Advances in Artificial Intelligence, volume 1303 of Lecture Notes in
Artificial Intelligence, pages 313–324. Springer, 1997.
17. S. Hölldobler, Y. Kalinke, and H.-P. Störr. Recurrent neural networks to approximate the semantics of acceptable logic programs. In G. Antoniou and J. Slaney, editors, Advanced Topics in Artificial Intelligence, volume 1502 of LNAI,
Berlin/Heidelberg, 1998. Proceedings of the 11th Australian Joint Conference on
Artificial Intelligence (AI’98), Springer–Verlag.
62
S. Hölldobler, Y. Kalinke and J. Wunderlich
18. S. Hölldobler, Y. Kalinke, and H.-P. Störr. Approximating the semantics of logic
programs by recurrent neural networks. Applied Intelligence, 11:45–59, 1999.
19. P. N. Johnson-Laird and R. M. J. Byrne. Deduction. Lawrence Erlbaum Associates,
Hove and London (UK), 1991.
20. Y. Kalinke. Using connectionist term representation for first–order deduction –
a critical view. In F. Maire, R. Hayward, and J. Diederich, editors, Connectionist Systems for Knowledge Representation Deduction. Queensland University of
Technology, 1997. CADE–14 Workshop, Townsville, Australia.
21. Y. Kalinke and H. Lehmann. Computations in recurrent neural networks: From
counters to iterated function systems. In G. Antoniou and J. Slaney, editors, Advanced Topics in Artificial Intelligence, volume 1502 of LNAI, Berlin/Heidelberg,
1998. Proceedings of the 11th Australian Joint Conference on Artificial Intelligence
(AI’98), Springer–Verlag.
22. R. M. Karp and V. Ramachandran. Parallel algorithms for shared-memory machines. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science,
chapter 17, pages 869–941. Elsevier Science Publishers B.V., New York, 1990.
23. J. W. Lloyd. Foundations of Logic Programming. Springer, Berlin, Heidelberg,
1987.
24. W. S. McCulloch and W. Pitts. A logical calculus and the ideas immanent in
nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943.
25. A. Newell. Physical symbol systems. Cognitive Science, 4:135–183, 1980.
26. G. Pinkas. Symmetric neural networks and logic satisfiability. Neural Computation,
3:282–291, 1991.
27. T. A. Plate. Holographic reduced representations. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 30–35, 1991.
28. J. B. Pollack. Recursive auto-associative memory: Devising compositional distributed representations. In Proceedings of the Annual Conference of the Cognitive
Science Society, pages 33–39, 1988.
29. L. Shastri and V. Ajjanagadde. From associations to systematic reasoning: A
connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioural and Brain Sciences, 16(3):417–494, September 1993.
30. A. Sperduti. Labeling RAAM. Technical Report TR-93-029, International Computer Science Institute, Berkeley, CA, 1993.
31. G.G. Towell and J.W. Shavlik. Extracting refined rules from knowledge–based
neural networks. Machine Learning, 131:71–101, 1993.
32. J. D. Ullman. Principles of Database Systems. Computer Science Press, Rockville,
Maryland, USA, second edition, 1985.
33. J. Wunderlich. Erweiterung des RNN–Modells um SHRUTI–Konzepte: Vom aussagenlogischen zum Schließen über prädikatenlogischen Programmen. Master’s
thesis, TU Dresden, Fakultät Informatik, 1998.
A Novel Modular Neural Architecture for
Rule-Based and Similarity-Based Reasoning
Rafal Bogacz and Christophe Giraud-Carrier
Department of Computer Science, University of Bristol
Merchant Venturers Building, Woodland Rd
Bristol BS8 1UB, UK
{bogacz,cgc}@cs.bris.ac.uk
Abstract. Hybrid connectionist symbolic systems have been the subject
of much recent research in AI. By focusing on the implementation of highlevel human cognitive processes (e.g., rule-based inference) on low-level,
brain-like structures (e.g., neural networks), hybrid systems inherit both
the efficiency of connectionism and the comprehensibility of symbolism.
This paper presents the Basic Reasoning Applicator Implemented as a
Neural Network (BRAINN). Inspired by the columnar organisation of the
human neocortex, BRAINN’s architecture consists of a large hexagonal
network of Hopfield nets, which encodes and processes knowledge from
both rules and relations. BRAINN supports both rule-based reasoning
and similarity-based reasoning. Empirical results demonstrate promise.
1
Introduction
Over the past few years, the mainly historical, and arguably unproductive, division between psychological and biological plausibility has narrowed significantly
through the design and implementation of successful hybrid connectionist symbolic systems. Rather than committing to a single philosophy, such systems draw on
the strengths of both biology and psychology, by implementing high-level human
cognitive processes (e.g., rule-based inference) within low-level, brain-like structures (e.g., neural networks). Hence, hybrid systems inherit the characteristics
of both traditional symbolic systems (e.g., expert systems) and connectionist
architectures, including:
– Complex reasoning
– Learning and generalisation from experience
– Efficiency through massive parallelism
This paper presents the Basic Reasoning Applicator Implemented as a Neural
Network (BRAINN). The original description of BRAINN is in [1]. The architecture of BRAINN mimics the columnar organisation of the human neocortex.
It consists of a large hexagonal network of Hopfield nets [7] in which both rules
and relations can be encoded in a distributed fashion. Each relation is stored
in a single Hopfield net, whilst each rule is stored in a set of adjacent Hopfield nets. Through systematically orchestrated relaxations, BRAINN combines
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 63–77, 2000.
c Springer-Verlag Berlin Heidelberg 2000
64
R. Bogacz and C. Giraud-Carrier
rule-based reasoning typical of expert systems with similarity-based reasoning
typical of neural networks. Hence, BRAINN supports both monotonic reasoning
and several forms of common-sense (non-monotonic) reasoning [10].
The paper is organised as follows. Section 2 details the BRAINN architecture
and algorithms. Section 3 reports the results of a number of experiments with
BRAINN. Section 4 reviews related work, and section 5 concludes the paper and
outlines directions for future work.
2
BRAINN
BRAINN’s architecture is inspired by the columnar organisation of the human
neocortex [3]. The human neocortex is divided into minicolumns (Ø0.03mm),
i.e., groups of neurons gathered around dendrite bundles. Minicolumns create
a hexagonal network, and are, in turn, organised in hexagonal macrocolumns
(Ø0.5mm). The axons of some pyramidal neurons have an unusually high number of synapses within about 0.5mm of their soma. Simultaneous activation of
neurons, in a 0.5mm radius, is thus sometimes observed. These neurons excite
one another and can in turn recruit additional neurons, since the adjacent neurons receive activation from two or more neurons, as shown in Figure 1.
Fig. 1. Neuron Recruitment
Calvin [3] speculates further that newly recruited neurons can subsequently
recruit others, so that the pattern of activation of one or two macrocolumns
can spread through the whole network. He argues that this process is especially
relevant to short term memory phenomena.
2.1
Knowledge Implementation
BRAINN’s underlying neural network architecture supports the encoding of both
relations and if-then rules, as detailed in the following sections.
Relations The relations considered here can be represented as triples of the
form <Object, Attribute, Value>. In BRAINN, each component of the triple is
represented by a unique pattern. The pattern is a sequence of bits of constant
A Novel Modular Neural Architecture
65
length N , where each bit may have value -1 or +1. Hence, each relation can be
stored in a Hopfield network with 3N units. Figure 2 shows one such network
for N = 4. The activations of units x1 to xN correspond to the binary representation of the object, the activations of units xN +1 to x2N correspond to the
binary representation of the attribute and the activations of units x2N +1 to x3N
correspond to the binary representation of the value.
cat
drinks
milk
Object
x2
x1
1 -1 -1 1
-1 -1 1 1
xN
1 -1 1 -1
x3N
Legend
- unit with activation –1
- unit with activation +1
xN+1
Attribute
Value
x2N+1
x2N
Fig. 2. Relation Encoding
For simplicity, Hopfield networks encoding relations are referred to as assemblies, which work as associative memories. After delivering two components of
a relation to an assembly, the third one may be retrieved. The weights of the
network are set using either the Hebb rule or the Perceptron rule [5], as detailed
below.
Weights Setting with the Hebb Rule. Upon creation of the network, all weights
are initialised to 0. For each new triple to remember, the binary representations
of its components are delivered to the corresponding neurons in the assembly
as shown in Figure 2. Then, the weights wij between units i and j (i 6= j) are
updated according to equation 1.
wij ← wij + xi xj
(1)
Weights Setting with the Perceptron Rule. The Perceptron rule is similar
to the Hebb rule, but it increases storage capacity and reduces susceptibility
to pattern correlation [5]. With the Perceptron rule, the weights wij to unit i
of unit i (as per
are modified according to equation 1 only if the output xt+1
i
equation 2 below), is different from the bit value in the triple’s representation xti .
In addition, learning is iterative. Patterns are presented repeatedly until they are
stored correctly or the learning time exceeds a pre-defined limit. The learning
algorithm is shown in Figure 3.
The network functions as an associative memory, using the principle of relaxation to retrieve relations. A pattern is delivered to the network and, during
relaxation, unit i changes its state xi according to equation 2 until the network
reaches a stable state.
3N
X
wij xti )
(2)
=
sgn(
xt+1
i
j=1
66
R. Bogacz and C. Giraud-Carrier
Initialise all weights to 0
Repeat
– For each triple to remember
1. Deliver triple’s elements to the network’s units xi
2. For each unit i
(a) Compute activations xt+1
i
(b) If xt+1
6= xti Then update weights: wij ← wij + xti xtj (j 6= i)
i
Until no updates are made or time is out
Fig. 3. Weight Setting with Perceptron Rule
The Hopfield network stabilises on the stored pattern most similar to the
delivered one or sometimes in a random state called a spurious attractor. In
BRAINN, all of the questions asked by the user take the form of a triple <Object,
Attribute, Value>, where one component is replaced by a question mark (e.g.,
<mouse, eats, ?>). The network then retrieves the triple’s missing component
based on knowledge of the other two. The units corresponding to the unknown
component are set to 0. If the question is delivered to an assembly remembering the expected relation, then, after relaxation, the units corresponding to the
unknown component are equal to the binary representation of the relation’s missing element. If the network does not store the triple, it stabilises in a spurious
attractor or another remembered triple. Examples are in section 2.3.
Given the above mapping of relations to triples, it is possible that an object
may have more than one value for a single attribute, e.g., <cat, eats, whiskas>
and <cat, eats, mouse>. To solve this problem, such triples are stored in distinct
assemblies. Details are described in section 2.4.
If-Then Rules The rules that BRAINN uses are traditional if-then rules, where
the left-hand side (LHS) consists of a conjunction of conditions and the righthand side (RHS) is a single condition, for example:
IF <soil, is, sandy> AND <soil, humus_level, high>
THEN <soil, compaction, high>
As with relations, conditions are represented by triples of the form <Object,
Attribute, Value> and subsequently stored in assemblies of the form described
in section 2.1.1. The various assemblies representing the conditions of a rule can
then be connected into a network of assemblies that encodes the rule. Figure
4 shows such a network for the above rule. In the network, there are connections between each unit from the LHS assemblies and each unit from the RHS
assembly. The weights between assemblies are set according to the Hebb rule.
Therefore, if one knows one side of the rule, one can retrieve the other.
To accommodate rules with varying numbers of conditions in LHS and to
provide a uniform network topology for both rules and relations (rather than a
set of disconnected networks), assemblies are organised into a large hexagonal
A Novel Modular Neural Architecture
67
soil
For simplicity, only
one neuron per element
of relation is shown.
soil
sandy
is
soil
compaction
high
high
humus_level
Fig. 4. Rule Encoding in Network
network as shown in Figure 5. With such an architecture, each rule may have a
maximum of 6 conditions in its LHS. Larger numbers of conditions in LHS can,
of course, be handled by chaining rules appropriately, using new variables (e.g.,
IF A AND B AND C THEN D can be rewritten as IF A AND B THEN E, and
IF E AND C THEN D).
Each line between assemblies
denotes connections between
all units from the assemblies
Fig. 5. Hexagonal Network of Assemblies
When reasoning with rules, BRAINN implements backward chaining. Hence,
the network retrieves the LHS of a rule upon delivery of its RHS. The following,
along with Figure 6, details how this takes place in the network. For the sake of
argument, assume that a rule consists of four conditions in its LHS. The RHS
is stored in one assembly and the four conditions from the LHS are stored in
adjacent assemblies. Upon activation of the assembly corresponding to the RHS,
the network must retrieve the four conditions of the LHS.
a)
b)
i.
RHS
RHS
RHS
ii.
RHS
RHS
RHS
RHS
RHS
RHS
RHS
RHS
LHS
RHS
LHS
LHS
LHS
iii.
iv.
LHS
LHS
LHS
LHS
LHS
LHS
RHS
LHS
LHS
RHS
LHS
LHS
Fig. 6. Rule Retrieval: a) Storage; b) Retrieval - i) RHS sent to all assemblies,
ii) Relaxation, iii) LHS sent to adajacent assemblies and iv) Relaxation
68
R. Bogacz and C. Giraud-Carrier
First, the RHS is delivered to all the assemblies in the hexagonal network.
Once all of the Hopfield networks have relaxed, only the assembly storing the
RHS has stabilised on the delivered pattern, since for that assembly, delivered
and stored patterns are the same. Then, the assembly storing the RHS sends
its pattern (vector of activation) to all six adjacent assemblies. Each adjacent
assembly receives the pattern vector of the RHS assembly multiplied by the
matrix of weights between the RHS assembly and itself. All of the adjacent
assemblies relax after receiving the vector. In the case of the four LHS assemblies,
the vector received is one of the patterns remembered in the local Hopfield
network, so these assemblies will be stable. The other two assemblies will not
be stable and will thus change their state. Moreover, the four LHS assemblies
now send their patterns back to the RHS assembly. The RHS assembly receives
these patterns multiplied by the matrix of weights between the LHS assemblies
and itself. Thus, the pattern received by the RHS assembly is equal to its own
pattern of activation. A kind of resonance is achieved, allowing the retrieval of
the correct LHS.
Note that LHS assemblies recruited by the aforementioned process are implicitly conjoined, i.e., the left-hand side of the rule is the conjunction of the
conditions found in all of the LHS assemblies retrieved. To avoid confusion during backward chaining when several rules have the same right-hand sides (e.g.,
IF A AND B THEN C, and IF D THEN C), the right-hand sides are stored in
different assemblies.
Rules with Variables The rules discussed so far are essentially propositional.
It is often useful, and even necessary, to encode and use more general rules, which
include variables. To reason in the presence of such rules, an effective way of
binding variables is required. In BRAINN, variable binding is achieved by using
special weight values between LHS and RHS assemblies, as shown in Figure 7
for the rule IF <&someone, drinks, milk> THEN <&someone, is, strong>. Let
&X be the variable. Then, the weights between the units representing &X in
LHS and the units representing &X in RHS are equal to 1, whilst the weights
between the units representing &X and all other units are equal to 0.
VRPHRQH
Legend
VRPHRQH
weights equal to 1
weights set up
according to Hebb rule
GULQNV
If there is no line between two
units, the weight is equal to 0
PLON
LV
VWURQJ
Fig. 7. Weight Setting for a Simple Rule
A Novel Modular Neural Architecture
69
With such a set of weights, the pattern for the variable is sent between assemblies without any modifications nor interactions with the rest of the information
in the assembly. The weights inside the LHS assemblies and the RHS assembly
must also satisfy similar conditions. That is, the weight of self-connection for
all units representing a variable is equal to 1, whilst the weight between each
unit representing a variable and any other unit is equal to 0. These latter conditions guarantee the stability of the assembly, which is critical to the reasoning
algorithm.
2.2
Functional Overview
Although BRAINN’s knowledge implementation is inspired by biological considerations, its information processing mechanisms are not biologically plausible.
A high-level view of BRAINN’s overall architecture is shown in Figure 8.
Short Term Memory
Reasoning
Goal
Control
Process
Long Term Memory
(Hexagonal network)
Fig. 8. BRAINN’s Architecture
The system’s knowledge (i.e., rules and relations) is stored in the Long Term
Memory (LTM). Temporary, run-rime information is stored in the Short Term
Memory (STM) and the reasoning goal is stored in a dedicated variable. Reasoning is effected by a form of backward chaining. The following sections detail
the reasoning mechanisms implemented by the Control Process.
2.3
Rule-Based Reasoning
To facilitate reasoning, BRAINN’s assemblies are labelled with the type of information they store: SN for a (semantic net’s) relation, LHS for a rule’s left-hand
side, and RHS for a rule’s right-hand side. The label is represented by a unique
sequence of 4 bits, stored in a few additional units in each assembly. Hence, each
assembly actually consists of 3N + 4 units.
As previously stated, BRAINN’s rule-based reasoning engine implements a
form of backward chaining. The pseudocode for the algorithm is described in
Figure 9.
If more than one rule can be used, the rules are sorted by ascending number of conditions in their LHS. The algorithm checks that an LHS condition is
satisfied by (recursively) asking the network to produce its value. For example,
70
R. Bogacz and C. Giraud-Carrier
ApplyRule(question)
1. Deliver question to all assemblies
2. Relax the network
3. If there is a SN assembly containing question Then return
corresponding answer
4. Else
(a) For all RHS assemblies containing question
i. Retrieve LHS of rule
ii. Sort rules by ascending number of LHS conditions
(b) For all rules in above order
i. Load rule to STM (both RHS and LHS assemblies)
ii. For each LHS condition of rule
– If LHS.value 6= ApplyRule(<LHS.object, LHS.attribute, ?>))
Then try next rule
iii. Give the answer from RHS of rule
Fig. 9. BRAINN’s Backward Chaining Algorithm
the algorithm checks the condition sky has colour blue by asking the question
<sky, has colour, ?>. Although adequate for single-valued attributes, this may
cause problems for multi-valued attributes.
The following illustrates the working of the rule application algorithm on a
simple reasoning task. Assume that BRAINN’s knowledge base consists of the
following relation and rule:
<Garfield, drinks, milk>
IF <&someone, drinks, milk> THEN <&someone, is, strong>
For simplicity, also assume that the hexagonal network consists of only 3 assemblies, organised as shown in Figure 10. The divisions in the assemblies represent
subsets of units, one for each element of information (i.e., object, attribute, value
and label). Also assume that the relation is stored in the upper assembly and
the rule in lower and right assemblies as shown in Figure 10.
Fig. 10. Simple Hexagonal Network
A Novel Modular Neural Architecture
71
The simplest question that the user can ask, is about what Garfield drinks,
i.e.,
<Garfield, drinks, ?>
The algorithm delivers the question to all the assemblies. The network, after
relaxation, is shown in Figure 11. A label over a division denotes that the activation of units in that division is equal to the binary representation of that
label. If there is no label over a division, the network is in a spurious attractor.
Fig. 11. Network after Relaxation: <Garfield, drinks, ?>
After relaxation, the bottom assembly is in a spurious attractor, the right
assembly has settled to one of the patterns stored in that assembly, and the top
assembly has settled to the relation (tag is SN) containing the question. Hence,
the system gives the answer from this assembly, i.e., milk.
The following question, which asks what Garfield is, causes BRAINN’s rulebased reasoning mechanisms to be applied.
<Garfield, is, ?>
As before, the question is delivered to all the assemblies. The network, after
relaxation, is shown in Figure 12.
Fig. 12. Network after Relaxation: <Garfield, is, ?>
Two assemblies are empty, because the network has settled in spurious attractors. Sequences of bits in those assemblies have no meaning. The lower assembly
72
R. Bogacz and C. Giraud-Carrier
stores the RHS condition, which contains the question. Neighbours of that assembly receive its pattern of activation multiplied by the matrices of weights
between assemblies. The resulting network is shown in Figure 13.
Fig. 13. Network after Retrieving LHS of Rule
The upper assembly is clear because the weights between the upper and lower
assemblies are equal to zero (no rule is stored). In the right assembly, the LHS
has been retrieved. The rule, IF <Garfield, drinks, milk> THEN <Garfield, is,
strong>, is written to STM and the question, <Garfield, drinks, ?>, is delivered
to the network. The behaviour of the network for this question is as described
above. The value returned is the same as the value in the LHS of the retrieved
rule, hence the answer for the question, <Garfield, is, ?>, is given from the RHS
of the rule, i.e., strong.
Currently, the system cannot answer questions involving variables (e.g., <?,
is, strong>) since, after relaxation, the assembly which stores the RHS has one
part (i.e., the object) empty or without meaning. Further work is necessary to
overcome this limitation.
2.4
Similarity-Based Reasoning
In addition to encoding relations and rules as described in section 2.1, BRAINN
learns and reasons from similarity, as shown below. Consider the knowledge
database shown in Figure 14.
Fig. 14. Sample Knowledge Base
A Novel Modular Neural Architecture
73
The database contains some information about cars, planes and lorries. The
user may ask “What is a lorry used for travelling on?” The database does not
contain the answer explicitly. However, lorries and cars have more attributes in
common than lorries and planes. In this sense, a lorry is more similar to a car
than to a plane. Hence, the system can guess that a lorry is for travelling on the
ground like a car.
To increase the capacity of the network, BRAINN generally stores new information in the assembly where the most similar information is already present.
Two mechanisms are then available for similarity-based reasoning, one using on
a voting algorithm and the other relying on Pavlov-like connections.
Voting Algorithm The voting algorithm assumes that all the relations with
the same object are stored in the same assembly. The algorithm to retrieve the
value of attribute Aquery for object Oquery is described in Figure 15.
All the relations with object Oquery are retrieved from memory
(one-by-one from the same assembly, using relaxation)
For each retrieved relation <Oquery, Aretrived, Vretrived>
1. Aretrived and Vretrived are delivered to each assembly
2. For each assembly C
– If ∃Osimilar s.t. C stores a relation <Osimilar, Aretrived,
Vretrived> Then
• If ∃Vsimilar s.t. C stores a relation <Osimilar, Aquery,
Vsimilar> Then vote for Vsimilar
Choose the value with the largest number of votes as the answer
Fig. 15. Voting Algorithm
For example, assume BRAINN implements the knowledge database presented in Figure 14. If asked <lorry, is for travelling, ?>, BRAINN would assert
on ground since lorry shares two properties with car (made of metal + has
wheels = 2 votes for on ground) and only one with plane (made of metal =
1 vote for in air).
Pavlov-like Connections The algorithm based on Pavlov-like connections assumes that all the relations with the same attribute and the same value are
stored in the same assembly, whilst relations with the same attribute but different values are stored in different assemblies. The hexagonal network is overlaid
with a fully connected mesh. These additional connections between all the assemblies capture co-occurring features (e.g., if some values of some attributes
occur together for one object, then some of the assemblies are active together).
The strengths of these Pavlov’s connections represent how often assemblies are
active together. When a new relation is learnt then:
74
R. Bogacz and C. Giraud-Carrier
1. The object from this relation is sent to all the assemblies
2. The assemblies are relaxed (only the assemblies that remember any information about the object are in “resonance”)
3. The strength of all the Pavlov’s connections between the assembly where the
new relation is stored and the assemblies in resonance is increased
Figure 16 shows the Pavlov’s connections for the knowledge database of Figure 14. Only non-zero connections are shown. Line thickness is proportional to
strength.
Fig. 16. Pavlov-like Connections
The algorithm to answer the question <Oquery, Aquery, ?> is in Figure 17.
The pattern of the object Oquery is sent to all the assemblies and all
the assemblies that remember any information about the object Oquery are
activated
Pavlov’s connections are used to determine which value of the attribute
Aquery usually occurs with the set of features of the object Oquery
Fig. 17. Pavlov Algorithm
For example, consider the behaviour of the system for the question: <lorry,
is for travelling, ?>. The object lorry is sent to all the assemblies and two assemblies storing information about the lorry are activated (see Figure 16). The
assembly remembering the relation <?, is for travelling, on the ground> receives activation from two assemblies and the assembly remembering the relation
<?, is for travelling, in the air> from one assembly only, hence the system will
guess that the answer is: on the ground.
The advantage of the voting algorithm is its simplicity and relatively low
computational cost (computations are strongly parallel). The Pavlov-like algorithm is slightly more involved, but has very interesting feature - connections
A Novel Modular Neural Architecture
75
between features are remembered even if the particular cases are forgotten (also
a feature of human learning). Therefore, such connections could be used for rule
extraction.
3
Empirical Results
BRAINN is implemented in C++ under Windows, with a GUI displaying traces
of the network’s behaviour (see http://www.cs.bris.ac.uk/ bogacz/brainn.html).
Results of preliminary experiments with BRAINN follow.
3.1
Classical Reasoning Protocols
Several tasks from the set of Benchmark Problems for Formal Nonmonotonic
Reasoning [8] were presented to BRAINN. The system incorporates the premises and correctly derives the conclusions for problems A1, A2, A3, A4, B1
and B2, which include default reasoning, linear inheritance and cancellation of
inheritance.
3.2
Sample Knowledge Base
BRAINN was also tested with a more realistic knowledge base in the domain
of soil science. This knowledge base consists of 20 rules with up to 2 conditions
in the LHS (e.g., IF <&soil, is, clay> AND <&soil, humus level, high> THEN
<&soil, compaction, low>) and chains of inference of length 3 at most.
The following is an example of BRAINN’s reasoning after “learning” the soil
science knowledge base.
User inputs: <My_soil, colour, brown>
<My_soil, weight, heavy>
User query: <My_soil, compaction, ?>
The question is sent to all the assemblies. After relaxation, no SN assembly is
found with the answer, but there is a RHS assembly that contains an answer.
This RHS assembly belongs to the rule, IF <&soil, Fe level, high> AND <&soil,
weight, heavy> THEN <&soil, compaction, high>. The RHS assembly sends its
weight-multiplied pattern to all of its neighbours. After relaxation the two LHS
neighbours are found. The rule is retrieved and written to STM. Then, the
first condition, <My soil, Fe level, high>, is sent to all the assemblies. Again,
no SN assembly is found, but the RHS assembly of the rule, IF <&soil, colour,
brown> THEN <&soil, Fe level, high>, is activated. This second rule is retrieved
(as above) and written to STM. The condition of the rule, <My soil, colour,
brown>, is then sent to all the assemblies. After relaxation, one of the assemblies
still contains the condition so that <My soil, Fe level, high> is confirmed. The
system sends the second condition, <My soil, weight, heavy>, of the first rule
from STM to all the assemblies. After relaxation one of assemblies contains the
condition. Both conditions of the first rule are now confirmed and the system
produces the answer, high, from the RHS of the first rule. Although BRAINN
behaves as expected, the network is large (36 assemblies of 29 units each).
76
4
R. Bogacz and C. Giraud-Carrier
Related Work
The BRAINN system is inspired by some of Hinton’s early work [6]. In Hinton’s
system, as in BRAINN, relations are implemented in a neural network in a
distrributed fashion. However, a different network architecture is used.
Many hybrid symbolic connectionist systems have been proposed. One of
the first such systems is described in [14]. That system imitates the structure
of a production system and is made up of several separate modules (working
memory, production rules and facts). With its distributed representation in all
of the modules, the system can match variables against data in the working
memory module by using a winner-take-all algorithm. The system has complex
structures and is computationally costly. Moreover, it is restricted to performing
sequential rule-based reasoning.
CONSYDERR [10] is a connectionist model for concept representation and
commonsense reasoning. It consists of a two-level architecture that naturally
captures the dichotomy between concepts and the features used to describe them.
However, it does not address learning (how such a skill could be incorporated is
also unclear) and is limited to reasoning from concepts.
CLARION [13], like CONSYDERR, uses two modules of information processing. One module encodes declarative knowledge in a localist network where
the nodes that represent a rule’s conditions are connected to the node representing that rule’s conclusion. The other module encodes procedural knowledge in a
layered sub-symbolic network. Given input data, decisions are reached through
processing in and interaction between both modules. CLARION also allows rule
extraction from the procedural to the declarative knowledge module.
ScNets [4] aim at offering an alternative to knowledge acquisition from experts. Known rules may be pre-encoded and new rules can be learned inductively
from examples. The representation lends itself to rule generation but the constructed networks are complex. Finally, ASOCS [9] are dynamic, self-organizing
networks that learn incrementally, from both examples and rules. ASOCS are
massively parallel networks, but are restricted to binary classification.
A number of other relevant systems are described in [12]. A thorough review
of the literature on hybrid symbolic connectionist models is in [11] and a dynamic
list of related papers is in [2].
5
Conclusion
This paper presents a hybrid connectionist symbolic system, called BRAINN
(Basic Reasoning Applicator Implemented as a Neural Network). In BRAINN,
a hexagonal network of Hopfield networks is used to store both relations and
rules. Through systematically orchestrated relaxations, BRAINN supports both
rule-based and similarity-based reasoning, thus allowing traditional (monotonic)
reasoning, as well as several forms of common-sense (non-monotonic) reasoning.
Preliminary experiments demonstrate promise.
Future work will focus on developing Pavlov’s similarity based algorithm by
implementing rules extraction and integrating similarity-based and rule-based
A Novel Modular Neural Architecture
77
reasoning into one algorithm. In addition, biological plausibility will be improved
by incorporating Goal and STM in the hexagonal network and using local rules
to control the behaviour of each assembly and determine the global reasoning
process.
Acknowledgements
This work is supported in part by an ORS grant held by the first author. The
soil science knowledge base was donated by Adam Bogacz.
References
1. Bogacz, R. and Giraud-Carrier, C. (1998). BRAINN: A Connectionist Approach to
Symbolic Reasoning. In Proceedings of the First International ICSC Symposium on
Neural Computation (NC’98), 907-913.
2. Boz, O. (1997). Bibliography on Integration of Symbolism with Connectionism, and Rule Integration and Extraction in Neural Networks. Online at
http://www.lehigh.edu/ ob00/integrated/references-new.html.
3. Calvin, W. (1996). The Cerebral Code. MIT Press.
4. Hall, L.O. and Romaniuk, S.G. (1990). A Hybrid Connectionist, Symbolic Learning System. In Proceedings of the National Conference on Artificial Intelligence
(AAAI’90), 783-788.
5. Herz, J., Krogh, A. and Palmer, R. (1991). Introduction to the Theory of Neural
Computation. Addison-Wesley.
6. Hinton, G. (1981). Implementing Semantic networks in Parallel Hardware. In Hinton, G. and Anderson, J. (Eds.), Parallel Models of Associative Memory. Lawrence
Erlbaum Associates, Inc.
7. Hopfield, J. and Tank, D. (1985). Neural Computation of Decisions in Optimization
Problems. Biological Cybernetics, 52:141-152.
8. Lifschitz, V. (1988). Benchmark Problems for Formal Nonmonotonic Reasoning. In
Proceedings of the Second International Workshop on Non-Monotonic Reasoning,
LNCS 346:202-219.
9. Martinez, T.R. (1986). Adaptive Self-Organizing Networks. Ph.D. Thesis (Tech.
Rep. CSD 860093), University of California, Los Angeles.
10. Sun, R. (1992). A Connectionist Model for Common Sense Reasoning Incorporating
Rules and Similarities. Knowledge Acquisition, 4:293-331.
11. Sun, R. (1994). Bibliography on Connectionist Symbolic Integration. In Sun, R.
(Ed.), Computational Architectures Integrating Symbolic and Connectionist Processing, Kluwer Academic Publishers.
12. Sun, R. and Alexandre, F. (Eds.) (1995). Working Notes of IJCAI’95 Workshop
on Connectionist-Symbolic Integration.
13. Sun, R. (1997). Learning, Action and Consciousness: A Hybrid Approach Toward
Modeling Consciousness. Neural Networks, 10(7):1317-1331.
14. Touretzky, D. and Hinton, G. (1985). Symbols Among the Neurons: Details of
a Connectionist Inference Architecture. In Proceedings of the International Joint
Conference on Artificial Intelligence (IJCAI’85), 238-243.
Addressing Knowledge-Representation Issues
in Connectionist Symbolic Rule Encoding
for General Inference
Nam Seog Park
Information Technology Laboratory,
GE Corporate Research and Development
One Research Circle, Niskayuna, NY 12309
Abstract. This chapter describes one method for addressing knowledge
representation issues that arise when a connectionist system replicates
a standard symbolic style of inference for general inference. Symbolic
rules are encoded into the networks, called structured predicate networks
(SPN) using neuron-like elements. Knowledge-representation issues such
as unification and consistency checking between two groups of unifying
arguments arise when a chain of inference is formed over the networks
encoding special type of symbol rules. These issues are addressed by
connectionist sub-mechanisms embedded into the networks. As a result,
the proposed SPN architecture is able to translate a significant subset of
first-order Horn Clause expressions into a connectionist representation
that may be executed very efficiently.
1
Introduction
Connectionist symbol processing attempts to replicate symbol processing functionality using connectionist components. Until now, many connectionist systems
have demonstrated their abilities to represent dynamic bindings in connectionist
styles [1-7]. However, only a few connectionist systems were able to provide additional connectionist mechanisms to deal with knowledge representation issues
such as unification and consistency checking within and across different groups
of unifying arguments, which are required to support general inference [6,7]. Encoding symbolic knowledge in a connectionist style requires not only encoding
symbolic expressions into corresponding networks but also implementing some
symbolic constraints that the syntax of expression imposes. For instance, if a
connectionist mechanism has to encode a symbolic rule such as p(X, X) → q(X)
that requires repeated arguments of the p predicate to get bound to the same
constant filler or free variable fillers during inference, an additional connectionist
mechanism, other than a basic dynamic binding mechanism, is needed to force
this condition. This additional connectionist mechanism should be capable of
detecting a consistency violation when two different constant fillers are assigned
to different occurrences of the same variable argument, X’s in this case.
This chapter describes how these issues can be addressed using a generalized
phase-locking mechanism [8-10]. The phase-locking mechanism was originally
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 78–91, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Addressing Knowledge-Representation Issues
79
proposed by Shastri and Ajjanagadde [7] and extended by Park, Robertson,
and Stenning [8] to overcome some of fundamental limitations of the original
mechanism in replicating standard symbolic inference.
2
A Structured Predicate Network
A structured predicate network (SPN) is a knowledge encoding scheme that the
generalized phase-locking mechanism employs. This scheme maps a symbolic rule
in a first-order Horn Clause expression [10] to a corresponding localist network.
When encoded, each symbolic rule is mapped to a corresponding SPN that is
composed of three parts, {Pa , M, Pc }. As illustrated in Figure 1, Pa is a predicate
assembly representing the antecedent of the rule, Pc represents the consequent,
and M is the intermediate mechanism that connects them together. Encoding a
symbolic rule in this architecture is considered as finding a one-to-one mapping
between the given rule and the SPN.
...
The antecedent
predicate assembly
....
Intermediate
mechanism
....
...
The consequent
predicate assembly
Fig. 1. The structure of an SPN
The main focus of this chapter is how to establish the required mapping
by providing a proper connectionist mechanisms to build predicate assemblies
and the intermediate mechanism. Since each symbolic rule needs to be encoded
into an SPN with a unique intermediate mechanism to support a target form of
symbolic inference, an automatic rule and SPN mapping mechanism is necessary.
3
Basic Building Blocks
The basic components of the SPN are three neuron-like elements: a π-btu, a τ or, and a multiphase τ -or element. Unlike ordinary nodes found in conventional
neural network models, these elements have special temporal behaviors hypothesized. These elements sample their inputs over several phases of an oscillation
cycle and determine their output patterns, depending on the input patterns
sampled during this time period. A phase is a minimum time interval in which
80
N.S. Park
a neuron element performs its basic computations – sampling its inputs and
thresholding – and an oscillation cycle is a window of time in which neuron
elements show their oscillatory behaviors. On becoming active, these elements
continually produce the output (oscillating) for the duration of inference [7]. For
instance, a π-btu element becomes active on receiving one or more spikes in
different phases of any oscillation cycle. On becoming active, a π-btu element
produces an oscillatory spike that is in-phase with the driving inputs. A τ -or
element, on the other hand, becomes active on receiving one or more spikes within a period of oscillation. Once activated, this element produces an oscillatory
pulse train whose pulse width is comparable to the period of an oscillation cycle.
A multiphase τ -or element becomes active when it receives more than one input
pulses in different phases within a period of oscillation and produces an oscillatory pulse train whose pulse width is comparable to the period of an oscillation
cycle. A threshold, n, associated with these elements indicates that the elements
will fire only if they receive n or more spike inputs in the same phase. Figure 2
demonstrates their behavior graphically.
12 34 56
12 34 56
π-btu element
12 34 56
12 34 56
12 34 56
12 34 56
τ -or element
multiphase
τ -or element
Fig. 2. Temporal behavior of the neuron-like elements.
When a mapping is carried out between a given symbolic rule and its corresponding SPN, the antecedent and consequent of the rule are translated into the
corresponding predicate assemblies. An n-ary predicate, p(arg1 , arg2 , . . ., argn ),
in the antecedent or consequent, is mapped to a predicate assembly consisting
of n entity nodes as can be seen in Figure 3.
arg1
p
arg2
argn
...
p {arg1([0],[0]]), arg2([0]],[0]), ... , argn([0],[0])}
Fig. 3. The structure of a predicate assembly and corresponding symbolic notation.
Addressing Knowledge-Representation Issues
81
An entity node is a pair of π-btu elements. When it is used to represent
an argument of the predicate, the left element is used to represent a variable
role of the argument and the right element a constant role. Either or both its
elements may become active during a chain of inference. If an entity node is
used to represent a constant, only its right element becomes active. Whereas
when the entity node is used to represent a variable, only its left node becomes
active during a chain of inference. In the symbolic notation of the entity node,
argi ([0],[0]), the symbol “0” denotes the state in which the left and the right
elements of the entity node are inactive.
As the basic building blocks, an entity node, a τ -or element, and a multiphase
τ -or element are used not only to build predicate assemblies but also to build
connectionist sub-mechanisms to be used for the intermediate mechanism of the
given symbolic rule. The readers are invited to refer to [9,10] for in-depth description of the behavior of the neuron elements and the structure of a predicate
assembly, as well as a dynamic binding mechanism built on these components.
4
Knowledge-Representation Issues
To identify the necessary connectionist sub-mechanisms to build the SPN for a
given symbolic rule, let us consider the following rule:
p(X, X, Y ) → q(X, Y ).
A standard type of inference expected with this rule is obtaining the conclusion q(a,a) when the fact p(a,U,U) is known. If a symbolic inference system is used for this inference, it would perform the unification first to acquire
the bindings, {a/X,U/X,U/Y}, from which the system can get the information,
{a/X,a/Y}, systematically. This information is then used to substitute the variables in the consequent so that the conclusion q(a,a) can be reached. Mapping
the given symbolic rule to the SPN means finding a corresponding connectionist
mechanism to carry out similar inference. This implies finding both mechanisms
to encode rules and mechanisms to replicate an important symbolic inference
procedure [9].
When the above rule is mapped to the corresponding SPN, two predicate
assemblies corresponding to the p(X,X,Y) and q(X,Y) predicates will be built
using an assembly of entity nodes as follows:
p{X1 ([0],[0]),X2 ([0],[0]),Y([0],[0])},
q{X([0],[0]),Y([0],[0])}.
The subscript numbers attached to the argument names of p predicate indicate the order of each argument repeatedly appearing in the predicate. This is
to differentiate repeated argument in different argument positions and does not
affect the meaning of the original argument name.
When the fact p(a,U,U) is presented to the SPN, the initial binding, {a/X,
U/X, U/Y}, are represented in a connectionist manner by introducing two entity
82
N.S. Park
nodes (called filler nodes) corresponding a and U respectively and by activating
their neuron elements in the following fashion:
– activate both the right elements of the a filler node and p:X1 argument node
in the same phase, say the first phase, which results in a[0,1] and p:X1 [0,1],
where the number 1 stands for the first phase;
– activate the right element of the U filler node and those of the p:X2 and p:Y
argument nodes in a different phase, the second phase for example, which
results in U[2,0], p:X2 [2,0], and p:Y[2,0], where the number 2 indicates the
second phase.
This activation will lead the situations in which in-phase activation between
the a filler node and the p:X1 argument node represent the binding, {a/X}, and
similar in-phase activation among U, p:X2 , and p:Y nodes represent the bindings,
{U/X,U/Y}:
a([0],[1]), U([2],[0]),
p{X1 ([0],[1]),X2 ([2],[0]),Y([2],[0])},
q{X([0],[0]),Y([0],[0])}.
If the intermediate mechanism that will be built between the p predicate
assembly and the q predicate assembly (see Figure 1) propagates these initial
bindings to the q predicate assembly in such a way that arguments of the q
predicate assembly are activated in
q{X([2],[1]),Y([2],[1])},
the result of inference can be obtained from the in-phase activation among the
filler nodes, a and U, and the argument nodes, q:X and q:Y. From in-phase
activation between the a filler node and the right elements of the two argument
nodes, the bindings {a/X,a/Y}, and from in-phase activation between the U filler
node and the left elements of the two argument nodes, the bindings {U/X,U/Y}
are obtained. These together provide the conclusion of the inference q(a,a) with
the intermediate result of the unification, {a/U}.
As this example illustrates, finding mapping between the symbolic rule and
the corresponding SPN first requires building the p and q predicate assemblies for
the rule and a sub-mechanism that will be used to propagate the initial bindings,
in the form of active phases, from the p predicate assembly to the q predicate
assembly as inference initiates. Furthermore, additional sub-mechanisms are also
necessary to ensure the consistency conditions that the syntax of the rule imposes
and to perform unification within and across groups of unifying arguments. The
collection of these sub-mechanisms is called the intermediate mechanism in SPN
architecture.
In summary, the intermediate mechanism handles knowledge-representation
issues during the inference on the SPN. The more detailed tasks that this mechanism is expected to perform are:
Addressing Knowledge-Representation Issues
83
– keep the consistency of the initial bindings on the antecedent predicate assembly;
– provide a path for binding propagation from the antecedent predicate assembly to the consequent predicate assembly;
– perform binding interaction (unification) within and across unifying groups
of arguments;
– maintain consistency of the bindings in the consequent after binding propagation.
These issues were initially articulated in [6], and the users can refer to this
article for some other revevant knowledge-representation issues.
5
Building an Intermediate Mechanism
How to build an intermediate mechanism for a given symbolic rule is a key issue
in the SPN architecture. This section provides in more detail the structure of
the intermediate mechanism.
5.1
Sub-mechanisms for Consistency Checking and Binding
Propagation
Two types of consistency checking are required when a symbolic rule is given to
be encoded into the SPN: constant consistency checking for a constant argument
and variable consistency checking for repeated variable arguments. Constant
consistency checking forces the condition that all constant arguments in the
antecedent must get bound to the same constant filler or free variable fillers
during inference. Variable consistency checking, on the other hand, forces all the
repeated variable arguments in the antecedent to get bound to the same constant
filler or free variable fillers. The rules needing this treatment are determined
by checking the type of arguments and whether they have repeated arguments
in the antecedent. In case of the example rule p(X, X, Y ) → q(X, Y ) variable
consistency checking is required to make sure the repeated arguments, X’s, are
always bound to the same constant filler during inference. A connectionist submechanism that forces this consistency is called the consistency checking submechanism.
After consistency checking, the initial bindings set between the filler nodes
and the antecedent predicate assembly need to propagate to the consequent
predicate assembly to complete the inference. Therefore, an additional submechanism is needed to provide paths between the argument nodes in the antecedent predicate assembly to the corresponding argument nodes in the consequent
predicate assembly. The rules that need this sub-mechanism are determined by
checking the argument matching between the arguments in the antecedent and
those in the consequent of the given rule. The example rule has both arguments,
X and Y, appearing in the antecedent and the consequent at the same time;
therefore, the example rule needs the binding propagation sub-mechanism.
84
N.S. Park
p:X2
p:X1
p:Y
p
b1
mto1
q
q:X
q:Y
Fig. 4. Consistency checking and binding propagation sub-mechanisms for the rule
p(X, X, Y ) → q(X, Y )
Figure 4 shows the consistency checking sub-mechanism and the binding
propagation sub-mechanism required for the example rule.
The figure illustrates two predicate assemblies at the top and the bottom of
the SPN. Between them are sub-mechanisms for consistency checking and binding propagation. Since any repeated arguments of the antecedent are forcing the
condition that they have to get bound to the same constant filler or free variable
fillers during inference, this requires any bindings generated from the repeated
arguments to be collected for consistency checking. The node with the label b1
is added for this purpose. This node is called a binding node. When the repeated
argument nodes (p:X’s) get bound to filler nodes by being activated at the same
phase, the initial bindings are propagated to the b1 binding node automatically.
The left element of the binding node then represents variable bindings of the
repeated argument nodes and the right element constant bindings by becoming
active at the corresponding phases. The consistency of these intermediate bindings is then checked by the separate consistency checking sub-mechanism. The
mto1 node, which is a multi-phase τ -or element, is inserted for this purpose.
Whenever the right elements of the repeated argument nodes are bound to two
different constant filler nodes, the right element of the b1 node will become active at more than one phase. The mto1 node receiving direct input from the b1
node will detect this situation and projects the inhibitory signal to stop the flow
of the activation from the antecedent predicate assembly to the consequential
predicate assembly. Note the dashed link from mto1 to one black dot near the
q predicate assembly, which is in turn connected to three other black dots by a
dashed line. This is to abbreviate representation of a full connection from mto1
to each black dot. On becoming active, the mto1 node sends the same inhibitory
signal to all of these black dots at the same time.
Addressing Knowledge-Representation Issues
85
In Figure 4, the repeated argument nodes, p:X1 and p:X2 , in the p predicate assembly are connected to the corresponding argument node, q:X, in the
q predicate assembly via the b1 binding node. These links serve as the binding
propagation sub-mechanism between X arguments. At the same way, the links
between the p:Y argument node to the q:Y argument node are used for the
binding propagation sub-mechanism between Y arguments. Whenever the argument nodes of the p predicate assembly become active at the specific phases,
this activation automatically propagates to the corresponding argument nodes
of the consequent predicate assembly through these binding propagation submechanisms.
5.2
A Sub-mechanism for Binding Interaction
The antecedent of a symbolic rule usually has several arguments, and these argument can be categorized into several groups based on their symbolic name. The
example rule p(X, X, Y ) → q(X, Y ), for instance, has two groups of arguments,
{p:X1 ,p:X2 } and {p:Y}. Because each of these groups share the same argument
name, all the arguments in the same argument group are supposed to get bound to the same constant filler or free variables during inference to keep the
consistency. For this reason, this argument group is called a Unifying Argument
Group (UAG).
At the beginning of inference, if the different types of fillers are assigned to
the arguments pertaining to the same UAG, unification occurs among them. For
example, presenting the fact, p(a,U,U), to the antecedent of the example rule
will generate two sets of bindings, {a/X,U/X} and {U/Y}, to start inference.
Since the first set of bindings is obtained from the same argument name, X,
unification occurs. As a result, the new intermediate binding, {a/U}, will be
obtained.
In addition, the example rule also requires unification between these two
UAGs. In the above situation, the same variable filler U is assigned to the two
arguments, p:X2 and p:Y, which belong to two different UAGs. Therefore, the
unification result obtained within the first UAG has to interact with that obtained within the second UAG to produce the desirable result of inference. This
binding interaction between UAGs makes sure that the intermediate binding,
{a/U}, produced from the first UAG to be used for the second UAG so that the
third argument of the p predicate gets bound to the constant filler a in the end
rather than being unified only with the variable filler U throughout the inference.
In general, binding interaction between variable UAGs occurs when at least
one of UAGs has repeated variable arguments. This type of binding interaction
is not necessary between two UAGs having a single variable argument because
no consistency checking is required in this case.
The connectionist sub-mechanism which performs this task in a SPN architecture is called the binding interaction sub-mechanism. The symbolic rules
which require this sub-mechanism are determined by checking if the rule has
more than two variable UAGs in the antecedent and if one of them has repeated
86
N.S. Park
variable arguments. Figure 5 visualizes the binding interaction sub-mechanism
built for the example rule.
p:X2
p:X1
p:Y
p
b1
to1
mto1
2
b2
2
2
mto2
q
q:X
q:Y
Fig. 5. Binding interaction sub-mechanism for the rule p(X, X, Y ) → q(X, Y )
The additional components added to this network, compared to the network
shown in Figure 4, are the to1, b2, and mto2 nodes. The first two nodes, to1 and
b2, are inserted as the binding interaction sub-mechanism and the mto2 node as
the consistency checking after binding interaction. The role of the to1 node is
to block the direct binding propagation from the argument nodes of the antecedent predicate assembly to the corresponding argument nodes of the consequent
predicate assembly. When the same variable filler is assigned to two arguments
belong to different UAGs at the same time, the to1 node becomes active and projects the inhibitory signal to the binding propagation sub-mechanism between
the antecedent predicate assembly and the consequential predicate assembly. As
can be seen in the figure, the two links coming from the left element of the p:Y
argument node and that of the b1 binding node provide inputs to the to1 node.
On receiving these signals, the to1 node determines whether it will project the
inhibitory signal or not. Since the threshold of the to1 node is 2, it will be activated only when the two input signals are active in the same phase; that is,
they are in synchrony with the same variable filler node, U in this case.
Activation of the to1 node allows the initial bindings to be propagated to
the q predicate assembly only through the b2 binding node where the cross
UAG binding interaction occurs. Before the activation of the b2 binding node
Addressing Knowledge-Representation Issues
87
propagates further, its consistency is checked by the mto2 node to ensure the
consistent result at the end of inference.
There are several steps of the activation flow on the SPN through the entire
step of inference. When the network is initialized with the fact p(a,U,U), the
fillers and the antecedent predicate assembly will be activated in the following
phases:
a([0],[1]), U([2],[0]),
p{X1 ([0],[1]),X2 ([2],[0]),Y([2],[0])}.
In the next oscillation cycle, the b1 binding node is activated in
b1([2],[1]).
After checking the consistency, this activation propagates to the b2 binding
node automatically. The b2 binding node also receives activation from the third
argument node, p:Y, at the same time and becomes active in
b2([2],[1]).
During this binding propagation, the to1 node becomes active and blocks the
direct binding propagation from the argument nodes in the p predicate assembly to those in the q predicate assembly. Consequently, the argument nodes in
the q predicate assembly receive inputs only from the b2 binding node and are
activated in the following phases:
q{X([2],[1]),Y([2],[1])}.
This situation is finally interpreted as the conclusion of the inference, q(a,a),
with the binding {a/U} as the desired result of unification.
5.3
Additional Sub-mechanisms
Until now only one example rule has been used to explain the rule-SPN mapping
procedure. In order to build more comprehensive SPNs which replicate various
standard symbolic styles of inference, additional sub-mechanisms are needed.
First of all, the SPN architecture needs an extra consistency checking submechanism for consistency checking of the constant arguments in the antecedent
of a rule. This sub-mechanism forces the condition that all constant arguments
in the antecedent must get bound to the same constant fillers specified as the
argument names or free variable fillers during inference. This means that the first
argument of the rule, p(a, X) → q(X), should get bound to only the constant
a to initiate the inference. Otherwise, the consistency violation occurs and the
inference has to stop.
Secondly, depending on the syntax of a rule, the binding interaction submechanism is needed not only between variable UAGs, as already explained,
but also between the following UAGs:
88
N.S. Park
– a pair of a constant UAG and a variable UAG,
– a pair of two constant UAGs.
For any pair of a constant and a variable UAG, if the same variable filler
is assigned to their arguments at the same time, the binding obtained between
the variable filler and the arguments in the constant UAG must migrate to the
arguments in the variable UAG. This situation can be seen when the fact p(U,U)
is presented to the rule p(c, X) → q(X). Presenting the fact first generates
the bindings between the variable fillers (U’s) and the two arguments of the
antecedent, which results in the set of bindings {U/c,U/X}. Since the arguments
p:c and p:X are bound to the same variable filler, binding interaction between
these two arguments generates the intermediate binding, {c/X}, which is then
used to produce the desirable result q(c) from the consequent of the rule. Thus,
a separate binding interaction sub-mechanism with different internal structure
is needed.
Binding interaction between constant UAGs refers to the situation where
the same variable filler is assigned to two different constant argument nodes as
exemplified with the rule p(a, b) → q(b, c). Presenting the fact p(U,U) should fail
the matching between the presented fact and the antecedent of the rule because
the same variable filler U is bound to two different constant arguments a and
b at the same time. This indicates that for any two different constant UAGs in
the antecedent, we need a binding interaction sub-mechanism which detects this
situation and prevents the rule from firing during inference.
Due to the space limitation, the details about these additional sub-mechanisms
are not described in this chapter. However, the readers can refer to [11] for more
details about these sub-mechanisms.
5.4
A Mapping Algorithm
The following summaries how to map a given symbolic rule to the corresponding
SPN:
RULE-SPN MAPPING PROCEDURE
STEP1: For a given symbolic rule, build the corresponding
antecedent predicate assembly and the consequential
predicate assembly.
STEP2: Build the intermediate mechanism in such way that
1. Among the argument nodes in the same unifying argument
group (UAG)
a. Introduce a binding node for any repeated argument nodes
b. Build the consistency checking sub-mechanism for
constant argument node and repeated variable argument
nodes
c. Build the binding propagation sub-mechanism between
the argument nodes in the antecedent predicate assembly
Addressing Knowledge-Representation Issues
89
and their corresponding argument nodes in the
consequential predicate assembly which shares the same
argument name
2. With one of the following pair of UAGs in the antecedent,
build a binding interaction sub-mechanism between them.
- a pair of variable UAGs, of which at least one has
repeated variable arguments
- a pair of a variable UAG and a constant UAG
- a pair of constant UAGs
Apart from the example rule used throughout this chapter, the SPN architecture was explored to encode various types of symbolic rules and successfully
demonstrated symbolic style of inference. The following rules are some of them:
–
–
–
–
–
p(a) → q(a),
p(a, b) → q(a, b),
p(X, X) → q(X),
p(a, X, X) → q(a, X),
p(X, X, Y, Y ) → q(X, Y ).
However, further experiments showed that the current SPN architecture has
one limitation in processing unification across more than two UAGs. This situation was detected when the fact, p(U,V,W,U,V), is presented to the SPN
encoding the following rule:
p(a, X, X, Y, Y ) → q(a, X, Y ).
This rule has three UAGs – {a}, {p:X1 ,p:X2 }, and {p:Y1 ,p:Y2 } – in the
antecedent. When the predicate, p(U,V,W,U,V), is presented the SPN to initiate
inference, arguments in each UAG carry out unification to obtain three sets of
bindings: {U/a}, {V/X,W/X}, and {U/Y,V/Y}. These bindings in turn interact
with each other to produce the following sets of intermediate bindings:
– { } from interaction between {U/a} and {V/X,W/X}
– {a/Y} from interaction between {U/a} and {U/Y,V/Y}
– {X/Y} from interaction between {V/X,W/X} and {U/Y,V/Y}
These results, as they stand, only give the intermediate result, {a/Y} and
{X/Y}, and do not provide another intermediate binding, {a/X}, which is needed
to produce the desirable result, q(a,a,a). In order to obtain this binding, one
more step of unification is required based on the above intermediate bindings
produced during the first stage of binding interaction between UAGs.
The current SPN’s architecture, however, is designed to perform only singlestage unification between each pair of UAGs; it cannot therefore deal with inference which requires more than one step of unification. In order to achieve the
additional unification step, the SPN needs another set of binding interaction submechanisms between the first layer of the binding interaction sub-mechanisms
90
N.S. Park
and the q predicate assembly. In general, if there are m UAGs in the antecedent
of a rule, the SPN requires m − 1 layers of binding interaction sub-mechanisms
to perform full unification among all the UAGs involved. It is worth nothing, however, that the number of binding interaction sub-mechanisms required is only
proportional to the number of unique UAGs of a rule, not to the total number
of the arguments of the rule.
6
Related Work
CONSYDERR [12] addresses all the knowledge-representation issues and proposed solutions for them. It performs unification and deals with consistency
checking. However, CONSYDERR handles these issues not by a uniform connectionist mechanism but by different types of complex network elements which are
hypothesized to have such a function.
CHCL [5] can also deal with consistency checking and unification implicitly
by a adopted connectionist unification algorithm [13]. Its representation scheme
of terms using a set of position-label pairs can also easily represent function terms
in the initial stages of inference and also variable bindings if any are generated
during inference.
SHRUTI [7], built on the basis of the original phase-locking mechanism,
performs limited consistency checking and unification due to a limitation in
representing full variable bindings.
DCPS [1] and TPPS [2] only have built-in unification mechanisms in restricted forms of rules due to the distributed representation they employed. However,
it is not clear how these connectionist models deal with consistency checking.
7
Conclusion
Any connectionist architecture which aspires to replicate symbolic inference
should be able to deal with the knowledge representation issues as well as representing dynamic bindings. A structured predicate network (SPN) architecture is
a step in this direction.
The rule encoding architecture proposed in this chapter provides a way of
representing not only a group of unifying arguments but also many groups of
such arguments with appropriate consistency checking between groups. Although
what is described here is a forward chaining style inference, the basic concept
of the proposed connectionist architecture with only slight adjustments can be
used to encode facts and rules to support backward chaining style inference.
The proposed SPN architecture is able to translate a significant subset of
first-order Horn Clause expressions into a connectionist representation that may
be executed very efficiently. However, in order to have the full expressive power
of Horn clause FOPL, we need to add the ability to represent structured terms
as arguments of a predicate and recursion in rules. Currently, no connectionist
system has provided convincing solutions to all these problems.
Addressing Knowledge-Representation Issues
91
References
1. Touretzky, D. S., Hinton, G. E.: A distributed connectionist production system.
Cognitive Science 12(3) (1988) 423 – 466
2. Dolan, C. P, Smolensky, P.: Tensor product production system: A modular architecture and representation. Connectionist Science 1(1) (1989) 53 – 68
3. Lange, T. E., Dyre, M. G.: High–level inferencing in a connectionist network.
Connection Science 1(2) (1989) 181 – 217
4. Barnden, J.: Neural-net implementation of complex symbol-processing in a mental
model approach to syllogistic reasoning. Proceedings of The 11th International
Joint Conference on Artificial Intelligence, San Mateo, CA, Morgan Kaufmann
Publishers Inc. (1989) 568 – 573
5. Hölldobler, S. H., Kurfeß, F.: CHCL - A connectionist inference system, In B.
Fronhöfer & G. Wrightson (Eds.). Parallelization in Inference Systems, Lecture
Notes in Computer Science, Berlin, Springer-Verlag (1991) 318 – 342
6. Sun, R.: On variable binding in connectionist networks. Connection Science 4(2)
(1992) 93–124
7. Shastri, L., Ajjanagadde, V.: From simple associations to systematic reasoning: A
connectionist representation of rules, variables, and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences 16(3) (1993) 417 – 451
8. Park, N. S., Robertson, D., Stenning, K.: An extension of the temporal synchrony approach to dynamic variable binding in a connectionist inference system.
Knowledge-Based Systems 8(6) (1995) 345 – 357
9. Park, N. S., Robertson, D.: A connectionist representation of symbolic components,
dynamic bindings and basic inference operations. Proceeding of ECAI-96 Workshop
on Neural Networks and Structured Knowledge (1996) pp 33 – 39
10. Park, N. S., Robertson, D.: A localist network architecture for logical inference.
In R. Sun, F. Alexandre (Eds.), Connectionist-Symbolic Integration, Hillsdale, NJ,
Lawrence Erlbaum Associates Publishers (1997)
11. Park, N. S.: A Connectionist Representation of First-Order Formulae with Dynamic Variable Binding, Ph.D. dissertation, Department of Artificial Intelligence,
University of Edinburgh (1997)
12. Sun, R.: Integrating Rules and Connectionism for Robust Commonsense Reasoning. John Wiley and Sons, New York, NY. (1994)
13. Hölldobler, S. H.: A structured connectionist unification algorithm. Proceedings of
the National Conference on Artificial intelligence, Menlo Park, CA, AAAI Press
(1990) 587 – 593
Towards a Hybrid Model of First-Order Theory
Refinement
Nelson A. Hallack, Gerson Zaverucha, and Valmir C. Barbosa
Programa de Engenharia de Sistemas e Computação—COPPE
Universidade Federal do Rio de Janeiro
Caixa Postal 68511, 21945-970 Rio de Janeiro , RJ, Brasil
{hallack, gerson, valmir}@cos.ufrj.br
Abstract. The representation and learning of a first-order theory using
neural networks is still an open problem. We define a propositional theory
refinement system which uses min and max as its activation functions,
and extend it to the first-order case. In this extension, the basic computational element of the network is a node capable of performing complex
symbolic processing. Some issues related to learning in this hybrid model
are discussed.
1
Introduction
In recent years, systems that combine analytical and inductive learning using
neural networks, like KBANN [8], CIL2 P [15,16], and RAPTURE [7], have been
shown to outperform purely analytical or inductive systems in the task of propositional theory refinement. This process involves mapping an initial domain
theory, which can be incorrect and/or incomplete, onto a neural network, which
is then trained with examples. Finally, the revised knowledge is extracted from
the network back into a logical theory. However, a similar mapping for first-order
theories involves two well-known and unsolved problems on neural networks—the
representation of structured terms and their unification.
To overcome these problems, some models have been proposed: [17] uses symmetric networks, SHRUTI [18] employs a model which makes a restricted form
of unification—actually this system only propagates bindings—, and CHCL [19]
realizes complete unification. Unfortunately, these approaches have in common
their inability to perform learning, which is our main goal. In [1], a restricted
form of learning is performed. Kalinke [20] criticizes several approaches for representing structured terms using distributed representations. The use of hybrid
systems seems a promising idea, in the sense that it in principle allows us to
take advantage of the characteristics of both models. We describe the current
state of a system that we are developing towards this goal. Basically, we replace the AND and OR usually achieved through the setting of weights and
thresholds with the fuzzy AND (min) and OR (max) functions, and use as the
computational element of our network a neuron with extended capabilities.
The following is how the remainder of the paper is organized. Section 2
discusses some problems that arise in the standard model of propositional theory
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 92–106, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Towards a Hybrid Model of First-Order Theory Refinement
93
refinement with neural networks and describes the MMKBANN network with
min/max activation functions. Section 3 presents some experimental results on
this model. In Sect. 4, we provide an extension of the MMKBANN model to the
first-order case, assuming a node with extended capabilities. In Sect. 5, a brief
example of relational theory refinement is shown and some issues of learning are
discussed.
2
MinMax Neural Networks
The task of propositional theory refinement is, given a knowledge base (KB)
and a set of examples E—both describing the dependencies between some set of
propositional symbols—, to modify KB to cover the positive examples and not to
cover the negative ones. When KB is expressed as a normal logic program,1 this
problem can be regarded as an instance of the problem of learning a mapping
between input and output neurons of a standard feedforward neural network or
a partially recurrent network. In the knowledge-based neural networks approach,
propositional symbols appearing only in the antecedent of KB clauses (we call
these features to distinguish them from the propositional symbols that appear
in the consequent of some clause) constitute the network’s input. Propositional
symbols that are not features may be intermediate conclusions (represented as
intermediate neurons), or final conclusions, that is, concepts which are being
refined (represented as output neurons).
A propositional symbol is considered true or false when its output value is
within a predetermined range. For instance, in the CIL2 P model with activations
between 0 and 1, the range corresponding to true is [Amin , 1], while [0, Amax ]
corresponds to false, with Amax < 0.5 < Amin . When the output value is in
(Amax , Amin ), the truth-value is unknown. The logical AND and OR operations
are implemented by a procedure that sets bias and weight values appropriately,
as depicted in Fig. 1. The values shown in this figure comply with the typical
activation function given by
yj (e, w) =
where
vj (e, w) =
X
1
,
1 + e−vj (e,w)
(1)
[wji yi (e, w) ] + θj .
(2)
i∈I(j)
In (1) and (2), e ∈ E is the running example, w is the network’s weight vector,
I(j) is the set of connections incoming to neuron j, wji is the weight of the
connection from neuron i to neuron j, θj is neuron j’s threshold, and yj (e, w)
its output.
Our standard knowledge-based neural network, designed to perform propositional theory refinement and henceforth called UCIL2 P (for Unfolded-CIL2 P),
basically follows the KBANN topology: each propositional symbol is located in a
1
A set of definite Horn clauses allowing the negation of antecedents.
94
N.A. Hallack, G. Zaverucha, and V.C. Barbosa
..................
...
.....
..
...
.....
... -3W/2
...
....... .........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... .. .....
.... ........ . .....
...
...
...
...
...
.
...
.
...
...
...
...
... W
.
... -W
.
.
.
.W
...
..
.
.
.
.
...
.
.
.
.
.
.. ...........
.
.
.......................
.....................
.
.
.
.
...... ........
...
.
...
.
..
...
.....
.
..
..
....
.....
.
...
.
.
...
...
.
.
..
...
...... .......
...... .......
....... .........
.
.
.
.
.
.
.
.
.
.
.
......
......
......
..................
...
.....
..
...
.....
... -W/2
...
...... .........
........... .......
.
.
.
.
.
.
.
.
..... .... ......
... ....... .....
..
...
...
...
...
...
...
...
.
.
...
...
.. W
.
...W
.
... W
..
.
...
.
.
.
.
.
... .
.
.
.
.
.
. ..........
.......................
.....................
.
.
.
.
...... ........
...
.
...
.
..
...
.....
.
..
..
....
.....
.
...
.
.
...
...
.
.
..
...
.
.
.
.
...... .....
...... .......
....... .......
.
.
.
.
.
.
.
.
.
.
.
......
......
......
(a)
(b)
D
D
A
B
A
C
B
C
Fig. 1. The use of weights and thresholds to represent (a) D = A ∧ B ∧ ¬C and (b)
D =A∨B∨C
specific layer according to the initial KB and near-zero weights are added connecting neurons of one layer to those of the next. These connections are designed to
allow the discovery of dependencies between the neurons that are not expressed
in KB. Unlike KBANN, we choose to always represent the AND of a clause by
a neuron, so in our network neurons in the same layer always compute the same
logical function (AND or OR). Unlike CIL2 P, we do not have a Jordan-like network architecture [5]. Therefore, a UCIL2 P network (determined by KB) may
have an arbitrary number of hidden layers. The Amin and Amax values are the
same as in CIL2 P, as described previously. The initial weights and thresholds are
set in such a manner as to yield values similar to those of Fig. 1. For instance,
KB = {A → D, A ∧ B → E, ¬C → E} leads to the values shown in Fig. 2. We
call an antecedent of a clause a negative or positive literal, respectively, if it is
negated or not. For instance, ¬C is a negative literal (or antecedent) of clause
¬C → E and B is a positive literal (or antecedent) of clause A ∧ B → E.
.........
.........
...... ........
...... ........
...
...
...
...
.
.
.....
.....
... -W/2
.
... -W/2
...
.
.
.
.
.
....... ........ ........... ..
........... ......... .........
.
.
.. .. .. ..
...... .......... ...........
.
.
............ .......... .........
.
.
. ...
. . .. ........ .. . ......
...
...
. .....
. .
...
...
.. ...
...
.
.. ..
W...... .. .. .. .. .. . . W...... .. .. .. .. ..W .......
. .....
............... . ..
...............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .. ........... ..........
.... ..
...
...
...
.
.
..
.
.
...
.
.
...
...
.
.....
.....
.....
........-W/2
.. -3W/2
.. W/2
.
.
.
.
.
.
.
...
.
.
.
.
.
.
.
.
.
...
. ....................
.........
..........
.
.
...... ................. .. .. ..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...........
.. .
.. .. ......... ............... . .. .. .. .. .. .. .. . ................
.
.
.
.
.
.
......
...
.
.. .
.. ..
..........
..........
. .. ........... .. .. .. ............ .. .. .. . .. .. .. . .
...
...
....
.......
....
...... .. .. .. .. .. .. .... .. .. .. .. .. .. .. ..
...
...
......
..
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
W
-W
W
W
.
.
.
.
.. ..
.. . ..................
.. . ................ . ..
................ .......... .. .. .. ..
.
.
.
.
.
.
.
.
.
.
....
.
.
.
.
.. .. ....
....
..... .. ..
.....
..
.
.
.
...
.
.
.
.
.
.
.
.
.
..
..
.
....
.....
.....
..
..
..
.
...
.
.
.
.
.
.
.
....
....
...
.
...... .......
.
.
.
.
.
.
.
.
.
.
.
.
.
.
............
............
..........
D
A
E
B
C
Fig. 2. A UCIL2 P network with dashed lines to represent connections of near-zero
weights
Some drawbacks arise from this architecture, and also from similar architectures like KBANN and CIL2 P, as follows. The initial value of W , which is calculated according to KB and the Amin and Amax values, grows with max(m, n),
where m is the maximum number of clauses with the same consequent and n is
Towards a Hybrid Model of First-Order Theory Refinement
95
the maximum number of positive literals in a clause. Because high initial weights
can disturb training, if in KB this problem occurs the calculated value of W is
disregarded. This is not actually a serious problem, as seen from KBANN and
CIL2 P’s results, but in these cases there is no formal guarantee of the correspondence between KB and the resulting network. Another drawback is that,
after training, the initial topology, with some nodes standing for OR and others
for the AND operator, is lost and we can no longer ensure the correspondence
with a symbolic set of rules. This problem has only partially been addressed
via extraction algorithms [4] and the introduction of a penalty function in the
backpropagation algorithm that forces the network to maintain a correspondence
with a symbolic set of rules [25].
To overcome these problems, we propose an architecture with the fuzzy AND
and OR operators, min and max, respectively.2 There are two types of dependency between an OR node j and some other node i ∈ I(j) in the network that
we would like to represent:
1. Node i is one of the arguments of the OR node j;
2. There are no dependencies between node i and the OR node j.
Similarly, there are three types of dependency between an AND node j and some
other node i ∈ I(j) in the network that we would like to represent:
1. Node i is a positive antecedent of the AND node j;
2. Node i is a negative antecedent of the AND node j;
3. There are no dependencies between node i and the AND node j.
Considering the range of activation as [0, 1], a natural choice to represent
these dependencies in the OR and AND nodes would be to pick
yj (e, w) = max wji yi (e, w)
(3)
1
wji
,
+
yj (e, w) = min wji yi (e, w) −
2
2
i∈I(j)
(4)
i∈I(j)
and
respectively, as their activation functions. Using (3), we can represent the first
type of OR-dependency simply with wji = 1 and the second type with wji = 0.
Using (4), we can represent the first type of AND-dependency with wji = 1, the
second type with wji = −1, and the third type with i 6∈ I(j).
But since our intention is to let the weight-learning procedure learn which
antecedents should be in a clause, (4) turns out not to be a good choice. Unlike
UCIL2 P, we cannot delete an antecedent i from a clause represented by neuron
2
Although not differentiable at all points, such functions did on preliminary tests
allow faster convergence during training than differentiable approximations of them.
Standard gradient descent can no longer be applied, though.
96
N.A. Hallack, G. Zaverucha, and V.C. Barbosa
j by simply changing the value of wji ; nor can we use near-zero weight values to
connect candidate antecedents to j, as in the standard model. So we choose
yj (e, w) = min [wji g(yi (e, w)) − wji + 1]
i∈I(j)
as the activation function for AND nodes, where
yi (e, w),
if i is a positive antecedent;
g(yi (e, w)) =
1 − yi (e, w), if i is a negative antecedent.
(5)
(6)
The motivation for (5) lies in the binary variables that are used to model
constraints in integer programming. When wji = 1, wji g(yi ) − wji + 1 becomes
g(yi ); when wji = 0, it becomes 1. Since min(A, B, 1) = min(A, B) if A, B ≤ 1,
a weight-learning procedure is capable of adding and deleting antecedents in a
clause if (5) is used.
If we use [−1, 1] instead of [0, 1] as the range of activation, then (3) and (5)
become
(7)
yj (e, w) = max [wji yi (e, w) + wji − 1]
i∈I(j)
and
yj (e, w) = min [wji g(yi (e, w)) − wji + 1] ,
i∈I(j)
(8)
respectively, where
g(yi (e, w)) =
yi (e, w),
if i is a positive antecedent;
−yi (e, w), if i is a negative antecedent.
(9)
Although it is not yet quite clear whether we should use [−1, 1] or [0, 1] as the
range of activation, in the experiments that we describe later [−1, 1] is employed.
The network that employs (8) and (9) as activation functions is called MMKBANN (for MinMax KBANN) and has the same topology as UCIL2 P. The differences between the two networks are the activation functions and the weights,
which are initially 1 or 0 in the former network. Neurons in one layer are connected to the neurons of the next layer with weights 1 or 0, depending on whether
the corresponding relation is in KB or not. Because we do not know in advance
whether an antecedent should be added to a clause as a positive or a negative
literal, there must be two connections between the corresponding neurons, one
for each case. However, this is a domain-dependent design decision. Notice that
the use of min/max functions ensures the correspondence of the initial KB and
the resulting MMKBANN network. It also guarantees a closer correspondence
between the trained network and a Horn-clause set. It is also worthy to notice the
resemblance between MMKBANN and the combinatorial neural model (CNM)
[12], which also uses min/max as the activation functions of its neurons. In the
CNM model, however, the network has an input layer, a hidden layer, and a
single-neuron output layer, with weights between the hidden and output layers
only.
Towards a Hybrid Model of First-Order Theory Refinement
97
The functions min and max are not differentiable, so we apply a subgradient
search method similar to the method used in [14] to minimize the mean square
error given by
1 XX
[yo (e, w) − ybo (e)]2
(10)
mse(w) =
|E|
e∈E o∈O
where w is the vector having as components the weights of the network, E is the
training set of examples, O the set of output units, yo (e, w) the output value of
unit o when example e is running, and ybo (e) the desired value for this unit.
Like in gradient-descent minimization, we can let
∆w = −α
s(w)
|s(w)|
(11)
where s(w) is one of the subgradients of mse(w) at w and α is the learning rate
parameter. The ji-component of one of such subgradients, corresponding to one
of the subgradients of mse(w) with respect to a weight wji , is given by
sji (w) =
2 XX
∂yo (e, w) ∂yj (e, w)
[yo (e, w) − ybo (e)]
.
|E|
∂yj (e, w) ∂wji
(12)
e∈E o∈O
The values of ∂yo (e, w)/∂yj (e, w) and ∂yj (e, w)/∂wji can be computed in the
same way as done in standard backpropagation, by using
∂ max(x1 , x2 , · · · , xn )
=
∂xi
1 , if xi = max(x1 , x2 , · · · , xn )
0 , otherwise.
(13)
∂ min(x1 , x2 , · · · , xn )
=
∂xi
1 , if xi = min(x1 , x2 , · · · , xn ),
0 , otherwise.
(14)
and
where 1 ≤ i ≤ n. Equivalently, we can use
∂ max(x1 , x2 , · · · , xn )
=
∂xi
1 , if |xi − max(x1 , x2 , · · · , xn )| ≤ θ
0 , otherwise.
(15)
∂ min(x1 , x2 , · · · , xn )
=
∂xi
1 , if |xi − min(x1 , x2 , · · · , xn )| ≤ θ
0 , otherwise.
(16)
and
to compute ∂yo (e, w)/∂yj (e, w) and ∂yj (e, w)/∂wji , where θ is a user-defined
parameter of training. The intuition behind this modification is that all values
close enough to the max (min) value should be assigned credit equally. In our
experiments, the use of (15) and (16) with θ = 0.2 instead of (13) and (14) led
to better learning of the training set.
98
3
N.A. Hallack, G. Zaverucha, and V.C. Barbosa
Experimental Results on Propositional Domains
In order to investigate the learning capabilities of MMKBANN, we will compare its results to those obtained by UCIL2 P . But first, aiming at establishing
UCIL2 P’s robustness, we will compare this system with one of the most successful
models in the task of propositional theory refinement, namely the RAPTURE
framework. The domains used in the comparison are the Promoter domain with
106 examples and the splice-junction recognition problem [8,9], both originating
from the Human Genome Project. The algorithm used to train UCIL2 P was the
Resilient Propagation (RProp) [10], a variation of standard backpropagation
with the desirable properties of:
– faster learning;
– robustness in the choice of parameters;
– greater appropriateness to networks with many hidden layers.
The RAPTURE results used in our comparison were obtained from [6]. Results of UCIL2 P are averaged over 10 trials. In all domains we trained the network until it reached 100% accuracy in the set of examples or until 100 epochs
were completed. The RProp parameters were η + = 1.2, η − = 0.5, ∆max = 0.25,
∆min = 10−6 , and ∆0 = 0.1. The network topologies were exactly those defined
by KB.
With Amin = 0.975 and Amax = 0.025, in the Promoter domain the calculated W was 48.85 and the network always reached 100% accuracy in the training
set. In the splice-junction problem, the calculated W was 16.65 and the network always reached 100% accuracy in the training sets with 50, 100, and 200
examples.
The error function was given by (10), since the cross-entropy error, which
is said to be better suited for problems of binary classification [11], performed
slightly worse in our preliminary tests. We obtained better results than those
reported here with the UCIL2 P model when we used initial W values smaller
than the calculated ones, and also when we added some new neurons to the initial
topology. This shows the deleterious effect of high initial weights in training and
indicates that algorithms allowing the dynamic addition of neurons can obtain
better results in these domains.
As we see in Fig. 3, the results obtained by UCIL2 P were almost equal to
those from RAPTURE in the Promoter domain, but clearly worse in the splicejunction problem. This is not disappointing, since RAPTURE relies not only on
a weight-learning procedure but also on algorithms for the dynamic insertion of
neurons and connections.
Now we turn our attention to the MMKBANN model. The question we would
like to answer is, for what domains is the MMKBANN model well-suited and
for what domains is it not? As long as the knowledge that can be encoded into
MMKBANN resembles a normal logic program, we can expect good performance
when the concept to be learned is expressible in this formalism with a number of
clauses that is less than or equal to the number of hidden neurons in the network.
Towards a Hybrid Model of First-Order Theory Refinement
35
30
25
20
15
10
05
00
99
20
15
10
...
.................
..........
...............
..........
...............
..........
.............
10
........
............
........
............
........
..........
20
........
........
............
........
..........
40
..........
........
............
........
60
........
........
............
........
90
(a)
05
00
..........
........
............
........
............
........
............
........
............
........
............
........
..........
50
........
............
........
............
........
............
........
............
........
..........
100
............
........
............
........
............
........
............
........
200
..
..............
........
............
........
............
........
400
(b)
Fig. 3. Percentage of error versus number of training examples in (a) the Promoter
domain and (b) the splice-junction domain. MMKBANN results are shown in black,
UCIL2 P results in white, and RAPTURE results in shaded bars
Bad performance is expected otherwise. Aiming at confirming this expectation,
we ran experiments in two domains: the already mentioned domain of Promoter
and an artificial domain derived from the game of chess, which is a modification
of a domain used in [2]. In neither domain have we created connections corresponding to negated literals. While in the artificial domain the target concept
to be learned does not require such literals, in the Promoter domain all we have
is evidence of this, since no negated literals appear in KB.
The results on both models were averaged over 10 trials. For MMKBANN we
also used the topology defined by KB and the RProp training method. Results
for the Promoter domain, shown in Fig. 3, tend to confirm our expectation. As
discussed in [6], this domain requires the evidence-summing property—that is,
the property that small pieces of evidence sum up to form significant evidence.
While present in UCIL2 P, this property does not exist in MMKBANN, due
to the use of the min and max functions.3 We credit the poor performance of
MMKBANN to this. In most of the training sets, MMKBANN could not reach
100% accuracy on the examples.
The other domain is a 4 × 5 chess board (see Fig. 4), and the problem is
to find out the board configurations in which the king placed at position c1
cannot move to the empty position c2. Other pieces that can be on the board
are a queen, a rook, a bishop, and a knight, for both players. Each example is
generated by randomly placing a subset of these pieces on the other empty board
positions (for a detailed description, see [2]). Each board position is represented
by six attributes, indicating whether one of the four types of pieces is located at
that position, whether the piece is an enemy, and whether the position is empty.
The correct definition of the concept comprises 33 definite Horn clauses. Since
we did not delete or add clauses to the correct KB, this is the number of hidden
neurons of the network.
3
For instance, max(10, 0, 0) = max(10, 9, 9).
100
N.A. Hallack, G. Zaverucha, and V.C. Barbosa
a4
b4
a3
b3
c4
d4
e4
c3
d3
e3
c2
d2
e2
c1
d1
e1
EMPTY
a2
b2
a1
b1
........
...
..................
.....
.
..
....
...............
Fig. 4. The 4 × 5 chess board used in our experiment
We corrupted the correct knowledge in two ways: we added antecedents to
existing clauses and we deleted antecedents from clauses. The results obtained
by UCIL2 P on these domains were quite disappointing. Despite being able to
learn the training sets with 100% accuracy, the network generalized poorly, as
we see in Fig. 5, indicating that the network overfitted the training examples.
The calculated W was 18.19, with Amin = 0.975 and Amax = 0.025. MMKBANN
could not learn with 100% accuracy most of the training sets, but its results are
better than those from UCIL2 P, confirming to some extent our expectation.
25
15
20
10
15
10
05
05
00
50
100
200
(a)
300
00
50
100
200
300
(b)
Fig. 5. Percentage error versus number of training examples in the chess domain with
(a) inserted antecedents and (b) deleted antecedents. MMKBANN results are shown
in black, and UCIL2 P results in white bars
4
A Hybrid Framework for First-Order Theory
Refinement
Having described a model that makes inference and inductive refinement over a
propositional theory while maintaining a close relation with symbolic processing,
our goal is to extend this model to the case of first-order theory. Sun [21] describes the “discrete neuron,” intended to be a description tool hiding unnecessary
Towards a Hybrid Model of First-Order Theory Refinement
101
implementation details. He argues that the discrete neuron can be implemented
with normal neurons. Inspired by this neuron, and aiming at handling the problems that arise in the first-order case, we make a similar automaton-theoretic
description of a neuron.
Definition 1. Let Σ be an alphabet (for instance, Σ = {a, b, . . . , z}) and n a
positive integer. Then U (Σ, n) = { (x, y) | x ∈ Σ n , y ∈ IR }.
For example, if Σ = {a, b, . . . , z}, then (a, 1) ∈ U (Σ, 1) and (b, c, 0.3) ∈
U (Σ, 2).4 We represent the ground facts P (a) and P (b), relative to some unary
predicate P , by associating a set {(a, 1), (b, 1)} ⊂ U (Σ, 1) with predicate P .
In general, the real number r in a tuple (a1 , . . . , at , r) ∈ U (Σ, t) represents
our confidence about the truth of P (a1 , . . . , at ). We choose to represent this
confidence value in [−1, 1].
Definition 2. A node j is a 5-tuple hΣ, InputArity, arity, I, yi, where:
–
–
–
–
Σ is an alphabet;
InputArity ∈ INn for n ≥ 0;
arity ∈ IN;
I is a vector of components I1 , . . . , In such that Ik ⊆ U (Σ, InputArityk ) for
0 ≤ k ≤ n;
– y : U (Σ, InputArity1 ) × . . . × U (Σ, InputArityn ) → U (Σ, arity).
We call this element a node instead of a neuron only to emphasize that we
are not worried about its implementation with standard neurons. I denotes the
set of incoming connections to the node and y its output. Our model for firstorder theory refinement, called MMCILP,5 is based on this node. Unlike the
MMKBANN model, where each node (predicate) stands at a specific layer in a
feedforward network, and similarly to CIL2 P, which uses a Jordan-like architecture, we have a three-layer network with a one-to-one mapping between predicate
symbols and nodes in the input and output layers. The input-layer nodes compute identity, the hidden-layer nodes (one for each encoded clause) compute the
fuzzy AND, and the output-layer nodes compute the fuzzy OR. There are recurrent connections between the corresponding predicate symbols in input and
output layers. The justification for this three-layer architecture with recurrent
connections, instead of the feedforward model of MMKBANN, is that there are
concepts in the language of first-order logic whose definition is recurrent. While
in propositional logic there is no sense in a clause with the same propositional
symbol appearing in both its consequent and one of its antecedents (for instance
A ∧ B → A), such is not true in the first-order case. For instance, one definition
for the concept descendant is given by the two-clause logic program
4
5
We do, for simplicity, use a set’s elements to represent the set. So b, c is used in
place of {b, c}.
MMCILP stands for MinMax Connectionist Inductive Logic Programming.
102
N.A. Hallack, G. Zaverucha, and V.C. Barbosa
c1: descendant(A,B) :- parent(B,A).
c2: descendant(A,C) :- descendant(A,B), parent(C,B).
where descendant(A,B) means that A descends from B and parent(A,B) means
that A is B’s parent.
There are three types of connection in MMCILP, excluding those that represent recurrence:
– OR connections, from hidden to output nodes;
– AND+ connections, from input to hidden nodes, the former standing for
positive antecedents;
– AND− connections, from input to hidden nodes, the former standing for
negative antecedents.
Each connection has an associated weight w and can receive complex messages
to be conveyed, like M = {(a, 1), (b, 0.5)}. In this case, the outgoing message is
w(M ) = {(a, f (1, w)), (b, f (0.5, w))}, where the f function is given by
w x + w − 1 , for OR connections;
(17)
f (x, w) = w x − w + 1 , for AND+ connections;
1 − w x − w , for AND− connections.
As an example, consider the definite Horn clause program shown in Fig. 6,
incorrectly defining the concept grandfather (this example was taken from [3]),
with two training examples. In Fig. 7, we can see the corresponding “network.”
Consider the alphabet Σ = {walter, carol, lee, jim} for all nodes.6 The recurrent
connections from output-layer nodes to input-layer nodes are represented making for each N predicate yN i := yN o , where N i and N o denote the predicate
corresponding node at input and output layers, respectively.
The hidden-layer nodes corresponding to ground facts will have fixed output7
—for instance, yc5 = {(walter, carol, 1)}. Nodes corresponding to clauses with
antecedents are more complex—for instance, the hidden-layer node c4 has the
vector (2, 1) as InputArityc4 , that is the arity of the inputs I1 = w1 (ymother ) and
I2 = w2 (yfemale ), where w1 denotes the weight of the AND+ connection from
the input-layer node mother to c4, and w2 denotes the weight of the AND+
connection from the input-layer node female to c4. The hidden-layer node c4 has
arityc4 = 2 and y function given by y(I) = { (A, B, r) | ∃A, B, r1 , r2 (A, B, r1 ) ∈
I1 ∧ (B, r2 ) ∈ I2 ∧r = min(r1 , r2 )}. Output-layer nodes are simpler—for instance,
parento has inputs I1 = w3 (yc3 ) and I2 = w4 (yc4 ), where w3 and w4 are the
weights of the incoming OR connections to parento from c3 and c4, respectively.
Its y function is defined as the union of its inputs, choosing the tuple of maximum
confidence value when the same tuple, disregarding the confidence value, appears
in two or more inputs. Given P, a function-free acceptable logic program8 —
6
7
Notice that the predicate names are not in the alphabet since they are not necessary
in the operation of the network. They appear only in the translation of KB into the
network.
The weights that correspond to facts are fixed, since they are supposed to be true.
Towards a Hybrid Model of First-Order Theory Refinement
103
c1: grandfather(A,B) :- father(A,C), parent(C,B).
c2: grandfather(A,B) :- father(A,B).
c3: parent(A,B) :- father(A,B).
c4: parent(A,B) :- mother(A,B), female(B).
c5: father(walter,carol).
c6: father(lee,jim).
c7: mother(carol,jim).
c8: male(walter).
c9: male(lee).
c10: male(jim).
c11: female(carol).
positive: grandfather(walter,jim).
negative: grandfather(lee,jim).
Fig. 6. An incorrect description of the concept grandfather and two examples
.....................
.....................
.....................
.....................
...
...
...
...
.....
.....
.....
.....
..
..
...
...
...
...
...
...
..
...
..... grandf ...
..... parent ....
.....
.....
father
.
.
.
...
...mother....
.
.
.
...
...
...
.
.
.
.
.
.
.
....
.
.
.
.
.
.
.
.
.
.
.
.....
.....
.....
....... ........
...........................
......................
......................
..........
.... ......
..... ........
...... ........
.
......
.
.
.
.
.
.
.
.
. ..
....
..... .....
........ ....
... .....
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
..
..
.
....
...
.
.
...
...
...
.
..
.
.
.
.
.
...
...
...
...
...
.
.
..
...
.
.
...
...
...
...
.
...
..
.
.
.
.
.
...
.
...
...
...
...
.
.
.
.
.
.
.
...................
.
.
.....................
.....................
......................
......................
......................
.....................
.
.
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... .......
... ...
... ...
... ...
... ...
... ...
...
...
.
.
.
.
.
.
.
.
.
.
.
.
... ...
...
... ..
... ..
... ..
... ..
... ..
.
.
.
.
.
.....
.
. ...
. ...
.. ...
. ...
. ...
. ...
.
.
.
.
.
.
...
...
.
.
.
.
.
.
.. .....
.. .....
.. .......
.. .....
.. ..............
.. ..............
...
....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......
......
......
..
..
.. ...................... ........ ........... ....... .......
....... .................... ........ .........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... ...... ...... ... .....
.. .........
..
....
...... ...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
.
......
.....
...... ........
...
...
......
.......
......
...
....
...
......
.......
......
...
....
......
..
.......
......
...
.....
......
..
.......
.....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
...
..
.....
....
...
... ...........
...
.... .......
.......
........
.......
..... ......
...
.......
........
..... .....
...
.......
...... .....
...............
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......... .........................
........ ...........
........ ............ ...........
.......................
........ ............
........ ............ ............
....
....
....
....
....
....
...
...
....
...
...
......
...
...
...
...
...
...
..
.
.
.
.
.
..... father ...
.....mother....
..... parent ....
..... grandf ....
..... female ....
..... male ....
.
.
.
.
.
.
...
...
...
...
...
...
..
..
..
..
..
..
.
.
.
.
.
.
.
....
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
......
......
......
......
......
..
.
.
.
.
........ ..........
..................
..................
..................
..................
..................
.....
......................
......................
.....
.....
...
...
...
...
...
...
.
.
.....
.....
... male ....
...female ....
.
.
....
.
.
.
.
.
.....
...
........ ........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.........
........... ............ .........
.
. . .
.........
... .... .....
...
...
... ...
...
.
...
...
.
.
...
.
.
...
..
...
...
...
...
...
...
...
...
..
...
...
.
.
.
...
.
.
...
.
.
.
.
.
.
...
.
.
..
.
.
.
.
.....................
....................
...................
......................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... ...
... ...
... ...
...
...
.
.
.
...
.
.
.
.
.
.
... ..
... ..
... ..
.....
.
.
.
...
. ..
. ..
... ..
...
...
... ....
... ....
.. .....
.
.
.
.
....
.
.
.
.
.
.
.
.
.
.
......
......
......
....... ........
...................
...................
...................
........
c8
c9
c10
c11
c7
c4
c3
c6
c5
c1
c2
Fig. 7. The network obtained from the grandfather program. All connections have
weight 1. Recurrent connections have been omitted. Node grandf represents concept
grandfather
i.e., a subset of normal logic programs—, it can be shown that the corresponding
network obtained from P will compute the logical operator TP , that is, the stable
model of P [22].
8
This restriction is to guarantee that the stable model of P is finite. It occurs also in
ILP [23], where h-bounded and flattening techniques are applied to programs with
function symbols.
104
5
N.A. Hallack, G. Zaverucha, and V.C. Barbosa
Learning
Our learning scheme here is the same as for MMKBANN. Given a collection E
of examples about some predicates whose definition we would like to learn, we
minimize the mean square error function mse(w) given by
1 X
[y(e, w) − yb(e)]2
(18)
mse(w) =
|E|
e∈E
where yb(e) is the desired confidence value of the example e and y(e, w) is the
confidence value obtained for this example in the corresponding output-layer
node. Values of y(e, w) can be obtained by computing TP .
Since knowledge about the program is encoded not only in the connections
between nodes but also in their activation functions, it is necessary to look at
these functions when errors are backpropagated during training. For example,
while backpropagating the error associated with the uncovered positive example
grandfather(walter,jim) from the output predicate grandfather, when this
error arrives at node c1 it is necessary to find the tuples of the antecedents
father and parent that originated the tuple (walter, jim) at c1’s output. Suppose
these tuples are (walter, carol) for father and (carol, jim) for parent. Then these
tuples receive an associated error and the backpropagation process starts again
at the output-layer nodes father and parent. Learning in this way and guided by
the two examples, our system was able to correct the given definition of concept
grandfather. The corrections were a decrease in the weight connecting c2 to
grandfather and a decrease in the weight connecting female to c4.
Now recall that in the MMKBANN model we assumed that nodes in consecutive layers were fully interconnected, aiming at the addition of new antecedents
and clauses. A similar assumption in MMCILP would be:
1. A node in the hidden layer is connected to all nodes in the output layer with
the same arity. If this connection is stated in KB its weight is 1, otherwise
it is 0;
2. A node in the hidden layer has all possible antecedents in its activation function. If the antecedent is stated in KB its associated weight is 1, otherwise
it is 0.
Whereas 1 seems reasonable, 2 is unrealistic because the number of possible
antecedents grows exponentially with the arity of predicates in the input layer.
One possible solution, which is currently being studied, is to keep for each clause
a pool of candidate antecedents and let the weight learning procedure choose the
best ones. Entropy-gain heuristics, like those encountered in [13,24], will select
the pool members.
6
Conclusions and Future Work
The framework we have described can be regarded as a starting point in the direction of a connectionist-symbolic first-order refinement system. We have used
Towards a Hybrid Model of First-Order Theory Refinement
105
supervised learning algorithms on a hybrid system that is capable of performing
complex symbolic operations, in the expectation that the benefits usually associated with neural networks have been retained in MMCILP. More experiments
are needed to clarify whether this holds.
As we evaluate our framework, some important issues will have to be considered. First is the high computational cost due to the use of subgradient descent
methods, which typically require hundreds of iterations. The use of more clever
ways to compute TP in each of the iterations instead of the naı̈ve way we have
described here can soften the demand for computational power, but surely some
considerable fraction of such a demand will still remain. There is also the combinatorial explosion in the search space for possible clauses, but this is inherent
to all first-order theory refinement systems. The adoption of syntactical restrictions in the allowed clauses, as those that usually appear in ILP [23], must be
considered.
The use of min/max functions instead of the normal setting encountered at
KBANN and CIL2 P, with weights and thresholds, ensures the AND/OR structure, but also makes the learning of the training set more difficult, as indicated
by the MMKBANN results. The use of techniques such as feature selection [13]
and the dynamic addition of nodes [2] can decrease this problem in both the
propositional and the first-order cases.
Acknowledgements
The authors are partially supported by CNPq. This work is part of the ICOM
project, also funded by CNPq/ProTeM-CC.
References
1. Botta, M., Giordana, A., Piola, R.: FONN: Combining first order logic with
connectionist learning. in Proceedings of the International Conference on Machine
Learning-97. (1997) 48–56
2. Optiz, D.W., Shavlik, J.W.: Dynamically adding symbolically meaningful nodes to
knowledge-based neural networks. Knowledge-Based Systems. 8 (1995) 301–311
3. Wogulis, J.: Revising relational domain theories. in Proceedings of the Eighth International Workshop on Machine Learning. (1991) 462–466
4. Towell, G., Shavlik, J.W.: Extracting refined rules from knowledge-based neural
networks. Machine Learning. 13 (1993) 71–101
5. Jordan, M.I.: Attractor dynamics and parallelism in a connectionist sequential
machine. in Proceedings of the Eighth Annual Conference of the Cognitive Science
Society. (1986) 531–546
6. Mahoney, J.J.: Combining symbolic and connectionist learning methods to refine
certainty-factor rule-bases. Ph.D. Thesis. University of Texas at Austin. (1996)
7. Mahoney, J.J., Mooney, R.J.: Combining connectionist and symbolic learning methods to refine certainty-factor rule-bases. Connection Science. 5 (special issue on
architectures for integrating neural and symbolic processing) (1993) 339–364
8. Towell, G., Shavlik, J.: Knowledge-based artificial neural networks. Artificial Intelligence. 69 (1994) 119–165
106
N.A. Hallack, G. Zaverucha, and V.C. Barbosa
9. Towell, G.: Symbolic knowledge and neural networks: Insertion, refinement and
extraction. Ph.D. Thesis. Computer Science Department, University of Wisconsin,
Madison. (1992)
10. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation
learning: The RProp algorithm. in Proceedings of the International Conference on
Neural Networks. (1993) 586–591
11. Rummelhart, D.E., Durbin, R., Golden, R., Chauvin, Y.: Backpropagation: The
basic theory. In: Backpropagation: Theory, Architectures, and Applications. Rummelhart, D.E., Chauvin, Y. (eds.) Hillsdale NJ: Lawrence Erlbaum Associates.
(1995) 1–34
12. Machado, R.J., Rocha, A.F.: The combinatorial neural network: A connectionist
model for knowledge-based systems. In: Uncertainty in Knowledge Bases. Bouchon,
B., Zadeh, L., Yager, R. (eds.) Berlin Germany: Springer-Verlag. (1991) 578–587
13. Setiono, R., Liu, H.: Improving backpropagation learning with feature selection.
Applied Intelligence. 6, no.2 (1996) 129–140
14. Machado, R.J., Barbosa, V.C., Neves, P.A.: Learning in the combinatorial neural
model. IEEE Transactions on Neural Networks. 9, no.5 (1998) 831–847
15. Garcez, A., Zaverucha, G., Carvalho L.A.: Logic programming and inductive learning in artificial neural networks. Workshop on Knowledge Representation in Neural Networks (KI’96). Budapest. (1996) 9–18
16. Garcez, A., Zaverucha, G.: The connectionist inductive learning and logic programming system. Applied Intelligence Journal (special issue on neural networks and
structured knowledge: representation and reasoning). 11, no.1 (1999) 59–77
17. Pinkas, G.: Logical inference in symmetric connectionist networks. Doctoral thesis.
Sever Institute of Technology, Washington University. (1992)
18. Shastri, L., Ajjanagadde, V.: From simple associations to systematic reasoning.
Behavioral and Brain Sciences. 16 no. 3 (1993) 417–494
19. Holldobler, S.: Automated inferencing and connectionist models. Postdoctoral thesis. Intellektik, Informatik, TH Darmstadt. (1993)
20. Kalinke, Y.: Using connectionist term representations for first-order deduction—
a critical view. CADE-14, Workshop on Connectionist Systems for Knowledge
Representation and Deduction. Townsville, Australia. (1997) http://pikas.inf.tudresden.de/˜yve/publ.html
21. Sun, R.: Robust reasoning: integrating rule-based and similarity-based reasoning.
Artificial Intelligence. 75 (1995) 241–295
22. Gelfond, M., Lifschitz, V.: Classical negation in logic programs and disjunctive
databases. New Generation Computing. 9 (1991) 365–385
23. Lavrac, N., Dzeroski, S.: Inductive Logic Programming: techniques and applications. Ellis Horwood series in Artificial Intelligence. 44 (1994)
24. Quinlan, J.R.: Learning logical definitions from relations. Machine Learning. 5
(1990) 239–266
25. Menezes, R., Zaverucha, G., Barbosa, V.C.: A penalty-function approach to rule extraction from knowledge-based neural networks. International Conference on Neural Information Processing (ICONIP98), Kitakyushu, Japan. (1998) 1497–1500
Dynamical Recurrent Networks for Sequential
Data Processing
Stefan C. Kremer1 and John F. Kolen2
1
Guelph Natural Computation Group, Dept. of Computing and Information Science,
University of Guelph,
Guelph, ON, N1G 4E1, CANADA
skremer@uoguelph.ca
2
Dept. of Computer Science & Institute for Human and Machine Cognition,
University of West Florida,
Pensacola, FL 32514, USA
jkolen@typhoon.coginst.uwf.edu
Abstract. All symbol processing tasks can be viewed as instances of
symbol-to-symbol transduction (SST). SST generalizes many familiar
symbolic problem classes including language identification and sequence
generation. One method of performing SST is via dynamical recurrent
networks employed as symbol-to-symbol transducers. We construct these
transducers by adding symbol-to-vector preprocessing and vector-to-symbol postprocessing to the vector-to-vector mapping provided by neural
networks. This chapter surveys the capabilities and limitations of these
mechanisms from both top-down (task dependent) and bottom up (implementation dependent) forces.
1
The Problem
This chapter focuses on dynamical recurrent network solutions to symbol processing problems. All symbol processing tasks can be shown to be special cases
of symbol-to-symbol transduction (SST). In SST, the transduction mechanism
maps streams of input symbols selected from an input alphabet, Σ, to streams
of output symbols from an output alphabet, Γ . A transducer can compute functions from a language of strings selected from Σ ∗ to another language of strings
selected from Γ ∗
f : Σ∗ → Γ ∗.
(1)
One implementation of this mapping is a multi-tape Turing machine (TM)
with these properties: there are three tapes, a read-only input tape, a working
tape, and a write-only output tape. The TM must always advance the head
on the input tape with every transition. Once the head passes the end of the
given input string, it reads blanks. In addition to moving its input head, the TM
always advances the output tape head one cell each time it writes a symbol to
the tape. Even under these constraints, the transducer is capable of computing
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 107–122, 2000.
c Springer-Verlag Berlin Heidelberg 2000
108
S.C. Kremer and J.F. Kolen
any computable function. We call this the transduction task. There are a number
of interesting, special-case scenarios that can be placed on the string-to-string
transduction problem.
1.1
Output Symbol Generation Scenarios
An important question regarding transducers is “What is the relationship between input and output string symbols?” All SSTs compute output symbols,
Γ ∗ , from input symbols, Σ ∗ , and internal state, S. Under this assumption, we
consider the following scenarios:
One-to-one. One output symbol is generated for each input symbol: S ×
Σ → S × Γ.
n-to-one. An output symbol after every nth input (one-to-one is a special
case of this for n = 1): S × Σ n → S × Γ .
Literal event. Output are generated on a one-to-one basis to a special
“generate output” symbol, γ ∈ Σ presented on the input tape: S × (Σ)∗ γ →
S × Γ.
Internal selection. This is the most general case where the output symbols
are generated according to some arbitrary rule mapping strings of input symbols to strings of output symbols. This rule is not required to produce output
symbols at a particular ratio to input symbols, nor is it required to produce output symbols relative to a particular trigger symbol: “generate output”[7]. These
two types of rules, however, are not precluded, and therefore internal selection
represents an encompassing completely general case: Σ ∗ → Γ ∗ .
While some would argue that these scenarios are fundamentally different,
each scenario be cast in terms of the others. For example, one-to-one is obviously
a special case of n-to-one which in turn is a special case of internal selection.
A literal event transducer can be constructed from a one-to-one mechanism by
outputing a “don’t care” symbol which can be interpreted as an empty string
(ǫ). Since the transducers can compute recursively enumerable sets, they are
closed under general sequential machine (GSM) mappings [14]. GSM mappings
are performed by finite state transducers with the ability to emit ǫs.
The scenarios above assume that sequences of inputs produce sequences of
outputs. These sequences, however, may be of length one in a degenerate case
where either the input tape or the output tape is limited to a single symbol. In
the former case we are dealing with a sequence generation task where a sequence
of output symbols is generated in response to a single input signal: f : Σ → Γ ∗ .
In the latter case we have a classification task where a stream of input symbols
is classified into a finite number of categories represented by the single output
symbols: f : Σ ∗ → Γ . If the output alphabet is further limited to the symbols
accept and reject, then the function accepts strings from a specific language. This
accept scenario links formal languages to their formal computational mechnisms.
For most tasks, the transducer is constructed by hand from a behavioral
specification. Constructing transducers is difficult from full specifications, yet
we find ourselves in situations where the specification is incomplete and defined
only in terms of examples. Construction under these constraints is an induction
Dynamical Recurrent Networks for Sequential Data Processing
109
problem similar to the classic grammar induction problem [12] and is usually
the main justification of resorting to neural network solutions.
1.2
Application/Motivation
SST is easily associated with human language processing tasks such as phonetic
to acoustic mapping, text to phoneme mapping, grammar induction, and translation. However, many other problems can be cast into this paradigm as well.
This was perhaps best exploited by King Sun Fu, one of the foremost researchers
in the applications of grammatical induction. He used the term “syntactic pattern recognition” to describe this field [9]. Fu used grammars to describe many
different things, only some of which would be considered languages in the lay
sense.
Specifically, symbol stream transduction has been applied to: modeling natural language learning, process control, signal processing, phonetic to acoustic
mapping, speech generation, robot navigation, nonlinear temporal prediction,
system identification, learning complex behaviors, motion planning, prediction
of time ordered medical parameters, and speech recognition, to name but a few.
In fact, grammatical induction can be used to induce anything that can be represented by some language. In this sense, it truly represents the archetypical
form of induction.
2
Generalized Architecture
We are interested in connectionist approaches to solving SST problems such as
those described above. Connectionist mechanisms, however, are vector-to-vector
transducers (VVT). They take a finite-length input vector and produce a finitelength output vector. Connectionist mechanisms, as concisely pointed out in [6],
can not be directly applied to SST. Two additional processes must be added: a
symbol-to-vector encoding which takes symbols from the input tape and converts
them into vectors for presentation to the networks, and a converse mechanism
to perform vector-to-symbol decoding. A hybrid, connectionist-symbolic architecture for SST is illustrated in Figure 1.
We describe below a coherent picture of the emergence of computationally
complex behavior of connectionist networks employed as SSTs. Three different
sources cooperate in this emergence: input modulation, internal dynamics of the
connectionist network, and observation [17]. Input channel effects generalize the
effects that input changes have on the dynamics of the network. In essence, the
input symbol indexes a particular internal dynamic. Internal dynamics refers to
the laws of motion that guide the trajectory of the system. These equations of
motions are taken as the definition of behavior of dynamic recurrent networks.
Finally, observation is the transduction of the output vector to the output symbol. This observation mechanism partitions the output vector space into finite
sets labeled with our output symbols. These three sources collaborate to spawn
the emergence of computation of the SST.
110
S.C. Kremer and J.F. Kolen
Symbol Sequence
Symbol
Symbol
Symbol
Encoding
Vector
Recurrent Network
Vector
Discretization
Symbol
Symbol
Symbol
Symbol Sequence
Fig. 1. Symbol-to-Symbol Transducer implemented with a recurrent connectionist
network.
2.1
Input Modulation
The symbol-to-vector encoding is most often implemented via table lookup. The
symbol ai is mapped to the vector x(i) , for example. Often a one-hot, or 1-in-n,
encoding is used for vectors, such that the components of the vector x(i) are
defined as:
(i)
(i)
(i)
(2)
x(i) = (x1 , x2 , x3 , ..., x(i)
n ),
(i)
where the j th component, xj , of this vector is defined
0 ifi = j
(i)
.
xj =
1 otherwise
(3)
Dynamical Recurrent Networks for Sequential Data Processing
111
A variation on this approach involves the presentation of several consecutive
input symbols at a time. Under this approach each symbol is mapped to a vector.
An extended vector, composed by concatenating the affine representations of the
individual symbols, is created for presentation to the network. This approach was
pioneered in the NetTalk project [36].
Under these approaches, the encoder takes a finite set of symbols, or strings
of symbols, and maps them to a finite set of vectors. The same symbol will always
encode to the same vector. From the perspective of the recurrent network, the
vector is a constant whenever the same input is presented. This constant vector
will reduce the internal dynamics of the network to a state-to-state mapping
[18]. In other words, the input symbol selects one mapping from the finite set of
all state-to-state mappings. For example, consider f (x, y) = x + y. If symbol a
is mapped to 0 and symbol b is mapped to 1, then we get two indexed functions
fa (y) = y and fb (y) = 1 + y. The input symbol sequence selects a function that
is the composition of the fa and fb functions. The result of this selection on state
spaces is similar to iterated function systems [17].
Other approaches include varying the mapping from symbol to vector over
time. For instance, the encoding vector could be scaled by αt , where α is some
constant between 0 and 1 and t is position of the symbol in the input stream.
One could interpret this as a bias on the initial symbols of the sequence [27].
2.2
The Internal Dynamics
While conventional feedforward networks with multiple input symbols presented
at one time like in [36] can compute SSTs, the computational power of such
systems is limited. For example, such a system could not compute the parity of
arbitrary length strings.
For this reason, we consider networks whose internals are not static. In dynamical systems, this change is governed by an internal dynamic. Many scientists
expect to find the roots of complex behavior within the internal dynamics. It is
a very enticing view as it is the internal processing, or dynamics, that we have
most control over when we build models of cognitive processes. Committing to
this approach frees researchers to focus their energies on issues of representation,
organization, and communication within the context of the dynamic.
Many formal computing machines offer explicit control mechanisms. Identifying aspects of their design crucial to the generation of computationally complex
behavior is fairly straight forward. Finite state automata, push-down automata,
linear-bounded automata, and Turing machines all share finite state control but
exploit different storage mechanisms which, in turn, produce differences in their
behaviors. Alternatively, the same automata can come from a Turing machine by
assuming restrictions on the control program: FSA by assuming unidirectional
head movement, PDA by assuming that heads move normally in one direction,
but always write a blank on moving in the other direction, nd LBA by assuming that they never move past the ends of the input string. The same types of
restrictions can be imposed on grammars to convert more general grammatical
frameworks to ones with limited representational powers.
112
S.C. Kremer and J.F. Kolen
2.3
The Act of Observation
We have outlined the effects of internal dynamics and input modulation on
the computational abilities of systems. The vector-to-symbol encoding is also
important as well. Essentially, the output vector space is discretized into regions
that map to a discrete symbol. One possible mapping function is nearest neighbor
encoding (a prototype vector). A list of symbol/vector pairs is given (often based
on a one-hot encoding scheme) and the nearest vector to the output vector is
selected and its corresponding symbol is produced as output. This section will
address the effects of observation on those the vector-to-symbol encoding.
Kolen and Pollack [20] examined the vector-to-symbol encoding, or observation, effects on the induction of grammatical competence models of physical
systems. The centerpiece of this work was a variable speed rotator that had
interpretations both as a context-free generator and as a context-sensitive generator. Believing that physical systems possess an intrinsic computational model
leads one to the Observers’ Paradox: the resulting interpretation depends upon
measurement granularity.
Because the variable speed rotator has two disjoint interpretations, computational complexity classes emerge from the interaction of system state dynamics
and observer-established measurement. The complexity class of a system is an
aspect of the property commonly referred to as computational complexity, a
property we define as the union of observed complexity classes over all observation methods. While this definition undermines the traditional notion of system
complexity, namely that systems have unique well-defined computational complexities, it accounts for our intuitions regarding the limitations of identifying
computation in physical systems. This argument shows how observing the output of a system, such as a recurrent network, can lead to multiple hypotheses
of internal complexities. The flip side of this statement implies that observed
complexity can be tuned via selection of an observation mechanism.
2.4
Synthesis
We described, above, the three sources of emergent computation in recurrent
network SST. These sources include a sensitivity to input in constructing virtual state transformations, the ability to constrain internal dynamics capable of
supporting complex behavior, and finally the fact that output is an observation
of the internal states of the network. These three sources collaborate to provide proper conditions for the emergence of computation within the context of
recurrent neural networks.
The preceding arguments show conclusively that these three conditions are
sufficient for emergence, but are they necessary? Clearly, input dependency is
important because most of the time we identify computational systems as transducers of information. Even if a computational system ignored its input, as
in the case of an enumerative generator, it still could be portrayed as utilizing
input. Likewise, internal dynamics cannot be excluded either. They provide the
Dynamical Recurrent Networks for Sequential Data Processing
113
underlying mechanism for system change. Finally, the role of observation is inseparable from the notion of emergent computation in that the system must
be observed doing something. Emergent computation arises from the interaction
between the environment, the system, and the observer. Thus, a computational
model unaccompanied by its input encodings and observation mechanisms is
useless.
3
Specific Architectures
In the previous section, we identified the three necessary and sufficient conditions
for computationally complex behavior in recurrent network based SST. We now
focus on the internal dynamics and consider three broad classes of dynamical
recurrent network architectures. These are: feedforward networks, output feedback networks, and hidden feedback networks. These are illustrated in Figure 2.
In the subsections below, we consider each in turn.
3.1
Feedforward Networks
One of the simplest methods for dealing with input data that varies over time as
well as space is using not only the current input symbol to drive the network, but
also previous input symbols (Figure 2(a)). In these types of systems, temporal
delays can be introduced to provide the processing units in the network with a
history of input values.
The simplest approach is the Window in Time (WIT) network made famous
by the NetTalk system of Sejnowski and Rosenberg [36]. In this architecture,
the current and previous input vectors are presented as inputs to a (typically)
one-hidden layer feedforward network.
The WIT approach constrains the machines which can be induced by using a
fixed memory in the sense that it is always formed by the vector concatenation
of the current input symbol and the previous n − 1 input symbols. Here, n
represents the size of the temporal window on previous inputs. This form of
memory is very restrictive in the sense that the network itself cannot influence
what is or is not stored in memory.
A number of variations on the simple WIT approach have been suggested.
The most notable of these is Waibel et al.’s Time-Delay Neural Networks (TDNNs) [39]. A general review of other time delay networks can also be found in
[3].
3.2
Output Feedback Networks
A second approach to dealing with sequential data is to use not only a finite
history of previous input values to determine the current output, but also a finite history of previous output sequences (Figure 2(b)). Because these networks
feedback their outputs, they bear a certain similarity to infinite impulse response filters (IIRs) in communications theory [28, 29]. Some [26] have chosen
114
S.C. Kremer and J.F. Kolen
(a)
(b)
(c)
Output Activations
Copied to Input
Units
Hidden Activations
Copied to Input
Units
Fig. 2. Three classes of dynamical recurrent network architectures: (a) feedforward
networks, (b) output feedback networks, (c) hidden feedback networks.
Dynamical Recurrent Networks for Sequential Data Processing
115
to describe this class of networks as Nonlinear Auto-Regressive with eXogenous
inputs (NARX) networks because of their similarity to NARX models in dynamical systems theory.
The NARX approach also uses a fixed memory since it is always formed by
the vector concatenation of the current input symbol and the previous n−1 input
symbols, plus the previous m output symbols. Here, m and n represent the size
of the temporal windows on previous outputs and inputs respectively. This form
of memory is also restrictive in the sense that the weights of the network cannot
be adjusted to influence what is stored in memory.
Clearly, the WIT networks described above are a degenerate case of the
output feedback networks where m = 0.
3.3
Hidden Feedback Networks
The final alternative for dealing with sequences is to use a memory system
which does not explicitly store previous inputs or outputs. In such a system,
the memory or state is invisible to an outside observer. This is not unlike the
operation of a Hidden Markov Model (HMM). The advantage of using a memory
that does not merely store previous inputs and outputs is that it can compute
salient properties in an input pattern that may not be explicitly represented in
a finite history of input or output symbols.
In these networks, memory is computed in a set of hidden units (Figure 2(c)).
Another analogy can be drawn between hidden feedback networks with Mealy
and Moore machines [14]. These two computation models are described in terms
of output and next state mappings. The activation values of these hidden units
are computed based on the activations the input units and a layer of special units
called context units. By convention (see [5]), the activations of the context units
are initially set to 0.5 (or 0.0), and subsequently set to the activation values of
the hidden units at the previous time step (the number of context and hidden
units must be equal). Specifically, the hidden unit activation values are copied
to the context units at each time step. Thus, at any given time, the hidden
unit activations represent the current state, while the context unit activations
represent the previous state.
A number of variations on this approach have been explored. The three most
significant involve: (1) using a set of first-order connections between the input
and hidden units in a massively parallel connection scheme, (2) using a set
of second-order connections between the input and hidden units in a massively
parallel connection scheme, and (3) using a set of first-order connections between
the the input and hidden units in a one-to-one connection scheme.
The first of these has been studied extensively in [5, 4] and represents the
most straight-forward approach to implementing a general hidden feedback network. The second-order connection scheme lends itself to encoding labelled state
transitions, since each second-order connection combines one context unit (or
previous state) and one input unit and transmits a signal to one hidden unit (or
next state) [11, 32]. The one-to-one connection scheme represents a simplified
method for implementing memory [8]. In this system, memory is represented
116
S.C. Kremer and J.F. Kolen
locally in the sense that a given hidden unit is used to compute its own future
value, but not the future activation values of any other units. This offers significant computational advantages when calculating a gradient with respect to
error.
4
Training Methods
The primary method for adapting recurrent networks to perform specific desired
computations involves computing a gradient in error space with respect to the
weights of the network. By adjusting the weights in the direction of decreasing
error, the network can be adapted to perform as desired. While the standard
backpropagation rule [34] can be used to adapt the weights of a feedforward
network, like those used in feedforward networks with delays, a different algorithm must be used to compute gradients in networks with recurrent or feedback
connections.
Three main gradient descent algorithms for networks with recurrent connections have been proposed. All three compute an approximation to the error
gradient in weights space, but they differ in how this computation is performed.
4.1
Backpropagation Through Time
The most intuitive approach is to temporally unfold the network and backpropagate the error through time (BPTT). That is, treat the hidden and input units
at time t as two new virtual layers, the hidden and input units at time t − 1 as
two more virtual layers, and so on to the very first time-step. Since there really
are only one set of input and hidden units, and thus only one set of weights
originating from them, weight values between the virtual layers must be shared. Weight changes to shared weights can be computed by using the standard
backpropagation algorithm to calculate changes as if the shared weights were individual, and then summing the changes for all virtual incarnations of the same
weight, and applying the summed change to the actual weight. This approach
gives a very intuitive way of implementing a gradient descent algorithm in a
network with recurrent networks. The time complexity of the BPTT algorithm
is Θ(n2 · t) where n represents the number of hidden units in the network and
t represents the length of the sequence. Analysis of the space complexity of this
weight adaptation algorithm reveals that the memory requirements grows proportionately to the length of an input sequence (Θ(n · t)). This is makes this
approach generally unsuitable for learning tasks involving very long sequences.
One solution to this problem is to truncate the gradient calculation and
unravel the network for only a finite number of steps. In this situation, the space
complexity is proportional to the selected number of steps of unraveling. In the
most extreme case, the network is only unraveled for one step [5, 32]. Empirical
evidence has shown that this truncation can make certain kinds of problems
unlearnable [2, 22].
Dynamical Recurrent Networks for Sequential Data Processing
4.2
117
Real-Time Recurrent Learning
An alternative to BPTT which overcomes the limitation of requiring memory
proportional to string length without truncating the gradient has been proposed
in [40]. Rather than use virtual layers, this approach describes the propagation
problem in terms of recursive equations. The solution of these equations results
in the Real-time Recurrent Learning (RTRL) method. This approach computes
the gradient by storing additional information about the interactions between
processing units. The computation of these interactions requires additional time,
but allows the memory required to remain constant regardless of sequence length.
Specifically, the time and space complexity is Θ(n3 ). While the time and space of
this method is bound by the architecure, it may be more efficient to use BPTT
if the sequence length is not much larger than the number of hidden units.
4.3
Schmidhuber’s Algorithm
The tradeoff between time and space offered by BPTT and RTRL can be avoided
using a clever algorithm which effectively combines the two approaches. Developed by Schmidhuber [35], the algorithm operates in two modes of operation: one
corresponding to BPTT and another corresponding to RTRL. By performing
only a finite number of steps in BPTT mode, the memory required for this stage
of the algorithm remains finite, yet the computational advantages are realized.
This algorithm also operates in Θ(n3 ) time complexity.
4.4
Teacher Forcing
In networks where output is fed back, the user is not forced to chose between
regular backpropagation and recurrent gradient descent algorithms. The activation values of the output units can be treated as trainable values which can be
adapted using a recurrent gradient descent algorithm that adapts the weights
of the network. It is also possible, however, to use regular backpropagation if
the desired output values are known for all points in the sequence. Using this
approach, the activation values of the output units which are fed back into the
network are not the actual values computed by the output units, but rather the
values which these units should assume once training is complete (i.e. the target
values). Obviously, this type of approach can only be used in a system where
output values are known throughout the training sequence. Otherwise, the output values mid-sequence are essentially hidden and thus must be trained using
a recurrent gradient descent algorithm.
5
General Learning Limitations
In the previous section, we presented brief summaries several network architectures and their training algorithms1 . It is easy to think that these techniques are
universal and will work under any circumstance.
1
For a more complete review, the reader is referred to [21]
118
S.C. Kremer and J.F. Kolen
Before tackling a new problem, however, it is important to explore the a priori
restrictions that exists due to computational characteristics of the problem and
not the mechanism solving the problem. It may be the case that the problem
in question is not solvable with the given resources (time or space). Much work
has been done on the problem of learning SSTs in the form of automata and
grammars.
The simplest learning problem for SSTs is language decidability. An input
sequence is transformed into a single symbol, either accept or reject. Gold [12]
examined the resources necessary for solving this problem under a variety of
circumstances.
He identified two basic methods of presenting string to a language learner:
“text” and “informant”. A text was defined as a sequence of legal strings containing every string of the language at least once. Typically, texts are presented
one symbol after another, one string after another. Since most interesting languages have an infinite number of strings, the process of string presentation never
terminates.
By contrast, Gold defined an informant as a device which can tell the learner
whether any particular string is in the language. Typically the informant presents
one symbol at a time, and upon a string’s termination supplies a grammaticality
judgment: grammatical or non-grammatical.
Gold coined the term “identifiable in the limit” to answer the question: which
classes of languages are learnable with respect to the above to methods of information presentation? A language is identifiable in the limit if there exists a
point in time at which no further string presentations alter the categorization of
strings, and all categorizations are correct.
Gold showed that identifiability in the limit is a high ideal to strive for.
When he assumed a text information presentation, Gold showed that only finite
cardinality languages could be learned. Finite cardinality languages consist of
a finite number of legal strings, and are a small subset of the regular sets (the
smallest set in the Chomsky hierarchy). In other words, none of the language
classes typically studied in language theory are text learnable.
Gold went on to show that under informant learning, only two kinds of language are identifiable in the limit: regular sets (which are those languages having
only transition rules of the form A → wB, where A and B are variables and
w is a (possibly empty) string of terminals), and context-free languages (which
are those languages having only transition rules of the form A → β, where A
is a single variable). Other classes of language, like the recursively enumerable
languages (those having transition rules, α → β, where α and β are arbitrary
strings of terminals and non-terminals), were shown to be unlearnable.
6
Representational Limitations of Specific Architectures
It is possible to evaluate the representational capacities of specific architectures.
This is done by considering what kinds of sequence transduction or classification
problems can be solved for some set of weights. By computing the union of these
Dynamical Recurrent Networks for Sequential Data Processing
119
problems over all possible weights, we can evaluate the representational capacity
of the network.
The representational capacity is important for two reasons. The first, and
most obvious, is that if a network cannot represent a problem, then it cannot
possible be trained to solve that problem. The second reason is that selecting
an architecture with an inappropriately large representational capacity means
that the learning process has too many degrees of freedom, resulting increased
training time and the possibility of more local minima. Thus, it is critical to
select an architecture whose representational capacity is neither too large, nor
too small.
Giles, Horne and Lin [10] were the first to recognize that Kohavi [16] had
previously identified a class of automata called definite machines. The representation capabilites of these machines exactly match those of feedforward networks
with delays when operating on input and output sequences consisting of finite
input-output alphabets. These networks, and their formal machine counterparts,
are able to represent only certain kinds of sequence transduction, and recognition problems. They are able to recognize only a subset of regular languages. In
particular, they are unable to differentiate strings from the alphabet Σ ∗ = {0, 1}
that contain an even number of 1’s from those containing an odd number of 1’s.
Output feedback networks are much more versatile in term of the types of
problems that they can represent. In fact, it has been shown that these types of
networks are Turing Machine equivalent assuming the appropriate information
is available on the systems output [37].
Hidden feedback networks generally have significant computational power,
and have been known for some time to have the computational capabilities of
Turing machines (assuming enough hidden units are available) [30, 33, 31, 15, 38].
There are some hidden feedback networks, however, whose computational power
has been shown to be limited [23, 24, 25].
7
Discussion and Conclusion
We have already examined the work of Gold that showed that identifying languages in the limit can be an intractable problem. This fact holds regardless
of implementation substrate (i.e. the symbolic-connectionist hybrid described
above). He made no assumptions on the mechanism for learning, only on the
data presented and the problem classes to be addressed. This means that recurrent networks cannot perform the difficult cases of language learning in the limit
either.
On the other hand, the learning mechanisms of recurrent connectionist networks differ from their symbolic counter parts in the sense that they can consider
more than one potential solution at a time. If we consider the weights of such
a network to represent a hypothetical language, then the error gradient of that
hypothetical language provides us with information about neighboring languages. In this sense, the difference between symbolic and network approaches to
120
S.C. Kremer and J.F. Kolen
SST are similar to the differences between interior and exterior methods in linear
programming.
One way in which the difficulties of language learning manifest themselves
under this paradigm is via a phenomenon known as “shrinking gradients”. Hochreiter [13] and Bengio, Simard and Frasconi [1] independently discovered that
it is very difficult for recurrent networks to learn to latch information for long
periods of time.
This paper has examined the topic of symbol processing in dynamical recurrent networks. By focusing on a general sequence to sequence transduction
problem, we identified the three components of any such system: input modulation, internal dynamics and output observation. We also dicussed the difficulties
of this task in the context of the special case grammar induction problem as well
as recurrent network solutions to this problem. The potential benefits of applying
neural networks to symbol processing must be tempered by the recognition that
the theoretical restrictions of the task apply to any implementational substrate.
A more detailed review of dynamical recurrent networks for symbol processing
and other tasks can be found in [19].
Acknowledgements
Stefan C. Kremer was supported by a grant from the Natural Sciences and
Engineering Research Council of Canada.
John F. Kolen was supported by subcontract number NCC2-1026 from the
National Aeronautics and Space Administration.
References
[1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with
gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–
166, 1994.
[2] A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland. Finite state automata
and simple recurrent networks. Neural Computation, 1(3):372–381, 1989.
[3] B. de Vries and J. M. Principe. A theory for neural networks with time delays. In
Richard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances
in Neural Information Processing 3, pages 162–168, San Mateo, CA, 1991. Morgan
Kaufmann Publishers, Inc.
[4] J. L. Elman. Distributed representations, simple recurrent networks and gram
matical structure. Machine Learning, 7(2/3):195–226, 1991.
[5] J.L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.
[6] J. A. Fodor and Z. W. Pylyshyn. Connectionism and cognitive architecture: A
critical analysis. Cognition, 28:3–71, 1988.
[7] M. L. Forcada and R. P. Ñeco. Recursive hetero-associative memories for translation. In Proceedings of the International Workshop on Artificial Neural Networks
IWANN’97 (Lanzarote, Spain, June, 1997), pages 453–462, 1997.
[8] P. Frasconi, M. Gori, and G. Soda. Recurrent networks for continuous speech
recognition. In Computational Intelligence 90, pages 45–56. Elsevier, September
1990.
Dynamical Recurrent Networks for Sequential Data Processing
121
[9] K.-S. Fu. Syntactic Pattern Recognition and Applications. Prentice-Hall, Inc.,
Engelwood Cliffs, NJ, 1982.
[10] C. L. Giles, B.G. Horne, and T. Lin. Learning a class of large finite state machines
with a recurrent neural network. Neural Networks, 8(9):1359–1365, 1995.
[11] C.L. Giles, G.Z. Sun, H.H. Chen, Y.C. Lee, and D. Chen. Higher order recurrent
networks & grammatical inference. In D. S. Touretzky, editor, Advances in Neural
Information Processing Systems 2, pages 380–387, San Mateo, CA, 1990. Morgan
Kaufmann Publishers.
[12] E. M. Gold. Language identification in the limit. Information and Control, 10:447–
474, 1967.
[13] J. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Master’s
thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität
München, 1991.
[14] J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages,
and Computation. Addison–Wesley, 1979.
[15] J. Kilian and H.T. Siegelmann. On the power of sigmoid neural networks. In Proceedings of the Sixth ACM Workshop on Computational Learning Theory, pages
137–143. ACM Press, 1993.
[16] Z. Kohavi. Switching and Finite Automata Theory. McGraw-Hill, Inc., New York,
NY, second edition, 1978.
[17] J. F. Kolen. Exploring the Computational Capabilities of Recurrent Neural Networks. PhD thesis, Ohio State University, 1994.
[18] J. F. Kolen. The origin of clusters in recurrent network state space. In Proceedings
of the Sixteenth Annual Conference of the Cognitive Science Society, pages 508–
513, Hillsdale, NJ, 1994. Earlbaum.
[19] J. F. Kolen and S. C. Kremer, editors. A Field Guide to Dynamical Recurrent
Networks. IEEE Press, 2000.
[20] J.F. Kolen and J.B. Pollack. The paradox of observation and the observation of
paradox. Journal of Experimental and Theoretical Artificial Intelligence, 7:275–
277, 1995.
[21] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review.
submitted.
[22] S. C. Kremer. On the computational power of Elman-style recurrent networks.
IEEE Transactions on Neural Networks, 6(4):1000–1004, 1995.
[23] S. C. Kremer. Comments on ‘constructive learning of recurrent neural networks:
Limitations of recurrent cascade correlation and a simple solution’. IEEE Transactions on Neural Networks, 7(4):1049–1051, July 1996.
[24] S. C. Kremer. Finite state automata that recurrent cascade-correlation cannot
represent. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo,
editors, Advances in Neural Information Processing Systems 8, pages 612–618.
MIT Press, 1996.
[25] S. C. Kremer. Identification of a specific limitation on local-feedback recurrent
networks acting as mealy-moore machines. IEEE Transactions on Neural Networks, 10(2):433–438, March 1999.
[26] T. Lin, B. G. Horne, P. Tiño, and C. L. Giles. Learning long-term dependencies
in NARX recurrent neural networks. IEEE Transactions on Neural Networks,
7(6):1329–1338, 1996.
[27] M. C. Mozer. Neural net architectures for temporal sequence processing. In A.S.
Weigend and N.A. Gershenfeld, editors, Time Series Prediction, pages 243–264.
Addison–Wesley, 1994.
122
S.C. Kremer and J.F. Kolen
[28] K. S. Narendra and K. Parthasarathy. Identification and control of dynamical
systems using neural networks. IEEE Transactions on Neural Networks, 1(1):4–
27, 1990.
[29] K. S. Narendra and K. Parthasarathy. Gradient methods for the optimization
of dynamical systems containing neural networks. IEEE Transactions on Neural
Networks, 2:252–262, March 1991.
[30] J. B. Pollack. On Connectionist Models of Natural Language Processing. PhD
thesis, Computer Science Department of the University of Illinois at UrbanaChampaign, Urbana, Illinois, 1987.
[31] J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46:77–
105, 1990.
[32] J. B. Pollack. The induction of dynamical recognizers. Machine Learning, 7:227–
252, 1991.
[33] J.B. Pollack. Implications of recursive distributed representations. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems 1, pages 527–
536, San Mateo, CA, 1989. Morgan Kaufmann.
[34] D. Rumberlhart, G. Hinton, and R. Williams. Learning internal representation by
error propagation. In J. L. McClelland, D.E. Rumelhart, and the P.D.P. Group
(Eds. ), editors, Parallel Distributed Processing: Explorations in the Micro structure of Cognition, volume 1: Foundations, pages 318–364. MIT Press, Cambridge,
MA, 1986.
[35] J. H. Schmidhuber. A fixed size storage o(n3 ) time complexity learning algorithm
for fully recurrent continually running networks. Neural Computation, 4(2):243–
248, 1992.
[36] T. J. Sejnowski and C. R. Rosenberg. NETtalk: a parallel network that learns
to read aloud. In J.A. Anderson and E. Rosenfeld, editors, Neurocomputing:
Foundations of Research, pages 663–672. MIT Press, 1988.
[37] H.T. Siegelmann, B.G. Horne, and C.L. Giles. Computational capabilities of recurrent narx neural networks. IEEE Transactions on Systems, Man and Cybernetics,
1997. In press.
[38] H.T. Siegelmann and E.D. Sontag. On the computational power of neural nets.
Journal of Computer and System Sciences, 50(1):132–150, 1995.
[39] A. Waibel. Consonant recognition by modular construction of large phonemic
time-delay neural networks. In D.Z. Anderson, editor, Neural Information Processing Systems, pages 215–223, New York, NY, 1988. American Institute of Physics.
[40] R. J. Williams and D. Zipser. A learning algorithm for continually running fully
recurrent neural networks. Neural Computation, 1(2):270–280, 1989.
Fuzzy Knowledge and Recurrent Neural
Networks: A Dynamical Systems Perspective⋆
Christian W. Omlin1 , Lee Giles2,3 , and K.K. Thornber2
1
Department of Computer Science, University of Stellenbosch, South Africa
2
NEC Research Institute, Princeton, NJ 08540
3
UMIACS, U. of Maryland, College Park, MD 20742
E-mail: omlin@cs.sun.ac.za {giles,karvel}@research.nj.nec.com
Abstract. Hybrid neuro-fuzzy systems - the combination of artificial
neural networks with fuzzy logic - are becoming increasingly popular.
However, neuro-fuzzy systems need to be extended for applications which
require context (e.g., speech, handwriting, control). Some of these applications can be modeled in the form of finite-state automata. This chapter presents a synthesis method for mapping fuzzy finite-state automata
(FFAs) into recurrent neural networks. The synthesis method requires
FFAs to undergo a transformation prior to being mapped into recurrent
networks. Their neurons have a slightly enriched functionality in order to
accommodate a fuzzy representation of FFA states. This allows fuzzy parameters of FFAs to be directly represented as parameters of the neural
network. We present a proof the stability of fuzzy finite-state dynamics
of constructed neural networks and through simulations give empirical
validation of the proofs.
1
Introduction
1.1
Preliminaries
We all use fuzzy concepts in our everyday lives. We say a car is traveling ‘very
slowly’ or ‘fast’ without specifying the exact speed in miles per hour. Fuzzy set
theory allows to mathematically define such linguistic quantities, and to carry
out computations with vague information, such as fuzzy reasoning. It was introduced as an alternative to traditional set theory to provide a calculus for
reasoning under uncertainty [1]. Whereas the latter specifies crisp set membership, i.e. an object either is or is not an element of a set, the former allows for
a graded membership function which can be interpreted as vague or uncertain
information. Fuzzy logic has been very successful in a variety of applications
[2,3,4,5,6,7,8,9,10].
⋆
This chapter contains material reprinted from IEEE Transactions on Fuzzy Systems,
Vol. 6, No. 1, p. 76-89, c 1998, Institute of Electrical and Electronics Engineers, and
from Proceedings of the IEEE, Vol. 87, No. 9, p. 1623-1640, c 1999, Institute of
Electrical and Electronics Engineers, by permission.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 123–143, 2000.
c Springer-Verlag Berlin Heidelberg 2000
124
C.W. Omlin, L. Giles, and K.K. Thornber
LN
−40
MN
−30
−20
SN
−10
ZE
0
SP
MP
10
20
LP
30
40
Fig. 1. Numbers as Fuzzy Sets: Numbers are defined as fuzzy sets with a triangular
membership function µ(x). The fuzzy sets are Zero (ZE), small negative (SN), small
positive (SP), medium negative (MN), medium positive (MP), large negative (LN),
and large positive (LP). The membership function µ(x) defines to what degree points
on the x-axis belong to these fuzzy sets.
An example of fuzzy sets for real numbers is shown Fig. 1. It is impossible
to say that the number one is a small positive number but that the number two
is not. However, we can define a degree to which numbers belong to the set of
small positive numbers. Since the number two is larger than the number one, it is
plausible to define the number two to belong to the set of small positive numbers
to a lesser degree than the number one. Fuzzy logic has been particularly successful in applications where linguistic variables often have clear physical meaning,
thus making it possible to incorporate rule-based and linguistic information in
a systematic way. Then, powerful algorithms for training neural networks can
be applied to refine the parameters of fuzzy systems, thus resulting in adaptive
fuzzy systems.
Fuzzy systems and feedforward neural networks share a lot of similarities.
Among other common characteristics, they are computationally equivalent. It
has been shown in [11] that fuzzy systems of the form
IF x is Ai AND y is Bi THEN z is Ci
where Ai , Bi , and Ci are fuzzy sets with triangular membership functions, are
universal approximators:
Theorem 1. For any given real-valued continuous function Γ defined on a compact set, there exists a fuzzy logic system whose output approximates Γ with
arbitrary accuracy.
It has been shown in [12] that recurrent neural networks are computationally at
least as powerful as Turing machines; whether or not these results also apply to
recurrent fuzzy systems remains an open question.
While the methodologies underlying fuzzy systems and neural networks are quite
different, their functional forms are often similar. The development of learning
algorithms for neural networks has been beneficial to the field of fuzzy systems
which adopted some learning algorithms; there exist backpropagation training
algorithms for fuzzy logic systems which are similar to those for neural networks
[13,14]. However, most of the proposed combined architectures are only able to
Fuzzy Knowledge and Recurrent Neural Networks
125
process static input-output relationships; they are not able to process temporal
input sequences of arbitrary length.
Inputs
Antecedent
Labels
A
x
1
Rules
R
Consequent
Labels
1
C
1
A
2
R
Action
1
2
y
C2
A
x
3
R3
C
3
2
A
Layer 1
4
Layer 2
R
4
Layer 3
Layer 4
Layer 5
Fig. 2. Fuzzy Feedforward Neural Network: An example of initializing a feedforward neural network with fuzzy linguistic rules neural network (solid directed connections). Additional weights may be added for knowledge refinement or revision through
training (dashed directed connections). They may be useful for unlearning incorrect
prior knowledge or learning unknown rule dependencies. Typically, radial basis funcx−a 2
tions of the form e−( b ) are used to represent fuzzy sets (layers 2 and 4); sigmoid
discriminant functions of the form 1+e1−x are appropriate for computing the conjunction of antecedents of a rule (layer 3).
In some cases, neural networks can be structured based on the principles of
fuzzy logic [15,16]. Neural network representations of fuzzy logic interpolation
have also been used within the context of reinforcement learning [17]. Consider
the following set of linguistic fuzzy rules:
IF (x1 is A1 ) AND (x2 is A3 ) THEN C1 .
IF ((x1 is A2 ) AND (x2 is A4 )) OR
((x1 is A1 ) AND (x2 is A3 )) THEN C2 .
IF (x1 is A2 ) AND (x2 is A4 ) THEN C3 .
where Ai and Cj are fuzzy sets and xk are linguistic input variables. A possible mapping of such rules into a feedforward neural network is shown in Fig. 2.
These mappings are typical for control applications. The network has an input
layer (layer 1) consisting of real-valued input variables x1 and x2 (e.g., linguistic
variables), a fuzzification layer (layer 2) which maps input values xi to fuzzy sets
Ai , an interpolation layer (layer 3) which computes the conjunction of all antecedent conditions in a rule (e.g., differential softmin operation), a layer which
combines the contributions from all fuzzy control rules in the rule base (layer 4),
126
C.W. Omlin, L. Giles, and K.K. Thornber
and a defuzzification layer (layer 5) which computes the final output (e.g. mean
of maximum method). Thus, fuzzy neural networks play the role of fuzzy logic
interpolation engines. 1 The rules are then tuned using a training algorithm for
feedforward neural networks.
2
Fuzzy Finite-State Automata
2.1
Motivation
There exist applications where the variables of linguistic rules are recursive, i.e.,
the rules are of the form
IF (x(t − 1) is α) AND (u(t − 1) is β) THEN x(t) is γ
where u(t − 1) and x(t − 1) represent previous input and state variables, respectively. The value of the state variable x(t) depends on both the previous
input u(t − 1) and the previous state x(t − 1). Clearly, feedforward neural networks generally do not have the computational capabilities to represent such
recursive rules when the depth of the recursion is not known a priori. Recurrent
neural networks, on the other hand, are capable of representing recursive rules.
Recurrent neural networks have the ability to store information over indefinite
periods of time, can develop ‘hidden’ states through learning, and are thus potentially useful for representing recursive linguistic rules. Since a large class of
problems where the current state depends on both the current input and the
previous state can be modeled by finite-state automata, it is reasonable to investigate whether recurrent neural networks can also represent fuzzy finite-state
automata (FFAs).
2.2
Significance of Fuzzy Automata
Fuzzy finite-state automata (FFAs). have a long history [20,21]; the fundamentals of FFAs have been discussed in [22] without presenting a systematic machine
synthesis method. Their potential as design tools for modeling a variety of systems is beginning to be exploited. Such systems have two major characteristics:
(1) the current state of the system depends on past states and current inputs,
and (2) the knowledge about the system’s current state is vague or uncertain.
Fuzzy automata have been found to be useful in a variety of applications such
as in the analysis of X-rays [23], in digital circuit design [24], and in the design of
intelligent human-computer interfaces [25]. Neural network implementations of
1
The term fuzzy inference is also used to describe the function of a fuzzy neural
network. We choose the term fuzzy logic interpolation in order to distinguish between
the function of fuzzy neural networks and fuzzy logic inference where the objective
is to obtain some properties of fuzzy sets B1 , B2 , . . . from properties of fuzzy sets
A1 , A2 , . . . with the help of an inference scheme A1 , A2 , . . . → B1 , B2 , . . . which is
governed by a set of rules [18,19].
Fuzzy Knowledge and Recurrent Neural Networks
127
fuzzy automata have been proposed in the literature [26,27,28,29]. The synthesis method proposed in [26] uses digital design technology to implement fuzzy
representations of states and outputs. In [29], the implementation of a Moore
machine with fuzzy inputs and states is realized by training a feedforward network explicitly on the state transition table using a modified backpropagation
algorithm. From a control perspective, fuzzy finite-state automata have been
shown to be useful for modeling fuzzy dynamical systems, often in conjunction
with recurrent neural networks [30,31,32,33]. There has been a lot of interest in
learning, synthesis, and extraction of finite-state automata in recurrent neural
networks [34,35,36,37,38,39,40,41]. In contrast to deterministic finite-state automata (DFAs), a set of FFA states can be occupied to varying degrees at any point
in time; this fuzzification of states generally reduces the size of the model, and
the dynamics of the system being modeled is often more accessible to a direct
interpretation.
We have previously shown how fuzzy finite-state automata (FFAs) can be
mapped into recurrent neural networks with second-order weights using a crisp
representation of FFA states [42]. That encoding required a transformation of a
FFA into a DFA which computes the membership functions for strings; it is only
applicable to a restricted class of FFAs which have final states. The transformation of a fuzzy automaton into an equivalent deterministic acceptor generally
increases the size of the automaton and thus the network size. Furthermore, the
fuzzy transition memberships of the original FFA undergo modifications in the
transformation of the original FFA into an equivalent DFA which is suitable for
implementation in a second-order recurrent neural network. Thus, the direct correspondence between system and network parameters is lost which may obscure
the natural fuzzy description of systems being modeled.
Here, we present a method for encoding FFAs using a fuzzy representation
of states. The objectives of the FFA encoding algorithm are (1) ease of encoding
FFAs into recurrent networks, (2) the direct representation of “fuzziness”, i.e.
the fuzzy memberships of individual transitions in FFAs are also parameters in
the recurrent networks, and (3) achieving a fuzzy representation by making only
minimal changes to the underlying architecture used for encoding DFAs (and
crisp FFA representations).
Representation of FFAs in recurrent networks requires that the internal representation of FFA states and state transitions be stable for indefinite periods
of time. The proofs of representational properties of AI and machine learning
structures are important for a number of reasons. Many users of a model want
guarantees about what it can theoretically do, i.e. its performance and capabilities; others need this for use justification and acceptance. The capability of
representing DFAs in recurrent networks can be viewed as a foundation for the
problem of learning DFAs from examples (if a network cannot represent DFAs,
then it certainly will have difficulty in learning them). A stable encoding of knowledge means t hat the model will give the correct answer (string membership
in this case) independent of when the system is used or how long it is used. This
can lead to robustness that is noise independent.
128
2.3
C.W. Omlin, L. Giles, and K.K. Thornber
Formal Definition
In this section, we give a formal definition of FFAs [43] and illustrate the definition with an example.
Definition 1. A fuzzy finite-state automaton (FFA) M is defined by an alphabet Σ = {a1 , . . . , am }, a set of states Q = {q1 , . . . , qn }, a fuzzy start state
R ∈ Q 2 , a finite output alphabet Z, a fuzzy transition map δ : Σ ×Q×[0, 1] → Q,
and an output map ω : Q → Z.
Weights θijk ∈ [0, 1] define the ‘fuzziness’ of state transitions, i.e. a FFA can
simultaneously be in different states with a different degree of certainty. The
particular output mapping depends on the nature of the an application. Since our
goal is to construct a fuzzy representation of FFA states and their stability over
time, we will ignore the output mapping ω for the remainder of this discussion,
and not concern ourselves with the language L(M ) defined by M . For a possible
definition, see [43]. An example of a FFA over the input alphabet {0, 1} is shown
in Fig. 3.
0/0.5
2
1
1/0.2
0/0.7
1/0.3
1/0.7
0/0.9
0/0.6
0/0.1
3
4
1/0.4
0/0.3
Fig. 3. Example of a Fuzzy Finite-State Automaton: A fuzzy finite-state automaton is shown with weighted state transitions. State 1 is the automaton’s start state.
A transition from state qj to qi on input symbol ak with weight θ is represented as a
directed arc from qj to qi labeled ak /θ. Note that transitions from states 1 and 4 on
input symbols ‘0’ are fuzzy (δ(1, 0, .) = {2, 3} and δ(4, 0, .) = {2, 3}).
2
In general, the start state of a FFA is fuzzy, i.e. it consists of a set of states that
are occupied with varying memberships. It has been shown that a restricted class of
FFAs whose initial state is a single crisp state is equivalent with the class of FFAs
described in Definition 1 [43]. The distinction between the two classes of FFAs is
irrelevant in the context of this chapter.
Fuzzy Knowledge and Recurrent Neural Networks
3
3.1
129
Representation of Fuzzy States
Preliminaries
The current fuzzy state of a FFA M is a collection of states {qi } of M which
are occupied with different degrees of fuzzy membership. A fuzzy representation of FFA states thus requires knowledge about the membership with which
each state qi is occupied. This requirement dictates the representation of the
current fuzzy state in a recurrent neural network. Since the method for encoding FFAs in recurrent neural networks is a generalization of the method for
encoding DFAs, we will briefly discuss the DFA encoding algorithm. For DFA
encodings, we used discrete-time, second-order recurrent neural networks with
sigmoidal discriminant functions which update their current state according to
the following equations:
X
1
(t+1)
(t) (t)
= g(αi (t)) =
,
αi (t) = bi +
Wijk Sj Ik ,
(1)
Si
−α
(t)
i
1+e
j,k
where bi is the bias associated with hidden recurrent state neurons Si , Wijk is a
second-order weight, and Ik denotes the input neuron for symbol ak . The indices
(t) (t)
i, j, and k run over all state and input neurons, respectively. The product Sj Ik
directly corresponds to the state transition δ(qj , ak ) = qi .
DFAs can be encoded in discrete-time, second-order recurrent neural networks with sigmoidal discriminant functions such that the DFA and constructed
network accept the same regular language [44]. The desired finite-state dynamics are encoded into a network by programming a small subset of all available
weights to values +H and −H; this leads to a nearly orthonormal internal DFA
state representation for sufficiently large values of H, i.e. a one-to-one correspondence between current DFA states and recurrent neurons with a high output.
Since the magnitude of all weights in a constructed network are equal to H, the
equation governing the dynamics of a constructed network is of the special form
(t+1)
Si
= g(x, H) =
1
1+
eH(1−2x)/2
(2)
where x is the input to neuron Si .
3.2
Recurrent State Neurons with Variable Output Range
We extend the functionality of recurrent state neurons in order to represent fuzzy
states. The main difference between the neuron discriminant function for DFAs
and FFAs is that the neuron now receives as inputs the weight strength H, the
signal x which represents the collective input from all other neurons, and the
transition weight θijk where δ(qj , ak , θijk ) = qi :
(t+1)
Si
= g̃(x, H, θijk ) =
θijk
1 + eH(θijk −2x)/2θijk
(3)
The value of θijk is different for each of the states that collectively make up the
current fuzzy network state. This is consistent with the definition of FFAs.
130
C.W. Omlin, L. Giles, and K.K. Thornber
Compared to the discriminant function g(.) for the encoding of DFAs, the
weight H which programs the network state transitions is strengthened by a
factor 1/θijk (0 < θijk ≤ 1); the range of the function g̃(.) is squashed to the
interval [0, θijk ], and it has been shifted towards the origin. Setting θijk = 1
reduces the function (3) to the sigmoidal discriminant function (2) used for DFA
encoding. More formally, the function g̃(x, H, θ) has the following important
invariant property which will later simplify the analysis:
Lemma 1. g̃(θx, H, θ) = θ g̃(x, H, 1) = θ g(x, H).
Thus, g(x, H) can be obtained by scaling g̃(x, H, 1) uniformly in the x− and
y−directions by a factor θ. This invariant property of g̃ allows a stability analysis
of the internal FFA state representation similar to the analysis of the stability
of the internal DFA state representation to be carried out.
3.3
Programming Fuzzy State Transitions
Consider state qj of FFA M and the fuzzy state transition δ(qj , ak , {θijk } =
{qi1 . . . qir }). We assign recurrent state neuron Sj to FFA state qj and neurons
Si1 . . . Sir to FFA states qi1 . . . qir . The basic idea is as follows: The activation
of recurrent state neuron Si represents the certainty θijk with which some state
transition δ(qj , ak , θijk ) = qi is carried out, i.e. Sit+1 ≃ θijk . If qi is not reached
at time t+1, then we have Sit+1 ≃ 0. Thus, we program the second-order weights
Wijk as follows:
+H if qi ∈ δ(qj , ak , θijk )
Wijk =
(4)
0 otherwise
+H if qj ∈ δ(qj , ak , θjjk )
(5)
Wjjk =
−H otherwise
bi = −H/2 if qi ∈ M.
(6)
Setting Wijk to a large positive value will ensure that Sit+1 will be arbitrarily
close to θijk and setting Wjjk to a large negative value will guarantee that the
output Sjt+1 will be arbitrarily close to 0. This is the same technique using for
programming DFA state transitions in recurrent networks [44] and for encoding
partial prior knowledge of a DFA for rule refinement [45].
4
4.1
Automata Transformation
Preliminaries
The above encoding algorithm leaves open the possibility for ambiguities when
a FFA is encoded in a recurrent network as follows: Consider two FFA states qj
and ql with transitions δ(qj , ak , θijk ) = δ(ql , ak , θilk ) = qi where qi is one of all
successor states reached from qj and ql , respectively, on input symbol ak . Further
assume that qj and ql are members of the set of current FFA states (i.e., these
states are occupied with some fuzzy membership). Then, the state transition
Fuzzy Knowledge and Recurrent Neural Networks
131
δ(qj , ak , θijk ) = qi requires that recurrent state neuron Si have dynamic range
[0, θijk ] while state transition δ(ql , ak , θilk ) = qi requires that state neuron Si
asymptotically approach θilk . For θijk 6= θilk , we have ambiguity for the output
range of neuron Si :
Definition 2. We say an ambiguity occurs at state qi if there exist two states
qj and ql with δ(qj , ak , θijk ) = δ(ql , ak , θilk ) = qi and θijk 6= θilk . A FFA M is
called ambiguous if an ambiguity occurs for any state qi ∈ M .
That ambiguity could be resolved by testing all possible paths through the FFA
and identify those states for which the above described ambiguity can occur.
However, such an endeavor is computationally expensive. Instead, we propose to
resolve that ambiguity by transforming any FFA M .
4.2
Transformation Algorithm
Before we state the transformation theorem, and give the algorithm, it will be
useful to define the concept of equivalent FFAs:
Definition 3. Consider a FFA M that is processing some string s = σ1 σ2 . . . σL
with σi ∈ Σ. As M reads each symbol σi , it makes simultaneous weighted state
transitions Σ ×Q×[0, 1] according to the fuzzy transition map δ(qj , ak , θijk ) = qi .
The set of distinct weights {θijk } of the fuzzy transition map at time t is called
the active weight set.
Note that the active weight set can change with each symbol σi processed by
M . We will define what it means for two FFAs to be equivalent:
Definition 4. Two FFAs M and M ′ with alphabet Σ are called equivalent if
their active weight sets are at all times identical for any string s ∈ Σ ∗ .
We have proven the following theorem [46]:
Theorem 2. Any FFA M can be transformed into an equivalent, unambiguous
FFA M ′ .
The trade-off for making the resolution of ambiguities computationally feasible
is an increase in the number of FFA states. We have proven the correctness of
the algorithm in [46]. Here, we illustrate the algorithm with an example.
4.3
Example of FFA Transformation
Consider the FFA shown in Fig. 4a with four states and input alphabet Σ =
{0, 1}; state q1 is the start state 3 . The algorithm initializes the variable ‘list’
3
The FFA shown in Fig. 4a is a special case in that it does not contain any fuzzy
transitions. Since the objective of the transformation algorithm is to resolve ambiguities for states qi with δ({qj1 , . . . , qjr }, ak , {, θij1 k , . . . , θijr k }) = qi , fuzziness is of
no relevance; therefore, we omitted it for reasons of simplicity.
132
C.W. Omlin, L. Giles, and K.K. Thornber
with all FFA states, i.e., list={q1 , q2 , q3 , q4 }. First, we notice that no ambiguity exists for input symbol ‘0’ at state q1 since there are no state transitions δ(., 0, .) = q1 . There exist two state transitions which have state q1
as its target, i.e. δ(q2 , 1, 0.2) = δ(q3 , 1, 0.7) = q1 . Thus, we set the variable
visit = {q2 , q3 }. According to Definition 2, an ambiguity exists since θ121 6=
θ131 . We resolve that ambiguity by introducing a new state q5 and setting
δ(q3 , 1, 0.7) = q5 . Since δ(q3 , 1, 0.7) no longer leads to state q1 , we need to introduce new state transitions leading from state q5 to the target states {q} of
all possible state transitions: δ(q1 , ., .) = {q2 , q3 }. Thus, we set δ(q5 , 0, θ250 ) = q2
and δ(q5 , 1, θ351 ) = q3 with θ250 = θ210 and θ351 = θ311 . One iteration through
the outer loop thus results in the FFA shown in Fig. 4b. Consider Fig. 4d which
shows the FFA after 3 iterations. State q4 is the only state left which has incoming transitions δ(., ak , θ4.k ) = q4 where not all values θ4.k are identical. We
have δ(q2 , 0, 0.9) = δ(q6 , 0, 0.9) = q4 ; since these two state transition do not
cause an ambiguity for input symbol ‘0’, we leave these state transitions as they
are. However, we also have δ(q2 , 0, θ420 ) = δ(q3 , 0, θ430 ) = δ(q7 , 0, θ470 ) = q4
with θ430 = θ470 6= θ420 = 0.9. Instead of creating new states for both state
transitions δ(q3 , 0, θ430 ) and δ(q7 , 0, θ470 ), it suffices to create one new state q8
and to set δ(q3 , 0, 0.1) = δ(q7 , 0, 0.1) = q8 . States q6 and q7 are the only possible successor states on input symbols ‘0’ and ‘1’, respectively. Thus, we set
δ(q8 , 0, 0.6) = q6 and δ(q8 , 1, 0.4) = q7 . There exist no more ambiguities and the
algorithm terminates (Fig. 4e).
5
Network Architecture
The architecture for representing FFA is similar with the architecture for DFAs
except that each neuron Si of the state transition module has a dynamical output range [0, θijk ] where θijk is the rule weight in the FFA state transition
δ(qj , ak , θijk ) = qi . Notice that each neuron Si is only connected to pairs (Si , Ik )
for which θijk = θij ′ k since we assume that M is transformed into an equivalent,
unambiguous FFA M ′ prior to the network construction. The weights Wijk are
programmed as described in Section 3.3. Each recurrent state neurons receives
as inputs the value Sjt and an output range value θijk ; it computes its output
according to Equation (3).
6
6.1
Network Stability Analysis
Preliminaries
In order to demonstrate how the FFA encoding algorithm achieves stability of
the internal FFA state representation for indefinite periods of time, we need to
understand the dynamics of signals in a constructed recurrent neural network.
We define stability of an internal FFA state representation as follows:
Definition 5. A fuzzy encoding of FFA states with transition weights {θijk }
in a second-order recurrent neural network is called stable if only state neurons
Fuzzy Knowledge and Recurrent Neural Networks
133
0/0.5
0/0.5
1
2
1
1/0.7
2
1/0.2
0/0.9
1/0.3
1/0.2
1/0.3
1/0.3
0/0.9
0/0.6
0/0.6
0/0.1
0/0.5
5
1/0.7
0/0.1
3
4
4
3
1/0.4
1/0.4
(a)
(b)
0/0.5
0/0.5
1
1
2
1/0.3
2
1/0.3
0/0.9
0/0.5
1/0.2
0/0.9
1/0.3
1/0.2
0/0.5
1/0.3
5
5
1/0.7
1/0.7
0/0.1
3
3
4
0/0.1
1/0.7
0/0.1
4
1/0.4
0/0.6
0/0.6
1/0.2
1/0.2
1/0.4
7
0/0.9
0/0.9
6
6
(c)
(d)
0/0.5
1
2
1/0.2
0/0.5
0/0.9 1/0.3
1/0.3
5
1/0.7
4
3
0/0.1
0/0.6
1/0.2
1/0.7
1/0.4
7
0/0.9
0/0.1
6
0/0.6
1/0.4
8
(e)
Fig. 4. Example of FFA Transformation: Transition weight ambiguities are resolved in a sequence of steps: (a) the original FFA; there exist ambiguities for all four
states; (b) the ambiguity of transition from state 3 to state 1 on input symbol 1 is
removed by adding a new state 5; (c) the ambiguity of transition from state 4 to state
2 on input symbol 0 is removed by adding a new state 6; (d) the ambiguity of transition from state 4 to state 3 on input symbol 1 is removed by adding a new state 7; (e)
the ambiguity of transition from states 3 and 7 - both transition have the same fuzzy
membership - to state 4 is removed by adding a new state 8.
corresponding to the set of current FFA states have an output greater than θijk /2
where θijk is the dynamic range of recurrent state neurons, and all remaining
recurrent neurons have low output signals less than θijk /2 for all possible input
sequences.
134
C.W. Omlin, L. Giles, and K.K. Thornber
It follows from that definition that there exists an upper bound 0 < φ− < θijk /2
for low signals and a lower bound θijk /2 < φ+ < θijk for high signals in networks
that represent stable FFA encodings. The ideal values for low and high signals
are 0 and θijk , respectively.
6.2
Fixed Point Analysis for Sigmoidal Discriminant Function
Here, we summarize without proofs some of the results that we used to demonstrate stability of neural DFA encodings; details of the proofs can be found in
[44].
In order to guarantee low signals to remain low, we have to give a tight upper
bound for low signals which remains valid for an arbitrary number of time steps:
Lemma 2. The low signals are bounded from above by the fixed point [φ−
f ]θ of
the function f
0
f =0
(7)
f t+1 = g̃(r · f t )
where [φ−
f ]θ represents the fixed point of the discriminant function g̃() with
variable output range θ, and r denotes the maximum number of neurons that
contribute to a neuron’s input. For reasons of simplicity, we will write φ−
f for
]
with
the
implicit
understanding
that
the
location
of
fixed
points
depends
[φ−
f θ
on the particular choice of θ. This lemma can easily be proven by induction on t.
It is easy to see that the function to be iterated in Equation (7) is f (x, H, θ, r) =
θ
. We will show later in this section that the conditions that gua1 + eH(θ−2rx)/2θ
rantee the existence of one or three fixed points are independent of the parameter
θ.
The function f (x, H, θ, r) has some desirable properties:
Lemma 3. For any H > 0, the function f (x, H, θ, r) has at least one fixed point
φ0f .
Lemma 4. There exists a value H0− (r) such that for any H > H0− (r), f (x,H,θ,r)
+
0
has three fixed points 0 < φ−
f < φf < φf < θ.
+
0
Lemma 5. If f (x, H, θ, r) has three fixed points φ−
f , φf , and φf , then
−
0
φf x0 < φf
lim f t = φ0f x0 = φ0f
t→∞
+
φf x0 > φ0f
where x0 is an initial value for the iteration of f (.).
(8)
Fuzzy Knowledge and Recurrent Neural Networks
135
Lemma 5 can be proven by defining an appropriate Lyapunov function P and
+4
showing that P has minima at φ−
f and φf .
The basic idea behind the network stability analysis is to show that neuron
outputs never exceed or fall below some fixed points φ− and φ+ , respectively.
+
The fixed points φ−
f and φf are only valid upper and lower bounds on low and
high signals, respectively, if convergence toward these fixed points is monotone.
The following corollary establishes monotone convergence of f t towards fixed
points:
Corollary 1. Let f 0 , f 1 , f 2 , . . . denote the finite sequence computed by successive iteration of the function f . Then we have f 0 < f 1 < . . . < φf where φf is
one of the stable fixed points of f (x, H, θ, r).
With these properties, we can quantify the value H0− (r) such that for any H >
H0− (r), f (x, H, θ, r) has three fixed points. The low and high fixed points φ−
f
and φ+
are
the
bounds
for
low
and
high
signals,
respectively.
The
larger
r,
the
f
larger H must be chosen in order to guarantee the existence of three fixed points.
If H is not chosen sufficiently large, then f t converges to a unique fixed point
θ/2 < φf < θ. The following lemma expresses a quantitative condition which
guarantees the existence of three fixed points:
θ
has three fixed points
Lemma 6. The function f (x, H, θ, r) =
1 + eH(θ−2rx)/2θ
−
+
0
0 < φf < φf < φf < θ if H is chosen such that
2(θ + (θ − x) log( θ−x
x ))
H > H0− (r) =
θ−x
where x satisfies the equation
θ2
r=
2x(θ + (θ − x) log( θ−x
x ))
Even though the location of fixed points of the function f depends on H, r, and
θ, we will use [φf ]θ as a generic name for any fixed point of a function f .
Similarly, we can quantify high signals in a constructed network:
Lemma 7. The high signals are bounded from below by the fixed point [φ+
h ]θ of
the function
0
h =1
(9)
ht+1 = g̃(ht − f t )
4
Lyapunov functions can be used to prove the stability of dynamical systems. For
a given dynamical system S, if there exists a function P - we can think of P as
an energy function - such that P has at least one minimum, then S has a stable
state. Here, we can choose P (xi ) = (xi − φ)f )2 where xi is the value of f (.) after
i iterations and φ is one of the fixed points. It can be shown algebraically that, for
x0 6= φ0f , P (xi ) decreases with every step of the iteration of f (.) until a stable fixed
point is reached.
136
C.W. Omlin, L. Giles, and K.K. Thornber
Notice that the above recurrence relation couples f t and ht which makes it
difficult if not impossible to find a function h(x, θ, r) which when iterated gives
the same values as ht . However, we can bound the sequence h0 , h1 , h2 , . . . from
below by a recursively defined function pt - i.e. ∀t : pt ≤ ht - which decouples ht
from f t :
Lemma 8. Let [φf ]θ denote the fixed point of the recursive function f , i.e.
lim f t = [φf ]θ . Then the recursively defined function p
t→∞
p0 = 1
pt+1 = g̃(g t − [φf ]θ )
(10)
has the property that ∀t : pt ≤ ht .
Then, we have the following sufficient condition for the existence of two stable
fixed point of the function defined in Equation (9):
Lemma 9. Let the iterative function pt have two stable fixed points and ∀t :
pt ≤ ht . Then the function ht has also two stable fixed points.
The above lemma has the following corollary:
Corollary 2. A constructed network’s high signals remain stable if the the sequence p0 , p1 , p2 , . . . converges towards the fixed point θ/2 < [φ+
p ]θ < θ.
Since we have decoupled the iterated function ht from the iterated function f t
by introducing the iterated function pt , we can apply the same technique to pt
for finding conditions for the existence of fixed points as in the case of f t . In fact,
the function that when iterated generates the sequence p0 , p1 , p2 , . . . is defined
by
θ
θ
(11)
=
p(r, θ, x) =
′ (θ−2r ′ x))/2θ
−
H
1+e
1 + eH(θ−2(x−[φf ]θ ))/2θ
with
′
H ′ = H(1 + 2[φ−
f ]θ ), r =
1
1 + 2[φ−
f ]θ
(12)
We can iteratively compute the value of [φp ]θ for given parameters H and r.
This results in the following lemma for high signals:
1
has three fixed
Lemma 10. The function p(x, [φ−
f ]θ ) =
H(θ−2(x−[φ−
] ))/2θ
f θ
1+e
0
+
points 0 < [φ−
p ]θ < [φp ]θ < [φp ]θ < 1 if H is chosen such that
2(θ + (θ − x) log( θ−x
x ))
H > H0+ (r) =
−
(1 + 2[φf ]θ )(θ − x)
where x satisfies the equation
θ2
1
=
−
1 + 2[φf ]θ
2x(θ + (θ − x) log( θ−x
x ))
Fuzzy Knowledge and Recurrent Neural Networks
137
Since there is a collection of fuzzy transition memberships θijk involved in the
algorithm for constructing fuzzy representations of FFAs, we need to determine
whether the condition of Lemmas 6 and 10 hold true for all rule weights θijk .
The following corollary establishes a useful invariant property of the function
H0 (x, r, θ):
Corollary 3. The value of the minima H(x, r, θ) only depends on the value of
r and is independent of the particular values of θ:
inf H(x, r, θ) = inf
2 θ log( θ−x
x )
= H0 (r)
θ − 2rx
(13)
The relevance of the above corollary is that there is no need to test conditions
for all possible values of θ in order to guarantee the existence of fixed points.
We can now proceed to prove stability of low and high signals, and thus
stability of the fuzzy representation of FFA states, in a constructed recurrent
neural network.
6.3
Network Stability
The existence of two stable fixed points of the discriminant function is only a
necessary condition for network stability. We also need to establish conditions
under which these fixed points are upper and lower bounds of stable low and
high signals, respectively.
Before we define and derive the conditions for network stability, it is convenient to apply the result of Lemma 1 of Section 3.2 to the fixed points of the
sigmoidal discriminant function:
Corollary 4. For any value θ with 0 < θ ≤ 1, the fixed points [φ]θ of the discriminant function
θ
1 + eH(θ−2rx)/2θ
have the following invariant relationship:
[φ]θ = θ [φ]1
Thus, we do not have to consider the conditions separately for all values of
θ that occur in a given FFA. We now redefine stability of recurrent networks
constructed from DFAs in terms of fixed points:
Definition 6. An encoding of DFA states in a second-order recurrent neural
network is called stable if all the low signals are less than [φ0f ]θi , and all the high
signals are greater than [φ0h ]θi for all θi of all state neurons Si .
We have simplified θi.. to θi because the output of each neuron Si has a fixed
upper limit θ for a given input symbol, regardless which neurons Sj contribute
residual inputs. We note that this new definition is stricter than that we gave in
138
C.W. Omlin, L. Giles, and K.K. Thornber
Definition 5. In order for the low signal to remain stable, the following condition
has to be satisfied:
H
0
+ Hr[φ−
(14)
f ]θj < [φf ]θj
2
Similarly, the following inequality must be satisfied for stable high signals:
−
H
−
0
+ H[φ+
h ]θj − H[φf ]θi > [φh ]θi
2
−
(15)
The above two inequalities must be satisfied for all neurons at all times. Instead
of testing for all values θijk separately, we can simplify the set of inequalities as
follows:
Lemma 11. Let θmin and θmax denote the minimum and maximum, respectively, of all fuzzy transition memberships θijk of a given FFA M . Then, inequalities (14) and (15) are satisfied for all transition weights θijk if the inequalities
H
0
+ Hr[φ−
f ]θmax < [φf ]θmin
2
(16)
H
−
0
+ H[φ+
h ]θmin − H[φf ]θmax > [φh ]θmax
2
(17)
−
−
are satisfied.
We can rewrite inequalities (16) and (17) as
−
H
0
+ Hr θmax [φ−
f ]1 < θmin [φf ]1
2
(18)
and
H
−
0
+ Hθmin [φ+
(19)
h ]1 − Hθmax [φf ]1 > θmax [φh ]1
2
+
Solving inequalities (18) and (19) for [φ−
f ]1 and [φh ]1 , respectively, we obtain
conditions under which a constructed recurrent network implements a given
FFA. These conditions are expressed in the following theorem:
−
Theorem 3. For some given unambiguous FFA M with n states and m input
symbols, let r denote the maximum number of transitions to any state over all
input symbols of M . Furthermore, let θmin and θmax denote the minimum and
maximum, respectively, of all transitions weights θijk in M . Then, a sparse recurrent neural network with n state and m input neurons can be constructed
from M such that the internal state representation remains stable if
[φ0f ]1
1
)
+ θmin
r θmax 2
H
[φ0h ]1
1 1
−
+
θ
)
[φ
]
+
(
]
>
(2) [φ+
max
1
1
f
h
θmin 2
H
(1) [φ−
f ]1 <
1
(
(3) H > max(H0− (r), H0+ (r)) .
Fuzzy Knowledge and Recurrent Neural Networks
139
Furthermore, the constructed network has at most 3mn second-order weights
with alphabet Σw = {−H, 0, +H}, n + 1 biases with alphabet Σb = {−H/2}, and
maximum fan-out 3m.
For θmin = θmax = 1, conditions (1)-(3) of the above theorem reduce to those
found for stable DFA encodings [44]. This is consistent with a crisp representation
of DFA states.
7
Simulations
In order to validate our theory, we constructed a fuzzy encoding of a randomly
generated FFA with 100 states (after the execution of the FFA transformation
algorithm) over the input alphabet {0, 1}. We randomly assigned weights in
the range [0, 1] to all transitions in increments of 0.1. The maximum indegree
was Din (M ) = r = 5. We then tested the stability of the fuzzy internal state
representation on 100 randomly generated strings of length 100 by comparing,
at each time step, the output signal of each recurrent state neuron with its ideal
output signal (since each recurrent state neuron Si corresponds to a FFA state
qi , we know the degree to which qi is occupied after input symbol ak has been
read: either 0 or θijk ). A histogram of the differences between the ideal and the
observed signal of state neurons for selected values of the weight strength H
over all state neurons and all tested strings is shown in Fig. 5. As expected,
the error decreases for increasing values of H. We observe that the number
of discrepancies between the desired and the actual neuron output decreases
‘smoothly’ for the shown values of H (almost no change can be observed for
values up to H = 6). The most significant change can be observed by comparing
the histograms for H = 9.7 and H = 9.75: The existence of significant neuron
output errors for H = 9.7 suggests that the internal FFA representation is highly
unstable. For H ≥ 9.75, the internal FFA state representation becomes stable.
This discontinuous change can be explained by observing that there exists a
critical value H0 (r) such that the number of stable fixed points also changes
discontinuously from one to two for H < H0 (r)) and H > H0 (r)), respectively
(see Fig. 5). The ‘smooth’ transition from large output errors to very small errors
for most recurrent state neurons (Fig. 5a-e) can be explained by observing that
not all recurrent state neurons receive the same number of residual inputs; some
neurons may not receive any residual input for some given input symbol ak at
time step t; in that case, the low signals of those neurons are strengthened to
g̃(0, H, θi.k ) ≃ 0.
8
Conclusions
Theoretical work that proves representational relationships between different
computational paradigms is important because it establishes the equivalences of
those models. Previously it has been shown that it is possible to deterministically
140
C.W. Omlin, L. Giles, and K.K. Thornber
1.6
1.6
1.4
1.4
1.2
Frequency
Frequency
1.2
1.0
0.8
0.6
0.8
0.6
0.4
0.4
0.2
1.0
0.2
0
0.2
0.4
0.6
0.8
Absolute Neuron Output Error
0
1
0
(a)
3.5
2.5
3.0
2.5
Frequency
Frequency
2.
1.5
1.0
2.0
1.5
1.0
0.5
0.5
0
0.2
0.4
0.6
0.8
Absolute Neuron Output Error
0
1
0
0.2
0.4
0.6
0.8
Absolute Neuron Output Error
1
(d)
10.0
4.0
8.0
3.0
6.0
Frequency
Frequency
(c)
5.0
2.0
1.0
0
1
(b)
3.0
0
0.2
0.4
0.6
0.8
Absolute Neuron Output Error
4.0
2.0
0
0.2
0.4
0.6
0.8
Absolute Neuron Output Error
(e)
1
0
0
0.2
0.4
0.6
0.8
Absolute Neuron Output Error
1
(f)
Fig. 5. Stability of FFA State Encoding: The histogram shows the absolute neuron
output error of a network with 100 neurons that implements a randomly generated
FFA, and reads 100 randomly generated strings of length 100 for different values of
the weight strength H. The error was computed by comparing, at each time step, the
actual with the desired output of each state neuron. The distribution of neuron output
signal errors are for weight strengths (a) H = 6.0, (b) H = 9.0, (c) H = 9.60, (d)
H = 9.65, and (e) H = 9.70, and (f) H = 9.75.
Fuzzy Knowledge and Recurrent Neural Networks
141
encode fuzzy finite-state automata (FFA) in recurrent neural networks by transforming any given FFA into a deterministic acceptor which assign string membership [42]. In such a deterministic encoding, only the network’s classification
of strings is fuzzy, whereas the representation of states is crisp. The correspondence between FFA and network parameters - i.e. fuzzy transition memberships
and network weights, respectively - is lost in the transformation.
Here, we have demonstrated analytically and empirically that it is possible
to encode FFAs in recurrent networks without transforming them into deterministic acceptors. The constructed network directly represents FFA states with the
desired fuzziness. That representation requires (1) a slightly increased functionality of sigmoidal discriminant functions (it only requires the discriminants to
accommodate variable output range), and (2) a transformation of a given FFA
into an equivalent FFA with a larger number of states. In the proposed mapping FFA → recurrent network, the correspondence between FFA and network
parameters remains intact; this can be significant if the physical properties of
some unknown dynamic, nonlinear system are to be derived from a trained network modeling that system. Furthermore, the analysis tools and methods used
to demonstrate the stability of the crisp internal representation of DFA’s carried
over and generalized to show stability of the internal FFA representation. We
speculate that other encoding methods are possible and that it reamins an open
question as to which encoding methods are better.
Acknowledgments
We would like to acknowledge useful discussions with K. Bollacker, D. Handscomb, and B.G. Horne.
References
1. L. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338–353, 1965.
2. P.P. Bonissone, V. Badami, K.H. Chiang, P.S. Khedkar, K.W. Marcelle, and M.J.
Schutten, “Industrial applications of fuzzy logic at General Electric,” Proceedings
of the IEEE, vol. 83, no. 3, pp. 450–465, 1995.
3. S. Chiu, S. Chand, D. Moore, and A. Chaudhary, “Fuzzy logic for control of roll
and moment for a flexible wing aircraft,” IEEE Control Systems Magazine, vol.
11, no. 4, pp. 42–48, 1991.
4. J. Corbin, “A fuzzy logic-based financial transaction system,” Embedded Systems
Programming, vol. 7, no. 12, pp. 24, 1994.
5. L.G. Franquelo and J. Chavez, “Fasy: A fuzzy-logic based tool for analog synthesis,” IEEE Transactions on Computer-Aided Design of Integrated Circuits, vol. 15,
no. 7, pp. 705, 1996.
6. T. L. Hardy, “Multi-objective decision-making under uncertainty fuzzy logic methods,” Tech. Rep. TM 106796, NASA, Washington, D.C., 1994.
7. W. J. M. Kickert and H.R. van Nauta Lemke, “Application of a fuzzy controller
in a warm water plant,” Automatica, vol. 12, no. 4, pp. 301–308, 1976.
8. C.C. Lee, “Fuzzy logic in control systems: fuzzy logic controller,” IEEE Transactions on Man, Systems, and Cybernetics, vol. SMC-20, no. 2, pp. 404–435, 1990.
142
C.W. Omlin, L. Giles, and K.K. Thornber
9. C.P. Pappis and E.H. Mamdani, “A fuzzy logic controller for a traffic junction,”
IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-7, no. 10, pp.
707–717, 1977.
10. X.M. Yang and G.J. Kalambur, “Design for machining using expert system and
fuzzy logic approach,” Journal of Materials Engineering and Performance, vol. 4,
no. 5, pp. 599, 1995.
11. L.-X. Wang, Adaptive Fuzzy Systems and Control: Design and Stability Analysis,
Prentice-Hall, Englewood Cliffs, NJ, 1994.
12. H.T. Siegelmann and E.D. Sontag, “On the computational power of neural nets,”
Journal of Computer and System Sciences, vol. 50, no. 1, pp. 132–150, 1995.
13. V. Gorrini and H. Bersini, “Recurrent fuzzy systems,” in Proceedings of the Third
IEEE Conference on Fuzzy Systems, 1994, vol. I, pp. 193–198.
14. L.-X. Wang, “Fuzzy systems are universal approximators,” in Proceedings of the
First International Conference on Fuzzy Systems, 1992, pp. 1163–1170.
15. P.V. Goode and M. Chow, “A hybrid fuzzy/neural systems used to extract heuristic
knowledge from a fault detection problem,” in Proceedings of the Third IEEE
Conference on Fuzzy Systems, 1994, vol. III, pp. 1731–1736.
16. C. Perneel, J.-M. Renders, J.-M. Themlin, and M. Acheroy, “Fuzzy reasoning
and neural networks for decision making problems in uncertain environments,” in
Proceedings of the Third IEEE Conference on Fuzzy Systems, 1994, vol. II, pp.
1111–1125.
17. H.R. Berenji and P. Khedkar, “Learning and fine tuning fuzzy logic controllers
through reinforcement,” IEEE Transactions on Neural Networks, vol. 3, no. 5, pp.
724–740, 1992.
18. K.K. Thornber, “The fidelity of fuzzy-logic inference,” IEEE Transactions on
Fuzzy Systems, vol. 1, no. 4, pp. 288–297, 1993.
19. K.K. Thornber, “A key to fuzzy-logic inference,” International Journal of Approximate Reasoning, vol. 8, pp. 105–121, 1993.
20. E.S. Santos, “Maximin automata,” Information and Control, vol. 13, pp. 363–377,
1968.
21. L. Zadeh, “Fuzzy languages and their relation to human and machine intelligence,”
Tech. Rep. ERL-M302, Electronics Research Laboratory, University of California,
Berkeley, 1971.
22. B. Gaines and L. Kohout, “The logic of automata,” International Journal of
General Systems, vol. 2, pp. 191–208, 1976.
23. A. Pathak and S.K. Pal, “Fuzzy grammars in syntactic recognition of skeletal
maturity from x-rays,” IEEE Transactions on Systems, Man, and Cybernetics,
vol. 16, no. 5, pp. 657–667, 1986.
24. S.I. Mensch and H.M. Lipp, “Fuzzy specification of finite state machines,” in
Proceedings of the European Design Automation Conference, 1990, pp. 622–626.
25. H. Senay, “Fuzzy command grammars for intelligent interface design,” IEEE
Transactions on Systems, Man, and Cybernetics, vol. 22, no. 5, pp. 1124–1131,
1992.
26. J. Grantner and M.J. Patyra, “Synthesis and analysis of fuzzy logic finite state
machine models,” in Proc. of the Third IEEE Conf. on Fuzzy Systems, 1994, vol. I,
pp. 205–210.
27. J. Grantner and M.J. Patyra, “VLSI implementations of fuzzy logic finite state
machines,” in Proceedings of the Fifth IFSA Congress, 1993, pp. 781–784.
28. S.C. Lee and E.T. Lee, “Fuzzy neural networks,” Mathematical Biosciences, vol.
23, pp. 151–177, 1975.
Fuzzy Knowledge and Recurrent Neural Networks
143
29. F.A. Unal and E. Khan, “A fuzzy finite state machine implementation based on
a neural fuzzy system,” in Proceedings of the Third International Conference on
Fuzzy Systems, 1994, vol. 3, pp. 1749–1754.
30. F.E. Cellier and Y.D. Pan, “Fuzzy adaptive recurrent counterpropagation neural
networks: A tool for efficient implementation of qualitative models of dynamic
processes,” J. Systems Engineering, vol. 5, no. 4, pp. 207–222, 1995.
31. E.B. Kosmatopoulos and M.A. Christodoulou, “Neural networks for identification
of fuzzy dynamical systems: an application to identification of vehicle highway
systems,” Tech. Rep., Dept. of Electronic and Computer Engineering, Technical
U. of Crete, 1995.
32. E.B. Kosmatopoulos and M.A. Christodoulou, “Structural properties of gradient recurrent high-order neural networks,” IEEE Transactions on Circuits and
Systems, 1995.
33. E.B. Kosmatopoulos, M.M. Polycarpou, M.A. Christodoulou, and P.A. Ioannou,
“High-order neural networks for identification of dynamical systems,” IEEE Transactions on Neural Networks, vol. 6, no. 2, pp. 422–431, 1995.
34. M.P. Casey, “The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction,” Neural Computation,
vol. 8, no. 6, pp. 1135–1178, 1996.
35. A. Cleeremans, D. Servan-Schreiber, and J. McClelland, “Finite state automata
and simple recurrent neural networks,” Neural Computation, vol. 1, no. 3, pp.
372–381, 1989.
36. J.L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, pp. 179–211,
1990.
37. P. Frasconi, M. Gori, M. Maggini, and G. Soda, “Representation of finite state
automata in recurrent radial basis function networks,” Machine Learning, vol. 23,
pp. 5–32, 1996.
38. C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun, and Y.C. Lee, “Learning
and extracting finite state automata with second-order recurrent neural networks,”
Neural Computation, vol. 4, no. 3, pp. 393–405, 1992.
39. J.B. Pollack, “The induction of dynamical recognizers,” Machine Learning, vol. 7,
no. 2/3, pp. 227–252, 1991.
40. R.L. Watrous and G.M. Kuhn, “Induction of finite-state languages using secondorder recurrent networks,” Neural Computation, vol. 4, no. 3, pp. 406, 1992.
41. Z. Zeng, R.M. Goodman, and P. Smyth, “Learning finite state machines with selfclustering recurrent networks,” Neural Computation, vol. 5, no. 6, pp. 976–990,
1993.
42. C.W. Omlin, K.K. Thornber, and C.L. Giles, “Fuzzy finite-state automata can be
deterministically encoded into recurrent neural networks,” IEEE Transactions on
Fuzzy Systems, vol. 6, no. 1, pp. 76–89, 1998.
43. D. Dubois and H. Prade, Fuzzy sets and systems: theory and applications, vol. 144
of Mathematics in Science and Engineering, pp. 220–226, Academic Press, 1980.
44. C.W. Omlin and C.L. Giles, “Constructing deterministic finite-state automata in
recurrent neural networks,” Journal of the ACM, vol. 43, no. 6, pp. 937–972, 1996.
45. C.W. Omlin and C.L. Giles, “Rule revision with recurrent neural networks,” IEEE
Transactions on Knowledge and Data Engineering, vol. 8, no. 1, pp. 183–188, 1996.
46. C. Lee Giles, C.W. Omlin, and K.K. Thornber, “Equivalence in knowledge representation: Automata, recurrent neural networks, and dynamical fuzzy systems,”
Proceedings of the IEEE, vol. 87, no. 9, pp. 1623–1640, 1999.
Combining Maps and Distributed
Representations for Shift-Reduce Parsing
Marshall R. Mayberry, III and Risto Miikkulainen
Department of Computer Sciences
The University of Texas at Austin
Austin, TX, 78712
martym,risto@cs.utexas.edu
Abstract. Simple Recurrent Networks (Srns) have been widely used
in natural language processing tasks. However, their ability to handle
long-term dependencies between sentence constituents is rather limited.
Narx networks have recently been shown to outperform Srns by preserving past information in explicit delays from the network’s prior output. Determining the number of delays, however, is problematic in itself.
In this study on a shift-reduce parsing task, we demonstrate a hybrid
localist-distributed approach that yields comparable performance in a
more concise manner. A SardNet self-organizing map is used to represent the details of the input sequence in addition to the recurrent distributed representations of the Srn and Narx networks. The resulting
architectures can represent arbitrarily long sequences and are cognitively
more plausible.
1
Introduction
The subsymbolic approach (i.e. neural networks with distributed representations) to processing language is attractive for several reasons. First, it is inherently robust: the distributed representations display graceful degradation of
performance in the presence of noise, damage, and incomplete or conflicting
input [18,31]. Second, because computation in these networks is constraint-based,
the subsymbolic approach naturally combines syntactic, semantic, and thematic constraints on the interpretation of linguistic data [17]. Third, subsymbolic
systems can be lesioned in various ways and the resulting behavior is often strikingly similar to human impairments [18,19,23]. These properties of subsymbolic
systems have attracted many researchers in the hope of accounting for interesting
cognitive phenomena, such as role-binding and lexical errors resulting from memory interference and overloading, aphasic and dyslexic impairments resulting
from physical damage, and biases, defaults and expectations emerging from training history [20,19,18,24].
Since its introduction in 1990, the simple recurrent network (Srn) [7] has
become a mainstay in connectionist natural language processing tasks such as
lexical disambiguation, prepositional phrase attachment, active-passive transformation, anaphora resolution, and translation [1,4,21,34]. However, this promising
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 144–157, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Combining Maps and Distributed Representations for Shift-Reduce Parsing
145
line of research has been hampered by the Srn’s inability to handle long-term
dependencies, which abound in natural language tasks, because the Srn’s hidden layer gradually loses track of earlier words in the input sentence through
memory degradation.
Another class of recurrent neural networks called Nonlinear AutoRegressive
models with eXogenous (Narx) inputs [5,22,15] has been proposed as an alternative to Srns that can better deal with long-term dependencies by latching
information earlier in a sequence. They have proven to be good at dealing with
the type of long-term dependencies that often arise in nonlinear systems such
as system identification [5], time series prediction [6], and grammatical inference [10], but have not yet been demonstrated on Nlp tasks, which typically
involve complex representations.
This paper describes a hybrid method of extending distributed recurrent networks such as the Srn and Narx with a localist representation of an input sentence. The approach is based on SardNet [11], a self-organizing map algorithm
designed to represent sequences. SardNet permits the sequence information to
remain explicit, yet generalizable in the sense that similar sequences result in
similar patterns on the map. When SardNet is coupled with Srn or Narx,
the resulting networks perform better in the shift-reduce parsing task taken up
in this study. Even with no recurrency and no explicit delays, such as in a feedforward network to which SardNet has been added, the performance is almost
as good. Moreover, we believe that SardNet is a biologically and cognitively
plausible way of representing sequences because it is a topological map and does
not impose hard memory limits. These results show that SardNet can be used
as an effective, concise, and elegant sequence memory in distributed natural
language processing architectures.
2
The Task: Shift-Reduce Parsing
The task taken up in this study, shift-reduce (SR) parsing, is one of the simplest
approaches to sentence processing that nevertheless has the potential to handle a
substantial subset of English [33]. Its basic formulation is based on the pushdown
automata for parsing context-free grammars, but it can be extended to contextsensitive grammars as well.
The parser consists of two data structures: the input buffer stores the sequence of words remaining to be read, and the partial parse results are kept
on the stack (Fig. 1). Initially the stack is empty and the entire sentence is in
the input buffer. At each step, the parser has to decide whether to shift a word
from the buffer to the stack, or to reduce one or more of the top elements of
the stack into a new element representing their combination. For example, if
the top two elements are currently NP and VP, the parser reduces them into S,
corresponding to the grammar rule S → NP VP (step 17 in Fig. 1). The process
stops when the elements in the stack have been reduced to S, and no more words
remain in the input. The reduce actions performed by the parser in this process
constitute the parse result, such as the syntactic parse tree (line 18 in Fig. 1).
146
M.R. Mayberry and R. Miikkulainen
Stack
the
the boy
NP[the,boy]
NP[the,boy] who
NP[the,boy] who liked
NP[the,boy] who liked the
NP[the,boy] who liked the girl
NP[the,boy] who liked NP[the,girl]
NP[the,boy] who VP[liked,NP[the,girl]]
NP[the,boy] RC[who,VP[liked,NP[the,girl]]]
NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]]
NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased
NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased the
NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased the cat
NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased NP[the,cat]
NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] VP[chased,NP[the,cat]]
S[NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]],VP[chased,NP[the,cat]]]
Input
the
boy
who
who
liked
the
girl
chased
chased
chased
chased
chased
the
cat
.
.
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Action
Shift
Shift
Reduce
Shift
Shift
Shift
Shift
Reduce
Reduce
Reduce
Reduce
Shift
Shift
Shift
Reduce
Reduce
Reduce
Stop
Fig. 1. Shift-Reduce Parsing a Sentence. Each step in the parse is represented by
a line from top to bottom. The current stack is at left, the input buffer in the middle,
and the parsing decision in the current situation at right. At each step, the parser either
shifts the input word onto the stack, or reduces the top two elements of the stack into
a higher-level representation, such as the boy → [the,boy] (step 3). (Phrase labels
such as “NP” and “RC” are only used in this figure to make the process clear; they
are not explicitly made a part of the actual distributed compressed representation of
the partial parse results.)
The sequential scanning process and incremental forming of partial representations is a plausible cognitive model for language understanding. SR parsing is
also very efficient, and lends itself to many extensions. For example, the parse
rules can be made more context sensitive by taking more of the stack and the
input buffer into account. Also, the partial parse results may consist of syntactic
or semantic structures.
The general SR model can be implemented in many ways. A set of symbolic shift-reduce rules can be written by hand or learned from input examples [9,29,35]. It is also possible to train a neural network to make parsing decisions based on the current stack and the input buffer. If trained properly, the
neural network can generalize well to new sentences [30]. Whatever correlations
there exist between the word representations and the appropriate shift/reduce
decisions, the network will learn to utilize them.
Another important extension is to implement the stack as a neural network.
This way the parser can have access to the entire stack at once, and interesting
cognitive phenomena in processing complex sentences can be modeled. The Spec
system [19] was a first step in this direction. The stack was represented as a
compressed distributed representation, formed by a Raam (Recursive AutoAssociative Memory) auto-encoding network [25]. The resulting system was able
to parse complex relative clause structures. When the stack representation was
artificially lesioned by adding noise, the parser exhibited very plausible cognitive
performance. Shallow center embeddings were easier to process, as were sentences with strong semantic constraints in the role bindings. When the parser made
Combining Maps and Distributed Representations for Shift-Reduce Parsing
147
errors, it usually switched the roles of two words in the sentence, which is what
people also do in similar situations. A symbolic representation of the stack would
make modeling such behavior very difficult.
The Spec architecture, however, was not a complete implementation of SR
parsing; it was designed specifically for embedded relative clauses. For general
parsing, the stack needs to be encoded with neural networks to make it possible
to parse more varied linguistic structures. We believe that the generalization and
robustness of subsymbolic neural networks will result in powerful, cognitively
valid performance. However, the main problem of limited memory accuracy of
the Srn parsing network must first be solved.
3
Parser Architecture
A subsymbolic parser is based on a recurrent network such as Srn or Narx.
The network reads a sequence of input word representations into output patterns
representing the parse results, such as syntactic or case-role assignments for the
words. At each time step, a copy of the hidden layer (Srn) or prior outputs
(Narx) is saved and used as input during the next step, together with the next
word. In this way each new word is interpreted in the context of the entire
sequence so far, and the parse result is gradually formed at the output.
Recurrent neural networks can be used to implement a shift-reduce parser
in the following way: the network is trained to step through the parse (such as
that in Fig. 1), generating a compressed distributed representation of the top
element of the stack at each step (formed by a Raam network: section 4.1).
The network reads the sequence of words one word at a time, and each time
either shifts the word onto the stack (by passing it through the network, e.g.
step 1), or performs one or more reduce operations (by generating a sequence of
compressed representations corresponding to the top element of the stack: e.g.
steps 8-11). After the whole sequence is input, the final stack representation is
decoded into a parse result such as a parse tree. Such an architecture is powerful
for two reasons: (1) During the parse, the network does not have to guess what
is coming up later in the sentence, as it would if it always had to shoot for the
final parse result; its only task is to build a representation of the current stack
in its hidden layer and the top element in its output. (2) Instead of having to
generate a large number of different stack states at the output, it only needs to
output representations for a relatively small number of common substructures.
Both of these features make learning and generalization easier.
3.1
Srn
In the simple recurrent network, the hidden layer is saved and fed back into the
network at each step during sentence processing. The network is normally trained using the standard backpropagation algorithm [27]. A well-known problem
with the Srn model is its low memory accuracy. It is difficult for it to remember
items that occurred several steps earlier in the input sequence, especially if the
148
M.R. Mayberry and R. Miikkulainen
network is not required to produce them in the output layer during the intervening steps [32,19]. The intervening items are superimposed in the hidden layer,
obscuring the traces of earlier items. As a result, parsing with an Srn has been
limited to relatively simple sentences with shallow structure.
3.2
Narx
Nonlinear AutoRegressive models with eXogenous inputs have been proposed
as an alternative to Srns. In Narx networks, previous sequence items are explicitly represented in a predetermined number of input and output delays to
help reduce the forgetting behavior of recurrent networks due to the loss of gradient information from much earlier input in a sequence [2,15]. The network is
typically trained via BackPropagation through Time [27], which allows earlier information to be captured and to influence later outputs by propagating gradient
information through “jump-ahead” connections that develop when the network
is unfolded in time. The performance of Narx networks is strongly dependent
on the number of input/output delays, and determining that number is itself a
nontrivial question [16].
3.3
SardNet
The solution described in this paper is to use an explicit representation of the
input sequence as additional input to the hidden layer. This representation provides more accurate information about the sequence, such as the relative ordering
of the incoming words, and can be combined with the distributed hidden layer
to generate accurate output that nevertheless retains all the advantages of distributed representations. The sequence representation must be explicit enough
to allow such cleanup, but it must also be compact and generalize well to new
sequences.
The SardNet (Sequential Activation Retention and Decay Network) [11]
self-organizing map for sequences has exactly these properties. SardNet is based on the Self-Organizing Map neural network [12,13], and organized to represent the space of all possible word representations. As in a conventional selforganizing map network, each input word is mapped onto a particular map node
called the maximally-responding unit, or winner. The weights of the winning
unit and all the nodes in its neighborhood are updated according to the standard adaptation rule to better approximate the current input. The size of the
neighborhood is set at the beginning of the training and reduced as the map
becomes more organized.
In SardNet, the sentence is represented as a pattern of activation on the
map (Fig. 2). For each word, the maximally responding unit is activated to a
maximum value of 1.0, and the activations of units representing previous words
are decayed according to a specified decay rate (e.g. 0.9). Once a unit is activated,
it is removed from competition and cannot represent later words in the sequence.
Each unit may then represent different words depending on the context, which
Combining Maps and Distributed Representations for Shift-Reduce Parsing
149
SARDNET
the boy who who liked the girl chased chased chased
Input Word
chased
Context Layer
SRN
NARX
[[the,boy],[who,[liked,[the,girl]]]]
Compressed RAAM
Fig. 2. The Parser Network. This snapshot shows the network during step 11 of
parsing the sentence from Fig. 1. The representation for the current input word, chased, is shown at top left. Each word is input to the SardNet map, which builds a
representation for the sequence word by word. In the Srn implementation of the parser, the previous activation of the hidden layer is copied (as indicated by the dotted
line labelled Srn) to the Context assembly at each step. In the Narx implementation,
a predetermined number of previous output representations constitutes the Context
(indicated by the dotted line labelled Narx). The Context, together with the current
input word and the current SardNet pattern, is propagated to the hidden layer of
the network. As output, the network generates the compressed Raam representation
of the top element in the shift-reduce stack at this state of the parse (in this case,
line 12 in Fig. 1). SardNet is a map of word representations, and is trained through
the Self-Organizing Map (SOM) algorithm [13,12]. All other connections are trained
through backpropagation (for Srn) or Bptt (for Narx) [27].
allows for an efficient representation of sequences, and also generalizes well to
new sequences.
In this parsing task, a localist SardNet representation of the input sentence
is formed at the same time as the distributed network hidden layer representation. The map is used along with the next input word and with the previous
hidden layer (in the case of the Srn) or previous outputs (in the form of some prespecified number of delays in Narx) (Fig. 2) as extra context that is input into
the hidden layer. This architecture allows these networks to perform the shiftreduce parsing task with significantly less memory degradation. The sequence
information remains accessible in SardNet, so that the network is able to focus
on capturing correlations related to the structure of sentence constituents during
parsing rather than having to maintain precise constituent information in the
hidden layer.
150
M.R. Mayberry and R. Miikkulainen
S → NP(n) VP(n,m)
RC(n) → who VP(n,m)
Rule Schemata
VP(n,m) → Verb(n,m) NP(m)
NP(n) → the Noun(n) RC(n)
NP(n) → the Noun(n)
RC(n) → whom NP(m) Verb(m,n)
Nouns
Noun(0) → boy
Verb(0,0)
Verb(0,3)
Verb(1,2)
Verb(2,1)
Verb(3,0)
→
→
→
→
→
liked, saw
chased
liked
bit
saw
Noun(1) → girl
Verb(0,1)
Verb(1,0)
Verb(1,3)
Verb(2,2)
Verb(3,1)
Noun(2) → dog
Verbs
→ liked, saw
→ liked, saw
→ chased
→ bit
→ saw
Noun(3) → cat
Verb(0,2)
Verb(1,1)
Verb(2,0)
Verb(2,3)
Verb(3,3)
→
→
→
→
→
liked
liked, saw
bit
bit, chased
chased
Fig. 3. Grammar. This phrase structure grammar generates sentences with subjectand object-extracted relative clauses. The rule schemata with noun and verb restrictions ensure agreement between subject and object depending on the verb in the clause.
Lexicon items are given in bold face.
4
4.1
Experiments
Input Data, Training, and System Parameters
The data used to train and test the parser networks was generated from the
phrase structure grammar in Fig. 3, adapted from a grammar that has become
common in the literature [8,19]. Since our focus was on shift-reduce parsing,
and not processing relative clauses per se, sentence structure was limited to one
relative clause per sentence. From this grammar, training targets corresponding
to each step in the parsing process were obtained. For shifts, the target is simply the current input. In these cases, the network is trained to auto-associate,
which these networks are good at. For reductions, however, the targets consist
of representations of the partial parse trees that result from applying a grammatical rule. For example, the reduction of the sentence fragment who liked the
girl would produce the partial parse result [who,[liked,[the,girl]]]. Two issues
arise: how should the parse trees be represented, and how should reductions be
processed during sentence parsing?
The approach taken in this paper is the same as in Spec (Sec. 2), as well as
in other connectionist parsing systems [19,3,28]. Compressed representations of
all the syntactic parse trees are built up through auto-association of the constituents using the Raam neural network. This training is performed beforehand
separately from the parsing task. Once formed, the compressed representations
can be decoded into their constituents using just the decoder portion of the
Raam architecture.
In shift-reduce parsing, the input buffer after each “Reduce” action is unchanged; rather, the reduction occurs on the stack. Therefore, if we want to
perform the reductions one step at a time, the current word must be maintained in the input buffer until the next “Shift” action. Therefore, the input to
the network consists of the sequence of words that make up the sentence with
the input word repeated for each reduce action, and the target consists of the
representation of the top element of the stack (as shown in Fig. 1).
Combining Maps and Distributed Representations for Shift-Reduce Parsing
151
the
10000000 who 01010000 whom 01100000 .
11111111
boy
00101000 dog 00100010 girl
00100100 cat 00100001
chased 00011000 saw 00010010 liked 00010100 bit 00010001
Fig. 4. Lexicon. Each word representation is put together from a part-of-speech identifier (first four components) and a unique ID tag (last four). This encoding is then
repeated eight times to form a 64-unit word representation. Such redundancy makes it
easier to identify the word.
network delays hidden weights network
delays hidden map size weights
Ffn
–
500 64000 SardFfn
–
201
144 63888
Srn
–
197 64025 SardSrn
–
134
144 64016
Narx
0
500 64000 SardNarx
0
252
100 63856
Narx
3
200 64000 SardNarx
3
137
100 63940
Narx
6
125 64000 SardNarx
6
94
100 63928
Fig. 5. Network Parameters. In order to keep the network size as consistent as
possible, the number of units in the hidden layers size was varied according to the size
of the inputs. Because the Sard networks included a 100-unit map (144 units in the
SardFfn and SardSrn) that was connected to both the input and hidden layers, the
size of the hidden layer was proportionally made smaller.
Word representations were hand-coded to provide basic part-of-speech information together with a unique ID tag that identified the word within the
syntactic category (Fig. 4). The basic encoding of eight units was repeated eight
times to fill out a 64-unit representation. The 64-unit representation length was
needed to encode all of the partial parse results formed by Raam. The redundancy in the lexical items facilitates learning in general.
4.2
System Parameters and Training
SardNet maps were added to an Srn and to a Narx network with zero, three,
and six delays (common in other comparisons of Narx and Srn [10,15,14]) to
yield SardSrn and SardNarx parsing architectures, respectively. Additionally,
a SardNet map was added to a feedforward network (Ffn). This SardFfn
network provides a baseline for evaluating the map itself in the parsing task. The
performances of all the architectures were compared in the shift-reduce parsing
task. The size of the hidden layer for each network was determined so that the
total number of weights was as close to 64,000 as the topology would permit
(Fig. 5).
Four data sets of 20%, 40%, 60%, and 80% of the 436 sentences generated
by the grammar were randomly selected and each parser was trained on each
dataset four times. Training on all 160 runs was stopped when the error on a 22sentence (5%) validation set began to level off. The same validation set was used
for all the simulations and was randomly drawn from a pool of sentences that
did not appear in any of the training sets. Testing was then performed on the
remaining sentences that were neither in the training set nor in the validation
152
M.R. Mayberry and R. Miikkulainen
set. All networks were trained with a learning rate of 0.1, and the maps had
a decay rate of 0.9. A map of 100 units was pretrained with a learning rate of
0.6, and then used for all of the SardNarx networks. A slightly larger map
with 144 units was used for the SardFfn and SardSrn networks since these
networks had otherwise much fewer weights and because a larger hidden layer
did not help performance, whereas a larger map did. (This would also be true
of the SardNarx networks, but making the map larger constrained the size of
the hidden layer to the point where it hurt performance.) Training took about
one day on a 400 MHz dual-processor Pentium II workstation for each network.
4.3
Results
The average mismatches performance measure reports the average number of leaf
representations per sentence that are not correctly identified from the lexicon by
nearest match in Euclidean distance. As an example, if the target reduction
is [who,[liked,[the,girl]]]], (step 11 of Fig. 1), but the output corresponds
to [who,[saw,[the,girl]]]], then a mismatch would occur at the leaf labelled
saw once the Raam representation was decoded. Average mismatches provide
a measure of the correctness of the information in the Raam representation. It
is a true measure of the utility of the network and was, therefore, used in our
experiments.
Most of the sentences in the training and test datasets were seventeen words
long because the same number of steps were needed to reduce both subjectand object-extracted relative clauses. The longest long-term dependency the
networks had to overcome was at step three in the parsing process where the
first reduction occurred, which was part of the final compressed Raam parse
representation for the complete sentence. It was in decoding this final parse
representation that even the best networks made errors.
The results are summarized in Fig. 6. The main result is that virtually all
Sard networks performed better than the purely distributed networks. The
only exception was on the 20% dataset, where Narx-6 beat SardFfn and
SardNarx-0 and was comparable to the SardSrn and SardNarx-3. These
results demonstrate that adding SardNet to a distributed network results in a
significant performance gain.
Compared to their performance on the larger datasets, the Sard networks
were weaker on the 20% dataset. On closer inspection it turned out that the map
was not smooth enough to allow as good generalization as in the larger datasets,
where there was sufficient data to overcome the map irregularities. It is also
interesting to note that adding even a single delay to these networks completely
eliminated this problem, bringing the performance in line with the others. There
are a number of ways that this problem can potentially be eliminated, including
richer lexical representations and tuning of the SardNet parameters. We are
studying such methods to improve generalization. The trade-off between the
sizes of the map and hidden layer should also be borne in mind. There is a
certain point in a distributed network that decreasing the hidden layer size results
in measurable performance degradation, whereas increasing it beyond another
Combining Maps and Distributed Representations for Shift-Reduce Parsing
153
Fig. 6. Summary of Parsing Performance. Averages over four simulations each
for the six types of network tested using the stricter average mismatches per sentence
measure on the test data. (The Ffn and Narx-0 results are all above 0.7 and are
not shown in this Fig. to allow the results among the remaining networks to be more
easily discernible.) With the exception of the Narx-6 network on the 20% dataset, the
networks to which SardNet was added clearly outperformed the purely distributed
networks. Even the SardNet networks with no recursion (SardFfn and SardNarx0) performed significantly better than the Srn, Narx-3, and Narx-6 networks on
the rest of the test datasets. The better performance of SardSrn, SardNarx-3, and
SardNarx-6, however, demonstrate the benefits of recursion.
point yields diminishing returns. We have found that the same holds for the
map. Quantifying these observations will be addressed in future work.
4.4
Example Parse
The different roles of the distributed representations in the hidden layer and the
localist representation on the SardNet map can be seen clearly by contrasting
the performances of SardSrn and the Srn on a typical sentence, such as the one
in Fig. 1. Neither SardSrn nor Srn had any trouble with the shift targets. Not
surprisingly, early in training the networks would master all the shift targets in
the sentence before they would get any of the reductions correct. The first reduction ([the,boy] in our example) also poses no problem for either network. Nor, in
general, does the second, [the,girl], because the constituent information is still
fresh in memory. However, the ability of the Srn to generate the later reductions
accurately degrades rapidly because the information about earlier constituents
is smothered by the later steps of the parse. Interestingly, the structural information survives much longer. For example, instead of [who,[liked,[the,girl]]]],
154
M.R. Mayberry and R. Miikkulainen
the Srn might produce [who,[bit,[the,dog]]]]. The structure of this representation is correct; what is lost are the particular instantiations of the parse tree.
This is where SardNet makes a difference. The lost constituent information
remains accessible in the feature map. As a result, SardSrn is able to capture
each constituent even through the final reductions.
5
Discussion and Future Work
These results demonstrate a practicable solution to the memory degradation problem of distributed networks. When prior constituents are explicitly represented
at the input, the network does not have to maintain specific information about
the sequence, and can instead focus on what it is best at: capturing structure. Although the sentences used in these experiments are still relatively uncomplicated,
they do exhibit enough structure to suggest that much more complex sentences
could be tackled with distributed networks augmented with SardNet.
These results also show that networks with SardNet can perform as well
as Narx networks with many delays. Why is this a useful result? The point
is that, in the general case, it will be unclear how many delays are needed in
a Narx network, whereas SardNet can accommodate sequences of indefinite
length (limited only by the number of nodes in the map without reactivation).
Indeed, SardNet functions much in the same way as the delays in the Narx
networks do, but with the explicit representation of the input tokens replaced
by single-unit activations on the map. This relieves the designer from having to
specify, by trial and error, the appropriate number of delays. It should also lead
to more graceful degradation with unexpectedly long sequences, and therefore
would allow the system to scale up better and exhibit more plausible cognitive
behavior.
The SardNet idea is not just a way to improve the performance of subsymbolic networks; it is an explicit implementation of the idea that humans can
keep track of identities of elements, not just their statistical properties [18]. The
subsymbolic networks are very good with statistical associations, but cannot
distinguish between representations that have similar statistical properties. People can; whether they use a map-like representation or explicit delays (and how
many) is an open question, but we believe the SardNet representation suggests
an elegant way to capture a lot of the resulting behavior. SardNet is a plausible cognitive approach, and useful for building powerful subsymbolic language
understanding systems. SardNet is also in line with the general neurological
evidence for topographical representations in the brain.
It is also worth noting that the operation of the recurrent networks on the
shift-reduce parsing task is a nice demonstration of holistic computation. The
network is able to learn how to generate each Raam parse representation during the course of sentence processing without ever having to decompose and
recompose the constituent representations. Partial parse results can be built
up incrementally into increasingly complicated structures, which suggests that
Combining Maps and Distributed Representations for Shift-Reduce Parsing
155
training could be performed incrementally. Such a training scheme is especially
attractive given that training in general is still relatively costly.
An extension of the hybrid approach taken in this study, currently being investigated by our group, is an architecture where SardNet is combined with a
Raam network. Raam, although having many desirable properties for a purely
connectionist approach to parsing, has long been a bottleneck during training.
Its operation is very similar to the Srn, and it suffers from the same memory
accuracy problem: with deep structures the superimposition of higher-level representations gradually obscure the traces of low-level items, and the decoding
becomes inaccurate. This degradation makes it difficult to use Raam to encode/decode parse results of realistic language. Preliminary results indicate that
the explicit representation of a compressed structure formed on a SardNet
feature map, coupled with the distributed representations of Raam, yields an
architecture able to encode richer linguistic structure. This approach should
readily lend itself to encoding the feature-value matrices used in the lexicalist,
constraint-based grammar formalisms of contemporary linguistics theory, such
as Hpsg [26], needed to handle realistic natural language.
6
Conclusion
We have shown how explicit representation of constituents on a self-organizing
map allows recurrent networks to process sequences more effectively. We demonstrated that neural networks equipped with SardNet sequence memory achieve
much better performance than distributed networks alone, and comparable performance to Narx networks with several delays, on a nontrivial shift-reduce parsing task. SardNet, however, provides a more elegant and cognitively plausible
solution to the problem of long-term dependencies in that it does not impose
hard limits on the length of the sequences it can process. In future work, we
will extend the method to other recurrent neural network architectures such as
Raam to allow much richer linguistic structures to be represented in connectionist architectures.
Acknowledgments
This research was supported in part by the Texas Higher Education Coordinating
Board under grant ARP-444.
SardSrn demo: http://www.cs.utexas.edu/users/nn/pages/research/nlp.html.
References
1. R. B. Allen. Several studies on natural language and back-propagation. In Proceedings of the IEEE First International Conference on Neural Networks (San Diego,
CA), volume II, pages 335–341. Piscataway, NJ: IEEE, 1987.
2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with
gradient is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
156
M.R. Mayberry and R. Miikkulainen
3. G. Berg. A connectionist parser with recursive sentence structure and lexical disambiguation. In W. Swartout, editor, Proceedings of the Tenth National Conference
on Artificial Intelligence, pages 32–37. Cambridge, MA: MIT Press, 1992.
4. D. J. Chalmers. Syntactic transformations on distributed representations. Connection Science, 2:53–62, 1990.
5. S. Chen, S. Billings, and P. Grant. Non-linear system identification using neural
networks. In International Journal of Control, pages 1191–1214, 1990.
6. J. Connor, L. Atlas, and D. Martin. Recurrent networks and narma modeling.
Advances in Neural Information Processing Systems, 4:301–308, 1992.
7. J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.
8. J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7:195–225, 1991.
9. U. Hermjakob. Learning Parse and Translation Decisions from Examples with Rich
Context. PhD thesis, Department of Computer Sciences, The University of Texas
at Austin, Austin, TX, 1997. Technical Report UT-AI97-261.
10. B. Horne and C. Giles. An experimental comparison of recurrent neural networks.
Advances in Neural Information Processing Systems, 7:697–704, 1995.
11. D. L. James and R. Miikkulainen. SARDNET: A self-organizing feature map for
sequences. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in
Neural Information Processing Systems 7, pages 577–584. Cambridge, MA: MIT
Press, 1995.
12. T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78:1464–1480,
1990.
13. T. Kohonen. Self-Organizing Maps. Springer, Berlin; New York, 1995.
14. T. Lin, B. G. Horne, and C. L. Giles. How embedded memory in recurrent neural
network architectures helps learning long-term temporal dependencies. Neural
Networks, 11(5):861–868, 1998.
15. T. Lin, B. G. Horne, and C. L. Giles. Learning long-term dependencies in narx
recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–
1338, 1996.
16. T. Lin, C. L. Giles, B. G. Horne, and S. Y. Kung. A Delay Damage Model Selection
Algorithm for NARX Neural Networks. IEEE Transactions on Signal Processing,
45(11):2719-2730, 1997.
17. J. L. McClelland and A. H. Kawamoto. Mechanisms of sentence processing: Assigning roles to constituents. In J. L. McClelland and D. E. Rumelhart, editors,
Parallel Distributed Processing: Explorations in the Microstructure of Cognition,
Volume 2: Psychological and Biological Models, pages 272–325. MIT Press, Cambridge, MA, 1986.
18. R. Miikkulainen. Subsymbolic Natural Language Processing: An Integrated Model
of Scripts, Lexicon, and Memory. MIT Press, Cambridge, MA, 1993.
19. R. Miikkulainen. Subsymbolic case-role analysis of sentences with embedded clauses. Cognitive Science, 20:47–73, 1996.
20. R. Miikkulainen. Dyslexic and category-specific impairments in a self-organizing
feature map model of the lexicon. Brain and Language, 59:334–366, 1997.
21. P. Munro, C. Cosic, and M. Tabasko. A network for encoding, decoding and
translating locative prepositions. Connection Science, 3:225–240, 1991.
22. K. S. Narendra and K. Parthasarathy. Identification and control of dynamical
systems using neural networks. IEEE Transactions on Neural Networks, 1:4–27,
1990.
Combining Maps and Distributed Representations for Shift-Reduce Parsing
157
23. D. C. Plaut. Connectionist Neuropsychology: The Breakdown and Recovery of Behavior in Lesioned Attractor Networks. PhD thesis, Computer Science Department,
Carnegie Mellon University, Pittsburgh, PA, 1991. Technical Report CMU-CS-91185.
24. D. C. Plaut and T. Shallice. Perseverative and semantic influences on visual object naming errors in optic aphasia: A connectionist account. Technical Report
PDP.CNS.92.1, Parallel Distributed Processing and Cognitive Neuroscience, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, 1992.
25. J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46:77–
105, 1990.
26. C. Pollard and I. A. Sag. Head-Driven Phrase Structure Grammar. University of
Chicago Press, Chicago, IL, 1994.
27. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors,
Parallel Distributed Processing: Explorations in the Microstructure of Cognition,
Volume 1: Foundations, pages 318–362. MIT Press, Cambridge, MA, 1986.
28. N. E. Sharkey and A. J. C. Sharkey. A modular design for connectionist parsing. In
M. F. J. Drossaers and A. Nijholt, editors, Twente Workshop on Language Technology 3: Connectionism and Natural Language Processing, pages 87–96, Enschede,
the Netherlands, 1992. Department of Computer Science, University of Twente.
29. R. F. Simmons and Y.-H. Yu. The acquisition and application of context sensitive
grammar for English. In Proceedings of the 29th Annual Meeting of the ACL.
Morristown, NJ: Association for Computational Linguistics, 1991.
30. R. F. Simmons and Y.-H. Yu. The acquisition and use of context dependent
grammars for English. Computational Linguistics, 18:391–418, 1992.
31. M. F. St. John and J. L. McClelland. Learning and applying contextual constraints
in sentence comprehension. Artificial Intelligence, 46:217–258, 1990.
32. A. Stolcke. Learning feature-based semantics with simple recurrent networks. Technical Report TR-90-015, International Computer Science Institute, Berkeley, CA,
1990.
33. M. Tomita. Efficient Parsing for Natural Language. Kluwer, Dordrecht; Boston,
1986.
34. D. S. Touretzky. Connectionism and compositional semantics. In J. A. Barnden
and J. B. Pollack, editors, High-Level Connectionist Models, volume 1 of Advances
in Connectionist and Neural Computation Theory, Barnden, J. A., series editor,
pages 17–31. Ablex, Norwood, NJ, 1991.
35. J. M. Zelle and R. J. Mooney. Comparative results on using inductive logic programming for corpus-based parser construction. In S. Wermter, E. Riloff, and
G. Scheler, editors, Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 355–369. Springer, Berlin; New York,
1996.
Towards Hybrid Neural Learning Internet
Agents
Stefan Wermter, Garen Arevian, and Christo Panchev
Hybrid Intelligent Systems Group
University of Sunderland, Centre for Informatics, SCET
St Peter’s Way, Sunderland, SR6 0DD, UK
stefan.wermter@sunderland.ac.uk
http://www.his.sunderland.ac.uk/
Abstract. The following chapter explores learning internet agents. In
recent years, with the massive increase in the amount of available information on the Internet, a need has arisen for being able to organize and
access that data in a meaningful and directed way. Many well-explored
techniques from the field of AI and machine learning have been applied in
this context. In this paper, special emphasis is placed on neural network
approaches in implementing a learning agent. First, various important
approaches are summarized. Then, an approach for neural learning internet agents is presented, one that uses recurrent neural networks for
the learning of classifying a textual stream of information. Experimental
results are presented showing that a neural network model based on a
recurrent plausibility network can act as a scalable, robust and useful
news routing agent.concluding section examines the need for a hybrid
integration of various techniques to achieve optimal results in the problem domain specified, in particular exploring the hybrid integration of
Preference Moore machines and recurrent networks to extract symbolic
knowledge.
1
Introduction
The exponential expansion of Internet information has been very apparent; however, there is still a great deal that can be done in terms of improving the
classification and subsequent access of the data that is potentially available. The
motivation for trying various techniques from the field of machine learning arises from the fact that there is a great deal of unstructured data. Much time is
spent on searching for information, filtering information down to essential data,
reducing the search space for specific domains, classifying text and so on. The
various techniques of machine learning are examined for automating the learning of these processes, and tested to address the problem of an expanding and
dynamic Internet [26].
So-called “internet agents” are implemented to address some of these problems. The simplest definition of an agent is that it is a software system, to
some degree autonomous, that is designed to perform or learn a specific task
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 158–174, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Towards Hybrid Neural Learning Internet Agents
159
[2,30] which is either one algorithm or a combination of several. Agents can be
designed to perform various tasks including textual classification [34,10], information retrieval and extraction [5,9], routing of information such as email, news
[3,18,6,46], automating web browsing [1], organization [36,4], personal assistance
[39,20,17,14] and learning for web-agents [28,46].
In spite of a lot of work on internet agents, most systems currently do not have
learning capabilities. In the context of this paper, a learning agent is taken to be
an algorithmic approach to a classification problem that allows it to be dynamic,
robust and able to handle noisy data, to a degree autonomously, while improving
its performance through repeated experience [44]. Of course, learning internet
agents can have a variety of definitions as well, and the emphasis within this
context is more on autonomously functioning systems that can either classify or
route information of a textual nature. In particular, after a summary of various
approaches, the HyNeT recurrent neural network architecture will be described,
which is shown to be a robust and scalable text routing agent for the Internet.
2
Different Approaches to Learning in Agents
The field of Machine Learning is concerned with the construction of computer
programs that automatically improve their performance with experience [33].
A few examples of currently applied machine learning approaches for learning agents are decision trees [37], Bayesian statistical approaches [31], Kohonen
networks [24,22] and Support Vector Machines (SVMs) [19]. However, in the following summary, the potential use of neural networks is examined.
2.1
Neural Network Approaches
Many internet-related problems are neither discrete nor are the distributions
known due to the dynamics of the medium. Therefore, internet agents can be
made more powerful by employing various learning algorithms inspired by approaches from neural networks. Neural networks have several main properties which
make them very useful for the Internet. The information processing is non-linear,
allowing the learning of real-valued, discrete-valued and vector-valued examples;
they are adaptable and dynamic in nature, and hence can cope with a varying
operating environment. Contextual information and knowledge is represented
by the structure and weights of a system, allowing interesting mappings to be
extracted from the problem environment. Most importantly, neural networks are
fault-tolerant and robust, being able to learn from noisy or incomplete data due
to their distributed representations.
There are many different neural network algorithms; however, while bearing
in mind the context of agents and learning, several types of neural network are
more suitable than others for the task that is required. For a dynamic system
like the Internet, an online agent needs to be as robust as possible, essentially
to be left to the task of routing, classifying and organizing textual data in an
160
S. Wermter, G. Arevian, and C. Panchev
autonomous and self-maintaining way by being able to generalize, to be faulttolerant and adaptive. The three approaches so far shown to be most suitable are
recurrent networks [46], Kohonen self-organizing maps (SOMs) [24,22] and reinforcement learning [42,43]. All these neural network approaches have properties
which are briefly discussed and illustrated below.
Supervised Recurrent Networks Recurrent neural networks have shown
great promise in many tasks. For example, certain natural language processing
approaches require that context and time be incorporated as part of the model
[8,7]; hence, recent work has focused on developing networks that are able to
create contextual representations of textual data which take into account the
implicit representation of time, temporal sequencing and the context as a result of the internal representation that is created. These properties of recurrent
neural networks can be useful for creating an agent that is able to derive information from text-based, noisy Internet input. In particular, recurrent plausibility
networks have been found useful [45,46].
Also, NARX (Nonlinear Autoregressive with eXogenous inputs) models have
been shown to be very effective in learning many problems such as those that
involve long-term dependencies [29]; NARX networks are formalized by [38]:
y(t) = f (x(t − nx ), . . . , x(t − 1), x(t), y(t − ny ), . . . , y(t − 1)),
where x(t) and y(t) are the input and output of the network at a time t; nx
and ny represent the order of the input and output, and the function f is the
mapping performed by the multi-layer perceptron.
In some cases, it has been shown that NARX and RNN (Recurrent Neural
Network) models are equivalent [40], and under conditions that the neuron transfer function is similar to the NARX transfer function, one may be transformed
to the other and vice versa - the benefit being that if the output dimension of a
NARX model is larger than the number of hidden units, training an equivalent
RNN will be faster; pruning is also easier in an equivalent NARX whose stability
behavior can be analyzed more readily.
Unsupervised Models Recently, applications of Kohonen nets have been extended to the realm of text processing [25,16], to create browsable mappings
of Internet-related hypertext data. A self-organizing map (SOM) forms a nonlinear projection from a high-dimensional data manifold onto a low-dimensional
grid [24]. The SOM algorithm computes an optimal collection of models that
approximates the data by applying a specified error criterion and takes into account the similarities and hence the relations between the models; this allows
the ordering of the reduced-dimensionality data onto a grid.
The SOM algorithm [23,24] is formalized as follows: there is an initialization
step, where random values for the initial weight vectors wj (0) are set; if the total
number of neurons in the lattice is N , wj (0) must be different for j = 1, 2, . . . , N .
The magnitude for the weights should be kept small for optimal performance.
Towards Hybrid Neural Learning Internet Agents
161
There is a sampling step where example vectors x from the input distribution
are taken that represent the sensory signal. The optimally matched ’winning’
neuron i(x) at discrete time t is found using the minimum-distance Euclidean
criterion by a process called similarity matching:
i(x) = argj min k x(t) − wj (t) k f or j = 1, 2, . . . , N
The synaptic weight vectors of all the neurons are adjusted and updated,
according to:
wj (t) + µ(t)[x(t) − wj (t)] f or j ∈ Λi(x) (t)
wj (t + 1) =
otherwise
wj (t)
The learning rate is µ(t), and Λi(x) (t) is the neighborhood function centered
around the winning neuron i(x); both µ(t) and Λi(x) (t) are continuously varied.
The sampling, matching and update are repeated until no further changes are
observed in the mappings.
In this way, the WEBSOM agent [25] can represent web documents statistically by their word frequency histograms or some reduced form of the data
as vectors. The SOM here is acting as a similarity graph of the data. A simple graphical user interface is used to present the ordered data for navigation.
This approach has been shown to be appropriate for the task of learning for
newsgroup classification.
Reinforcement Learning Approaches This is the on-line learning of inputoutput mappings through a process of exploration of a problem space. Agents
that use reinforcement learning rely on the use of training data that evaluates
the final actions taken. There is active exploration with an explicit trial-anderror search for the desired behavior [43,12]; evaluative feedback, specifically
characteristic of this type of learning, indicates how good an action taken is,
but not if it is the best or worst. All reinforcement algorithm approaches have
explicit goals, interact with and influence their environments.
Reinforcement learning aims to find a policy that selects a sequence of actions which are statistically optimal. The probability that a specific environment
makes a transition from a state x(t) to y at a time t + 1, given that it was previously in states x(0), x(1), ..., and that the corresponding actions a(0), a(1), ...,
were taken, depend entirely on the current state x(t) and action a(t) as shown by:
Γ {x(t + 1) = y|x(0), a(0); x(1), a(1); . . . ; x(t), a(t)}
= Γ {x(t + 1) = y|x(t), a(t)}
where Γ (·) is the transition probability or change of state.
If the environment is in a state x(0) = x, the evaluation function [43,12] is
given by:
162
S. Wermter, G. Arevian, and C. Panchev
H(x) = E
"
∞
X
k
#
γ r(k + 1)|x(0) = x
k=0
Here, E is the expectation operator, taken with respect to the policy used to
select actions by the agent. The summation is termed the cumulative discounted
reinforcement, and r(k + 1) is the reinforcement received from the environment
after action a(k) is taken by the agent. The reinforcement feedback can have
a positive value (regarded as a ’reward’ signal), a negative value (regarded as
’punishment’) or unchanged; γ is called the discount-rate parameter and lies in
the range 0 ≤ γ < 1, where if γ → 0, then the reinforcement is more short
term, and if γ → 1, then the cumulative actions are for the longer term. Learning the evaluation function H(x) allows the use of the cumulative discounted
reinforcement later on.
This approach, though not fully explored for sequential tasks on the Internet,
holds promise for the design of a learning agent system that fulfills the necessary
criteria - one that is autonomous, able to adapt, robust, can handle noise and
sequential decisions.
3
Analysis and Discussion of a Specific Learning Internet
Agent: HyNeT
A more detailed description of one particular learning agent will now be presented. A great deal of recent work on neural networks has shifted from the
processing of strictly numerical data towards the processing of various corpora
and the huge body of the Internet [35,26,5,19]. Indeed, it has been an important
goal to study the more fundamental issues of connectionist systems, and the
way in which knowledge is encoded in neural networks and how knowledge can
be derived from them [13,32,11,41,15]. A useful example, applicable as it is a
real-world task, is the routing and classification of newswire titles and will now
be described.
3.1
Recurrent Plausibility Networks
In this section, a detailed analysis of one such agent called HyNeT (Hybrid
Neural/symbolic agents for Text routing on the internet), which uses a recurrent
neural network, is presented and experimental results are discussed.
The specific neural network explored here is a more developed version of
the simple recurrent neural network, namely a Recurrent Plausibility Network
[45,46]. Recurrent neural networks are able to map both previous internal states
and input to a desired output - essentially acting as short-term incremental
memories that take time and context into consideration.
Fully recurrent networks process all information and feed it back into a single
layer, but for the purposes of maintaining contextual memory for processing arbitrary lengths of input, they are limited. However, partially recurrent networks
Towards Hybrid Neural Learning Internet Agents
163
have recurrent connections between the hidden and context layer [7] or Jordan
networks have connections between the output and context layer [21]; these allow
previous states to be kept within the network structure.
Simple recurrent networks have a rapid rate of decay of information about
states. For many classification tasks in general, recent events are more important
but some information can also be gained from information that is more longerterm. With sequential textual processing, context within a specific processing
time-frame is important and two kinds of short-term memory can be useful
- one that is more dynamic and varying over time which keeps more recent
information, and a more stable memory, the information of which is allowed
to decay more slowly to keep information about previous events over a longer
time-period. In other research [45], different decay memories were introduced by
using distributed recurrent delays over the separate context layers representing
the contexts at different time steps. At a given time step, the network with n
hidden layers processes the current input as well as the incremental contexts
from the n − 1 previous time steps. Figure 1 shows the general structure of our
recurrent plausibility network.
Feedforward Propagation
Output
Layer
On(t)
Hidden
Layer
Hidden
Layer
Recurrent Connections
Hn(t)
Cn−1(t−1)
Context Layer
Cn−2(t−1)
Context Layer
Hn−1(t)
I0(t)
Input Layer
Fig. 1. General Representation of a Recurrent Plausibility Network.
The input to a hidden layer Hn is constrained by the underlying layer Hn−1
as well as the incremental context layer Cn−1 . The activation of a unit Hni (t)
164
S. Wermter, G. Arevian, and C. Panchev
at time t is computed on the basis of the weighted activation of the units in
the previous layer H(n−1)i (t) and the units in the current context of this layer
C(n−1)i (t). In a particular case, the following is used:
X
X
wli C(n−1)i (t))
wki H(n−1)i (t) +
Lni (t) = f (
k
l
The units in the two context layers with one time a step are computed as
follows:
Cni (t) = (1 − ϕn )H(n+1)i (t − 1) + ϕn Cni (t − 1)
where Cni (t) is the activation of a unit in the context layer at time t. The selfrecurrency of the context is controlled by the hysteresis value ϕn . The hysteresis
value of the context layer Cn−1 is lower than the hysteresis value of the next
context layer Cn . This ensures that the context layers closer to the input layer
will perform as memory that represents a more dynamic context for small time
periods.
3.2
Reuters-21578 Text Categorization Test Collection
The Reuters News Corpus is a collection of news articles that appeared on the
Reuters Newswire; all the documents have been categorized by Reuters into
several specific categories. Further formatting of the corpus [27] has produced the
so-called ModApte Split; some examples of the news titles are given in Table 1.
Table 1. Example titles from the Reuters corpus.
Semantic Category
money-fx
shipping
interest
economic
corporate
commodity
energy
shipping & energy
money-fx & currency
Example Titles
Bundesbank sets new re-purchase tender
US Navy said increasing presence near gulf
Bank of Japan determined to keep easy money policy
Miyazawa sees eventual lower US trade deficit
Oxford Financial buys Clancy Systems
Cattle being placed on feed lighter than normal
Malaysia to cut oil output further traders say
Soviet tankers set to carry Kuwaiti oil
Bank of Japan intervenes shortly after Tokyo opens
All the news titles belong to one or more of eight main categories: Money and Foreign Exchange (money-fx, MFX), Shipping (ship, SHP), Interest Rates (interest, INT), Economic Indicators (economic, ECN), Currency
(currency, CRC), Corporate (corporate, CRP), Commodity (commodity,
CMD), Energy (energy, ENG).
Towards Hybrid Neural Learning Internet Agents
3.3
165
Various Experiments Conducted
In order to get a comparison of performance, several experiments were conducted
using different vector representations of the words in the Reuters corpus as
part of the preprocessing; the variously derived vector representations were fed
into the input layer of simple recurrent networks, the output being the desired
semantic routing category. The preprocessing strategies are briefly outlined and
explained below. The recall/precision results are presented later in Table 2 for
each experiment.
Simple Recurrent Network and Significance Vectors In the initial experiment, words were represented using significance vectors; these were obtained
by determining the frequency of a word in different semantic categories using
the following operation:
F requency of w in xi
f or j ∈ {1, · · · n}
v(w, xi ) = P
F requency of w in xj
j
If a vector (x1 x2 . . . xn ) represents each word w, and xi is a specific semantic
category, then v(w, xi ) is calculated for each dimension of the word vector, as
the frequency of a word w in the different semantic categories xi divided by
the number of times the word w appears in the corpus. The computed values
are then presented at the input of a simple recurrent network [8] in the form
(v(w, x1 ), v(w, x2 ), . . . , v(w, xn )).
Simple Recurrent Network and Semantic Vectors An alternative preprocessing strategy was to represent vectors as the plausibility of a specific word
occurring in a particular semantic category, the main advantage being that they
are independent of the number of examples present in each category:
N ormalized f requency of w in xi
, j ∈ {1, · · · n}
v(w, xi ) = P
N ormalized f requency of w in xj
j
where:
N ormalized f requency of w in xi =
F requency of w in xi
N umber of titles in xi
The normalized frequency of appearance a word w in a semantic category xi
(i.e. the normalized category frequency) was again computed as a value v(w, xi )
for each element of the semantic vector, divided by normalizing the frequency of
appearance of a word w in the corpus (i.e. the normalized corpus frequency).
166
S. Wermter, G. Arevian, and C. Panchev
Recurrent Plausibility Network and Semantic Vectors In the final experiment, a recurrent plausibility network, as shown in Figure 1 was used; the
actual architecture used for the experiment was one with two hidden and two
context layers. After empirically testing various combinations of settings for the
values of the hysteresis value for the activation function of the context layers, it
was found that the network performed optimally with a value of 0.2 for the first
context layer, and 0.8 for the second.
Table 2. Best recall/precision results from various experiments
Type of Vector Representation Used in Experiment
Training set
recall precision
Significance Vectors and Simple Recurrent Network
85.15 86.99
Semantic Vectors and Simple Recurrent Network
88.57 88.59
Semantic Vectors with Recurrent Plausibility Network 89.05 90.24
“Bag of Words” with Recurrent Plausibility Network
-
3.4
Test set
recall precision
91.23 90.73
92.47 91.61
93.05 92.29
86.60 83.10
Results of Experiments
The results in Table 2 show the clear improvement in the overall recall/precision
values from the first experiment using the significance vectors, to the last using
the plausibility network. The experiment with the semantic vector representation
showed an improvement over the first. The best performance was shown by the
use of the plausibility network.
In comparison, a bag-of-words approach, to test performance on sequences
without order, reached 86.6% recall and 83.1% precision; this indicates that the
order of significant words and hence the context are important as a source of
information which the recurrent neural network learns, allowing better classification performance.
These results demonstrate that a carefully developed neural network agent
architecture can deal with significantly large test and training sets. In some previous work [45], recall/precision accuracies of 95% were reached but the library
titles used in the work were much less ambiguous than the Reuters Corpus (which
had a few main categories and newstitles that could easily be misclassified due to
the inherent ambiguity) and only 1 000 test titles were used in the approach while
the plausibility network was scalable to 10 000 corrupted and ambiguous titles.
For general comparison with other approaches, interesting work on text categorization on the Reuters corpus has been done using whole documents [19]
rather than titles. Taking the ten most frequently occurring categories, it has
been shown that the recall/precision break-even point for Support Vector Machines was 86%, 82% for k-Nearest Neighbor, 72% for Naive Bayes. Though a
different set of categories and whole documents were used, and therefore the
results may not be directly comparable to results shown in Table 2, they do
Towards Hybrid Neural Learning Internet Agents
167
2.5
2
Sum Squared Error
1.5
1
0.5
MIYAZAWA
SEES
EVENTUAL
LOWER
0
0
150
300
U
450
600
750
900
S
TRADE
DEFICIT
Epoch
Fig. 2. The error surface of the title “Miyazawa Sees Eventual Lower US Trade Deficit”
however give some indication of document classification performance on this
corpus. Especially for medium text data sets or when only titles are available,
the HyNeT agent compares favorably with the other machine learning techniques
that have been tested on this corpus.
3.5
Analysis of the Output Representations
For a clear presentation of the network’s behavior, the results are illustrated
and analyzed below; the error surfaces show plots of the sum-squared error of
the output preferences, plotted against the number of training epochs and each
word of a title.
Figure 2 shows the surface error of the title “Miyazawa Sees Eventual Lower
US Trade Deficit”. In the Reuters Corpus this is classified under the “economic”
category; as can be seen, the network does learn the correct category classification. The first two words, “Miyazawa” and “sees”, are initially given several
possible preferences to other categories and the errors are high early on in the
training. However, the subsequent words “eventual”, “lower”, etc. cause the network to increasingly favor the correct classification, and at the end, the trained
network has a very strong preference (shown by the low error value) for the
incremental context of the desired category.
The second example is shown in Figure 3, titled “Bank of Japan Determined
To Keep Easy Money Policy” and belonging to the “interest” category. This example shows a more complicated behavior in the contextual learning, in contrast
to the previous one. The words beginning “Bank of Japan” are ambiguous and
could be classified under different categories such as “money/foreign exchange”
168
S. Wermter, G. Arevian, and C. Panchev
2.5
Sum Squared Error
2
1.5
1
0.5
BANK
OF
JAPAN
DETERMINED
0
0
150
300
450
Epoch
600
750
900
TO
KEEP
EASY
MONEY
POLICY
Fig. 3. The error surface of the title “Bank of Japan Determined To Keep Easy Money
Policy”
and “currency”, and indeed the network shows some confused behavior; again
however, the context of the latter words such as “easy money policy” eventually
allow the network to learn the correct classification.
3.6
Context Building in Plausibility Neural Networks
Figures 5 and 7 present cluster dendrograms based on the internal context representations at the end of titles. The test includes 5 representative titles for each
category; each title belongs to only one category. All titles are correctly classified
by the network. The first observation that can be made from these figures is that
the dendrogram based on the activations of the second context layer (closer to the
output layer) provides a better distinction between the classes. In other words,
it can be seen that the second context layer is more representative of the title
classification than the first one. This analysis aims to explore how these contexts
are built and what the difference is between the two contexts along a title.
Using the data for Figures 5 and 7, the class-activity of a particular context
unit is defined with respect to a given category as the activation of this unit
when a title from this category has been presented to the network. That is, for
example, at the end of a title from the category “economic”, the units with the
higher activation will be classified as being more class-active with respect to the
“economic” category, and the units with lower activation as less class-active.
For the analysis of the context building in the plausibility network, the activation of the context units were taken while processing the title “Assets of
money market mutual funds fell 35.3 mln dlrs in latest week to 237.43 billion”.
Towards Hybrid Neural Learning Internet Agents
169
1
Activation
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
IN
.
WEEK
.
.
BILLION
.
.
MLN
.
FELL
ASSETS
.
MONEY
.
MUTUAL
Word Sequence
Neuron Unit
Fig. 4. The activation of the units in the first context layer. The order of the units is
changed according to the class-activity
energy - A_1445
commodity - A_365
economic - A_1209
economic - A_1192
economic - A_1199
economic - A_1210
interest - A_3442
commodity - A_373
commodity - A_358
money-fx - A_2568
economic - A_1207
energy - A_1439
energy - A_1435
energy - A_1432
energy - A_1425
ship - A_5113
ship - A_5031
ship - A_5600
ship - A_5407
ship - A_5393
commodity - A_356
commodity - A_348
interest - A_4557
interest - A_4271
interest - A_3868
corporate - A_4
corporate - A_2
corporate - A_3
corporate - A_1
corporate - A_0
interest - A_3734
money-fx - A_3262
currency - A_7035
currency - A_7005
currency - A_7018
currency - A_7019
currency - A_6986
money-fx - A_2763
money-fx - A_2175
money-fx - A_2094
Fig. 5. The cluster dendrogram and internal context representations of the first context
layer for 40 representative titles.
170
S. Wermter, G. Arevian, and C. Panchev
1
Activation
0.8
0.6
0.4
0.2
0
1
2
3
4
5
6
IN
.
WEEK
.
.
BILLION
.
.
MLN
.
FELL
ASSETS
.
MONEY
.
MUTUAL
Word Sequence
Neuron Unit
Fig. 6. The activation of the units in the second context layer. The order of the units
is changed according to the class-activity
energy - A_1445
energy - A_1439
energy - A_1435
energy - A_1432
energy - A_1425
ship - A_5600
ship - A_5393
ship - A_5113
ship - A_5031
ship - A_5407
commodity - A_373
commodity - A_365
commodity - A_358
commodity - A_356
commodity - A_348
currency - A_7035
currency - A_7005
currency - A_7018
currency - A_7019
currency - A_6986
money-fx - A_2763
money-fx - A_3262
economic - A_1210
economic - A_1207
economic - A_1209
economic - A_1192
economic - A_1199
interest - A_4557
interest - A_3868
interest - A_3734
interest - A_4271
interest - A_3442
money-fx - A_2568
money-fx - A_2175
money-fx - A_2094
corporate - A_4
corporate - A_3
corporate - A_1
corporate - A_2
corporate - A_0
Fig. 7. The cluster dendrogram and internal context representations of the second
context layer for 40 representative titles.
Towards Hybrid Neural Learning Internet Agents
171
This title belongs to the “economic” category and the data was sorted with a
key which is the activity of the neurons with respect to this category. The results
are shown in Figures 4 and 6.
The most class-active unit for the class “economic” is given as unit 1 in
the figure, and the lowest class-activity as unit 6. Thus, the ideal curve at a
given word step for the title to be classified to the correct category will be a
monotonically decreasing function starting from the units with the highest classactivity to the units with lower class-activity. As can be seen, most of the units
in the first context layer (closer to the input) are more dynamic. They are highly
dependent on the current word. Therefore the first context layer does not build a
representative context for the required category at the end of the title. It rather
responds to the incoming words, building a short dynamic context. However, the
second context layer is incrementally building its context representation for the
particular category. It is the context layer which is most responsible for a stable
output and does not fluctuate so much with the different incoming words.
4
Conclusions
A variety of neural network learning techniques were presented which are considered relevant to the specific problem of classification on Internet texts. A new
recurrent network architecture, HyNeT, was presented that is able to route news
headlines. Similar to incremental language processing, plausibility networks also
process news titles using previous context as extra information. At the beginning
of a title, the network might predict an incorrect category which usually changes
to the correct one later on when more contextual information is available.
Furthermore, the error of the network was also carefully examined at each
epoch and for each word of the training headlines. These surface error figures allow a clear, comprehensive evaluation of training time, word sequence and overall
classification error. In addition, this approach may be quite useful for any other
learning technique involving sequences. Then, an analysis of the context layers
was presented showing that the layers do indeed learn to use the information
derived from context.
To date, recurrent neural networks have not been developed for a new task
of such size and scale, in the design of title routing agents. HyNeT is robust,
classifies noisy arbitrary real-world titles, processes titles incrementally from
left to right, and shows better classification reliability towards the end of titles
based on the learned context. Plausibility neural network architectures hold a lot
of potential for building robust neural architectures for semantic news routing
agents on the Internet.
References
1. M. Balabanovic and Y. Shoham. Learning information retrieval agents: Experiments with automated web browsing. In Proceedings of the 1995 AAAI Spring
Symposium on Information Gathering from Heterogeneous, Distributed Environments, Stanford, CA, 1995.
172
S. Wermter, G. Arevian, and C. Panchev
2. M. Balabanovic, Y. Shoham, and Y. Yun. An adaptive agent for automated web
browsing. Technical Report CS-TN-97-52, Stanford University, 1997.
3. W. Cohen. Learning rules that classify e-mail. In AAAI Spring Symposium on
Machine Learning in Information Access, Stanford, CA, 1996.
4. R. Cooley, B. Mobasher, and J. Srivastava. Web mining: Information and pattern
discovery on the world wide web. In International Conference on Tools for Artificial
Intelligence, Newport Beach, CA, November 1997.
5. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and
S. Slattery. Learning to extract symbolic knowledge from the world wide web. In
Proceedings of the 15th National Conference on Artificial Intelligence, Madison,
WI, 1998.
6. P. Edwards, D. Bayer, C.L. Green, and T.R. Payne. Experience with learning
agents which manage internet-based information. In AAAI Spring Symposium on
Machine Learning in Information Access, pages 31–40, Stanford, CA, 1996.
7. J. L. Elman. Finding structure in time. Technical Report CRL 8901, University
of California, San Diego, CA, 1988.
8. J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7:195–226, 1991.
9. D. Freitag. Information extraction from html: Application of a general machine
learning approach. In National Conference on Artificial Intelligence, pages 517–
523, Madison, Wisconsin, 1998.
10. J. Fuernkranz, T. Mitchell, and E. Riloff. A case study in using linguistic phrases
for text categorization on the WWW. In Proceedings of the AAAI-98 Workshop
on Learning for Text Categorisation, Madison, WI, 1998.
11. L. Giles and C. W. Omlin. Extraction, insertion and refinement of symbolic rules
in dynamically driven recurrent neural networks. Connection Science, 5:307–337,
1993.
12. S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan College
Publishing Company, New York, 1994.
13. J. Hendler. Developing hybrid symbolic/connectionist models. In J. A. Barnden
and J. B. Pollack, editors, Advances in Connectionist and Neural Computation
Theory, Vol.1: High Level Connectionist Models, pages 165–179. Ablex Publishing
Corporation, Norwood, NJ, 1991.
14. R. Holte and C. Drummond. A learning apprentice for browsing. In AAAI Spring
Symposium on Software Agents, Stanford, CA, 1994.
15. V. Honavar. Symbolic artificial intelligence and numeric artificial neural networks:
towards a resolution of the dichotomy. In R. Sun and L. A. Bookman, editors,
Computational Architectures integrating Neural and Symbolic Processes, pages 351–
388. Kluwer, Boston, 1995.
16. S. Honkela. Self-organizing maps in symbol processing. In Hybrid Neural Systems
(this volume). Springer-Verlag, 2000.
17. M.A. Hoyle and C. Lueg. Open SESAME: A look at personal assisitants. In Proceedings of the Interanational Conference on the Practical Applications of Intelligent
Agents and Multi-Agent Technology, London, pages pp. 51–56, 1997.
18. D. Hull, J. Pedersen, and H. Schutze. Document routing as statistical classification.
In AAAI Spring Symposium on Machine Learning in Information Access, Stanford,
CA, 1996.
19. T. Joachims. Text categorization with support vector machines: learning with
many relevant features. In Proceedings of the European Conference on Machine
Learning, Chemnitz, Germany, 1998.
Towards Hybrid Neural Learning Internet Agents
173
20. T. Joachims, D. Freitag, and T. Mitchell. Webwatcher: A tour guide for the world
wide web. In Fifteenth International Joint Conference on Artificial Intelligence,
Nagoya, Japan, 1997.
21. M. I. Jordan. Attractor dynamics and parallelism in a connectionist sequential
machine. In Proceedings of the Eighth Conference of the Cognitive Science Society,
pages 531–546, Amherst, MA, 1986.
22. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM - self-organizing maps
of document collections. Neurocomputing, 21:101–117, 1998.
23. T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin, third
edition, 1989.
24. T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1995.
25. T. Kohonen. Self-organisation of very large document collections: State of the art.
In Proceedings of the International Conference on Ariticial Neural Networks, pages
65–74, Skovde, Sweden, 1998.
26. S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280:98–100,
1998.
27. D. D. Lewis.
Reuters-21578 text categorization test collection, 1997.
http://www.research.att.com/˜lewis.
28. R. Liere and P. Tadepalli. The use of active learning in text categorisation. In
AAAI Spring Symposium on Machine Learning in Information Access, Stanford,
CA, 1996.
29. T. Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning long-term dependencies
in NARX recurrent neural networks. IEEE Transactions on Neural Networks,
7(6):1329–1338, November 1996.
30. F. Menczer, R. Belew, and W. Willuhn. Artificial life applied to adaptive information agents. In Proceedings of the 1995 AAAI Spring Symposium on Information
Gathering from Heterogeneous, Distributed Environments, 1995.
31. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine Learning, Neural
and Statistical Classification. Ellis Horwood, New York, 1994.
32. R. Miikkulainen. Subsymbolic Natural Language Processing. MIT Press, Cambridge, MA, 1993.
33. T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, New York, 1997.
34. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from
labeled and unlabeled documents. In Proceedings of the National Conference on
Artificial Intelligence, Madison, WI, 1998.
35. R. Papka, J. P. Callan, and A. G. Barto. Text-based information retrieval using
exponentiated gradient descent. In Advances in Neural Information Processing
Systems, volume 9, Denver, CO, 1997. MIT Press.
36. M. Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. In International
Joint Conference on Artificial Intelligence, Nagoya, Japan, 1997.
37. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986.
38. H. T. Siegelmann, B. G. Horne, and C. L. Giles. Computational capabilities of
recurrent NARX neural networks. Technical Report CS-TR-3408, University of
Maryland, College Park, 1995.
39. M. Spiliopoulou, L. C. Faulstich, and K. Winkler. A data miner analyzing the
navigational behavior of web users. In ACAI-99 Workshop on Machine Learning
in User Modeling, Crete, July 1999.
40. J.P.F. Sum, W.K. Kan, and G.H. Young. A note on the equivalence of NARX and
RNN. Neural Computing and Applications, 8:33–39, 1999.
41. R. Sun. Integrating Rules and Connectionism for Robust Commonsense Reasoning.
Wiley, New York, 1994.
174
S. Wermter, G. Arevian, and C. Panchev
42. R. Sun and T. Peterson. Multi-agent reinforcement learning: Weighting and partitioning. Neural Networks, 1999.
43. R. S. Sutton and A. G. Barto. Reinforcement Learning: an Introduction. MIT
Press, Cambridge, MA, 1998.
44. G. Tecuci. Building Intelligent Agents: An Apprenticeship Multistrategy Learning
Theory, Methodology, Tool and Case Studies. Academic Press, San Diego, 1998.
45. S. Wermter. Hybrid Connectionist Natural Language Processing. Chapman and
Hall, Thomson International, London, UK, 1995.
46. S. Wermter, C. Panchev, and G. Arevian. Hybrid neural plausibility networks for
news agents. In Proceedings of the National Conference on Artificial Intelligence,
pages 93–98, Orlando, USA, 1999.
A Connectionist Simulation of the Empirical
Acquisition of Grammatical Relations
William C. Morris1 , Garrison W. Cottrell1 , and Jeffrey Elman2
1
Computer Science and Engineering Department
University of California, San Diego
9500 Gilman Dr., La Jolla CA 92093-0114 USA
2
Center for Research in Language
Department of Cognitive Science
University of California, San Diego
9500 Gilman Dr., La Jolla CA 92093-0114 USA
Abstract. This paper proposes an account of the acquisition of grammatical relations using the basic concepts of connectionism and a construction-based theory of grammar. Many previous accounts of firstlanguage acquisition assume that grammatical relations (e.g., the grammatical subject and object of a sentence) and linking rules are universal
and innate; this is necessary to provide a first set of assumptions in the
target language to allow deductive processes to test hypotheses and/or
set parameters.
In contrast to this approach, we propose that grammatical relations
emerge rather late in the language-learning process. Our theoretical proposal is based on two observations. First, early production of childhood
speech is formulaic and becomes systematic in a progressive fashion.
Second, grammatical relations themselves are family-resemblance categories that cannot be described by a single parameter. This leads to the
notion that grammatical relations are learned in a bottom up fashion.
Combining this theoretical position with the notion that the main purpose of language is communication, we demonstrate the emergence of the
notion of “subject” in a simple recurrent network that learns to map from
sentences to semantic roles. We analyze the hidden layer representations
of the emergent subject, and demonstrate that these representations correspond to a radially–structured category. We also claim that the pattern
of generalization and undergeneralization demonstrated by the network
conforms to what we expect from the data on children’s generalizations.
1
Introduction
Grammatical relations are frequently a problem for language acquisition systems.
In one sense they represent the most abstract aspect of language; subjects transcend all semantic restrictions – virtually any semantic role can be a subject.
While semantics is seen as being related to world-knowledge, syntax is seen as
existing on a distinct plane. For this reason there are language theories in which
grammatical relations are considered the most fundamental aspect of language.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 175–193, 2000.
c Springer-Verlag Berlin Heidelberg 2000
176
W.C. Morris, G.W. Cottrell, and J. Elman
One approach to learning syntax has been to relegate grammatical relations
and their behaviors to the “innate endowment” that each child is born with.
There are a number of theories of language acquisition (e.g., [4, 27, 46, 47]) that
start with the assumption that syntax is a separate component of language, and
that the acquisition of syntax is largely independent of semantic considerations.
Accordingly, in these theories there is an innate, skeletal syntactic system present from the very beginning of multiword speech. Acquiring syntax consists of
modifying and elaborating the skeletal system to match the target language.
This assumption of innate syntax inevitably leads to a problem, sometimes
referred to as the “bootstrapping problem”. How does one start this “purely
syntactic” analysis? How does one start making initial assignments of words to
grammatical relations (i.e., subject, object, etc.)? A commonly proposed mechanism involves the child tentatively assigning nominals to grammatical relations
based on their semantic content by linking rules1 (e.g., Pinker [46, 47]). This implies that these grammatical relations and linking rules are present at the very
beginning of the learning process.
One problem with this approach is that cross-linguistically the behaviors of
grammatical relations differ too much to be accommodated by a single system.
Proposals have been put forward [36, 47] that a single parameter with a binary
value (“accusative” or “ergative”) is sufficient to account for the extant grammatical systems. This has been shown to be inadequate [29, 40, 51] because there
are languages that have neither strictly accusative nor strictly ergative syntax.
We propose a language acquisition system that does not rely on innate linguistic knowledge [40]. The proposal is based on Construction Grammar [24, 25]
and on the learning mechanisms of PDP-style connectionism [50]. We have hypothesized that abstractions such as “subject” emerge through rote learning of
particular constructions, followed by the merging of these “mini-grammars”. The
claim is that in using this sort of a language acquisition system it is possible for a
child to learn grammatical relations over time, and in the process accommodate
to whatever language-specific behaviors his target language exhibits.
Here we present a preliminary study showing that a neural net that is trained
with the task of assigning semantic roles to sentence constituents can acquire
grammatical relations. We have demonstrated this in two ways: by showing that
this network associates particular subjecthood properties with the appropriate
verb arguments, and by showing that the network has gone some distance toward
abstracting this nominal away from its semantic content.
In the following, we first review the ways in which the grammatical relation
“subject” appears in several languages. This gives rise to the notion that grammatical relations do not have, for example, only two patterns of ways in which
they control (in the linguistic sense) other categories. Rather, grammatical re1
Linking rules are heuristics (or algorithms, depending on the theory) for making
provisional assignments of verb arguments to grammatical relations. The criteria
for the assignments are semantic. Because virtually any semantic role can be a
subject, the algorithmic variants of these theories are quite complicated. For a recent
treatment of linking rules, see Dowty [21].
A Connectionist Simulation of the Empirical Acquisition
177
lations exhibit a variety of patterns of control over syntactic properties. This
suggests it would be difficult for the subject relation to be described by a binary
innate parameter. Next, we review relevant developmental data on the acquisition of syntax. The evidence we review suggests that 1) syntax is acquired in a
bottom up, data-driven fashion, and 2) that there are specific patterns of overand under- generalization that reflect the nature of the linguistic input to the
child. We then review the theory proposed by Morris [40] based on this data.
Finally, we present a connectionist simulation of one stage of the theory, and
demonstrate that the system acquires a notion of “subject” without any innate
bias to do so.
2
The Shape of Grammatical Relations
While a number of theorists have explored the real complexity of grammatical
relations (e.g., [19, 20, 23, 29, 51]), there remains a perception among some
theorists (e.g., [34, 35, 36]) that grammatical relations are essentially a binary
phenomenon: grammatical relations are deemed to be either accusative or ergative, and hence an “ergative parameter” determines the behaviors. This has
been the prevailing view in a number of language acquisition theories [47].
A first-order approximation of the difference between accusative and ergative
grammatical relations is that the subject of a syntactically accusative language
is typically the agent of an action, while in a syntactically ergative language the
“subject”,2 or subject-like grammatical relation, is typically the patient of an
action. One potential distinguishing property (indicative, though not decisive)
would be which nominal in a sentence controls clause coordination. Thus in the
sentence, Max hit Larry and ran away, who ran away? In a strongly syntactically
accusative language, it is Max that ran away; in a strongly syntactically ergative
language, it is Larry that ran away.
For those who regard the accusative/ergative split as being simply binary,
the problem becomes merely identifying the subject. If the subject is the agent,
then the language is accusative, if it is the patient, it is ergative. But the problem
is not that simple. It is not merely the identity of the subject that is the issue,
but what properties do the various grammatical relations control? In some sense,
the question is what “shape” do the grammatical relations in a language take
on?
We have examined the literature to find the syntactic properties that are associated with subjects cross-linguistically. Perhaps the definitive work in this area
is Keenan [29], from which we have extracted a set of six properties that are capable of being associated with subjects (and quasi-subjects) cross-linguistically:
2
Because of the associations with accusative phenomena carried by the term “subject” in a number of theoretical approaches, one might wish to call the primary
grammatical relation in syntactically ergative languages something else. The term
“pivot” has been used.
178
W.C. Morris, G.W. Cottrell, and J. Elman
1. Addressee of imperatives.
2. Control of reflexivization. E.g., Max shaved himself. (The controller of the
reflexive is the subject.)
3. Control of coordination. E.g., Max pinched Lola and fled. (The deleted argument of the second clause is coreferential with the subject of the first
clause.)
4. Target of equi-NP deletion. E.g., Max convinced Lola to be examined by the
doctor. Max convinced the doctor to examine Lola. (The deleted argument
of the embedded clause is the subject.)
5. Ability to launch floating quantifiers. E.g., The boys could all hear the mosquitoes. (The quantifier all refers to the subject, i.e., boys, rather than to the
object, i.e., mosquitoes.)
6. Target of relativization deletion. E.g., I know the man who saw Max. I know
the man who Max saw.
In English the last item is a free property; any nominal that is coreferential
with the relativized matrix nominal can be deleted in the embedded clause in
relativization. The examples demonstrate two of the cases.
The grammatical relations of various languages control various combinations
of these (and other) properties. This is what we mean by the “shape” of grammatical relations. We have analyzed these syntactic properties in English and in two
other languages, Dyirbal (Australian) [18, 20] and Kapampangan (Philippine),
which have rather different constellations of properties from those of English,
as well as from each other [40]. Grammatical relations in these languages have
shown interesting patterns of behavior. For example, in English the first five
of these properties are controlled by the subject, the last is a “free property”,
not controlled by any grammatical relation. In Dyirbal, properties 3, 4, & 6 are
controlled by an “ergative subject”, or “pivot” [18, 20]. In Kapampangan, one
grammatical relation (which tends to be the agent) controls properties 1, 2, &
3, while another (which ranges over all semantic roles) controls properties 5 &
6. Property 4 can be controlled by either of the grammatical relations.
Hence English is a highly syntactically-accusative language, Dyirbal is a highly syntactically-ergative language, and Kapampangan appears to be a split
language, neither highly ergative nor highly accusative in syntax. This is discussed at some length in [40], but as these languages do not bear directly on the
present simulation, we will simply note that this issue is addressed in both the
theoretical proposal and in our long-term goals.
Our purpose for raising the issue here is to argue that for a language acquisition to be “universal”, i.e., capable of learning any human language, it must be
able to accommodate a variety of language types. Simply settling on the identity
of the subject is not sufficient. Rather, the various control patterns (“shapes”)
described above must be accommodated. Our proposal involves a system that
can learn a variety of shapes.
A Connectionist Simulation of the Empirical Acquisition
3
179
Review of Data from Psycholinguistic Studies
There are several avenues of psycholinguistic data that we have explored. One of
these is the issue of early abstraction vs. rote behavior. There have been a number
of studies that have indicated that children’s earliest multiword utterances have
been largely rote or semi-rote behaviors [1, 2, 11, 12, 13, 44, 45, 53]. In a pair
of studies Tomasello and Olguin showed an asymmetry between the relative
facility with which two-year-old children can manipulate nouns, both in terms
of morphology and syntax, and the relative difficulty with which they handle
verbs. Tomasello & Olguin [55] demonstrated their productivity with nouns,
while Olguin & Tomasello [43] showed their relative nonproductivity with verbs.
It appears that the control that children have over verbs very early in the multiword stage is largely rote; there is no systematic relationship between them.
That is, there is little or no transfer from knowledge of one verb to the next.
There have been a number of studies [26, 41, 42, 56, 57] that have been
interpreted as providing evidence of early abstraction. There are several problems with the interpretations of these studies, however. Some of these have
interpreted arguably rote behaviors as representing abstraction [54], and others
have interpreted small-scale systematic behavior as large scale systematic behavior [3, 15, 44, 45]). That is, it was found that certain systematic behaviors were
limited to semantically similar predicates.
Despite the fact that an individual child’s developing grammar is a quickly
moving target, the issues of systematic and non-systematic behaviors can in
certain instances be teased out. Indications of systematic behaviors can be seen
in overgeneralization, and indications of the limits of systematic behaviors can
be seen in undergeneralization.
In numerous studies, Bowerman [5, 6, 7, 8, 9, 10] has investigated instances
of overgeneralization in child speech; overgeneralization is the phenomenon of
extending rules inappropriately. For example, children exposed to English learn
the “lexical causative” alternation, as in the ball rolled∼Larry rolled the ball,
and the vase broke∼Max broke the vase. Children inappropriately extend this
alternation to verbs such as giggle or sweat to produce such sentences as Don’t
giggle me or It always sweats me [9]. Overgeneralizations of this sort are evidence
that the child has developed the notion of a class of verbs, such as roll, float,
break, sweat, giggle, and disappear, which share a semantic role (patient) in
their intransitive forms, and that the child is willing to treat them the same
syntactically. The fact that this is inappropriate for the word sweat means that
the child is extremely unlikely to have heard this usage before, therefore the child
has used systematic behavior to produce this utterance. Another of Bowerman’s
studies [10] involved the overgeneralization of linking rules. Children rearranged
verb-argument structures in accordance with a linking rule generalization rather
than in accordance with some presumed verb-class alternation (e.g., I saw a
picture which enjoyed me.).
Of particular note here is the timing of these, and other, overgeneralizations.
Most of the overgeneralizations that Bowerman has studied, including the lexical causative overgeneralization discussed above, appear starting between two
180
W.C. Morris, G.W. Cottrell, and J. Elman
and a half and three and a half years of age. The linking rule overgeneralizations started appearing after the age of 6. The former overgeneralizations are
presumably learned behaviors—the child must learn what sorts of verb classes
exist in a language and what alternations are associated with them before these
overgeneralizations can occur. On the other hand, according to many nativist
theories, linking rules are innate [46, 47]. Furthermore, linking rules must be active very early in multi-word speech in order for the first tentative assignments
of nouns to grammatical relations to be made, a necessary step in breaking into
the syntactic system. Yet the overgeneralizations ascribable to linking rules do
not appear until the age of six years or later.
If we can judge by overgeneralization, it would appear that linking rules are
not innate; at the very least it appears that they are not active at a time when
they are most needed, i.e., early in multi-word speech. The alternative is that
they are not necessary precursors to multiword speech. Rather, they are highly
abstract generalizations that first give evidence of existence after a large portion
of the grammar of a language has been mastered.
Undergeneralization, too, has a role to play in determining the nature of
the learning mechanisms. A number of studies have been conducted showing an
interesting asymmetry in the learning of the passive construction in English. A
study by Maratsos, Kuczaj, Fox, & Chalkley [38] showed that four- and fiveyear-old children could understand both the active and passive voices of action
verbs (e.g., drop, hold, shake, wash), but had difficulty understanding the passive
voices of psychological or perceptual verbs (e.g., watch, know, like, remember).
Maratsos, Fox, Becker, & Chalkley [37] showed that this difficulty appeared to
extend until the age of 10. Another studies by de Villiers et al. [17] confirmed
the comprehension asymmetry between the two types of verbs, while a study by
Pinker et al. [48] showed a similar asymmetry in production. In a preliminary
study Maratsos et al. [37] also showed that parental input to children was limited
in a similar way: parents used few, if any, experiential verbs in the passive voice. 3
This study is particularly interesting because a common notion of the passive
is that its relationship to the active voice is defined in terms of subjects and
objects. Whether or not this is true in an adult, it appears that this is not the
way that children learn this alternation. It seems that children first acquire this
systematic alternation in a semantically-limited arena, in which the active-voice
patient is promoted to the passive “subject”. Only later do they extend it to a
more “semantically abstract” arena in which it is the active-voice object that is
promoted to the subject position.
3
The few experiential verbs that they did find in the passive voice in parental
input were of the percept-experiencer type (e.g., frighten, surprise) rather than the
experiencer-percept type (e.g., fear, like). Maratsos et al. did not test the children
for their comprehension of percept-experiencer verbs.
A Connectionist Simulation of the Empirical Acquisition
4
181
A Theoretical Proposal
We wish to test a proposal put forward in Morris [40], which describes an approach to learning grammatical relations without recourse to innate, domainspecific, linguistic knowledge. This model is based on (i) the Goldberg variation
of Construction Grammar [24, 25], and (ii) the learning mechanisms of connectionism [50], inter alia. The proposal is that the acquisition of grammatical
relations occurs as a three-stage process.
In the first stage a child learns verb argument structures as separate, individual “mini-grammars”. This word is used to emphasize that there are no
overarching abstractions that link these individual argument structures to other
argument structures. Each argument structure is a separate grammar unto itself.
In the second stage the child develops correspondences between the separate
mini-grammars; initially the correspondences are based on both semantic and
syntactic similarity, later the correspondences are established on purely syntactic
criteria. The transition is gradual, with the role that semantics plays decreasing
slowly.
For example, the verbs eat and drink are quite similar to each other, and
will “merge” quickly into a larger grammar. Similarly, the verbs hit and kick will
merge early, since their semantics and syntax are similar. While all four of these
verbs have agents and patients as verb arguments, there are many semantic differences between the verbs of ingestion and the verbs of physical assault, therefore
the merge between these two verb groups will occur later in development.
Ultimately, these agent-patient verbs will merge with experiencer-percept
verbs (e.g., like, fear, see, remember), percept-experiencer verbs (e.g., please,
frighten, surprise), and others, yielding a prototypical transitive construction
with an extremely abstract argument structure. The verb-arguments in these
abstract argument structures can be identified as “A”, the transitive actor, and
“O”, transitive patient (or “object”). In addition there is prototypical intransitive argument structure with a single argument, “S”, the intransitive “subject”.
(This schematic description is due to Dixon [19].)
In the third stage, the child begins to associate the abstract arguments of the
abstract transitive and intransitive constructions with the coindexing constructions that instantiate the properties of, for example, clause coordination, control structures, and reflexivization. So, for example, an intransitive-to-transitive
coindexing construction will associate the S of an intransitive first clause with
the deleted co-referent A of a transitive second clause. This will enable the understanding of a sentence like Max arrived and hugged everyone. Similarly, a
transitive-to-intransitive coindexing construction will associate the A of an initial transitive clause with the S of a following intransitive clause; this will enable
the understanding of a sentence like Max hugged Annie and left.
Since this association takes place relatively late in the process, necessarily
building on layers of abstraction and guided by input, the grammatical relations
(of which S, A, and O are the raw material) “grow” naturally into the languageappropriate molds.
182
W.C. Morris, G.W. Cottrell, and J. Elman
From beginning to end this is a usage-based acquisition system. It starts with
rote-acquisition of verb-argument structures, and by finding commonalities, it
slowly builds levels of abstraction. Through this bottom-up process, it accommodates to the target language. (For other accounts of usage based systems, see
also Bybee [14] and Langacker [31, 32, 33].)
5
A Connectionist Simulation
In this section we present a connectionist simulation to test whether a network
could build abstract relationships corresponding to “subjects” and “objects”
given an English-like language with a variety of grammatical constructions. This
was done in such a way that there is no “innate” knowledge of language in the
network. In particular, there are no architectural features that correspond to
“syntactic elements”, i.e., no grammatical relations, no features that facilitate
word displacement, and so forth. The main assumptions are that the system
can process sequential data, and that it is trying to map sequences of words to
semantic roles.
The motivation behind the network is the notion that merely the drive to
map input words to output semantics is sufficient to induce the necessary internal
abstractions to facilitate the mapping. To test this hypothesis, a Simple Recurrent Network [22] was created and tested using the Stuttgart Neural Network
Simulator (SNNS). The network is shown in Figure 1.
The network takes in a sequence of patterns representing sentences generated
from a grammar. At each time step, a word or end of sentence marker is presented. After each sentence, an input representing “reset” is presented, for which
the network is supposed to zero out the outputs. The output patterns represent
semantic roles in a slot-based representation. The teaching signal for the roles
are given as targets starting from the first presentation of the corresponding
filler word, and then held constant throughout the rest of the presentation of
the sentence.
Fig. 1. Network architecture.
A Connectionist Simulation of the Empirical Acquisition
183
The input vocabulary consists of 56 words (plus end of sentence and reset),
represented as 10-bit patterns, with 5 bits on and 5 bits off. Of these 56 words,
25 are verbs, 25 are nouns, and remaining 6 are a variety of function words. All
of the nouns are proper names. Of the verbs, 5 are unergative (intransitive, with
agents as the sole arguments, e.g., run, sing), 5 are unaccusative (intransitive,
with patient arguments, e.g., fall, roll), 10 are “action” transitives (with agent
& patient arguments, e.g., hit, kick, tickle), and 5 are “experiential” transitives
(with experiencer & percept arguments, e.g., see, like, remember). In addition
there is a “matrix verb”, persuade, which is used for embedded sentence structures. The 5 remaining words are who, was, by, and, and self.
The output layer is divided into 6 slots that are 10 units wide. The first slot is
the verb identifier, the second through the fifth are the identifiers for the agent,
the patient, the experiencer, and the percept. (Note that at most only two of
these four slots should be filled at one time.) The sixth slot is the “matrix agent”
slot, which will be explained below. The representation of the fillers is unrelated
to the representation of the words – the slot fillers only have 2 bits set out of 10.
Hence the network cannot just copy the inputs to the slots.
Using the back-propagation learning procedure [49] the network was taught
to assign the proper noun identifier(s) to the appropriate role(s) for a number of
sentence structures. Thus for the sentence, Sandy persuaded Kim to kiss Larry,
the matrix agent role is filled by Sandy, the agent role is filled by Kim, and the
patient role is filled by Larry. In the sentence, Who did Larry see, the experiencer
role is filled by Larry and the percept role is filled by who. Training was conducted
for 50 epochs, with 10,000 sentences in each epoch. The learning rate was 0.2,
initial weights set within a range of 1.0. There was no momentum.
Examples of the types of sentences and their percentage in the training set
are listed below:
1. Simple declarative intransitives (18%). E.g., Sandy jumped (agent role)
and Sandy fell (patient role).
2. Simple declarative transitives (26%). E.g., Sandy kissed Kim (agent and
patient roles) and Sandy saw Kim (experiencer and percept roles).
3. Simple declarative passives (6%). E.g., Sandy was kissed (patient role).
4. Questions (20%). E.g., Who did Sandy kiss?(agent and patient roles, object
is questioned), Who kissed Sandy? (agent and patient roles, subject is questioned), Who did Sandy see? (experiencer and percept roles, object is questioned),
and Who saw Sandy? (experiencer and percept roles, subject is questioned).
5. Control (equi-NP) sentences (25%). E.g., Sandy persuaded Kim to run
(matrix agent and agent roles), Sandy persuaded Kim to fall (matrix agent and
patient roles), Sandy persuaded Kim to kiss Max (matrix agent, agent, and patient roles) and Sandy persuaded Kim to see Max (matrix agent, experiencer, and
percept roles).
6. Control (equi-NP) sentences with questions (6%). E.g., Who did Sandy
persuade to run/fall? (questioning embedded subject, whether agent or patient,
of an intransitive verb), Who persuaded Sandy to run/fall? (questioning matrix
agent; note embedded intransitive verb), Who persuaded Sandy to kiss/see Max?
184
W.C. Morris, G.W. Cottrell, and J. Elman
(questioning matrix agent; note embedded transitive verb), and Who did Sandy
persuade to kiss Max? (questioning embedded agent).
The generalization test involved two systematic gaps in the data presented
to the network; both involved experiential verbs. The first was passive sentences with experiential verbs e.g., Sandy was seen by Max. The second involved
questioning embedded subjects in transitive clauses with experiential verbs, e.g.,
Who did Sandy persuade to see Max? Neither of these sentence types occurred
with experiential verbs in the training set. The test involved probing these gaps.
The network was not expected to generalize over these two systematic gaps in
the same way. The questioning-of-embedded-subject-sentences gap is part of an
interlocking group of constructions which “conspire” to compensate for the gap.
The “members of the conspiracy” are the transitive sentences (group 2 above),
the questions (group 4), and the control sentences (group 5). These sentences
are related to each other, and they should cause the network to treat the agents
of action verbs and the experiencers of experiential verbs the same. Thus we
believe that this gap, which is unattested in parental input, should show some
generalization. Our explanation in terms of construction conspiracies would then
be the basis for our explanation of many of the overgeneralizations that occur
in children.
Meanwhile, the passive gap has no such compensating group of constructions.
Only the transitive sentences (group 2) provide support for the passive generalization. This gap corresponds to one that actually exists in parental input. If
our model is a good one, we would expect that it should not bridge this gap.
6
Results
In Table 1 we show the result of testing a variety of constructions, some forms of
which were trained, and two were not. Five hundred sentences of each listed type
were tested. The results were computed using Euclidean distance decisions-each
field in the output vector was compared with all possible field values (including
the all-zeroes vector), and the fields assigned the nearest possible correct value.
For a sentence to be “correct” all of the output fields had to be correct. The two
salient lines are for simple passive clauses with experiential verbs, which had a
6.2% success rate, and questioning embedded subjects with experiential verbs,
which had a 67.4% success rate. The near complete failure of generalization for
simple passive clauses with experiential verbs showed that the nonappearance of
experiential verbs in the passive voice in the training set caused the network to
learn the passive voice as a semantically narrow alternation. This is similar to
the undergeneralization found by Maratsos et al. [37, 38], discussed above. This
gap, as mentioned earlier, has been shown [37] to be one that actually exists in
parental input to children.
On the other hand, the questioning of embedded subjects with experiential
verbs, which likewise did not appear in the training set, showed much greater generalization, in all likelihood because there is a “conspiracy of syntactic
constructions” surrounding this gap. As mentioned above, the simple transitive
A Connectionist Simulation of the Empirical Acquisition
185
Table 1. Sentence comprehension using Euclidean distance decisions.
Sentence description
Percent correct
Simple active clauses, action verbs
97.6%
Simple active clauses, experiential verbs
97.6%
Simple passive clauses, action verbs
91.8%
Simple passive clauses, experiential verbs
6.2%
Control (equi-NP) structures
83.6%
Questioning embedded subjects, action verbs
91.4%
Questioning embedded subjects, experiential verbs
67.4%
clauses, questioned simple clauses, and control sentences, were the prime “conspirators”.
Simple transitive clauses established the argument structures for both the
agent-patient verbs and the experiencer-percept verbs:
– Roger kissed Susie. (agent–patient argument structure)
– Linda saw Pete. (experiencer–percept argument structure)
Questioned simple clauses established the ability to question the subjects of both
argument structures:
– Who pinched Sandy? (questioned agent)
– Who remembered Max? (questioned experiencer)
Control sentences established embedded clauses for both argument structures:
– Fred persuaded Ian to tickle Lynn. (embedded agent–patient. argument structure)
– Fred persuaded Sam to hate Terry. (embedded experiencer–percept argument
structure)
Questioning embedded agents established the relevant pattern, including the
fronting of the embedded, questioned constituent:
– Who did Raul persuade to tickle Sally? (embedded questioned agent)
The interlocking patterns above led to extension of this last pattern to experiencer–
percept verbs.
The passive gap has no such compensating group of constructions. Only the
transitive sentences (group 2) provided support for the passive generalization;
as we shall see, these were insufficient to bridge the gap.
Simple transitive clauses established the similarity of argument structures:
– Sally tickled Jack. (agent–patient argument structure)
– Jack liked Sally. (experiencer–percept argument structure)
Simple intransitive clauses established patients as subjects:
– Susie fell. (patient–only argument structure)
186
W.C. Morris, G.W. Cottrell, and J. Elman
Passive sentences, which only occurred with agent–patient verbs, established an alternation between active–voice agent–patient argument structures and
passive–voice patient–only argument structures with the same verbs:
– Jack was tickled. (patient–only argument structure with a verb that is seen
in the active voice)
The gap of the questioned–embedded–experiencer was overcome because
there was a sufficient number of overlapping constructions and there was a wellestablished precedent of experiencer subjects. As a result we are seeing a level of
abstraction, with the network able to “define”, in some sense, the gap in terms
of the embedded subject rather than merely an embedded agent.
In order for the gap of the passive-voice for experiencer–percept verbs to be
overcome there would have to have been an established precedent of percept–
subjects. There were none. There were no percept–only verbs in the data set;
indeed, there are arguably no percept–only verbs in English. The gap of the
passive-voice for experiencer–percept verbs was not overcome because there was
an insufficient number of overlapping constructions, and because there was no
precedent of percept–subjects in the data set.
6.1
Analysis of Representations in the Hidden Layer
We wanted to probe the way that the network represented subjects internally,
i.e., in the hidden layer. This was done by creating and comparing “subjectvariance vectors” for combinations of verb classes and syntactic constructions.
Subject–variance vectors are vectors representing the variance of the hidden
layer units when only the subject is varied. This should show where the subject
is being encoded in the hidden layer. Creating the variance vectors is a three-step
process.
To construct these vectors, we presented the network with 25 sentences varying only in their subject. We saved the 120 hidden unit activations at the end of
each presentation, and computed the variance on a per unit basis. The variances
so computed should then represent “where” the subject is being encoded for that
verb/construction combination.
Next we compared the subject-variance vectors within a verb class to each
other. An average subject-variance vector was computed for each verb class (for
a given construction); this represented the “prototype” subject representation
for the verb class.
To test how tightly associated the representations of the subjects of these
verb-construction classes were we computed the average Euclidean distance from
the prototype to each of the members of the class. For unaccusative (patientonly) and unergative (agent-only) verbs in simple clauses and in embedded clauses the average distances were about 0.5. For transitive verbs, both agent-patient
and experiencer-percept verbs, in simple clauses and in embedded clauses, the
averages were about 0.3. For passive-voice agent-patient verbs the average was
about 0.4. (The fact that intransitive verb-construction combinations have “less
A Connectionist Simulation of the Empirical Acquisition
187
Table 2. Euclidean distances between prototype subject variance vectors.
Simple clauses
Transitive Agent
Transitive Agent
Intransitive Agent
Transitive Experiencer
Intransitive Agent
Transitive Agent
Simple clauses
Intransitive Patient
Intransitive Agent
Transitive Experiencer
Transitive Agent
Embedded clauses
Embedded Transitive Agent
Embedded Transitive Agent
Embedded Transitive Experiencer
Embedded Transitive Experiencer
Embedded Intransitive Agent
Embedded Transitive Agent
Active voice
Intransitive Patient
Transitive Experiential
Intransitive Agent
Transitive Agent
Passive Voice Percept
Other simple clauses
Intransitive Agent
Transitive Experiencer
Transitive Experiencer
Intransitive Patient
Intransitive Patient
Intransitive Patient
Embedded counterparts
Embedded Intransitive Patient
Embedded Intransitive Agents
Embedded Transitive Experiencer
Embedded Transitive Agent
Other embedded clauses
Embedded Transitive Experiencer
Embedded Intransitive Agent
Embedded Intransitive Agent
Embedded Intransitive Patient
Embedded Intransitive Patient
Embedded Intransitive Patient
Passive voice
Passive Voice Patient
Passive Voice Patient
Passive Voice Patient
Passive Voice Patient
Passive Voice Patient
Distance
0.69
0.69
0.82
1.15
1.18
1.40
Distance
0.70
0.73
0.77
0.81
Distance
0.57
0.67
0.78
1.07
1.07
1.27
Distance
0.72
1.27
1.37
1.51
0.38
disciplined”, i.e., less tightly associated, subject representations may be explained by the fact that these verbs have only a single verb argument. The network
need not “remember” two verb arguments simultaneously; it can therefore be
profligate in the manner of the storage of the subject’s identity.)
The third step involved looking at the distances between the prototypes. This
allowed us to see how similar prototypes were. The results of our comparisons are
shown in Table 2, where we use “Intransitive Agent” for unergative subjects (e.g.,
Sandy jumped) and “Intransitive Patient” for unaccusative subjects (e.g. Sandy
fell). In general, one can think of distances less than 1.0 as “close” (although
none are as close as the within-class distances mentioned above) and distances
greater than 1.0 as “far”. With this in mind, Table 2 shows that there are
interesting relationships between the instantiations of subjects in various verband-construction groups (recall that all the entries in the Table correspond to
subjects).
First, considering only the simple clauses, we see that the entries divide into
two distinct groups. The intransitive patients are relatively far from the other
classes, while the agents and experiencers tend to pattern together. To understand why, consider that in transitive constructions with agent-patient verbs,
both agents and patients must be present. Therefore the two semantic roles
188
W.C. Morris, G.W. Cottrell, and J. Elman
must be stored simultaneously, and thus their representations must be in somewhat different units. Agents, whether transitive or intransitive, will most likely
be represented by the same set of units. Note that experiencers never need to
be stored simultaneously with agents. Therefore their representation can overlap agent-subjects much more than can the representations of patient-subjects.
Then the question is why experiencers pattern more with agents than patients.
We believe this is because the agent-subjects are simply the most frequent in the
training set, and thus have a primacy in “carving out” the location for subjects
in the hidden layer. This is also consistent with many linguistic theories where
agents are considered the prototypical subjects.
Second, the distances between the matrix clause subjects and their embedded
clause counterparts are also close, and in the same range as the distances between
“non-antagonistic” subject types.
Third, the embedded clauses essentially replicate the pattern seen in the
simple clauses, with agent and experiencer subjects patterning together, and
patient subjects at a distance.
Fourth, the passive voice patient subjects are far from active voice subjects,
with the (not unexpected) exception of active voice intransitive patients. Clearly,
the network has drawn a major distinction between patient-subjects and nonpatient subjects. Again, we hypothesize that the network did this simply because
of the necessity of storing agents and patients simultaneously.
Finally, we see that passive voice patient subjects are very close (within the
range of a within-class distance) to passive voice subjects of experiential verbs
(percepts). Recall that the network was never trained on experiential verbs in the
passive voice and never trained with percept-subjects; the network has basically
stored such subjects in the same location as passive voice patient subjects. This
is consistent with the failure of the network to correctly process these novel
constructions.
We conclude from this analysis of the subject-variance vectors that within
a syntactically defined class of verbs, the subjects are stored in very nearly the
same set of units. These subject patterns are more similar to each other than
they are to the subject patterns for the same class of verbs in other constructions, or to the subject patterns of other classes of verbs. Most importantly,
though, the representation of “subject” in the network is controlled by two main
factors. First, if the subjects of two sentences must fill the same thematic role,
they will be stored similarly. Second, representations are pushed apart according
to whether the processing requirements force them to compete for representational resources. In the case of our set of sentence types, the effect is that agents
and patients are stored separately because they can appear together, and experiencers are stored very close to agents, since they never appear together. The
result is that the instantiation of “subject” in the network amounts to a radial
category in the manner of Lakoff’s Women, Fire, and Dangerous Things [30].
These relationships are largely in accord with the predictions of the theoretical
model sketched out in this paper.
A Connectionist Simulation of the Empirical Acquisition
7
189
Discussion and Conclusions
This simulation was intended to demonstrate that the most abstract aspects
of language are learnable. There are two broad areas in which this is explored:
control of “subjecthood” properties and demonstration of relative abstraction.
In the area of control of properties, this simulation demonstrated that the
network was capable of learning to process equi-NP deletion sentences (also
known as “control constructions”). This is shown in the ability of the network
to correctly process sentences such as Sandy persuaded Kim to run (these are
shown in groups 5 & 6, in section 5 above). As was seen above, the network was
able to correctly understand these sentences at a rate of 84%.
The network’s ability to abstract from semantics was shown in the ability
of the network to partially bridge the artificial gap in the training set, that of
the questioned embedded subject of experiential verbs. The network was able
to define the position in that syntactic construction in terms of a semanticallyabstract entity, that is, a subject rather than an agent. Consistent with developmental data, the network also did not generalize when it should not have. In
particular, it did not process passive sentences with perceptual subjects. We have
hypothesized that this pattern of generalization and lack of generalization can
be explained as a conspiracy of constructions, that bootstrap the processing required for a new construction. Without this scaffolding, the network assimilates
the new construction into a known one.
As is clear from the examination of the hidden layer, we can see how the
network stores a partially-abstract representation of the subject. We can also
see the limitations of abstraction; the network’s representation of the subject of
a given sentence is also partially specified in semantically loaded units. And, as
we have seen in the Maratsos [37] study, this appears to be appropriate to the
way that humans learn language. This result is also consistent with Goldberg’s
theoretical analysis [25] that predicts this semantically-limited scope to certain
syntactic constructions.
Of course, we have been preceded by many others in the use of recurrent
networks for language comprehension [16, 22, 28, 39, 52]. Most of these previous
works impose a great deal of structure on the networks that, in some sense,
parallels a preconceived notion of what sentence processing should look like.
The previous work to which we owe the greatest debt is that of Elman [22], who
developed Simple Recurrent Networks, and St. John & McClelland [52], who
applied them to the problem of mapping from sequences of words to semantic
representations. There are two main differences between this work and that of St.
John & McClelland. In terms of networks, ours is simpler, because we specify in
advance an output representation for semantics. While our semantics is simpler,
the syntactic constructions used in training are more complex. Indeed, the fact
that we focus upon the notion of a grammatical relation and how it could be
learned is what differentiates this work from much of the previous work. Such
a notion, as shown in the list of characteristic properties, requires a fairly large
array of sentence types. Our analysis of the network’s representation of this
notion also is novel.
190
W.C. Morris, G.W. Cottrell, and J. Elman
One obvious drawback of our work is the impoverished semantics. All of our
nouns were glossed as proper names, but they were just simple bit patterns
with no inherent structure. The only difference in verb “meanings”, aside from a
particular bit pattern for a signature, was the set of thematic roles they licensed.
A richer semantics would presumably be required to model the earlier stages of
the theory, where verbs with similar meanings merge into larger categories. On
the bright side, preliminary studies for future work, as well as similar studies by
Van Everbroeck [58], indicate that this sort of network can be scaled up in the
size of the vocabulary.
In the context of this book, this work demonstrates that a “radical” connectionist approach, that is, one without any additional bells and whistles to force it
to be “symbolic”, is indeed able to form categories usually reserved for symbolic
approaches to linguistic analysis. Indeed, we believe that this sort of approach
will eventually show that syntax as a separate entity from semantic processing
is an unnecessary assumption. Rather, what we see in our network is that “syntax”, in the usual understanding of that term, is part and parcel of the processing
required to map from a sequence of input words to a set of semantic roles.
Acknowledgments
The authors would like to thank Adele Goldberg and Ezra Van Everbroeck for
their comments and suggestions in the course of this study.
References
[1] N. Akhtar. Characterizing English-speaking children’s understanding of SVO
word order. to appear.
[2] L. Bloom. One Word at a Time: The Use of Single Word Utterances Before
Syntax. Mouton de Gruyter, The Hague, 1973.
[3] L. Bloom, K. Lifter, and J. Hafitz. Semantics of verbs and the development of
verb inflection in child language. Language, 56:386–412, 1980.
[4] H. Borer and K. Wexler. The maturation of syntax. In T. Roeper and E. Williams,
editors, Parameter Setting. D. Reidel Publishing Company, Dordrecht, 1987.
[5] M. Bowerman. Learning the structure of causative verbs: A study in the relationship of cognitive, semantic and syntactic development. In Papers and Reports
on Child Language Development, volume 8, pages 142–178. Department of Linguistics, Stanford University, 1974.
[6] M. Bowerman. Semantic factors in the acquisition of rules for word use and
sentence construction. In D. M. Morehead and A. E. Morehead, editors, Normal
and deficient child language. University Park Press, Baltimore, 1976.
[7] M. Bowerman. Evaluating competing linguistic models with language acquisition
data: Implications of developmental errors with causative verbs. Quaderni di
semantica, 3:5–66, 1982.
[8] M. Bowerman. Reorganizational processes in lexical and syntactic development.
In E. Wanner and L. R. Gleitman, editors, Language acquisition: The state of the
art, pages 319–346. Cambridge University Press, Cambridge, 1982.
A Connectionist Simulation of the Empirical Acquisition
191
[9] M. Bowerman. The “no negative evidence” problem: How do children avoid an
overgeneral grammar? In John A. Hawkins, editor, Explaining Language Universals, pages 73–101. Basil Blackwell, Oxford (UK), 1988.
[10] M. Bowerman. Mapping thematic roles onto syntactic functions: Are children
helped by innate linking rules? Linguistics, 28:1253–1289, 1990.
[11] M. D. S. Braine. On learning the grammatical order of words. Psychological
Review, 70:323–348, 1963.
[12] M. D. S. Braine. Children’s first word combinations, volume 41 of Monographs
of the Society for Research in Child Development. University of Chicago Press,
Chicago, 1976.
[13] R. Brown. A First Language: The Early Stages. Harvard University Press, Cambridge MA, 1973.
[14] J. Bybee. Morphology: A Study of the Relation between Meaning and Form. John
Benjamins, Amsterdam, 1985.
[15] E. V. Clark. Early verbs, event-types, and inflections. In C. E. Johnson and
J. H. V. Gilbert, editors, Children’s Language, volume 9, pages 61–73. Lawrence
Erlbaum Associates, Mahwah, NJ, 1996.
[16] Garrison W. Cottrell. A Connectionist Approach to Word Sense Disambiguation.
Research Notes in Artificial Intelligence. Morgan Kaufmann, San Mateo, 1989.
[17] J. G. de Villiers, M. Phinney, and A. Avery. Understanding passives with nonaction verbs. Paper presented at the Seventh Annual Boston University Conference on Language Development, October 8-10, 1982.
[18] R. M. W. Dixon. The Dyirbal Language of North Queensland. Cambridge University Press, Cambridge UK, 1972.
[19] R. M. W. Dixon. Ergativity. Language, 55:59–138, 1979.
[20] R. M. W. Dixon. Ergativity. Cambridge University Press, Cambridge, 1994.
[21] D. Dowty. Thematic proto-roles and argument selection. Language, 67(3):547–619,
1991.
[22] J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.
[23] W. A. Foley and R. D. Van Valin Jr. Functional Syntax and Universal Grammar.
Cambridge University Press, Cambridge, 1984.
[24] A. E. Goldberg. Argument Structure Constructions. PhD thesis, University of
California, Berkeley, 1992.
[25] A. E. Goldberg. A Construction Grammar Approach to Argument Structure.
University of Chicago Press, Chicago, 1995.
[26] K. Hirsh-Pasek and R. M. Golinkoff. The Origins of Grammar: Evidence from
Early Language Comprehension. MIT Press, Cambridge MA, 1996.
[27] N. M. Hyams. Language acquisition and the theory of parameters. D. Reidel
Publishing Company, Dordrecht, 1986.
[28] Ajay N. Jain. A connectionist architecture for sequential symbolic domains. Technical Report CMU-CS-89-187, Carnegie Mellon University, 1989.
[29] E. L. Keenan. Towards a universal definition of “Subject”. In C. Li, editor, Subject
and Topic, pages 303–334. Academic Press, New York, 1976.
[30] G. Lakoff. Women, Fire, and Dangerous Things. Chicago University Press, Chicago, 1987.
[31] R. W. Langacker. Theoretical Prerequisites, volume 1 of Foundations of Cognitive
Grammar. Stanford University Press, Stanford, 1987.
[32] Ronald W. Langacker. Concept, Image, and Symbol: The Cognitive Basis of Grammar. Mouton de Gruyter, Berlin, 1991.
[33] Ronald W. Langacker. Descriptive Application, volume 2 of Foundations of Cognitive Grammar. Stanford University Press, Stanford, 1991.
192
W.C. Morris, G.W. Cottrell, and J. Elman
[34] B. Levin. On the Nature of Ergativity. PhD thesis, MIT, 1983.
[35] C. D. Manning. Ergativity: Argument Structure and Grammatical Relations. PhD
thesis, Stanford, 1994.
[36] A. P. Marantz. On the nature of grammatical relations. MIT Press, Cambridge
MA, 1984.
[37] M. Maratsos, D. E. C. Fox, J. A. Becker, and M. A. Chalkley. Semantic restrictions
on children’s passives. Cognition, 19:167–191, 1985.
[38] M. Maratsos, S. A. Kuczaj II, D. E. C. Fox, and M. A. Chalkley. Some empirical
studies in the acquisition of transformational relations: Passives, negatives, and
the past tense. In W. A. Collins, editor, Children’s Language and Communication:
The Minnesota Symposia on Child Psychology, volume 12, pages 1–45. Lawrence
Erlbaum Associates, Hillsdale NJ, 1979.
[39] Risto Miikkulainen. Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. MIT Press, Cambridge, MA, 1993.
[40] W. C. Morris. Emergent Grammatical Relations: An Inductive Learning System.
PhD thesis, University of California, San Diego, 1998.
[41] L. Naigles. Children use syntax to learn verb meanings. Journal of Child Language,
17:357–374, 1990.
[42] L. Naigles, K. Hirsh-Pasek, R. Golinkoff, L. R. Gleitman, and H. Gleitman. From
linguistic form to meaning: Evidence for syntactic bootstrapping in the two-yearold. Paper presented at the Twelfth Annual Boston University Child Language
Conference, Boston MA, 1987.
[43] R. Olguin and M. Tomasello. Twenty-five-month-old children do not have a grammatical category of verb. Cognitive Development, 8:245–272, 1993.
[44] J. M. Pine, E. V. M. Lieven, and C. F. Rowland. Comparing different models of
the development of the english verb category. MS, Forthcoming.
[45] J. M. Pine and H. Martindale. Syntactic categories in the speech of young children:
The case of the determiner. Journal of Child Language, 23:369–395, 1996.
[46] S. Pinker. Language Learnability and Language Development. Harvard University
Press, Cambridge MA, 1984.
[47] S. Pinker. Learnability and Cognition: The Acquisition of Argument Structure.
MIT Press, Cambridge MA, 1989.
[48] S. Pinker, D. S. LeBeaux, and L. A. Frost. Productivity and constraints in the
acquisition of the passive. Cognition, 26:195–267, 1987.
[49] D. E. Rumelhart and J. L. McClelland. On learning the past tenses of English
verbs. In J. L. McClelland and D. E. Rumelhart, editors, Parallel Distributed
Processing: Explorations in the Microstructure of Cognition, volume 2, pages 216–
271. The MIT Press, Cambridge MA, 1986.
[50] D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1. The MIT Press, Cambridge
MA, 1986.
[51] P. Schachter. The subject in Philippine languages: Topic, Actor, Actor-Topic, or
none of the above? In C. Li, editor, Subject and Topic, pages 491–518. Academic
Press, New York, 1976.
[52] Mark F. St. John and James L. McClelland. Learning and applying contextual
constraints in sentence comprehension. Artificial Intelligence, 46:217–257, 1990.
[53] M. Tomasello. First verbs: A case study of early grammatical development. Cambridge University Press, Cambridge, 1992.
[54] M. Tomasello and P. J. Brooks. Early syntactic development: A construction
grammar approach. In M. Barrett, editor, The Development of Language. UCL
Press, London, in press.
A Connectionist Simulation of the Empirical Acquisition
193
[55] M. Tomasello and R. Olguin. Twenty-three-month-old children do have a grammatical category of noun. Cognitive Development, 8:451–464, 1993.
[56] V. Valian. Syntactic categories in the speech of young children. Developmental
Psychology, 22:562–579, 1986.
[57] V. Valian. Syntactic subjects in the early speech of American and Italian children.
Cognition, 40:21–81, 1991.
[58] E. Van Everbroeck. Language type frequency and learnability: A connectionist
appraisal. In M. Hahn and S. C. Stoness, editors, The Proceedings of the Twenty
First Annual Conference of the Cognitive Science Society, pages 755–760, Mahwah
NJ, 1999. Lawrence Erlbaum Associates.
Large Patterns Make Great Symbols:
An Example of Learning from Example
Pentti Kanerva
RWCP Theoretical Foundation SICS Laboratory
Real World Computing Partnership, Swedish Institute of Computer Science
SICS, Box 1263, SE-164 29 Kista, Sweden
kanerva@sics.se
Abstract. We look at distributed representation of structure with variable binding, that is natural for neural nets and that allows traditional symbolic representation and processing. The representation supports learning from example. This is
demonstrated by taking several instances of the mother-of relation implying the
parent-of relation, by encoding them into a mapping vector, and by showing that
the mapping vector maps new instances of mother-of into parent-of. Possible
implications to AI are considered.
1
Introduction
Distributed representation is used commonly with neural nets, as well as in ordinary
computers, to encode a large number of attributes or things with a much smaller number of variables or units. In this paper we assume that the units are binary so that the
encodings of things are bit patterns or bit vectors (‘pattern’ and ‘vector’ will be used
interchangeably in this paper). In normal symbolic processing the bit patterns are arbitrary identifiers or pointers, and it matters only whether two patterns are identical or
different, whereas bit patterns in a neural net are somewhat like an error-correcting
code: similar patterns (highly correlated, small Hamming distance) mean the same or
similar things—which is why neural nets are used as classifiers and as low-level sensory processors.
Computer modeling of high-level mental functions, such as language, led to the
development of symbolic processing, and for a long time artificial intelligence (AI)
and symbolic processing (e.g., Lisp) were nearly synonymous. However, the resulting
systems have been rigid and brittle rather than lifelike. It is natural to look to neural
nets for a remedy. Their error-correcting properties should make the systems more forgiving. Consequently, considerable research now goes into combining symbolic and
neural approaches. One way of doing it is to use both symbolic and neural processing
according to what each one does the best. Wermter [15] reviews such approaches for
language processing. Another is to encode (symbolic) structure with distributed representation that is suitable for neural nets. The present paper explores this second direction, possibly providing insight into how symbolic processes are realized in the brain.
Much has been written and debated about the encoding of structure for neural nets
(see, e.g., Sharkey [13]) without reaching a clear consensus. Hinton [4] has discussed
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 194-203, 2000
Springer-Verlag Berlin Heidelberg 2000
1
Large Patterns Make Great Symbols: An Example of Learning from Example
195
it in depth and has introduced the idea of reduced representation. The idea is that both
a composed structure and its components are represented by vectors of the same
dimensionality, akin to the fixed-width pointers of a symbolic expression, except that
the vector for the composed structure is built of the vectors for its components; it is a
function of the component vectors. This idea is realized in Pollack’s Recursive AutoAssociative Memory (RAAM) [11], in Kussul, Rachkovskij, and Baidyk’s Associative-Projective Neural Network [8, 12], in Plate’s Holographic Reduced Representation (HRR) [9, 10], and in my binary Spatter Code [6]. I will use here the latter because
it allows for particularly simple examples. The research on representation by Shastri
and Ajjanagadde [14] and by Hummel and Holyoak [5] demonstrate yet other solutions to problems we share.
Most of this paper is devoted to demonstrating a symbol-processing mechanism for
neural nets. The point, however, is not to develop a neural-net Lisp but to see what
additional properties a system based on distributed representation might have, and how
these properties could lead to improved modeling of high-level mental functions.
These issues are touched upon at the end of the paper.
2
Binary Spatter-Coding of Structure
The binary Spatter Code is a form of Holographic Reduced Representation. It is summarized below in terms applicable to all HRRs, and in traditional symbolic terms using
a two-place relation r(x, y) as an example.
2.1
Space of Representations
HRRs work with large random patterns, or very-high-dimensional random vectors. All
things—variables, values, composed structures, mappings between structures—are
points of a common space: they are very-high-dimensional random vectors with independent, identically distributed components. The dimensionality of the space, denoted
by N, is usually in the thousands (e.g., N = 10,000). The Spatter Code uses dense
binary vectors (i.e., 0s and 1s are equally probable). The vectors are written in boldface, and when a single letters stands for a vector it is also italicized so that, for example, x is the N-dimensional vector (‘N-vector’ for short) that represents the variable or
role x, and c is the N-vector that represents the value or filler c.
2.2
Item Memory or Clean-up Memory
Some operations produce approximate vectors that need to be cleaned up (i.e., identified with their exact counterparts). It is done with an item memory that stores all valid
vectors known to the system, and retrieves the best-matching vector when cued with a
noisy vector, or retrieves nothing if the best match is no better than what results from
random chance. The item memory performs a function that, at least in principle, is performed by an autoassociative neural memory.
2
196
2.3
P. Kanerva
Binding
Binding is the first level of composition. It combines things that are very closely associated with each other, as when one is used to name or identify the other. Thus a variable is bound to a value with a binding operator that combines the N-vectors for the
variable and the value into a single N-vector for the bound pair. The Spatter Code
binds with coordinatewise (bitwise) Boolean Exclusive-OR (XOR, ⊗), so that the variable x having the value c (i.e., x = c) is encoded by the N-vector x⊗c whose nth bit is
xn ⊗cn (xn and cn are the nth bits of x and c, respectively). An important property of all
HRRs is that binding of two random vectors produces a random vector that resembles
neither of the two.
2.4
“Unbinding”
The inverse of the binding operator decomposes a bound pair into its constituents: it
finds the filler if the role is given, or the role if the filler is given. The XOR is its own
inverse function, so that, for example, x⊗(x⊗c) = c finds the vector to which x is
bound in x⊗c (i.e., what’s the value of x?).
2.5
Merging
Merging is the second level of composition in which identifiers and bound pairs are
combined into a single entity. It has also been called ‘super(im)posing’, ‘bundling’,
and ‘chunking’. It is done by a normalized sum vector, and the merging of G and H is
written as 〈G + H〉, where 〈…〉 stands for normalization. The relation r(A, B) can be
represented by merging the representations for r, ‘r1 = A’, and ‘r2 = B’, where r1 and r2
are the first and second roles of the relation r. It is encoded by
rAB = 〈r + r1 ⊗A + r2 ⊗B〉 .
(1)
The normalized sum of dense binary vectors is given by bitwise majority rule, with ties
broken at random. An important property of all HRRs is that merging of two or more
random vectors produces a random vector that resembles each of the merged vectors
(i.e., rAB is similar to r, r1 ⊗A, and r2 ⊗B).
2.6
Distributivity
In all HRRs, the binding and unbinding operators distribute over the merging operator.
For example,
x⊗〈G + H + I 〉 = 〈x⊗G + x⊗H + x⊗I 〉 .
(2)
Distributivity is a key to analyzing HRRs.
2.7
Probing
To find out what is bound to r1 in rAB (i.e., what’s the value of r1 in r(A, B)?), we
probe rAB (of eqn. 1) with r1 using the unbinding operator. It yields a vector A′ that is
recognizable as A (A′ will retrieve A from the item memory). The analysis is as
3
Large Patterns Make Great Symbols: An Example of Learning from Example
197
A′ = r1 ⊗rAB = r1 ⊗〈r + r1 ⊗A + r2 ⊗B〉 ,
(3)
follows:
which, by distributivity, becomes
A′ = 〈r1 ⊗r + r1 ⊗(r1 ⊗A) + r1 ⊗(r2 ⊗B)〉
(4)
A′ = 〈r1 ⊗r + A + r1 ⊗r2 ⊗B〉 .
(5)
and simplifies to
Thus A′ ιs similar to A; it is also similar to r1 ⊗r and to r1 ⊗r2 ⊗B (see Merging, sec.
2.5), but they are not stored in the item memory and thus act as random noise.
2.8
Holistic Mapping and Simple Analogical Retrieval
The functions described so far are sufficient for traditional symbol processing; for
example, for realizing a Lisp-like list-processing system. Holistic mapping is a parallel
alternative to sequential search and substitution of traditional symbolic processing.
Probing is the simplest form of holistic mapping: it approximately maps a composed
pattern into one of its bound constituents, as seen above (eqns. 3–5). However, much
more than that can be done in a single mapping operation. For example, it is possible
to do several substitutions at once, by constructing a mapping vector from individual
substitutions (each substitution appears as a bound pair, which then are merged; the
kernel map M* discussed at the end of the next section is an example). I have demonstrated this kind of mapping between things that share structure (same roles, different
objects) elsewhere [7]. In the next section we do the reverse: we map between structures that share objects (same objects in two different relations). The mappings are
constructed from examples, so that this is a demonstration of analogical retrieval or
inference.
3
Learning from Example
We will look at two relations, one of which implies the other: ‘If x is the mother of y,
then x is the parent of y’, represented symbolically by m(x, y) → p(x, y). We take a specific example m(A, B) of the mother-of relation and compare it to the corresponding
parent-of relation p(A, B), to get a mapping M1 between the two. We then use this
mapping on another pair (U, V ) for which the mother-of relation holds, to see whether
M1 maps m(U, V ) into the corresponding parent-of relation p(U, V ).
We encode ‘A is the mother of B’, or m(A, B), with the random N-vector mAB =
〈m + m1 ⊗A + m2 ⊗B〉, where m encodes (it names) the relation and m1 and m2 encode
its two roles. Similarly, we encode ‘A is the parent of B’, or p(A, B), with pAB = 〈p +
p1 ⊗A + p2 ⊗B〉. Then
M1 = MAB = mAB⊗pAB
(6)
maps a specific instance of the mother-of relation into the corresponding instance of
the parent-of relation, because mAB⊗MAB = mAB⊗(mAB⊗pAB) = pAB.
4
198
P. Kanerva
The mapping M AB is based on one example; is it possible to generalize based on
only one example? When the mapping is applied to another instance m(U, V ) of the
mother-of relation, which is encoded by mUV = 〈m + m1 ⊗U + m2 ⊗V 〉, we get the
vector W:
W = mUV⊗MAB .
(7)
Does W resemble pUV?
We will measure the similarity of vectors by their correlation ρ (i.e., by normalized
covariance; −1 ≤ ρ ≤ 1). The correlations reported in this paper are exact, they are
mathematical mean values or expectations. They have been calculated from composed
vectors that are complete in the following sense: they include all possible bit combinations of their component vectors. For example, if the composed vectors involve a total
of b “base” vectors, all vectors will be 2b bits.
If we start with randomly selected (base) vectors m, m1, m2, p, p1, p2, A, B, …, U, V
that are pairwise uncorrelated (ρ = 0), we observe first that mAB and pAB are uncorrelated but mAB and mUV are correlated because they both include m in their composition; in fact, ρ(mAB, mUV) = 0.25 and, similarly, ρ(pAB, pUV) = 0.25. When W is
compared to pUV and to other vectors, there is a tie for the best match: ρ(W, pUV) =
ρ(W, pAB) = 0.25. All other correlations with W are lower: with the related (reversed)
parent-of relations pBA and pVU it is 0.125, with an unrelated parent-of relation pXY
it is 0.0625, and with A, B, …, U, V, mAB, and mUV it is 0. So based on only one
example, m(A, B) → p(A, B), it cannot be decided whether m(U, V ) should be mapped
to the original “answer” p(A, B) or should generalize to p(U, V ).
Let us now look at generalization based on three examples of mother implying parent: What is m(U, V ) mapped to by M3 that is based on m(A, B) → p(A, B), m(B, C) →
p(B, C), and m(C, D) → p(C, D)? This time we will use a mapping vector M3 that is
the sum of three binary vectors,
M3 = MAB + MBC + MCD ,
(8)
where MAB is as above, and MBC and MCD are defined similarly. Since M3 itself is not
binary, mapping mAB or mUV with M3 cannot be done with an XOR. However, we
can use an equivalent system in which binary vectors are bipolar, by replacing 0s and
1s with 1s and −1s, and bitwise XOR (⊗) with coordinatewise multiplication (×). Then
the mapping can be done with multiplication, vectors can be compared with correlation, and the results obtained with M1 still hold. Notice that now MAB = mAB×pAB,
for example (cf. eqn. 6).
Mapping with M3 gives the following results: To check that it works at all, consider
WAB = mAB×M3; it is most similar to pAB (ρ = 0.71) as expected because M3 contains MAB . Its other significant correlations are with mAB (0.41) and with pUV and
pVU (0.18). Thus the mapping M3 strongly supports m(A, B) → p(A, B). It also supports the generalization m(U, V ) → p(U, V ) unambiguously, as seen by comparing
WUV = mUV×M3 with pUV. The correlation is ρ(WUV , pUV) = 0.35; the other significant correlations of WUV are with pAB and pVU (0.18) and with pBA (0.15) because
they all include the vector p (parent-of).
5
Large Patterns Make Great Symbols: An Example of Learning from Example
199
To track the trend further, we look at generalization based on five examples,
m(A, B) → p(A, B), m(B, C) → p(B, C), …, m(E, F) → p(E, F), giving the mapping
vector
M5 = MAB + MBC + MCD + MDE + MEF .
(9)
Applying it to mAB yields ρ(mAB×M5, pAB) = 0.63 (the other correlations are
lower, as they were for M3), and applying it to mUV yields ρ(mUV×M5, pUV) = 0.40
(again the other correlations are lower).
When the individual mappings MXY are analyzed, each is seen to contain the kernel
vectors m×p, m1 ×p1, and m2 ×p2, plus other vectors that act as noise and average out
as more and more of the mappings MXY are added together. The analysis is based on
the distributivity of the (un)binding operator over the merging operator (see sec. 2.6)
and it is as follows:
MXY = mXY×pXY
= 〈m + m1 ×X + m2 ×Y 〉×〈p + p1 ×X + p2 ×Y 〉
= 〈〈m + m1 ×X + m2 ×Y 〉×p
+ 〈m + m1 ×X + m2 ×Y 〉×p1 ×X
+ 〈m + m1 ×X + m2 ×Y〉×p2 ×Y 〉
= 〈〈m×p + m1 ×X×p + m2 ×Y×p〉
+ 〈m×p1 ×X + m1 ×X×p1 ×X + m2 ×Y×p1 ×X 〉
+ 〈m×p2 ×Y + m1 ×X×p2 ×Y + m2 ×Y×p2 ×Y 〉〉
= 〈〈m×p + m1 ×X×p + m2 ×Y×p〉
+ 〈m×p1 ×X + m1 ×p1 + m2 ×Y×p1 ×X 〉
+ 〈m×p2 ×Y + m1 ×X×p2 ×Y + m2 ×p2〉〉
= 〈〈m×p + noise〉 + 〈m1 ×p1 + noise〉 + 〈m2 ×p2 + noise〉〉 .
(10)
The overstrike indicates cancellation, and the underscore picks out the kernel vectors.
The three kernel vectors are responsible for the generalization, and from them we
can construct a kernel mapping from mother-of to parent-of:
M* = m×p + m1 ×p1 + m2 ×p2 .
(11)
When mUV is mapped with it, we get a maximum correlation with pUV, as expected,
namely, ρ(mUV×M*, pUV) = 0.43; correlations with other parent-of relations are
ρ(mUV×M*, pXY) = 0.14 (X ≠ U, Y ≠ V ) and 0 with everything else.
The results are summarized in Figure 1 that relates the amount of data to the
strength of inference and generalization. The data are examples or instances of motherof implying parent-of, m(x, y) → p(x, y), and the task is to map either an old (Fig. 1a)
or a new (Fig. 1b) instance of mother-of into parent-of. The data are taken into account
by encoding them into the mapping vector Mk .
Figure 1a shows the effect of new data on old examples. Adding examples into the
mapping makes it less specific, and consequently the correlation for old inferences
(i.e., with pAB) decreases, but it decreases also for all incorrect alternatives. Figure 1b
shows the effect of data on generalization. When the mapping is based on only one
6
200
P. Kanerva
example, generalization is inconclusive (mUV×M1 is equally close to pAB and pUV),
but when it is based on three examples, generalization is clear, as M3 maps mUV
much closer to pUV than to any of the others. Finally, the kernel mapping M* represents a very large number of examples, a limit as the number of examples approaches
infinity, and then the correct inference is the clear winner.
4
Discussion
We have used a simple form of Holographic Reduced Representation to demonstrate
mapping between two information structures, which is a task that has traditionally
been in the domain of symbolic processing. Similar demonstrations have been made
by Chalmers [2] and by others (e.g., Bodén & Niklasson [1]) using Pollack’s Recursive Auto-Associative Memory (RAAM) [11] and by Plate using real-vector HRR [9].
The lesson from such demonstrations is that certain kinds of representations and oper-
(a)
mAB × M k
(b)
mUV × M k
1
CORRELATION
pAB
0.5
pBA
pUV
pAB
pUV, pVU
pVU
pBA
0
M1
M3
M5
M*
M1
M3
M5
M*
Figure 1. Leaning from example: The mappings Mk map the mother-of relation to the parent-of
relation. They are computed from k specific instances or examples of mXY being mapped to
pXY. Each map Mk includes the instance ‘mAB maps to pAB’ and excludes the instance ‘mUV
maps to pUV’. The mappings are tested by mapping an “old” instance mAB (a) and a “new”
instance mUV (b) of mother-of to the corresponding parent-of. The graphs show the closeness
of the mapped result (its correlation ρ) to different alternatives for parent-of. The mapping M* is
a kernel map that corresponds to an infinite number of examples. Graph b shows good generalization (mUV is mapped closest to pUV) and discrimination (it is mapped much further from,
e.g., pAB) based on only three examples (M3). The thickness of the heavy lines corresponds to
slightly more than ±1 standard deviation around the expected correlation ρ when N = 10,000.
7
Large Patterns Make Great Symbols: An Example of Learning from Example
201
ations on them make it possible to do symbolic tasks with distributed representations
suitable for neural nets. Furthermore, when patterns are used as if they were symbols
(Gayler & Wales [3]), we do not need to configure different neural nets for different
data structures. A general-purpose neural net that operates with such patterns is then a
little like a general-purpose computer that runs programs for a variety of tasks.
The examples above were encoded with the binary Spatter Code. It uses very simple
mathematics and yet demonstrates the key properties of Holographic Reduced Representation. However, the thresholding of the vector sum, which is a part of the Spatter
Code’s merging operation, discards considerable amount of information. Therefore the
sums themselves rather than thresholded sums were used for the mappings Mk and M*
(eqns. 8, 9, and 11). This means that the mappings are “outside the system” (they are
not binary vectors; in HRRs, all things are points of the same space), and the results
have to be measured by correlation rather than by Hamming distance. In fact, the system we ended up with—binding and mapping with coordinatewise multiplication,
merging with vector addition—has been used by Ross Gayler (personal communication) to study distributed representation of structure. Plate’s real-vector HRR [9] also
uses the vector sum for merging, but it uses circular convolution for binding. Circular
convolution involves more computation than does coordinatewise binding but the principles are no more complicated.
The binary system has its peculiarities because it binds and maps with XOR, which
is its own inverse function and is also commutative. When we form the mapping from
mother-of to parent-of, which is a valid inference, it is the same mapping as from parent-of to mother-of, which is valid as evidence but is not a valid inference. The binary
system with XOR obviously is not the final answer to the distributed representation of
structure, although by being very simple it makes for a good introduction. Another
way to think about the invalid inference from parent-of to mother-of is that these simple mechanisms don’t even have to be fully reliable, that they are allowed to, and perhaps should, make mistakes that resemble mistakes of judgment that people make.
Correlations obtained with HRRs tend to be low. Even the highest correlation for
generalization in this paper is only 0.43 and it corresponds to the kernel map (Fig. 1b),
and we usually have to distinguish between two quite low correlations, such as 0.3 and
0.25. Such fine discrimination is impossible if the vectors have only a few components, but when they have several thousands, even small correlations and small differences in them are statistically significant. This explains the need for very-highdimensional vectors. For example, with N = 10,000, correlations have a standard deviation no greater than ±0.01, so that the difference between 0.3 and 0.25 is highly significant (it’s more than 5 standard deviations). These statistics suggest that N = 10,000
allows very large systems to work reliably and that N = 100,000 would usually be
unnecessarily large.
4.1
Toward Natural Intelligence
Our example of learning from example uses a traditional symbolic setting with roles
and fillers, with an operation for binding the two, and with another operation for combining bound pairs (and singular identifiers) into representations for new, higher level
compound entities such as relations (eqn. 1). The same binding and merging opera-
8
202
P. Kanerva
tions are used also to encode mappings between structures, such as the kernel map
between two relations (eqn. 11). Furthermore, individual instances or examples of the
mapping M are encoded with the binding operator—by binding corresponding
instances of the two relations (eqn. 6)—and several such examples are merged into
an estimate of the kernel map by simply adding them together or averaging (eqns. 8
and 9).
That the averaging of vectors for structured entities should be meaningful, is a consequence of the kind of representation used. It has no counterpart in traditional symbolic representation and is, in fact, one of the most exciting and tantalizing aspects of
distributed representation. Another is the “holistic” mapping of structures with mapping vectors such as M. They suggest the possibility of learning structured representations from examples without explicitly encoding roles and fillers or relying on highlevel rules. This would be somewhat like how people learn. For example, children can
learn to use language grammatically without explicit instruction in grammar rules, just
as they pick up the sound patterns of their local dialect without being told about vowel
shifts and other such things. In modeling that kind of learning we would still use the
binding and merging operators, and possibly one or two additional operators, but
instead of binding the objects of a new instance to abstract roles, as was done in the
examples of this paper, we would bind them to the objects of an old instance—an
example—so that the example rather than an abstract frame would serve as the template. Averaging over many such examples could then produce representations from
which the abstract notions of role and filler and other high-level concepts could be
inferred. In essence, some traditions of AI and computational linguistics would be
turned upside down. Instead of basing our systems on abstract structures such as grammars, we would base them on examples and would discover grammars in the representations that the system produces by virtue of its mechanisms. The rules of grammar
would then reflect underlying mental mechanisms the way that the rules of heredity
reflect underlying genetic mechanisms, and it is those mechanisms that interest us and
that we want to understand.
To reach such a goal, we need to discover the right kinds of representations, ones
that allow meaningful new “symbols” to be created by simple operations on existing
patterns, operations such as XOR and averaging for binding and merging. The point of
this paper is to suggest that the goal may indeed be realistic and is a prime motive for
the study of distributed representations.
Acknowledgment
The support of this research by Japan’s Ministry of International Trade and Industry
through the Real World Computing Partnership is gratefully acknowledged.
References
1. Bodén, M.B., Niklasson, L.F.: Features of distributed representation for tree structures: A
study of RAAM. In: Niklasson, L.F., Bodén, M.B. (eds.): Current Trends in Connectionism. Erlbaum, Hillsdale, NJ (1995) 121–139
9
Large Patterns Make Great Symbols: An Example of Learning from Example
203
2. Chalmers, D.J.: Syntactic transformations on distributed representations. Connection Science 2, No. 1–2 (1990) 53–62
3. Gayler, R.W., Wales, R.: Connections, binding, unification and analogical promiscuity. In:
Holyoak, K., Gentner, D., Kokinov, B. (eds.): Advances in Analogy Research: Integration
of Theory and Data from the Cognitive, Computational, and Neural Sciences (Proc. Analogy ’98 workshop, Sofia). NBU Series in Cognitive Science, New Bulgarian University,
Sofia (1998) 181–190
4. Hinton, G.E.: Mapping part–whole hierarchies into connectionist networks. Artificial Intelligence 46, No. 1–2 (1990) 47–75
5. Hummel, J.E., Holyoak, K.J.: Distributed representation of structure: A theory of analogical access and mapping. Psychological Review 104 (1997) 427–466
6. Kanerva, P.: Binary spatter-coding of ordered K-tuples. In: von der Malsburg, C., von
Seelen, W., Vorbrüggen, J.C., Sendhoff, B. (eds.): Artificial Neural Networks (Proc.
ICANN ’96, Bochum, Germany). Springer, Berlin (1996) 869–873
7. Kanerva, P.: Dual role of analogy in the design of a cognitive computer. In: Holyoak, K.,
Gentner, D., Kokinov, B. (eds.): Advances in Analogy Research: Integration of Theory and
Data from the Cognitive, Computational, and Neural Sciences (Proc. Analogy ’98 workshop, Sofia). NBU Series in Cognitive Science, New Bulgarian University, Sofia (1998)
164–170
8. Kussul, E.M., Rachkovskij, D.A., Baidyk, T.N.: Associative-projective neural networks:
Architecture, implementation, applications. Proc. Neuro-Nimes ’91, The Fourth Int’l Conference on Neural Networks and their Applications, (1991) 463–476.
9. Plate, T.A.: Distributed Representations and Nested Compositional Structure. PhD thesis.
Graduate Department of Computer Science, University of Toronto (1994) (Available by ftp
at ftp.cs.utoronto.ca as /pub/tap/plate.thesis.ps.Z)
10. Plate, T.A.: A common framework for distributed representation schemes for compositional
structure. In: Maire, F., Hayward, R., Diederich, J. (eds.): Connectionist Systems for Knowledge Representation and Deduction (Proc. CADE ’97, Townsville, Australia) Queensland
U. of Technology (1997) 15–34
11. Pollack, J.P.: Recursive distributed representations. Artificial Intelligence 46, No. 1–2
(1990) 77–105
12. Rachkovskij, D.A., Kussul, E.M.: Binding and normalization of binary sparse distributed
representations by context-dependent thinning. (Manuscript available at http:// cogprints.soton.ac.uk/abs/comp/199904008)
13. Sharkey, N.E.: Connectionist representation techniques. AI Review 5, No. 3 (1991) 143–167
14. Shastri, L., Ajjanagadde, V.: From simple associations to systematic reasoning. Behavioral
and Brain Sciences 16, No. 3 (1993) 417–494
15. Wermter, S.: Hybrid Approaches to Neural-Network-Based Language Processing. Report
ICSI TR-97-030, International Computer Science Institute, Berkeley, California (1997)
10
Context Vectors: A Step Toward a
“Grand Unified Representation”
Stephen I. Gallant
Knowledge Stream Partners, 148 State St., Boston MA 02109, USA,
sgallant@ksp.com
Abstract. Context Vectors are fixed-length vector representations useful for document retrieval and word sense disambiguation. Context vectors were motivated by four goals:
1. Capture “similarity of use” among words (“car” is similar to “auto”,
but not similar to “hippopotamus”).
2. Quickly find constituent objects (eg., documents that contain specified words).
3. Generate context vectors automatically from an unlabeled corpus.
4. Use context vectors as input to standard learning algorithms.
Context Vectors lack, however, a natural way to represent syntax, discourse, or logic. Accommodating all these capabilities into a “Grand
Unified Representation” is, we maintain, a prerequisite for solving the
most difficult problems in Artificial Intelligence, including natural language understanding.
1
Introduction
We can view many of the fundamental advances in Physics as a series of successful unifications of separate, contradictory theories. For example Einstein unified
wave and particle theories for light, Minkowski unified space and time into “spacetime”, Hawkings unified Thermodynamics and Black Holes, and the current
‘holy grail’ is an attempt to unify the contradictory theories of relativistic gravity
with quantum theory.
Just as theories play the central role in Physics, representation plays the
central role in machine learning and AI. For example, a fixed-length vector representation is necessary for virtually every neural network learning algorithm.
Getting the right features in such a vector can be much more important than
the choice of a learning algorithm.
The goal of this paper is to review a representation for text and images,
Context Vectors, and to assess the role of Context Vectors as a step toward an
eventual “Grand Unified Representation”. Finding a Grand Unified Representation is a likely prerequisite for solving the most difficult fundamental problems
in Artificial Intelligence, including natural language understanding.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 204–210, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Context Vectors: A Step Toward a “Grand Unified Representation”
2
205
Context Vector Representations
Context Vectors (CV’s) are fixed-length vectors that are used to represent words,
documents and queries [9,10,12,14,15]. Typically the dimension for each vector
is about 300, so we represent each of these objects by a single 300-dimensional
vector of real numbers. Context Vectors are normalized to have Euclidean length
equal to 1.
Context Vectors were created with four goals in mind.
2.1
Similarity of Use
The first goal is to capture “similarity of use” among terms. For example, we
want a representation where “car” and “auto” are similar (ie., small Euclidean
distance or large dot product), and where “car” and “hippopotamus” are roughly
orthogonal (small dot product).
2.2
Quick Searching
A second goal for Context Vectors is “quick sensitivity to constituent objects”.
For example, we want the dot product of the CV for “car” with CV’s of documents containing “car” to be larger than the dot product of the CV for “car”
with CV’s of documents that do not contain “car”. This can be accomplished
by defining a document Context Vector to be the normalized sum of the CV’s of
those words in the document. (In practice, we use a weighted sum of the CV’s
to emphasize relative importance, ie., rarity of words, but we will omit many
engineering details from this overview.)
The basic principle that makes possible this sensitivity to constituent objects
is that vector sum (superposition) preserves recognition. Suppose we take a sum,
S, of 10 random, normalized 300-dimensional vectors. Then it is easy to prove
that the expected dot product of a random, normalized vector, V, with S will be
about 1 if V was one of the 10 constituent vectors, and about 0 otherwise (see
[11], pg. 41). Note that this property only works with high-dimensional vector
sums: if we have a sum of 10 integers that equals 93, then we have no information
as to whether 16 was one of the 10 constituent integers.
We can apply this Vector Superposition Principle to document CV’s, which
are sums of CV’s from the constituent terms in the document. Therefore, a
document Context Vector will tend to have a higher dot product with the CV
of a word in that document (or a word whose CV is similar to a word in that
document) than with the CV of a random, unrelated document.
The task of document retrieval is now quite simple:
1. Form a query Context Vector in the same way as we form document CV’s,
taking the weighted sum of constituent word Context Vectors.
2. Compute the dot product of the query Context Vector with each document
Context Vector (requiring 300 multiplications and additions per document).
3. Select those documents having the largest dot product with the query Context Vector.
206
S.I. Gallant
Fig. 1. Learning Context Vectors. The CV for “fox” is modified by adding a fraction
(say 0.1) of the CV’s for “quick”, “brown”, “jumped”, and “over”. This makes the CV
for “fox” more like the CV’s of these surrounding terms.
2.3
Automated Generation of Context Vectors
A third goal for Context Vectors is their automatic generation from an unlabeled
corpus of documents so that the vectors preserve “similarity of use” [13,?].
One simple method for doing this is the following (Figure 1):
1. Initialize a Context Vector for each word to a normalized, random vector.
2. Make several passes through the corpus, changing the Context Vector for
each word to be more like the Context Vectors for its immediate neighbor
word Context Vectors. This generates the Context Vectors for all the words.
3. Generate the Context Vectors for documents by taking the (weighted) sum
of Context Vectors for constituent words.
Although we have skipped a number of important engineering details, the
above procedure generates a Context Vector system automatically that embodies
“similarity of use” in its vector representations.
This learning algorithm has much the same flavor as Kohonen’s Self-Organizing
Maps (SOM) [17]. In SOM, a fixed (usually small) set of vectors is repeatedly
nudged toward neighbors so that the vector come to “fill the space of neighboring exemplars”. Similarly, the above learning algorithm moves the CV’s for
words to “fill the space of neighboring word CV’s”. In both cases, the resulting
vectors take on characteristics of their neighbors, because neighbors become part
of the vectors. A difference with CV generation, however, is that the neighbors
are themselves CV’s, and these CV’s are themselves changing. One technique to
help “stabilize” this learning is described in [2].
Note that all words, documents, queries, and phrases are represented by
Context Vectors in exactly the same vector space, where distance between points
is strongly related to distance between objects in terms of “usage” (or perhaps
even “meaning”).
Context Vectors: A Step Toward a “Grand Unified Representation”
2.4
207
Learning
The fourth goal for Context Vectors, and a key motivation for creating Context
Vector representations, is that CV’s be usable for neural networks and other
machine learning techniques. Because Context Vectors for all objects are fixedlength vectors (and similar objects have similar representations), this property
naturally follows. Thus we can conveniently apply neural networks to problems
involving words or documents, for example to improve queries based upon user
feedback from retrieved documents.
For example, if we have a set of documents that are judged “relevant” to
query Q, and another set of documents that are “not relevant” to Q, we can
learn a query context vector that is different than Q. We first append “+1” as
an additional feature to CV’s of relevant documents, and “-1” as an additional
feature to CV’s of not relevant documents. We can now use either regression or
other single-cell neural net learning algorithms, such as the Pocket Algorithm
[8,11], to generate a set of weights having the same dimensions as the CV’s for
documents. We can then use these weights directly as a query CV, or modify Q
by adding in a fractional multiple of this set of weights.
Experiments have shown that both methods can give significant improvement
over the original query [15].
3
Related Work
There are many similarities between Context Vectors and earlier work on Latent
Semantic Indexing (LSI) by Dumais and Associates [3,4,5,6], as discussed in [1].
In fact, the traditional Salton vector space model [19] also utilizes fixed-length
vectors, but these vectors have (in theory) one component for every word stem
in the corpus under consideration, and therefore no “similarity of use” property.
Context Vectors can also be used for word-sense disambiguation, for example
in differentiating “star in the sky” from “movie star”. The basic idea is to create
a Context Vector for each word sense, and then use surrounding context of a
word to find the closest word sense. Word senses may be taken from a dictionary
(requiring hand-labeling of the training corpus), or generated by clustering the
different contexts in which a word appears in a corpus [15].
There has also been work on “Image Context Vectors”, but these efforts have
so far met with mixed results [16].
Currently, HNC software has a 50-person division called Aptex that has commercialized Context Vector technology, including Web applications. Although
the author is nor aware of technical details of HNC’s recent web activities, note
that a web page is itself a document. Therefore it is possible to precompute CV’s
for web pages of interest. Each CV is a fixed length vector, say 300 numbers, and
this representation is a searchable representation for web pages. Furthermore, if
we know which web ads a user “clicks through” and which ads he or she ignores,
we can use regression or single-cell neural net learning algorithms to generate
a query CV, as previously described. Then we can chose which ad, from as yet
unpresented ads, to present to the user for increased likelihood of selection.
208
S.I. Gallant
Fig. 2. Toward a Grand Unified Representation
All of these operations can be accomplished very quickly, including learning
query CV’s, so this technology is well-suited to “real time” ad selection applications over the web.
4
Toward a “Grand Unified Representation”
Context Vectors have been successful in achieving their four main goals. However,
they are insufficient for representing natural language in several key respects:
– Syntax: There is no representation for syntax. Even worse, most document
retrieval systems ignore key syntactical words, such as prepositions and conjunctions.
– Discourse and Logic: there is no Context Vector mechanism to capture the
temporal order of a story, or to draw logical conclusions. Presumably, it is
necessary to represent syntax as a prerequisite for representing discourse and
logic.
Taking a broader view, we see the situation as in Figure 2.
In order to accomplish some of the most difficult AI tasks, we need advances in representation that unify the capabilities displayed along the bottom of
Figure 2.
How might this be done?
One avenue worth exploring is to expand the dimensionality of Context Vector representations to capture syntactic and logical information. This would be
in the same spirit as the current efforts among the Physics community aimed at
higher dimensional “string theory”. Intuitively, it appears necessary to expand
dimensions, rather than use even more superposition, because storing documents
appears to already take up most of the storage capacity in Context Vectors. We
are less optimistic about several other approaches, such as creating a syntax tree
of Context Vectors.
Context Vectors: A Step Toward a “Grand Unified Representation”
209
To speculate further on possible new approaches, we might first perform a
parse of the text in question, and then build a CV representation that incorporates information from this parse. For example, we might emphasize “heads” of
phrases, including verbs and prepositions, to create parse-based CV’s for phrases.
Doing this would generate two separate CV’s for “threw the ball” and “milked
the cow”, so that “George milked the cow and threw the ball” would end up with
a different CV from “George milked the ball and threw the cow”. (In the current
CV representation, these two sentences produce the same document CV.)
Perhaps Plate’s Holographic Reduced Representations [18] might prove useful
for constructing phrase CV’s?
The ultimate representation should contain both the current first order context vectors, as well as parse-based second order phrase CV’s. The representation
could either use different sets of features for the two representations, or superimpose them on the same set of (say 400) features using the Vector Superposition
principle.
One problem is that perfect parsing can be extremely difficult, and can even
depend on the semantics of the text. However, restricting ourselves to phrase
identification is somewhat easier, and a precise parse may not be critical. We
might even consider going one step further, and “learning a parse”. One approach
would be to attempt to replace standard parsing schemes by some Kohonen-style
SOM operation that attempts to learn syntax relations.
Of course, why stop there? As long as we are speculating, we could extend to
third order structures involving relations among second order structures. Higher
order structures would then come to represent discourse, and could be used to
attempt to find similar story plots. We might even begin to approach the problem
of “understanding” the text.
5
Conclusion
We maintain that the key direction for research is representation, not learning
theory. We need to aim for a Grand Unified Representation that can handle
the various performance issues in Figure 2. A representation that merely combines aspects of neural and symbolic representations, without giving increased
computational capabilities, is of limited value.
Rather, we need a series of innovations that result in increasingly more powerful representations, each of which adds to representational ability.
Context Vectors are a step in this “Unification Program”. They give a nice
way to represent words, word senses, documents, and queries while capturing
similarity of meaning.
We hope that work with Context Vectors will stimulate the clever ideas
and breakthroughs needed to arrive at a Grand Unified Representation, and
ultimately to tackle the truly difficult tasks in Artificial Intelligence.
210
S.I. Gallant
References
1. Caid WR, Dumais ST and Gallant SI. (1995) Learned vector-space models for document retrieval. Information Processing and Management, Vol. 31, No. 3, pp. 419-429.
2. Caid WR & Pu O. System and method of context vector generation and retrieval.
United States Patent 5619709, Nov 21, 1995.
3. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A.
Indexing by latent semantic analysis. Journal of the Society for Information Science,
1990, 41(6), 391-407.
4. Dumais, S. T. Improving the retrieval of information from external sources. Behavior
Research Methods, Instruments and Computers, 1991, 23(2), 229-236.
5. Dumais, S. T. (1993) LSI meets TREC: A status report. In D. Harman (Ed.) The
First Text REtrieval Conference (TREC-1). NIST special publication 500-207, 137152.
6. Dumais, S.T. (1994) Latent Semantic Indexing (LSI) and TREC-2. In D. Harman
(Ed.) The Second Text REtrieval Conference (TREC-2). NIST special publication
500-215, 105-115.
7. Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A.,
Streeter, L. A., and Lochbaum, K. E. Information retrieval using a singular value
decomposition model of latent semantic structure. In Proceedings of SIGIR, 1988,
465-480.
8. Gallant, SI (1990) Perceptron-based learning algorithms. IEEE Transactions on
Neural Networks 1(2):179-192.
9. Gallant SI (1991) Context vector representations for document retrieval. AAAI-91
Natural Language Text Retrieval Workshop: Anaheim, CA.
10. Gallant SI (1991) A practical approach for representing context and for performing
word sense disambiguation using neural networks. Neural Computation 3(3):293309.
11. Gallant, S. I. (1993) Neural Network Learning and Expert Systems. M.I.T. Press.
12. Gallant, S. I. (1994) Method For Document Retrieval and for Word Sense Disambiguation Using Neural Networks. United States Patent 5,317,507, May 31, 1994.
13. Gallant, S. I. (1994) Method For Context Vector Generation for use in Document
Storage and Retrieval. United States Patent 5,325,298, June 28, 1994.
14. Gallant S. I., Caid WR et al (1992) HNC’s MatchPlus system. The First Text
REtrieval Conference: Washington, DC. pp. 107-111.
15. Gallant S. I., Caid WR et al (1993) Feedback and Mixing Experiments With MatchPlus. The Second Text REtrieval Conference (TREC-2). NIST special publication
500-215, 101-104.
16. Gallant, S. I., and Johnston, M. F. Image retrieval using Image Context Vectors:
first results. IS&T/SPIE Symposium on Electronic Imaging: Science & Technology,
San Jose, Ca, Feb. 5-10, 1995. In Niblack and Jain, Eds. Storage and Retrieval for
Image and Video Databases III. SPIE Vol 2420, 82-94.
17. Kohonen, T. Self-Organizing Maps. Springer, Berlin, 1995.
18. Plate, T.A. Distributed Representations and Nested Compositional Structure. University of Toronto, Department of Computer Science Ph.D. Thesis, 1994.
19. Salton, G., and McGill, M.J. Introduction to Modern Information Retrieval.
McGraw-Hill, New York, 1983.
Integration of Graphical Rules with Adaptive
Learning of Structured Information
Paolo Frasconi1 , Marco Gori2 , and Alessandro Sperduti3
1
Dipartimento di Ingegneria Elettrica ed Elettronica, Università di Cagliari
Piazza d’Armi, 09123 Cagliari Italy
2
Dipartimento di Ingegneria dell’Informazione, Università di Siena
Via Roma 56, 53100 Siena, Italy
3
Dipartimento di Informatica, Università di Pisa,
Corso Italia 40, I-56125 Pisa, Italy
Abstract. We briefly review the basic concepts underpinning the adaptive processing of data structures as outlined in [3]. Then, turning to
practical applications of this framework, we argue that stationarity of
the computational model is not always desirable. For this reason we introduce very briefly our idea on how a priori knowledge on the domain
can be expressed in a graphical form, allowing the formal specification of
perhaps very complex (i.e., non-stationary) requirements for the structured domain to be treated by a neural network or Bayesian approach.
The advantage of the proposed approach is the systematicity in the specification of both the topology and learning propagation of the adopted
computational model (i.e., either neural or probabilistic, or even hybrid
by combining both of them).
1
Introduction
Several disciplines have taken advantage of structured representations. These
include, among many other, knowledge representation, language modeling, and
pattern recognition. The interest in developing connectionist architectures capable of dealing with these rich representations (as opposed to “flat” or vectorbased representations) can be traced back to the end of the 80’s.
Different approaches have been proposed. Touretzky’s BoltzCONS system
[15] is an example of how a Boltzman machine can handle symbolic structures
using coarse-coded memories [11] as basic representational elements, and LISP’s
car, cdr, and cons functions as basic operations. The RAAM model proposed
by Pollack [10] is based on backpropagation (Backpropagation is only one particular way to implement the concept underlying the RAAM model) to discover
compact recursive distributed representations of trees with a fixed branching
factor. Recursive distributed representations are an instance of the concept of
a reduced descriptor introduced by Hinton [6] to solve the problem of mapping
part-whole hierarchies into connectionist networks.
Also related to the concept of a reduced descriptor are Plate’s holographic
reduced representations [9]. A formal characterization of representations of strucS. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 211–225, 2000.
c Springer-Verlag Berlin Heidelberg 2000
212
P. Frasconi, M. Gori, and A. Sperduti
tures in connectionist systems using the tensor product was developed by Smolensky [13].
Examples of application domains where structures are extensively used are
medical and technical diagnoses (discovery and manipulation of structured dependencies, constraints, explanations), molecular biology (DNA and protein analysis), chemistry (classification of chemical structures, quantitative structureproperty relationship (QSPR), quantitative structure-activity relationship
(QSAR)), automated reasoning (robust matching, manipulation of logical terms,
proof plans, search space reduction), software engineering (quality testing, modularization of software), geometrical and spatial reasoning (robotics, structured
representation of objects in space, figure animation, layouting of objects), speech
and text processing (robust parsing, semantic disambiguation, organizing and
finding structure in texts and speech).
While algorithms that manipulate symbolic information are capable of dealing with highly structured data, adaptive neural networks are mostly regarded
as learning models for domains in which instances are organized into static data
structures, like records or fixed-size arrays. Recurrent neural networks, that generalize feedforward networks to sequences (a particular case of dynamically
structured data) are perhaps the best known exception.
However, during the last few years, neural networks for the representation and
processing of structures have been developed [14]. In particular, recursive neural
networks are a generalization of recurrent networks for processing sequences
(i.e., linear chains from a graphical point of view) to the case of directed acyclic
graphs. These kind of networks are of paramount importance for both structural
pattern recognition and for the development of hybrid neural-symbolic systems,
since they allow the treatment of structured information very naturally and, in
several cases, very efficiently.
The main motivations for using neural networks for the processing of structures are several: first of all, neural networks are universal function approximators,
then they can perform automatic inference (learning), hold very good classification capabilities, and they can deal with noise and incomplete data. All the
above properties are particularly useful when facing classification problems in a
structured domain.
Let us consider, for example, the use of neural networks for classification
of labeled graphs which represent chemical compounds. The standard approach
with feedforward neural networks consists of encoding each graph as a fixed-size
vector, which is then used for feeding the neural network. Unfortunately, the a
priori definition of the encoding process has several drawbacks. For example, in
chemistry, the encoding is performed through the definition of topological indices which are properly designed by means of a very expensive trial and error
approach) and the resulting vectorial representations of graphs may be very difficult to classify. Recursive neural networks face this problem by simultaneously
learning either how to represent or how to classify structured patterns. This is a
very desirable property which has already been shown to be useful in practice.
Integration of Graphical Rules
213
It must be noted that automatic inference can also be obtained by using
a symbolic representation, such as in Inductive Logic Programming [7]. Recursive neural networks, however, have their own specific peculiarity, since they can
approximate functions from a structured domain (possibly with real valued vectors as labels) to the set of reals. To the best of our knowledge, this cannot be
performed by any symbolic system.
In many applications, however, the assumption of stationarity of the computational model, which is typically done when using neural or Bayesian networks,
does not fit with the nature of the problem at hand. This is especially true
when considering complex problems which may involve heterogeneous (i.e., both
symbolic and numerical) and structured data. In this chapter, after a brief review of the basic concepts underpinning the adaptive processing of structures
as outlined in [3], we show by two examples from practical applications that
the assumption of stationarity may not be adequate to the complexity of the
application domains. Following this argument, in this chapter we discuss some
ideas about how to define a graphical formalism for the systematic specification
and processing of a priori knowledge on the application domain.
2
Structured Domains
In this chapter, we consider domains of DOAGs (Directed Ordered Acyclic Graphs) where vertices are marked by labels, i.e. subsets of domain variables. The
meaning of an edge (v, w) is that the variables in the labels attached to v and w
are related. A graph is uniformly labeled if: i) an equivalence relation is defined
on the domain variables; ii) every label contains one and only one variable from
each equivalence class. For example, in a pattern recognition domain, when classifying images containing geometrical patterns, P erimeter, Area, T exture are
all possible equivalence classes since all geometrical patterns can be represented
using these features.
Let us assume uniformly labeled graphs (with N equivalence classes). Let
Y1 , Y2 , . . ., YN be representative variables for the equivalence classes. Yi is said
categorical when the realizations belong to a finite alphabet Yi . Yi is said numerical when the realizations are real numbers: Yi = IR. Moreover, the set of label
realizations
Y = Y1 × Y2 × · · · × YN
is called label space.
The class of DOAGs in which we are interested in, is formed by directed
acyclic graphs such that, for each vertex v, a total order ≺ is defined on the
edges leaving from v.
E.g.:
(v, w) ≺ (v, u) ≺ (v, t)
A trivial example of structure is a sequence (see Figure 1), which can be
defined as either: i) an external vertex, or ii) an ordered pair (t, h) where the
head h is a vertex and the tail t is a sequence.
214
time=0
P. Frasconi, M. Gori, and A. Sperduti
time=1
time=2
...
1
0
1
0
1
0
a
a
b
a
b
a
....
head (last time step)
frontier (origin of time axis)
tail (a sequence)
Fig. 1. left: Example of sequence; right: Causality for sequential transductions.
2.1
Sequential Transductions
Here we review very briefly the concept of sequential transduction. Let U and Y
be the input and output label spaces. We denote by U ∗ the set of all sequences
with labels1 in U.
A general transduction T is a subset of U ∗ ×Y ∗ . We shall limit our discussion
to functions: T : U ∗ → Y ∗ . A transduction T (·) is causal if the output at time t
does not depend on future inputs (at time t + 1, t + 2, . . .)(see Figure 1).
A particular form of transduction can be obtained by having the outputs to
depend on hidden state variables (label space X ), i.e., for each time t:
X t = f (X t−1 , U t , t) and Y t = g(X t , U t , t)
where f : X × U → X is the state transition function and g : X × U → Y is
the output function, and the initial state is associated with the external vertex
(frontier).
A recursive state representation exists only if T (·) is causal (see Figure 2).
Moreover, T (·) is stationary if f (·) and g(·) do not depend on t.
2.2
Transductions for Data Structures
Here we generalize causal and stationary transductions to structured domains.
Let U # and Y # be two DOAG spaces. A transduction T is a subset of U # × Y # .
As in the sequential case, we impose the following restrictions: i) T is a function T : U # → Y # ; ii) T is IO-isomorph, i.e., skel(T (U )) = skel(U ) (skel(U )
returns a graph with the same topology as U but without labels attached to the
vertices), where skel() is an operator returning the skeleton of the graph, i.e., the
graph obtained by ignoring all node labels; iii) # is an ordered class of DAGs.
Given an input graph U , a recursive state representation for a structural
transduction can be defined so that, for each vertex v:
X v = f (X ch[v] , U v ) and Y v = g(X v , U v )
1
(1)
Cf. the classical definition in language theory: U is a set of terminal symbols and
U ∗ is the free monoid over U .
Integration of Graphical Rules
215
Xv
Uv
X v.R
X v.L
U t-2
U t-1
Ut
U v.L
U v.R
X v.R.L
X v.R.R
X t-2
X t-1
Xt
frontier state if v.R is external
X t = f ( X t-1 , Ut )
X v = f ( Uv , X v.L , X v.R )
frontier state if v.L is external
(a)
(b)
Fig. 2. (a): Sequential transduction with hidden recursive state representation; (b):
Structural transduction with hidden recursive state representation.
where ch[v] are the (ordered) children of v, f : X m ×U → X is the state transition
function and g : X × U → Y is the output function.
Also for structural transductions a recursive state representation exists only
if T (·) is causal, and T (·) is stationary if f (·) and g(·) do not depend on v (see
Figure 2).
Note that the update of equations (1) may follow any reverse topological sort
of the input graph (see Figure 3). Specifically, some vertices can be updated in
parallel according to the preorder defined by the topology of the input graph.
We are particularly interested in supersource transductions, which can be
characterized by the following points: i) the input graph U is a DOAG with
supersource s; ii) the output is a single label Y , which is either a categorical
variable, in case of classification, or a (multivariate) numerical variable, in case
of regression; iii) the hidden state X v is updated by f (X ch[v] , U v ), while the
output label is “emitted” at s: Y = g(X s ).
In order to graphically describe structured transductions, it is convenient
to define generalized shift operators. In case of sequences, the standard shift
operator is defined as q −1 Y t = Y t−1 (unitary time delay). This definition can
be extended for DOAGs as follows: qk−1 Y v is the label attached to the k-th child
216
P. Frasconi, M. Gori, and A. Sperduti
4
v
1
3
3
2
2
1st
2nd
1
1
2
0
1st
0
3rd
1
2nd
1st
1
0
1
q -1 q -1 Yv
2
1
0
2nd
1st
1
2nd
0
q -1 q -1 Yv
1
2
Fig. 3. left: Example of reverse topological sort for a DOAG: any total order on the
nodes consistent with the numbering of the nodes is admissible (nodes with same
number can be permuted); right: Example showing the non-commutativity of the generalized shift operators.
of vertex v. Notice that the composition of these operators is not commutative
(see Figure 3).
Having defined generalized shift operators, we can now define a class of graphical models called recursive networks, used to represent structural transductions. We assume that: i) T () admits a recursive state space representation; ii)
input, state, and output DOAGs are uniformly labeled.
The recursive network of T () is a directed graph where: i) vertices are marked with representative variables; ii) edges are marked with generalized shift
operators; iii) an edge (Av , Bv ), with label qk−p , means that for each v in the
vertex set of the input structure, Bv is a function of qk−p Av .
Given a graph U ∈ U # and a recursive transduction T , it is possible to
compute the output of the transduction T by resorting to encoding networks: the
encoding network associated with U and T is formed by unrolling the recursive
network of T through the input graph U . Note that time-unfolding is a special
case of the above concept for the class of sequences (see Figure 4).
The encoding network is a graphical representation of the functional dependences between the variables in consideration given a specific DOAG in input.
This encoding network can be implemented either by a neural network or a
Bayesian network [3]. For example, let us consider neural networks. In this case,
both the state transition function f (·) and the output function g(·) are realized
by feedforward neural networks, leading to the parametric representation
X v = f (X ch[v] , U v , θ f ), and Y v = g(X v , U v , θ g ),
where θ f and θ g are connection weights. In the special case of linear chains,
the above equations exactly correspond to the general state space equations
Integration of Graphical Rules
217
Yv
t=2
t=4
t=1
t=3
t=5
q-1
+
Sequence (list)
Xv
=
Uv
Encoding network: TIME UNFOLDING
Recursive network
a
Yv
a
b
+
c
q-1L
q-1R
Xv
d
e
e
Data structure (binary tree)
b
=
d
c
e
e
Uv
Recursive network
frontier states
Encoding network
Fig. 4. Example of generation of encoding networks for sequential and structural transductions.
of recurrent neural networks. Moreover, the above representation is stationary,
since both f (·) and g(·) do not depend on the node v.
The encoding network, in this case, will result in a feedforward neural network
obtained by replicating and connecting the feedforward neural networks implementing f (·) and g(·), according to the topology of the input DOAG. Standard
learning algorithms for neural networks can be used to train the neural encoding
network, taking care of the fact that θ f and θ g are replicated across the network.
3
Need for Non-stationary Transductions
The basic ingredients of the framework briefly described in the previous section
can be summarized by the following pseudo-equation:
Encoding Network(i) = Recursive Network + Input Graph(i)
This framework assumes stationarity as long as the Recursive Network is stationary. In the following, we argue that in several application cases it is useful
to assume non-stationarity. Non-stationary solutions are also sometimes needed
because of the recursive nature of the model.
Here we briefly discuss two practical cases where non-stationarity of the
model will clearly improve the efficiency and performance of the model.
218
3.1
P. Frasconi, M. Gori, and A. Sperduti
A Chemical Application
Chemical compounds are usually represented as undirected graphs. Each node
of the graph is an atom or a group of atoms, while arcs represent bonds between
atoms (see Figure 5).
Fig. 5. Typical chemical compound, naturally represented by an undirected graph.
One fundamental problem in chemistry is the prediction of the biological
activity of chemical compounds. Quantitative Structure-Activity Relationship
(QSAR) is an attempt to face the problem relying on compounds’ structures.
The biological activity of a drug is fully determined by the micromechanism
of interaction of the active molecules with the bioreceptor. Unfortunately, discovering this micromechanism is very hard and expensive. Hence, because of
the assumption that there is a direct correlation between the activity and the
structure of the compound, the QSAR approach is a way of approaching the
problem by comparing the structure of all known active compounds with inactive compounds, focusing on similarities and differences between them. The aim
is to discover which substructure or which set of substructures characterize the
biomechanism of activity, so as to generalize this knowledge to new compounds.
The main requirement for the use of recursive networks consists in finding
a representation of molecular structures in terms of DOAGs. The candidate
representation should retain the detailed information about the structure of the
compound, atom types, bond multiplicity, chemical functionalities, and finally a
good similarity with the representations usually adopted in chemistry. The main
representational problems are: how to represent cycles, how to give a direction
to edges, how to define a total order over the edges.
Let us consider a specific problem: the prediction of the non-specific activity
(affinity) towards the Benzodiazepine/GABAA receptor [1]. An appropriate description of the molecular structures of benzodiazepines can be based on a labeled tree representation. In fact, concerning the first problem, since cycles mainly
Integration of Graphical Rules
219
constitute the common shared template of the benzodiazepines compounds, it
is reasonable to represent a cycle (or a set of connected cycles) as a single node
where the attached label carries information about its chemical nature. The second problem can be solved by the definition of a set of rules based on the
I.U.P.A.C. nomenclature system2 . Finally, the total order over the edges follows
a set of rules mainly based on the size of the sub-compounds. An example of
representation for a Benzodiazepine is shown in Figure 6.
TEMPLATE
R9
R1
bdz
O
R8
O
N
N
A
N
R3
B
A
R7
B
N
R6
5
R 2’
R6’
C
R1=H
R2’=F
R3=H
R5=PH
R6=H
R6’=H
R7=COCH3
R8=H
R9=H
H
F
H
PH
H
H
C
H
H
C
O
H
H
H
bdz(h,f,h,ph,h,h,c2(o,c3(h,h,h)),h,h).
Fig. 6. Example of representation for a benzodiazepine.
This representation of Benzodiazepines implies that, once a given recursive
network has been defined, the set of parameters corresponding to the pointers
are used with different roles: at the level of the root, they are used to implement
the target regression, while at the remaining levels they are used to perform the
encoding of the substituents (which do not have a target value associated). Since
there is no strong evidence about the fact that these two functionality should be
implemented by the same set of parameters, it is clear that it would have been
better to have a different set of parameters for each functionality. Moreover, in
this particular case, since the substituents just exploit 3 out of 9 pointers, it
turns out that only these 3 pointers are “overloaded”, thus introducing a clear
arbitrary bias in the model.
3.2
Logo Recognition
Pattern recognition is another source of applications in which one may be interested in adaptive processing of data structures. This was recognized early with
the introduction of syntactic and structural pattern recognition [4,8,5,12], that
are based on the premise that the structure of an entity is very important for
both classification and description.
2
The root of a tree representing a benzodiazepine is determined by the common
template.
220
P. Frasconi, M. Gori, and A. Sperduti
Figure 7 shows a logo with a corresponding structural representation based on
a tree, whose nodes are components properly described in terms of geometrical
features. This representation is invariant with respect to roto-translations and
naturally incorporate both symbolic and numerical information.
a
a=(square,0.843)
e
b
f=(circle,1.086)
b=(triangle, 0.018)
f
SUM
d
c
h
i
SUM
SUM
g
c=(triangle, 0.018)
g=(letter(S),0.009)
d=(triangle, 0.018)
h=(letter(U),0.007)
e=(triangle, 0.018)
i=(letter(M),0.011)
Fig. 7. A logo with the corresponding representation based on both symbolic and subsymbolic information.
Of course, the extraction of robust representations from patterns is not a
minor problem. The presence of a significant amount of noise is likely to affect
significantly representations that are strongly based on symbols. Hence, depending on the problem at hand, the structured representation that we derive should
emphasize the symbolic or the sub-symbolic information. For example, the logo
shown in Figure 7 could be significantly corrupted by noise so as to make it unreasonable to recognize the word “SUM”. In that case, one should just consider
all the word as sub-symbolic information collected in a single node.
In general, however, the symbolic information could be exploited as a selector
for different sets of parameters: tree nodes with the same symbolic part can be
processed by using the same set of parameters, while tree nodes with different
symbolic part can be processed by a different set of parameters. This would allow
the reduction of the variance in input, thus permitting the learning algorithm to
focus only on relevant aspects of the problem. Some work in this direction has
already been presented in [2].
4
Representing Non-stationary Transductions
In the previous section, we have argued about the advantages of “using” nonstationary transductions. Non-stationary transductions can be defined within
Integration of Graphical Rules
221
the proposed formalism by implementing f (·) and g(·) through non-stationary
models, e.g., non-stationary neural networks. Very often, however, a priori knowledge about this non-stationarity could conveniently be represented in an explicit
way at the level of the graphical model, i.e., at the level of the recursive network,
independently of the implementation mechanism. The basic idea is to define a
sort of graphical language for describing explicitly the desired non-stationarity.
The advantage of this approach is to avoid the introduction of distinct parameters for each vertex in the input structures. In fact, the graphical formalisms
allows the user to exactly specify under which specific conditions a new set of
parameters must be used.
For example, let us consider the case of a transduction for binary trees where
both f (·) and g(·) depend on: i) the distance of the node in the tree from the
frontier; ii) the value associated to a numerical variable (U ) within the labels. A
specific instance of this situation could be described by the pseudo-code reported
in Figure 8.
The first statement declares a variable of type sequence, while the second
statement declares a variable of type vertex. The variable of type sequence
is then used to contain the sequence of vertices returned by the statement
sort vertices by(dist from(frontier),<). This statement returns one of
the admissible topological orders (there may be many) where the vertices of
the input DOAG are sorted by increasing (<) distance from the frontier. The
next foreach (v,seq){ . . . } command iterates the value of v through each
element of the sequence Seq, executing the commands contained in the body
{ . . . } after each assignment.
Within the foreach, there is an if-then-else statement which tests how far
the current vertex is from the frontier. If the distance of the current vertex from
the frontier is less than 3, then an additional test is performed on the associated
numerical variable U : if the value of U is within the range [0.3, 0.55], then the
recursive network corresponding to the second occurring then branch is used,
otherwise the recursive network corresponding to the first occurring else branch
is used. If the distance of the current vertex from the frontier is not less than
3, the recursive network corresponding to the second occurring else branch is
used.
Please note that the three recursive networks which appear in the “program”
should be intended as distinct models, i.e., each network has a different set of
parameters associated with its state transition and output functions.
All the three recursive networks involved in the definition of the above description exploit the same set of variables (i.e., U , X, Y ), which assures the
consistency of the description. In Figure 9, an example of application of the
“program” is reported. In the right side of the figure, the encoding network
obtained by applying the program to the input tree in the left-side of the figure is shown. Note that vertices 1, 2, 4, 5, and 6 satisfy both the first and
second if-then-else conditions, while vertices 3 and 7 only satisfy the first
if-then-else condition. Finally, vertex 8 is the only node for which the last
recursive network is applied.
222
P. Frasconi, M. Gori, and A. Sperduti
Sequence_of_vertices Seq;
Vertex v;
Seq <- sort_vertices_by( dist_from(frontier), <);
foreach(v, Seq) {
if (dist_from(frontier)<3) then {
Y
if ( U
in [0.3,0.55]
) then
q-1
2
q-1
1
;
X
Y
U
q-1
1
else
;
X
U
}
Y
else
q-1
q-1
2
1
;
X
U
}
Fig. 8. Example of non-stationary recursive network.
It must be observed that the resulting encoding network is not connected
since the subnetwork corresponding to vertices 3 and 6 is not connected to the
remaining part of the network. As a consequence of that, neither vertex 3, nor
vertex 6 will contribute to the definition of the state and output of vertices 7
and 8.
The processing flow, in this case, is sequential, since the vertices of the input
graph are sorted according to a given topological order. In general, however,
the preorder defined by the topology of the input DOAG admits a degree of
parallelism. This form of parallelism can be expressed by a data flow model,
where the nodes of the input DOAG are visited in parallel in a bottom-up
fashion, from the frontier to the supersource. For example, the nodes of the
input DOAG of Figure 9 can be processed according to the following sequence
Integration of Graphical Rules
223
0.23
Input Tree
0.23
0.4
8
0.23
4
0.5
0.5
0.43
5
frontier states
0.43
6
0.37
1
0.23
0.4
7
0.4
2
0.11
0.11
3
0.37
0.4
frontier states
frontier states
Encoding Network
Fig. 9. Example of generation of the encoding network by using the input in the left
side of the figure and the non-stationary recursive network shown in Figure 8.
of sets, S1 ={1,2,3,4}, S2 ={5,6}, S3 ={7}, S4 ={8}, where nodes in the same set
are processed in parallel but set Si must be processed strictly after set Si−1 .
Such parallelism could be expressed by adopting an event-based environment
where an event can be understood as a condition satisfied by a given vertex of
the input DOAG. This event-based environment could be introduced by the statement deal with { . . . } which in general contains a set of (non contradictory)
event(condition): {. . .} statements and a default: {. . .} statement applied
when no condition of the event statements is satisfied.
It must be observed that a condition, besides to be a hard logical condition
on the topology of the DOAG, i.e., whether a node is at a given level of the DOAG
or the node has more than 2 children, can also be defined as a fuzzy condition,
implemented, for example, by a neural network. So the test on the condition can
either activate a symbolic procedure or a pattern recognition subsystem.
One interesting aspect of this approach is that, since the “program”, together
with the input DOAG, is used to define the encoding network, learning is driven
by the “program” as well. In fact, the correspondence between parameters and
connections within the encoding network is established by the “program”. In
general, it may be feasible to define statements concerning the learning modality
for each recursive network involved in the “program”, even considering different
(but “compatible”) learning algorithms for each recursive network, i.e., different
gradient descent algorithms.
In Figure 4, an example of a simple if-then-else graphical rule which
solves our chemical problem, as stated in Section 3.1, is shown. If the current
vertex is the root of the input DOAG, then all the subgraphs (substituents) are
224
P. Frasconi, M. Gori, and A. Sperduti
Y
-1
q -11 q 2
if(root(vertex) == TRUE )
q -19
then
;
X
U
-1 q -1
q -11 q 2 3
X
else
;
U
Fig. 10. Example of non-stationary rule for the chemical application.
considered and an output (predicted biological activity) is generated, otherwise
only three set of parameters are used for the encoding of the subgraphs (chemical
fragments) and no output (prediction) needs to be generated.
5
Conclusion
We have briefly reviewed the basic concepts underpinning the adaptive processing of data structures as outlined in [3]. Concepts such as causality and stationarity have been adapted to the context of learning data structures. In particular,
we have argued that in practical applications stationarity in not always desiderable. In order to explain the role of non-stationarity, we have discussed two
practical examples, i.e., predicting the biological activity of chemical compounds
and automatic classification of logos. Finally, we have introduced very briefly our
idea on how a priori knowledge on the domain can be expressed in a graphical
form, allowing the formal specification of eventually very complex (i.e., nonstationary) requirements for the structured domain to be treated by a neural
network or a Bayesian approach. The advantage of the proposed approach is the
Integration of Graphical Rules
225
systematicity in the specification of both the topology and learning propagation
of the adopted computational model (i.e., either neural or probabilistic, or even
hybrid by combining both of them). Moreover, the proposed approach would
allow the easy definition and reuse of basic modules, besides to the possibility to
modify the computational model with a few changes in the formal specification.
References
1. A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Quantitative structureactivity relationships of benzodiazepines by recursive cascade correlation. In IEEE
International Joint Conference on Neural Networks, pages 117–122, 1998.
2. M. Diligenti, M. Gori, M. Maggini, and E.Martinelli. Adaptive graphical pattern
recognition: the joint role of structure and learning. In Proceedings of the International Conference on Advances in Pattern Recognition, pages 425–432. Springer,
1998.
3. P. Frasconi, M. Gori, and A. Sperduti. A framework for adaptive data structures
processing. IEEE Transactions on Neural Networks, 9(5):768–786, 1998.
4. K. S. Fu. Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs, N.J, 1982.
5. R. C. Gonzalez and M. G. Thomason. Syntactic Pattern Recognition. Addison
Wesley, Reading, Massachusettes, 1978.
6. G. E. Hinton. Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46:47–75, 1990.
7. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods.
Journal of Logic Programming, 19,20:629–679, 1994.
8. T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, Heidelberg, 1977.
9. Tony A. Plate. Holographic reduced representations. IEEE Transactions on Neural
Networks, 6(3):623–641, May 1995.
10. J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46(12):77–106, 1990.
11. R. Rosenfeld and D. S. Touretzky. Four capacity models for coarse-coded symbol
memories. Technical Report CMU-CS-87-182, Carnegie Mellon, 1987.
12. R. J. Schalhoff. Pattern Recognition: Statistical, Structural and Neural Approaches.
John Wiley & Sons, 1992.
13. P. Smolensky. Tensor product variable binding and the representation of symbolic
structures in connectionist systems. Artificial Intelligence, 46:159–216, 1990.
14. A. Sperduti and A. Starita. Supervised neural networks for the classification of
structures. IEEE Transactions on Neural Networks, 8(3):714–735, 1997.
15. D. S. Touretzky. Boltzcons: Dynamic symbol structures in a connectionist network.
Artificial Intellicence, 46:5–46, 1990.
Lessons from Past, Current Issues,
and Future Research Directions
in Extracting the Knowledge Embedded
in Artificial Neural Networks
Alan B. Tickle, Frederic Maire, Guido Bologna,
Robert Andrews, and Joachim Diederich
Machine Learning Research Centre
Queensland University of Technology
Box 2434 GPO Brisbane
Queensland 4001, Australia
{ab.tickle, f.maire, g.bologna, r.andrews, j.diederich}@qut.edu.au
Abstract. Active research into processes and techniques for extracting
the knowledge embedded within trained artificial neural networks has
continued unabated for almost ten years. Given the considerable effort
invested to date, what progress has been made? What lessons have been
learned? What direction should the field take from here? This paper
seeks to answer these questions. The focus is primarily on techniques
for extracting rule-based explanations from feed-forward ANNs since, to
date, the preponderance of the effort has been expended in this arena.
However the paper also briefly reviews the broadening overall agenda for
ANN knowledge-elicitation. Finally the paper identifies some of the key
research questions including the search for criteria for deciding in which
problem domains these techniques are likely to out-perform techniques
such as Inductive Decision Trees.
1
Introduction
Notwithstanding the proven abilities of Artificial Neural Networks (ANNs) in
problem domains such as pattern recognition and function approximation [31],
[46], [47], to an end-user the modus operandi of ANNs remain something of
a numerical enigma. In particular ANNs inherently lack even a rudimentary
capability to explain either the process by which they arrived at a given result
or, in general, the totality of “knowledge” actually embedded therein. After an
investment of research effort spanning an interval which now exceeds ten years,
the point has been reached where this deficiency is all but redressed.
An appreciation of the dimensions of this research effort can be gauged both
from recent surveys [2,3,48,49,50,51,52,53] and by simply counting the number of papers appearing on this topic at the major international conferences.
Apart from revealing a rich and diverse set of approaches to solving the ANNknowledge-extraction problem, such an analysis also highlights how the research
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 226–239, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Extracting the Knowledge Embedded in Artificial Neural Networks
227
effort has tended to coalesce around particular facets of the problem e.g. the
variations in ANN types and, in particular:
1. Techniques for extracting and representing the “knowledge” of trained feedforward ANNs as sets of symbolic rules;
2. Techniques for extracting Finite State Machine (FSM) representations from
Recurrent Neural Networks (RNNs); and
3. Techniques for extracting fuzzy rules.
The analysis also highlights another important area of research. This is in developing techniques whereby an existing rule-base or similar a priori knowledge
is used to initialise an ANN, new data is then used to train the ANN, and an
updated set of (refined) rules or knowledge is extracted from the trained ANN
[2]. This is the problem domain of rule-refinement [54].
The range and diversity of techniques being developed under the umbrella
of ANN-knowledge-extraction underscores the fact that eliciting the knowledge
embedded within trained ANNs is both an active and evolving discipline. Hence
the purpose of the ensuing discussion is:
1. To review the ADT taxonomy [2,53] for classifying techniques for extracting
the knowledge embedded within ANNs;
2. To identify the important lessons which have been learned so far in the area
of ANN rule-extraction;
3. To utilise the ADT taxonomy as a basis for highlighting the increasingly rich
and diverse range of mechanisms and techniques which have been developed
to deal with the more general problem of knowledge-extraction from ANNs;
and
4. To identify and discuss some of the key research questions which will have
an important bearing on the direction that future development of the field
will take.
2
A Review of the ADT Taxonomy for Classifying
ANN-Knowledge-Extraction Techniques
In 1995, Andrews, Diederich, and Tickle proposed a taxonomy (the so-called
ADT taxonomy) for categorising ANN rule-extraction techniques [2]. One of the
primary goals in developing this taxonomy was to provide a uniform basis for
the systematic comparison of the different approaches which had appeared up
to that time. In its original form the ADT taxonomy comprised a total of five
primary classification criteria:
1. The expressive power (or, alternatively, the rule format) of the extracted
rules;
2. The quality of the extracted rules;
3. The translucency of the view taken within the rule extraction technique of
the underlying Artificial Neural Network units;
228
A.B. Tickle et al.
4. The algorithmic complexity of the rule extraction/rule refinement technique;
and
5. The extent to which the underlying ANN incorporates specialised training
regimes (i.e. a measure of the portability of the rule extraction technique
across various ANN architectures).
The authors used three basic groupings of rule formats: (i) conventional (Boolean, propositional) symbolic rules; (ii) rules based on fuzzy sets and logic; and
(iii) rules expressed in first-order-logic form i.e. rules with quantifiers and variables. They also used a set of four measurements of rule quality: (a) rule accuracy
(i.e. the extent to which the rule set is able to classify a set of previously unseen
examples from the problem domain correctly); (b) rule fidelity (i.e. the extent
to which the rule set mimics the behaviour of the ANN from which it was extracted); (c) rule consistency (i.e. the extent to which, under differing training
sessions, the ANN generates rule sets which produce the same classifications of
unseen examples); and (d) rule comprehensibility (e.g. measuring the size of the
rule set in terms of the number of rules and the number of antecedents per rule).
For the third criterion (i.e. the translucency criterion) the authors sought to
categorise a rule extraction technique based on the granularity of the underlying
ANN which was either explicitly or implicitly assumed within the rule extraction
technique. In a notional spectrum of such perceived degrees of granularity, the
authors used two basic delimiting points to define the extremities of the range
of options i.e. decompositional where rule extraction is done at the level of
individual hidden and output units within the ANN, and pedagogical where the
view of the ANN is at its maximum level of granularity i.e. as a “black box”.
The labelling of this latter extremity as “pedagogical” was based on the work
of Craven and Shavlik who cast their rule extraction technique as a learning
process [9]. However Neumann [32] has commented that this label is perhaps
something of a misnomer because it is arguable that the underlying process is
more about searching than it is about learning. The label “eclectic” was assigned
to the mid-point of this spectrum to accommodate essentially hybrid techniques
which analyse the ANN at the individual unit level but which extract rules at
the global level.
Whilst the ADT taxonomy was conceived in the context of categorising ANN
rule-extraction techniques, the five general notions it embodies viz (a) rule format (b) quality (c) translucency (d) algorithmic complexity and (e) portability
have been applied to other forms of representing the knowledge extracted from
ANNs as well as to other ANN architectures and training regimes. Subsequent
work has shown that not only is this taxonomy applicable to a cross-section of
current techniques for extracting rules from trained feed-forward ANNs but also
that the taxonomy can be adapted and extended to embrace a broader range
of ANN types (e.g. recurrent neural networks) and explanation structures [52].
In particular a distinguishing characteristic of techniques for extracting Deterministic Finite State Automata (DFAs) from recurrent neural networks is that
the underlying analysis is performed at the level of ensembles of neurons rather
than individual neurons. To accommodate such techniques which operate at an
Extracting the Knowledge Embedded in Artificial Neural Networks
229
intermediate level of granularity, it was proposed to use the label “compositional” to denote techniques which extract rules from ensembles of neurons rather
than individual neurons.
Related work on classification schemas for ANN rule-extraction techniques
has tended to focus primarily on alternative measures for comparing the quality
of the extracted rule set. For example Krishnan et al. [24] have proposed a set
of three criteria for assessing rule quality. Under their terminology, rules should
be (1) valid i.e. “the rules must hold regardless of the values of the unmentioned
variables in the rules”, (2) maximally general i.e. “if any of the antecedents are
removed, the rule should no longer be valid”, and (3) complete i.e. “all possible
valid and maximally general rules must be extracted”. Similarly Healy [20] also
proposes three measures of rule quality incorporating the notions of validity,
consistency, and completeness. Under his schema, valid rules are those which
are correct for the data i.e. the rule If A then B is valid if A ⇒ B is the correct
inference for all data cases. Similarly, consistency means that the rules do not
allow both B and not B. Completeness means that there are no unreachable
conclusions.) At an overall level, the absence of a consistent measure of rule
quality has complicated the task of directly comparing the performance of ANN
rule-extraction techniques.
3
Seven “Take Home” Messages from ANN Rule
Extraction
Given the widespread use of trained feed-forward ANNs, it is therefore not surprising that this type of ANN is the one which has attracted the preponderance
of attention for the purposes of providing the requisite explanation capability.
Moreover, because of the widespread acceptance of symbolic rules as a vehicle for
knowledge representation, the dominant form of such explanations is also symbolic rules. Hence rule-extraction from trained feed-forward ANNs now encompasses a rich and diverse range of ideas, mechanisms, and procedures. Consequently
it is useful at this juncture to make some general observations and comments
on what has been achieved to date. In particular, it is possible to glean at least
seven main lessons of potential importance both to those who are in the process
of developing knowledge-extraction techniques as well as to a prospective user
of such techniques. These are as follows:
1. In the first instance it is useful to observe that the process of extracting rules
from an ANN which has been trained to classify a data set from a given
problem domain is different from that of the classic induction task where
the corresponding set of symbolic classification-rules is gleaned directly from
the data. This difference arises because it is possible to utilise the ANN to
generate new data about the problem domain e.g. by the ANN as a “black
box” - a motif adopted in the so-called “pedagogical” class of rule-extraction
techniques. Moreover it is possible to formulate queries of the form “is this
region of the input space totally classified as positive”. This is not possible
by simply using the raw data.
230
A.B. Tickle et al.
2. There is also a second major difference between the process of rule-extraction
from an ANN and symbolic rule-induction. This is that the former case is
more appropriately described as a multi-criteria optimisation problem. Not
only is the objective to extract symbolic rules from the trained ANN, the
agenda is for rules which afford high levels of fidelity, accuracy, consistency,
and comprehensibility. For both the ANN itself and a symbolic-induction
technique applied directly to the data, the learning objective is simply for a
result that generalises well, given a finite training sample.
3. There now exists a substantive body of results which confirms both the
effectiveness and utility of the overall ANN rule-extraction approach as a
tool for rule-learning in a diverse range of problem domains. In particular Fu
[11] has shown that the combination of ANNs and rule-extraction techniques
outperforms established rule-learning techniques in situations in which there
is noise, inconsistent data, and uncertainty in the datasets. Moreover he has
illustrated similar advantages in situations involving multicategory learning.
Recently Fu [12,13] has also shown that a particular ANN-based technique
(CFNet) is better able to learn domain rules from data sets comprising only
a small fraction of domain instances in comparison with the decision-treebased rule generator system C4.5 ([36]) (Fu defines domain rules as the set
of rules which are able to explain all the domain instances). While earlier
results from other authors [33,54] have identified problem domains in which
the extracted knowledge representation outperformed the ANN from which
they were extracted, this avenue of research does not appear to have been
followed-up in any substantive way.
4. Within ANN rule-extraction, there is a strong and continuing predilection
towards so-called “decompositional” techniques [2,3,14,53] viz the process of
first extracting rules at the individual (hidden and output) unit level within
the ANN solution and then aggregating these rules to form global relationships. This is in contrast to the so-called “pedagogical” techniques which
treat the ANN as a “black box” and extract global relationships between the
inputs and the outputs of the ANN directly, without analysing the detailed
characteristics of the underlying ANN solution. While both approaches rely
heavily on heuristics to achieve tractability, recent experience [23,27,41] continues to suggest that it is easier to contrive an efficient heuristic to accomplish the key step in decompositional rule extraction (viz “weight pruning”
and the systematic elimination of less relevant variables) than the corresponding step in pedagogical rule extraction (viz controlling the manner in which
options in the solution space are generated).
5. No clear difference has yet emerged between the performance of specific
purpose rule-extraction techniques and those techniques with more general
portability [3]. Moreover whilst the utility of specific purpose techniques is
attractive to an end-user, there is an on-going demand for techniques which
can be applied in situations where an ANN solution already exists.
6. Computational complexity of the rule extraction process may be a limiting
factor on what is achievable from such techniques. The complexity depends
naturally on the format of the rules (extracting threshold rules is easy). Go-
Extracting the Knowledge Embedded in Artificial Neural Networks
231
lea [19] showed that, the worst case computational complexity of extracting
minimum DNF rules from trained ANNs and the complexity of extracting
the same rules directly from the data are both NP-hard. In the same vein,
one potentially promising approach to ANN rule-extraction focuses on extracting from single perceptrons, the best rule within a given class of rules.
However, extracting the best M-of-N rule from a single-layer network has also
been shown to be NP-hard whilst the task of deciding whether a perceptron
is symmetric with respect to two variables is NP-complete [27]. Hence the
combination of ANN learning and rule-extraction potentially involves significant additional computational cost over direct rule-learning techniques.
7. The possibility exists of significant divergence between the rules that capture the totality of knowledge embedded within the trained ANN and the
set of rules that approximate the behaviour of the network in classifying the
training set [50]. For example, Tickle [51] reported such a situation in extracting rules from a number of ANNs trained in a problem domain (the Edible
Mushroom problem domain) characterised by (a) the availability of only a
relatively small amount of data from the domain and (b) the fact that many
of the attributes appear to be irrelevant in determining the classification of
individual cases. Overall however, the lesson from this result is that it simply
behoves an end user to validate any rules extracted either by using further
test data from the problem domain or by having the rules scrutinised by a
domain expert.
4
The Expanding Horizon of Knowledge-Extraction from
ANNs
Whilst rule extraction has attracted considerable attention, there have also been
important continuing developments in other facets of eliciting knowledge from
ANNs. For example, Saito and Nakano [38] demonstrated a technique for extracting scientific laws whereas other authors have explored the extraction of
decision trees from ANNs [10] as well as utilising decision trees [22,25], propositional rules [8,35], and rough sets [4] as a means of ANN initialisation. In
addition a number of authors e.g. [16,18,34,39,56] have extended previous work
[14,15,16,17,33] on both the extraction of Deterministic Finite State Automata
(DFAs) from recurrent ANNs as well as showing how a recurrent network can
be constructed such that it acts as a Fuzzy Finite State Automata, (FFA), i.e.
a recogniser for a fuzzy regular language.
In parallel with the development of techniques for extracting Boolean rules
from trained ANNs, corresponding techniques for extracting fuzzy rules continue
to be synthesised. These are the so-called neurofuzzy systems and their adherents argue that they afford significant benefits in dealing with the inherent
uncertainty in many classification problems [26,30].
As well as the extraction of knowledge from ANNs, further progress has also
been made on the use of ANNs for rule refinement [35]. In essence, the process of
using ANNs for rule refinement involves inserting an initial rule base (i.e. what
232
A.B. Tickle et al.
might be termed prior knowledge) into an ANN, training the ANN on available
datasets, and then extracting the “refined” rules [2,54].
Another potentially important development in the synthesis of knowledge-extraction techniques applicable to ANNs used in reinforcement learning [43,44,45].
In addition to the extraction of explicit, symbolic rules from ANNs designed for
this problem domain, techniques have also been developed for the extraction of
explicit plans (open-loop policies). In particular, the intent is to enable explicit
reasoning of plans, without any a priori domain knowledge.
5
5.1
Research Questions and the Future Directions in
Extracting Knowledge from ANNs
A Sample of New Techniques and Ideas in Rule Extraction
One of the important lines of development in the area of ANN rule-extraction is
the synthesis of techniques which are not specific to a particular ANN regime but
can be applied across a broad cross-section of related approaches. As indicated
previously, within this area of the overall field of ANN knowledge-elicitation, the
focus of attention continues to be on the so-called decompositional class of ANN
rule-extraction algorithms. A key criterion for the development of a successful
algorithm in this area is the identification and incorporation of a heuristic for
limiting the size of the solution space which must be searched (and therefore
ensuring tractability).
Recently Maire [27] and Krishnan [23] have suggested similar decompositional
techniques which are applicable to feed-forward networks with Boolean inputs.
In particular, Krishnan [23,24] introduced a technique (COMBO) which offers
an alternative to randomly trying combinations of the weights at the individual
unit level. The initial step in this process is to sort the weights and then generate
combinations of all possible sizes. The combinations for any particular size are
then arranged in descending order of the sum of the weights in the combination.
The authors argue that not only does this technique limit the size of search space
but that it did so in a manner which ensures that the extracted rules satisfy their
three quality criteria of being valid, maximally general, and complete. Similarly,
Maire [27] presented a method to unify M-of-N rules based on a partial order.
(The M-of-N concept essentially is a means of expressing rules in the compact
form: If M of the following N antecedents are true then ...).
In a similar vein Alexander and Mozer [1] have proposed a template-based
procedure which expedites the process of reformulating an ANN weight-vector as
a symbolic M-of-N expression rather than as normal symbolic rules. (Alexander
and Mozer adopt the convention n-of-m to describe their rule-set.) The authors
introduce the notion of a weight template as a parameterised region of the total
weight space corresponding to a given n-of-m expression, find the template that
best fits the actual weights in the trained ANN, and then extract the rules
directly. Other approaches include that of Setiono et al. which facilitate the
process of rule extraction by pruning redundant connections in the ANN and
clustering the activation values of the hidden nodes [42].
Extracting the Knowledge Embedded in Artificial Neural Networks
233
McGarry et al. have extended the rule-extraction process to include Radial
Basis Function (RBF) networks [29]. In particular they undertook a comparison
of the quality and complexity of the rule sets extracted from RBF networks with
those extracted from multi-layer perceptron networks trained on the same data
sets. Apart from providing a symbolic representation of the clusters derived from
the data sets, they were also able to show that it was possible to represent accurately the boundaries of these clusters using a compact rule set. This confirms
similar results reported by Andrews et al. [3].
Bologna et al. have experimented rule extraction from combinations of neural
networks [5]. Symbolic rules were extracted from each single neural network and
also from the global combination. Rules generated from the global system had
the same representation of those generated by each network (i.e. conventional
rules). The neural network model used in the experiments was IMLP. This is
a particular Multi-Layer Perceptron architecture from which symbolic rules are
extracted in polynomial time using an eclectic rule extraction technique [7]. It
should be noted that the key step of the rule extraction process is related to
a boolean minimisation problem for which the goal is to determine whether an
axis-parallel hyper-plane is discriminant in a given region of the input space. As
a result, fidelity of rules can be made as high as desired.
5.2
Extracting Knowledge in Other Forms
Traditionally, people have been looking for rules of the form Ri ⇒ Ro , which
reads if the input of the ANN is in the region Ri , then its output is in the region
Ro , where regions are axis parallel hypercubes. Recently, a number of rule extraction methods that use polyhedra as regions have been proposed. Setiono [40]
and Bologna [6] called them oblique decision rules. Ideally, a rule extraction method should first compute the best possible approximation of the true reciprocal
image of a region of the output space, then depending on the application, either
use directly this raw region (for software verification), or process the information
to make it more comprehensible to a human user. This inversion problem is the
core problem of rule extraction. Maire [28] introduced an algorithm for inverting the function realised by a feed-forward network based on the inversion of
the vector-valued functions computed by each individual layer. This algorithm
back-propagates polyhedral regions from the output layer, which are constraints
similar those used in Validity Interval Analysis [47], back to the input layer.
These regions are finite unions of polyhedra. A by-product of this backpropagation algorithm is a new rule-extraction method whose fidelity can provably be
made as high as desired. The method can be applied to all feed-forward networks
with continuous values, and advocates the view that a rule extraction algorithm
must first try to find a high fidelity set of rules (best approximation to the
true reciprocal image), then perform the necessary post-processing (simplification, translation to other rule formats, projection and visualisation), contrary to
most rule extraction methods, which make dramatic approximations from the
start, approximations should be postponed to the last moment.
234
A.B. Tickle et al.
Previous surveys of ANN rule-extraction techniques together with the critiques presented above, illustrate that there is an increasingly rich and diverse
range of schemas being employed to represent the knowledge extracted from
ANNs. Nonetheless, there remains considerable scope for further exploration and
development. One such area, for example, is where ANNs are used in system control and management functions. In particular a distinguishing characteristic of
this application of knowledge-extraction techniques is that the output from the
ANN is real-valued (continuous) data as distinct from nominal-valued output
used in classification problems.
A second such example is in situations where an ANN has been trained on
the type of noisy time-series data which characterises the financial markets such
as foreign exchanges. Giles et al. have studied such applications using recurrent neural networks and representing the extracted knowledge in the form of a
deterministic finite state automata [18]. However a desired goal of the knowledgeextraction process could also be an abstract representation of the market’s dynamical behaviour which can then be used as the basis for decision-making (e.g.
whether or not to buy or sell a particular stock or foreign currency).
In a similar vein, one of the emerging agenda in the general area of Machine
Learning is a determination of the means of representing the results of the learning process in a form which best matches the requirements of the end-user. For
example Humphrey et al. [21] discuss the application of knowledge visualisation
techniques in Machine Learning and demonstrate how users’ questions can be
mapped onto visualisation tasks to create new graphical representations that
show the flow of examples through a decision structure.
5.3
A Formal Model of Rule Extraction
Whilst rule extraction from trained feed-forward ANNs is only one facet of the
total problem of extracting the knowledge embedded within ANNs, it is nonetheless an area which continues to attract considerable attention. Golea [19] sees
as one of the main challenges in this particular area, the formulation of a consistent theoretical basis for what has been, until recently, a disparate collection
of empirical results. To this end he has proposed a formal model for analysing
the rule-extraction problem which is an adaptation of the well-known probably
approximately correct (PAC) learning model [55]. Applied in the context of extracting Boolean if-then-else rules, this model provides the basis for proving that
extracting simple rules from trained feed-forward ANNs is computationally an
NP-hard problem. Hence an avenue for further investigation is to determine if
this (or a similar) model could be used to analyse for the computational complexity of extracting other forms of rules such as fuzzy rules and probabilistic rules
as well as that of extracting Finite State Machines (FSMs) from trained recurrent
neural networks. The Support Vector Machines deserve similar attention.
Healy [20] also observes that research into the area of rule extraction from
ANNs could benefit from the availability of a formal model of the semantics of
the rules which expressed the relationship between the application data, the neural network learning model, and the extracted rules. In particular he offers the
Extracting the Knowledge Embedded in Artificial Neural Networks
235
proposition that a valid and complete rule base corresponds to a continuous function. His preliminary findings suggest that “to be truly capable, a rule extraction
architecture must capture the antecedents of rules with conjunctive consequents
as identifiable collections of network nodes and connections”.
Apart from the need for more formal models, there also appears to be a need
for non-experimental ways to measure the fidelity of extracted rules. Ultimately
one of the overall goals of this process is to determine under what conditions, if
any, the knowledge-extraction problem is tractable.
5.4
Predicting ANN Performance
In the Machine Learning research domain, Inductive Decision Trees (IDTs) are
commonly used models for generating symbolic rules from datasets [36]. In this
context Quinlan introduced the C4.5 and C5.0 algorithms which actually are the
“gold standard” in symbolic rule extraction. In many applications the average
predictive accuracy obtained by Artificial Neural Networks and C5.0 is similar,
but C5.0 training time is much shorter [37]. Thus, the reader could believe that
there are no benefits to using ANNs in classification problems. This is not true.
Indeed, there are classification problems in which the IDT average predictive
accuracy is clearly worse than those obtained by ANNs [46] and similarly there
are problems for which the converse is true. So, it is important to know in which
context an ANN could participate in generating better quality rules.
Quinlan has analysed and compared the classification mechanism of IDTs
and ANNs [37]. He has pointed out that the inherent classification mechanism
of an ANN is parallel since in computing the value of each output neuron, all
input neurons are used simultaneously. For IDTs, the classification mechanism
is sequential as an instance is classified by following a path in the tree of testing
attributes. Now, suppose that the goal is to learn a classification task in which
each attribute is not relevant at the single level, but in which there are relevant
combinations of several attributes. In this case an ANN during its learning phase
will be more appropriate than an IDT to determine those attribute combinations
in its internal representation. To the contrary, with only one or few relevant attributes, a decision tree will be more suitable because in an ANN the high relevance
of one attribute could be hidden by the low relevance of all other attributes.
Quinlan states in a conjecture that classification problems which are unsuitable for ANNs are denoted to as S-problems, whereas P-problems are those
unsuitable for IDTs [37]. One important question therefore that should be answered in the future is how can we estimate whether a given classification task
belongs to the P-problem or to the S-problem category.
6
Summary and Conclusion
The preceding discussion has highlighted the rich diversity of mechanisms and
procedures now available to extract and represent the knowledge embedded
within trained Artificial Neural Networks. Whilst the field of ANN knowledgeextraction is one which continues to attract considerable interest, the preceding
236
A.B. Tickle et al.
discussion has also argued the case for using an established learning model (such
as PAC learning) to provide a consistent theoretical basis for what has been,
until now, a disparate collection of empirical results. Furthermore, useful work
remains to be done to determine under what conditions, if any, the knowledgeextraction problem is tractable, and, indeed conditions under which it is worth
using Artificial Neural Networks models rather than Inductive Decision Trees.
Acknowledgements
The authors would like to acknowledge the constructive criticisms made by two
anonymous reviewers of an earlier version of this paper.
References
1. Alexander, J.A., Mozer, M.C.: Template-Based Procedures for Neural Network
Interpretation. Neural Networks 12 (1999) 479–498.
2. Andrews, R., Diederich J., Tickle, A.B.: A Survey and Critique of Techniques
for Extracting Rules from Trained Artificial Neural Networks. Knowledge-Based
Systems 8(6) (1995) 373-389.
3. Andrews, R., Cable, R., Diederich, J., Geva, S., Golea, M., Hayward, R., Ho-Stuart,
C., Tickle, A.B.: An Evaluation and Comparison of Techniques for Extracting
and Refining Rules from Artificial Neural Networks. Technical report, Queensland
University of Technology, Australia (1996).
4. Banerjee, M, Mitra, S., Pal, S. K.: Rough Fuzzy MLP: Knowledge Encoding and
Classification. IEEE Transactions on Neural Networks 9(6) (1998) 1203–1216.
5. Bologna, G., Pellegrini, C.: Symbolic Rule Extraction from Modular Transparent
Boxes. Proceedings of the Conference of Neural Networks and their Applications
(NEURAP), 393–398 (1998).
6. Bologna, G., Pellegrini, C.: Rule-Extraction from the Oblique Multi-layer Perceptron. Proceedings of the Australian Conference on Neural Networks (ACNN’98),
University of Queensland, Australia (1998) 260–264.
7. Bologna, G.: Symbolic Rule Extraction from the DIMLP Neural Network. Neural
Hybrid Systems, S. Wermter and R. Sun (eds.), Springer Verlag (1999).
8. Carpenter, G.A., Tan, A.W.:Rule Extraction: From Neural Architecture to Symbolic Representation. Connection Science 7(1) (1995) 3–27.
9. Craven, M., Shavlik, J.W.: Using Sampling and Queries to Extract Rules From
Trained Neural Networks. Machine Learning: Proceedings of the Eleventh International Conference (1994), Amherst MA, Morgan-Kaufmann, 73–80.
10. Craven, M.: Extracting Comprehensible Models from Trained Neural Networks.
PhD Thesis, University of Wisconsin, Madison Wisconsin (1996).
11. Fu, L.M.: Rule Generation from Neural Networks. IEEE Transactions on Systems,
Man and Cybernetics 28(8) (1994) 1114–1124.
12. Fu, L.M.: A Neural Network Model for Learning Domain Rules Based on its Activation Function Characteristics. IEEE Transactions on Neural Networks, 9(5)
(1998) 787–795.
13. Fu, L.M.: A Neural Network for Learning Domain Rules with Precision. Proceedings of the International Joint Conference on Neural Networks (IJCNN99) (1999)
(To Appear).
Extracting the Knowledge Embedded in Artificial Neural Networks
237
14. Giles, C.L., Miller C.B., Chen, D., Chen, H., Sun, Z., Lee, Y.C.: Learning and
Extracting Finite State Automata with Second-order Recurrent Neural Networks.
Neural Computation 4 (1992) 393–405.
15. Giles, C.L., Omlin, C.W.: Rule Refinement with Recurrent Neural Networks. Proceedings of the IEEE International Conference on Neural Networks San Francisco
CA (1993) 801–806.
16. Giles, C.L., Omlin, C.W.: Extraction, Insertion, and Refinement of Symbolic Rules
in Dynamically Driven Recurrent Networks. Connection Science 5(3-4) (1993) 307–
328.
17. Giles, C.L., Omlin, C.W.: Rule Revision with Recurrent Networks. IEEE Transactions on Knowledge and Data Engineering 8(1) (1996) 183.
18. Giles, C.L., Lawrence, S., Tsoi, A.C.: Rule Inference for Financial Prediction using
Recurrent Neural Networks. Proceedings of the IEEE/IAFE Conference on Computational Intelligence for Financial Engineering (CIFEr), IEEE Piscataway NJ
(1997) 253–259.
19. Golea, M.: On the Complexity of Rule Extraction from Neural Networks and Network Querying. Proceedings of the Rule Extraction From Trained Artificial Neural
Networks Workshop, Society For the Study of Artificial Intelligence and Simulation of Behavior Workshop Series (AISB’96) University of Sussex, Brighton, UK
(1996) 51–59.
20. Healy, M.J.: A Topological Semantics for Rule Extraction with Neural Networks.
Connection Science 11(1) (1999) 91–113.
21. Humphrey, M., Cunningham S.J., Witten I.H.: Knowledge Visualization Techniques for Machine Learning. Intelligent Data Analysis 2(4) (1998).
22. Ivanova, I., Kubat, M.: Initialisation of Neural Networks by Means of Decision
Trees. Knowledge Based Systems 8(6) (1995) 333–344.
23. Krishnan, R.: A Systematic Method for Decompositional Rule Extraction From
Neural Networks. Proceedings of the NIPS’96 Rule Extraction From Trained Artificial Neural Networks Workshop (1996), Queensland University of Technology
38–45.
24. Krishnan, R., Sivakumar, G., Battacharya, P.: A Search Technique for Rule Extraction from Trained Neural Networks. Pattern Recognition Letters 20 (1999)
273–280.
25. Kubat, M.: Decision Trees Can Initialize Radial-Basis Function Networks. IEEE
Transactions on Neural Networks 9(5) (1998) 813–821.
26. Lin, C-T., Lee, C.S.G.: Neural Fuzzy Systems: a Neuro-Fuzzy Synergism to Intelligent Systems. Prentice-Hall Upper Saddle River NJ (1996).
27. Maire, F.: A Partial Order for the M-of-N Rule-Extraction Algorithm. IEEE Transactions on Neural Networks 8(6) (1997) 1542–1544.
28. Maire, F.: Rule-Extraction by Backpropagation of Polyhedra. Journées Francophones sur l’Apprentissage Automatique (JFA’98) (May) Arras France (1998).
29. McGarry, K., Wermter, S., MacIntyre, J.: Knowledge Extraction from Radial Basis
Function Networks. Proceedings of the International Joint Conference on Neural
Networks (IJCNN99) (1999) (To Appear).
30. Meneganti, M., Saviello, S., Tagliaferri, R.: Fuzzy Neural Networks for Classification and Detection of Anomalies. IEEE Transactions on Neural Networks 9(5)
(1998) 848–861.
31. Michie, D., Spiegelhalter, D.L., Taylor C.C: Machine Learning, Neural and Statistical Classification. Hertfordshire Ellis Horwood (1994).
238
A.B. Tickle et al.
32. Neumann, J.: Classification and Evaluation of Algorithms for Rule Extraction From
Artificial Neural Networks A Review. (Unpublished) Centre for Cognitive Science,
University of Edinburgh (1998).
33. Omlin, C.W., Giles, C.L.: Extraction of Rules from Discrete-Time Recurrent Neural Networks. Connection Science 5(3-4) (1993) 307–336.
34. Omlin, C.W., Thornber, K., Giles, C.L.: Fuzzy Finite State Automata Can be
Deterministically Encoded Into Recurrent Neural Networks. IEEE Transactions
on Fuzzy Systems (To appear).
35. Optiz, D.W., Shavlik, J.W.: Dynamically Adding Symbolically Meaningful Nodes
to Knowledge-Based Neural Networks. Knowledge-Based Systems 8(6) (1995) 301–
311.
36. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, California (1993).
37. Quinlan, J.R.: Comparing Connectionist and Symbolic Learning Methods. In Computational Learning Theory and Natural Learning Systems: Constraints and Prospects, ed. R. Rivest, MIT Press (1994), 445–456.
38. Saito, K., Nakano, R.: Law Discovery Using Neural Networks. Proceedings of
the NIPS’96 Rule Extraction From Trained Artificial Neural Networks Workshop,
Queensland University of Technology (1996) 62–69.
39. Schellhammer, I., Diederich, J., Towsey, M., Brugman, C.: Knowledge Extraction
and Recurrent Neural Networks: an analysis of an Elman network trained on a natural language learning task. Queensland University of Technology, NRC Technical
Report (1997) 97–151.
40. Setiono, R., Huan, L.: NeuroLinear: From Neural Networks to Oblique Decision
Rules. Neurocomputing 17 (1997) 1–24.
41. Setiono, R.: Extracting Rules from Neural Networks by Pruning and Hidden Unit
Splitting. Neural Computation 9 (1997) 205–225.
42. Setiono, R., Thong, J.Y.L., Yap, C.S.: Symbolic Rule Extraction from Neural Networks: An Application to Identifying Organisations Adopting IT. Information and
Management 34(2) (September) (1998) 91–101.
43. Sun, R., Sessions, C.: Extracting Plans from Reinforcement Learners. Proceedings
of the International Symposium on Intelligent Data Engineering and Learning
(IDEAL’98) (October) Springer-Verlag (1998).
44. Sun, R., Peterson, T.: Autonomous Learning of Sequential Tasks: Experiments and
Analyses. IEEE Transactions on Neural Networks, 9(6) (1998) 1217–1234.
45. Sun, R.: Knowledge Extraction from Reinforcement Learning. Proceedings of the
International Joint Conference on Neural Networks (IJCNN99) (1999) (To Appear).
46. Thrun, S.B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong,
K., Dzeroski, S., Fahlman, S. E., Fisher, D., Hamann, R., Kaufman, K., Keller, S.,
Kononenko, I., Kreuziger, J., Michalski, R.S., Mitchell, T., Pachowicz, P., Reich, Y.,
Vafaie, H., Van de Welde, K., Wenzel, W., Wnek, J. and Zhang, J.: The MONK’s
Problems: a Performance Comparison of Different Learning Algorithms. Carnegie
Mellon University, Technical report CMU-CS-91-197 (1991).
47. Thrun, S.B.: Extracting Provably Correct Rules From Artificial Neural Networks.
Technical Report IAI-TR-93-5 Institut fur Informatik III Universitat Bonn (1994).
48. Tickle, A.B., Hayward, R., Diederich, J.: Recent Developments in Techniques for
Extracting Rules from Trained Artificial Neural Networks. Herbstschule Konnektionismus (HeKonn 96) (October) Munster (1996).
Extracting the Knowledge Embedded in Artificial Neural Networks
239
49. Tickle, A.B., Andrews, R., Golea, M., Diederich, J.: Rule Extraction from Trained
Artificial Neural Networks. In A. Browne (Ed) Neural Network Analysis, Architectures and Applications Institute of Physics Publishing Bristol UK (1997) 61–99.
50. Tickle, A.B.,, Golea, M., Hayward, R., Diederich, J.: The Truth is in There: Current Issues in Extracting Rules from Trained Feed-Forward Artificial Neural Networks. Proceedings of the 1997 IEEE International Conference on Neural Networks
(ICNN’97), 4 (1997) 2530–2534.
51. Tickle, A.B.: Machine Learning, Neural Networks and Information Security: Techniques for Extracting Rules from Trained Feed-Forward Artificial Neural Networks
and their Application in an Information Security Problem Domain. PhD Dissertation, Queensland University of Technology (1997).
52. Tickle, A.B., Andrews, R., Golea, M., Diederich, J.: The Truth will come to Light:
Directions and Challenges in Extracting the Knowledge Embedded within Trained
Artificial Neural Networks. IEEE Transactions on Neural Networks 9(6) (1998)
1057–1068.
53. Tickle, A.B., Maire F., Bologna, G., Diederich, J.: Extracting the Knowledge Embedded within Trained Artificial Neural Networks: Defining the Agenda. Proceedings of the Third International ICSC Symposia on Intelligent Industrial Automation (IIA’99), and Soft Computing (SOCO’99) (1999) 732–738.
54. Towell, G., Shavlik, J.: The Extraction of Refined Rules From Knowledge Based
Neural Networks. Machine Learning 131 (1993) 71–101.
55. Valiant, L.G.: A theory of the learnable. Communications of the ACM 27(11)
(1984) 1134–1142.
56. Williams, R.J., Zipser, D.: A Learning Algorithm for Continually Running Fully
Recurrent Neural Networks. Neural Computation 1(2) (1989) 270–280.
Symbolic Rule Extraction from the DIMLP
Neural Network
Guido Bologna
Machine Learning Research Centre
Queensland University of Technology
Box 2434 GPO Brisbane
Queensland 4001, Australia
bologna@fit.qut.edu.au
Abstract. The interpretation of neural network responses as symbolic
rules is actually a difficult task. Our first approach consists in characterising the discriminant hyper-plane frontiers. More particularly, we point
out that the shape of a discriminant frontier built by a standard multilayer perceptron is related to an equation with two terms. The first one
is linear, and the second is logarithmic.
This equation is not sufficient to easily generate symbolic rules. So, we
introduce the Discretized Interpretable Multi Layer Perceptron (DIMLP)
model that is a more constrained multi-layer architecture. From this
special network, rules are extracted in polynomial time and continuous
attributes do not need to be binary transformed.
We apply DIMLP to three applications of the public domain in which it
gives better average predictive accuracy than C4.5 algorithm and discuss
rule quality.
1
Introduction
The lack of validation tools is often one of the reasons for not using neural systems in practice. For instance, physicians cannot trust a diagnosis system without explanation of its responses. The difficulty of justification of neural network
responses is due to its distributed internal representation. More particularly, the
overall network decision mechanism is represented onto a space of connection
weights and activation values which has an exponential size and so in practice
it cannot be entirely explored.
Researchers in the field of symbolic rule extraction from neural networks
have proposed several algorithms. They are concisely described by the taxonomy
proposed by Andrews at al. [1]. Briefly, symbolic rule extraction methods belong
to three categories: pedagogical, decompositional, and eclectic. In the pedagogical
approach symbolic rules are generated by an inductive symbolic algorithm which
globally analyses expressions related to the input and the output layer. For
the decompositional approach, symbolic rules are determined by analysing the
weights at the level of each hidden neuron and each output neuron. Finally,
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 240–254, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Symbolic Rule Extraction from the DIMLP Neural Network
241
the eclectic approach is a combination of the pedagogical and decompositional
strategies.
Inductive Decision Trees (IDT) as C4.5 [2] and CART [4] are popular algorithms for data classification and symbolic rule extraction. In the Machine Learning community it is well known that in many applications neural networks and
inductive decision trees reach similar accuracy performances. However, Quinlan
pointed out that a special category of problems is more favourable to neural
networks and another one to decision trees [3].
In this chapter we investigate the shape of the discriminant frontiers built
by a multi-layer perceptron in order to choose a strategy for the extraction
of symbolic rules. As a result, we introduce the Discretized Interpretable Multi
Layer Perceptron (DIMLP) model. In this context, rules are directly generated by
determining discriminant hyper-plane frontiers. Moreover, a noticeable feature
of our rule extraction algorithm is that it scales in polynomial time with respect
to the number of hidden neurons and to the number of examples.
We compare DIMLP, standard MLP networks and C4.5 decision trees in
three classification problems of the public domain. The first one is the wellknown Monk-2, the second one is related to the Sonar dataset, and the last one
concerns the classification of handwritten digits.
2
Discriminant Frontier Determination in a Standard
Multi-layer Perceptron
The problem of discriminant frontier determination in the general case of multilayer perceptrons is very complex. Hence, we simplify our investigations by first
considering only two different classes and at most only one hidden layer of neurons. As a notation, symbols xi , hj , yk will denote input neurons, hidden neurons,
and output neurons, respectively1 . Finally, the activation function of hidden and
output neurons is the sigmoid function σ given as
σ(x) =
1
.
1 + e−x
(1)
Without loss of generality we assume that an instance is classified class1
when y1 > y2 and class2 otherwise, with y1 and y2 corresponding to the two
output neurons. Finally, it follows that the discriminant frontier between two
classes lies in the set of points where y1 = y2 .
2.1
One Hidden Neuron
In this case the outputs of the network are
X
w1i xi ))
y1 = σ(v10 + v11 · σ(
i
1
Bias virtual neurons are also used.
(2)
242
G. Bologna
and
X
w1i xi )) ;
y2 = σ(v20 + v21 · σ(
(3)
i
where w1i are the weights between the input layer and the hidden neuron, and
v1i , v2j are the weights between the hidden neuron and the output layer. As the
sigmoid is a monotonic function, the set of points where y1 = y2 is determined
by
X
w1i xi ) = 0 .
(4)
(v10 − v20 ) + (v11 − v21 ) · σ(
i
At this point it is important to note that equation (4) has a solution if and only
if
0<−
v10 − v20
< 1 and v11 6= v21 .
v11 − v21
(5)
Equation (4) is equivalent to
(v11 − v21 ) = −(v10 − v20 ) · (1 + e−
Further,
P
i
w1i xi
).
P
(v11 − v21 ) + (v10 − v20 )
= −e− i w1i xi .
(v10 − v20 )
Assuming that (5) holds, we obtain
X
v11 − v21
w1i xi + ln −1 −
=0.
v10 − v20
i
(6)
(7)
(8)
In the simple perceptron model a hyper-plane can be shifted with respect to
the origin by the bias unit. In this case a discriminant hyper-plane is also shifted
by the weights between the hidden neuron and the output neurons (logarithmic
term).
2.2
Two Hidden Neurons
Let us consider an MLP with two hidden neurons denoted to as h1 and h2 . In
this case y1 and y2 are given as
X
y1 = σ(
v1i hi )
(9)
i
and
X
v2i hi ) .
y2 = σ(
i
(10)
Symbolic Rule Extraction from the DIMLP Neural Network
243
We define ∆Vi as the difference between weights v1i and v2i (∆V0 will designate
the difference of weights related to the bias neuron). Hence, y1 = y2 is equivalent
to
∆V0 + ∆V1 h1 + ∆V2 h2 = 0 .
(11)
Equation (11) is equivalent to
h1 = −
∆V0 + ∆V2 h2
.
∆V1
(12)
Equation (12) has a solution if and only if
0<−
∆V0 + ∆V2 h2
< 1 and ∆V1 6= 0 .
∆V1
(13)
Rewriting (12) leads to
1 + e−
P
i
w1i xi
=−
∆V1
.
∆V0 + ∆V2 h2
Finally, applying the logarithm we obtain
X
∆V1
w1i xi + ln −1 −
=0.
∆V0 + ∆V2 h2
i
(14)
(15)
Equation (15) is linear if h2 is constant. So, when h2 is saturated2 we have
again almost linear hyper-plane frontiers. Otherwise, when the logarithmic term
varies, the linear frontier is curved.
2.3
More than Two Hidden Neurons
We suppose that we have one hidden layer with l hidden neurons and two output
neurons. Hypothesis y1 = y2 is equivalent to
l
X
∆Vk hk = 0 .
(16)
k=0
After several steps (cf. eq. (15)) we obtain
"
#
X
∆Vi
wij xj + ln −1 − Pl
=0;
k=0,k6=i ∆Vk hk
j
(17)
with conditions
0<−
Pl
k=0,k6=i
∆Vk hk
∆Vi
< 1 and ∆Vi 6= 0 .
(18)
We denote the logarithmic term of (17) to as the shift-curving term, while the
other one is called the hyper-plane term.
2
The values of the sigmoid function varies very slowly before -3 or after 3.
244
2.4
G. Bologna
Discussion on Discriminant Frontiers
The weighted sum related to a hidden neuron represents a possible discriminant
hyper-plane frontier that will be shifted and/or curved by the weights between
the hidden layer and the output layer. In other terms, a neuron defines a virtual
hyper-plane frontier which may turn into a real hyper-plane frontier. The bending
of a frontier appears when the shift-curving term varies. A linear approximation
of MLP frontiers is given in [5].
At this point we may wonder about the creation of a special multi-layer
perceptron model in which there is neither curving phenomenon, nor shifting
factor. This kind of network would have only oblique hyper-plane frontiers. However, when many oblique frontiers are built, rules are not understandable3 . So,
the question arising is whether it is possible to define a multi-layer perceptron
model with hyper-plane frontiers parallel to the axis of input variables. This is
precisely the matter of the next section.
3
The IMLP Model
3.1
The Architecture
The IMLP network [6] has an input layer, one or two hidden layers and an
output layer (see Figure 1). In this model, each neuron of the first hidden layer
Output Layer
1
third bias
1
second bias
1
first bias virtual unit
hi
wij
xj
Input Layer
Fig. 1. An IMLP network with two hidden layers. Each neuron of the first hidden layer
is connected to only one input neuron and to the first bias unit. All the other layers
are fully connected.
3
In this case the antecedents of the rules will be the linear combination of input
attributes.
Symbolic Rule Extraction from the DIMLP Neural Network
245
is connected to only one input neuron and to the first virtual bias unit. All
the other layers are fully connected. As we will see in Section 3.2, this special
connectivity pattern is the basic idea that will permit us to extract symbolic
rules.
The output hi of the ith neuron of the first hidden layer is given by the
threshold function:
P
1 if j wij .xj > 0
.
(19)
hi =
0 otherwise
When the model has two hidden layers, the second hidden layer responses are
not given by the threshold function but rather by the sigmoid function. Hence,
the IMLP architecture differs from the standard MLP architecture in three main
ways:
1. Each neuron in the first hidden layer is connected to only one input neuron.
2. The activation function used by the neurons of the first hidden layer is the
threshold function instead of the sigmoid function.
3. The size of the first hidden layer is determined experimentally as a multiple
of the size of the input layer4 .
3.2
Symbolic Rule Extraction
The key idea for symbolic rule extraction is hyper-plane determination. As we
will see in next paragraph, each hyper-plane expression containing only one input
variable represents a possible antecedent in a symbolic rule.
Examples with One Hyper-Plane. Figure 2 shows an example of IMLP
network with only one hidden layer and only one hidden neuron. The weighted
sum of inputs and weights related to a hidden neuron represents a virtual hyperplane frontier that will be effective depending on the weights related to successive
layers (v0 and v1 in figure 2).
A real hyper-plane frontier will be located in −w0 /w1. In fact, the points
where h1 = 1 are defined by w1 x1 + w0 ≥ 0. This is equivalent to x1 ≥ −w0 /w1 .
The points where h1 = 0 are defined by w1 x1 + w0 < 0. This is equivalent to
x1 < −w0 /w1 . Weights above those between the input layer and the first hidden
layer determine whether a hyper-plane frontier is virtual or not.
Assuming that we have class black circle when y1 ≥ 0.5 and white square
otherwise, we obtain a real hyper-plane frontier when
v0 + v1 ≥ 0 (h1 = 1)
.
(20)
(h1 = 0)
v0 < 0
If (20) holds we obtain two symbolic rules given as
1. If (x1 ≥ −w0 /w1 ) =⇒ black circle.
2. If (x1 < −w0 /w1 ) =⇒ white square.
4
Thus, during the training phase all input neurons will have the same opportunity to
affect network responses.
246
G. Bologna
x2
y1
v0 = -2
h1 = 0
h1 = 1
v1 = 3
1
h1
w1
w0
1
x1
x1
-w0/w1
x2
y1
v0 = -2
1
h1 = 1
h1
w1
w0
1
h1 = 0
v1 = 1
x1
x1
-w0/w1
Fig. 2. The concept of virtual hyper-plane (dashed line) and real hyper-plane (plain
line) in an IMLP network. The virtual hyper-plane located in −w0 /w1 is effective when
v0 and v1 are in a particular configuration (cf. (20)).
An Example with Two Perpendicular Frontiers. We create two perpendicular hyper-plane frontiers by connecting h1 to x1 and h2 to x2 . The IMLP
network and the corresponding input space partition are shown in figure 3.
y1
x2
v0 = -10
v1 = 5 v2 = 6
1
h1
w10
h1 = 1
h2 = 1
h1 = 0
h2 = 0
h1 = 1
h2 = 0
w20
w1
1
-w20/w2
h2
h1 = 0
h2 = 1
x1
w2
x2
-w10/w1
x1
Fig. 3. An example with perpendicular hyper-plane frontiers (plain lines) in an IMLP
network with two hidden neurons.
Symbolic Rule Extraction from the DIMLP Neural Network
247
A configuration of weights v0 , v1 , and v2 which enables h1 and h2 to define
two real hyper-plane frontiers is
v0 < 0
v0 + v1 < 0
.
(21)
v0 + v2 < 0
v0 + v1 + v2 ≥ 0
With respect to the hidden
as
neurons h1 and h2 the input space partition is given
h¯1 h¯2
h¯1 h2
h1 h¯2
h1 h 2
=⇒ white square
=⇒ white square
;
=⇒ white square
=⇒ black circle
(22)
where a bar denotes the negation, and multiplication denotes conjunction. This
is equivalent to
(h¯1 h¯2 + h¯1 h2 + h1 h¯2 ) =⇒ white square
;
(23)
h1 h2 =⇒ black circle
where the addition denotes the disjunction. Using a Karnaugh map (see figure 4)
we simplify (23) into
h¯1 =⇒ white square
h¯2 =⇒ white square .
(24)
h1 h2 =⇒ black circle
Finally, transcribing (24) into the input space we obtain three symbolic rules.
) =⇒ white square.
1. If (x1 < − ww10
1
) =⇒ white square.
2. If (x2 < − ww20
2
3. If (x1 ≥ − ww10
) and (x2 ≥ − ww20
) =⇒ black circle.
1
2
Once again, the real hyper-plane frontiers are not shifted with respect to the
virtual hyper-plane frontiers.
In general in an IMLP network virtual hyper-planes are not shifted because
threshold functions related to the neurons of the first hidden layer create subregions of the input space in which network responses are constant. By contrary,
in a standard multi-layer perceptron even if two input examples are as close as
we desire they will give a slightly different network response.
h2
h1
Fig. 4. A Karnaugh map characterising the input space partition of the IMLP network
illustrated in figure 3.
248
G. Bologna
Rule Extraction in the General Case. The key idea behind rule extraction
is the simplification of boolean expressions formed by binary values of the first
hidden layer and the classification provided by the output layer. With respect
to the taxonomy introduced by Andrews at al. [1] our rule extraction algorithm
is eclectic.
Weights between the input layer and the first hidden layer achieve a binary
transformation of each input attribute. Moreover, at the level of the first hidden
layer, input signals are not combined together. So, it is as if we had the input
layer again. Now if we apply a rule extraction method to analyse the associations
between the input layer (in this case the first hidden layer) and the output layer
it is a pedagogical method. Now, as we use the weights between the input layer
and the first hidden layer to determine rule antecedents we obtain a pedagogical
and decompositional rule extraction technique. Hence, it is eclectic.
The general algorithm is given as:
1. Propagate all available examples in the network.
2. Form logical expressions composed of boolean values of the first hidden layer
and the class given by the output layer.
3. Simplify all boolean expressions by a covering algorithm.
4. Rewrite simplified expressions in terms of input attributes.
In the third step of the general algorithm, Karnaugh maps cannot be used for
more than 6 variables. The larger cost operation of the rule extraction algorithm
is mainly in the covering phase. If we search for the minimal covering, this calculation is exponential with the number of neurons of the first hidden layer and the
number of examples. However, using heuristic covering algorithms as Espresso
[7] or C4.5 [2] we obtain quasi-optimal solutions in polynomial time. It is always
possible to generate rules that exactly mimic IMLP responses, nevertheless, if
the number of generated rules is too large a partial covering may give a more
comprehensible rule set.
3.3
Learning
The use of threshold functions in the first hidden layers makes the error function
non-differentiable. For the learning phase of IMLP networks we use an adapted
back-propagation algorithm [8] and/or a simulated annealing algorithm [9]. The
key idea of the former algorithm resides in gradually increasing the slope of the
sigmoidal functions to obtain a network with threshold functions, whereas for
simulated annealing the error function does not require to be differentiable.
4
The DIMLP Model
The Discretized Interpretable Multi Layer Perceptron (DIMLP) model is a generalisation of the IMLP model. Neurons of the first hidden layer transform
continuous inputs into discrete values instead of binary ones.
Symbolic Rule Extraction from the DIMLP Neural Network
4.1
249
The Architecture
The architectures of DIMLP and IMLP are fundamentally the same ones. The
only difference resides in the activation function of the neurons of the first hidden
layer. In DIMLP networks this activation function is a staircase function as
shown by figure 5. So, continuous input values are transformed into several
discrete levels. The staircase function is given as
If x < Rmin
S(x) = σ(Rmin ) h
i
Rmax −Rmin
x−Rmin
)) If Rmin ≤ x ≤ Rmax (25)
S(x) = σ(Rmin + d · Rmax −Rmin · (
d
S(x) = σ(Rmax )
If x > Rmax
where σ is the sigmoid function, Rmin and Rmax forms a range where the staircase is adapted to the sigmoid function, d is the number of discretized levels,
and ”[ ]” denotes the integer part notation.
Fig. 5. Discretization of the sigmoid function.
4.2
Learning
There are two distinct training phases. During the first one, DIMLP has sigmoid
functions in the first hidden layer (and also in the other layers) and the model
is trained with a back-propagation algorithm. During the second training phase,
all weights are frozen and a staircase function is adapted by simulated annealing
to approximate the response of each neuron of the first hidden layer.
4.3
Rule Extraction
From a geometrical point of view, each neuron of the first hidden layer virtually
defines a number of parallel hyper-plane frontiers that is equal to the number
of stairs of its staircase function. Let us clarify this situation by the example
illustrated in figure 6.
In practice, we show how to transform a DIMLP network into an IMLP
network by creating for each hidden neuron of the former network as many
hidden neurons as the number of stairs in the latter network. As we can extract
250
G. Bologna
y
S(x)
y
h
v1
h
v4
v3
v2-v1
h1
s1
v3-v2
h2
s2
v4-v3
h3
s3
h4
s4
1
v2
v1
0
x
s1
s2 s3 s4
x
DIMLP
-1 -1 -1
-1
1
x
IMLP
Fig. 6. Transformation of an DIMLP network into an IMLP network. Note that in the
IMLP network one more hidden layer is created.
symbolic rules from an IMLP network, we can also extract rules from a DIMLP
network. Rule extraction is performed by the same algorithm given in paragraph
3.2. Nevertheless, this time a covering algorithm is applied to discrete attributes
instead of binary ones.
4.4
DIMLP Versus IMLP
The DIMLP architecture is more compact than the IMLP architecture. To be
more clear let us consider an IMLP network with 10 input neurons 100 neurons
in the first hidden layer, 5 neurons in the second hidden layer, and two neurons in
the output layer (10-100-5-2). Therefore, we have defined 10 virtual hyper-planes
for each input variable. This IMLP network has 100·2+(100+1)·5+(5+1)·2 =
717 connections.
Now, imagine that we define a 10-10-5-2 DIMLP network with 10 stairs for
each staircase function. So, for each input variable we have again defined 10
virtual hyper-planes. However, the number of connections is smaller. We have
10 · 2 + (10 + 1) · 5 + (5 + 1) · 2 = 87 connections.
5
5.1
Experiments
Datasets
The three datasets used for the applications are: 1. Monk-2 [10]; 2. Sonar [11]; 3.
Pen Based Handwritten Digits [12]. All datasets are obtainable from the University of California-Irvine data repository for Machine Learning (via anonymous
ftp from ftp.ics.uci.edu). The summary of these datasets, their representations,
and how each dataset is used in experiments are given below.
Symbolic Rule Extraction from the DIMLP Neural Network
251
Monk-2. The dataset consists of 432 examples belonging to two classes (142
cases for the Monk class and 290 cases for the non-Monk class). Each example
is described by 6 discrete attributes that have been transformed into 17 binary
attributes. The training and testing sets have been fixed and contain 169 and
263 examples, respectively.
Symbolic rules generating the dataset are defined by rules of the type:
If exactly two between 6 (discrete) attributes have the first possible value then
the class is Monk.
Otherwise; the class is non-Monk.
The positive ”compact” rule gives rise to 15 symbolic rules having 6 antecedents.
Sonar. The dataset contains 208 examples described by 60 continuous attributes
in the range 0.0 to 1.0. Examples belong to the class cylinder (111) and rock (97).
The training and testing sets have been fixed. Each set contains 104 examples.
Pen Based Handwritten Digits. The dataset contains 10992 examples. Each
example is described by 16 integer attributes in the range 0 to 100. There are 10
classes, each corresponding to one possible digit. The frequency of each class is
between 9% and 11%. The training and testing sets have been fixed and contain
7494 and 3498 examples, respectively.
5.2
Neural Architectures and C4.5 Parameters
Monk-2. DIMLP: 17-17-10-2 (number of stairs in the staircase function: 2); the
training phase is stopped when all training examples are learned. For the MLP
architecture and C4.5 parameters see [10].
Sonar. DIMLP: 60-60-14-2 (number of stairs in the staircase function: 150).
MLP: 60-12-2. The training phase of neural networks is stopped when all training examples are learned. For C4.5 parameters, m and c are set to 1 and 99,
respectively5 .
Pen Based Handwritten Digits. DIMLP: 16-48-40-10 (number of stairs in
the staircase function: 150). MLP: 16-40-10. The training phase of neural networks is stopped when more than 99.6% of training examples are learned. Using
previous C4.5 parameters the average accuracy obtained on training examples
is 99.5%.
5.3
Results of the Monk-2 Problem
Average predictive accuracies are given in table 1. The predictive accuracy obtained by C4.5 is only 64.8% [10]. Thrun reports that the majority of inductive
5
With these values the obtained average predictive accuracy is slightly better than
those obtained using default parameters.
252
G. Bologna
Table 1. Average predictive accuracy on 10 trials with fixed training and testing sets.
DIMLP MLP
100%
C4.5
100% 64.8%
symbolic algorithms gives worse predictive accuracies than neural networks in
the Monk-2 problem. The average predictive accuracies obtained by algorithms
as ID3, ID5R, AQR, CN2 and Prism are: 69.1%; 69.2%; 79.7%; 69.0%; 72.7%;
respectively [10]. Rules extracted from DIMLP for class Monk are those ones
that generate the dataset, whereas for C4.5 the number of rules is notably larger
in size.
5.4
Results of the Sonar Problem
Average predictive accuracies are shown in table 2. The average number of rules
and the average number of antecedents per rule are given in table 3. The best
average accuracy is given by standard multi-layer perceptrons. Moreover, DIMLP
networks are more accurate than C4.5 decision trees on average. However, fewer
rules and antecedents per rule are generated by C4.5.
Table 2. Average predictive accuracy on 10 trials with fixed training and testing sets.
DIMLP
89.1%
MLP
C4.5
90.5% 76.3%
Table 3. Average number of rules and average number of antecedents per rule on 10
trials with fixed training and testing sets.
DIMLP C4.5
5.5
rules
24.7
6.9
ant./rule
3.6
2.7
Results of the Pen Based Handwritten Digits Problem
Table 4 gives the average predictive accuracies. Table 5 illustrates the average
number of rules and the average number of antecedents per rule. As well as in
the Sonar problem, DIMLP networks are more accurate than C4.5 decision trees
on average. Again, fewer rules and antecedents per rule are generated by C4.5.
Symbolic Rule Extraction from the DIMLP Neural Network
253
Table 4. Average predictive accuracy on 10 trials with fixed training and testing sets.
DIMLP MLP
C4.5
96.7% 96.7% 93.2%
Table 5. Average number of rules and average number of antecedents per rule on 10
trials with fixed training and testing sets.
rules
ant./rule
5.6
DIMLP
C4.5
251.9
115.3
6.0
6.2
Discussion of Results
In three classification problems DIMLP networks were more accurate than C4.5
decision trees. The reason for this dichotomy could reside in the fact that the
inherent classification mechanism of neural networks is parallel, whereas those
related to decision trees is sequential. Therefore, during the training phase a
decision tree may miss rules involving multiple attributes which are weakly predictive separately but become strongly predictive in combination. On the other
hand, a neural network may fail to discern a strongly relevant attribute among
several irrelevant ones.
As a consequence, we speculate that for these three classification problems
and especially in the first and in the second, the inherent multi-variate search
technique related to neural networks is better suited than the inherent uni-variate
search technique related to decision trees.
Let us clarify this conjecture with the Monk-2 dataset. Remind that this
classification problem is generated by 15 symbolic rules having 6 antecedents.
We would like to estimate the probability to find one of these rules using an
uni-variate searching algorithm. Let us define Ai as “find antecedent ai in a
given rule”. We define P (A1 , ... , A6 ) the probability to find 6 correct antecedents in a rule. We suppose P (Ai ) < α < 1, with α ∈ ℜ. If there is a subset of k antecedents ai1 , ..., aik for which Ai1 , ..., Aik are independent6 then
P (A1 , ... , A6 ) < αk . It is worth noting that the larger k, the smaller the probability to find one correct rule. Hence, for an uni-variate search algorithm the
probability to find one correct rule in the Monk-2 problem could be very small
if k is close to 6.
6
Conclusion
The use of staircase activation functions in the first hidden layer simplifies the
determination of discriminant hyper-plane expressions. When only one neuron of
6
Imagine for instance the black and white classes of a chess board.
254
G. Bologna
the first hidden layer is connected to an input neuron, discriminant hyper-plane
expressions correspond to antecedents of symbolic rules.
On three datasets DIMLP has shown to be close to standard MLP predictive
accuracy with the advantage of being directly interpretable with symbolic rules.
Moreover, on the same datasets DIMLP is more accurate than C4.5, but it also
creates on the two last applications more rules and more antecedents.
Finally, DIMLP represents an improvement with respect to IMLP as it has
a more powerful internal representation.
Acknowledgement
This research is funded by a fellowship for young researchers provided by the
Swiss National Science Foundation.
References
1. Andrews R., Diederich J., Tickle A.B.: Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks. Knowledge-Based Systems,
vol. 8, no. 6, 373–389 (1995).
2. Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993).
3. Quinlan J.R.: Comparing Connectionist and Symbolic Learning Methods. In Computational Learning Theory and Natural Learning, R. Rivest (eds.), 445–456
(1994).
4. Breiman L., Friedmann J.H., Olshen R.A., Stone J.: Classification and Regression
Trees. Wadsworth and Brooks, Monterey, California (1984).
5. Maire F.: Rule Extraction by Backpropagation of Polyhedra. Neural Networks (12)
4–5, 717–725 (1999).
6. Bologna G.: Rule Extraction from the IMLP Neural Network: a Comparative
Study. In Proceedings of the Workshop of Rule Extraction from Trained Artificial Neural Networks (after the Neural Information Processing Conference), 13–19
(1996).
7. Rudel R., Sangiovanni-Vincentelli A.: Espresso-MV: Algorithms for MultipleValued Logic Minimisation. In Proceedings of the Custom International Circuit
Conference (CICC’85), Portland, 230–234 (1985).
8. Corwin E., Logar A., Oldham W.: An Iterative Method for Training Multi-layer
Networks with Threshold Functions. In IEEE Transactions on Neural Networks
Journal, vol 5, no 3, 507–508 (1994).
9. Aarts E.H.L., Laarhoven P.J.M.: Simulated Annealing: Theory and Applications.
Kluwer Academic (1987).
10. Thrun S.B.: The Monk’s Problems: a Performance Comparison of Different Learning Algorithms. Technical Report, Carnegie Mellon University, CMU-CS-91-197
(1991).
11. Gorman R.P., Sejnowski T.J.: Analysis of Hidden Units in a Layered Network
Trained to Classify Sonar Targets. Neural Networks, 1(1) 75–88 (1988).
12. Alimoglu F., E. Alpaydin E.: Combining Multiple Representations and Classifiers
for Pen-based Handwritten Digit Recognition. In Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR 97), Ulm,
Germany (1997).
Understanding State Space Organization in
Recurrent Neural Networks with Iterative
Function Systems Dynamics
Peter Tiňo1,2 , Georg Dorffner1,3 , and Christian Schittenkopf1
1
2
Austrian Research Institute for Artificial Intelligence,
Schottengasse 3, A-1010 Vienna, Austria
{petert, georg, chris}@ai.univie.ac.at
Department of Computer Science and Engineering, Slovak University of Technology,
Ilkovicova 3, 812 19 Bratislava, Slovakia
3
Dept. of Medical Cybernetics and Artificial Intelligence, University of Vienna,
Freyung 6, A-1010 Vienna, Austria
Abstract. We study a novel recurrent network architecture with dynamics of iterative function systems used in chaos game representations of
DNA sequences [16,11]. We show that such networks code the temporal and statistical structure of input sequences in a strict mathematical
sense: generalized dimensions of network states are in direct correspondence with statistical properties of input sequences expressed via generalized Rényi entropy spectra. We also argue and experimentally illustrate
that the commonly used heuristic of finite state machine extraction by
network state space quantization corresponds in this case to variable
memory length Markov model construction.
1
Introduction
The correspondence between iterative function systems (IFS) [1] and recurrent
neural networks (RNNs) has been recognized for some time [13,28]. Because of
the non-linear nature of RNN dynamics a deeper insight into RNN state space
structure has been lacking. Also, even though there is a strong empirical evidence
supporting usefulness of extracting finite state machines from recurrent networks
trained on symbolic sequences [23,7], we do not have a deeper understanding of
what the machines actually represent.
We address these issues in the context of a novel recurrent network architecture, that we call iterative function system network1 (IFSN). Dynamics of IFSNs
corresponds to iterative function systems used in chaos game representations of
symbolic sequences [16,11]. Using tools from multifractal theory and statistical
1
Recently, we discovered that Tabor [28] independently investigated similar types of
recurrent networks. However, while we are mainly interested in learning and representational issues in recurrent networks with IFS dynamics, Tabor’s view is more
general, with an emphasis on metric relations between the network representations
of various forms of automata.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 255–269, 2000.
c Springer-Verlag Berlin Heidelberg 2000
256
P. Tiňo, G. Dorffner, and C. Schittenkopf
mechanics we establish a rigorous relationship between statistical properties of
sequences driving the network input and scaling behavior of IFSN states. We
also analyse the structure of the IFSN state space and interpret the state space
quantization leading to machine extraction in a Markovian context. We finish
the paper by a detailed study of finite state machine extraction from IFSNs
driven by the chaotic Feigenbaum sequence.
2
Formal Definitions
Consider a finite alphabet A = {1, 2, ..., A}. The sets of all finite2 and infinite
sequences over A are denoted by A+ and Aω respectively. The set of all sequences
consisting of a finite, or an infinite number of symbols from A is then A∞ =
A+ ∪ Aω . The set of all sequences over A with exactly n symbols (i.e. of length
n) is denoted by An .
Let S = s1 s2 ... ∈ A∞ and i ≤ j. By Sij we denote the string si si+1 ...sj , with
Sii = si .
2.1
Geometric Representations of Symbolic Sequence Structure
In this section we describe iterative function systems (IFSs) [1] acting on the
N -dimensional unit hypercube X = [0, 1]N , where3 N = ⌈log2 A⌉. To keep the
notation simple, we slightly abuse mathematical notation and, depending on the
context, regard the symbols 1, 2, ..., A, as integers, or as referring to maps on X.
The maps i = 1, 2, ..., A, constituting the IFS are affine contractions
i(x) = kx + (1 − k)ti , ti ∈ {0, 1}N , ti 6= tj for i 6= j,
(1)
1
2 ].
with contraction coefficients k ∈ (0,
The attractor of the IFS (1) is the unique set K ⊆ X, known as the Sierpinski
SA
sponge [12], for which K = i=1 i(K) [1].
For a string u = u1 u2 ...un ∈ An and a point x ∈ X, the point
u(x) = un (un−1 (...(u2 (u1 (x)))...)) = (un ◦ un−1 ◦ ... ◦ u2 ◦ u1 )(x)
(2)
is considered a geometrical representation of the string u under the IFS (1). For
a set Y ⊆ X, u(Y ) is then {u(x)| x ∈ Y }.
Denote the center { 21 }N of the hypercube X by x∗ . Given a sequence S =
s1 s2 ... ∈ A∞ , its (generalized) chaos game representation is formally defined as
the sequence of points4
CGRk (S) = S1i (x∗ ) i≥1 .
(3)
When k = 21 and A = {1, 2, 3, 4}, we recover the IFS used by Jeffrey and
others [11,22,25] to construct the chaos game representation of DNA sequences.
2
3
4
excluding the empty word
for x ∈ ℜ, ⌈x⌉ is the smallest integer y, such that y ≥ x
the subscript k in CGRk (S) identifies the contraction coefficient of the IFS used for
the geometric sequence representation
Understanding State Space Organization in Recurrent NNs
2.2
257
Statistics on Symbolic Sequences
Let S = s1 s2 ... ∈ A∞ be a sequence generated by a stationary information
source. Denote the (empirical) probability of finding an n-block w ∈ An in S by
Pn (w). A string w ∈ An is said to be an allowed n-block in the sequence S, if
Pn (w) > 0. The set of all allowed n-blocks in S is denoted by [S]n .
A measure of n-block uncertainty (per symbol) in S is given by the entropy
rate
1 X
Pn (w) log Pn (w).
(4)
hn (S) = −
n
w∈[S]n
If information is measured in bits, then log ≡ log2 . The limit entropy rate h(S) =
limn→∞ hn (S) quantifies the predictability of an added symbol (independent of
block length).
The entropy rates hn are special cases of Rényi entropy rates [24]. The βorder (β ∈ ℜ) Rényi entropy rate
hβ,n (S) =
X
1
log
Pnβ (w)
n(1 − β)
(5)
w∈[S]n
computed from the n-block distribution reduces to the entropy rate hn (S) when
β = 1 [9]. The formal parameter β can be thought of as the inverse temperature in
the statistical mechanics of spin systems [6]. In the infinite temperature regime,
β = 0, the Rényi entropy rate h0,n (S) is just a logarithm of the number of
allowed n-blocks, divided by n. The limit h(0) (S) = limn→∞ h0,n (S) gives the
asymptotic exponential growth rate of the number of allowed n-blocks, as the
block length increases.
The entropy rates h(S) = h(1) (S) = limn→∞ h1,n (S) and h(0) (S) are also
known as the metric and topological entropies respectively.
Varying the parameter β amounts to scanning the original n-block distribution Pn – the most probable and the least probable n-blocks become dominant in
the positive zero (β = ∞) and the negative zero (β = −∞) temperature regimes
respectively. Varying β from 0 to ∞ amounts to a shift from all allowed n-blocks
to the most probable ones by accentuating still more and more probable subsequences. Varying β from 0 to −∞ accentuates less and less probable n-blocks
with the extreme of the least probable ones.
2.3
Scaling Behavior on Multifractals
Loosely speaking, a multifractal is a fractal set supporting a probability measure
[2]. The degree of fragmentation of the fractal support M is usually quantified
through its fractal dimension D(M ) [1]. Denote by N (ℓ) the minimal number
of hyperboxes of side length ℓ needed to cover M . The fractal (box-counting)
dimension D(M ) relates the side length ℓ with N (ℓ) via the scaling law N (ℓ) ≈
ℓ−D(M ) .
258
P. Tiňo, G. Dorffner, and C. Schittenkopf
For 0 < k ≤ 12 , the n-th order approximation Dn,k (M ) of the fractal dimension D(M ) is given by the box-counting technique with boxes of side ℓ = k n :
−Dn,k (M )
N (k n ) = (k n )
.
Just as the Rényi entropy spectra describe (non-homogeneous) statistics on
symbolic sequences, generalized Rényi dimensions Dβ capture multifractal probabilistic measures µ [15]. Generalized dimensions Dβ (M ) of an object M describe a measure µ on M through the scaling law
X
µβ (B) ≈ ℓ(β−1)Dβ (M ) ,
(6)
B∈Bℓ , µ(B)>0
where Bℓ is a minimal set of hyperboxes with sides of length ℓ disjointly5 covering
M.
In particular, for 0 < k ≤ 12 , the n-th order approximation Dβ,n,k (M ) of
Dβ (M ) is given by
X
µβ (B) = ℓ(β−1)Dβ,n,k (M ) ,
(7)
B∈Bℓ , µ(B)>0
where ℓ = k n .
The infinite temperature scaling exponent D0 (M ) is equal to the box-counting
fractal dimension D(M ) of M . Dimensions D1 and D2 are respectively known
as the information and correlation dimensions [2]. Of special importance are the
limit dimensions D∞ and D−∞ describing the scaling behavior of regions where
the probability is most concentrated and rarefied respectively.
2.4
Chaos Game Representation of Single Sequences
In [16] we established a relationship between the Rényi entropy spectra of a
sequence S ∈ A∞ and the generalized dimension spectra of its chaos game
representations.
Theorem 1 [16]: For any sequence S ∈ A∞ , and any n = 1, 2, ..., the n-th
order approximations of the generalized dimensions of its game representations
are equal (up to a scaling constant log k −1 ) to the sequence n-block Rényi entropy
rate estimates:
hβ,n (S)
,
(8)
Dβ,n,k (CGRn,k (S)) =
log k1
where CGRn,k (S) is the sequence CGRk (S) without the first n − 1 points. Furthermore, for each S ∈ Aω ,
Dβ,n,k (CGRk (S)) =
5
at most up to Lebesgue measure zero borders
hβ,n (S)
.
log k1
(9)
Understanding State Space Organization in Recurrent NNs
259
Hence, for infinite sequences S ∈ Aω , when k = 21 , the generalized dimension estimates of geometric chaos game representations exactly equal the corresponding sequence Rényi entropy rate estimates. In particular, given an infinite
sequence S ∈ Aω , as n grows, the box-counting fractal dimension and the information dimension estimates D0,n, 21 and D1,n, 21 of the original Jeffrey chaos game
representation [11,22,25] tend to the sequence topological and metric entropies
respectively.
Another nice property of the chaos game representation CGRk is that it codes
the suffix structure of allowed subsequences in the distribution of subsequence
geometric representations (2) [16]. In particular, if v ∈ A+ is a suffix of length
|v| of a string u = rv, r, u ∈ A+ , then u(X) ⊂ v(X), where v(X) is an N dimensional hypercube of side length k |v| . Hence, the longer is the common suffix
v shared by two subsequences rv and rvqv of a sequence S = rvqvw, r, q, v ∈ A+ ,
w ∈ A∞ , the closer lie the corresponding points rv(x∗ ) and rvqv(x∗ ) in the chaos
game representation of S,
√
(10)
dE (rv(x∗ ), rvqv(x∗ )) ≤ k |v| N .
Here dE denotes the Euclidean distance.
3
Recurrent Neural Network
Recurrent neural network (RNN) with adjustable recurrent weights presented
in figure 1 was shown to be able to learn mappings that can be described by
finite state machines [20], or produce symbolic sequences closely approximating
(with respect to the information theoretic entropy and cross-entropy measures)
the statistical structure in long chaotic training sequences [21,19].
(t)
(t)
The network has an input layer I (t) = (I1 , ..., IA ) with A neurons (to
which the one-of-A codes of input symbols from the alphabet A = {1, ..., A} are
presented, one at a time), a hidden non-recurrent layer H (t) , a hidden recurrent
layer R(t+1) (RNN state space), and an output layer O(t) having the same number
of neurons as the input layer. Activations in the recurrent layer are copied with
a unit time delay to the context layer R(t) that forms an additional input.
Using second-order hidden units, at each time step t, the input I (t) and the
context R(t) determine the output O(t) and the future context R(t+1) by
(t)
Hi
X
(t) (t)
H
Qi,j,k Ij R + Ti,j
= g
,
k
(11)
j,k
(t)
Oi
X
(t)
Vi,j Hj + TiO ,
= g
j
(12)
260
P. Tiňo, G. Dorffner, and C. Schittenkopf
(t)
O
unit delay
V
(t)
(t+1)
R
H
Q
W
(t)
I
R
(t)
Fig. 1. Recurrent neural network (RNN) architecture. When recurrent weights W and
thresholds T R are fixed prior to the training process and activation functions in the
recurrent layer R(t+1) are linear so that the recurrent part [I (t) + R(t) → R(t+1) ] of
the network implements the IFS (1), the architecture is referred as iterative function
system network (IFSN).
(t+1)
Ri
X
(t) (t)
R
Wi,j,k Ij R + Ti,j
= g
.
(13)
k
j,k
H
R
are
, Ti,j
Here, g is the standard logistic sigmoidal function. Wi,j,k , Qi,j,k and Ti,j
O
second-order real valued weights and thresholds, respectively. Vi,j and Ti are
the weights and thresholds, respectively, associated with the hidden to output
layer connections.
When first-order hidden units are used, eqs. (11) and (13) change to
X
X
(t)
(t)
(t)
H
QIi,j Ij +
QR
(14)
Hi = g
i,k Rk + Ti
j
and
(t+1)
Ri
respectively.
3.1
= g
X
j
k
(t)
I
Wi,j
Ij +
X
k
(t)
R
Wi,k
Rk + TiR ,
(15)
Previous Work
We briefly describe our previous experiments [21,19] with the RNN introduced
above. The network was trained (via Real Time Recurrent Learning [10]) on
single long chaotic symbolic sequences S = s1 s2 ... to predict, at each point in
time, the next symbol. To start the training, the initial network state R(1) was
Understanding State Space Organization in Recurrent NNs
261
randomly generated and the network was reset with R(1) at the beginning of
each training epoch. After the training, the network was seeded with the initial
state R(1) and code of the first symbol s1 in S. Then the network acted as a
(t)
(t)
stochastic source. We transformed the RNN output O(t) = (O1 , ..., OA ) into a
probability distribution over symbols ŝt+1 that will appear at the net input at
the next time step:
(t)
O
P rob (ŝt+1 = i) = PA i
j=1
(t)
Oj
, i = 1, 2, ..., A.
(16)
We observed that trained RNNs produced sequences closely mimicking (in the
information theoretic sense) the training sequence.
Next we extracted from trained RNNs stochastic finite state machines by
identifying clusters in recurrent neurons’ activation space (RNN state space)
with states of the extracted machines. The extracted machines provide a compact
and easy-to-analyse symbolic representation of the knowledge induced in RNNs
during the training. Finite machine extraction from RNNs is a commonly used
heuristic especially among recurrent network researchers applying their models
in grammatical inference tasks [23]. The extracted machines often outperform the
original networks on longer test strings. For an analysis of this phenomenon see
[4,18]. In our experiments we found that these issues translate to the problem
of training RNNs on single long chaotic sequences. With sufficient number of
states the extracted stochastic machines do indeed replicate the entropy and
cross-entropy performance of their mother RNNs.
We considered two principal ways of machine extraction. In the test mode extraction we let the RNN generate a sequence in an autonomous mode and code
the transitions between quantized network states driven by the generated symbols as stochastic6 machines MRN N . In the training sequence driven construction
we drive the RNN with the training sequence S and code the transitions between quantized RNN states on symbols in S as stochastic machines MRN N (S) .
We found that the machines MRN N (S) achieved considerably better modeling
performance than their mother RNNs [19].
3.2
Recurrent Neural Network with IFS Dynamics
We propose an alternative RNN architecture, which we call iterative function
system network (IFSN). IFSNs share the architecture with RNNs described above
(see figure 1), with the exception that the recurrent neurons’ activation function
is linear and the weights W (W I , W R ) and thresholds T R are fixed, so that the
network dynamics in eq. (13) (eq. (15)) is given by (1): given the current state
R(t) , i(R(t) ) is the next state R(t+1) , provided the input symbol at time t is
i. Such a dynamics is equivalent to the dynamics of the IFS (1) driven by the
6
with each state transition in the machine we associate its empirical probability by
counting how often was that transition evoked during the extraction process
262
P. Tiňo, G. Dorffner, and C. Schittenkopf
symbolic sequence appearing at the network input. The trainable parts of IFSNs
form a feed-forward architecture [I (t) + R(t) → H (t) → O(t) ].
In our recent experimental study [17] on two long chaotic sequences with different degrees of “complexity” measured by Crutchfield’s ǫ-machines [5], we have
(interestingly enough) found that even though IFSNs have fixed non-trainable
recurrent parts, they achieved performances comparable with those of RNNs having adjustable recurrent weights and thresholds. Moreover, the extracted machines MIF SN (S) actually outperformed the corresponding machines MRN N (S) .
4
Understanding State Space Organization in IFSNs
Although previous approaches to the analysis of RNN state space organization did point out the correspondence between IFSs and RNN recurrent part
[I (t) + R(t) → R(t+1) ] [14,13], due to nonlinearity of recurrent neurons’ activation function, they did not manage to provide a deeper insight into the RNN
state space structure (apart from observing an apparent fractal-like clusters corresponding to nonlinear IFS attractor). Also, despite strong empirical evidence
supporting the usefulness of extracting symbolic finite state representations from
trained recurrent networks [23,7] a deeper understanding of what the machines
actually represent is still lacking.
The results summarized in section 2.4 enable us to formulate the principles
behind coding, in the recurrent part of IFSNs, of the temporal/statistical structure in symbolic sequences appearing at the network input. Theorem 1 tells
us that the fractal dimension of states in IFSN driven by a sequence S directly
corresponds to the allowed subsequence variability in S expressed through the topological entropy of S. An analogical relationship holds between the information
dimension of IFSN states and the metric entropy of S. Generally, multifractal
characteristics of IFSN states measured via spectra of generalized dimensions
directly correspond to Rényi entropy spectra of the input sequence S.
The input sequence S feeding the IFSN is translated in the network recurrent
part into the chaos game representation CGRk (S). The CGRk (S) forms clusters
of state neuron activations, where (as explained in section 2.4) points lying in a
close neighborhood code histories with a long common suffix (e.g. histories that
are likely to produce similar continuations), whereas histories with different suffices (and potentially different continuations) are mapped to activations lying far
from each other. When quantizing the IFSN state space to extract the network
finite state representation, densely populated areas (corresponding to contexts
with long common suffices) are given more attention by the vector quantizer.
Consequently, more information processing states of the extracted machines are
devoted to these potentially “problematic” contexts. This directly corresponds
to the idea of variable memory length Markov models [26,27], where the length
of the past history considered in order to predict the future is not fixed, but
context dependent.
Understanding State Space Organization in Recurrent NNs
5
263
Extracting Finite State Stochastic Machines from
IFSNs
relative frequency
We report an experiment with a well-known chaotic sequence, called the Feigenbaum sequence [8]. Feigenbaum sequence is a binary sequence generated by
the logistic map yt+1 = ryt (1 − yt ), y ∈ [0, 1], with the control parameter r
set to the period doubling accumulation point value7 [15]. The iterands yt are
partitioned into two regions8 [0, 21 ) and [ 12 , 1], corresponding to symbols 1 and
2, respectively. The training sequence S used in this study contained 260.000
symbols.
The topological structure of the sequence (i.e. the structure of allowed nblocks not regarding their probabilities) can only be described using a context sensitive tool – a restricted indexed context-free grammar [6]. The metric
structure of the Feigenbaum sequence is organized in a self-similar fashion [8].
The transition between the ranked distributions for block lengths 2g → 2g+1 ,
3 2g−1 → 3 2g , g ≥ 1, is achieved by rescaling the horizontal and vertical axis
by a factor 2 and 21 , respectively. Plots of the Feigenbaum sequence n-block
distributions, n = 1, 2, ..., 8, can be seen in figure 2. Numbers above the plots
indicate the corresponding block lengths. The arrows connect distributions with
the (2, 21 )-scaling self-similarity relationship.
12
1 2
3
4
5
6
7
16
8
rank
Fig. 2. Plots of self-similar rank-ordered block distributions of the Feigenbaum sequence for different block lengths (indicated by the numbers above the plots). The self
similarity relates block distributions for block lengths 2g → 2g+1 , 3 2g−1 → 3 2g , g ≥ 1
(connected by arrows).
We chose to work with the Feigenbaum sequence because Markovian predictive models on this sequence need deep prediction contexts. Classical fixed-order
Markov models (MMs) cannot succeed and the power of admitting a limited
number of variable length contexts can be fully exploited.
First, we built a series of variable memory length Markov models (VLMMs)
of growing size. For construction details see [26,27]. Then, we quantized the
7
8
r=3.56994567...
this partition is a generating partition defined by the critical point
264
P. Tiňo, G. Dorffner, and C. Schittenkopf
one-dimensional9 IFSN state space10 using dynamic cell structures (DCS) technique [3]. This way we obtained a series of extracted machines MIF SN (S) with
increasing number of machine states.
Each model was used to generate 10 sequences G of length equal to the length
of the training sequence. Since the Feigenbaum sequence n-block distributions
have just one or two probability levels, we measure the disproportions between
the Feigenbaum
P and model generated distributions through the L1 distances,
dn (S, G) = w∈An |PS,n (w) − PG,n (w)|, where PS,n and PG,n are the empirical
n-block frequencies in the training and model generated sequences, respectively.
A modeling horizon n(M) of a model M is the longest block length, such
that, for all 10 sequence generation realizations and for all block lengths n ≤
n(M), dn (S, G) is below a small threshold ∆. We set ∆ = 0.005, since in this
experiment, either dn (S, G) ∈ (0, 0.005], or δn (M) >> 0.005.
Figure 3 interprets the growing ability of VLMMs and machines MIF SN (S)
to model the metric structure of allowed blocks in the Feigenbaum sequence S.
The classical MM (stars) totally fails in this experiment, since the context
length 5 is far too small to enable the MM to mimic the complicated subsequence
structure in S. On the other hand VLMMs (squares) and machines MIF SN (S)
n−block modeling performance (Feigenbaum sequence, Delta=0.005)
25
M−IFSN(S)
VLMM
MM
20
n(M)
15
10
5
5
10
15
20
# machine states
25
30
Fig. 3. Modeling horizons n(M) of models M built on the Feigenbaum sequence as a
function of the number machine states in M.
9
10
Chaos representation of binary sequences is defined by a one-dimensional IFS (1).
In this experiment we set the contraction ratio k of maps in the IFS to 0.5.
Note that since the training sequence driven extraction of machines MIF SN (S) uses
only recurrent part of IFSN (which is fixed prior to the training), no network training
is needed and the finite state representations MIF SN (S) can be readily constructed.
Understanding State Space Organization in Recurrent NNs
265
(triangles) quickly learn to explore a limited number of deep prediction contexts
and perform comparatively well.
The jumps in the modeling horizon graph of machines MIF SN (S) on figure 3
can be understood through their state transition diagrams.
While the machine M4 in figure 4a can model only blocks of length 1,2 and
3, the introduction of an additional transition state in the machine M5 shown
in figure 4b enables the latter machine to model blocks of length up to 6.
Only three consecutive 2’s are allowed in the Feigenbaum training sequence
S. The loop on symbol 2 in the state 1 of the machine M4 is capable of producing
blocks of consecutive 2’s of any length. So, the n-block distribution, n ≥ 4, cannot
be properly modeled by the machine M4 . The state 1 in the machine M4 is split
into two machine M5 states 1.a and 1.b. Any number of 4-blocks 2212 can be
followed by any number of 2-blocks 12 and vice versa. This is fine as long as we
study structure of the 6-block distribution.
Moving to higher block lengths, we find that once the 4-block 2212 is followed
by the 2-block 12, another copy of the 2-block 12 followed by the 4-block 2212
must appear. This 12-block rule is implemented by the machine M8 in figure 5b.
The machine M8 is created from the machine M7 in figure 5a by splitting the
state 3.a into two states 3.a and 3.c. The machine M7 with 7 states is equivalent
to the machine M5 (figure 4a) with 5 states: states 2.a, 2.b and 3.a, 3.b in M7
are equivalent to states 2 and 3, respectively, in M5 .
State splitting responsible for the third jump in the modeling horizon graph
between the extracted machines M22 and M23 with 22 and 23 states, respectively,
is illustrated in figure 6. Symbols A and B stand for the 4-blocks 1212 and 2212,
respectively. The machine M22 is equivalent to the machine M8 . State splitting
in the middle left branch of the machine M22 removes the two lower cycles BAB,
B, and creates a single larger cycle BBAB in the machine M23 . This machine correctly implements the training sequence block distributions for blocks of length
up to 24.
2
2
1
2
1
4
2
3
a
2
2
1
2
2
1.a
1.b
1
4
2
2
1
3
b
Fig. 4. State transition diagrams of the machines MIF SN (S) . The machines M4 (a) and
M5 (b) were obtained by quantizing IFSN state space via dynamic cell structures with
4 and 5 centers respectively. State transitions are labeled only with the corresponding
symbols, since the transition probabilities are uniformly distributed, i.e. for all states
i, the probability associated with each arc leaving i is equal to 1/Ni , where Ni is the
number of arcs leaving state i.
266
P. Tiňo, G. Dorffner, and C. Schittenkopf
2
2
1
2
1
4
2
2
2
1
2
2
1.a
3
1.b
a
1
4
2
2
1
3
b
Fig. 5. Machines M7 and M8 extracted from IFSN driven by the Feigenbaum sequence
S. The network state space is quantized into 7 (a) and 8 (b) compartments, respectively.
Construction details are described in caption to the previous figure.
AB
AB
A=1212
B=2212
AB
B
AB
BBAB
B
B
M8
M 22
M 23
Fig. 6. Schematic representation of state transition structure in machines MIF SN (S) .
Symbols A and B stand for the 4-blocks 1212 and 2212, respectively. The machine M22 ,
obtained from a codebook with 22 centers, is equivalent to the machine M8 (see also the
previous figure). State splitting in the middle left branch of the machine M22 (dashed
line) removes the two lower cycles BAB, B, and creates a single larger cycle BBAB in the
machine M23 .
Variable memory length Markov models implement the same subsequence
constraints as the machines MIF SN (S) . Figures 7a and 7b present VLMMs N5
and N11 with 5 and 11 prediction contexts, respectively. The VLMMs are shown
as probabilistic suffix automata with states labeled by the corresponding suffices. The VLMM N5 is isomorphic to the machine M5 in figure 4b, and the
VLMM N11 is equivalent to the machine M8 in figure 5b. Although not shown
here, the VLMM with 23 prediction contexts is isomorphic to the machine M23
schematically presented in figure 6.
6
Conclusion
We introduced a novel recurrent network architecture, that we call iterative function system network (IFSN), with dynamics corresponding to iterative function
systems used in chaos game representations of symbolic sequences [16,11].
In our previous work on modeling long chaotic sequences we empirically compared recurrent networks having adjustable recurrent weights and non-linear sigmoid activations in the recurrent layer with IFSNs and showed that introducing
fixed IFS dynamics into recurrent networks does not degradate the network per-
Understanding State Space Organization in Recurrent NNs
2122
267
2121
2
2
2
1
21222
2
212221
1
212
a
2122212121222
1
2
212221212122
1
2122212221
2
21222121212
2
212221222
2
21222121212221
2
2
2122212121
21222122
2
2122212
1
21222121
2
1
212221212
b
Fig. 7. VLMMs N5 (a) and N11 (b) built on the Feigenbaum sequence. The VLMMs
are are shown as probabilistic suffix automata with states labeled by the corresponding
variable length prediction contexts. As with machines MIF SN (S) in this experiment,
the state transition probabilities are uniformly distributed.
formance. Even more surprisingly, we found that finite state stochastic machines
extracted from IFSNs outperform machines extracted from “fully adjustable”
RNNs.
In this contribution we formally study state space organization in IFSNs. It
appears that IFSNs reflect in their states the temporal and statistical structure
of input sequences in a strict mathematical sense. The generalized dimensions
of IFSN states are in direct correspondence with statistical properties of input
sequences expressed via generalized Rényi entropy spectra.
We also argued and experimentally illustrated that the commonly used heuristic of finite state machine extraction from RNNs by network state space quantization corresponds in case of IFSNs to variable memory length Markov model
construction.
Acknowledgements
This work was supported by the Austrian Science Fund (FWF) within the research project “Adaptive Information Systems and Modeling in Economics and
Management Science” (SFB 010). The Austrian Research Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of Science and
Transport.
References
1. M.F. Barnsley. Fractals everywhere. Academic Press, New York, 1988.
2. C. Beck and F. Schlogl. Thermodynamics of chaotic systems. Cambridge University
Press, Cambridge, UK, 1995.
268
P. Tiňo, G. Dorffner, and C. Schittenkopf
3. J. Bruske and G. Sommer. Dynamic cell structure learns perfectly topology preserving map. Neural Computation, 7(4):845–865, 1995.
4. M.P. Casey. The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation,
8(6):1135–1178, 1996.
5. J.P. Crutchfield and K. Young. Inferring statistical complexity. Physical Review
Letters, 63:105–108, July 1989.
6. J.P. Crutchfield and K. Young. Computation at the onset of chaos. In W.H.
Zurek, editor, Complexity, Entropy, and the physics of Information, SFI Studies
in the Sciences of Complexity, vol 8, pages 223–269. Addison-Wesley, Reading,
Massachusetts, 1990.
7. P. Frasconi, M. Gori, M. Maggini, and G. Soda. Insertion of finite state automata
in recurrent radial basis function networks. Machine Learning, 23:5–32, 1996.
8. J. Freund, W. Ebeling, and K. Rateitschak. Self-similar sequences and universal
scaling of dynamical entropies. Physical Review E, 54(5):5561–5566, 1996.
9. P. Grassberger. Information and complexity measures in dynamical systems. In
H. Atmanspacher and H. Scheingraber, editors, Information Dynamics, pages 15–
33. Plenum Press, New York, 1991.
10. J. Hertz, A. Krogh, and R.G. Palmer. Introduction to the Theory of Neural Computation. Addison–Wesley, Redwood City, CA, 1991.
11. J. Jeffrey. Chaos game representation of gene structure. Nucleic Acids Research,
18(8):2163–2170, 1990.
12. R. Kenyon and Y. Peres. Measures of full dimension on affine invariant sets. Ergodic
Theory and Dynamical Systems, 16:307–323, 1996.
13. J.F. Kolen. Recurrent networks: state machines or iterated function systems?
In M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L. Elman, and A.S. Weigend,
editors, Proceedings of the 1993 Connectionist Models Summer School, pages 203–
210. Erlbaum Associates, Hillsdale, NJ, 1994.
14. P. Manolios and R. Fanelli. First order recurrent neural networks and deterministic
finite state automata. Neural Computation, 6(6):1155–1173, 1994.
15. J.L. McCauley. Chaos, Dynamics and Fractals: an algorithmic approach to deterministic chaos. Cambridge University Press, 1994.
16. P. Tiňo. Spatial representation of symbolic sequences through iterative function
system. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems
and Humans, 29(4):386–393, 1999.
17. P. Tiňo and G. Dorffner. Recurrent neural networks with iterated function systems
dynamics. In International ICSC/IFAC Symposium on Neural Computation, pages
526–532, 1998.
18. P. Tiňo, B.G. Horne, C.L. Giles, and P.C. Collingwood. Finite state machines and
recurrent neural networks – automata and dynamical systems approaches. In J.E.
Dayhoff and O. Omidvar, editors, Neural Networks and Pattern Recognition, pages
171–220. Academic Press, 1998.
19. P. Tiňo and M. Koteles. Extracting finite state representations from recurrent
neural networks trained on chaotic symbolic sequences. IEEE Transactions on
Neural Networks, 10(2):284–302, 1999.
20. P. Tiňo and J. Sajda. Learning and extracting initial mealy machines with a
modular neural network model. Neural Computation, 7(4):822–844, 1995.
21. P. Tiňo and V. Vojtek. Modeling complex sequences with recurrent neural networks. In G.D. Smith, N.C. Steele, and R.F. Albrecht, editors, Artificial Neural
Networks and Genetic Algorithms, pages 459–463. Springer Verlag Wien New York,
1998.
Understanding State Space Organization in Recurrent NNs
269
22. J.L. Oliver, P. Bernaola-Galván, J. Guerrero-Garcia, and R. Román Roldan. Entropic profiles of dna sequences through chaos-game-derived images. Journal of
Theor. Biology, (160):457–470, 1993.
23. C.W. Omlin and C.L. Giles. Extraction of rules from discrete-time recurrent neural
networks. Neural Networks, 9(1):41–51, 1996.
24. A. Renyi. On the dimension and entropy of probability distributions. Acta Math.
Hung., (10):193, 1959.
25. R. Roman-Roldan, P. Bernaola-Galvan, and J.L. Oliver. Entropic feature for sequence pattern through iteration function systems. Pattern Recognition Letters,
15:567–573, 1994.
26. D. Ron, Y. Singer, and N. Tishby. The power of amnesia. In Advances in Neural
Information Processing Systems 6, pages 176–183. Morgan Kaufmann, 1994.
27. D. Ron, Y. Singer, and N. Tishby. The power of amnesia. Machine Learning,
25:117–150, 1996.
28. W. Tabor. Dynamical automata. Technical Report TR98-1694, Cornell University,
Computer Science Department, 1998.
Direct Explanations and Knowledge Extraction
from a Multilayer Perceptron Network
that Performs Low Back Pain Classification
1
1
2
2
Marilyn L. Vaughn , Steven J. Cavill , Stewart J. Taylor , Michael A. Foy ,
2
and Anthony J.B. Fogg
1
Knowledge Engineering Research Centre, Cranfield University (RMCS),
Shrivenham, Swindon. SN6 8LA, UK
M.L.Vaughn@rmcs.cranfield.ac.uk
2
Princess Margaret Hospital, Okus Road, Swindon. SN1 4JU, UK
Abstract. Using a new method published by the first author, this chapter shows
how knowledge in the form of a ranked data relationship and an induced rule
can be directly extracted from each training case for a Multi-layer Perceptron
(MLP) network with binary inputs. The knowledge extracted from all training
cases can be used to validate the MLP network and the ranked data relationship
for any input case provides direct user explanations. The method is
demonstrated for example training cases from a real-world MLP that classifies
low back pain patients into three diagnostic classes. In using the method to
validate the network a number of test cases apparently mis-classified by the
network were found to have most likely been incorrectly classified by the
clinicians. The method uses a direct approach which does not depend on
combinatorial search and is thus applicable to real-world networks with large
numbers of input features, as demonstrated in this current study.
1 Introduction
Artificial neural networks are being increasingly used as decision support tools in a
wide range of applications [1-4] but are currently undermined by their inability to
explain or justify their output activations. In domains such as medical diagnosis it is
especially important for clinicians to understand and to have confidence in a system’s
prediction [5]. The multilayer perceptron (MLP) network is one of the most widely
used neural networks in the field and most approaches for extracting knowledge from
these networks have used a combinatorial search based approach [6-10]. However, a
limitation of these methods is that obtaining all possible combinations of rules is NPhard [11,12] and the methods are not universally applicable to arbitrary MLP
networks. Such techniques are thus not feasible for achieving the goal of readily
interpreting trained neural networks that solve real-world problems with a large
number of input features [5], as in the current study.
Using a new method published by the first author [13, 14], it is shown in Section 3
of this chapter how to directly interpret an input case and extract knowledge from a
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp.270-285, 2000.
© Springer-Verlag Berlin Heidelberg 2000
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
271
standard MLP network with binary inputs. For any MLP input case the method
directly finds the key inputs, both positive and negated, used by the network to classify
the case. The top ranked key inputs for the input case represent a data relationship
which provides an explanation of the classification of the case for the domain user.
The process of discovering the key inputs provides an explanation for the network
developer of how the MLP uses the hidden layer neurons to classify the input case.
In Section 3 it is shown how the knowledge learned by the MLP from a training
example can also be represented as a rule which is directly induced from the ranked
data relationship. With the assistance of domain experts the validation of the
knowledge extracted from all of the training examples provides a method for
validating the MLP network.
The interpretation and knowledge extraction method is demonstrated in Section 4,
for selected example training cases from a MLP network that classifies low back pain
(LBP) patients into three diagnostic classes. A ranked data relationship is found for
each example case and a rule is directly induced from each of the ranked data
relationships.
In Section 5 it is shown how the average highest key input rankings for each
diagnostic class can be used to represent the network’s knowledge for validation
purposes. Results are presented of the average highest ranked key inputs that the low
back pain MLP uses to classify all training cases for each diagnostic class. It is shown
how the validation of this knowledge by the domain experts leads to the validation of
the low back pain MLP network both during training and testing.
Finally, the interpretation and knowledge discovery method is compared with other
methods of rule extraction in Section 6 and the chapter is summarised in Section 7.
2 The Low Back Pain MLP Network
Low back pain is one of the most common medical problems encountered in
healthcare, with 60-80% of the population suffering some form within their life-span
[15,16]. It has been estimated that only 15% of patients with low back pain obtain an
accurate diagnosis of their problem with any degree of certainty, and many receive
care that is less than optimal [16,17]. Low back pain is a difficult multi-factorial
problem that includes physical, psychological and social aspects of illness [15].
For this study, low back pain is classified into three diagnostic classes: simple low
back pain (SLBP) - mechanical low back pain, minor scoliosis and old spinal
fractures; root pain (ROOTP) - nerve root compression due to either disc, bony
entrapment or adhesions; and abnormal illness behaviour (AIB) - mechanical low
back pain, degenerative disc or bony changes with signs and symptoms magnified as a
sign of distress in response to chronic pain.
A data set of 198 actual cases was collected from patient questionnaires, physical
findings and clinical findings. In this preliminary study, the MLP network is a fully
connected, feed-forward network with 92 binary encoded input neurons corresponding
to 39 patient attributes, and 3 output layer neurons, each representing a diagnostic
272
M.L. Vaughn et al.
class. Using the sigmoidal activation function and generalised delta learning rule the
network was trained with 99 randomly selected patient cases. The MLP network
architecture with the lowest test error was found with 10 hidden layer neurons at 1100
cycles when the training set had a 96% classification accuracy and a test set, with 99
patient cases, had a 67% classification accuracy.
3 The Interpretation and Knowledge Extraction Method
The method first finds the hidden layer neurons which positively activate the
classifying output neuron. This leads to the discovery of the key positive inputs in the
MLP input case which positively drive the output classification, as follows.
3.1 Discovery of the Feature Detector Neurons
For an MLP input case, the method [13] defines the feature detector neurons as the
hidden neurons which positively activate the classifying output neuron. For sigmoidal
activations the hidden layer neuron activation is always positive and, hence, the
feature detectors are hidden neurons connected to the classifying neuron with positive
weights. The hidden layer bias (which fixes the output neuron’s threshold at zero) also
makes a positive contribution when the bias connection weight is positive.
In performing a classification task, the first author has shown [18] that the MLP
network finds sufficiently many hidden layer feature detector neurons with activation
>0.5 which positively activate the classifying output neurons. These neurons also play
a major role in negatively activating the non-classifying output neurons. The relative
contribution of the feature detectors to the MLP output classification can be found by
ranking the detectors in order of decrease in classifying output activation when
selectively switched off at the hidden layer.
3.2 Discovery of the Significant Inputs
The interpretation and knowledge extraction method defines the significant inputs as
the inputs in an MLP input case which positively activate the classifying output
neuron. Thus the significant inputs positively activate the feature detector neurons and,
for binary encoded inputs, are positive inputs connected to the feature detectors with
positive weights.
The relative contribution of the significant inputs to the MLP output classification
can be found by ranking the significant inputs in order of decrease in activation at the
classifying neuron when each is selectively switched off in turn at the MLP input
layer. There is evidence [14,18] that the most significant inputs are the MLP inputs
that most uniquely discriminate the input case.
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
273
3.3 Discovery of the Negated Significant Inputs
The interpretation and knowledge extraction method defines the negated significant
inputs in the MLP input case as the significant inputs for another class which deactivate the feature detectors for that class when not active at the MLP input layer. For
binary inputs these are zero-valued inputs connected to hidden neurons which are not
feature detectors with positive weights.
The relative contribution of the negated significant inputs to the MLP output
classification can be found by ranking the negated significant inputs in order of
decrease in activation at the classifying neuron when each is selectively switched on in
turn at the MLP input layer.
3.4 Knowledge Learned by the MLP from the Training Data
Using the knowledge extraction method, the knowledge learned by the MLP from the
training data can be expressed as a non-linear data relationship and as an induced rule
which is valid for the training set, as follows.
Data Relationships. The knowledge learned by the network from an input training
case is embodied in the data relationship between the significant (and negated
significant) inputs and the associated network outputs:
significant (and negated significant) =>
training inputs
associated network
outputs
The data relationship is non-linear due to the effect of the sigmoidal activation
functions at each processing layer of neurons. The most important inputs in the
relationship can be ranked in order of decrease in classifying neuron activation, as
discussed above. The ranked data relationship is generally exponentially decreasing
and embodies the graceful degradation properties of the MLP network.
For the domain expert the ranked data relationship represents the explanation of the
case which can be enhanced with the information about the feature detector neurons
for the network developer.
Induced Rules. For an input training example, a rule which is valid for the MLP
training set can be directly induced from the training data relationship in order of
significant and negated significant input rankings [14,18]. The most general rule
induced from each training example represents the key knowledge that the MLP
network has learned from the training case and other similar training examples.
From the most general rule for each training example a general set of rules can be
induced which defines the key knowledge that the MLP network has learned from the
training set. However, rules do not embody the fault-tolerant and graceful degradation
properties of the MLP network and are not a substitute for the network. The rules,
nontheless, are useful for validating the network knowledge and the completeness of
the knowledge.
274
M.L. Vaughn et al.
3.5 MLP Network Validation and Verification
The validation of the MLP network can be undertaken, with the assistance of the
domain experts, by validating the data relationships and induced rules that the network
has learned from all the training examples. The validation process can lead to the
discovery of previously unknown data relationships and the analysis leading to this
discovery provides a method for data mining [19].
In MLP network testing the test data relationships can be used to verify that the
network is correctly generalising the knowledge learned by the network from the
training process. Testing is not expected to reveal new significant (and negated
significant) input data relationships since network generalisation is inferred from the
network knowledge acquired during training.
4 Explanations and Knowledge Extraction from LBP Training
Cases
The interpretation and knowledge extraction method is demonstrated for three low
back pain example cases: a class SLBP training case, a class ROOTP training case,
and a class AIB training case, which have classifying output neuron activations of
0.96, 0.93 and 0.91 respectively when presented to the low back pain MLP network.
4.1 Discovery of the Feature Detectors for Example Training Cases
Using the knowledge extraction method, as presented in Section 3, the feature
detectors are discovered to be hidden neurons H3, H4, H6, and H8 for the SLBP case,
hidden neurons H1, H3, H7, and H10 for the ROOTP case, and hidden neurons H1, H2,
H5, and H8 for the AIB case. The feature detectors are ranked in order of contribution
to the classifying output neuron when the detector is switched off at the hidden layer.
These contributions are shown in Tables 1a, 1b and 1c.
Table 1a. Hidden layer feature detectors - SLBP training case
Hidden layer
feature
detector
H3
H4
H6
H8
Activation of
feature
detector
+1.0000
+0.9998
+0.9999
+0.9552
Connection
weight to
classifying neuron
+0.5797
+1.5108
+0.8979
+0.5704
Positive input
Negative input
Total input
Sigmoid output
Contribution to
output
activation
+0.5797
+1.5104
+0.8979
+0.5448
+3.5327
-0.3984
+3.1344
+0.9583
Feature detector
rank
3 (- 3.2%)
1 (-12.8%)
2 (- 5.7%)
4 (- 2.9%)
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
275
Table 1b. Hidden layer feature detectors - ROOTP training case
Hidden layer
feature
detector
H1
H3
H7
H10
Activation of
feature
detector
+0.9999
+1.0000
+1.0000
+0.7355
Connection
weight to
classifying neuron
+1.4073
+0.3424
+1.8821
+0.4446
Positive input
Negative input
Total input
Sigmoid output
Contribution to
output
activation
+1.4072
+0.3424
+1.8821
+0.3270
+3.9587
-1.3657
+2.5930
+0.9304
Feature detector
rank
2 (-17.7%)
3 (- 2.8%)
1 (-27.9%)
4 (- 2.6%)
Table 1c. Hidden layer feature detectors - AIB training case
Hidden layer
feature
detector
H1
H2
H5
H8
Activation of
feature
detector
+0.8475
+0.9967
+0.9901
+0.1648
Connectionweight
to
classifying neuron
+0.0393
+1.7291
+1.7100
+0.1861
Positive input
Negative input
Total input
Sigmoid output
Contribution to
output
activation
+0.0333
+1.7234
+1.6932
+0.0307
+3.4805
-1.1554
+2.3250
+0.9109
Feature detector
rank
3 (- 0.3%)
1 (-29.1%)
2 (-28.3%)
4 (- 0.3%)
4.2 Discovery of the Significant Inputs for Example Training Cases
The low back pain MLP inputs are binary encoded and thus the significant inputs have
value +1 and a positive connection weight to the feature detectors. The significant
negated inputs have value +0 and a positive connection weight to hidden neurons that
are not feature detectors.
For the SLBP training case, there are 24 significant inputs, of which 16, including
the input bias, show a decrease in the classifying output activation when selectively
removed from the input layer. Similarly, the ROOTP training case has 26 significant
inputs, of which 15 show negative changes at the output neuron when selectively
removed from the input layer. Finally, the AIB training case has 33 significant inputs,
with 17 showing a decrease in the output activation when selectively removed from
the input layer.
276
M.L. Vaughn et al.
4.3 Data Relationships/Explanations for Example Cases
For the SLBP case and the ROOTP case the top ten combined ranked significant and
negated inputs account for a total decrease in the classifying neuron activation of 95%
and 96% respectively when switched off together at the MLP input layer. For the AIB
case the top four combined ranked inputs account for a 97% drop in classifying
activation.
The ranked input data relationships for each example training case and the
accumulated decrease at the respective classifying neurons are shown in Tables 2a, 2b
and 2c, and in [19]. The ranked data relationships represent the direct explanations of
the example cases for the orthopaedic surgeons. The relationships show that the nonlinear data relationship for each example case is exponentially decreasing with respect
to the ranked key inputs. In a similar way explanations can be provided automatically
on a case-by-case basis for any input case presented to the MLP network.
Table 2a. SLBP case training data relationship
Rank
1
2
3
4
5
6
7
8
9
10
SLBP training case (−95%)
pain brought on by bending over
back pain worse than leg pain
not low Zung depression score
not normal DRAM
no leg pain symptoms
not leg pain worse than back pain
straight right leg raise ≥70 °
back pain aggravated by sitting
back pain aggravated by standing
straight right leg raise not limited
Accumulated
classifying activation
0.94
0.67
0.33
0.13
0.07
0.06
0.05
0.05
0.05
0.04
Accumulated
decrease
−1.6%
−29.6%
−65.9%
−86.0%
−92.8%
−93.6%
−94.5%
−94.6%
−94.6%
−95.2%
Table 2b. ROOTP case training data relationship
Rank
1
2
3
4
5
6
7
8
9
10
ROOTP training case (−96%)
not back pain worse than leg pain
not lumbar extension < 5°
not pain brought on by bending over
not no leg pain symptoms
not straight right leg raise ≥ 70°
back pain aggravated by coughing.
loss of reflexes
lumbar extension (5 to 14 °)
not straight right leg raise not limited
not high MSPQ score
Accumulated
classifying activation
0.91
0.90
0.75
0.42
0.34
0.27
0.23
0.15
0.07
0.04
Accumulated
decrease
−1.6%
−3.3%
−19.8%
−54.4%
−63.3%
−70.9%
−74.9%
−84.2%
−92.6%
−96.4%
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
277
Table 2c. AIB case training data relationship
Rank
1
2
3
4
AIB training case (−97%)
not straight left leg raise limited by leg pain
straight left leg raise ≤45 °
claiming invalidity/disability benefit
not straight left leg raise (46 to 69°)
Accumulated
classifying activation
0.87
0.41
0.06
0.02
Accumulated
decrease
−4.7%
−54.8%
−93.1%
−97.6%
4.4 Discussion of the Key Inputs for Example Cases
From Table 2a it can be seen that in the SLBP case the patient presents back pain
symptoms as worse than leg pain (rank 2) which supports no leg pain symptoms (rank
5) and not leg pain worse than back pain (rank 6). Also, a straight leg raise (SLR) 70
(rank 7) is substantiated by no limitation on the SLR test (rank 10). This particular
patient case indicates possibly a not normal psychological profile (rank 3,4).
In the ROOTP case, as shown in Table 2b, high ranked inputs are negated SLBP
key inputs indicating leg pain symptoms: not back pain worse than leg pain (rank 1),
not no leg pain symptoms (rank 4) and not SLR 70 (rank 5). In the AIB case, as
shown in Table 2c, three key inputs are associated with the left leg (rank 1,2,4) and the
other key input is claiming invalidity/disability benefit (rank 3).
4.5 Induced Rules from Training Example Cases
Using the knowledge extraction method, a rule which is valid for the MLP training set
can be directly induced from the data relationship for an input training example in
combined ranked order of significant and negated significant inputs. This is
demonstrated for each of the example training cases, as follows.
SLBP Example Training Case. The following rule, which is valid for the MLP
training set, is induced in ranked order from the SLBP example training data
relationship shown in Table 2a:
IF pain brought on by bending over AND back pain worse than leg pain
AND NOT low Zung depression score AND NOT normal DRAM
AND no leg pain symptoms
THEN class SLBP
This rule represents the key knowledge that the MLP network has learned from this
training case. The most general valid rule for the example case that can be found
(from the above rule) which is valid for the MLP training set is given by:
IF no leg pain symptoms THEN class SLBP
This rule is verified by a frequency analysis of the ‘no leg pain symptoms’ attribute in
the training set which shows that eight SLBP cases are the only training cases with this
attribute present.
278
M.L. Vaughn et al.
ROOTP Example Training Case. The following rule, which is valid for the MLP
training set, is induced in ranked order from the ROOTP example training data
relationship shown in Table 2b:
IF NOT back pain worse than leg pain AND NOT lumbar extension < 5
AND NOT pain brought on by bending over
AND NOT no leg pain symptoms AND NOT straight right leg raise 70
AND back pain aggravated by coughing
THEN class ROOTP
This is the most general rule that can be found (from the above rule) for this example
case which is valid for the MLP training set.
AIB Example Training Case. The following rule, which is valid for the MLP
training set, is induced in ranked order from the AIB example training data
relationship shown in Table 2c:
IF NOT straight left leg raise limited by leg pain
AND straight left leg raise 45 AND claiming invalidity benefit
AND NOT straight left leg raise (46 to 69) AND smoker
THEN class AIB
This is the most general rule that can be found (from the above rule) for this example
case which is valid for the MLP training set.
4.6 Comparison of Knowledge Representations
The ranked data relationship, as shown in Tables 2a, 2b and 2c for each example case,
embodies the graceful degradation properties of the MLP network and shows the
exponentially decreasing relationship between the most important inputs and the
classifying output activation for the input case. In general use the ranked data
relationship represents the explanation of any input case presented to the MLP for the
domain expert or network user.
In comparison with the ranked data relationship, the rule induced from a training
input case is brittle and does not indicate the relative importance of the attributes used
by the MLP in classifying the case. For example, the rule induced from the AIB
example training case in Section 4.5 is valid only if the 5th ranked attribute ‘smoker’
is included in the rule yet the data relationship shows that the attribute is of extremely
low importance in the classification of the case.
The advantage of the rule, however, is that it is valid for the training set and the
more general the rule the easier it is to understand what the network has learned. This
is important for network validation since all of the extracted rules represent the
knowledge that the MLP has learned from the training set.
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
279
5 Knowledge Extraction from all LBP Training Cases
Using the knowledge extraction method, the ranked significant inputs and the ranked
negated significant inputs were discovered separately for each of the 99 training cases
in the low back pain MLP training set. With the aim of aggregating this knowledge in
a meaningful way for validation by the clinicians, the average of the ranking values of
the significant inputs was taken by class, resulting in a ranked significant input profile
for each diagnostic class, as shown in Table 3 and in [19]. Similarly, the ranked
negated input class profiles for each diagnostic class are shown in Table 4 and in [19].
5.1 Discussion of the Ranked Class Profiles
In summary, the significant input class profiles in Table 3 indicate that a typical SLBP
patient presents with back pain worse than leg pain and a good range of lumbar
flexion, whereas a typical ROOTP patient presents with leg pain greater than back
pain and limited lumbar movements. A typical AIB patient claims invalidity/disability
benefit, high Waddell’s inappropriate signs and a psychological profile indicating
distress. Some indicators are significant to both SLBP and AIB classes which is to be
expected since AIB patients are SLBP patients that develop signs of illness behaviour
manifested by distress.
The negated input profiles in Table 4 indicate that a typical SLBP patient does not
present with many of the key significant inputs of class ROOTP patients and vice
versa. There is an indication that some SLBP patients may not have a normal
psychological profile, possibly because patients with chronic back pain have the
potential to develop signs of AIB. Many of the key negated AIB inputs support the
key significant inputs of class AIB patients.
5.2 Validation of the LBP Network
The ranking profiles for each diagnostic class were used to generally validate the low
back pain MLP network, with the assistance of the domain experts, as follows.
Validation of the Training Cases. The ranked class profiles, as shown in Tables 3
and 4, indicate to the domain experts that the low back pain MLP network has largely
determined relevant attributes as typical characteristics of patients belonging to the 3
diagnostic classes.
However, for the SLBP class, the 8th ranked attribute ‘no back pain’ is
contradictory to the top ranked attribute ‘worse back pain’. Further investigation
revealed that two SLBP training cases with attribute ‘no back pain’ had been
incorrectly included in the training set. These were the only two cases in the training
set with this attribute present which demonstrates the sensitivity of both the MLP
network and the knowledge extraction method. As a result of the two incorrect
training cases the training performance of the low back pain MLP network was reassessed at 98% classification accuracy.
280
M.L. Vaughn et al.
Table 3. Top ten averaged ranked
significant inputs for all training
each diagnostic class
R
All SLBP training cases
Table 4. Top ten averaged ranked
negated inputs for all training cases cases in
in each diagnostic class
R
All SLBP training cases
1
2
3
4
5
6
7
8
9
10
back pain worse than leg pain
lumbar flexion ≥45 °
straight right leg raise ≥ 70 °
no leg pain symptoms
pain brought on by bending over
minimal ODI score
lumbar extension < 5 °
no back pain symptoms
straight left leg raise (46 to 69 °)
straight right leg raise not limited
1
2
3
4
5
6
7
8
9
10
back pain aggravated by coughing
lumbar flexion < 30°
lumbar extension (5 to 14°)
equal back and leg pains
smoker
normal DRAM
loss of reflexes
low Zung depression score
recurring back pain
leg pain worse than back pain
R
All ROOTP training cases
R
All ROOTP training cases
1
2
3
4
5
6
7
8
9
10
back pain aggravated by coughing.
leg pain worse than back pain
lumbar flexion < 30 °
lumbar extension (5 to 14 °)
normal DRAM
low MSPQ score
low Zung depression score
use of walking aids
leg pain aggravated by coughing
pain brought on by lifting
1
2
3
4
5
6
7
8
9
10
back pain worse than leg pain
lumbar flexion≥45°
pain brought on by bending over.
no leg pain symptoms
minimal ODI score
straight right leg raise ≥ 70°
lumbar extension < 5°.
acute back pain
high Zung depression score
high MSPQ score
R
All AIB training cases
R
All AIB training cases
1
2
3
4
5
6
7
8
9
10
claiming invalidity/ disability benefit
straight left leg raise≤45 °
high Waddell’s inappropriate signs
distressed depressive
pain brought on by bending over
straight left leg raise ltd by hamstrings
pain brought on by falling over
back pain aggravated by standing
high Zung depression score
back pain worse than leg pain
1
2
3
4
5
6
7
8
9
10
straight left leg raise ltd by leg pain
low Waddell’s inappropriate signs
normal DRAM
low Zung depression score
leg pain worse than back pain
leg pain aggravated by walking
chronic back pain
straight left leg raise (46° to 69°)
back pain aggravated by coughing.
straight right leg raise limited by
back pain
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
281
Validation of the Test Cases. Of the 99 test cases, 35 were apparently mis-classified
by the low back pain MLP network. By directly interpreting each of the mis-classified
test cases using the knowledge extraction method it was possible to compare the top
ranked significant and negated inputs of each test case with the class profiles shown in
Table 3 and Table 4. Many of the mis-classified case test rankings indicated a high
degree of commonality with the ranked class profiles.
On further investigation it was agreed by the domain experts that 16 of the
apparently mis-classified cases were likely to have been correctly classified by the low
back pain MLP and incorrectly classified by the clinicians, based on the evidence of
the average class rankings. Of these, 13 cases (out of 19) were correctly classified by
the network as class AIB, 2 (out of 11) ) were correctly classified by the network as
class ROOTP and 1 (out of 5) ) was correctly classified by the network as class SLBP.
The difficulty experienced by clinicians in diagnosing AIB patients has been observed
in other studies [20].
As a result of the 16 test cases correctly classified by the MLP network, the test
performance of the low back pain MLP network was re-assessed at 81% classification
accuracy which is similar to the results of other researchers [21,22].
6 Comparison with other Methods
Most approaches for extracting knowledge in the form of rules from trained neural
networks use a search based approach [6-10] where the MLP network is regarded as a
collection of two-layer perceptrons with activation functions which approximate step
threshold units. A decompositional [10] approach is taken where rules are extracted for
each hidden and output unit separately. The rule extraction algorithms search for
combinations of input values which when satisfied guarantee that a given unit is
maximally active. A limitation of these methods is that obtaining all possible
combinations of rules is NP-hard [11,12] and the methods are not universally
applicable to arbitrary MLP networks [5]. Such methods are thus not feasible for
extracting rules from real-world networks with a large number of input features.
To reduce the search space in the rule extraction process some approaches incorporate
techniques such as specialised training procedures [9,23] and network pruning [2427,12] or special network topologies [28,29]. The Partial-RE method in [12] uses
weight ordering to reduce the search space and can be used for larger size problems if
a small number of premises per rule is sufficient, in which case the method is
polynomial in n, where n is the number of input features.
An approach that enables the extraction of rules that directly map inputs to outputs
for feedforward networks is DEDEC [30] which, as in the current study, also extracts
rules by ranking the inputs of the MLP according to their importance. However, the
ranking process is done by first examining the weight vectors of the network and then
clustering the ranked inputs. Each cluster is used to generate a set of optimal binary
rules that describes the functional dependencies between the attributes of a cluster and
the network outputs [12].
282
M.L. Vaughn et al.
Another approach that enables the extraction of rules that directly map inputs to
outputs for arbitrary MLP networks is validity-interval analysis (VIA) [31] which uses
linear programming to determine if a set of constraints on a network’s activation
values is consistent. However, the method is computationally expensive since it
requires multiple runs of linear programming per rule. A further drawback is that the
method assumes that activations of the hidden layer units are independent.
The rule extraction method described in the current study also enables the extraction
of rules that directly map inputs to outputs for arbitrary MLP networks. However, the
method in this study uses a direct, holistic approach which is not computationally
expensive since the complexity of the method is linear in the number of hidden layer
and output layer neurons. Unlike VIA, the approach makes use of the dependent
hidden layer activations to interpret and discover knowledge from an input case.
The method aims to extract rules only from the training set because the network
knowledge is expected to embody what it has learned from the training examples. As a
result, the maximum number of rules that can be extracted is limited by the size of the
training set. However, duplication of the most general rules is expected, especially for
training examples lying in the same hidden layer decision region [18]. Use of the
method thus far, as indicated in this study and [14,18], indicates good rule
comprehensibility with a small number of premises per extracted rule due to the
exponentially decreasing relationships learned by the network.
In general use the interpretation and knowledge extraction method described in this
study directly provides an explanation for any input case presented to the network by
discovering the ranked data relationship for the case, as demonstrated in Section 4.
Potentially novel inputs beyond the network knowledge bounds, as defined by the
hidden layer decision regions with training examples, can be detected and prevented,
as shown in [18]. Since the method does not use combinatorial search or specialised
training procedures it potentially represents a significant advance towards the goal of
readily interpreting trained neural networks (for which solutions already exist) that
solve real-world problems with a large number of input features [5,11], as in the
current study.
7 Summary and Conclusions
Using a new interpretation and knowledge extraction method it is shown in this
chapter how to directly explain the classification of any MLP network input case by
discovering the most important inputs in the input case in the form of a ranked data
relationship. This reveals the non-linear exponentially decreasing relationship between
the most important inputs and the classifying output activation for the input case, and
embodies the graceful degradation properties of the MLP network.
The knowledge that the MLP network learns from a training case can be represented
as a ranked data relationship and as an induced general rule which is valid for the
training set. In this study, the knowledge that the MLP network learns from all of the
99 training cases is represented by the ranked class profiles of the averaged highest
input rankings for the significant inputs and for the negated inputs.
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
283
In validating the ranked class profiles the domain experts considered that the
preliminary network had determined largely valid attributes as typical characteristics
of each diagnostic class of patients. One evidently invalid SLBP attribute, however,
revealed that two training cases had been incorrectly included in the training set which
illustrates the sensitivity of both the MLP network and the knowledge extraction
method.
By directly interpreting 19 mis-classified AIB test cases it was agreed by the domain
experts that 13 cases were likely to have been correctly classified by the low back pain
MLP based on the evidence of the class characteristics. This demonstrates the greater
consistency by the MLP network in classifying the AIB patients when compared with
the clinicians.
Since the interpretation and knowledge extraction method uses a direct, holistic
approach which does not use combinatorial search or specialised training procedures it
is concluded that the method potentially represents a significant advance towards the
goal of readily interpreting trained feed-forward neural networks that solve real-world
problems with a large number of input features, as in the current study.
Future Work
Future research work will seek to automatically induce a valid rule for each training
input case to further enhance the MLP network knowledge validation process. Studies
will also be made to discover if the knowledge learned by the network is invariant
with respect to parameters in the learning environment such as network architecture,
input/output data encoding, activation function, selection of initial weights and
learning rules. The results of this research will be presented subsequently.
Acknowledgements
The financial support for this research was provided by the Ridgeway Hospital,
Swindon and Compass Health Care.
References
1. Spiegelhalter, D.J., Taylor, C.C., (eds) : Machine Learning, Neural and Statistical
Classification. Ellis Horwood, Chichester (1994)
2. Patterson D.: Artificial Neural Networks Theory and Applications. Prentice Hall: Singapore
(1996)
3. Dillon T., Arabshahi P., Marks R.J.: Everyday Applications of Neural Networks. IEEE
Trans. Neural Networks, Vol. 8. (1997)
4. Looney C.G.: Pattern Recognition Using Neural Networks. Oxford University Press: New
York (1997)
284
M.L. Vaughn et al.
5. Craven M.W., Shavlik J.W.: Using Sampling and Queries to Extract Rules from Trained
Neural Networks, in Machine Learning. Proceedings of the Eleventh International Conference
on Machine Learning, Amherst, MA,USA. Morgan Kaufmann (1994) 73-80
6. Ourston D., Mooney R.J.: Changing the Rules: A Comprehensive Approach to Theory
Refinement. Proceedings of the Eighth National Conference on Artificial Intelligence (1990)
815-820
7. Gallant S.I.: Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA.
(1993)
8. L.Fu: Neural Networks in Computer Intelligence, McGraw-Hill, London (1994)
9. Towell, G.G., & Shavlik, J.W.: Extracting Refined Rules from Knowledge Based Neural
Networks. Machine Learning, Vol.13. (1993) 71-101
10. Andrews R., Diederich J., Tickle A.B.: Survey and Critique of Techniques for Extracting
Rules from Trained Artificial Neural Networks. Knowledge-Based Systems, Vol. 8 (6) (1995)
373-389
11. Tickle A.B., Andrews R., Golea M., Diederich J.: The Truth Will Come to Light: Directions
and Challenges in Extracting the Knowledge Embedded Within Trained Artificial Neural
Networks. IEEE Trans. Neural Networks, Vol. 9 (6) (1998) 1057-1067
12. Taha I.A., Ghosh J.: Symbolic Interpretation of Artificial Neural Networks. IEEE Trans.
Neural Networks, Vol.11 (3) (1999) 448-463
13. Vaughn M.L.: Interpretation and Knowledge Discovery from the Multilayer Perceptron
Network: Opening the Black Box. Neural Computing & Applications, Vol. 4 (2) (1996) 7282
14. Vaughn M.L., Ong E., Cavill S.J.: Interpretation and Knowledge Discovery from the
Multilayer Perceptron Network that Performs Whole Life Assurance Risk Assessment. Neural
Computing & Applications, Vol. 6 (4). (1997) 203-213
15. Waddell G.: A New Clinical Model for the Treatment of Low-Back Pain. Spine, Vol. 12 (7)
(1987) 632-644
16. Jackson D., Llewelyn-Phillips H., Klaber-Moffett J.: Categorization of Back Pain Patients
Using an Evidence Based Approach. Musculoskeletal Management, Vol. 2 (1996) 39-46
17. Bigos S., Bowyer O., Braen G.: Acute Low Back Problems in Adults. Clinical Practice
Guideline No. 14. AHCPR Publication No. 95-0642. U.S. DHHS. (1994)
18. Vaughn M.L.: Derivation of the Multilayer Perceptron Weight Constraints for Direct
Network Interpretation and Knowledge Discovery. To be published in Neural Networks
(1999)
19. Vaughn M.L., Cavill S.J., Taylor S.J., Foy M.A., Fogg A.J.B.: Direct Knowledge Discovery
and Interpretation from a Multilayer Perceptron Network that Performs Low-Back-Pain
Classification. In Bramer M. (ed.), Knowledge Discovery and Data Mining: Theory and
Practice. IEE Press (1999) 160-179
20. Waddell G., Bircher M., Finlayson D., Main C.: Symptoms and Signs: Physical Disease or
Illness Behaviour. BMJ, Vol. 289 (1984) 739-741
21. Bounds D. G., Lloyd P. J., Mathew B. G., Waddell G.: A Multi Layer Perceptron Network
for the Diagnosis of Low Back Pain. Proceedings of IEEE International Conference on Neural
Networks, San Diego, California (1988) 481-489
22. Bounds D. G., Lloyd P. J., Mathew B.: A Comparison of Neural Network and Other Pattern
Recognition Approaches to the Diagnosis of Low Back Disorders. Neural Networks, Vol. 3.
(1990) 583-591
Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network
285
23. Craven, M.W., & Shavlik, J.W.: Learning symbolic rules using artificial neural networks,
Proceedings of the Tenth International Conference on Machine Learning. Amherst MA, USA.
Morgan Kauffman (1993) 73-80
24. Viktor, H.L., Engelbrecht, A.P. & Cloete, I.: Reduction of Symbolic Rules from Artificial
Neural Networks Using Sensitivity Analysis, Proceedings of the 1995 IEEE International
Conference on Neural Networks, Perth, Western Australia (1995)
25. Krishnan R.: A Systematic Method for Decompositional Rule Extraction from Neural
Networks. Proceedings of NIPS’96 Workshop of Rule Extraction from Trained Artificial
Neural Networks. Queensland Univ. Technol. (1996) 38-45
26. Maire F. : A Partial Order for the M-of-N Rule Extraction Algorithm. IEEE Trans. Neural
Networks, Vol.8 (1997) 1542-1544
27. Setiono R.: Extracting Rules from Neural Networks by Pruning and Hidden Unit Splitting.
Neural Comput, Vol 9. (1997) 205-225
28. Bologna, G.: Rule Extraction from the IMLP Neural Network: a Comparative Study.
Proceedings of NIPS’96 Workshop of Rule Extraction from Trained Artificial Neural
Networks. Queensland Univ. Technol. (1996)
29. Saito K., Nakano R.: Law Discovery Using Neural Networks. Proceedings of NIPS’96
Workshop of Rule Extraction from Trained Artificial Neural Networks. Queensland Univ.
Technol. (1996) 62-69
30. Tickle A.B., Orlowski M., Diederich J.: DEDEC: A Methodology for Extracting Rules from
Trained Artificial Neural Networks. Proceedings of NIPS’96 Workshop of Rule Extraction
from Trained Artificial Neural Networks. Queensland Univ. Technol. (1996) 90-102
31. Thrun, S.B.: Extracting rules from artificial neural networks with distributed representations.
In: Advances in Neural Information Processing Systems, Vol. 7. Tesauro G., Touretzky D.,
Leen T. (eds.). MIT Press (1995)
High Order Eigentensors as Symbolic Rules in
Competitive Learning
Hod Lipson1 and Hava T. Siegelmann2
1
Mechanical Engineering Department
Industrial Engineering and Management Department
Technion – Israel Institute of Technology, Haifa 32000, Israel
Current e-mail: hlipson@mit.edu, iehava@ie.technion.ac.il
2
Abstract. We discuss properties of high order neurons in competitive learning.
In such neurons, geometric shapes replace the role of classic ‘point’ neurons in
neural networks. Complex analytical shapes are modeled by replacing the
classic synaptic weight of the neuron by high-order tensors in homogeneous
coordinates. Such neurons permit not only mapping of the data domain but also
decomposition of some of its topological properties, which may reveal
symbolic structure of the data. Moreover, eigentensors of the synaptic tensors
reveal the coefficients of polynomial rules that the network is essentially
carrying out. We show how such neurons can be formulated to follow the
maximum-correlation activation principle and permit simple local Hebbian
learning. We demonstrate decomposition of spatial arrangements of data
clusters including very close and partially overlapping clusters, which are
difficult to separate using classic neurons.
1
Introduction
A phase diagram contains data points representing measurements of a phenomenon
plotted in multi-dimensional space. If the measured phenomenon follows no rules, the
data points will uniformly fill the data space. However, if there are any governing
principles to the measured phenomenon, the data points will not fill the space
uniformly, but rather will form some kind of structure or pattern. Tools for
exploratory data analysis such as neural networks may then be used to map this
structure, and so create a model for the behavior of the phenomenon.
In many cases, however, mapping the phenomenon, i.e. determining what areas in
the data space it is more likely to occupy, is not sufficient. In order to understand the
laws that govern the phenomenon, it is necessary to decompose the mapped volume
into its components and derive relationships among them. It is convenient to associate
the simpler mapping of the phenomenon with determination of its metrics, and the
“deeper” understanding of the governing principles or symbolic structure with
determination of its topology. In this paper we take the view where in order to extract
symbolic meaning from an observed phenomenon it is necessary to use neuronal units
that can individually account for a complete symbolic rule. In practice, we use
geometric shapes: the shape itself is the symbolic rule and its parameters (say
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp.286-297, 2000.
© Springer-Verlag Berlin Heidelberg 2000
High Order Eigentensors as Symbolic Rules in Competitive Learning
287
curvatures and gradients) are the tunable parameters. We describe an augmentation to
classic competitive neurons that permits them to decipher these aspects.
The ability to determine the topological structure of the data domain is considered
a primary capacity of Kohonen’s self-organizing map (SOM) (Kohonen, 1997).
Under certain conditions, a SOM network may self-organize so that each neuron
relocates to the center of a cluster of input points which it represents. Due to the
connectivity of the net, the topology of the net is also mapped onto the topology of the
data domain, thus revealing topological properties of the structure such as cluster
proximity. However, the topological structure of the net may also act as a constraint
on the arrangement of the neurons, impeding its ability to capture certain
configurations. When this topological constraint is released by adopting full
connectivity, a general form of a vector quantization (VQ) clustering algorithm is
obtained. Such algorithms (Pal et al, 1993) can map any data arrangement, but they
do not provide information regarding the topological structure. Similarly, most other
networks acquire topological information implicitly into their weights; this
information cannot be directly extracted.
In this paper we explore an alternative approach to modeling the topology of the
data domain. Modeling is achieved by using neurons that not only map the location of
their corresponding data, but also explicitly map its local topological and geometrical
properties. These ‘geometric neurons’ are of higher dimensionality than their input
domain, and may therefore track features of the activation area that might correspond
to local symbolic properties. They use high-order ‘synaptic tensors’ instead of classic
synaptic weight vectors, where weights correspond also to combinations of inputs of
various orders. When these neurons are used in a network configuration, local
topological properties are accumulated to explicitly reveal the global topological
arrangement of the data. These neurons obey the simple Hebbian-type learning rule,
and, depending on their shape and base functions, can reveal configurations even
among close and partially overlapping clusters.
The paper first outlines the concept of the proposed enhancement in light of
existing work, and then provides a mathematical formulation for analytic geometric
neurons with polynomial base functions. We show that a classic neuron is a first-order
case of the geometric neuron, and the second-order neuron corresponds to the wellestablished ellipsoidal (Mahalanobis) metric neuron. We describe the neuron itself, its
activation, its learning scheme and then demonstrate its functionality within a net.
2
Shape-Sensitive Neurons
The fundamental property of a shape sensitive neuron is that it is capable of mapping
the local topological and geometrical properties of a data volume because, unlike a
point neuron, it has topological and geometric properties of its own. These properties
are parametric and hence adaptive. Unlike classic networks, the topological structure
of the data is then directly accessible, since the topology of each neuron is known and
simple. A schematic illustration of a network with geometric neurons is shown in
Figure 1(a), where point neurons are replaced by higher-order ‘blob’ neurons which
can take the form of various topological components such as links, forks and volumes,
288
H. Lipson and H.T. Siegelmann
as well as of ordinary point neurons. Four actual shape sensitive neurons are shown in
Figure 1(b).
High order neurons are defined as neurons which accept input not only from single
inputs, but also from combinations of inputs, such as sets of inputs multiplied to
various orders. The use of high order neurons in general is not new. High order
neurons are generally associated with more degrees of freedom rather than explicit
topological properties. Explicit geometric properties have been introduced for specific
cases of prototype-based clustering and competitive neural networks. Gustafson and
Kessel (1979) used the covariance matrix to capture ellipsoidal properties of clusters.
Davé (1989) used fuzzy clustering with a non-Euclidean metric to detect lines in
images. This concept was later expanded, and Krishnapuram et al (1995) used
general second-order shells such as ellipsoidal shells and surfaces. For an overview
and comparison of these methods see Frigui and Krishnapuram (1996). Incorporation
of Mahalanobis (elliptical) metrics in neural networks was addressed by Kavuri and
Venkatasubramanian (1993) for fault analysis applications, and by Mao and Jain
(1996) as a general competitive network with embedded principal component analysis
units. Kohonen uses adaptive tensorial weights (Kohonen, 1997) to capture significant
variances in the components of input signals, thereby introducing a weighted
Euclidean distance in the matching law. Abe et al (1997) attempt to extract fuzzy
rules using ellipsoidal units and compare successfully to other rule-extracting
methods. In this paper we explore the possibility of using neurons with general and
explicit geometric properties under direct Hebbian learning. These neurons are not
limited to ellipsoidal shapes.
Å
(a)
(b)
Fig. 1. (a) A schematic illustration of a network with geometric neurons. Point neurons are
replaced by higher-order ‘blob’ neurons that can take the form of various topological
(symbolic) components such as links, forks and volumes, as well as of ordinary point neurons.
(b) Examples of four high order geometric neurons developed in this work, with different
geometric and topological structures, and the data points they represent.
In the following sections we adopt a notation where: x, w are column vectors, W, D
are matrices; xH, WH denote vectors and matrices in homogeneous representation
(j)
(j)
th
th
(described later), x , D - the j vector/matrix corresponding to the j neuron/class; x1,
Dij - element of a vector/matrix; m - the order of neuron; d - the dimensionality of the
input; N - the size of the layer (the number of neurons).
High Order Eigentensors as Symbolic Rules in Competitive Learning
3
289
A High-Order Neuron with Polynomial Base Functions
In classical self-organizing networks, each neuron j is assigned a synaptic weight
(j)
vector w . The winning neuron j(x) in response to an input x is the one showing the
(j)T
highest correlation with the input, i.e. neuron j for which w x is the largest. Note that
(j)
when the synaptic weights w are normalized to a constant Euclidean length, then the
above criterion becomes identical to the minimum Euclidean distance matching
criterion. However, the use of a minimum-distance matching criterion incorporates
several difficulties. The minimum distance criterion implies that the features of the
input domain are spherical, i.e., matching deviations are considered equally in all
directions, and distances between features must be larger than the distances between
points within a feature.
These aspects preclude the ability to detect higher order, complex or ill-posed
feature configurations and topologies, as these are based on higher geometrical
properties such as directionality and curvature. This constitutes a major difficulty,
especially when the input is of high dimensionality, where such configurations are
difficult to visualize. Complex clusters may require complex metrics for separation.
The modeling constraints imposed by the maximum correlation matching criterion
(j)
stem from the fact that the neuron’s synaptic weight w has the same dimensionality
as the input x , i.e., the same dimensionality as a single point in the input domain,
while in fact the neuron is modeling a cluster of points which may have higher order
attributes such as directionality and curvature. We shall therefore refer to the classic
neuron as a ‘first-order’ (zero degree) neuron, due to its correspondence to a point in
multidimensional space.
Second Order
To circumvent this restriction, we augment the neuron with the capacity to map
additional geometric and topological information. For example, as the first-order is a
point neuron, the second-order case will correspond to orientation and size
components, effectively attaching a local oriented coordinate system with nonuniform scaling to the neuron center and using it to define a new distance metric.
Thus each second-order neuron will represent not only the mean value of the data
points in the cluster it is associated with, but also the principal directions of the cluster
and the variance of the data points along these directions. Intuitively, we can say that
rather than defining a sphere, the second-order distance metric now defines a multidimensional oriented ellipsoid. After some mathematical manipulation (Gustafson and
Kessel, 79), the second order information (orientation and scaling) can be shown to
reside entirely in the correlation matrix R of the zero-mean data, and the matching
criterion becomes
(
i(x ) = arg j min x − w ( j )
)T R −1(x − w( j ) ),
j = 1,2, K , N
(1)
This criterion also corresponds to the term used in the maximum likelihood
Gaussian classifier (Duda and Hart, 73). As it stands, Eq. (1) requires separate
(j)
(j)
tracking the orientation and size information, in R , and the position in w . For a
290
H. Lipson and H.T. Siegelmann
more systematic treatment, we combine these two coefficients into one expanded
(j)
covariance denoted RH in homogenous coordinates (Faux and Pratt, 81), as
RH
= ∑xH
( j)
( j)
xH
( j)T
(2)
where in homogenous coordinates
x
xH =
1
(j)
(d+1)(d+1)
(d+1)1
(j)
so that now RH ³§
and X ³§
. When working in homoheneous
coordinates, the constant unit 1 is appended to the vector, so that when the vector is
multiplied, it carries lower orders as well. Thus, a single representation encapsulated
both first degree (linear) and zero degree (constant) elements. In analogy to the
‘correlation matrix memory’ and ‘autoassociative memory’ (see Haykin 1994), the
(j)
extended matrix RH , in its general form, can be viewed as ‘homogeneous
autoassociative tensor’. Now the new matching criterion becomes simply
[
i (x ) = arg j min x TH R (Hj ) x H
−1
]
−1
(3)
= arg j min R (Hj ) 2 x H
−1
= arg j min R (Hj ) x H ,
j = 1,2,K , N
This representation retains the notion of maximum correlation, and for
-1
convenience, we now call RH the synaptic tensor. Note that this step is based on the
(j)
fact that the eigenstructure of the extended correlation matrix RH in homogeneous
coordinates corresponds to the principal directions and average (i.e. both second-order
(j)
and first-order) properties of the cluster accumulated in RH . This property permits
extension to higher order tensors, where direct eigenstructure analysis is not well
defined. The transition into homogeneous coordinates also dispensed with the need to
zero-mean the correlation data.
Digressing for a moment, we recall that the classic neuron possesses a synaptic
(j)
weight vector w which corresponds to a point in the input domain. The synaptic
(j)
(j)
weight w can be seen to be the first order average of its signals, i.e., w =SxH (the last
element xH (d+1)=1 of the homogeneous coordinates has no effect in this case). The
shape-sensitive neurons hold information regarding the linear correlations among the
(j)
T
coordinates of data points represented by the neuron, by using RH =S xHxH . Each
element of RH is thus a proportionality constant relating two specific dimensions of
the cluster.
Higher Orders
The second-order neuron can be regarded as a second-order approximation of the
th
corresponding data distribution. We may consequently introduce an m -order shape
th
sensitive neuron capable of modeling a d-dimensional data cluster to an m -order
approximation. For example, a third-order neuron is capable of storing not only the
High Order Eigentensors as Symbolic Rules in Competitive Learning
291
principal directions and size of the d-dimensional data cluster, but also its curvatures
along each of these axes; hence, it is capable of matching the topology of, say, a Yshaped fork.
th
In order to obtain m -order components of the data cluster, we use an analogy
between eigenstructure decomposition and least-squares fitting. The analogy holds
that the principal directions of a data cluster (its eigenvectors) correspond to the
normals of the orthogonal set of best-fit hyperplanes through the data set, and the
eigenvalues correspond to the variances of the data from those hyperplanes. In
homogeneous coordinates, the hyperplanes contain also an offset, and hence each
homogeneous eigenvector corresponds to the coefficients of the corresponding
hyperplane equation. Extending this analogy to higher orders, the higher principal
components of the cluster (say, the principal curvatures) correspond to the
eigentensors of higher-order correlation tensors in homogeneous coordinates. The
neuron is therefore represented by a 2(m-1)-dimensional tensor of rank d+1, denoted
(j)
(d+1)...(d+1)
. The factor of 2 is introduced by the squared error used by the leastby Z ³§
squares method. The tensor is created by successive ‘outer products’ of the
homogeneous vector xH by itself.
ZH
( j)
= xH
2 ( m −1)
(4)
In practice, in order to extract the winner neuron it is only necessary to determine
the amount of correlation between an input and the tensors. As in Eq. (3), in higher
orders too this amounts to multiplication of the input xH by the tensor.
The exponent consists of the factor (m-1), which is the degree of the
approximation, and a factor of 2 since we are performing auto-correlation so the
function is multiplied by itself. In analogy to reasoning that lead to Eq. (3), each
eigenvector (now an eigentensor of order m-1) corresponds to the coefficients of a
principal curve, and multiplying it by an input points produces an approximation of
the distance of that input point from the curve. Consequently, the inverse of the tensor
(j)
ZH can then be used to compute the high-order correlation of the signal with the
nonlinear shape neuron, by simple tensor multiplication:
i (x ) = arg j min Z (Hj ) ⊗ x Hm −1 ,
−1
j = 1,2,K, N
(5)
where © denotes tensor multiplication. Note that the covariance tensor can only be
inverted if its order is an even number, as satisfied by Eq. (4). Note also that
amplitude operator is carried out by computing the root of the sum of the squares of
the elements of the argument. The computed metric is now not necessarily spherical
and may take various other forms.
In practice however, high-order tensor inversion is not directly required. To make
this analysis simpler, we use a Kronecker notation for tensor products (see Graham,
1981). Kronecker tensor product ‘flattens out’ the elements of X©Y into a large
matrix formed by taking all possible products between the elements of X and those of
Y. For example, if X is a 2 by 3 matrix, then X©Y is
292
H. Lipson and H.T. Siegelmann
Y ⋅ X 1,1
X ⊗Y =
Y ⋅ X 2,1
Y ⋅ X 1, 2
Y ⋅ X 2,2
Y ⋅ X 1, 3
Y ⋅ X 2,3
(6)
where each block is a matrix of the size of Y. In this notation, the internal structure of
higher-order tensors is easier to perceive and their correspondence to linear regression
of principal polynomial curves is revealed. Consider, for example, a fourth order
covariance tensor of the vector x={x,y}. The fourth-order tensor corresponds to the
simplest non-linear neuron according to Eq. (4), and takes the form of the 2222
tensor
x 2
x⊗x⊗x⊗x = R⊗R =
xy
xy x 2
⊗
y 2 xy
x4
xy x 3 y
=
2
y x3 y
2 2
x y
x3 y
x3 y
x2 y2
x2 y2
xy 3
x2 y2
x2 y2
xy 3
x2 y2
xy 3
3
xy
y4
(7)
The homogeneous version of this tensor includes also all lower-order permutations
of the coordinates of xH={x,y,1}, namely, the 3333 tensor
Z H ( 4) = x H , x H , x H , x H = R H , R H
x 2
= xy
x
xy
y
2
y
x x 2
y ⊗ xy
1 x
xy
y
2
y
x
y
1
(8)
Extracting Symbolic Rules
It is immediately apparent that the above matrix corresponds to the matrix to be
solved for finding a least squares fit of a conic section equation
2
2
ax +by +cxy+dx+ey+f=0 to the data points. Moreover, the set of eigenvectors of this
matrix corresponds to the coefficients of the set of mutually orthogonal best-fit conic
section curves that are the principal curves of the data. This notion adheres with
Gnanadesikan’s method for finding principal curves (Gnanadesikan, 1977). Now,
substitution of a data point into the equation of a principal curve yields an
approximation of the distance of the point from that curve, and the sum of squared
distances amounts to the term evaluated in Eq. (5). Note that each time we increase
complexity, we are seeking a set of principal curves of one degree higher. This
implies that the least-squares matrix needs to be two degrees higher (because it is
minimizing the squared error), thus yielding the coefficient 2 in the exponent of Eq.
(4) for the general case. Figure 2 shows a cluster of data points and one of their
eigentensors.
Rule extraction is performed as follows: First, the synaptic tensor of each neuron is
analyzed to extract its eigentensors, which are half the order of the synaptic tensor.
The terms of each eigentensor define the coefficients of a polynomial curve, such as
the one shown in Figure 2 (b). Distance from this polynomial curve is one of the
clustering metrics used by this neuron, and hence can be used as a symbolic rule for
High Order Eigentensors as Symbolic Rules in Competitive Learning
293
classification of the associated data points. The algebraic rule can thus be directly
extracted analytically.
(a)
(b)
(c)
rd
Fig. 2. (a) The cluster of points, (b) one of the six eigentensors, and (c) the best-fit 3 -order 2D
volume corresponding to a unit circle in the space spanned by the six orthogonal eigentensors.
4
Hebbian Learning for High-Order Neurons
In order to show that higher-order shapes are a direct extension of classic neurons, we
show that they are also subject to simple Hebbian learning. Following an
interpretation of Hebb’s postulate of learning, synaptic modification (i.e., learning)
occurs when there is a correlation between presynaptic and postsynaptic activities.
We have already shown in Eq. (3) above that (a) the presynaptic activity xH and
(j)-1
postsynaptic activity of neuron j coincide when the synapse is strong, i.e. RH xH is
minimum. We now proceed to show that, in accordance with Hebb’s postulate of
learning, (b) it is sufficient to incur self-organization of the neurons by increasing
synapse strength when there is a coincidence of presynaptic and postsynaptic signals.
In order to provide quality (b) above, we need to show how self organization is
(j)
obtained merely by increasing RH , where j is the winning neuron. As each new data
point arrives at a specific neuron j in the net, the synaptic weight of that neuron is
adapted by the incremental corresponding to Hebbian learning,
ZH
( j)
(k + 1) = Z H ( j ) (k ) + η (k )x (Hj ) 2( m−1)
(9)
where k is the iteration counter and h(k) is the iteration-dependent learning rate
(j)
coefficient. It should be noted that in Eq. 12, ZH becomes a weighted sum of the input
2(m-1)
(unlike RH in Eq. 10, which is a uniform sum). The eigenstructure
signals xH
analysis of a weighted sum still provides the principal components under the
assumption that the weights are uniformly distributed over the cluster signals. This
assumption holds true if the process generating the signals of the cluster is stable over
time - a basic assumption of all neural network algorithms (Bishop, 1997). In practice
this assumption is easily acceptable, as will be demonstrated in the following sections.
In principle, continuous updating of the covariance tensor may create instability in
a competitive environment, as a winner neuron becomes more and more dominant. To
force competition, the covariance tensor can be normalized using any of a number of
294
H. Lipson and H.T. Siegelmann
factors dependent on the application, such as the number of signals assigned to the
neuron so far, or the distribution of data among the neuron (forcing uniformity). Some
basic criteria are discussed in (Lipson and Siegelmann, 1999).
5
Implementation
The proposed neuron has been implemented both in an unsupervised and in a
supervised setup. High order tensor multiplications have been performed in practice
by ‘flattening out’ the tensors using Kroneker’s tensor product notation. Briefly,
unsupervised learning is attained by letting randomly initialized neurons compete
over input data, using the Hebbian learning and 'winner takes all' principle as
described earlier. Supervised learning is attained by training one neuron per input
class with inputs of that class only, and then cross-validating the results. The precise
implementation is described in (Lipson and Siegelmann, 1999). Below we
demonstrate some results.
nd
Figure 3(a) shows three 2 order neurons (ellipsoids) that self organized to
decompose a point set into three natural groups. Note the correct decomposition
despite the significant proximity and partial overlap of the cluster, a factor which
usually ‘confuses’ classic networks. Figures 3(b,c) show decompositions of point sets
rd
using 3 order neurons, capable of modeling the data with more complex shapes than
mere ellipses. Note how the direct determination of the area of overlap permits
explicit modeling of uncertainty or ambiguity in the data domain, a factor which is
crucial for symbolic understanding. Figure 4 shows three instances of different data
nd
topologies, modeled using three 2 order neurons.
(a)
(b)
(c)
nd
rd
Fig. 3. Self classification of point clusters, (a) 2 (elliptical), (b,c) 3 order.
High Order Eigentensors as Symbolic Rules in Competitive Learning
(a)
(b)
295
(c)
Fig. 4. Point clusters with various topologies and their interpretation, (a) string, (b) hole, (c)
fork. Analysis shown was completed after single pass.
The geometric neurons have also been tested on the 4-dimensional IRIS Data
benchmark (Anderson, 1939) both in unsupervised and in supervised modes. Out of
150 flowers, the networks misclassified 3 (unsupervised) and 0 flowers (supervised),
1
respectively . Supervised learning was achieved by training each neuron on its
designated class separately with 20% cross validation. Tables 1 and 2 summarize the
results.
Table 1. Comparison of self-classification results for IRIS data. (Blank cells correspond to
unavailable data)
Method
Epochs
# misclassified or
unclassified
25
200
17
Super Paramagnetic (Blatt et al, 1996)
LVQ / GLVQ (Pal et al, 1993)
K-Means (Mao and Jain, 1996)
16
HEC (Mao and Jain, 1996)
5
nd
20
4
rd
30
3
2 -order Unsupervised
3 -order Unsupervised
296
H. Lipson and H.T. Siegelmann
Table 2. Supervised classification results for IRIS data. Our results after one training epoch,
with 20% cross validation. Averaged results for 250 experiments. (Blank cells correspond to
unavailable data)
Order
3-Hidden-Layer N.N. (Abe et al, 1997)
Epochs
1000
# misclassified
Average
Best
2.2
1
Fuzzy Hyperbox (Abe et al, 1997)
Fuzzy Ellipsoids (Abe et al, 1997)
2
1000
1
3.08
1
rd
1
2.27
1
th
1
1.60
0
th
1
1.07
0
th
1
1.20
0
th
1
1.30
0
3 -order Supervised
4 -order Supervised
5 -order Supervised
6 -order Supervised
7 -order Supervised
6
1
nd
2 -order Supervised
Conclusions and Further Research
In this paper, we discussed the use of high-order geometric neurons and demonstrated
their practical use for modeling the principal structure of spatial distributions.
Although high-order neurons do not directly correspond to neurobiological details, we
believe that they can provide powerful symbolic modeling capabilities. In particular,
they exhibit useful properties for correctly handling partially overlapping clusters, an
occurrence that may represent a key symbolic property. Moreover, when an entire
shape or ‘rule’ is encoded into a single neuron its easy to find a minimal set of ‘key
examples’ that can be used to induce the rule in a learning system. For example, the
eigentensors of the synaptic tensor appear to be such a set.
The use of geometric neurons raises some further practical questions, which have
not been addressed in this paper, such as selecting the number of neurons for a
particular task (network size), the cost implication of the increased number of degrees
of freedom, and network initialization. It is also necessary to investigate the
relationship between the values of the synaptic tensor and the shapes it may acquire.
Acknowledgements
We thank Eitan Domani for his helpful comments and insight. This work was
supported in part by the U.S.-Israel Binational Science Foundation (BSF), by the
Israeli ministry of arts and sciences, and by the Fund for Promotion of Research at the
Technion. Hod Lipson acknowledges the generous support of the Charles Clore
Foundation and the Fischbach Fellowship.
Download further information, MATLAB demos and implementation details of this
work from http://www.cs.brandeis.edu/~lipson/papers/geom.htm
High Order Eigentensors as Symbolic Rules in Competitive Learning
297
References
Abe S., Thawonmas R., 1997, “A fuzzy classifier with ellipsoidal regions”, IEEE Trans. On
Fuzzy Systems, Vol. 5, No. 3, pp. 358-368
Anderson E., “The Irises of the Gaspe Peninsula,” Bulletin of the American IRIS Society, Vol.
59, pp. 2-5, 1939.
Bishop, C. M., 1997, Neural Networks for Pattern Recognition, Clarendon press, Oxford
Blatt, M., Wiseman, S. and Domany, E., 1996, “Superparamagnetic clustering of data”,
Physical Review Letters, 76/18, pp. 3251-3254
Davé R. N., 1989, “Use of the adaptive fuzzy clustering algorithm to detect lines in digital
images”, in Proc. SPIE, Conf. Intell. Robots and Computer Vision, SPIE Vol. 1192, No. 2,
pp. 600-611
Duda R. O., and Hart, P. E., 1973, Pattern classification and scene analysis, New York, Wiley.
Faux I. D., Pratt M. J., 1981, Computational Geometry for Design and Manufacture, John
Wiley & Sons, Chichester
Frigui, H. and Krishnapuram, R., 1996, “A comparison of fuzzy shell-clustering methods for
the detection of ellipses”, IEEE Transactions on Fuzzy Systems, 4/2, pp. 193-199
Geoffrey J. Mclachlan, Thriyambakam Krishnan, 1997, The EM algorithm and extensions,
Wiley-interscience, New York
Gnanadesikan, R., 1977, Methods for statistical data analysis of multivariate observations,
Wiley, New York
Graham A., 1981, Kronecker products and Matrix Calculus: with Applications, Wiley,
Chichester
Gustafson E. E. and Kessel W. C., 1979, “Fuzzy clustering with fuzzy covariance matrix”, in
Proc. IEEE CDC, San Diego, CA, pp. 761-766
Haykin, S., 1994, Neural Networks, A comprehensive foundation, Prentice Hall, New Jersey
Kavuri, S.N. and Venkatasubramanian, V., 1993, “Using fuzzy clustering with ellipsoidal units
in neural networks for robust fault classification”, Computers Chem. Eng., 17/8, pp. 765-784
Kohonen, T., 1997,“Self organizing maps”, Springer Verlag, Berlin
Krishnapuram, R., Frigui, H. and Nasraoui, O., 1995, “Fuzzy and probabilistic shell clustering
algorithms and their application to boundary detection and surface approximation - Parts I
and II”, IEEE Transactions on Fuzzy Systems, 3/1, pp. 29-60.
Lipson H., Siegelmann H. T., 1999, “Clustering Irregular Shapes Using High-Order Neurons”,
Neural Computation, accepted for publication
Mao, J. and Jain, A., 1996, “A self-organizing network for hyperellipsoidal clustering (HEC),
IEEE Transactions on Neural Networks, 7/1, pp. 16-29.
Pal, N., Bezdek, J.C. and Tsao, E.C.-K., 1993, “Generalized clustering networks and
Kohonen’s self-organizing scheme”, IEEE Transactions on Neural Networks, 4/4, pp. 549557
Pan, J.S., McInnes, F.R. and Jack, M.A., 1996, “Fast clustering algorithms for vector
quantization” Pattern Recognition, 29:3, pp. 511-518.
Holistic Symbol Processing and the Sequential
RAAM: An Evaluation
James A. Hammerton1 and Barry L. Kalman2
1
School of Computer Science⋆ ⋆ ⋆ ,
The University of Birmingham, UK
james.hammerton@ucd.ie
2
Department of Computer Science,
Washington University, St Louis
barry@cs.wustl.edu
Abstract. In recent years connectionist researchers have demonstrated
many examples of holistic symbol processing, where symbolic structures
are operated upon as a whole by a neural network, by using a connectionist compositional representation of the structures. In this paper the
ability of the Sequential RAAM (SRAAM) to generate representations
that support holistic symbol processing is evaluated by attempting to
perform Chalmers’ syntactic transformations using it. It is found that the
SRAAM requires a much larger hidden layer for this task than the RAAM
and that it tends to cluster its hidden layer states close together, leading
to a representation that is fragile to noise. The lessons for connectionism
and holistic symbol processing are discussed and possible methods for
improving the SRAAM’s performance suggested.
1
Introduction
Recently, several connectionist techniques have been developed for representing
compositional structures, such as Pollack’s RAAM [15], Plate’s HRRs [14] and
Callan and Palmer-Brown’s (S)RAAM [3]. Connectionist researchers have demonstrated that some of these techniques support holistic symbol processing1
[4, 5, 8], where a symbol structure can be operated upon holistically in contrast
to the symbol by symbol manipulations performed by traditional symbolic systems. For example, Chalmers [4] has performed the transformation of simple
active sentences into their passive equivalents, holistically. Other examples of
holistic symbol processing include Niklasson and Sharkey’s work [13] on transforming logical implications into their equivalent disjunctions, Blank et al.’s work
[2] on transforming sentences of the form X chases Y to the form Y flees X and
Weber’s constant-time term unification [19].
⋆⋆⋆
1
The first author has now moved to the Department of Computer Science, University
College Dublin, Ireland
Much of the literature refers to this as holistic computation. Hammerton [7] argues
that the focus is on holistic symbol processing rather than holistic computation per
se.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 298–312, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Holistic Symbol Processing and the Sequential RAAM: An Evaluation
299
For holistic symbol processing to progress from an interesting possibility to
a phenomenon that can be exploited to enhance the performance of intelligent
systems, it is necessary to develop an understanding of how the techniques work,
what their limitations are and what it is that enables them to support holistic
computation. Developing such an understanding may indicate ways of overcoming any limitations and may hold lessons for the development of other techniques. This paper2 presents an evaluation of the ability of the Sequential RAAM
(SRAAM) [12, 15] to generate representations that support holistic symbol processing, by attempting to recreate Chalmers’ active to passive transformations
[4] using the SRAAM in place of the RAAM, and then analysing its performance
to understand how it is performing the task and whether the technique has any
limitations.
2
Syntactic Transformations with the SRAAM
2.1
Why the SRAAM was Chosen
Many of the connectionist representations that have been developed could be
used for this work as they support holistic symbol processing to some degree.
The SRAAM was chosen for various reasons:
– Its ability to generate representations that support holistic symbol processing
was not clear from the literature and thus required clarification. The more
complex tasks which it has been used for involved confluent inference [5, 9].
Hammerton [8, 7] argues that confluent inference is not a form of holistic
symbol processing but rather a separate phenomenon. The work of Blank
et al. [2] does involve holistic symbol processing but the tasks involved are
simpler than many of the tasks for which the RAAM has been used. Kwasny
and Kalman [12] train SRAAMs to encode relatively complex structures but
their attempts to perform holistic operations on those structures met with
ambiguous results.
– Kwasny and Kalman’s work [12] suggests that the SRAAM is easier to train,
offers higher levels of generalisation and is more flexible in the range of
structures it can represent than the RAAM. It is easier to train because the
training scheme is a simple modification of the algorithm used to train simple
recurrent networks (SRNs) and does not require an external memory to store
intermediate results. The SRAAM achieved higher levels of generalization.
When trained on 30 sequences out of a set of 183, it could correctly encode
the entire set. Chalmers [4] trained a RAAM to encode a set of simple active
and passive sentences. When trained on 80 of these and tested on a further
80 novel sentences, it made errors on 13 of the novel sentences. Finally the
SRAAM is more flexible because it can represent arbitrarily branching trees
rather than the fixed branching trees of the RAAM. If these advantages
2
This paper forms a summary of the work reported in Chapters 3,4 and 5 of Hammerton’s thesis [7].
300
J.A. Hammerton and B.L. Kalman
could be combined with effective support for holistic symbol processing then
the SRAAM would be a strong candidate as a vehicle for holistic symbol
processing.
– The SRAAM is closer to standard connectionist models than the (S)RAAM
[3] or HRRs [14], both of which appear to support holistic symbol processing at least as effectively as the RAAM if not more so. Neither (S)RAAM
or HRRs employ error minimisation, nor do they utilise the networks of interconnected units favoured by connectionists. The SRAAM can be trained
with standard techniques and is a simple variation on the SRN [12].
– Finally the Labelling RAAM [17, 18] is rather different in its operation compared to other methods due to the practice of turning it into a bi-directional
associative memory. The production of reduced descriptions of symbol structures for use with other networks is thus not as natural with the LRAAM
as with other methods. Furthermore there has already been extensive investigation of its properties and it is thus well understood.
2.2
Syntactic Transformations: A Benchmark
Chalmers’ task involving active to passive transformations was chosen as this is
a simple task which the RAAM performed well, and it provides a benchmark for
other techniques. Where the sentences used by Chalmers were represented by
ternary trees to be encoded by the RAAM, here they are encoded in the SRAAM
as sequences. In the experiments reported here, 3 training regimes were used:
– Back-propagation with sigmoidal units. Back-propagation is used to train
SRAAMs employing sigmoidal units and the sum of squares error function (SSE).
– Back-propagation with hyperbolic tangent units. Back-propagation is used to
train SRAAMs employing units that use the hyperbolic tangent activation
function. The Kalman-Kwasny error function (KKE) [10, 12] is employed
with these networks. This leads to higher error values for the same data.
Kalman and Kwasny claim improved performance with this form of training
compared to standard back-propagation.
– Kwasny and Kalman’s training regime with hyperbolic tangent units. Kwasny
and Kalman’s work employs a variant of conjugate gradient training with
the error derivatives recomputed to take into account the recurrence in the
SRAAM and the fact that the target output patterns change over time. They
claim superior performance of this training method over back-propagation.
Additionally with each training regime networks with hidden layers of 12 or
39 hidden units were employed and the experiments were performed using both
the representation of symbols used by Chalmers (Table 1) and referred to here
as the “2 unit representation” and a set of orthogonal patterns (Table 2) referred
to as the “1 unit representation”.
Holistic Symbol Processing and the Sequential RAAM: An Evaluation
301
Table 1. The 2 unit representation of the symbols used by Chalmers. For SRAAMs
employing hyperbolic tangent units, replace “0”s with “-1”s.
john
michael
helen
diane
chris
love
hit
betray
kill
hug
is
by
nil
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
1
0
0
0
0
1
1
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Table 2. The 1 unit representation of the symbols used by Chalmers. For SRAAMs
employing hyperbolic tangent units, replace “0”s with “-1”s.
john
michael
helen
diane
chris
love
hit
betray
kill
hug
is
by
nil
2.3
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Results
Tables 3 and 4 summarise the results of training the SRAAM to encode and
decode 130 of the sentences used by Chalmers, 65 active sentences and their
passive equivalents. The remaining 120 sentences were used as a testing set to
indicate generalisation. The results reported here for back-propagation are the
best achieved for each network after a range of values for the momentum and
learning rate were tried. The ranges used for both values was 0.01 to 0.9, and
different combinations of values in these ranges were tried. Kwasny and Kalman’s
simulator employs an adaptive step size so these parameters do not exist in their
training.
The “Method” column indicates which training regime was used. BP SSE is
back-propagation with sigmoidal units and the sum of squares error. BP KKE is
302
J.A. Hammerton and B.L. Kalman
Table 3. Various attempts to learn to encode and decode 130 sequences from Chalmers’
data set, using the 2 unit representation.
Method
BP SSE
BP SSE
BP KKE
BP KKE
KK
KK
Network
Iterations
25-12-25
1350
52-39-52
1520
25-12-25
570
52-39-52
320
25-12-25 D:8310 F:80587
52-39-52 D:6950 F:71384
Error % train E/seq Train
98.87
0.8
1.80
71.28 42.3
0.96
492.80 3.8
2.65
243.71 47.7
0.59
1.23
46.9
0.53
0.04
85.3
0.15
% test E/seq Test
1.7
1.83
21.7
1.19
0.0
2.85
50.8
0.64
45
0.55
87.5
0.12
Table 4. Various attempts to learn to encode and decode 130 sequences from Chalmers’
data set, using the 1 unit representation.
Method
BP SSE
BP SSE
BP KKE
BP KKE
KK
KK
Network
Iterations
25-12-25
2400
52-39-52
1820
25-12-25
120
52-39-52
270
25-12-25 D:7679 F:92470
52-39-52 D:7153 F:71545
Error % train E/seq Train
123.05 3.1
1.81
131.38 32.3
0.92
535.85 0.0
2.37
243.71 23.8
0.90
0.81
31.5
0.85
0.06
85.3
0.15
% test E/seq Test
0.0
2.09
17.5
1.22
0.0
2.40
15.8
1.16
20.8
0.88
50.0
0.50
back-propagation with hyperbolic tangent units and the Kalman-Kwasny error.
KK is Kwasny and Kalman’s training method which employs hyperbolic tangent
units. The “Network” column gives the network architecture for each SRAAM.
The “Iterations” column indicates how many iterations were used in training.
With Kwasny and Kalman’s method there are two types of iteration, one where
derivatives are computed (D) and one where a line search is employed (F). The
“Error” column shows the minimum error achieved during training. The “%
train” column indicates the percentage of the training set correctly encoded and
decoded at the end of training. The “E/seq Train” shows the number of errors
per sequence produced by the encoding and decoding process for the training
set. The “% test” column indicates the percentage of the testing set correctly
encoded and decoded. The “E/seq Test” column shows the errors per sequence
produced by encoding and decoding the testing set.
None of the networks learned to encode and decode the entire training set
without error, despite hidden layers of up to 39 units, 3 times larger than that
used with Chalmers’ RAAM, being employed. The error levels in the testing
set were generally higher than in the training set, although the error levels for
the testing set were closer to those for the training set when Kwasny and Kalman’s training was used, suggestive of a greater power of generalisation with
this method compared to back-propagation. Kwasny and Kalman’s method consistently outperforms back-propagation confirming their claims for superior training. The use of the hyperbolic tangent function improved the performance for
back-propagation when using the 2 unit representation, but made it worse when
Holistic Symbol Processing and the Sequential RAAM: An Evaluation
303
using the 1 unit representation. The convergence of back-propagation was faster with the hyperbolic tangent units than with sigmoidal units as Kwasny and
Kalman suggested. Finally the 2 unit representation generally lead to better
performance than the 1 unit representation, suggesting that the use of units to
indicate where a symbol is a verb, noun, or “is” or “by” in the 2 unit representation may have helped training. To test this hypothesis SRAAMs with 39 hidden
units employing a randomized 2 unit representation, where 2 units out of the
13 were selected at random and set to “1”, were trained. The errors were higher
than in either of the above 2 cases, as would fit the hypothesis.
Attempts to extend training beyond the points reported in these tables resulted in the error rising and then oscillating chaotically with back-propagation.
With Kwasny and Kalman’s training restarting the training at the point where
it stops with lower tolerances simply results in the training terminating immediately, suggesting it cannot improve things any further.
Finally it was found that a transformation network could be trained to perform the active to passive transformations on the imperfectly learned encodings
from the above experiment without inducing extra errors on top of those produced by the encoding and decoding process itself. This suggests that holistic
symbol processing can be supported if the resources are available to achieve error
free encoding and decoding of the sequences.
3
Analysing the Performance of the SRAAM
3.1
Determining the Cause of Failure
The failure of the SRAAM to learn to encode and decode Chalmers’ sentences
correctly was unexpected given the success other researchers have had. This
failure may have occurred for any of several reasons. It may be that the solution
does not exist. A proof it does exist in this case for SRAAMs of 20 or more
hidden units is presented in Chapter 4 of Hammerton’s thesis [7]3 . It may be
that the error landscape is rugged, making it difficult for back-propagation or
conjugate gradient to find the global optimum. Another possibility is that the
representations being developed interfere with each other, hindering training.
The analysis summarised here was aimed at indicating whether the problem lay
in features of the error landscape or in the way the SRAAM was organising its
hidden-layer states. Thus analyses were performed on both the weights and the
representations developed by the SRAAM in order to try and get a complete
picture of what is going on. The analysis consisted of the following parts:
– Determining the difficulty of finding the solution. This was done by determining how much noise needed to be added to solutions derived by hand
from the proof that they exist to prevent subsequent training from finding
them again and attempting to train the SRAAM using simulated annealing,
a more global optimisation method.
3
This does not mean a solution does not exist for smaller networks.
304
J.A. Hammerton and B.L. Kalman
– Determining the sensitivity of the network to noise in the weights. Every
weight in the network has a percentage of its value either added or subtracted
at random, unless it is zero in which case a small constant is added or
subtracted instead. Then the errors produced in encoding and decoding are
compared to the errors produced without the noise being added.
– Determining the sensitivity of the network to noise in the hidden-layer patterns. The method used is an adaptation of the single-unit lesioning employed
by Balogh [1] to determine the distributedness of the representations produced by Chalmers’ RAAM. Balogh’s method takes the encoding of a sentence,
sets one of the units in the encoding to zero, and then decodes the sentence.
The errors are then compared with the errors produced with the original encoding. This is adapted here so that instead of setting the value of the unit
to zero, the unit is set to a percentage of the original value. By varying this
percentage, one can determine both the sensitivity to noise and the nature
of the representations developed by the SRAAM.
– Determining how the SRAAM encodings cluster. Hierarchical Cluster Analysis (HCA) is performed on the hidden-layer encodings produced for each
sentence.
– Determining how the encoding and decoding trajectories are organised. The
method employed here is to train a self-organising map [11] on the hiddenlayer states produced during encoding and decoding the sequences and then
plot the trajectories taken when encoding and decoding the sequences on the
map.
– Determining the range of activations generated during encoding and decoding. This indicates the hypercube into which all the hidden-layer states are
packed and thus the extent to which the available hyperspace is utilised.
– Determining the range, average and average magnitude of the weights. If
the magnitude of the weights is very high this may indicate a local optimum
which is difficult to escape from due to large weight changes being required.
– The HCA, analysis of the range, average and magnitude of the weights, the
plotting of the encoding and decoding trajectories and the range of activations generated were all performed at three points during training to see if
any trends would occur.
3.2
Results
This section summarises the results of the analysis. More detail can be found in
Chapters 4 and 5 of Hammerton’s thesis.
Using Simulated Annealing to Train. Initially intended as a preliminary
run, a 26-13-26 SRAAM was trained on 60 of the sequences using simulated
annealing for 1.5 million iterations, taking 2 weeks to run on a 300 MHz Sun
Ultra 2. The network had failed to find a solution, and an attempt to take the
resulting network and train it further using back-propagation also failed to find
a solution. As this was a smaller network and data set than used above, it was
Holistic Symbol Processing and the Sequential RAAM: An Evaluation
305
felt that there was not enough time to repeat this with the full data set or
larger networks given the length training involved. Thus it is only indicative of
a solution that is difficult to find.
Starting from Hand-Derived Solutions with Noise Added. With the
hand-derived solutions, it was found that 10% noise was usually sufficient to
prevent back-propagation finding them again and that 50% noise was sufficient
in the case of Kwasny and Kalman’s method. This suggests that the handderived solutions are in a fairly small area of weight space with back-propagation
and a somewhat larger, but apparently still inaccessible area with Kwasny and
Kalman’s method.
Adding Noise to Partial Solutions Found in Training. The SRAAMs
trained by back-propagation showed serious degradation in performance when
10% noise was added and complete failure to encode and decode the sequences
when 50% noise was added. With Kwasny and Kalman’s method, 1% noise was
sufficient to cause serious degradation in the performance, with 10% resulting
in complete failure. Thus it can be seen that the partial solutions developed by
the SRAAMs were highly sensitive to noise in the weights, with Kwasny and
Kalman’s method being more sensitive. This is suggestive of the networks being
caught in a local optimum.
Representational Lesioning. Figures 1 and 2 show the results of perturbing
any single unit by 10% and 50% respectively, for the network trained by backpropagation using sigmoidal units employing the 1 unit representation and a 39
unit hidden layer. The errors are counted for each “slot”. For active sentences,
slots 0 to 2 are the first, second and third words respectively. For passive sentences, slot 0 is the first word, slot 1 is the second and third word and slot 2 is the
third and fourth word. This corresponds to the slots used in Balogh’s analysis
of the RAAM representations [1] developed in Chalmers’ experiments, except
Balogh numbers them 1 to 3 (where here it is 0 to 2), thus allowing comparison
with the results he found. Slot 3 here corresponds to the length. As can be seen,
perturbing any single unit in the encodings of the sequences with 10% noise led
to a noticeable degradation in performance with upto 5% or so errors occurring
in any slot and errors usually occur in multiple slots for any particular unit. Recall (Section 3.1) that these are errors produced in addition to those produced
in the encoding and decoding process without perturbing the hidden layer patterns. 50% noise led to serious degradation, with upto 20% error occurring in any
slot and errors usually occurring across all the slots regardless of which unit was
lesioned. The pattern of induced errors was roughly similar regardless of which
unit was perturbed, with more deeply embedded symbols being more likely to
be corrupted than less deeply embedded symbols. This was suggestive of a distributed representation being developed that was sensitive to noise. This pattern
of errors was repeated for the networks trained with the 2 unit representation
and also for the networks trained with the hyperbolic tangent units.
306
J.A. Hammerton and B.L. Kalman
100
"lesion.seq130.9.slot.0"
"lesion.seq130.9.slot.1"
"lesion.seq130.9.slot.2"
"lesion.seq130.9.slot.3"
Percentage error
80
60
40
20
0
1
3
5
7
9
11
13
15
17
19 21 23
Unit lesioned
25
27
29
31
33
35
37
39
Fig. 1. The results of perturbing each unit in the encodings developed by the SRAAM
trained using back-propagation, sigmoidal units, the 1 unit representation and 10%
noise.
100
"lesion.seq130.5.slot.0"
"lesion.seq130.5.slot.1"
"lesion.seq130.5.slot.2"
"lesion.seq130.5.slot.3"
Percentage error
80
60
40
20
0
1
3
5
7
9
11
13
15
17
19 21 23
Unit lesioned
25
27
29
31
33
35
37
39
Fig. 2. The results of perturbing each unit in the encodings developed by the SRAAM
trained using back-propagation, sigmoidal units, the 1 unit representation and 50%
noise.
Holistic Symbol Processing and the Sequential RAAM: An Evaluation
307
Figures 3 and 4 show the results of perturbing any single unit by 1% and
10% respectively in the networks trained by Kwasny and Kalman’s training regime, a 20 unit hidden layer and the 1 unit representation. As can be seen 1%
perturbation of any single unit can lead to as much as 40% error across multiple slots with 10% perturbations sometimes yielding 100% error across multiple
slots. Again most of the time errors occur in multiple slots and more deeply
embedded information is the worst affected, however it is clear that the representations developed here are far more sensitive to error than those developed
with back-propagation. This suggests that Kwasny and Kalman’s method produced highly fragile representations, despite the better encoding and decoding
performance it achieved. Finally a point worth noting here is that the networks
were developing distributed yet fragile representations, in contradiction to the
common assumption that distributed representations are robust representations.
These results show that the assumption need not necessarily hold.
Hierarchical Cluster Analysis. The results from the hierarchical cluster analysis showed that the networks clustered the sequences by 2 strategies; they clustered by the value of the last symbol to be encoded and by whether the sequence
was an active or a passive sentence. With back-propagation networks the latter
strategy would however dominate. With Kwasny and Kalman’s networks the
former strategy would dominate. There seemed to be no particular trend during
training suggesting that the strategies are settled on early in training and then
refined.
100
"lesiondata.99.slot.0"
"lesiondata.99.slot.1"
"lesiondata.99.slot.2"
"lesiondata.99.slot.3"
Percentage error
80
60
40
20
0
1
3
5
7
9
11
Unit lesioned
13
15
17
19
Fig. 3. The results of perturbing each unit in the encodings developed by the SRAAM
trained using back-propagation, sigmoidal units, the 1 unit representation and 1%
noise.
308
J.A. Hammerton and B.L. Kalman
100
"lesiondata.9.slot.0"
"lesiondata.9.slot.1"
"lesiondata.9.slot.2"
"lesiondata.9.slot.3"
Percentage error
80
60
40
20
0
1
3
5
7
9
11
Unit lesioned
13
15
17
19
Fig. 4. The results of perturbing each unit in the encodings developed by the SRAAM
trained using back-propagation, sigmoidal units, the 1 unit representation and 10%
noise.
Encoding and Decoding Trajectories. Figure 5 shows the trajectories produced for the SRAAM trained on 130 sequences with back-propagation, sigmoidal units and the 1 unit representation. The trajectories are plotted on a 25x25
Kohonen map. The numbers labelling each point indicate which position in a
sequence that point corresponds to, i.e. 0 being the start, 1 the next symbol
and so on. This figure exemplifies the general finding for the back-propagation
networks which is that the hidden layer patterns for the same position in different sequence cluster together, which provides an explanation for the sensitivity
of the patterns to noise in the hidden layer; the final encodings are clustered
together. With Kwasny and Kalman’s method, the Kohonen maps would show
just 2 points, suggesting that the hidden-layer states are clustered into 2 tightly
packed groups. This gives an explanation for the greater sensitivity to noise in
the hidden-layer patterns on the part of the network trained by Kwasny and
Kalman’s method. There was no identifiable trend in the organisation of the
trajectories during training.
Range of Activations Produced. The range of activations produced during encoding and decoding was wider for back-propagation than for Kwasny
and Kalman’s method. There was no identifiable trend in this during training
however.
Holistic Symbol Processing and the Sequential RAAM: An Evaluation
2
2
2
2
20
2
2
2
2
2
2
2
1
1
1
1
1
10
2
1
1
1
2
2
2
1
2
2
2
2
2
2
2
3
2
2
3
3
3
20 3
3
2
2
3
2
2
2
2
2
2
2
0
10 1
3
3
3
4
2
2
4
2
4
4
4
2
4
4
4
2
4
4
4
4
2
4
4
4
0
1
1
1
1
1
1
15
2
2
2
2
2
2
2
2
1
1
1
2
1
1
1
0
0
1
1
2
2
2
2
2
2
2
2
2
2
2
2
1
1
5
1
1
2
2
3
3
3
20
2
2
2
2
2
2
2
15
2
2
2
3
3
3
2
2
3
3
3
2
2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
15
20
4
3
0
0
1
1
1
0
0
0
0
0
0
0
15
0
0
0
0
0
0
0
0
4
4
4
4
4
4
2
2
2
2
2
2
2
2
2
2
4
4
4
0
0
0
0
0
0
0
0
0
1
1
1
1
1
4
2
4
4
4
1
5
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
4
4
4
4
4
4
4
4
4
4
4
5
10
4
0
0
0
0
0
0
4
4
4
0
0
20
4
4
4
4
4
4
4
0
0
1
0
De oding Traje tories
4
4
4
4
4
4
4
4
4
4
1
0
4
4
4
10
0
0
4
4
2
2
2
4
4
4
4
0
0
0
0
10
2
0
0
0
1
1
10
3
2
2
2
0
1
1
1
1
1
2
2
2
0
1
2
5
2
2
0
0
0
2
2
2
2
2
2
2
2
2
2
2
20
En oding Traje tories
2
2
2
15
2
2
0
1
0
10
20
1
0
0
1
1
5
2
1
4
1
2
1
0 1
0
4
4
5
1
5 1
4
4
4
0
0
1
1
4
4
4
4
1
1
1
4
4
4
1
1
10
4
4
0
0
1
1
0 1
0
4
4
4
4
4
4
2
0
1
4
4
4
4
4
4
2
4
4
4
4
4
1
5 1
4
4
4
4
2
2
2
4
4
4
3
4
2
2
2
2
2
15
2
2
2
3
3
3
2
2
2
2
3
3
3
3
2
2
2
3
3
3
2
2
2
2
2
2
2
2
2
2
2
2
15
2
2
2
2
2
2
2
309
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
20
Fig. 5. Encoding and decoding trajectories for the SRAAM trained on 130 sequences
with the 1 unit representation, back-propagation and sigmoidal units at the end of
training. Active sentences on the left, passive on the right.
4
Discussion
The broad picture emerging from the analysis is of solutions that appear difficult
to find, partial solutions that are sensitive to noise in the weights, hidden-layer
states that are clustered closely together and thus sensitive to noise and a strategy for organising the hidden layer states that is settled on early in training and
then subsequently refined. It should also be noted that the hand-derived solutions had well spread out hidden-layer states. This suggests that if some way could
be found to keep the hidden-layer states well separated improved training may
result. This seems at first sight to be contradicted by the performance of Kwasny
and Kalman’s method which is superior to that of back-propagation, yet it packs
the hidden-layer states much more tightly together than back-propagation. This
can be reconciled by observing that the basic strategy is decided early on in
training and then refined. It may simply be that Kwasny and Kalman is better
at refining the strategy than back-propagation and that if the strategy can be
310
J.A. Hammerton and B.L. Kalman
improved the training may be improved. This is backed up by the superior performance of Kwasny and Kalman’s method on finding the solution when noise
has been added to a hand-derived network. To test the hypothesis, the range
of activations was computed for the hand-derived solution with 10% noise after
training with Kwasny and Kalman’s method. This run had found an error free
solution again and the range of activations produced was far wider than normal
training produced. Unfortunately time was not available for further investigation of this hypothesis such as a direct attempt to create a training method that
would try and keep the hidden-layer patterns well separated.
5
Conclusion and Further Work
The main conclusion to draw from this work is that the SRAAM in its current
form is not an effective vehicle for holistic symbol processing, though it may
be that confluent inference can improve its performance judging by the success
others have had with it [5, 9]. Not only did the SRAAM fail to learn a task
the RAAM has little trouble with, despite being given a large hidden layer,
but it has been found that there are features of the SRAAM’s behaviour that
may be problematic for training generally. The close clustering of hidden-layer
patterns can lead to interference and sensitivity to noise, whilst the sensitivity to
noise in the weights is suggestive of solutions and partial solutions that exist in
small areas of weight space making them difficult to find. However if a way can
be found of keeping the hidden-layer patterns separated the performance may
improve accordingly. It may be for example that Robert French’s context-biasing
technique [6], which aims to do exactly this in feedforward networks could be
adapted for use with the SRAAM.
There are further more general lessons however. The main one is that the
techniques for producing connectionist representations of compositional structures need to be thoroughly analysed in order to see whether problems exist in the
way they create their representations. Also it is important that such analyses
look at both the weights produced during training and the encodings produced, rather than to look at one or other alone, since there may be indications
of trouble in either area. Much of the work on the RAAM and its derivatives
looks solely at the representations produced, though there are exceptions to this
(such as the work on LRAAM). Further work is thus needed to develop a full
understanding of these techniques, and should include formal analyses where appropriate, and should look at factors such as how the use of confluent inference
affects the behaviour of the techniques, and how the nature of the task affects
the performance and behaviour of the techniques.
Acknowledgements
The authors wish to thank Peter Hancox, the first author’s supervisor for his
guidance in this work; Russell Beale, Ela Claridge and Riccardo Poli for their
help, advice and discussions of this work; Stan Kwasny for answering questions
Holistic Symbol Processing and the Sequential RAAM: An Evaluation
311
about his work and for reading a draft of this paper; Jean Hammerton for proofreading a draft of this paper, and John Barnden for his interest in and discussions
of this work. The back-propagation simulations and the analyses presented here
were all run with the PDP++ neural network simulator and the authors wish to
thank the maintainers of PDP++ and the contributors to the PDP++ discussion
list for help in the setting up and usage of PDP++. The training runs employing
Kwasny and Kalman’s training schedule were performed using the simulator
written by Barry Kalman. This work was supported by a research studentship
from the School of Computer Science, The University of Birmingham, UK.
References
[1] I. L. Balogh. An analysis of a connectionist internal representation: Do RAAM
networks produce truly distributed representations? PhD thesis, New Mexico State
University, 1994.
[2] D. S. Blank, L. A. Meeden, and J. B. Marshall. Exploring the symbolic/subsymbolic continuum: A case study of RAAM. In J. Dinsmore, editor, Symbolic and Connectionist Paradigms: Closing the Gap, pages 113–148. Lawrence
Erlbaum Associates, Hillsdale, NJ, 1992.
[3] R.E. Callan and D. Palmer-Brown. (S)RAAM: An analytical technique for fast
and reliable derivation of connectionist symbol structure representations. Connection Science, 9(2):139–159, 1997.
[4] D. J. Chalmers. Syntactic transformations on distributed representations.
Connection Science, 2(1–2):53–62, 1990. Reprinted in [16], pages 46–55.
[5] L. Chrisman. Learning recursive distributed representations for holistic computation. Connection Science, 3(4):345–366, 1991.
[6] R. M. French. Dynamically constraining connectionist networks to produce distributed, orthogonal representations to reduce catastrophic interference. In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, Atlanta, GA, August 1994, pages 335–340, Hillsdale, NJ, 1994. Lawrence Erlbaum
Associates.
[7] J. A. Hammerton. Exploiting Holistic Computation: An evaluation of the Sequential RAAM. PhD thesis, School of Computer Science, The University of
Birmingham, UK, 1998.
[8] J. A. Hammerton. Holistic computation: Reconstructing a muddled concept.
Connection Science, 10(1):3–19, 1998.
[9] E. K. S. Ho and L. W. Chan. Confluent preorder parser as a finite state automata. In Proceedings of International Conference on Artificial Neural Networks
ICANN’96, Bochum, Germany, July 16–19 1996, pages 899–904, Berlin, 1996.
Springer-Verlag.
[10] B. L. Kalman and Kwasny S. C. High performance training of feedforward and
simple recurrent networks. Neurocomputing, 14:63–83, 1997.
[11] T. Kohonen. The self-organising map. Proceedings of the IEEE, 78(9):1464–1480,
1990.
[12] S. C. Kwasny and B. L. Kalman. Tail-recursive distributed representations and
simple recurrent networks. Connection Science, 7(1):61–80, 1995.
[13] L. Niklasson and N. E. Sharkey. Systematicity and generalization in compositional
connectionist representations. In G. Dorffner, editor, Neural Networks and a New
Artificial Intelligence, pages 217–232. Thomson Computer Press, London, 1997.
312
J.A. Hammerton and B.L. Kalman
[14] T. A. Plate. Distributed Representations and Nested Compositional Structure.
PhD thesis, University of Toronto, 1994.
[15] J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46(1–
2):77–105, 1990.
[16] N. E. Sharkey, editor. Connectionist Natural Language Processing:Readings from
Connection Science. Intellect, Oxford, 1992.
[17] A. Sperduti. Labeling RAAM. Technical Report TR-93-029, International Computer Science Institute, Berkeley, California, 1993.
[18] A. Sperduti. On Some Stability Properties of the LRAAM Model. Technical
Report TR-93-031, International Computer Science Institute, Berkeley, California,
1993.
[19] V. Weber. Connectionist unification with a distributed representation. In Proceedings of the International Joint Conference on Neural Networks – IJCNN ’92,
Beijing, China, pages 555–560, Piscataway, NJ, 1992. IEEE.
Life, Mind, and Robots
The Ins and Outs of Embodied Cognition
Noel Sharkey1 and Tom Ziemke2,1
1
University of Sheffield
Dept. of Computer Science
Sheffield S1 4DP, UK
noel@dcs.shef.ac.uk
2
University of Skövde
Dept. of Computer Science
54128 Skövde, Sweden
tom@ida.his.se
Abstract. Many believe that the major problem facing traditional artificial intelligence (and the functional theory of mind) is how to connect
intelligence to the outside world. Some turned to robotic functionalism
and a hybrid response, that attempts to rescue symbolic functionalism
by grounding the symbol system with a connectionist hook to the world.
Others turned to an alternative approach, embodied cognition, that emerged from an older tradition in biology, ethology, and behavioural modelling. Both approaches are contrasted here before a detailed exploration
of embodiment is conducted. In particular we ask whether strong embodiment is possible for robotics, i.e. are robot “minds” similar to animal
minds, or is the role of robotics to provide a tool for scientific exploration, a weak embodiment? We define two types of embodiment, Loebian
and Uexküllian, that express two different views of the relation between
body, mind and behaviour. It is argued that strong embodiment, either
Loebian or Uexküllian, is not possible for present day robotics. However,
weak embodiment is still a useful way forward.
1
Introduction
Cognitive science and artificial intelligence (AI) have been dominated by computationalism and functionalist theories of mind since their inception. The development of computers and their quick increase in information processing power
during the 1940s and 50s led many theorists to assert that the relation between brain/body and mind in humans was the same as or similar to the relation
between hardware and software in computers. The functionalist theory of mind
seemed to solve in an elegant fashion the dispute between dualists and materialists on the relation between mind and matter. Thus, cognitive science and
AI emerged as new disciplines, both of them initially based on the computer
metaphor for mind.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 313–332, 2000.
c Springer-Verlag Berlin Heidelberg 2000
314
N. Sharkey and T. Ziemke
Serious doubts about the appropriateness of the computer metaphor were
formulated in the arguments of Dreyfus [17] and Searle [37] around 1980, although for a long time they left most cognitive scientists and AI researchers cold.
More attention was paid to the re-emergence of connectionism during the 1980s
which, although to some degree concerned with neural hardware and thus taking
a subsymbolic view on representation, did not question the computationalist framework in general. Ignoring the problems of purely computational theories of
mind did not, however, make them disappear and theorists began to respond to
the AI “mind in a vacuum” problem in one of two main approaches.
Firstly, there was the approach of those, such as Harnad [22], who recast the
problem as the symbol grounding problem and suggested it could be solved by
using a hybrid combination of the respective strengths of symbolic and connectionist theories. This was an attempt to maintain a computational theory of
cognition augmented with robotic capacities supposed to connect internal representations to the world they are supposed to represent. The relationship between
functional theories of mind, AI, connectionism, and the hybrid response is examined in Section 2. We focus on the separation of functionalism from the world
and why this is seen as a weakness by opponents of strong AI. This leads us to a
discussion of theoretical attempts to connect the physical symbol system to the
world using connectionist nets as a front end.
The second approach, which is the main focus of this chapter, represents
a more radical conception of cognition and AI arising out of biology and the
study of animal behaviour. Led by a ‘new robotics’ that rejected traditional
AI and representationalism altogether, it embraced the tenets of a situated and
embodied cognition and challenged core assumptions of the funtionalist theory of
mind. Two different types of embodiment are defined and discussed in Section 3
in relation to their biological underpinnings: Loebian or mechanistic embodiment
and Uexküllian or phenomenal embodiment. Section 4 then moves on to critically
examine two questions. Firstly, to what extent can the concept of embodiment be
applied to artefacts such as robots? The arguments revolve around distinctions
between living systems and machines that are made by humans, e.g. autopoiesis
versus allopoiesis. Secondly, to what extent can meaning be attributed to the
behaviour of a robot? It is argued that strong embodiment, either Loebian or
Uexküllian, is not possible for present day robotics. However, weak embodiment
is still a useful way forward.
2
Connectionism, Strong AI, and the Hybrid Response
In this section, we discuss some of the criticisms faced by proponents of the
functional theory of mind and computationalism as instantiated in strong AI.
Searle [38] states that the thesis of strong AI is that, “the implemented program,
by itself, is constitutive of having a mind. The implemented program, by itself,
guarantees mental life”. The functional theory of mind is the view that mental
states are physical states that are defined as functional states because of their
causal relations. Thus, the functional role that a representation plays in a given
Life, Mind, and Robots
315
computational system as a whole, imbues it with meaning (hence Function Role
Semantics (FRS), [3]).
Connectionism is an attempt to get closer to the physical basis of mind by viewing representations as brain states. However, as Smolensky (1988) [50] pointed
out, the subsymbolic representations of connectionists are closer to the symbolic
realm than to physical brain states. The main debates between the cognitivists
and the connectionists were about the form that representation should take, and
how they are structured in order to realise their functional roles [44]. The cognitivists stood for syntactically structured compositional representations [18].
For them it was the syntactic relations between the elements of the representational system as a whole that allowed for systematic inference. In contrast, the
connectionists argued for non-concatenative subsymbolic representations that
were spatially rather than syntactically structured [45]. Several connectionists
argued this case by building connectionist models that demonstrated systematicity of subsymbolic representation in some form or other [12,14,31,32].
So far as we have stated it, the debate was one within the remit of the
functional theory of mind. But the connectionists also had mathematical learning
techniques that were based on abstract models of neural computation. Thus the
representations and their spatial organisation could be learned “from the world”
by extensional programming, self-organisation, reinforcement learning, or, more
lately, genetic algorithms. Although most of the representation research was
carried out in worlds that mainly consisted of symbols of one form or another (but
see [40]), the theory was that connectionist networks were hardwired physical
machines (like brains) with connections that change in response to the outside
world [39]. Connectionism had opened questions about the physical basis of
representation and how neural adaptation can create and change representations
and their spatial organisation through interaction with the world.
There were two ways in which the more philosophical connectionists could
jump. One way was to go with eliminative materialists and see connectionism as
the reduction of mental states to physical states, e.g. patterns of brain activation.
It is here that the connectionists separate from ‘run of the mill’ cognitivists. It
does not matter what machine the cognitivist runs the mind program on as long
as it is capable of running it. For the connectionists, particular hardware was
required, the brain. The other way to jump was to go for what Searle refers to
as “causal emergence” [38], i.e. cognition emerges from a brain, [47].
There may be another type of strong AI lurking in these two positions. Stated
in its most general form, if we were to physically implement a neural net system
that had the causal powers of a brain, it would have mental states in the same
way that we do. This does not sound much like strong AI. However, it could be
argued that the properties of an artificial neural net are sufficient to provide a
system with the causal power of a brain. Or, alternatively, the states of a neural
net are sufficient to represent states of mind.
There were also those who did not want to ‘go the whole hog’ with connectionism and, instead, decided to occupy the middle ground by proposing hybrid
systems that attempted to integrate the best features of both connectionism
316
N. Sharkey and T. Ziemke
and symbolic computation (see [44]). Some of these researchers simply wanted
to bring connectionism into the mainstream by embedding artificial neural networks in larger systems. Others, with commitments to some of the tenets of
cognitivism, used the connectionist learning to “hook” symbols onto the world
[22] (see also [46,61]), in an attempt to rescue functionalism from what they saw
as one of its major weaknesses, i.e. minds are semantic, AI programs are purely
syntactic, therefore AI programs cannot be minds [37].
Lloyd (1989) points out that a problem with such an internalist conception is
that: “...some kernel of the system must win representationality by other means.”
[26]. He questions how a person might learn the meaning of a new word, say
aham, without ostension, simply by using the context in which it occurs, just
as is postulated to occur in a functional role semantics. This might be possible
for a few terms, Lloyd argues, but there must come a critical point where there
are just too many new terms to be able to allow an individual to learn them all
without ostension (and to be able to provide ostensive definitions).
Lloyd’s argument has similarities with John Searle’s famous 1980 Chinese
Room argument [37]. In brief, Searle imagines himself placed in a room where
questions written in Chinese are passed to him and he uses a sort of lookup table
combined with transformation rules to write down the correct answer in Chinese
and pass it out of the room. The point is that Searle does not understand Chinese;
he has no idea of the meaning relation between the questions and the answers;
it is just a matter of manipulating meaningless symbols. Searle’s point was that
the operation of AI programs resembles what goes on in the Chinese room.
Although the running programs manipulate symbols and provide outputs that
are appropriate for their inputs, the programs or the machine do not understand
what they are processing. There is no intrinsic meaning; the only meaning is
for external observers. Therefore, AI programs are not intelligent in the strong
sense.
Harnad (1990), taking Searle’s criticisms on board, suggested a way to save
strong AI. In what is termed a Hybrid response, the Chinese Room argument
was recast as an argument against symbolic functionalism in which mind is held
to be intrinsically, and solely, symbolic. Harnad proposed a robotic functionalism that extends symbolic functionalism to include some components that are
intrinsically robotic. Robotic capacity, in this context, is a label for a range of
sensorimotor interactions with the external environment, of which Harnad cites
the discrimination and identification of objects, events, and states of affairs in
the world as principal [22].
Harnad’s view appears to be that sensory transduction is sufficient to foil the
thrust of the Chinese Room argument. Harnad is quite explicit about the nature
of the sensory transduction, or robotic capacity, that serves to ground a symbolic
capacity: Specifically he names the ability to discriminate and identify nonsymbolic representations. Discrimination involves the machine making similarity
judgements about pairs of inputs, and constructing iconic representations. These
icons are nonsymbolic transforms of distal stimuli. However, Harnad argues that
merely being able to discriminate inputs does not make for a robotic capacity.
Life, Mind, and Robots
317
In addition, the machine must also be able to identify the iconic representations.
That is, it must be able to extract those invariant features of the sensory icons
that will serve as a basis for reliably distinguishing one icon from another and
for forming categorical representations.
But Searle is not just arguing that AI programs need a way to point at and
categorise objects in the world. While Harnad sees the need to link the physical
symbol system to the world to give the symbols a referential meaning in the
same way as Lloyd op cit. has argued, he is still committed to functionalism.
Searle, on the other hand, is arguing for much more; a machine cannot have
intrinsic semantics because it is not intentional and it has no consciousness. To
get the idea of what he means by consciousness, as opposed to the functionalist’s
representational states, Searle [38] asks us to pinch our skin sharply on the left
forearm. He then lists a number of different things that will happen as a result
and includes a list of neural and brain events. However, the most important event
for Searle happens a few hundred milliseconds after you pinched your skin: “A
second sort of thing happened, one that you know about without professional
assistance. You felt pain. ... This unpleasant sensation had a certain particular
sort of subjective feel to it, a feel which is accessible to you in a way that is
not accessible to others around you.” What he is getting at here are is that
the feeling of pain is one of the many qualia. But Searle is not a dualist, he
asks, “How is it possible for physical, quantitatively describable neuron firings
to cause qualitative, private subjective experiences?”. Searle’s view is biological.
He holds that the phenomenal mind is caused by a real living brain.
3
Embodied Cognition
Behaviour-based robotics represents an alternative approach to AI that has been
gaining ground over the last decade (for overviews see [11,1,60,41]) . Although
the basis for the approach has been around in biology for nearly a century and
in robotics for over fifty years, it entered mainstream AI in the 1980s through
the work and ideas of Rodney Brooks [6,7,8,9,10,11]. His approach to the study
of intelligence was through the construction of physical robots embedded in,
and interacting with, their environment by means of a number of behavioural
modules working in parallel. The modules receive input from the sensors on
the robot and the behaviour is controlled by the appropriate module for the
situation. But there is no central controller. The modules are connected to each
other in a way that allows certain modules to subsume the activity of others,
hence this type of architecture is referred to as subsumption architecture.
This approach differs from both symbolic and robotic functionalism. Unlike
classical AI, the modules do not make use of explicit internal representations
in the sense of representational entities that correspond to entities in the ‘real’
world. Here, there is no need to talk about grounding symbols or representations
in the world1 . There are no internal representations to ground and there are
1
Brooks is not entirely consistent with this point as he does write about about physical
grounding of representations in [7,10].
318
N. Sharkey and T. Ziemke
no world models or planners; the world is its own world model for the robot.
Intelligence is found in the interaction of the robot with its environment.
In terms of theory of mind, the literature appears to suggest a re-emergence
of behaviourism in that there are no mental representations, just behaviour and
predispositions to behaviour; sensing and acting in the world are necessary and
sufficient for intelligence. Such a view takes form from considering a less anthropocentric universe than cognitivism. Much recent behaviour-based robotics takes
inspiration from simple intelligent life forms such as arthropods, e.g. [2,57,25].
This provides a way to approach the fundamental building blocks of intelligence
through a consideration of its biological basis.
But there is an unexpected turn in the theoretical picture that we have been
painting about the new AI. Brooks (1991) proposed that the two cornerstones of
the new approach to Artificial Intelligence are situatedness and embodiment [8].
Situatedness means that the physical world directly influences the behaviour of
the robot - it is embedded in the world and deals with the world as it is, rather
than as an abstract description. Embodiment commits the research to robots
rather than to computer programs, as the object of study. This is consistent
with the anti-mentalist stance of the new robotics.
However, the way in which embodiment is discussed has a least two radically different interpretations within the theory of mind, and behaviour-based
roboticists appear to move between them with impunity. In the next two subsections, we examine these contrasting positions in relation to the work of the
biologists, Charles Sherrington (1857-1952), Jacques Loeb (1859-1924) and Jakob von Uexküll (1864-1944), who, in different ways, developed theories of a
biological basis for behaviour.
3.1
Loebian Embodiment
The first type of embodiment follows from the mechanistic or behaviourist line
that Brooks appeared to be following. In this view one would expect embodiment
to mean that cognition is embodied in the mechanism itself. That is, cognition
is embodied in the control architecture of a sensing and acting machine. There
is nothing else. There is no cognitive apparatus separate from the mechanism
itself. There is no need for symbol grounding in that there are no symbols to
ground. It is the behaviour or the dispositions to behaviour that are grounded
in the interaction between agent and environment. This is similar to notions
from physicalism in which the physical states of a machine are considered to
be its mental states, i.e. there is no subjectivity. Movement is “forced” by the
environment. This form of robot control relates mostly to the work of Sherrington
on reflexes [49] and Loeb on tropisms [27].
Sherrington [49] focused on the nervous system and how it constitutes the behaviour of the multicellular animal as a “social unit in the natural economy”. He
proposed that it was nervous reaction that produces an animal individual from
a mere collection of organs. The elementary unit of integration and behaviour
was the simple reflex consisting of three separable structures: an effector organ,
a conducting nervous pathway leading to that organ, and a receptor to initiate
Life, Mind, and Robots
319
the reaction. This is the reflex arc and it is this simple reflex which exhibits
the first grade of coordination. However, Sherrington admitted that the simple
reflex is most likely to be a purely abstract conception. Since, in his view, all
parts of the nervous system are connected together, no part may react without
affecting and being affected by other parts. Nonetheless, the idea of chains of
reflexes to form coherent behaviour became part of mainstream psychological
explanation beginning with the work of the Russian psychologist Dimitri Pavlov
who extended the study of reflex integration into the realm of animal learning,
classical conditioning [33].
Loeb [27] had no patience with physiology and argued against the possibility
of expressing the conduct of the whole animal as the “algegraic sum of the reflexes
of its isolated segments”. His concern was with how the whole animal reacted in
response to its environment. Like the later behaviourists, Loeb was interested in
how the environment “forced” or determined the movement of the organism. He
derived his theory of tropisms (directed movement towards or away from stimuli)
from the earlier scientific study of plant life, e.g. the directed movement through
geotropism [24] and phototropism [16]. Thus for Loeb, animals were cartesian
puppets whose behaviour was determined by the environmental puppeteer (see
also [43,59]). Loeb would certainly have been very interested in today’s biologically inspired robotics and was quick to see the implications of the artificial
heliotropic machine built by J.J. Hammond. Loeb claimed that construction of
the heliotropic machine, which was based on his theories, supported his mechanistic conception of the volitional and instinctive actions of animals [27].
Loeb used tropism2 rather than taxis to stress what he saw as the fundamental identity of the curvature movements of plants and the locomotion of animals
in terms of forced movement enabled by body symmetry. Although his specific
theory about animal symmetry eventually fell under the weight of counter experimental evidence, major parts of his general theory of animal behaviour were
taken up by later biologists and psychologists using the term taxis for directed
animal movement. Fraenkel and Gunn [19], sympathising with Loeb’s stance on
the objective study of animal behaviour, heralded behaviour-based robotics by
proposing that the behaviour of many organisms can be explained as a combination of taxes working together and in opposition.
However, it was the ideas of both reflex and taxis that first inspired what
is now called artificial life research. Grey Walter (1950, 1953) built two “electronic tortoises” in one of the earliest examples of a physical implementation of
intelligent behaviour. Each of these electromechanical robots was equipped with
two ‘sense reflexes’; a very small artificial nervous system built from a minimum
of miniature valves, relays, condensers, batteries and small electric motors, and
these reflexes were operated from two ‘receptors’: one photoelectric cell, giving
the tortoises sensitivity to light, and an electrical contact which served as a
touch receptor. The artificial tortoises were attracted towards light of moderate
intensity, repelled by obstacles, bright light and steep gradients, and never stood
2
Nowadays the term reflex is reserved for movements that are not directed towards
the source of stimulation whereas taxis and tropism are used to denote movements
with respect to the source of stimulation.
320
N. Sharkey and T. Ziemke
still except when re-charging their batteries. They were attracted to the bright
light of their hutch only when their batteries needed re-charging. Grey Walter
made many claims from his observations of these robots which included saying
that the ‘tortoises’, exhibited hunger, sought out goals, exhibited self-recognition
and mutual recognition. He also carried out the first artificial research on classical conditioning with his CORA system [21]. Grey Walter’s work combined
and tested ideas from a mixture of Loeb’s tropisms and Sherrington’s reflexes.
Although Loeb is not explicitly mentioned in the book, the influence is clear,
not least from the terms positive and negative tropisms instead of taxes.
These same ideas turn up again in recent robotics research and form the basis
of much of modern robotics and Alife work (see also [48]). There is also considerable activity aimed at specific biological modelling that continues the line
of mechanistic modelling of animal behaviour started by Hammond’s heliotrope
machine. Sherrington believed that there was much more to mind than the execution of reflexes as we shall see later. Loeb is more representative of the antimentalistic mechanistic view of intelligence and thus we dub this view Loebian
embodiment.
3.2
Uexküllian Embodiment
Brooks [8] also espouses quite a different type of embodiment saying that, “The
robots have bodies and experience the world directly - their actions are part of a
dynamic with the world and have immediate feedback on their own sensations”.
He was anxious to take on board von Uexküll’s concept of Merkwelt or perceptual
world according to which each animal species, with its own distinctly non-human
sensor suites and body, has it own phenomenal world of perception [5,9]. This
notion of embodied cognition has its roots in von Uexküll’s idea of bringing
together an organism’s perceptual and motor worlds in its Umwelt (subjective
or phenomenal world) and hence we call it Uexküllian embodiment.
Von Uexküll tried to capture the seemingly tailor-made fit or solidarity between the organism’s body and its environment in his formulation of a theoretical biology [53], and his Umwelt theory [52,54,56]. Von Uexküll criticized the
mechanistic doctrine “that all living beings are mere machines” for the reason
that it overlooked the organism’s subjective nature, which integrates the organism’s components into a purposeful whole. Thus his view is to a large degree
compatible with Sherrington and Loeb’s ideas of the organism as an integrated unit of components interacting in solidarity among themselves and with
the environment. However, he differed from them in suggesting a rudimentary
non-anthropomorphic psychology in which subjectivity acts as an integrative
mechanism for coherent action:
We no longer regard animals as mere machines, but as subjects whose
essential activity consists of perceiving and acting ... for all that a subject
perceives becomes his perceptual world and all that he does, his effector
world. Perceptual and effector worlds together form a closed unit, the
Umwelt. [54]
Life, Mind, and Robots
321
Although he strongly contradicted purely mechanistic/materialistic conceptions of life, and in particular the work of Loeb (e.g. [55]), von Uexküll was not
a dualist. He did not deny the material nature of the organism’s components.
For example, he discussed how a tick’s reflexes are “elicited by objectively demonstrable physical or chemical stimuli” [54]. However, for von Uexküll, the
organism’s components are forged into a coherent unit that acts as a behavioural entity. It is a subject that, through functional embedding, forms a “systematic
whole” with its Umwelt.
Similar ideas have begun to emerge in robotics as part of a new enactive
cognitive science approach, [51], and there is broad support in the field of embodied cognition where there is a reassessment of the relevance of life and biological embodiment for the study of cognition and intelligent behaviour, e.g.
[15,13,36,58,34,61,62]. Two of the principals of the new approach, Maturana and
Varela [29,30], have proposed that cognition is first and foremost a biological
phenomenon. For them, “all living systems are cognitive systems, and living as
a process is a process of cognition” [29].
In this framework, cognition is viewed as embodied action by which Varela et
al. [51] mean “...first, that cognition depends upon the kinds of experience that
come from having a body with various sensorimotor capacities, and second, that
these individual sensorimotor capacities are themselves embedded in a more encompassing biological, psychological, and cultural context”. Thus, this view, like
that of von Uexküll, emphasises the organism’s embedding in not only its physical environment, but also the context of its own phenomenal world (Umwelt),
and the tight coupling between the two. In the words of Varela et al., “cognition
in its most encompassing sense consists in the enactment or bringing forth of a
world by a viable history of structural coupling” [51].
This is very different from Loebian embodiment where the mechanisms underlying behaviour are themselves controlled by the environment and where the
organism is a mere Cartesian puppet (cf. also [43,59]). What we have called
Uexküllian embodiment is the notion of a phenomenal cognition as an intrinsic
part of a living body; as a process of that body.
4
Weak but Not Strong Embodiment
In this section we take a closer look at the two types of embodiment proposed
in Section 3. First, Uexküllian embodiment is revisited to ask in what sense can
a robot be a subjective embodiment. Then Loebian embodiment is discussed in
terms of mechanistic minds and observer errors. For both types of embodiment
we inquire as to whether they exist in the strong sense of equating “robot minds”
with animal minds or they exist only in the weak sense of scientific or engineering
models for investigating or exploiting knowledge about living systems.
4.1
Uexküllian Embodiment Revisited
A critical question facing Loeb, Sherrington, and von Uexküll concerned how
a large number of living cells could act together in an integrated manner to
322
N. Sharkey and T. Ziemke
produce a unitary behaving organism; in other words, to create a unitary living
body. Loeb’s approach was entirely behaviouristic and anti-mentalistic while von
Uexküll proposed a subjective world as the method of integration. Sherrington
was slighly different. He worked on the nervous mechanisms of the reflexes as
an integrative mechanism. But he also recognised the limitations of the nonsubjective. Writing about a decerberate dog that was used in his research, Sherrington states that,
... it contains no social reactions. It evidences hunger by restlessness
and brisker knee jerks; but it fails to recognize food as food: it shows no
memory, it cannot be trained to learn... The mindless body reacts with
the fatality of a multiple penny-in-the-slot machine, physical, and not
psychical. [49]
There can be no Uexküllian embodiment on existing robots. Uexküllian
embodiment requires a living body. It is the embodiment of cognition or Umwelt as living processes. A robot does not have a body in this sense. It is a
collection of inanimate mechanisms and non-moving parts that form a loosely
integrated physical entity; a robot body is more like a car body and not at all
like a living body. According to von Uexküll there are principal differences between the construction of a mechanism and a living organism, see e.g. [55,62].
He uses the example of a pocket watch to illustrate how machines are centripetally constructed: The individual parts of the watch, such as its hands, springs,
wheels, and cogs must be produced first, so that they may be added to a common centerpiece. In contrast, the construction of an animal starts centrifugally;
animal organs grow outwards from single cells. Von Uexküll was clear about the
machine vision of animal life:
Now we might assume that an animal is nothing but a collection of
perceptual and effector tools [like microphones and motor cars], connected by an integrating apparatus which though still a mechanism, is yet
fit to carry on the life functions. This is indeed the position of all mechanistic theories, whether their analogies are in terms of rigid mechanics or
more plastic dynamics. They brand animals as mere objects. The proponents of such theories forget that, from the first, they have overlooked
the most important thing, the subject which uses the tools, perceives and
functions with their aid. [54]
More recently, Maturana and Varela [29] have also made a distinction between the organisation of living systems and machines made by humans. They
were not interested in listing the static properties of living systems or the properties of the components of such systems. Rather they were concerned with the
problem of how a living system, such as a single cell, can create and maintain its
own identity despite a continual flux of perturbations and the continual changing
of its components through destruction and transformation. As Boden [4], points
out, this is universally accepted as one of the core problems of biology and for
Maturana and Varela it is the problem.
Life, Mind, and Robots
323
Maturana and Varela ibid attacked the problem by concentrating on the
organisation of matter in living systems. In his essay, Biology of Cognition in
[29], Maturana proposed that the organisation of a cell is circular because the
components that specify it are also the very components whose production and
maintenance the organisation secures. It is this circularity that must be maintained for the system to retain its identity as a living system. Maturana and Varela
[29] described the maintenance of the circularity with the new term autopoiesis,
meaning self- (auto) -creating, -making, or -producing (poiesis).
An autopoietic machine, such as a living system, is a special type of homeostatic machine for which the fundamental variable to be maintained constant is its
own organisation. This is unlike regular homeostatic machines which typically
maintain single variables, such as temperature or pressure. The structure of an
autopoietic system is the concrete realisation of the actual components (all of
their properties) and the actual relations between them. Its organisation is constituted by the relations between the components that define it as a unity of a
particular kind. These relations are a network of processes of production that,
through transformation and destruction, produce the components themselves.
It is the interactions and transformations of the components that continuously
regenerate and realize the network of processes that produced them. Although
these definitions apply to many systems, such as social systems, that may also
be autopoietic, living systems are physical in a particular way. Moreover, autopoiesis is held to be necessary and sufficient to characterise the organisation of
living systems.
Living systems are not the same as machines made by humans as some of
the mechanistic theories would suggest. In Maturana and Varela’s formulation,
machines made by humans, including cars and robots, are allopoietic. Unlike
an autopoietic machine, the organisation of an allopoietic machine is given in
terms of a concatenation of processes. These processes are not the processes
of production of the components that specify the machine as a unity. Instead,
its components are produced by other processes that are independent of the
organisation of the machine. Thus the changes that an allopoietic machine goes
through without losing its defining organisation are necessarily subordinated
to the production of something different from itself. In other words, it is not
truly autonomous. In contrast, a living system is an autopoietic machine whose
function it is to create and maintain the unity that distinguishes it from the
medium in which it exists. It is truly autonomous.
To be a living autonomous entity requires a unity that distinguishes an organism from its environment. At the core of autopoiesis lies the autonomy of the
individual cells, which Maturana and Varela refer to as “first-order autopoietic
units”, similar to what von Uexküll meant by the term Zellautonome for autonomous cellular unities [53]. The individual cells’ solidarity which constitutes
the organism as an integrated behavioural entity and “second-order autopoietic
unit” is due to the fact that “the structural changes that each cell undergoes
in its history of interactions with other cells are complementary to each other,
within the constraints of their participation in the metacellular unity they comprise” [30].
324
N. Sharkey and T. Ziemke
The chemical, mechanical, and integrating mechanisms of living things are
missing from robots. Consequently, there can be no notion of multicellular solidarity or even a notion of a cell in a current robot. Although some may argue
that the messaging between sensors, controllers and actuators, is a primitive
type of integration, this is very different from the dependency relationship between living cells in real neural networks. Artificial neural nets can be used as a
‘stand-in’ integrative mechanism between sensors and actuators. However, they
are not themselves integrated into the body of the robot; most of the body is a
container for the controller, a stand to hang the sensors on, and a box for the
motors and wheels. There is no interconnectivity or cellular communication. In
multicellular creatures, solidarity of different types of cells in the body is required for survival. Cells need oxygen and so living bodies need to breathe, they
need nutrition and so bodies need to behave in a way that enables ingestion of
appropriate nutrients.
Furthermore, like von Uexküll [53], Maturana and Varela point out that living
systems, cannot be properly analyzed at the level of physics alone, but require
a biological phenomenology:
... autopoietic unities specify biological phenomenology as the phenomenology proper to those unities with features distinct from physical
phenomenology. This is so, not because autopoietic unities go against
any aspect of physical phenomenology - since their molecular components must fulfill all physical laws - but because the phenomena they
generate in functioning as autopoietic unities depend on their organization and the way this organization comes about, and not on the physical
nature of their components (which only determine their space of existence). [30]
Boden [4] has also recently argued that current metal robots cannot be living systems, or, in her words, “strong artificial life”. Although she goes along
with much of Maturana and Varela’s characterisation of the organisation of the
living, she stops short of accepting that all living processes are cognitive processes. Instead, Boden found it sufficient to argue from the weaker, but more
straighforward, position that biochemical metabolism is necessary for life. Robots, like Grey Walter’s turtles, that can recharge their batteries do not have
metabolic processing that involves closely interlocking biochemical processes.
They only receive packets of energy that do not produce and maintain the robot’s body. It may be possible to produce an artificial life form, writes Boden,
but it will probably have to be a biochemical one.
Even unicellular organisms with no nervous system exhibit more bodily integration and intimacy with their environment than current robots. In no sense
does a ‘situated’ and ‘embodied’ robot actually experience the world. Its ‘experience’ is no different than that of an electronic tape measure. An electronic
tape measure uses sonar to take measurements of room dimensions and ceiling
heights in houses. You can point it at a wall and get a readout of say 54.4 cm.
This is similar to the sonar sensing used for robots. The output from the robot
sensors is a voltage proportional to distance. We could mount two of these tape
Life, Mind, and Robots
325
measures on the front of a three wheeled platform with motors for the two rear
wheels. The motors could be controlled directly by the tape measure outputs by
wiring the left sensor to the right motor and vice versa. After a little tinkering,
this device should make a reasonable job of avoiding boxes and walls. It is thus a
situated robot in the sense described above. However, it does not have its “own
sensations” nor does it have “a body” in the sense discussed above that could
enable it to ”experience the world directly”. Thus, to repeat ourselves, strong
Uexküllian embodiment is not possible on current robots.
Weak Uexküllian embodiment is, of course possible, in the sense of simulating
embodied cognition with a physical robot. This would mean writing programs to
capture aspects of cognition, but in a different way and with a different notion
of cognition than used in disembodied AI. It would be a simulation of enactive
cognition that could provide a ‘wedge in the door’ for biological and psychological
research. In this sense it can be useful to view an autopoetic machine as an
allopoietic machine. Although this has scientific value, according to Maturana
and Varela, it will not reveal the autopoietic organisation.
4.2
Loebian Embodiment and the Clever Hans Error
Weak Loebian embodiment is not in question here. Robots have already been
applied usefully as physical tests of the plausibility of mechanistic hypotheses
about particular animal behaviour, e.g. [57,25]. Weak embodiment also incorporates the use of the robot as a thought tool for engineering or modelling
applications [6,42].
However, why should strong embodiment be in question? After all, we have
already said that the mechanistic or Loebian approach treats animals as mere
Cartesian puppets. But they are puppets that have been co-evolved with their
environments and their own niche that encapsulates the meaning of their existence. The intimate relationship between the body of a living organism and
its environment as described in the previous section is missing, and this is important for strong embodiment regardless of one’s views of mentality. Strong
embodiment implies that the robot is integrated and connected to the world in
the same way as an animal.
However, the identity of an allopoietic machine, like a robot, is not determined by its operation. Rather, because the products of the machine are independent of its organisation, its identity depends entirely on the observer. The
meaning of the robot’s “actions” is also in the observer’s world and not in the
“robot world”. The robot’s behaviour has meaning only for the observer. To
think otherwise is to commit what is known in biology as the Clever Hans error.
We shall expand on the nature and implications of such errors of attribution for
the notion of strong Loebian embodiment in the remainder of this section.
Clever Hans was a horse whose performance startled citizens and scientists in
the early part of the 20th century with its ability to perform simple arithmetic
operations. A sum was written on a poster and displayed so that the horse could
see it. Then the horse, to the amazement of the audience, would tap out the
numerical solution with a hoof. There were even “scientific” theories about how
it was using mental arithmetic. Hans even passed a test set up by Stumpf, the
326
N. Sharkey and T. Ziemke
director of the Berlin Institute of Psychology, with a panel including a zoologist,
a vet, and a circus trainer.
However, eventually the horse was submitted to an objective psychological
assessment [35] and all was revealed. The horse always got the answer correct
whenever there was an observer who knew the correct answer. With an observer
who did not know the answer, Hans tapped an arbitrary number of times. It
turned out that people who knew the answer to the sum were giving Hans a
subtle cue at the right moment and he stopped hoofing. Such was Hans’ ability
at cue detection that it worked even if the observer knew what Hans was up to.
Hans did not know or care much about human arithmetic; it was not part of a
horse’s natural world. The exact number of hoof taps bore no relevance to Hans.
If he responded appropriately to arbitrary start and stop cues, he was rewarded.
Hans’ cleverness was in being able to spot the stop cues without, apparently,
having been trained. His owner, an Austrian aristocrat named von Osten, did
not realise that he was being ‘tricked’. Otherwise, the behaviour is not that much
different to pigeons pecking in a Skinner box. It is as simple as that.
However, in the case of Clever Hans, it just so happens that, for human
observers, the start sign, the sum, etc. have a ruleful (arithmetic) relationship
with the behaviour of the horse. Thus the observers attributed causal significance
to Hans’ behaviour. Hans might have impressed them even more if they had given
him questions such as, how many days are there in a week? Or, how many miles
is it from Sheffield to Rotheram? Or, how many times has Brazil won the world
cup? And, of course, he would also need a clever audience.
The point is that the meaning that the observer received from Clever Hans
did not originate from the horse. Rather, it was merely a distorted reflection of
the observer’s own meaning. The meaning for Hans is less clear but it probably
had nothing to do with arithmetic.
At first blush, the tale of Clever Hans seems to have similarities with Searle’s
Chinese room argument. Like Clever Hans, the person in the room was operating
within the meaning domain of the observer. It was the observer who brought the
meaning of the questioning and answering to the configuration. However, there
is a major difference; Hans was simply responding to human cues to meet his
own agenda. He was a living system as well. The observer was, in a sense, both
in the ‘room’ of the system (as a cue provider) and outside of it at the same
time (as the observer).
However, to invoke the systems argument, the most popular rebuttal of the
Chinese Room argument, for Clever Hans would be inappropriate. The argument
would run that although Hans did not know the meaning of the human arithmetic
task, the system as a whole, consisting of the observer, the problem, Hans and
his hoof, understood the human arithmetic problem. But this would be an oddly
redundant system since one of its components, “the observer”, can understand
the whole problem so why have more components that understand nothing about
it? The situation had an entirely different meaning for Hans that the observers
did not understand. It would be a folly, however, to say that the ‘system as a
whole’ understood the meaning of Hans’ behaviour in terms of the horse world,
for the same reasons as with the human observer.
Life, Mind, and Robots
327
Bringing the Clever Hans error to bear on robotics research, suppose that we
have a light sensing robot, AutoHans, that has to perform the same task as its
living counterpart. A screen with the sum on it is lighted and the robot begins
moving back and forth (sadly there are no hoofs). When AutoHans has moved
the correct number of times, the screen is darkened and AutoHans stops. These
events are meaningful only to the observer and not to the machine. To think
otherwise would be a Clever AutoHans error.
Going a step further, suppose that we now “rewire” the light sensing robot
so that it can follow a light gradient. Now we can stand above it, call the sum
out loud, and wave a torch around in circles until the correct number of circles
has been achieved by the robot. This is such obvious trickery that it would not
be worth mentioning except to highlight that it is no different than putting well
placed lights in the environment to manipulate the behaviour of vehicles. For
example, Grey Walter’s 1950s tortoise robots would be repelled by the bright
light in their hutch (battery charger) and moved only towards moderate light in
the room. When their batteries were low, they moved only towards the bright
light in the hutch. This was a predesigned world and a robot with a carefully
crafted controller. If this argument is sound, then to attribute “hunger”, “attraction”, or other anthropomorphic labels to the behaviour of these devices is
again making the Clever AutoHans error. They can only exhibit weak Loebian
embodiment.
Such systems, and this applies to all behavior-based robots, are designed by
humans such that their movement in interaction with “cues” in the environment,
e.g. lights of a particular intensity, looks, to human observers, like the behaviour
of an organism. Even if the system adapts by neural and/or evolutionary methods, the goals or purpose of the system, by necessity, are designed by the
researcher as the part that will make the work comprehensible to other human
observers. These goals implicitly direct the themes of research papers and are
what give the devices their credibility as autonomous agents. Essentially they
search through “cue” space to find appropriate cues that lead to the satisfaction
of the observer’s/experimenter’s goals. These goals are not the robot’s but the
observer’s goals instantiated. Thus the robot’s interactions with the world carry
meaning only for the observer.
Take for example, the case of Webb’s cricket studies [57,28] in which a wheeled
robot uses an auditory system capable of selectively localising the sound of a male
cricket stridulating (rubbing its wings together rapidly to produce a sound that
attracts potential mates). Comparisons of the intensity of auditory information
from two microphones was used to directly control the motors on a mobile robot
and drive it towards the sound source.
Let us be clear about what this robot system tells us. It is a physical demonstration that mate selection can be mechanised through direct auditory-motor
connections. No intermediate recognition or decision processes are necessary.
This supports a biological model of mate selection as a taxis. Given the correspondence with some of the data on mate preferences in the male cricket, this is
a useful piece of biological modelling. Sound localisation and its use in control
is also useful from an engineering perspective.
328
N. Sharkey and T. Ziemke
However, care must be exercised not to attribute strong Loebian embodiment
to the physical incantations of this robotic system. Webb is very careful with
her claims. It is inappropriate, except perhaps in a popular science magazine
headline, to call this a “robotic cricket”. Yes, it can respond to certain sound
frequencies and combinations of frequencies in a way that has some resemblances
to the cricket; but only at a distal level of description and only for one taxis.
This is at best a simulated partial embodiment.
Clearly at the proximal level of description there is little similarity between
the real cricket and the robot behaviours. Obviously the instant-to-instant behavioural output of a rigid platform on wheels will be very different from a
behavioural output of a legged insect body [23]. Both will behave quite differently towards a blade of grass on their trek to the male. The robot is actually a
single function device; a tracker to find the most “attractive” male crickets inside
a given sensory perimeter within a constrained physical environment. Out in the
wild what chance would the robot have to locate the sound of a male cricket?
And if it did, what are the chances that it would reach the male? (beware of
mud patches, swampland, and rocks). Moving to the end game of mate selection,
what if the robot did reach the male? What would the robot do when the cricket
stopped stridulating? The robot would be a less than amorous companion. The
cricket would certainly not recognise it as a mate even if he did notice it. Or,
for that matter, what would the robot do if the cricket stridulated a few inches
above it or even sat on top of it stridulating?
5
Conclusions
We began by examining one of the dominant criticisims that has dogged AI since
the 1980s; namely that the symbolic realm of the AI program is connected to the
world only through the knowledge of its human designers. Nor was connectionism free from this criticism although connectionist systems can be more readily
connected to the outside world by “learning” a mapping between sensors and
actuators. Indeed, many who agreed with the criticism but were still committed
to the functional theory of mind developed a hybrid response to the problem by
using connectionist nets to ‘hook’ the symbolic or representational domain onto
the world
The same time, through the 1980s and 1990s, a new movement emerged
in AI whose focus was primarily on robotics, i.e. physical machines which are,
unlike computer programs, capable of interacting with their environment. Terms
like “embodiment”, “physical grounding” and “cognitive robotics” have become
central in recent theories of mind, although they are used by cognitive theorists in
at least two different ways. Firstly, as discussed in Section 2, there are those, who
maintain the functionalist view of a computational mind and thus see the robot
body as a means of “hooking” internal representations to the world. Secondly,
as discussed in Section 3, there are those who reject the traditional notion of
representation, and believe that cognition will emerge from the sensorimotor
interaction of robot and environment.
We identified two significantly different notions of embodiment, Loebian (or
mechanistic) and Uexküllian (or phenomenal), and discussed them in terms of
Life, Mind, and Robots
329
their relationship to theories of body, mind and behaviour. In defining Loebian
embodiment, we showed how the mechanistic notions of an early behaviourism,
which pictures organisms as mechanisms more or less ‘puppeteered’ by their
environment, have found their way into robotic studies of cognition and behaviour. This was contrasted with Uexküllian embodiment which encapsulates the
view that living cognition relies on a phenomenal subjective interaction with
the world. Cognition is ‘embodied’ in each organism’s body and, in the extreme, all biological processes, including those of the single cell, are cognitive
(e.g. Maturana and Varela [29,30]). However, because the new AI is still finding
its (theoretical) feet, many of its practitioners move freely between these two
notions of embodiment despite their differences.
It has been argued here that strong embodiment of either type is, in principle,
not possible for current robots. Despite their apparent ‘autonomy’ during operation, robots remain allopoietic machines, which ultimately derive meaning only
from their designers and observers. However, weak embodiment of both types is
possible: There have been a number of successful robotic studies of mechanistic
theories of animal behaviour. Similarly, robots can certainly be used to study
and simulate how artificial agents can enact or bring forth, by means of adaptive techniques, their own ‘phenomenal’ environmental embedding in the form of
interactive representations and behavioural structures. Thus, studying an allopoietic machine, such as a robot, as if it was physically autopoietic (living), can
yield useful scientific insights and can bring much that is new into engineering.
The limitations of such work, however, should be kept in mind. Two of those,
dealt with here, are that, (i) current robots cannot experience a phenomenal
world that arises directly out of having a living body, and (ii) the study of an
allopoietic machine cannot reveal the organisation of the underlying autopoietic
machine.
Acknowledgements
The authors would like to warmly thank Amanda Sharkey for helpful comments
on an earlier draft of this paper.
References
1. Alan Arkin. Behavior-Based Robotics. Intelligent Robotics and Autonomous
Agents Series. MIT Press, Cambridge, MA, 1998.
2. Randall D. Beer. Intelligence as Adaptive Behavior - An Experiment in Computational Neuroethology. Academic Press, San Diego, CA, 1990.
3. N. Block. Are absent qualia impossible? Philosophical Review, 89:257–272, 1980.
4. Margaret Boden. Is metabolism necessary? Technical Report CSPR 482, School of
Cognitive and Computing Sciences, University of Sussex, Brighton, UK, January
1998.
5. Rodney A. Brooks. Achieving artificial intelligence through building robots. Technical Report Memo 899, MIT AI Lab, 1986.
6. Rodney A. Brooks. A robust layered control system for a mobile robot. IEEE
Journal of Robotics and Automation, 2:14–23, 1986.
330
N. Sharkey and T. Ziemke
7. Rodney A. Brooks. Elephants don’t play chess. Robotics and Autonomous Systems,
6(1–2):1–16, 1990.
8. Rodney A. Brooks. Intelligence Without Reason. In Proceedings of the Twelfth
International Joint Conference on Artificial Intelligence (IJCAI-91), pages 569–
595, San Mateo, CA, 1991. Morgan Kauffmann.
9. Rodney A. Brooks. Intelligence without representation. Artificial Intelligence,
47:139–159, 1991.
10. Rodney A. Brooks. The Engineering of Physical Grounding. In Proceedings of
the Fifteenth Annual Conference of the Cognitive Science Society, pages 153–154,
Hillsdale, NJ, 1993. Lawrence Erlbaum.
11. Rodney A. Brooks. Cambrian Intelligence: The Early History of the New AI. MIT
Press, Cambridge. MA, 1999.
12. D.J. Chalmers. Syntactic transformations on distributed representations. Connection Science, 2:53–62, 1990.
13. Hillel J. Chiel and Randall A. Beer. The brain has a body: Adaptive behavior
emerges from interactions of nervous system, body, and environment. Trends in
Neurosciences, 20:553–557, 1997. Dec. 1997.
14. L Chrisman. Learning recursive distributed representations for holistic computation. Connection Science, 3:345–366, 1991.
15. Andy Clark. Being There - Putting Brain, Body and World Together Again. MIT
Press, Cambridge, MA, 1997.
16. De Candolle. Reported in Frankel & Gunn (1940) The Orientation of Animals:
Kineses, Taxes and Compass Reactions, Clarendon Press: Oxford, UK, 1832.
17. H.L. Dreyfus. What computers can’t do: The limits of artificial intelligence. Harper
& Row, New York, 2nd revised edition, 1979.
18. Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and cognitive architecture:
A critical analysis. Cognition, 28:3–71, 1988.
19. G.S. Fraenkel and D.L. Gunn. The Orientation of Animals: Kineses, Taxes and
Compass Reactions. Clarendon Press, Oxford, 1940.
20. William Grey Walter. An imitation of life. Scientific American, 182:42–54, 1950.
21. William Grey Walter. The living brain. Norton, New York, 1953.
22. Stevan Harnad. The symbol grounding problem. Physica D, 42:335–346, 1990.
23. Fred A. Keijzer. Some Armchair Worries about Wheeled Behavior. In From animals
to animats 5 - Proceedings of the Fifth International Conference on Simulation of
Adaptive Behavior, pages 13–21, Cambridge, MA, 1998. MIT Press.
24. Knight. Reported in Frankel & Gunn (1940) The Orientation of Animals: Kineses,
Taxes and Compass Reactions, Clarendon Press: Oxford, UK, 1806.
25. D. Lambrinos, M. Marinus, H. Kobayashi, T. Labhart, R. Pfeifer, and R. Wehner. An autonomous agent navigating with a polarized light compass. Adaptive
Behavior, 6(1):131–161, 1997.
26. D. E. Lloyd. Simple Minds. MIT Press, Cambridge, MA, 1989.
27. Jacques Loeb. Forced movements, tropisms, and animal conduct. Lippincott Company, Philadelphia, 1918.
28. H.H. Lund, B. Webb, and J. Hallam. Physical and temporal scaling considerations
in a robot model of cricket calling song preference. Artificial Life, 4(1):95–107,
1998.
29. H. R. Maturana and F. J. Varela. Autopoiesis and Cognition - The Realization of
the Living. D. Reidel Publishing, Dordrecht, Holland, 1980.
30. H. R. Maturana and F. J. Varela. The Tree of Knowledge - The Biological Roots
of Human Understanding. Shambhala, Boston, MA, 1987. Revised edition, 1992.
Life, Mind, and Robots
331
31. L. Niklasson and N.E. Sharkey. The systematicity and generalisation of connectionist compositional representations. In R.Trappl, editor, Cybernetics and Systems.
Kluwer Academic Press, Dordrecht, NL, 1992.
32. L. Niklasson and T. van Gelder. On being systematically connectionist. Mind and
Language, 3:288–302, 1994.
33. Ivan Petrovich Pavlov. Conditioned Reflexes. Oxford University Press, London,
UK, 1927.
34. Rolf Pfeifer and Christian Scheier. Understanding Intelligence. MIT Press, Cambridge, MA, 1999.
35. P. Pfungst. Clever Hans (The horse of Mr. von Osten): A contribution to experimental animal and human psychology. Henry Holt, New York, 1911.
36. Erich Prem. Epistemic autonomy in models of living systems. In Proceedings of the
Fourth European Conference on Artificial Life, pages 2–9, Cambridge, MA, 1997.
MIT Press.
37. John Searle. Minds, brains and programs. Behavioral and Brain Sciences, 3:417–
457, 1980.
38. J.R. Searle. The Mystery of Consciousness. Granta Books, London, 1997.
39. Noel E. Sharkey. Connectionist representation techniques. Artificial Intelligence
Review, 5:143–167, 1991.
40. Noel E. Sharkey. Neural networks for coordination and control: The portability of
experiential representations. Robotics and Autonomous Systems, 22(3-4):345–359,
1997.
41. Noel E. Sharkey. The new wave in robot learning. Robotics and Autonomous
Systems, 22(3-4), 1997.
42. Noel E. Sharkey. Learning from innate behaviors: A quantitative evaluation of
neural network controllers. Autonomous Robots, 5:317–334, 1998. Also appeared
in Machine Learning, 31, 115-139.
43. Noel E. Sharkey and Jan Heemskerk. The neural mind and the robot. In A. J.
Browne, editor, Neural Network Perspectives on Cognition and Adaptive Robotics,
pages 169–194. IOP Press, Bristol, UK, 1997.
44. Noel E. Sharkey and Stuart A. Jackson. Three horns of the representational trilemma. In V. Honavar and L. Uhr, editors, Symbol Processing and Connectionist
Models for Artificial Intelligence and Cognitive Modeling: Steps towards Integration, pages 155–189. Academic Press, Cambridge, MA, 1994.
45. Noel E. Sharkey and Stuart A. Jackson. An internal report for connectionists. In
R. Sun and L. Bookman, editors, Computational architectures integrating neural
and symbolic processes: A perspective on the state of the art, pages 223–244. Kluwer
Academic Press, Boston, MA, 1995.
46. Noel E. Sharkey and Stuart A. Jackson. Grounding computational engines. Artificial Intelligence Review, 10(10):65–82, 1996.
47. Noel E. Sharkey and Amanda J.C. Sharkey. Emergent cognition. In J. Hendler, editor, Handbook of Neuropsychology. Vol. 9: Computational Modeling of Cognition,
pages 347–360. Elsevier Science, Amsterdam, The Netherlands, 1994.
48. Noel E. Sharkey and Tom Ziemke. A consideration of the biological and psychological foundations of autonomous robotics. Connection Science, 10(3–4):361–391,
1998.
49. Charles Scott Sherrington. The integrative action of the nervous system. C. Scribner’s Sons, New York, 1906.
50. P. Smolensky. On the proper treatment of connectionism. Behavioral and Brain
Sciences, 11:1–74, 1988.
332
N. Sharkey and T. Ziemke
51. F. Varela, E. Thompson, and E. Rosch. The Embodied Mind - Cognitive Science
and Human Experience. MIT Press, Cambridge, MA, 1991.
52. Jakob von Uexküll. Umwelt und Innenwelt der Tiere. Springer, Berlin, Germany,
1921.
53. Jakob von Uexküll. Theoretische Biologie. Suhrkamp, Frankfurt/Main, Germany,
1928.
54. Jakob von Uexküll. A stroll through the worlds of animals and men - a picture
book of invisible worlds. In Claire H. Schiller, editor, Instinctive Behavior - The
Development of a Modern Concept, pages 5–80. International Universities Press,
New York, 1957. Originally appeared as von Uexküll (1934) Streifzüge durch die
Umwelten von Tieren und Menschen. Springer, Berlin.
55. Jakob von Uexküll. The Theory of Meaning. Semiotica, 42(1):25–82, 1982.
56. Jakob von Uexküll. Environment [Umwelt] and inner world of animals. In G. M.
Burghardt, editor, Foundations of Comparative Ethology, pages 222–245. Van Nostrand Reinhold, New York, 1985.
57. Barbara Webb. Using robots to model animals: A cricket test. Robotics and
Autonomous Systems, 16(2–4):117–134, 1995.
58. Michael Wheeler. Cognition’s coming home: The reunion of life and mind. In
Phil Husbands and Inman Harvey, editors, Proceedings of the Fourth European
Conference on Artificial Life, pages 10–19, Cambridge, MA, 1997. MIT Press.
59. Tom Ziemke. The ‘Environmental Puppeteer’ Revisited: A Connectionist Perspective on ‘Autonomy’. In Proceedings of the 6th European Workshop on Learning
Robots (EWLR-6), pages 100–110, Brighton, UK, 1997.
60. Tom Ziemke. Adaptive Behavior in Autonomous Agents. Presence, 7(6):564–587,
1998.
61. Tom Ziemke. Rethinking Grounding. In Alexander Riegler, Markus Peschl, and
Astrid von Stein, editors, Understanding Representation in the Cognitive Sciences.
Plenum Press, New York, 1999.
62. Tom Ziemke and Noel E. Sharkey. A stroll through the worlds of robots and
animals: Applying Jakob von Uexküll’s theory of meaning to adaptive robots and
artificial life. Semiotica, special issue on the work of Jakob von Uexküll, to appear
in 2000.
Supplementing Neural Reinforcement Learning
with Symbolic Methods
Ron Sun
CECS Department
University of Missouri – Columbia
Columbia, MO 65211, USA
Abstract. Several different ways of using symbolic methods to enhance
reinforcement learning are identified and discussed in some detail. Each
demonstrates to some extent the potential advantages of combining RL
and symbolic methods. Different from existing work, in combining RL
and symbolic methods, we focus on autonomous learning from scratch
without a priori domain-specific knowledge. Thus the role of symbolic methods lies truly in enhancing learning, not in providing a priori
domain-specific knowledge. These discussed methods point to the possibilities and the challenges in this line of research.
1
Introduction
Reinforcement learning is useful in situations where there is no exact teacher
input (but there is sparse feedback), especially in sequential decision making.
In sequential decision making tasks, an agent needs to perform a sequence of
actions to reach some goal states. It may learn to perform such tasks, from
scratch, without teacher input, but using reinforcements provided externally.
However, it suffers from a few shortcomings (such as slow learning; discussed
later).
In this paper, I will identify four major ways with which reinforcement learning can benefit from incorporating symbolic methods, in terms of either improving learning processes or improving learning results. I will focus on autonomous learning without requiring additional a priori domain-specific knowledge
when incorporating symbolic methods, which constitutes a more difficult task
(compared with providing additional symbolic domain-specific knowledge to a
reinforcement learner; cf. Maclin and Shavlik 1994). The benefit of such a focus is that our methods are more likely applicable to changing environments,
unknown environments, or other environments in which a priori knowledge is
hard or costly to obtain, and to tasks in which coding a priori domain-specific
symbolic structures is difficult by hand. The important point is that, although in
our methods symbolic structures are generated based on reinforcement learning,
generated symbolic structures can in turn help with reinforcement learning in
various ways (Sun and Peterson 1998).
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 333–347, 2000.
c Springer-Verlag Berlin Heidelberg 2000
334
1.1
R. Sun
Review of Reinforcement Learning
One popular reinforcement learning algorithm for dealing with sequential decision making tasks is Q-learning (Watkins 1989). In a Q-learning model, a Q-value
is an evaluation of the “quality” of an action in a given state: Q(x, a) indicates how desirable action a is in state x (which consists of sensory input). We
can choose an action based on Q-values. One easy way of choosing an action is
to choose the one that maximizes the Q-value in the current state; that is, we
choose a if Q(x, a) = maxi Q(x, i). To ensure adequate exploration, a stochastic
decision process, for example, based on Boltzmann distribution, can be used,
so that different actions can be tried in accordance with their respective probabilities, to ensure various possibilities are all looked into. Using a Boltzmann
distribution, we have
e1/αQ(x,a)
p(a|x) = P 1/αQ(x,a )
i
ie
where α controls the degree of randomness (temperature) of the decision-making
process.
To acquire the Q-values, we can use the Q-learning algorithm of Watkins
(1989). In the algorithm, Q(x, a) estimates the maximum discounted cumulativePreinforcement that the model will receive from the current state x on:
∞
max( i=0 γ i ri ),, where γ is a discount factor that favors reinforcement received sooner relative to that received later, and ri is the reinforcement received
at step i (which may be 0). The updating of Q(x, a) is based on minimizing
r + γe(y) − Q(x, a); that is,
∆Q(x, a) = α(r + γe(y) − Q(x, a))
where α is a learning rate, γ is a discount factor, e(y) = maxa Q(y, a), and y is
the new state resulting from action a. Thus, the updating is based on the temporal difference in evaluating the current state and the action chosen. 1 Through
successive updates of the Q function, the model can learn to take into account
future steps in longer and longer sequences (Watkins 1989). The agent may
eventually converge to a stable function or find an optimal sequence that maximizes the reinforcement received. Hence, the agent learns to deal with sequential
decision making tasks.
1.2
Problems with Reinforcement Learning
A major problem facing reinforcement learning is the large size of the state
space of any problem that is realistic and interesting. The large size of the
1
In the above formula, Q(x, a) estimates, before action a is performed, the discounted
cumulative reinforcement to be received if action a is performed, and r + γe(y)
estimates the discounted cumulative reinforcement that the agent will receive, after
action a is performed; so their difference (the temporal difference in evaluating an
action) enables the learning of Q-values that approximate the discounted cumulative
reinforcement.
Supplementing Neural Reinforcement Learning with Symbolic Methods
state
335
Q-values
action
selection
critic
Fig. 1. The Q-learning Method
state space leads to the storage problem of a huge lookup table. Therefore,
function approximation or decomposition/aggregation methods are called for.
The lookup table implementation of Q-learning may even be out of question
because of a continuous input space and when discretized, the resulting huge
state space (e.g., in our navigation task experiments, there are more than 1012
states; Sun and Peterson 1998). Function approximators may have to be used
in such cases (Sutton 1990, Lin 1992, Tesauro 1992). The large size of the state
space also leads to the problem of slow learning, due to the need to explore a
sufficiently large portion of the space in order to find optimal or near optimal
solutions (Whitehead 1993). Again, we need to use function approximation or
decomposition/aggregation methods to remedy the problem.
The most often used function approximator is backpropagation neural networks. To implement Q-learning, we can use a four-layered backpropagation
network (see Figure 1), in which the first three layers form a backpropagation
network for computing Q-values and the fourth layer (with only one node) performs stochastic decision making. The output of the third layer (i.e., the output
layer of the backpropagation network) indicates the Q-value of each action (represented by an individual node), and the node in the fourth layer determines
probabilistically the action to be performed based on a Boltzmann distribution
(Watkins 1989). The training of the backpropagation network is based on minimizing the following:
r + γe(y) − Q(x, a) if ai = a
erri =
0
otherwise
where i is the index for an output node representing the action ai . The backpropagation procedure is then applied as usual to adjust weights.
However, using function approximators leads to the lack of guarantee of convergence of reinforcement learning. In practice, learning performance can be arbitrarily bad as has been demonstrated by Boyan and Moore (1995) and others.
Thus, we will examine ways (with symbolic methods) that remedy this problem
to some extent.
It should be noted that it is generally the case that decision theoretical methods for decomposition and aggregation are too costly computationally. They
not only need to perform decomposition and aggregation, but also decision theoretical computation in deciding when and where to do so. Given that, good heuristic methods are called for that can avoid such costly deliberative computation
336
R. Sun
and achieve reasonably good outcomes nevertheless, which we will look into in
this paper.
Another problem is that reinforcement learning algorithms only lead to closedloop action (or control) policies, which require moment-to-moment sensing to
discern current states in order for an agent to make action decisions. This type
of policy is unusable in situations where there is no feedback or feedback is unreliable, in which case an open-loop policy (or a semi-open-loop policy; more
later) is much more desirable.
1.3
Remedies
Here are the outlines of a few possible remedies that we proposed, which include
the creation of spatial regions for simplifying a learning task by decomposition,
the creation of sequential segments for simplifying a learning task by forming a
hierarchical structure of sequences, the extraction of symbolic rules to help with
the learning of function approximators, and the extraction of symbolic plans to
make learning results more usable. The benefits of these methods lie in either
(1) improving learning processes or (2) improving results. They enhance reinforcement learners without requiring additional a priori domain-specific knowledge
(as opposed to Maclin and Shavlik 1994).
First, in trying to speed up reinforcement learning and to improve learning
results, we may want to form a coalition of multiple (modular) reinforcement
learners (agents). Or in other words, we may partition a state space to form
regions each of which is assigned to a different learner to handle. Incorporating
symbolic methods can be very useful here. This is because fully on-line gradient
descent and similar methods for partitioning suffer from the problem of slow
learning due to the enlarged space to be explored (enlarged by combining partitioning and learning). Our region-splitting algorithm (Sun and Peterson 1999)
addresses partitioning in complex reinforcement learning tasks by splitting regions on-line but separately using symbolic descriptions (i.e., semi-on-line), to
facilitate the overall process. Splitting regions separately reduces the learning
complexity, as well as simplifies individual modules (and their function approximators), thus facilitating overall learning.
Second, in terms of speeding up reinforcement learning, incorporating symbolic rule learning methods can be very useful. This is because reinforcement
learning methods such as Q-learning (Watkins 1989) are known to be extremely slow. They require exponential time to explore state spaces and can be too
costly without function approximators. However, with function approximators,
it is difficult to ensure proper generalization (Boyan and Moore 1995). Clarion has been developed as a framework for addressing this problem (Sun and
Peterson 1998). It is based on the two-level approach proposed in Sun (1997).
That is, the model utilizes both continuous and discrete knowledge (in neural
and symbolic representations respectively), tapping into the synergy of the two
types of processes. The model goes from neural to symbolic representations in
a gradual fashion. It extracts symbolic rules to supplement neural networks (as
function approximators) to ensure that improper generalizations are corrected.
Supplementing Neural Reinforcement Learning with Symbolic Methods
337
Third, beside improving learning, we can improve the usability of results from
reinforcement learning, using symbolic methods. For instance, we can improve
usability by extracting explicit plans that can be used in an open-loop fashion
from closed-loop policies resulting from reinforcement learning. Different from
pure reinforcement learning that generates only reactive plans (closed-loop policies), as well as existing probabilistic planning that requires a substantial amount
of a priori knowledge to begin with, we devise a two-stage bottom-up process, in
which first reinforcement learning is applied, without the use of a priori domainspecific knowledge, to acquire a closed-loop policy and then explicit plans are
extracted (Sun and Sessions 1998). The extraction of symbolic knowledge improves the applicability of learned policies, especially when environmental feedback
is unavailable or unreliable.
Fourth, yet another method, SSS, involves learning to segment sequences
based on reinforcement received during task execution (Sun and Sessions 1999).
It segments sequences to create hierarchical structures to reduce non-Markovian
temporal dependencies in order to facilitate the learning of the overall task.
Note that none of these above methods require a priori domain-specific knowledge to begin with, the benefit of which was stated earlier. (There are, however,
a few parameters that need to be set, as will be discussed below.)
2
Details of Several Methods
Let us look into some details of these methods.
2.1
Space Partitioning
We developed the region-splitting method, which addresses automatic partitioning in complex reinforcement learning tasks with multiple modules (or “agents”),
without a priori domain knowledge regarding task structures (Sun and Peterson
1999). Partitioning a state/input space into multiple regions helps to exploit
differential characteristics of regions and differential characteristics of modules, thus facilitating learning and reducing the complexity of modules, especially
when function approximators are used. Usually local modules turn out to be a lot
simpler than a monolithic, global module. We adopt a learning/partitioning decomposition — separating the two issues and optimizing them separately, which
facilitates the whole task (as opposed to the on-line methods of Jacobs et al
1991, Jordan and Jacobs 1994). What is especially important is the fact that
we can use hard partitioning, because we do not need to calculate gradients of
partitioning and hard partitioning is easier to do.
In the region-splitting algorithm, a region in a partition (non-overlapping and
with hard boundaries) is handled exclusively by a single module in the form of a
backpropagation network. The algorithm attempts to find better partitioning by
splitting existing regions incrementally when certain criteria are satisfied. The
splitting criterion is based on the total magnitude of the errors that incurred
in a region during training and also based on the consistency of the errors (the
338
R. Sun
distribution of the directions of the errors, either positive or negative); these two
considerations can be combined.
Specifically, in the context of Q-learning (Watkins 1989), error is defined as
the Q-value updating amount (i.e., the Bellman residual; see Sun and Peterson
1997). That is, errorx = g + γ maxa′ Qk′ (x′ , a′ ) − Qk (x, a), where x is a (full or
partial) state description, a is the action taken, x′ is the new state resulting from
a in x, g is the reinforcement received, k is the module (agent) responsible for x,
and k ′ is the module (agent) responsible for x′ . We select those regions to split
that have high sums of absolute errors (or alternatively, sums of squared errors),
which are indicative of the high total magnitude of the errors (Bellman residuals),
but have low sums of errors, which together with high sums of absolute errors
are indicative of low error consistency (i.e., that Q-updates/Bellman residuals
are distributed in different directions). That is, our combined criterion is
X
X
errorx | −
|errorx | < threshold1
consistency(r) = |
x
x
where x refers to the data points encountered during previous training that are
within the region r to be split.
Next we select a dimension to be used in splitting within each region to
be split. Instead of being random, we again use the heuristics of high sums of
absolute errors but low error consistency. Since we have already calculated the
sum of absolute errors and it remains the same regardless what we do, what we
can do is to best split a dimension to increase the overall error consistency, i.e.,
the sums of errors (analogous to CART; see Breiman et al 1984). Specifically, we
compare for each dimension i in the region r the following measure: the increase
in consistency if the dimension is optimally split, that is,
= max(|
vi
X
x:xi <vi
∆consistency(r, i)
X
X
errorx | + |
errorx |) − |
errorx |
x
x:xi ≥vi
where vi is a split point for a dimension i, x refers to the points within region r on the one side or the other of the split point, when projected to dimension i. This measure indicates how much more we can increase the error
consistency if we split a dimension i optimally. The selection of dimension i
is contingent upon ∆consistency(r, i) > threshold2. Among those dimensions
that satisfy ∆consistency(r, i) > threshold2, we choose the one with the highest
∆consistency(r, i). For a selected dimension i, we then optimize the selection of
a split point vi′ based on maximizing the sum of the absolute values of the total
errors on both sides of the split point:
X
X
errorx | + |
errorx |)
vi′ = argmaxvi (|
x:xi <vi
x:xi ≥vi
where vi′ is the chosen split point for a dimension i. Such a point is optimal in
the exact sense that error consistency is maximized (Breiman et al 1984).
Supplementing Neural Reinforcement Learning with Symbolic Methods
339
Then, we split the region r using a boundary created by the split point. We
create a split hyper-plane using the selected point spec = xj < vj . We then
split the region using the newly created boundary: region1 = region ∩ spec and
region2 = region∩¬spec, where region is the specification of the original region.
The algorithm is as follows:
1. Initialize one partition to contain only one region that covers the whole
input space
2. Train an agent on the partition
3. Further split the partition
4. Train a set of agents (with each region assigned to a different agent)
5. If no more splitting can be done, stop; else, go to 3
Further splitting a partition:
For each region that satisfies consistency(r) < threshol 1 do:
1. Select a dimension j in the input space that maximizes ∆consistency, provided that ∆consistency(r, j) > threshol 2
2. In the selected dimension j, select a point (a value vj ) lying within the
region and maximizing ∆consistency(r, j)
3. Using the selected value point in the selected dimension, create a split hyperplane: spec = xj < vj
4. Split the region using the newly created hyper-plane: region1 = region∩spec
and region2 = region ∩ ¬spec, where region is the specification of the
original region; create two new agents for handling these two new regions
by replicating the agent for the original region
5. If the number of regions exceeds R, keep combining regions until the number
is right: randomly select two regions (preferring two adjacent regions) and
merge the two; keep one of the two agents responsible for these two regions
and delete the other
The method was experimentally tested and compared to a number of other
algorithms. As expected, we found that the multi-module automatic partitioning method outperformed single-module learning and on-line gradient descent
partitioning methods (Sun and Peterson 1999).
2.2
Rule Extraction
We developed a two-level model for rule extraction (i.e., the Clarion model)
(Sun 1997, Sun and Peterson 1997, 1998, Sun et al 1999). The bottom level implements the usual Q-learning (Watkins 1989) using a backpropagation network.
The top level learns rules.
In the top level, we devised a novel rule learning algorithm based on neural
reinforcement learning. The basic idea is as follows: we perform rule learning
(extraction and subsequent revision) at each step, which is associated with the
following information: (x, y, r, a), where x is the state before action a is performed, y is the new state entered after an action a is performed, and r is the
reinforcement received after action a. If some action decided by the bottom level is successful then the agent extracts a rule that corresponds to the decision
340
R. Sun
and adds the rule to the rule network. Then, in subsequent interactions with
the world, the agent verifies the extracted rule by considering the outcome of
applying the rule: if the outcome is not successful, then the rule should be made
more specific and exclusive of the current case (“shrinking”); if the outcome is
successful, the agent may try to generalize the rule to make it more universal
(“expansion”). Rules are in the following form: conditions −→ action, where the
left-hand side is a conjunction of individual conditions each of which refers to a
primitive: a value range or a value in a dimension of the (sensory) input state.
At each step, we update the following statistics for each rule condition and
each of its minor variations (i.e., the rule condition plus/minus one value), with
regard to the action a performed: that is, P Ma (i.e., Positive Match) and N Ma
(i.e., Negative Match). Here, positivity/negativity is determined by the Bellman
residual (the Q-value updating amount) which indicates whether or not the
action is reasonably good. Based on these statistics, we calculate the information
gain measure; that is,
IG(A, B) = log2
P Ma (B) + 1
P Ma (A) + 1
− log2
P Ma (A) + N Ma (A) + 2
P Ma (B) + N Ma (B) + 2
where A and B are two different conditions that lead to the same action a. The
measure compares essentially the percentage of positive matches under different
conditions A and B (with the Laplace estimator; Lavrac and Dzeroski 1994). If
A can improve the percentage to a certain degree over B, then A is considered
better than B. In the algorithm, if a rule is better compared with the matchall rule (i.e, the rule with the condition that matches all inputs), then the rule
is considered successful (for the purpose of deciding on expansion or shrinking
operations).
We decide on whether or not to extract a rule based on a simple success
criterion which is fully determined by the current step (x, y, r, a):
– Extraction: if r + γ maxb Q(y, b) − Q(x, a) > threshold, where a is the action
performed in state x, r is the reinforcement received, and y is the resulting
new state (that is, if the current step is successful), and if there is no rule
that covers this step in the top level, set up a rule C −→ a, where C specifies
the values of all the input dimensions exactly as in x.
The criterion for applying the expansion and shrinking operators, on the other
hand, is based on the afore-mentioned statistical test. Expansion amounts to
adding an additional value to one input dimension in the condition of a rule,
so that the rule will have more opportunities of matching inputs, and shrinking
amounts to removing one value from one input dimension in the condition of
a rule, so that it will have less opportunities of matching inputs. Here are the
detailed descriptions of these operators:
– Expansion: if IG(C, all) > threshold1 and maxC ′ IG(C ′ , C) ≥ 0, where
C is the current condition of a matching rule, all refers to no condition
at all (with regard to the same action specified by the rule), and C ′ is
Supplementing Neural Reinforcement Learning with Symbolic Methods
341
a modified condition such that C ′ = C plus one value (i.e., C ′ has one
more value in one of the input dimensions) (that is, if the current rule is
successful and the expanded condition is potentially better), then set C ′′ =
argmaxC ′ IG(C ′ , C) as the new (expanded) condition of the rule. Reset all
the rule statistics.
– Shrinking: if IG(C, all) < threshold2 and maxC ′ IG(C ′ , C) > 0, where C
is the current condition of a matching rule, all refers to no condition at all
(with regard to the same action specified by the rule), and C ′ is a modified
condition such that C ′ = C minus one value (i.e., C ′ has one less value in
one of the input dimensions) (that is, if the current rule is unsuccessful, but
the shrunk condition is better), then set C ′′ = argmaxC ′ IG(C ′ , C) as the
new (shrunk) condition of the rule. Reset all the rule statistics.
If shrinking the condition makes it impossible for a rule to match any input
state, delete the rule.
For making the final decision of which action to take, we combine the corresponding values for each action from the two levels by a weighted sum; that is, if
the top level indicates that action a has an activation value v (which should be 0
or 1 as rules are binary) and the bottom level indicates that a has an activation
value q (the Q-value), then the final outcome is w1 ∗ v + w2 ∗ q. Stochastic decision making with Boltzmann distribution based on the weighted sums is then
performed to select an action out of all the possible actions. Relative weights or
percentages of the two levels are automatically set based on the relative performance of the two levels.
Although it does not reduce learning complexity by order of magnitude,
Clarion does help to speed up learning, empirically, in all our experiments in
various sequential decision task domains (see Sun and Peterson 1998 for details).
Note that this kind of rule extraction is very different from rule extraction at
the end of training a backpropagation network (as in Towell and Shavlik 1993),
in which costly search is done to find rules without benefiting learning processes
per se.
2.3
Plan Extraction
The above rule extraction method generates isolated rules, focusing on individual
states, but not the chaining of these rules in accomplishing a sequential decision
tasks. In contrast, the following plan extraction method generates a complete
explicit plan that can by itself accomplish a sequential task (Sun and Sessions
1998a). By an explicit plan, we mean a control policy consisting of an explicit
sequence of action steps, that does not require (or requires little) environmental
feedback during execution (compared with a completely closed-loop control policy). When no environmental feedback (sensing) is required, an explicit plan
amounts to an open-loop policy. When a small amount of feedback (sensing) is
required, it amounts to a semi-open-loop policy. In either case, an explicit plan
can lead to “policy compression” (i.e., it can lead to fewer specifications for fewer states, through explication of the closed-loop policy). For example, instead
342
R. Sun
of a closed-loop policy specifying actions for all of the 1012 states in a domain
(see Sun and Sessions 1998a), an explicit plan may contain a sequence of 10
actions each of which either does not rely on sensing states, or relies only on
distinguishing 5 different states at a time.
Our plan extraction method (Sun and Sessions 1998 a, b) turns a set of Q
values (and the corresponding policy resulting from these values) into a plan
that is in the form of a sequence of steps (in accordance with the traditional
AI formulation of planning). The basic idea is that we use beam search, to find
the best action sequences (or conditional action sequences) that achieve the goal
with a certain probability.
We recognize that the optimal Q value learned through Q-learning represents
the total future probability of reaching the goal (Sun and Sessions 1998 a). Thus
Q-values can be used as a guide in searching for explicit plans.
We employ the following data structures in plan extraction. The current
state set, CSS, consists of multiple pairs in the form of (s, p(s)), in which the
first item indicates a state s and the second item p(s) indicates the probability
of that state. For each state in CSS, we find the corresponding best action. In
so doing, we have to limit the number of branches at each step, for the sake of
time efficiency of the algorithm as well as the representational efficiency of the
resulting plan. The set thus contains up to (a fixed number) n pairs, where n
is the branching factor in beam search. In order to calculate the best default
action at each step, we include a second set of states CSS ′ , which covers a certain
number (m) of possible states not covered by CSS.
The algorithm is as follows:
Set the current state set CSS = {(s0 , 1)} and CSS ′ = {}
Repeat until the termination conditions are satisfied (e.g., step > D)
- For each action u, compute the probabilities of transitioning to each of all
the possible next states (for all s′ ∈ S) from each of the current states
(s ∈ CSS):
p(s′ , s, u) = p(s) ∗ ps,s′ (u)
- For each action u, compute its estimated utility with respect to each state
in CSS:
X ′
U t(s, u) =
p(s , s, u) ∗ max Q(s′ , v)
v
s′
That is, we calculate the probabilities of reaching the goal after performing
action u from the current state s.
- For each action u, compute the estimated utility with respect to all the states
in CSS ′ :
U t(CSS ′ , u) =
X X
s∈CSS ′
p(s) ∗ ps,s′ (u) max Q(s′ , v)
v
s′
- For each state s in CSS, choose the action us with the highest utility U t(s, u):
us = argmaxu U t(s, u)
- Choose the best default action u with regard to all the states in CSS ′ :
u = argmaxu′ U t(CSS ′ , u′ )
Supplementing Neural Reinforcement Learning with Symbolic Methods
343
- Update CSS to contain n states that have the highest n probabilities, i.e.,
with the highest p(s′ )’s:
p(s′ ) =
X
p(s′ , s, us )
s∈CSS
where us is the action chosen for state s.
- Update CSS ′ to contain m states that have the highest m probabilities
calculated as follows, among those states that are not in the new (updated)
CSS:
p(s′ ) =
X
p(s′ , s, us )
s∈CSS∪CSS ′
where us is the action chosen for state s (either a conditional action in case
s ∈ CSS or a default action in case s ∈ CSS ′ ), and the summations are
over the old CSS and CSS ′ (before updating). 2
In the measure U t, we take into account the probabilities of reaching the goal
in the future from the current states (based on the Q values; see Theorem 1 in
Sun and Session 1998 a), as well as the probability of reaching the current states
based on the history of the paths traversed (based on p(s)’s). This is because
what we are aiming at is the estimate of the overall success probability of a path.
In the algorithm, we select the best actions (for s ∈ CSS and CSS ′ ) that are
most likely to succeed based on these measures, respectively.
Note that (1) as a result of incorporating nonconditional default actions,
nonconditional plans are special cases of conditional plans. (2) If we set n = 0
and m > 0, we then in effect have a nonconditional plan extraction algorithm
and the result from the algorithm is a nonconditional plan. (3) If we set m = 0,
then we have a purely conditional plan (with no default action attached).
An issue is how to determine the branching factor, i.e., the number of conditional actions at each step. We can start with a small number, say 1 or 2, and
gradually expand the search by adding to the number of branches, until a certain criterion, a termination condition, is satisfied. We can terminate the search
when p(G) > δ (where δ is specified a priori) or when a time limit is reached (in
which case failure is declared).
The advantage of extracting plans is that, instead of closed-loop policies that
have to rely on moment-to-moment sensing, extracted plans can be used in an
open-loop fashion, which is useful when feedback is not available or unreliable.
In addition, extracting plans usually leads to the savings of sensing and storage
costs (Sun and Sessions 1998 a).
2
For both CSS and CSS ′ , if a goal state or a state of probability 0 is selected, we
may remove it and, optionally, reduce the beam width of the corresponding set by
1.
344
2.4
R. Sun
Segmentation
The SSS method (which stands for Self-Segmentation of Sequences; Sun and Sessions 1999) involves learning to segment sequences to create hierarchical structures, based on reinforcement received during task execution, with different levels
of control communicating with each other through sharing reinforcement estimates obtained by each others.
In SSS, there are three types of learning modules:
– Individual action module Q: Each performs actions and learns through Q-learning
to maximize its reinforcements.
– Individual controller CQ: Each CQ learns when a Q module (corresponding to the
CQ) should continue its control and when it should give up the control, in terms
of maximizing reinforcements. The learning is accomplished through (separate)
Q-learning.
– Abstract controller AQ: It performs and learns abstract control actions, that is,
which Q module to select under what circumstances. The learning is accomplished
through (separate) Q-learning to maximize reinforcements.
The model works as follows:
1. Observe the current state s.
2. The currently active Q/CQ pair takes control. If there is no active one (when
the system first starts), go to step 5.
3. The active CQ selects and performs a control action based on CQ(s, ca) for
different ca. If the action chosen by CQ is en , go to step 5. Otherwise, the
active Q selects and performs an action based on Q(s, a) for different a.
4. The active Q and CQ performs learning. Go to step 1.
5. AQ selects and performs an abstract control action based on AQ(s, aa) for
different aa, to select a Q/CQ pair to become active.
6. AQ performs learning. Go to step 3.
For the details of learning of the three types of modules, see Sun and Sessions
(1999).
Through the interaction of the three types of modules, SSS segments sequences to create a hierarchy of subsequences, in an effort to maximize the overall
reinforcement (see Sun and Sessions 1999 for detailed analyses). It distinguishes
global and local contexts (i.e., non-Markovian dependencies) and treats them
separately. This is done through automatically determining segments (subsequences) or, in other words, automatically seeking out proper configurations of
temporal structures (i.e., global and local non-Markovian dependencies), which
leads to more reinforcements.
Note that such segmentation is different from reinforcement learning using
pre-given hierarchical structures. In the latter work, a vast amount of a priori
domain-specific knowledge has to be worked out by hand and engineered into a
learner.
Supplementing Neural Reinforcement Learning with Symbolic Methods
3
345
Discussion
We can characterize the afore-discussed symbolic methods along several different
dimensions. First of all, in terms of the timing of applying symbolic methods,
we have the following possibilities:
– off-line: symbolic methods are applied at the end (e.g., in extracting plans)
– semi-on-line: symbolic methods are applied along the way but separately
(e.g., in extracting rules and in creating regions)
– on-line: symbolic methods are applied within a unified mathematical or computational framework (e.g., in creating hierarchies)
Second, in terms of the techniques used for creating symbolic structures, we have
the following possibilities:
–
–
–
–
beam search
IG-based incremental search
modular competition
incremental splitting
Many other possibilities that we have not yet explored exist. Finally, we can
have the following varieties of resulting symbolic structures:
– small pieces of knowledge (e.g., rules)
– large chunks of knowledge (e.g. plans)
– global structures (e.g. hierarchies and regions)
These dimensions together define a very large space which we have not yet fully
explored. There are many possibilities out there.
There are many challenges that we need to address in order to further develop
hybrid reinforcement learning methods along the line outlined above. First of
all, we need to develop better extraction algorithms that generate useful and
comprehensible regions, rules, plans, and hierarchical structures. We can further
explore the space outlined above. We shall also look into other mathematical
frameworks for these processes. Second, we need to test these algorithms on largescale real-world application problems, in order to fully validate them. Third, we
shall also look into other ways of incorporating symbolic processes and structures
into reinforcement learning. Finally, we need to investigate ways of mixing these
techniques to achieve even better performance.
In sum, there are a variety of ways of improving reinforcement learning in
practice, through combining it with symbolic methods, without losing its characteristics of being autonomous and requiring no a priori domain-specific knwoledge to begin with (beside reinforcements). This paper highlights a few possibilities and challenges of this line of work, and hopes to stimulate further research
in this area.
346
R. Sun
Acknowledgements
This work was supported in part by Office of Naval Research grant N00014-951-0440. The work was done in collaboration with Todd Peterson, Chad Sessions,
and Ed Merrill.
References
1. J. Boyan and A. Moore, (1995). Generalization in reinforcement learning: safely
approximating the value function. in: J. Tesauro, and D. Touretzky, and T. Leen,
(eds.) Neural Information Processing Systems, 369-376, MIT Press, Cambridge,
MA.
2. L. Breiman, L. Friedman, and P. Stone, (1984). Classification and Regression.
Wadsworth, Belmont, CA.
3. R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton, (1991). Adaptive mixtures of
local experts. Neural Computation. 3, 79-87.
4. M. Jordan and R. Jacobs, (1994). Hierarchical mixtures of experts and the EM
algorithm. Neural Computation. 6, 181-214.
5. N. Lavrac and S. Dzeroski, (1994). Inductive Logic Programming. Ellis Horword,
New York.
6. L. Lin, (1992). Self-improving reactive agents based on reinforcement learning,
planning, and teaching. Machine Learning. Vol.8, pp.293-321.
7. R. Maclin and J. Shavlik, (1994). Incorporating advice into agents that learn from
reinforcements. Proc. of the National Conference on Artificial Intelligence (AAAI94). Morgan Kaufmann, San Meteo, CA.
8. S. Singh, (1994). Learning to Solve Markovian Decision Processes. Ph.D Thesis,
University of Massachusetts, Amherst, MA.
9. R. Sun, (1992). On variable binding in connectionist networks. Connection Science,
Vol.4, No.2, pp.93-124. 1992.
10. R. Sun, (1997). Learning, action, and consciousness: a hybrid approach towards
modeling consciousness. Neural Networks, 10 (7), pp.1317-1331
11. R. Sun and T. Peterson, (1997). A hybrid model for learning sequential navigation.
Proc. of IEEE International Symposium on Computational Intelligence in Robotics
and Automation (CIRA’97). Monterey, CA. pp.234-239. IEEE Press, Piscateway,
NJ.
12. R. Sun and T. Peterson, (1998). Autonomous learning of sequential tasks: experiments and analyses. IEEE Transactions on Neural Networks, Vol.9, No.6, pp.12171234.
13. R. Sun and T. Peterson, (1999). Multi-agent reinforcement learning: weighting and
partitioning. Neural Networks, Vol.12 No.4-5. pp.127-153.
14. R. Sun, T. Peterson, and E. Merrill, (1999). A hybrid architecture for situated
learning of reactive sequential decision making. Applied Intelligence, in press.
15. R. Sun and C. Sessions, (1998a). Extracting plans from reinforcement learners.
Proceedings of the 1998 International Symposium on Intelligent Data Engineering
and Learning (IDEAL’98). pp.243-248. eds. L. Xu, L. Chan, I. King, and A. Fu.
Springer-Verlag, Heidelberg.
16. R. Sun and C. Sessions, (1998b). Learning to plan probabilistically from neural
networks. Proceedings of IEEE International Joint Conference on Neural Networks,
pp.1-6. IEEE Press, Piscataway, NJ.
Supplementing Neural Reinforcement Learning with Symbolic Methods
347
17. R. Sun and C. Sessions, (1999). Self segmentation of sequences. Proceedings of
IEEE International Joint Conference on Neural Networks, IEEE Press, Piscataway,
NJ.
18. R. Sutton, (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proc.of Seventh International Conference on Machine Learning. Morgan Kaufmann, San Meteo, CA.
19. T. Tesauro, (1992). Practical issues in temporal difference learning. Machine Learning. Vol.8, 257-277.
20. G. Towell and J. Shavlik, (1993). Extracting refined rules from Knowledge-Based
Neural Networks, Machine Learning. 13 (1), 71-101.
21. C. Watkins, (1989). Learning with Delayed Rewards. Ph.D Thesis, Cambridge University, Cambridge, UK.
22. S. Whitehead, (1993). A complexity analysis of cooperative mechanisms in reinforcement learning. Proc. of the National Conference on Artificial Intelligence
(AAAI’93), 607-613. Morgan Kaufmann, San Francisco, CA.
Self-Organizing Maps in Symbol Processing
Timo Honkela
Media Lab, University of Art and Design,
Hämeentie 135 C, FIN-00560 Helsinki, Finland
Timo.Honkela@uiah.fi
http://www.mlab.uiah.fi/˜timo/
Abstract. A symbol as such is disassociated from the world. In addition, as a discrete entity a symbol does not mirror all the details of
the portion of the world that it is meant to refer to. Humans establish
the association between the symbols and the referenced domain — the
words and the world — through a long learning process in a community. This paper studies how Kohonen self-organizing maps can be used
for modeling the learning process needed in order to create a conceptual
space based on a relevant context with which the symbols are associated.
The categories that emerge in the self-organizing process and their implicitness are considered as well as the possibilities to model contextuality,
subjectivity and intersubjectivity of interpretation.
1
Introduction
Models of natural language may test the background assumptions of the developers, or, at least, reflect them. In the predominant approaches among computerized models of language, the linguistic categories and rules are predetermined
and coded by hand explicitly as symbolic representations. The field of connectionist natural language processing, based on the use of artificial neural networks,
may be characterized to take an opposite stand. The critical view on symbolic
representations is based on the idea that the symbolic and discrete nature of
written expressions in natural language does not imply that symbolic descriptions of linguistic phenomena are sufficient as such. This view appears to be
relevant especially when semantic and pragmatic issues are considered. A traditional, logic-based analysis studies examples like the ones given below (from
[35]). The emphasis lies in phenomena that are suitable to be explained in the
framework of predicate logic, e.g., propositional forms, connectives, quantifiers,
truth values, presuppositions, and logical ambiguity.
– “Each one of Mozart’s works is a masterpiece.”
– “If butter is heated, it melts.”
However, if one considers the sentences and conversations given below, it
should be apparent that there is need for formalisms and tools that enable modeling, for instance, of adaptation, vagueness, contextuality, subjectivity of interpretation, and the relationship between discrete symbols and continuous spaces
in the domain under consideration.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 348–362, 2000.
c Springer-Verlag Berlin Heidelberg 2000
Self-Organizing Maps in Symbol Processing
349
– “Please, show me some pictures with beautiful Finnish lake sceneries.”
– “Do you see that small woman there?”
“Actually, I don’t consider her small while people in her country are usually
much shorter than here.”
A radically connectionist natural language processing approach is based on
the following assumptions. The ability to understand natural language utterances can be learned via examples. The categories necessary in the interpretation
emerge in the self-organizing learning processes and they may be implicit rather
than explicit as will be shown later in this article. An implicit category can be
used during interpretation even if it is not named. The processing mechanisms
are mainly statistical rather than rule-like. The process of symbol grounding,
i.e., associating symbols with continuous multi-dimensional spaces and dynamic
processes as well as the assumptions outlined above are discussed in this paper.
The methodological basis is Kohonen’s self-organizing map algorithm.
2
Self-Organizing Map
The basic self-organizing map (SOM) [22,24] can be visualized as a sheet-like
neural-network array, the cells (nodes, units) of which become specifically tuned
to various input signal patterns or classes of patterns in an orderly fashion. The
learning process is competitive and unsupervised, meaning that no teacher is
needed to define for an input the correct output, i.e., the cell into which the input
is mapped. The locations of the responses in the map tend to become ordered in
the learning process as if some meaningful nonlinear coordinate system for the
different input features were being created over the network [24].
2.1
Self-Organizing Map Algorithm
Assume that some sample data sets (such as in Table 1) have to be mapped onto
a 2-dimensional array. The set of input samples is described by a real vector
x(t) ∈ Rn where t is the index of the sample, or the discrete-time coordinate.
Each node i in the map contains a model vector mi (t) ∈ Rn , which has the same
number of elements as the input vector x(t).
The initial values of the components of the model vector, mi (t), may even
be selected at random. In practical applications, however, the model vectors are
more profitably initialized in some orderly fashion, e.g., along a two-dimensional
subspace spanned by the two principal eigenvectors of the input data vectors [24].
Any input item is thought to be mapped into the location, the mi (t) of
which matches best with x(t) in some metric (e.g. Euclidean). The self-organizing
algorithm creates the ordered mapping as a repetition of the following basic
tasks:
1. An input vector x(t) is compared with all the model vectors mi (t). The bestmatching unit (node) on the map, i.e., the node where the model vector is
most similar to the input vector in some metric is identified. This bestmatching unit is often called the winner.
350
T. Honkela
Table 1. Three-dimensional input data in which each sample vector x consists of the
red-green-blue values of the color shown in the rightmost column.
250
165
222
210
255
184
189
255
233
...
235
042
184
105
127
134
183
140
150
...
215
42
135
30
80
11
107
0
122
...
antique white
brown
burlywood
chocolate
coral
dark goldenrod
dark khaki
dark orange
dark salmon
...
2. The model vectors of the winner and a number of its neighboring nodes in
the array are changed towards the input vector according to the learning
principle specified below.
The basic idea in the SOM learning process is that, for each sample input
vector x(t), the winner and the nodes in its neighborhood are changed closer
to x(t) in the input data space. During the learning process, individual changes
may be contradictory, but the net outcome in the process is that ordered values
for the mi (t) emerge over the array. If the number of available input samples is
restricted, the samples must be presented reiteratively to the SOM algorithm.
Adaptation of the model vectors in the learning process takes place according
to the following equations:
mi (t + 1) = mi (t) + α(t)[x(t) − mi (t)] for each i ∈ Nc (t),
mi (t + 1) = mi (t) otherwise,
where t is the discrete-time index of the variables, the factor α(t) ∈ [0, 1] is a
scalar that defines the relative size of the learning step, and Nc (t) specifies the
neighborhood around the winner in the map array.
At the beginning of the learning process the radius of the neighborhood is
fairly large, but it is made to shrink during learning. This ensures that the global
order is obtained already at the beginning, whereas towards the end, as the radius
gets smaller, the local corrections of the model vectors in the map will be more
specific. The factor α(t) also decreases during learning. The resulting map is
shown in Figure 1. The experiment was conducted using SOM PAK software
[25].
Perhaps the most typical notion of the SOM is to consider it as an artificial
neural network model of the brain, especially of the experimentally found ordered
“maps” in the cortex. There exists a lot of neurophysiological evidence to support
the idea that the SOM captures some of the fundamental processing principles of
the brain [27]. Other early artificial-neural-network models of self-organization
have been presented, e.g., in [1], [3], and [54].
The SOM can also be viewed as a model of unsupervised machine learning,
and as an adaptive knowledge representation scheme. The traditional knowledge
Self-Organizing Maps in Symbol Processing
lawn green
pale green
greenyellow
palegoldenrod
khaki
lightgoldenrod
darksea green
dark khaki
dark orange
goldenrod
coral
sandy brown
salmon
chocolate
darkgoldenrod
light pink
pink
thistle
light blue
pale turquoise
powder blue
hot pink
orchid
violet
sky blue
mediumorchid
mediumviolet red
violet red
sienna
forest green
lime green
dark green
slate gray
dark slateblue
black
blue violet
slate blue
brown
firebrick
dark olivegreen
mediumpurple
dark orchid
dark violet
purple
maroon
olive drab
lavender
plum
paleviolet red
indian red
floral white
mint cream
alice blue
ghost white
white
rosy brown
light coral
linen
old lace
beige
moccasin
wheat
burlywood
tan
dark salmon
antiquewhite
papayawhip
351
midnightblue
navy blue
steel blue
cornflowerblue
royal blue
cadet blue
medium seagreen
light seagreen
mediumturquoise
turquoise
darkturquoise
Fig. 1. A map of colors based on their red-green-blue values. The color symbols in the
rightmost column of Table 1 as used in labeling the map. The best matching unit is
searched for each input sample and that node is labeled accordingly.
representation formalisms – semantic networks, frame systems, predicate logic,
to provide some examples – are static and the reference relations of the elements
are determined by a human. Moreover, those formalisms are based on the tacit
assumption that the relationship between natural language and world is one-
352
T. Honkela
to-one: the world consists of objects and the relationships between the objects,
and these objects and relationships have straightforward correspondence to the
elements of language. An alternative point of view is that the pattern recognition
process must be taken into account: expressions of languages refer to patterns
and distributions of patterns in the concrete perceptual domain and often also
in the abstract domain.
One does not need to question the existence of the world in order to be critical
towards the notion of entities or objects as a basis for epistemological considerations. Both anticipation and context influence the perception and the naming
process. This view is adopted, e.g., in constructivism. One early constructivist,
Heinz von Foerster, has stated that objects and events are not primitive experiences but they are representations of relations. The construction of these relations
is subjective, constrained by anatomical and cultural factors. The postulate of
an external (objective) reality gives way to a reality that is determined by modes
of internal computations [7]. This relativity or subjectivity of interpretation of
symbols does not lead, however, into arbitrariness while the language users can
refine their interpretation models closer to each other through communication.
The use of the SOM in modeling such a learning process is considered in [12].
3
SOM-Based Symbol Processing
The self-organizing map can be used in several ways when symbol processing is
considered. One can create a map of symbols by associating each label with a
numerical vector and finding corresponding best-matching location on the map.
Another mean is to encode each symbol with, e.g., a unique random vector and
use this coding as the basis in the learning process [47]. The order of the map
is based on presenting the encoded word with its context during the learning.
The context can, e.g., be textual [13,47], or numerical measurements and representations [11]. The latter can originate from a visual source. An overview of
connectionist, statistical and symbolic approaches in natural language processing
and an introduction to several articles is given in [55].
3.1
Maps of Words
Contextual information has widely been used in statistical analysis of natural
language corpora. Charniak [4] presents the following scheme for grouping or
clustering words into classes that reflect the commonality of some property.
1. Define the properties that are taken into account and can be given a numerical value.
2. Create a vector of length n with n numerical values for each item to be
classified.
3. Cluster the points that are near each other in the n-dimensional space.
The open questions are: what are the properties used in the vector, the
distance metric used to decide whether two points are close to each other, and
Self-Organizing Maps in Symbol Processing
353
the algorithm used in clustering. The SOM does both vector quantization and
clustering at the same time. Moreover, it produces a topologically ordered result.
Word encoding Handling computerized form of written language rests on
processing of discrete symbols. One useful numerical representation of written
text can be obtained by taking into account the sentential context in which
the words occur. Before utilization of the context information, however, the
numerical value of the code should not imply any order to the words. Therefore, it
will be necessary to use uncorrelated vectors for encoding. The simplest method
to introduce uncorrelated codes is to assign a unit vector for each word. When
all different word forms in the input material are listed, a code vector can be
defined to have as many components as there are word forms in the list. As an
example related to Table 1 shown earlier, the color symbols of Table 2 are here
replaced by binary numbers that encode them. One vector element (column in
the table) corresponds to one unique color symbol.
Table 2. A simple example of input data for the SOM algorithm in order to obtain
map of symbols. The three first columns correspond to red-green-blue values and the
rest of the columns are used to code the color symbols as binary values.
0.250
0.165
0.222
0.210
0.255
0.184
...
0.235
0.042
0.184
0.105
0.127
0.134
...
0.215
0.042
0.135
0.030
0.080
0.011
...
1
0
0
0
0
0
.
0
1
0
0
0
0
.
0
0
1
0
0
0
.
0
0
0
1
0
0
.
0
0
0
0
1
0
.
0
0
0
0
0
1
.
...
...
...
...
...
...
...
The component of the vector where the index corresponds to the order of
the word in the list is set to the value “1”, whereas the rest of the components
are “0”. This method, however, is only practicable in small experiments. With
a vocabulary picked from a even reasonably large corpus the dimensionality of
the vectors would become intolerably high. If the vocabulary is large, the word
forms can be encoded by quasi-orthogonal random vectors of a much smaller
dimensionality [47]. Such random vectors can still be considered to be sufficiently
dissimilar mutually and not to convey any information about the meaning of the
words. Mathematical analysis of the dimensionality reduction and the random
encoding of the word vectors is presented in [47] and [21]. The random encoding
can also be motivated from the linguistic point of view. The appearance of a word
does not usually correlate with its meaning. However, it may be interesting to
consider an encoding scheme in which the form of the words is taken into account
to some extent. Motivation for such an experiment may stem, for instance, from
the attempt to model aphasic phenomena.
354
T. Honkela
Map Creation and Implicit Categories The basic steps for creating maps
of words are given in the following.
1. A unique random vector is created for each word form in the vocabulary.
2. All the instances of the word under consideration, so-called key words, are
found in the text collection. The average over the contexts of each key word
is calculated. The random codes formed in step 1 are used in the calculation.
The context may consist of, e.g., the preceding and the succeeding word, or
some other window over the context. As a result each key word is associated
with a contextual fingerprint.
3. Each vector formed in step 2 is input to the SOM. The resulting map is
labeled after the training process by inputing the input vectors once again
and by naming the best-matching neurons according to the key word part
of the input vector.
The averaging process is well motivated when the computational point of
view is considered. The number of training samples is reduced considerably in
the averaging.
The areas on a map of words can be considered as implicit categories or classes
that have emerged during the learning process. Consider, for instance, Figure 2
in which some syntactic classes have emerged on a map of words. The overall
organization of this map of words reflects syntactical categories. The context
of the analysis has consisted of the immediate neighboring words. Single nodes
can be considered to serve as adaptive prototypes. Each prototype is involved
in the adaptation process in which the neighbors influence each other and the
map is gradually finding a form in which it can best represent the input. Local
organization of a map of words seems to follow semantic features. Often a node
becomes labeled by several symbols that are synonyms, antonyms or otherwise
belong to a closed class (see, e.g., [6,19]). Often the class borders can be detected
by analyzing the distances between the prototype vectors in the original input
space [51,52].
The prototype theory of concepts involves that concepts have a prototype
structure and there is no delimiting set of necessary and sufficient conditions
for determining category membership that can also be fuzzy. Instances of a
concept can be ranked in terms of their typicality. Membership in a category is
determined by the similarity of an object’s attributes to the category’s prototype.
The development of prototype theory is based on the works by, e.g., Rosch [48]
and Lakoff [29]. MacWhinney [32] discusses the merits and problems of the
prototype theory. He mentions that prototype theory fails to place sufficient
emphasis on the relations between concepts. MacWhinney also points out that
prototype theory has not covered the issue of how concepts develop over time in
language acquisition and language change, and, moreover, it does not provide a
theory of representation. MacWhinney’s competition model has been designed
to overcome these deficits. Recently, MacWhinney has presented a model of
emergence in language based on the SOM [33]. His work is closely related to
the adaptive prototypes of the maps of words. In MacWhinney’s experiments
the SOM is used to encode auditory and semantic information about words.
Self-Organizing Maps in Symbol Processing
am
be
PAST
PARTICIPLE
AND
PAST
when
TENSE now
VERBS
MODAL
VERBS
PRESENT
TENSE
VERBS
355
what
where
been
was
is
ADVERBS,
PRONOUNS,
PREPOSITIONS,
ETC.
INANIMATE
NOUNS
how
PERSONAL
PRONOUNS
ANIMATE
NOUNS
Fig. 2. Emergent implicit classes on a map of words. The input consisted of the English
translations of fairy tales collected by the Grimm brothers. The 150 most frequent
words were mapped. The area in the middle consists of words in classes of adverbs,
pronouns, prepositions only in a partial order. Some of the individual words have been
shown. Detailed results are presented in [13].
Also Gärdenfors’ recent work (e.g., [9,10]) is very closely related to the issue
of adaptive prototypes. Mitra and Pal have studied the relationship between
fuzziness, self-organization and inferencing [42] and the use of the SOM as a
fuzzy classifier [41].
Handling Ambiguity The main disadvantage of the averaging process described earlier seems to be that information related to the varying use of single
words is lost. However, it is entirely possible to use the SOM to cluster the contexts of a word to obtain information about the potential ambiguity of a word.
Such a study has been conducted in [45]. Gallant [8] has presented a disambiguation method based on neural networks. In a study with similar objectives,
[50] used co-occurence information to create lexical spaces. The dimensionality
reduction was based on singular value decomposition. An automatic method for
word sense disambiguation was developed: a training set of contexts is clustered,
each cluster is assigned a sense, and the sense of the closest cluster is assigned
to the new occurrences. Schütze used two clustering methods to determine the
sense clusters.
Related Work Miikkulainen has widely used the SOM to create a model of
story comprehension. The SOM is used to make conceptual analysis of the words
356
T. Honkela
appearing in the phrases [37,38,40]. A model of aphasia based on the SOM is
presented in [39]. The model consists of two main parts: the maps for the lexical
symbols in the different input and output modalities, and the map for the lexical
semantics. The SOM appears to be appealing model for aphasia. Consider, for
instance, a situation in which the word “lion” is used instead of “tiger”, i.e.,
a case of neighboring items in a semantic map. On the other hand, the use of
“sing” instead of “sink” corresponds to a small error in a phonetic map.
Scholtes has used the SOM to parsing and several other natural language
processing tasks such as filtering in information retrieval [49]. Several authors
have presented methods for creating maps of documents rather than maps of
words, e.g., [20,30,36]. The basic idea is to provide a visual display for exploration
of text database. On the display two documents appear close to each other if
they are similar in content. In [14,15], a map of words is used as a filtering
preprocessor of the documents to be mapped.
3.2
Modeling Gradience and Non-symbolic Representations
The world is continuous and changing, and, thus, the language is a medium of
abstraction rather than a tool to create an exact “picture” of selected portions
of the world. In the abstraction process, the relationship between a language
and the world is one-to-many in the sense that a single word or expression in
language is most often used to refer to a set or to a continuum of situations in the
world. In order to be able to model the relationship between language and world,
the mathematical apparatus of the predicate logic, for instance, does not seem to
provide enough representational power. One way of enhancing the representation
is to take into account the unclear boundaries between different concepts. Many
names have been used to refer to this phenomenon such as gradience, fuzziness,
impreciseness, vagueness, or fluidity of concepts.
The possibility of abandoning the predetermined discrete, symbolic features
is worth consideration. However, a remark on the notion of ’symbol’ may be
necessary: the basic idea is to consider the possibility of grounding the symbols
based on the unsupervised learning scheme. The symbols are used on the level of
communication and may be used as the labels for the usually continuous multidimensional conceptual spaces. Symbols serve also as a means for compressing
information.
Figure 3c outlines a scheme for “breaking up” the symbols. If suitable “raw
data” are used, it may not be necessary to use any intermediate levels in facilitating interpretation. For example, de Sa [5] proposes a model of learning in
a cross-modal environment based on the SOM. The SOM is used to associate
input from different modalities (pictures, language). De Sa’s practical experiments use, however, input for which the interpretation of the features is given
beforehand. Nenov and Dyer [43,44] present an ambitious model and experiments
on perceptually grounded language learning which is based on creating associations between linguistic expressions and visual images. Hyötyniemi [16] discusses
the semantic considerations when dealing with mental models presenting three
levels: features based on raw observations, patterns, and categories.
Self-Organizing Maps in Symbol Processing
357
A mathematical framework for continuous formal systems is given in [31]; it
can be based on, e.g., Gabor filters. Traditionally, the filters have to be designed
manually. To facilitate automatic extraction of features, the ASSOM method
[23,26] could be used. For instance, the ASSOM is able to learn Gabor-like filters
automatically from the input image data. It remains to be seen, though, what
kinds of practical results can be acquired by aiming at still further autonomy in
the processing, e.g., by combining uninterpreted speech and image input.
(a)
degree of
membership
1
1
.75
(b)
degree of
membership
.75
"tallness"
.5
.5
time in
history
sex
.25
.25
"tallness"
0
0
height
height
etc.
degree of
membership
age
(c)
specification of
membership
a non−symbolic feature space
with reduced dimensionality
nonlinear mapping
based on adaptation
original multidimensional space
Fig. 3. Three stages of modeling continuity in the relation between linguistic expressions and the referenced continuous phenomena. In (a) a traditional view on fuzzy
set theory is provided: the fuzziness of a single feature is represented by a curve in
a one-dimensional space. The second alternative (b) points out the need to consider
multidimensional cases: the degree of membership related to “tallness” of a person is
not only based on the size of the person. The furthest scheme (c) is based on processing
of continuous, uninterpreted “raw data” [15].
358
3.3
T. Honkela
Contextuality, Subjectivity and Intersubjectivity
Considerable number of person years have been spent in coding the knowledge
representations by means of traditional AI in terms of entities, rules, scripts, etc.
It seems, however, that the qualitative problems are not satisfactorily solved by
quantitative means. The world is changing all the time, and, perhaps still more
importantly, the symbolic descriptions are not grounded. Symbol grounding,
embodiment and their connectionist modeling is a central topic, e.g., in [53] and
[46]. The contextuality of interpretation is easily neglected, being, nevertheless,
a very commonplace phenomenon in natural language (see, e.g., [18]). The selforganizing map is a suitable method for contextual modeling: instead of handling
symbols or variables separately the SOM can be used in associating them with
the relevant context, or even in evaluating the relevancy of the context.
It seems that the SOM can be used in modeling the individual use of language: to create maps of subjective use of language based on examples, and,
furthermore, to model intersubjectivity, i.e., to have a map that also models the
contents of other maps (see, e.g., [12]). Two persons may have different conceptual or terminological “density” of the topic under consideration. A layman, for
instance, is likely to describe a phenomenon in general terms whereas an expert
uses more specific terms. However, in communication individuals tune in to each
other’s language use. De Boer has simulated the emergence of realistic vowel systems in a population of agents that try to imitate each other as well as possible.
The agents start with no knowledge of the sound system at all. Through communication a coherent vowel system emerges [2]. Similar thorough experiment
using the SOM for conceptual emergence in an agent community still remains
to be conducted.
Honkela [12] proposes a model that adds a third vector element in addition
to symbol part and context part: the specification of the identity of the utterer of
the expressions. Thus, the network would be selective according to the “listener”.
This kind of enhanced map would provide a means for selecting the terms that
can be used in communication. The detailed map could, for instance, include
many expressions for different versions of a certain color among which a general
term or a specific term would become selected based on the listener. This kind
of use of the SOM provides a model of subjectivity and intersubjectivity: strictly
speaking every “SOM-agent” has a symbol mapping of its own but the mappings
are adapted through interactions to become similar enough to enable meaningful
communication.
4
Conclusions
Natural language understanding as a field of artificial intelligence requires, for
instance, means to model emergence of conceptual systems, intersubjectivity in
communication, interpretation of expressions with contextual cues, and symbol
grounding into perceptual spaces. If such means are available the methods can
be considered to be relevant also in cognitive science and epistemology. In this
Self-Organizing Maps in Symbol Processing
359
paper, critical view on the traditional means to model natural language understanding and symbol processing has been given. Especially, the “purely symbolic
methods”, e.g., systems based on predicate logic or semantic networks, appear
not to be sufficient to tackle the phenomena mentioned above.
As an alternative, the self-organizing map (SOM) has been considered in this
paper. There are several results based on the SOM that cover relevant areas of
the overall phenomena and thus the SOM seems to be a promising alternative
for the more traditional approaches. There are, of course, related methods that
can be used for similar purposes but the SOM covers several of the modeling
requirements at the same time and, moreover, serves as a neurophysiologically
grounded cognitive model. However, the basic SOM as such does not suit very
well for processing structured information but one can combine the SOM with,
e.g., recurrent networks to obtain better coverage of both structural and contentrelated linguistic phenomena (see, e.g., [34]). Moreover, the SOM principle can
be generalized through adoption of principles of evolutionary computation which
gives possibility of finding spatial order based, e.g., on functional similarity of
potentially highly structured input samples [17,28].
Acknowledgements
This work is mainly based on the results gained during my three-year stay at
Academy Professor Teuvo Kohonen’s Neural Networks Research Center in Helsinki University of Technology. I wish to thank Professor Kohonen for his most
invaluable advice and guiding. I am also grateful to Dr. Samuel Kaski, Dr. Jari
Kangas, Ms Krista Lagus, Dr. Dieter Merkl, Prof. Erkki Oja, Mr. Ville Pulkki,
and many others.
References
1. Amari, S.-I.: A theory of adaptive pattern classifiers. IEEC, 16:299–307 (1967)
2. Boer, B. de: Investigating the Emergence of Speech Sounds. Proceedings of IJCAI’99, International Joint Conference on Artificial Intelligence. Dean, T. (ed.),
Morgan Kaufmann, Vol. I, pp. 364–369 (1999)
3. Carpenter, G. and Grossberg, S.: A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics, and
Image Processing, 37:54–115 (1987)
4. Charniak, E.: Statistical Language Learning. MIT Press, Cambridge, MA (1993)
5. de Sa, V.: Unsupervised Classification Learning from Cross-Modal Environmental
Structure. PhD thesis, University of Rochester, Department of Computer Science,
Rochester, New York (1994)
6. Finch, S. and Chater, N.: Unsupervised methods for finding linguistic categories. I.
Aleksander and J. Taylor, eds., Artificial Neural Networks, vol. 2, pp. II-1365-1368,
North-Holland (1992)
7. Foerster, H. von: Notes on an epistemology for living things. Observing Systems,
Intersystems publications, pp. 258-271 (1981)
360
T. Honkela
8. Gallant, S. I.: A practical approach for representing context and for performing
word sense disambiguation using neural networks. ACM SIGIR Forum, 3(3):293–
309 (1991)
9. Gärdenfors, P.: Mental representation, conceptual spaces and metaphors. Synthese,
106:21–47 (1996)
10. Gärdenfors, P.: Philosophy and Cognitive Science, chapter Conceptual spaces as a
framework for cognitive semantics, pp. 159–180. Kluwer, Dordrecht (1996)
11. Honkela, T. and Vepsäläinen, A.M.: Interpreting imprecise expressions: experiments with Kohonen’s Self-Organizing Maps and associative memory. Proceedings
of ICANN-91, International Conference on Artificial Neural Networks, T. Kohonen,
K. Mäkisara, O. Simula and J. Kangas (eds.), North-Holland, vol. I, pp. 897-902
(1991)
12. Honkela, T.: Neural Nets that Discuss: A General Model of Communication Based
on Self-Organizing Maps. Proceedings of ICANN-93, International Conference on
Artificial Neural Networks, Amsterdam, Gielen, S. and Kappen, B. (eds.), SpringerVerlag, London, pp. 408-411 (1993)
13. Honkela, T., Pulkki, V. and Kohonen, T.: Contextual relations of words in Grimm
tales analyzed by self-organizing map. Proceedings of ICANN-95, International Conference on Artificial Neural Networks, F. Fogelman-Soulié and P. Gallinari (eds),
vol. 2, EC2 et Cie, Paris, pp. 3-7 (1995)
14. Honkela, T., Kaski, S., Lagus, T., and Kohonen, T.: Newsgroup exploration with
WEBSOM method and browsing interface. Technical report A32, Helsinki University
of Technology, Laboratory of Computer and Information Science, Espoo, Finland
(1996)
15. Honkela, T.: Self-Organizing Maps in Natural Language Processing. PhD Thesis,
Helsinki University of Technology, Espoo, Finland (1997)
See http : //www.cis.hut.f i/∼tho/thesis/
16. Hyötyniemi, H.: On mental images and ’computational semantics’. Proceedings
of Finnish Artificial Intelligence Conference, Finnish Artificial Intelligence Society,
Espoo, pp. 199–208 (1998)
17. Nissinen, A.S., and Hyötyniemi, H.: Evolutionary Self-Organizing Map Proceedings
of EUFIT’98: European Congress on Intelligent Techniques and Soft Computing, pp.
1596-1600 (1998)
18. Hörmann, H.: Meaning and Context. Plenum Press, New York (1986)
19. Kaski, S., Honkela, T., Lagus, K., and Kohonen, T.: Creating an order in digital
libraries with self-organizing maps. In Proceedings of WCNN’96, World Congress
on Neural Networks (1996)
20. Kaski, S., Honkela, T., Lagus, K., and Kohonen, T.: WEBSOM—Self-Organizing
Maps of Document Collections. Neurocomputing, 21:101-117 (1998)
21. Kaski, S.: Dimensionality reduction by random mapping: Fast similarity computation for clustering. Proceedings of IJCNN’98, International Joint Conference on
Neural Networks (1998)
22. Kohonen, T.: Self-organizing formation of topologically correct feature maps. Biological Cybernetics, 43(1):59–69 (1982)
23. Kohonen, T.: The Adaptive-Subspace SOM (ASSOM) and its use for the implementation of invariant feature detection. Fogelman-Soulié, F. and Gallinari, P., editors, Proceedings of ICANN’95, International Conference on Artificial Neural Networks, volume I, pp. 3–10, Nanterre, France. EC2 (1995)
24. Kohonen, T.: Self-Organizing Maps. Springer, Berlin, Heidelberg (1995)
Self-Organizing Maps in Symbol Processing
361
25. Kohonen, T., Hynninen, J., Kangas, J., and Laaksonen, J.: SOM PAK: The SelfOrganizing Map program package. Report A31, Helsinki University of Technology,
Laboratory of Computer and Information Science (1996)
26. Kohonen, T., Kaski, S., Lappalainen, H., and Salojärvi, J.: The adaptivesubspace self-organizing map (ASSOM). Proceedings of WSOM’97, Workshop on
Self-Organizing Maps, Espoo, Finland, June 4-6, pp. 191–196. Helsinki University
of Technology, Neural Networks Research Centre, Espoo, Finland (1997)
27. Kohonen, T., and Hari, R.: Where the abstract feature maps of the brain might
come fromo. Trends In Neurosciences, 22(3): 135-139 (1999)
28. Kohonen, T.: Fast Evolutionary Learning with Batch-Type Self-Organizing Maps.
Neural Processing Letters, Kluwer, 9:153–162 (1999)
29. Lakoff, G.: Women, Fire and Dangerous Things. University of Chicago Press, Chicago (1987)
30. Lin, X., Soergel, D., and Marchionini, G.: A self-organizing semantic map for information retrieval. Proceedings of 14th. Ann. International ACM/SIGIR Conference
on Research & Development in Information Retrieval, pp. 262–269 (1991)
31. MacLennan, B.: Continuous formal systems: A unifying model in language and cognition. Proceedings of the IEEE Workshop on Architectures for Semiotic Modeling
and Situation Analysis in Large Complex Systems, August 27-29, 1995, Monterey,
CA (1995)
32. MacWhinney, B.: Linguistic categorization, chapter Competition and Lexical Categorization. Benjamins, New York (1989)
33. MacWhinney, B.: Cognitive approaches to language learning, chapter Lexical
Connectionism. MIT Press (1997)
34. Mayberry, M.R., and Miikkulainen, R.: SardSrn: A Neural Network Shift-Reduce
Parser. Proceedings of IJCAI’99, International Joint Conference on Artificial Intelligence. Dean, T. (ed.), Morgan Kaufmann, Vol. II, pp. 820–825 (1999)
35. McCawley, J.D.: Everything that Linguists have always Wanted to Know about
Logic but were ashemed to ask. Basil Blackwell, London (1981)
36. Merkl, D.: Self-Organization of Software Libraries: An Artificial Neural Network
Approach. PhD thesis, Institut für Angewandte Informatik und Informationssysteme, Universität Wien (1994)
37. Miikkulainen, R.: DISCERN: A Distributed Artificial Neural Network Model of
Script Processing and Memory. PhD thesis, Computer Science Department, University of California, Los Angeles, Tech. Rep UCLA-AI-90-05 (1990)
38. Miikkulainen, R.: Subsymbolic Natural Language Processing: An Integrated Model
of Scripts, Lexicon, and Memory. MIT Press, Cambridge, MA (1993)
39. Miikkulainen, R.: Self-organizing feature map model of the lexicon. Brain and
Language, 59:334–366 (1997)
40. Miikkulainen, R. and Dyer, M. G.: Natural language processing with modular
neural networks and distributed lexicon. Cognitive Science, 15:343–399 (1991)
41. Mitra, S. and Pal S.: Self-organizing neural network as a fuzzy classifier. IEEE
Transactions on Systems, Man and Cybernetics, 24(3):385–99 (1994)
42. Mitra, S. and Pal, S.: Fuzzy self-organization, inferencing, and rule generation.
IEEE Transactions on Systems, Man & Cybernetics, Part A [Systems & Humans],
26(5):608–20 (1996)
43. Nenov, V. I. and Dyer, M. G.: Perceptually grounded language learning: Part 1 A neural network architecture for robust sequence association. Connection Science,
5(2):115–138 (1993)
44. Nenov, V. I. and Dyer, M. G.: Perceptually grounded language learning: Part 2 DETE: a neural/procedural model. Connection Science, 6(1):3–41 (1994)
362
T. Honkela
45. Pulkki, V.: Data averaging inside categories with the self-organizing map. Report
A27, Helsinki University of Technology, Laboratory of Computer and Information
Science, Espoo, Finland (1995)
46. Regier, T.: A model of the human capacity for categorizing spatial relations. Cognitive Linguistics, 6(1):63–88 (1995)
47. Ritter, H. and Kohonen, T.: Self-organizing semantic maps. Biological Cybernetics,
61(4):241–254 (1989)
48. Rosch, E.: Studies in cross-cultural psychology, vol. 1, chapter Human categorization, pp. 3–49. Academic Press, New York (1977)
49. Scholtes, J. C.: Neural Networks in Natural Language Processing and Information
Retrieval. PhD thesis, Universiteit van Amsterdam, Amsterdam, Netherlands (1993)
50. Schütze, H.: Dimensions of meaning. Proceedings of Supercomputing, pp. 787–796
(1992)
51. Ultsch, A.: Self-organizing neural networks for visualization and classification.
Opitz, O., Lausen, B., and Klar, R., editors, Information and Classification, pp.
307–313, London, UK. Springer (1993)
52. Ultsch, A. and Siemon, H.: Kohonen’s self organizing feature maps for exploratory
data analysis. Proceedings of INNC’90, International Neural Network Conference,
pp. 305–308, Dordrecht, Netherlands. Kluwer (1990)
53. Varela, F. J., Thompson, E., and Rosch, E.: The Embodied Mind: Cognitive Science
and Human Experience. MIT Press, Cambridge, Massachusetts (1993)
54. von der Malsburg, C.: Self-organization of orientation sensitive cells in the striate
cortex. Kybernetik, 14:85–100 (1973)
55. Wermter, S., Riloff, E., and Scheler, G.: Connectionist, Statistical and Symbolic
Approaches to Learning for Natural Language Processing. Springer Verlag, New York
(1996)
Evolution of Symbolisation:
Signposts to a Bridge between Connectionist
and Symbolic Systems
Ronan G. Reilly
Department of Computer Science
University College Dublin
Belfield
Dublin 4
Ireland
Ronan.Reilly@ucd.ie
http://www.cs.ucd.ie/staff/rreilly
Abstract. This paper describes recent attempts to understand the evolution of language in humans and argues that useful lessons can be learned from this analysis by designers of hybrid symbolic/connectionist systems. A specification is sketched out for a biologically grounded hybrid
system motivated by our understanding of both the evolution and development of symbolisation in humans.
1
Introduction
If one does not subscribe to the theory of a serendipitous emergence of language in our ancestors several hundred thousand years ago, one is faced with a
conundrum. If our language capacity has been subject to the forces of evolution,
what were its intermediate stages, what foundation was it constructed upon?
An answer to this question would help us understand how language is achieved
in contemporary brains, and go some way to helping us build more effective
natural language processing (NLP) systems. In particular it would help us overcome what I believe are the significant obstacles confronting pure connectionist
approaches to NLP, and lead to better motivated hybrid designs.
The question of intermediate stages in the emergence of language has recently
been addressed by Terrence Deacon [4] in his book The Symbolic Species. Deacon argues for the existence of an evolutionary path in the development of the
relationship between sign and signified that proceeds from an iconic relationship,
through a process of temporal and spatial indexicalisation (i.e., association), to
an arbitrary and culturally licensed symbolic relationship. The purpose of this
paper is to highlight possible lessons to be learned from the evolution of symbolisation in early hominids, as described by Deacon, that might support our
efforts to improve the symbolic capabilities of connectionist networks.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 363–371, 2000.
c Springer-Verlag Berlin Heidelberg 2000
364
2
R.G. Reilly
An Evolutionary Perspective on Symbolisation
Similarly to C.S. Pierce, Deacon distinguishes between three categories of sign:
iconic, indexical, and symbolic. Iconic signs have some physical similarity with
what they signify, indexical signs are related to their referent either spatially or
temporally, and symbolic signs have an arbitrary relationship with their referent. In keeping with an evolutionary perspective, Deacon makes the case that
symbol systems emerged in human species through a process that moved from
iconic through indexical to symbolic sign usage. Each succeeding level of sign
usage subsumed the preceding one. Thus indexical signs are built upon spatio–
temporal relationships between icons, and symbols are constructed upon relationship between indices and most importantly on relationship between other
symbols. So, for example, a child learning to read, first encounters printed words
as iconic of print in general, much in the same way as someone who doesn’t
read Chinese might look upon a book of Chinese characters. Each character is
equally iconic (trivially) of written language. As the child learns the writing system she proceeds to a stage where the written signs index the spoken language
and her perceptual environment in systematic ways. Finally, the relationships
among these indices allow her to access the symbolic aspects of the words. However, the written words acquire their symbolic status not simply by standing
for something in an associationstic sense, but from the mesh of inter-symbolic
relationships in which they are embedded. This contrasts with the naive view of
symbolisation that sees the acquisition of indexical reference as the essence of
symbolisation. According to Deacon, this is a necessary stage in the development
of symbolisation, but by no means the full story.
One of Deacon’s central points is that there is an isomorphism between the
stages leading to symbol development and the transition from perception (of
icons), through associationistic learning (of indices), through cognising (of symbols). Only humans have reached the final stage, though Deacon argues that
the bonobo chimpanzee, Kanzi [19], has attained a level of sophistication in the
symbolic domain that poses a major challenge to language nativists. Overall,
Deacon’s thesis is an alternative to the nativists’ arguments for the origins of
language [1][12]. According to Deacon, symbolic cognition, of which language is
just one manifestation, emerged as a result of an evolutionary dynamic between
the selectional advantages provided by simple symbol usage and pre–frontal cortical changes that favoured and supported symbol usage.
Deacon supports his case by looking at the comparative neural anatomy
of apes and humans. He demonstrates that, contrary to conventional wisdom,
overall brain size is not the key inter–species difference. Rather it is the disproportional growth of the pre–frontal cortex relative to other cortical areas. If we
were to scale up a modern ape’s brain size to the correct proportions for a human
body, the pre–frontal region in humans would be twice that of the scaled–up ape.
More importantly, this scaling–up is not just of cortical mass, but of connectivity
and the capacity to influence other cortical regions.
An example of the role of the pre–frontal cortex in a natural primate environment would be its use in foraging behaviour. In foraging, an effective strategy
is not to return to the locations that one has most recently visited, since they
Evolution of Symbolisation
365
are least likely to provide a food reward. Thus, one needs to be able to suppress
the more basic drive to return to where one has previously had one’s behaviour
reinforced.
In humans pre–frontal enlargement biases us to a specific style of learning.
This involves, among other things, the ability to detach from the immediate
perceptual demands of a task, to switch between alternative courses of action,
to delay an immediate response to a stimulus, to tune into higher–order features
of the stimulus environment, and so on. Deacon argues that these capabilities are
essential pre–requisites for a facility with symbols. They provide the necessary
attentional and mnemonic resources for the task of symbol learning. Damage
to the frontal areas of the cortex in humans gives rise to subtle impairments
in tasks involving so–called executive function. The deficits appear to lie in an
inability to start, stop or modulate an action. In neuropsychology, this can often
be observed in tests of fluency or card sorting. Patients with pre–frontal damage,
for example, have difficulty in performing tasks involving the random generation
of words of a given semantic category.
Studies involving the teaching of language skills to chimpanzees [19], have
shown that they could with the aid of external symbolic devices (e.g., boards of
lexigrams) improve their symbol acquisition and manipulation skills considerably
beyond those of non–symbol using chimpanzees. What appears to have occurred
in human evolution, and specifically as a result of the enlargement of the pre–
frontal area is the internalisation of the functional equivalents of these external
mnemonic aids.
It is a mistake, however, to view the pre–frontal area as a brain region that
assumes its functional role at the same time as all other regions. Indeed, there is
a rather specific pattern to the way in which all of the cortical regions develop. In
the next section I will argue that the pattern of maturation plays an important
role in the unfolding of the functional capabilities of the pre–frontal and other
brain regions.
3
Maturational Wave
One of the paradoxes of artificial neural network research is that the capabilities of artificial neural networks fall far short of those of the real thing, yet
the learning algorithm(s) employed by real neural networks may well be considerably simpler and less powerful than, say, error backpropagation [17]. The
evidence to date suggests that some variant of Hebb’s rule, possibly mediated
by NMDA–based long term potentiation, may be the dominant learning rule in
natural nervous systems [3]. The problem is that Hebb’s rule cannot be used to
learn even a trivial higher–order function such as XOR, at least not directly. So
this presents the evolutionary account of the emergence of symbolisation with a
problem. If the symbolic system is indeed built upon layers of indexical and iconic relationships, we don’t have, prima facie, a biological learning algorithm that
is up to the job of detecting the type of second–order features of the environment
necessary for symbol use. Not only do we have an impoverishment of stimulus
366
R.G. Reilly
if the nativist position is to be believed, but we also have an impoverishment of
learning mechanism.
Shrager and Johnson [20], however, in an elegant computational study demonstrated that the inclusion of a wave of learning plasticity passing through
a model cortex permitted the acquisition of higher–order functions using only
simple Hebbian learning. There is good evidence that just such a modulation of
plasticity occurs during cortical development [21]. An important feature of the
process is that it affects the sensory–motor regions of the cortex initially and
then moves through to regions more distal from sensory–motor areas. In Shrager
and Johnson’s model, this gives rise to the emergence of higher–order feature
detectors in these distal areas that use outputs of low–order feature detectors in
the sensory areas. There are obvious implications in this process for the role of
the pre–frontal cortex in the acquisition of symbolisation. By virtue of its remove
from sensory regions — it receives no direct sensory inputs — and the fact that it
is one of the last regions to mature, the pre–frontal area can be considered to be
a configurable resource with the capacity to be sensitive to higher–order features
of input from other cortical regions. This makes it an obvious candidate for mediating the acquisition of symbolic relationships. In a sense, therefore, there may
be nothing especially “symbolic” about the capabilities of the pre–frontal cortex; it was a relatively uncommitted computational resource that got exploited
by evolution. Moreover, other factors in addition to its uncommittedness, may
also have ensured it took on a central role in the emergence of symbolisation.
Evidence for this comes from research by Patricia Greenfield [9] on the evolution
of Broca’s area, which suggests that its prior computational function provided
a useful building block in its subsequent evolution as a centre for handling language syntax (I will explore this idea in more detail in a later section). It is my
contention that something similar may also have happened with respect to the
pre–frontal area, and a good candidate for a re–usable computational resource
in the pre–frontal cortex was the foraging behaviour of our primate ancestors.
4
Language Development
Rebotier and Elman [14] have argued that there are suggestive parallels between
the results of Shrager and Johnson’s model [20] and the language learning modelling of Elman and his colleagues [5]. Elman has demonstrated an interaction
between developing cognitive capacity and the acquisition of rule–like behaviour
of increasing complexity. In particular, he showed that a simple recurrent network (SRN) could not learn a complex context–free grammar without an initial
limitation in its memory capacity. Elman’s experiments involved training an
artificial neural network to learn a small artificial language with many of the
characteristics of real English. The sample sentences he used in training comprised only grammatical sentences. When he first tried to train a network to
learn this language it failed to do so. However, he discovered that if he limited
the network’s memory capacity early in training and then gradually increased
it, the network could ultimately master the grammar. This gradual increase in
capacity is analogous to what happens to a child’s cognitive abilities as he or
Evolution of Symbolisation
367
she develops. Elman’s network succeeded in learning the complex grammar by
first acquiring a simpler version of it. What is particularly interesting about this
finding is that, counter-intuitively, a capacity limitation early in development
turns out to be an advantage.
Elman’s finding has, I believe, significant implications for the nativist position on language acquisition. Given that a child learning a language is exposed
to predominantly grammatical utterances, Chomsky [1] and others have argued
that some innate knowledge of language is needed to circumvent the obstacle
to learning identified by Gold [7]. The latter proved that grammars of the complexity of natural language are unlearnable from positive examples alone. With
Elman’s finding we can discern the shape of a response to Chomsky and Gold’s
position. If we take account of the fact that language is acquired by a child
with certain cognitive capacity limitations, these limitations act as a filter on
the complexity of the language, transforming it into one that is learnable from
positive instances alone. The incremental expansion of these capacities allows
additional features of the language to be constructed on a simpler foundation.
There appears to be some similarity between the arguments marshalled by
those in favour hybrid approaches to NLP, and those of the linguistic nativists.
Elman’s results suggest that employing the principle of starting small and paying
attention to developmental issues may be one way to push further the capabilities of the connectionist component, and postpone the introduction of a hybrid
mechanism to a stage where it is really needed.
5
Symbol Emergence
Further support for Elman’s position comes from a study of language learning
in chimpanzees [19]. Kanzi, a bonobo chimpanzee, acquired the ability to use
a limited artificial language involving a board of lexigrams while being cared
for by his mother. One of the more interesting features of this case is that
Kanzi was not explicitly taught how to use the lexigrams himself, but acquired
his ability incidentally while his mother was going through a training regime
that ultimately proved ineffective for her. This appears to be analogous to the
phenomenon described by Elman [5], where a complex grammar could only be
acquired by a connectionist network by “filtering” it through an initially limited
memory capacity, akin to the capacity limitations we find with children. What
is important from an evolutionary point of view, however, is not the means
by which Kanzi acquired a significant facility with symbols, but that he was
able to do so at all. This suggests that somewhere back in hominid evolution
conditions prevailed that facilitated and favoured the expression of an incipient
symbol ability. This then set in train a process of dynamical co-evolution between
brain structure and cognitive abilities that led to the emergence of the complex
cognitive and linguistic skills that we manifest as a species.
The evolutionary jump to symbolisation from an association-based indexical
system, argued for by Deacon, requires an ability to discern certain key relationships between symbols. Although not referred to by Deacon [4] as such, these
relationships are primarily those of compositionality and systematicity. These
368
R.G. Reilly
are the very features identified by Fodor and Pylyshyn [6] as notably absent
from connectionist approaches to language and cognition. While the compostionality criticism has been positively addressed to most people’s satisfaction by
Van Gelder [22], systematicity still remains a keenly debated issue [10]. In the
case of connectionist NLP models, it boils down to their ability (or lack of it)
to generalise their behaviour to new words in new positions in test sentences.
Notwithstanding some positive indications [2], current connectionist models do
not as yet demonstrate the strong systematicity characteristic of a human language user. There are hybrid symbolic/connectionist solutions to the systematicity
problem [11], but a more satisfactory solution would be a biologically grounded
hybrid system motivated by our understanding of the evolution and development
of symbolisation in humans. As yet, nothing of this sort exists, and all one can
do at present is sketch out a specification for such a system. In the next section,
I will summarise its main features.
6
Evolutionary and Developmental Re-Use
As was discussed earlier, Greenfield [9] proposed that both language production
and high–level motor programming are initially subserved by the same cortical
region, which subsequently differentiates and specialises. This homology arises
from the exploitation during the evolution of the human language capacity of the
motor–planning capabilities of what is now Broca’s area. The object–assembly
specialisation in this incipient Broca’s area effectively provided a re–usable computational resource for the evolution of a language production system.
Greenfield [9] based her argument for the evolution of Broca’s region in part
on developmental evidence. She observed that there are parallels in the developmental complexity of speech and object manipulation. In studying the object
manipulation of children aged 11–36 months, she noted that the increase in
complexity of their object combination abilities mirrored the phonological and
syllabic complexity of their speech production. There are two possible explanations for this phenomenon: (1) It represents analogous and parallel development
mediated by separate neurological bases; or (2) the two processes are founded
on a common neurological substrate. Greenfield used evidence from neurology,
neuropsychology, and animal studies to support her view that the two processes
are indeed built upon an initially common neurological foundation, which then
divides into separate specialized areas as development progresses. A significant
part of Greenfield’s argument centered on the results of an earlier study [8] in
which children were asked to nest a set of cups of varying size. The very young
children could do little more than pair cups. Slightly older children showed a
capacity to nest the cups, but employed what Greenfield referred to as a “pot”
strategy as their dominant approach. This entailed the child only ever moving
one cup at a time to carry out the task. Still older children tended to favor a
“sub–assembly” strategy in which the children moved pairs or triples of stacked
cups. At a given age one could detect a combination of different nesting strategies
being employed, but with one strategy tending to dominate.
Evolution of Symbolisation
369
In [15], I sought to explore some aspects of Greenfield’s thesis within a
computational model. I described an abstract connectionist characterization
of a language processing and object manipulation task involving the recursive
auto–associative memory (RAAM) technique developed by Jordan Pollack [13].
RAAMs were choosen because they are a general representational scheme for
hierarchical structures, and thus can cope equally well with both linguistic and
motoric representations. They are neurally plausible on several counts. The representations are distributed, and neural representations appear also to be distributed. They preserve similarity relationships: RAAM encodings of structures
that are similar, are themselves similar. This also appears to be a pervasive feature of neural representations [18]. The fact that RAAM representations are of
fixed-width also gives them some degree of neural plausibility, since the neural
pathways (i.e., cortico–cortical projections) along which their neural counterparts are transmitted, are also of fixed width.
The simulations described in [15] involved training a simple recurrent network (SRN) to generate, on the basis of an input RAAM “plan”, a sequence of
either object–assembly actions or phonemes. The results of the simulations indicated an advantage in terms of rate of error decline for networks that had prior
training on a simulated object assembly task, when compared with various control conditions. This was taken as support for Greenfield’s view of a functional
homology underlying both language and object assembly tasks.
To understand why the object assembly task provided a significant training
advantage, it is useful to look at the control condition in which prior training was
given first on the language task and then on the object assembly task. In this
case, the prior language training did not benefit the SRN in learning the object
assembly task. This suggests that the advantage provided by initial training on
object assembly may be another case of the ”importance of starting small” in
the sense of Elman [5], and already discussed above. The tree structures associated with object assembly were relatively simple, with few deep embeddings.
The language RAAM structures were more complex in this respect. Elman [5]
demonstrated in his grammar-learning studies that if a complex grammar is to
be learned by an SRN, the network must first be trained on a simpler version of
the grammar. In the model described in [15], when initial training was provided
on the language task, the network was being trained on complex structures first
and then simpler ones, the reverse of the approach that Elman found to be effective. It is unsurprising, therefore, that training is retarded when the complexity
ordering is reversed.
As well as demonstrating the viability of the re–use position, the model also
successfully reproduced the relative difficulty that children have in producing
the “pot” and “sub-assembly” strategies. In addition, there appeared to be a
divergence in the exploitation of hidden-unit space as the complexity of the
language task increased. This partitioning of the hidden unit space was analogous
to what Greenfield argues occurs during the development of Broca’s region in
children; early in development the one cortical region subsumes both object
assembly and language tasks, but as development progresses the upper part of
the region specialises for object assembly and the lower part for language. What
370
R.G. Reilly
the model demonstrated was that this divergence could result simply from an
increase in complexity of the two tasks, rather than a genetically programmed
differentiation, as Greenfield herself has proposed [8].
In summary, the evidence from the modelling described above supports the
argument that there are good computational reasons for building a language
production system on an object-assembly foundation, and lends support to the
more general argument that a process of re–utilisation of cortical computation
may provide a mechanism for the construction of high–level cognitive processes
on lower–level sensory–motor ones.
7
Signposts to a Connectionist/Symbolic Bridge
In the preceding sections I have described a possible route taken in the evolution
of symbolic capacity in humans. The account relies heavily on Deacon’s analysis
of the role of the pre–frontal cortex in the emergence of symbolisation. Deacon
focuses on a neurobiological basis for this capacity, and its possible origins in
evolution. I have augmented this with a description of two general computational mechanisms that might underpin the changes he identified. The first of
these is the role of maturation in conjunction with a neurally plausible learning
mechanisms. I have also proposed that something akin to software re–use may
be at work in biasing the selection of one cortical region rather than another as
the substrate for symbolisation. The jump to symbolisation represents a significant discontinuity in functional capabilities, but I believe that it may be a small
underlying difference that makes all the difference. In other words, the change in
computational mechanism supporting this jump may be quite subtle. So, while
we may have a functionally hybrid system, appearing to combine associationistic and symbolic capabilities, its computational infrastructure may not be that
hybrid at all, merely an extension of existing associationistic mechanism. The
challenge, then, is to find that subtle difference that makes the difference.
Some of the indications are that:
– we should aim to build complex, functionally hybrid, systems from simpler
connectionist components;
– we should explore the use of simple learning rules, such as Hebbian learning,
in conjunction with a maturation dynamic;
– we should try to resolve the limitations in systematicity of connectionist
models by focussing on the overall role of the pre–frontal cortex in language
and cognition.
Explorations in the space of connectionist models constrained by the above
guidelines will, I believe, yield fruitful results.
References
1. Chomsky, N. (1986). Knowledge of language. New York: Praeger.
2. Christiansen, M.H., & Chater, N. (1994). Generalisation and connectionist language learning. Mind and Language, 9, 273–287.
Evolution of Symbolisation
371
3. Cruikshank, S.J., & Weinberger, N.M. (1996). Evidence for the Hebbian hypothesis
in experience-dependent physiological plasticity of neocortex: A critical review.
Brain Research Reviews, 22, 191–228.
4. Deacon, T. (1997). The symbolic species: The co-evolution of language and the
human brain. London, UK: The Penguin Group.
5. Elman, J.L. (1993). Learning and development in neural networks: The importance
of starting small. Cognition, 48, 71–99.
6. Fodor J. A., & Pylyshyn, Z.W. (1988). Connectionism and cognitive architecture:
A critical analysis. Cognition, 28, 3–71.
7. Gold, E.L. (1967). Language identification in the limit. Information and Control,
16, 447–474.
8. Greenfield, P., Nelson, K, & Saltzman, E. (1972). The development of rule-bound
strategies for manipulating seriated cups: A parallel between action and grammar.
Cognitive Psychology, 3, 291–310.
9. Greenfield, P. (1991). Language, tool and brain: The ontogeny and phylogeny of
hierarchically organized sequential behavior. Behavioral and Brain Sciences, 14,
531–595.
10. Hadley, R.F. (1994). Systematicity in connectionist language learning. Mind and
Language, 9, 247–271.
11. Hadley, R.F. & Hayward, M.B. (1997). Strong semantic systematicity from Hebbian connectionist learning. Mind and Machines, 7, 1–37.
12. Pinker, S. (1994). The language instinct: how the mind creates language. New
York: William Morrow.
13. Pollack, J.B. (1990). Recursive distributed representations. Artificial Intelligence,
46, 177–105.
14. Rebotier, T.P., & Elman, J.L., (1996). Explorations with the dynamic wave model.
In D. Touretsky, M. Mozer, and M. Haselmo (Eds.), Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press.
15. Reilly, R.G. (in press). The relationship between object manipulation and language
development in Broca’s area: A connectionist simulation of Greenfield’s hypothesis.
Behavioral and Brain Sciences.
16. Reilly, R.G. (1998). Cortical software re–use: A neural basis for creative cognition.
In T. Veale (Ed.), Computational Models of Creative Computation, pp. 36-42.
17. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal
representations by error propagation. In D. E. Rumelhart, J. L. McClelland, and
The PDP Research Group (Eds.), Parallel distributed processing. Explorations in
the microstructure of cognition. Volume 1: Foundations. Cambridge, MA: MIT
Press, pp. 318–362.
18. Sejnowski, T.J. (1986). Open questions about computation in the cerebral cortex.
In J. L. McClelland, D. E. Rumelhart, & The PDP Research Group (Eds.), Parallel
distributed processing. Explorations in the microstructure of cognition. Volume 2:
Psychological and biological models. Cambridge, MA: MIT Press, pp. 372–389.
19. Savage–Rumbaugh, E.S., & Lewin, R. (1994). Kanzi: The ape at the brink of the
human mind. New York: John Wiley.
20. Shrager, J. & M.H. Johnson (1996). Dynamic plasticity influences the emergence
of function in a simple cortical array. Neural Networks, 9, 1119–1129.
21. Thatcher, R.W. (1992). Cyclic cortical reorganization during early childhood.
Brain and Cognition, 20, 24–50.
22. Van Gelder, T. (1990). Compositionality: A connectionist variation on a classical
theme. Cognitive Science, 14, 355–384.
A Cellular Neural Associative Array for
Symbolic Vision
Christos Orovas and James Austin
Advanced Computer Architectures Group
Computer Science Department, University of York
York, YO10 5DD, UK
[christos|austin]@minster.york.ac.uk
http://www.cs.york.ac.uk/arch/
Abstract. A system which combines the descriptional power of symbolic representations with the parallel and distributed processing model of
cellular automata and the speed and robustness of connectionist symbol
processing is described. Following a cellular automata based approach,
the aim of the system is to transform initial symbolic descriptions of
patterns to corresponding object level descriptions in order to identify
patterns in complex or noisy scenes. A learning algorithm based on a
hierarchical structural analysis is used to learn symbolic descriptions of
objects. The underlying symbolic processing engine of the system is a
neural based associative memory (AURA) which enables the system to
operate in high speed. In addition, the use of distributed representations
allow both efficient inter-cellular communications and compact storage
of rules.
1
Introduction
One of the basic features of syntactic and structural pattern recognition systems
is the use of the structure of the patterns in order to classify them. This approach
is the most appropriate when the patterns under consideration are characterized
by complex structural relationships [1]. In these systems each pattern is represented using a symbolic data structure (strings, arrays, trees, graphs) where
symbols represent basic pattern primitives. Structural methods rely on prototype
matching which compares the unknown pattern with a set of models. Syntactic
systems follow concepts from formal language theory aiming to represent each
class of patterns with a corresponding grammar. Although these methods have
been successfully applied in a number of cases, they have their drawbacks such as
the computational complexity of the algorithms, sensitivity to noise and errors
in the patterns and the lack of generality and robust learning abilities [2]. In
relation to syntactical systems, the problem of grammatical inference combined
with the representational abilities of the grammar used and the complexity and
sensitivity of parsing are important obstacles to be addressed [3,2].
In order to deal with the problem of high dimensionality which occurs when
the entire pattern is treated as the entity to be recognized, a decentralized and
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 372–386, 2000.
c Springer-Verlag Berlin Heidelberg 2000
A Cellular Neural Associative Array for Symbolic Vision
373
distributed approach can be followed where the description of the patterns is
achieved by the co-operation of a set of simple and homogeneous symbolic processing units. This approach follows a computational framework similar to cellular automata [4].
The basic idea in cellular automata is that a cellular array of relatively simple processing units exists and at each time instant the state of each cell is
determined by its previous state and the previous states of its direct neighbours
using a common set of simple rules. Albeit the homogeneous and simple units
and the local neighbourhood connectivity, this model can demonstrate examples
of complex behaviour and propagation of information [5] and can be used to
simulate several physical systems containing many discrete elements with local
interactions [6]. Apart from Von Neumann’s twenty nine states and four neighbours connectivity cellular automaton which was capable of simulating a Turing
machine [7] and the well known Life game [4], cellular automata have found
applications in simulating physical systems [8,9] and also in image processing
[4,10,11]. The system in [8] is also an example where the rules are extracted
directly from the experimental data that have to be simulated and this is achieved by using a genetic algorithm to search in the rule space for these rules that
represent the data best. However, this is also one of the rare examples where
automatic generation of rules for a cellular automata system is performed.
Using this paradigm of parallel, distributed and ‘virtual’ multilayered processing, the problem of dimensionality is overcome by portioning the object into
segments using processing nodes which communicate with each other. By allowing the states that each cell can be in to represent information at the different
stages of interpretation of low level features towards world models, the cells can
individually decide whether or not the segments they hold are parts of the same
object. Effectively, a bottom up parsing is performed in a decentralized manner;
each cell is trying to build its derivation tree upwards obeying at the same time
at orders set by its neighbours.
The questions that need to be answered for such an approach is how the
grammar used to recognize each pattern is held in each node, how the rules of
the grammar can be automatically created, how this set is handled efficiently
and how generalization under noise tolerance is provided.
Our approach is based on Cellular Associative Neural Networks (CANNs) [12]
which learn the underlying grammar of the objects in a scene and operate as
outlined above. A similar idea of ‘communicating grammars’ is also followed by
the parallel grammar systems [13]. The network of language processors (NLP)
is a typical example of such a system [14]. This consists of several language
identifying devices which are called language processors and are associated with
the nodes of a graph. The processors operate on strings (which can represent data
and/or programs) by performing rewriting and communication steps. Although
the potential of this approach is revealed by the extensive study of its properties
in [14] there still remain the problems of generality, complexity of parsing and
inferring of grammars. In our model the idea is extended by the use of associative
neural networks in order to efficiently handle large sets of simple symbolic rules
374
C. Orovas and J. Austin
and by the use of a hierarchical approach to learn the structure of the patterns
and produce the sets of rules which make up the communicating grammar.
Associative memories are characterized by their ability to address data by
their content [15,16]. The neural associative memory model which is used in our
system, AURA, is also capable of symbolic and sub-symbolic processing in terms
that symbolic information is handled sub-symbolically; symbols are converted to
patterns of activity which are managed in a distributed manner. By doing that
we have the advantage of having the ability to perform symbolic processing
and at the same time benefit from the generalization, noise and error tolerance
and adaptability of neural information processing. The integration of neural
(sub-symbolic) and symbolic processing is the main characteristic of the hybrid
systems [17,18]. Connectionist symbolic processing is also a term related with
these systems. Depending on the way in which neural and symbolic processing
interact, hybrid systems can be classified in various categories without necessarily
very sharp boundaries [18]. In CANNs, the overall architecture is a symbolic
one; a cellular array of communicating symbolic processors. Neural associative
memories with symbolic processing capabilities are used by each unit in the
system in order to meet the requirements for high speed of store and recall,
noise tolerance and generalization. The neural associative memories are basic
elements and are embedded in each processor. This characteristic can classify
CANNs as an embedded hybrid system according to the taxonomy in [18].
The novelty of CANNs lays not only in the combination of ideas from structural pattern recognition with the computational paradigm of cellular automata
and the use of connectionist symbolic processing. It is also in the existence of
a simple yet effective learning algorithm which, using a hierarchical approach,
produces the necessary rules for the operation of the system. In this paper we
present the basic architecture of AURA, the 2D CANN, the learning method
and initial results of simple evaluations.
2
Cellular Associative Neural Networks
A CANN is a cellular array of associative processors. This is a new concept
based on both cellular automata and associative neural networks. As mentioned
earlier, the main characteristics are the cellular automata like operation and the
connectionist symbolic processing using the AURA model of neural associative
memory.
The initial labelling of the image is a 2D array of symbols produced by labelling the raw pixel image using a feature recognition stage. During recognition
each iteration of the cellular array corresponds to a higher level of representation, indicated by symbols associated with higher level structures (e.g. whole
edges, corner parts and then whole objects). After recognition is complete, each
node in the 2D array will contain a label indicating what the feature ‘beneath’
that cell represents.
A Cellular Neural Associative Array for Symbolic Vision
375
During processing, messages are exchanged between each cell in the array
and after every iteration each cell is aware of the state of more distant cells. An
example of the operation in a CANN is depicted in figure 1.
time:
(1)
t1
t2
(1)
1
f1
t1
f2
0
(2)
t1
(3)
t1
t2
(2)
2
(4)
O
t2
(3)
t2
(4)
O
3
4
5
Fig. 1. The recognition of object O through the operation of a CANN. Object O
consists of features f1 and f2 and ti (n) is the intermediate state at time n of a cell
initially having feature label fi .
All the cells have the same structure and they all perform the same sets of
symbolic rules. Thus, their operation is location invariant. As we will see later,
each cell is made up of a number of AURA based associative memories.
Recognition is performed by an iterative process. Symbols from three alphabets are used to form the states of the cells in the array and the messages used
to communicate between cells. The symbolic rules that govern the convergence
properties of the system are produced by a learning algorithm which uses a hierarchical approach to describe the structure of the patterns. Recall uses these
rules and, by employing a constraint relaxation, is able to generalize. Following
is a more detailed description of these aspects of the system.
2.1
AURA
The Advanced Uncertain Reasoning Architecture (AURA) is a set of methods
for building associative memory systems capable of symbolic and sub-symbolic
processing. Each cell in the CANN described in this paper is made up of a
number of modules. Each module is a rule matching engine constructed using
the AURA methods [19], the details of which are outside the scope of this paper.
An out-line is presented here. The architecture used here has two stages. The
first one uses a set of binary correlation matrix memories (CMM) [20] used to
determine which rule will fire given a set of pre-conditions and the second is an
enhanced indexing database used to store the post conditions of a rule. A basic
376
C. Orovas and J. Austin
example of a CMM and a schematic diagram of the operation of the AURA
model are depicted in figure 2. AURA can handle symbolic rules of the form
pattern B
0
1
0
1
1
0
0
0
0
1
Antecedents
(if P1 is A and P2 is B)
1
Lexical to
Token
converter
Tokenised
Symbols
P1
P2
A
B
0
p
a
t
t
e
r
n
0
A
1
Superimposed
tokens
Binding
1
0
1
CMM
arity 1
CMM
arity 2
CMM
arity n
0
Superimposed
separators
0
Post−Condition
Look Up
0
0
(a)
4
0
4
4
0
0
0
0
Identify
separators
4
(b)
Post−Conditions
Fig. 2. a) Example of a CMM. b) Schematic diagram of AURA.
preconditions → postcondition. The preconditions are sets of variable : value
pairs connected either with logical AND or OR operators. For each variable and
value there is a unique binary vector and the input to the CMMs is formed after
superimposing the corresponding tensor products of these vectors. This input is
directed to the appropriate CMM according to the number of preconditions in the
rule (called arity). To identify each rule there is a unique binary pattern with a
constant number of bits set to one sparsely distributed in it. This pattern is called
separator and is associated with the pre-processed input at the relevant CMM.
The separator is also used as the key in order to store the relevant postcondition
to the database. In recall, the input is applied to the corresponding CMM and a
vector of summed values is produced at the output. This is then converted to one
or a number of superimposed separators using the L-max thresholding method
[21,22]. This method sets the L highest sums to ones and the rest to zeroes. L
is the number of bits initially set in the separator patterns. For each separator
identified in the recovered binary pattern a postcondition is then retrieved from
the database.
AURA provides a powerful symbolic processing engine and it is an essential
part of the system. It allows on-line learning and high speed both in learning
and recalling modes, it can perform in parallel on the data (multiple queries will
give multiple responses) and is capable of partial and combinatorial matching
of rules. It also allows direct hardware implementation of the system using the
PRESENCE [23] architecture for very high speed processing.
A Cellular Neural Associative Array for Symbolic Vision
2.2
377
The Cell Structure
Each cell in the CANN consists of three kinds of modules, spreader, passer and
combiner. They perform the three following tasks: (a) convert the input to a
form suitable for spreading in each direction away from the cell, (b) combine
incoming information with information to be passed to the neighbouring units
and operate as a symbolic gate, and (c) combine current state and information
from neighbours to produce the new state of the processor. An example of a two
dimensional processor communicating with four neighbours is depicted in figure
3a. The spreaders incorporate a gating function, which can prevent symbols from
being passed to other units if the cell has no image features on its input. This
can prevent information pathways between passers becoming saturated. The
passers operate to ‘count’ the distance a message has been sent, akin to allowing
the system state that ‘there is a corner feature five hops away’. The combiners
unite the information being passed by the passers about distant features and the
current information about what part of the object the cell may represent, and
translate this to a higher level description of the feature. The spreaders allow
input symbols to be translated to features suitable for the passers. If they are
not used all the passers obtain the same symbols to be passed in all directions.
In operation the output from the combiners gets input to the spreaders on the
next iteration of the array.
R1
passer
R5
R2
IN
spreader
R3
IN
R6
combiner
OUT
(a)
passer
R4
(b)
Fig. 3. a) A 2D associative processor and b) the patterns used for training.
Messages and Rules Information in a CANN is represented by symbolic messages consisting of one or more symbols. These symbols belong to three sets, or
alphabets. The first is the input alphabet and it is the set of symbols representing primitive pattern features. This can be thought of as the set of terminals
378
C. Orovas and J. Austin
in a grammar system [24]. The second is the set of the transition symbols used
during the evolution process of the CANN. These symbols correspond to the
non-terminals in a grammar system. They either represent various combinations of symbols or sub-patterns or formations of sub-patterns. The third set is
the output alphabet and can be thought of as the set of the starting symbols
in a grammar system and represent complete patterns and are the object level
symbols of the system.
Each module has its own set of symbolic rules. However, these sets are the
same for the modules of the same type. Thus, the operation is location independent. The rules are of the form input conditions → output where input conditions
are combinations of messages and output is either a transition symbol or a symbol from the output alphabet.
Connection Schemata. These determine the connection pattern which is followed, both inter-cell and intra-cell in the array. Cells are connected directly to
four of their immediate neighbours. By the exchange of messages which takes
place, cells become aware of the states of more distant cells after a number of
iterations.
For the intra-cell case the connection schema defines which messages form
the input to a module and where the output of a module is directed. There
can be equivalent forms of connections using different types of modules and
numbers of them. It is important that there is always communicating and state
determining units within the processor. These connection schemata are uniformly
applied for all the processors. The types which have been followed for the current
experiments derive from the one depicted in figure 3a.
For the results presented in this paper no spreader modules were used, as
the effect of spreading different messages in each direction has little effect in the
problems addressed here.
2.3
Learning Session
The sets of rules for the operation of a CANN are produced during the learning
session. These rules allow the transition from the initial symbolic image to the
object level description. The training set which is used is a set of patterns along
with the corresponding high level symbol (object label) for each pattern. Two
inputs are given to the system for each pattern: (a) a symbolic description of the
pattern using labels representing primitive pattern features placed in an array
where each location corresponds to each cell in the CANN and (b) an object
label which is the ‘name’ of the pattern.
The algorithm which is followed for every processor in the cellular array
when the processor has a non null initial state is depicted in figure 4. Initially, a
preliminary stage (step 1) prepares the system for operation by placing the state
of each processor at the input points of its neighbours. Then, the main part of
the algorithm begins and it is applied for all the modules in all the processors
A Cellular Neural Associative Array for Symbolic Vision
379
Step 1 Place the symbols representing the initial state of the processor
to the input channels of its neighbours.
Step 2 For all modules:
Check if the input or the combination of inputs is recognizable.
If recognizable
Retrieve the answer and place it at the location for the output
of the module.
else
Assign a new transition symbol to represent this input or
combination of inputs, store the new association to the module and
place the new symbol at the location for the output of the module.
Step 3 If all states are unique goto step 4 else goto step 2.
Step 4 Associate the current inputs to the combiner module with the object level
label provided and store the association to the module.
Fig. 4. The algorithm used in the learning session. It is reminded that the spreader
module takes only one input whereas the passer and the combiner modules have more
than one input.
with a non null state. Processors with null (empty) states do not participate in
this process 1 .
For every module a ‘test and set’ approach is followed. If the combination
of the symbols at the input of a module is recognizable then the corresponding
postcondition is retrieved. If not, a new transition symbol is created and assigned
as a postcondition to the current formation of inputs. Then, the new rule is stored
at the relevant CMM. Rules previously produced are used and the new ones are
appended to the corresponding sets.
As mentioned earlier, after every iteration, due to the propagation of messages, each cell becomes aware of the states of more distant cells. This is reflected
in its state. Within a finite number of iterations (the number of which depends
on the size of the input pattern) all cells have a state which is unique in the array. Unique states indicate unique formations of subpatterns in which the input
pattern has been divided into. This is the termination condition for the learning algorithm and corresponds to the configuration in which the entropy of the
cellular array is the maximum. At this stage, for all the non-empty cells, the
preconditions which exist at the input of the combiner modules are associated
with the symbol from the output alphabet specifying the pattern presented.
The beauty of this learning process is its simplicity, allowing complex patterns
to be learned without manual intervention. As is presented next, generalization
of the resulting rules to allow recognition in the presence of image noise is dealt
with using a relaxation method.
1
An option exists for the passers of these processors to participate when they have
an incoming signal.
380
2.4
C. Orovas and J. Austin
Recall Session
For recall, a symbolic image is presented to the CANN. As in the learning session, a bottom-up approach is followed. A ‘parsing’ using a universal grammar is
performed transforming the input symbols to the nearest object level description
or descriptions. As mentioned in the beginning, each cell is trying to build its
derivation tree in a bottom-up manner, obeying at the same time at the orders
arriving from its neighbours. The initial states of the cells represent simple features which could exist in all objects that could be recognized by a CANN. As
messages from neighbours arrive, the cells are forced to alter their states to new
ones which represent feature formations that can be found in a reduced number
of objects. This process is repeated and leads to an ever decreasing number of
possible objects that the cell can belong to. If a pattern used for training is
presented then the system converges to the corresponding object level symbols
for that pattern. If there are similar patterns stored, the corresponding object
level symbols also appear at the common areas.
When an unknown pattern is presented, the system tries to label those formations of pattern primitives which are recognized. However, it is not always
possible for all cells to give a higher or an object level label due to corruption
or differences in the input pattern compared to the trained examples. To allow
generalization, relaxation is used. If a postcondition for a combination of inputs
cannot be found then the constraints are relaxed and responses with incomplete
precondition matching are accepted (the system has an increased tolerance).
This is achieved by accessing more than one CMM in the relevant module and
by reducing the threshold used for determining whether a valid separator can be
retrieved from the CMM’s output. The decision for increased tolerance is taken
either when none of the cells can alter their state (global) or a given cell fails
to output a known separator (local). In the second case we have a completely
decentralized operation.
Due to partial matching it is possible to output more than one symbol from
a module. This is the reason that messages and states can consist of more than
one symbol. There are two ways in which multiple symbols for the same precondition are presented to the modules. Consecutive and simultaneous. The former
presents inputs one by one and trades speed for size of the CMMs while the latter presents the inputs superimposed and is faster but needs larger CMMs. For
example, consider rules with two preconditions. Symbols A and B are both present for the first precondition while symbol C is the second precondition. Using
the consecutive presentation, the CMM of arity two is accessed two times. One
for the combination A and C and one for the combination B and C. The results
are then combined to form a multiple symbol message. On the other hand, using
the simultaneous approach the CMM is accessed only once. In that case the
binary patterns corresponding to symbols A and B are superimposed. We can
see that using the consecutive approach the number of times that the CMMs are
accessed depends on the number of symbols existing as pre-conditions while the
simultaneous method always accesses the CMMs only once. This is one of the
major advantages of using the AURA methods for rule matching in the CANN.
A Cellular Neural Associative Array for Symbolic Vision
381
The recalling session stops when there are no alterations to the configuration
of the system, or the number of state altering cells is less than a threshold or a
preset maximum number of iterations has been reached.
3
Experiments
The recognition of objects in multi-object and noisy scenes is the aim of the
system. Applications can be found in specific areas such as analysis of electronic
and mechanical drawings although the long term aim is to build an adaptable
and generic pattern recognition system.
Various parameters of the system have been tested using a set of prototype
patterns. Among these parameters are included the intra-processor connection
schemata, the effect of global/local relaxation, the simultaneous/consecutive presentation and parameters of the AURA modules. At the same time, the behaviour
of the system with complex patterns (combinations of the training patterns),
noise, and scale variations of the training patterns have been tested. Some initial results with complex patterns were presented in [25]. In this paper, results
of experiments where symbolic noise was injected into the training patterns and
also with scale variations of them are presented.
The patterns used for training (R1-R6) can be seen in figure 3b. Their size
is of the order of 6×12 and 12×6 symbols. For training, five repetitions for each
of the patterns R1 to R6 were needed. For each pattern, a decreasing number
of rules for the combiner module were produced, starting at 153 (for R1) and
ending at 82 (for R6). As mentioned earlier, the number of iterations needed
to complete the learning session depends on the size of the patterns while the
number of rules produced depends on the level of similarity among the patterns
since common formations are represented with the same rules.
Symbolic noise can be inserted into the input image during the initial labelling process. For example, consider a symbol, A, at position (x, y) of a symbolic
image. There are three categories of noise: (a) absence of symbol A from (x, y),
(b) replacement of A by one or more different symbols, and, (c) addition of one
or more symbols at (x, y).
For the testing purposes, random noise was injected into the training set at
various levels according to predefined values for P (x/α) which is the probability
of having noise of type x when symbol α, α representing a primitive pattern
feature, should be present and given that noise exists in that position.
Ten versions at each noise level were tested for each pattern and the results
for pattern R4 can be seen in the graphs in figures 5 and 6. These graphs show
the average recognition success while recalling noisy versions of pattern R4. The
standard deviations of the averages for pattern R4 itself are also shown. Figure
7 has similar results for pattern R6.
The graph in figure 5a has the recognition results when no tolerance for
neither the combiner nor the passer modules were allowed. We can see that
the percentage for R4 is always the highest one although its value decreases to
almost 5% at 50% noise. A tolerance of 1/5 for the combiner modules is allowed
382
C. Orovas and J. Austin
Pattern R4, Combiner Tolerance = 0, Passer Tol. = 0
Pattern R4, Combiner Tolerance = 1, Passer Tol. = 0
120
120
R1
R2
R3
R4
R5
R6
100
R1
R2
R3
R4
R5
R6
100
R4
80
80
R4
60
60
R1
40
R1 R3
40
R3
20
R2
R5
R6
20
0
R6
R2
R5
-20
0
5
10
15
20
25
30
Noise (%)
35
40
45
50
5
10
15
20
25
30
Noise (%)
(a)
35
40
45
50
(b)
Fig. 5. Percentages of object level labels while recalling noisy versions of pattern R4.
Tolerance for the combiner modules is 0/5 (see text) in (a) and 1/5 in (b). Tolerance
for the passer modules is 0/2 in both (a) and (b).
in graph 5b. That means that one out of five preconditions is allowed not to
match in the inputs to the combiner modules. The effect is a direct improvement
at the recognition success as more cells are able to alter their states and end
up with object level labels even when not all the necessary preconditions are
present. This improvement is even better in graph 6b where a tolerance of 1/2 is
Pattern R4, Combiner Tolerance = 1, Passer Tol. = 1
Pattern R4, Combiner Tolerance = 0, Passer Tol. = 1
120
120
R1
R2
R3
R4
R5
R6
100
R1
R2
R3
R4
R5
R6
100
R4
80
80
R4
60
60
R1
R3
40
40
R1 R3
R6
20
20
R5
R2
R5
R2
R6
0
0
5
10
15
20
25
30
Noise (%)
(a)
35
40
45
50
5
10
15
20
25
30
Noise (%)
35
40
45
50
(b)
Fig. 6. Percentages of object level labels while recalling noisy versions of pattern R4.
Tolerance for the combiner modules is 0/5 in (a) and 1/5 in (b) while tolerance for the
passer modules is 1/2 in both (a) and (b).
A Cellular Neural Associative Array for Symbolic Vision
Pattern R6, Combiner Tolerance = 2, Passer Tol. = 0
383
Pattern R6, Combiner Tolerance = 2, Passer Tol. = 1
110
110
R1
R2
R3
R4
R5
R6
100
90
R6
R6
90
80
80
70
70
60
60
R5
50
R1
R2
R3
R4
R5
R6
100
R5
50
40
R2
40
R2
30
30
R3
R4
20
20
R4
R1
R1
10
10
0
R3
0
5
10
15
20
25
30
Noise (%)
(a)
35
40
45
50
5
10
15
20
25
30
Noise (%)
35
40
45
50
(b)
Fig. 7. Percentages of object level labels while recalling noisy versions of pattern R6.
Tolerance for the combiner modules is 2/5 for both (a) and (b) while no tolerance is
allowed for passer modules in graph (a).
also allowed for the passer modules. This enables the cells to communicate and
exchange messages even when there are empty spaces or erroneous conditions
between them. The effect of the relaxation of the constraints of the system is
obvious from these graphs. Even with 50% of noise, object level symbol R4 has
an average occurrence of almost 80% in 6b.
Graph 6a has the results when tolerance is allowed only for the passer modules. These results have indications of improvement but it is only when both
modules are relaxed when the best performance is achieved. Comparing 5b and
6a we can see that the behaviour of the system is influenced at a greater level by
the combiner units than by the passer units. This is justified since the combiner
modules are the ones that decide for the next state of the processor while the
passer modules are only responsible for passing this information to the neighbouring units. The effect of the relaxation of the passer modules is more obvious
in graphs 7a and 7b. We can see there that although the correct behaviour is
followed in 7a, it needs to be augmented by relaxing the passer modules (in
graph 7b). This has as effect a ‘fine-tuning’ of the recognition levels.
We can notice from all the graphs that the next highest percentages are
for patterns which are similar to the actual pattern. Thus we have a sort of
‘multilevel classification’ where recognition percentages are distributed according
to the level of similarity with the actual pattern at the input. We can also notice
that as the level of noise increases so do the percentage of symbols output for
patterns with no similarity to the actual one. This is because as noise increases
there are more randomly created subpatterns belonging to these patterns.
The behaviour with the increased tolerance demonstrates both the capabilities of AURA for uncertain reasoning and partial matching and the robustness
384
C. Orovas and J. Austin
Pattern R2, Combiner Tolerance = 1, Passer Tol. = 0
Pattern R4, Combiner Tolerance = 1, Passer Tol. = 0
100
100
R1
R2
R3
R4
R5
R6
80
R1
R2
R3
R4
R5
R6
80
R4
R2
R3
60
60
R1
R5
40
R6
40
R1
R2
R6
R3
20
R4
20
R5
0
100
105
110
115
120
125
Scale (%)
(a)
130
135
140
145
150
0
100
105
110
115
120
125
Scale (%)
130
135
140
145
150
(b)
Fig. 8. Recalling scaled versions of pattern R1 (a) and pattern R4 (b). The combiner
modules are allowed a tolerance of 1/5 while no tolerance is allowed for passers.
of the distributed processing approach. Relaxing the combiners has as an effect
the production of states for the units even if some messages are missing while
relaxing the passers enables messages to overcome obstacles such as empty spaces and erroneous conditions. It is important to note however that, even with no
relaxation, the behaviour of the system is still good.
Another set of experiments was performed in order to examine the behaviour
with scale variations in patterns. The current results demonstrate that the system can cope very well with scaling of up to 120 % in all cases and in some cases
this percentage is 150%. The graphs in figure 8 give the results in two cases.
The results in 8a are characteristic for patterns R1 and R2 while the ones in
8b are typical for the rest of the patterns. Increasing the tolerance of the passer
modules provides better results although not at the same level as with noise.
Ideas which will allow a more ‘distance free’ recognition, i.e. recognition of a
pattern with more weight to the nature of the existing subpatterns than to their
distance, are currently being considered.
Local relaxation and consecutive presentation were used for these experiments. The recognition of the patterns was achieved after the 6th iteration at
most of the cases with only small fluctuations at the percentages after that. As
with the learning session, the number of iterations before the first recognition
is achieved, depends on the size of the patterns. A software simulation of the
AURA model running on Silicon Graphics workstations were used. It is in our
near future plans to extend the experimentations using the dedicated hardware
platform for the AURA system. This is estimated to provide a speed up of up
to 200 times [23] thus allowing for real time image processing since the initial
labelling stage can also be performed using the dedicated hardware.
A Cellular Neural Associative Array for Symbolic Vision
4
385
Summary
A cellular system combining the enhanced representational power of symbolic
descriptions and the robustness of associative information processing has been
presented. Homogeneous associative processors are placed in a cellular array and
they interact in a parallel, distributed and decentralized fashion. Each processor
has communicating and state determining modules and together they process
the rules of a global and universal grammar. These rules are produced during
a simple, but powerful, learning method which uses a hierarchical approach.
A basic ‘test and set’ principle is followed for this and the grammar which is
produced describes the structure of the patterns and the relations among their
constituent parts in all levels of abstraction.
Starting with initial states representing primitive pattern features, a bottomup approach is followed by each cell in the system in its effort to end up having
an object level state. This will indicate the object(s) which the cell is part of.
This process is directed by the messages arriving from the neighbouring cells.
The parallel operation of the units relieves the model from the complexity which
is associated with parsing. This is because each cell operates in an autonomous
way, however, leading to a global solution. This characteristic also enables the
model to tackle erroneous conditions and noise at a local level.
An essential component of the system is the underlying associative symbolic
processing engine, AURA. The latter endows the system with high speed management of large sets of symbolic rules, generalization, partial matching and noise
tolerance.
The descriptional and computational potential of parallel grammar systems
has been well studied and demonstrated in similar approaches [13,14]. The system presented in this paper also demonstrates how the incorporation of a powerful connectionist symbolic processing mechanism along with the use of a simple
yet effective learning algorithm can result in a flexible pattern recognition system
with direct hardware implementation.
Acknowledgements
The funding for this project came from the State Scholarships Foundation of
Greece while C. Orovas’s participation to the workshop was subsidized by the
TMR PHYSTA (Principled Hybrid Systems: Theory and Applications) research
project of the EC.
References
1. H.Bunke. Structural and syntactic pattern recognition. In C.H.Chen, L.F.Pau,
and P.S.P Wang, editors, Handbook of Pattern Recognition & Computer Vision,
pages 163–209. World Scientific, 1993.
2. K. Tombre. Structural and syntactic methods in line drawing analysis: To which
extend do they work ? In P.Perner, P.Wang, and A.Rozenfeld, editors, Advances in
Syntactic and Structural Pattern Recognition, 6th Int. Workshop, SSPR’96, pages
310–321, Germany, 1996.
386
C. Orovas and J. Austin
3. E. Tanaka. Theoretical aspects of syntactic pattern recognition. Pattern Recognition, 28(7):1053–1061, 1995.
4. K. Preston and M. Duff, editors. Modern Cellular Automata: Theory and Applications. Plenum Press, 1984.
5. S. Wolfram. Computation theory of cellular automata. Communications in Mathematical Physics, 96:15–57, 1984.
6. S. Wolfram. Statistical mechanics of cellular automata. Reviews of Modern Physics,
55(3):601–643, Jul 1983.
7. A.W. Burks, editor. Essays on Cellular Automata. University of Illinois Press,
1970.
8. F.C. Richards, P.M. Thomas, and N.H. Packard. Extracting cellular automaton
rules directly from experimental data. Physica D, 45:189–202, 1990.
9. Y. Takai, K. Ecchu, and K. Takai. A cellular automaton model of particle motions
and its applications. Visual Computer, 11, 1995.
10. T. Pierre and M. Milgram. New and efficient cellular algorithms for image processing. CVGIP: Image Understanding, 55(3):261–274, 1992.
11. M.J.B. Duff and T.J. Fountain, editors. Cellular Logic Image Processing. Academic
Press, 1986.
12. C. Orovas. Cellular Associative Neural Networks for Pattern Recognition. PhD
thesis, University of York, 1999. (copies are available).
13. G. Paun and A. Salomaa, editors. New Trends in Formal Languages. LNCS 1218.
Springer, 1997.
14. E. Csuhaj-Varju and A. Salomaa. Networks of parallel language processors. In
Paun and Salomaa [13], pages 299–318.
15. T. Kohonen. Content-Addressable Memories. Springer-Verlag, 1980.
16. Austin J. Associative memory. In Fiesler E. and Beale R., editors, Handbook of
Neural Computation. Oxford University Press, 1996.
17. Geoffrey E. Hinton, editor. Connectionist Symbol Processing. MIT/Elsevier, 1990.
18. R. Sun and L.A. Bookman, editors. Computational architectures integrating neural
and symbolic processing. Kluwer Academic Publishers, 1995.
19. J. Austin and K. Lees. A neural architecture for fast rule matching. In Proceedings
of the Artificial Neural Networks and Expert Systems Conference, Dunedin, New
Zealand, Dec 1995.
20. D.J. Willshaw, O.P. Buneman, and H.C. Longuet-Higgins. Non-holographic associative memory. Nature, 222(7):960–962, Jun 1969.
21. D. Casasent and B. Telfer. High capacity pattern recognition associative processors.
Neural Networks, 5:687–698, 1992.
22. J. Austin and T.J. Stonham. Distributed associative memory for use in scene
analysis. Image and Vision Computing, 5(4):251–261, 1987.
23. J.Kennedy and J.Austin. A parallel architecture for binary neural networks. In
Proceedings of the 6th International Conference on Microelectronics for Neural
Networks, Evolutionary & Fuzzy Systems (MICRONEURO’97), pages 225–232,
Dresden, Sep. 1997.
24. G.Rozenberg and A.Salomma, editors. Handbook of Formal Languages, Vol.I-II-III.
Springer-Verlag, 1997.
25. C. Orovas and J. Austin. A cellular system for pattern recognition using associative
neural networks. In 5th IEEE International Workshop on Cellular Neural Networks
and their Applications, pages 143–148, London, April 1998.
Application of Neurosymbolic Integration for
Environment Modelling in Mobile Robots
Gerhard Kraetzschmar, Stefan Sablatnög, Stefan Enderle, and Günther Palm
Neural Information Processing, University of Ulm
James-Franck-Ring, 89069 Ulm, Germany,
{gkk,stefan,steve,palm}@neuro.informatik.uni-ulm.de,
WWW:http://www.informatik.uni-ulm.de/ni/staff/gkk.html
Abstract. We present an architecture for representing spatial information on autonomous robots. This architecture integrates several kinds
of representations each of which is tailored for different uses by the robot control software. We discuss various issues regarding neurosymbolic
integration within this architecture. For one particular problem – extracting topological information from metric occupancy maps – various
methods for their solution have been evaluated. Preliminary empirical
results based on our current implementation are given.
1
Introduction and Motivation
A close investigation of strengths and weaknesses of both symbolic, classical
AI methods and subsymbolic, so-called soft computing methods suggests that
a successful integration of both promises to yield more powerful methods that
exhibit the stengths and avoid the weaknesses of either component approach.
This idea is the underlying assumption of the majority of efforts to build hybrid
systems over the past few years. Kandel and Langholz [8], Honovar and Uhr [7],
Bookman and Sun [14], Goonatilake and Khebbal [5], Medskers [10] and Wermter
[18] all have edited collections of papers describing a wide variety of approaches to
integrate symbolic and subsymbolic computation. Hilario [6] provides a suitable
classification and defines the notion of neurosymbolic integration for combining
neural networks with symbolic information processing.
The study of neurosymbolic integration is also a central research topic in
our long-term, multi-project research effort SMART.1 In this project, the topic
is studied in the context of adaptive mobile systems, such as service robots. In
our group, we particularly investigate neurosymbolic integration for representing
and reasoning about space.
In this paper, we first give a very brief overview on the most common approaches to represent spatial concepts in robotics. An analysis of their respective strengths and weaknesses motivates the development of integrated, neurosymbolic systems for spatial representations on autonomous mobile robots.
The Dynamo Architecture is then suggested as a framework for this endeavor,
1
See http://www.uni-ulm.de/SMART for more informtion.
S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 387–401, 2000.
c Springer-Verlag Berlin Heidelberg 2000
388
G. Kraetzschmar et al.
and its potential for applying neurosymbolic integration is outlined. One of the
candidate problems for applying neurosymbolic integration is the extraction of
topological spatial concepts from metric occupancy maps. We discuss various
approaches we have pursued so far and report on their respective strengths and
weaknesses based on experimental results.
2
Overview on Representations of Space
Reasoning about space is a widely occurring problem. Application areas range
from Geometric Information Systems (GIS), Computer-Aided Design (CAD),
and architectural interior design all the way to industries like clothings, shoes,
and logistics. In most of these problems, reasoning about space is used either
as an optimization tool (minimizing waste of raw material, maximizing use of
given storage capacity) or as a structural aid to solve problems later in the work
process (providing precise specs for parts manufacturing or interior design work
on a building). In these applications, a global view on the problem can be taken,
and once a suitable representation for the relevant spatial concepts has been
chosen, the representation remains fixed and static.
The situation in robotics, even for classical industrial robots, is quite different. Either the whole robot, or at least parts of it, like a multiple-joint robot
arm, move through space, thereby changing their spatial relations to their surroundings over time. Moreover, most intellectually and economically attractive
applications of autonomous mobile robots usually work in environments exhibiting dynamics of some sort: various kinds of manipulatory objects (trash cans
etc.), objects relocatable by other agents like chairs and crates, objects with
changing spatial state like doors, and both human and non-human agents all
introduce a certain dynamic aspect. Thus, a fundamental, but often neglected
question to ask about spatial representations used in robotics is how they can
deal with dynamic aspects of the world.
Any evaluation of spatial representations and the respective reasoning capabilities in robotics has to take into account the particular tasks the robots are
supposed to perform. Some typical tasks are briefly overviewed in the next few
paragraphs.
Navigating Through Space: One of the most investigated task is the pick-up-anddelivery task for office robots (see cf. [1]). It requires autonomous navigation
in office environments. Leonard and Durrant-Whyte [9] have framed the problem concisely into three questions: 1) Where am I? 2) Where am I going? and
3) How should I get there? Almost all approaches to solve the navigation task
for indoor robots are based on some kind of two-dimensional spatial representation. The range of representations varies widely and includes at least polygonial representations/computational geometry, relational representations/logics
of space [12], topological representations/graph theory, and probabilistic metric
occupancy grids/Bayesian inference [2], [16]. In all cases, allocentric representations of the environment are most commonly used, i.e. a global perspective
Neurosymbolic Integration for Environment Modelling
389
is taken. Most approaches, however, are best suited for modelling static environments and require substantial effort for dealing with dynamic world aspects,
especially if the environment is very large.
Several biologically-inspired approaches have received closer attention, e.g.
Mallot’s view graphs [3] or Tani’s sensory-based navigation system [15]. Experiments done in biology and psychophysics also initiated increased interest in
egocentric representations (see c.f. [13]), where the environment is modelled relative to the robot. Egocentric representations have strong implicit dynamics
(almost everything changes whenever the robot moves) and are often transient,
i.e. they are constructed from the sensory information currently available and
are not stored permanently. As a consequence, information tends to be noisy
(few sensory input data yield limited reliability) but timely (changes in the environment are immediately seen, if sensors can detect them).
Following People: A seemingly similar, yet different robot task is following a
person. Spatially speaking, the robot is supposed to track a person by keeping
the distance between the person and itself within given bounds. The task requires to visually track the person to be followed and to extract information
about the spatial relation (angular position, distance) between the robot and
the tracked person. Having information on the dynamics of the tracked person
(speed, acceleration) can simplify the task. An essential requirement is the automatic avoidance of obstacles by the robot. Having a model of the environment
maybe of help, but does in no way solve the problem by itself. A robust people
following system would have to compensate for temporary interruption of the
tracking capability due to occlusions. For this kind of task, egocentric models of
the immediate environment seem to be suited quite well.
Manipulating Objects: Manipulating objects by robots involves even more challenging problems of reasoning about space. Contrary to navigation tasks, threedimensional representations are considered essential for manipulation tasks by
most roboticists. However, most 3D representations require substantially higher
computational efforts, which make it even more difficult to model dynamic world
aspects than in the 2D case. Also, there is psychological and psychophysical evidence, that humans also use just 2D representations even for manipulatory tasks.
Speaking with Humans: Interacting with humans is a robot task that recently received significantly increased interest; many researchers believe it to be the field
attracting most robotics research over the next decade or so. Using speech is
both the most natural and appealing way for interacting with a human. Spatial
representations and reasoning about space are essential requirements for providing powerful and natural and speech interfaces. Even the specification of the
simplest robot tasks (“go and get me coffee”, “pick up the blue box next to the
trash can in the kitchen and bring it to Peter’s room”) involved substantial and
complex kinds of reasoning about space. Logics of space and topological approaches are most common in these areas, where precise metric distances can often
be neglected.
390
G. Kraetzschmar et al.
Reviewing the various kinds of representations for the different tasks, it is
immediately plausible that only the integration of several of the discussed approaches is likely to provide the necessray spatial reasoning capabilities for mobile
robots performing the whole range of tasks. Thus, hybrid spatial representations
are of great interest in robotics. Several hybrid approaches, e.g. by Thrun [16] or
Kuipers [11] are already known, but these systems use a combination of several
representations for navigation only and do not specifically support other robot
tasks.
3
The Dynamo Architecture
We present the Dynamo architecture (see Figure 1) as a framework for the integration of symbolic and subsymbolic spatial representations. Basically, Dynamo
features three layers of spatial representations:
– At the lowest level, we use egocentric occupancy maps for representing the
local environment of the robot. For each sensor modality, a neural network
interprets the most recent sensory data and approximates a continuous 2D
occupancy landscape. Locality is determined by the range of the particular
sensor. An integrated view of the local environment is produced by fusing the
continuous representations of the different sensor modalities. This so-called
multimodal egocentric map is continuous as well. If needed, e.g. for implementing collision avoidance during trajectory execution, discrete equidistant
occupancy grid maps can be produced for each continuous map by sampling
it in an anytime manner. All egocentric maps are transient.
– At the medium level, we use allocentric occupancy maps to represent a global
view of the environment. These maps (usually one, but several for different
floor levels) are discrete, usually equidistant (but other representations are
possible and under consideration), and constructed automatically by integrating egocentric occupancy maps over time. The integration requires an
estimate of the global position of the robot; the quality of position estimates
strongly influences overall allocentric map quality.
– At the top level, we basically use a relational representation, which has been
extended by topological and polygonial elements. The relational representation, itself integrated into the knowledge representation used by the robot,
provides the means to represent relevant spatial concepts like rooms, furniture, doors, etc., including all necessary attributes. Each such concept is
associated with a polygonial description (currently a set of rectangular regions). The defining characteristic of a region is that it has an attribute
which is spatially invariant over the region. Thus, regions are typed. Examples are the region of a table, the region of a rug, or a region representing
an abstract spatial concept like a “forbidden zone” (cf. stairs). Depending
upon their type, regions may overlap (Bob’s room and free space) or not
(wall and free space). Traversability of and between regions is topologically
represented by a graph.
Neurosymbolic Integration for Environment Modelling
relational / topological map
391
route
planning
topological
projection
spatial
query
metric
projection
room
door
corridor
room
unexplored
door
door
room
room
or
do
metric information
symbolic
concepts
concept
matching
room
door
corridor
room
unexplored
door
door
room
annotation layer
vision
exploration
room
or
do
classification
region layer
segmentation
path
planning
occupancy grid
position
estimation
integration
odometry
motion
multimodal egocentric map
unimodal
egocentric
maps
fusion
visual data
interpretation
IR
interpretation
sonar
interpretation
laser
interpretation
visual
range data
IR
sensors
sonar
sensors
laser
scanner
Fig. 1. The Dynamo Architecture
control
392
G. Kraetzschmar et al.
The Dynamo architecture provides many opportunities to study neurosymbolic integration in various forms:
1. Fusing the multiple modalities of egocentric maps can be done symbolically (priority rules, majority rules) or subsymbolically (learning contextdependent fusion functions with neural networks)
2. In order to reduce the dependency on reliable position estimates for temporal
map integration, a matching process can try to compute the position where
the current (fused) egocentric map matches the allocentric map best. Methods to do this in a subsymbolic manner are based e.g. on expectation maximization. Problems arise when the world changes dynamically. Including
expectations generated from symbolically available knowledge may improve
and simplify this process.
3. The segmentation of grid maps links sets of grid cells to symbolic concepts
like rectangular regions or polygones. This process is considered a central
element in integrating subsymbolic and symbolic representations and will be
discussed in more detail below.
4. By map annotation we mean associating low-level symbolic concepts like
regions with higher-level concepts present in the knowledge base, thereby
augmenting the information available for the region under consideration.
This association can be derived either by using object classification (we can
see that this region is a table) or by matching low-level and higher-level concepts based on spatial occupancy (we already know that the room contains
a single table).
5. The extraction of topological information can be done based on a polygonial representation (regions) by computing various spatial relationships
(connectedness, overlap, etc.). This process can be very expensive. Extracting topological information directly from the allocentric map could exploit
the contextual spatial information that is inherent in the allocentric map,
but must be inferred when using a polygonial map.
6. Projecting topological information into allocentric maps is the reverse idea
to the previous issue. It is useful for exploiting knowledge about the environment that is available a priori or acquired through communication with
humans and can significantly ease exploration.
7. Probably the most interesting long-term aspect is modelling dynamic world
aspects. The basic idea is to have a symbolic model of the spatial dynamics
of certain objects like humans or robots. These models are applied to update
the position and orientation of the regions associated with these objects. The
regions can be projected onto the allocentric map, e.g. for path planning
purposes.
4
From Grid Maps to Regions: Map Segmentation
The annotation of subsymbolic spatial representations with symbolic information is considered a key element in Dynamo. Annotating grid cells directly is
computationally expensive, given typical grid map resolutions of 5–15cm, and
Neurosymbolic Integration for Environment Modelling
393
poses a severe problem of maintaining map consistency. A simple table would require the annotation hundreds of grid cells. The segmentation of grid maps into
a set of regions, which then serve as basic elements for annotation, can reduce
the required computational effort for annotation enormously.
We applied several methods to solve this problem. Two of them, Colored
Kohonen Map [17] and Growing Neural Gas [4], are based on self-organizing
maps. The latter method only segments free space. Common to both methods
is that the grid map is sampled during the adaptation process to retrieve single
data points. Each data point is handled in an isolated manner and neighboring
grid cells are not evaluated. Thus, these methods are based on the exploitation
of isolated data points.
The third method was first applied to the problem by Thrun and Bücken
[16]. It based on Voronoi diagrams and computes a segmentation of free space.
The fourth method, developed in our local group, applies ideas from computer
vision to detect wall boundaries, which are then used to generate rectangles
representing either free space or occupied space. Both methods take into account
contextual spatial information.
4.1
Methods Based on Isolated Data Point Exploitation
Self-organizing maps lend themselves naturally to neurosymbolic integration:
they define a mapping from a high-dimensional input space (usually subsymbolic
data with unknown structure) to a low-dimensional topological output space. In
our application to map segmentation, the topology of the output space defines a
Voronoi tesselation of the input space, thereby providing a segmentation of the
grid map.
Colored Kohonen Map Vleugels et al. [17] developed a colored variant of
Fritzke’s Growing Cell Structures, which themselves are a variant of Kohonen
maps with adaptive topology, for mapping the configuration space of a robot.
The Algorithm: Two different types of neurons are used: One is specially designed to settle on obstacle boundaries, the other one is tuned to approximate
the Voronoi diagram of free space. This behavior is achieved by applying a repulsive force to the neurons modelling free space from inputs inside obstacles,
and an attractive force to the neurons modelling obstacles by inputs inside free
space. Neurons are not allowed to change color, i.e. the control algorithm prevents that obstacle neurons are drawn into free space and vice versa. The size
of the Kohonen map adapts automatically as new neurons are inserted as needed. The amount of data is dramatically reduced, especially in high dimensional
configuration spaces.
After the learning process, neurons in free space and their topological connections are interpreted as road maps, which are used for solving path planning
problems.
394
G. Kraetzschmar et al.
I
1
attractive force from I2
I2
repulsive force from I
1
Fig. 2. Update of colored neurons.
Application To Our Problem: Figure 3 shows an example of a grid map that represents the configuration space of an omnidirectional point robot in an artificial
environment and the corresponding colored Kohonen map. Input space is only
two dimensional; data points were acquired by random sampling of the grid map.
Fig. 3. Example of a colored Kohonen map.
Neurosymbolic Integration for Environment Modelling
395
Results: A problem we often met with this method was that the map tends to
move outside of narrow free space pockets, leaving free space which is not appropriately represented by neurons (see top left of Figure 3). Due to the dynamic
insertion of neurons, the robot is able to easily increase the size of the grid map
and to adapt the size of the Kohonen map accordingly. Also, the incremental
creation and adaption process enables permanent adaptation to slowly changing
environments.
Growing Neural Gas The colored Kohonen map does not try to fully represent
free space; it just builds road maps. For autonomous robots navigating in two
dimensional space (ignoring its holonomic restrictions), we are interested in a
more complete represention of free space. We tried to achieve this by using the
Growing Neural Gas method by Fritzke [4].
The Algorithm: The algorithm described in [4] uses a set of units A, each of which
is associated with a reference vector wc ∈ Rn (also referenced to as the position
of the unit). Units keep track of a local error variable, which is incremented
by the actual error for the best matching unit. Pairs of such units c ∈ A can
be connected by edges ec1 c2 . Edges are equipped with an counter age which is
incremented for every input.
The algorithm usually starts with two units connected to each other. Every
unit is assigned a random position. Then for each input in the nearest unit s1 and
the second nearest unit s2 are computed, using the Euclidian distance measure.
The age of every edge is incremented and the distance of the input and the best
matching unit is added to the local error variable. The unit s1 is moved towards
the input, so are the neighbors, which are directly connected to s1 by one of
the edges. If there is already an edge between s1 and s2 the age of this edge is
set to zero, else the edge is inserted. Then all edges are checked whether they
have reached a maximum age, if so they are removed. All λ steps a new unit is
inserted halfway between the unit with the maximum error q and its neighbor n
with the largest error variable. The error of the two units is decreased and the
new unit is assigned the new error value of q. On every step all error variables
are decreased by a factor d. This is repeated until a specfic criteria is reached.
Application To Our Problem: We used this method to create a map of free space
by repeatedly presenting free space position vectors sampled from the grid.
Results: After short time units were equally distributed over the free space, and
as soon as their density allowed connections crossing occupied space were removed. The neural gas automatically develops a topology preserving structure, so
that the resulting graph can be used for path planning purposes. This structure
is not perfect, but improves with time spent on presenting examples from the
input space. The number of neurons needed to represent a map depends on the
required final density of neurons and on the structural complexity inherent in
the particular grid map. The algorithm allows for dynamic extension of the map
396
G. Kraetzschmar et al.
as well as for adaption to dynamic changes in the environment. This method
needs many neurons especially for modelling at a very detailed level (Figure 4).
The result given by this method usually is not free of some topological errors,
e.g. connections crossing a wall. Longer training can reduce this, but not avoid
it totally, as we do not want to make the structure of regions as detailed as the
original grid map itself.
Fig. 4. Neural gas approximating free space
In the bottom left part of Figure 4, it can be seen that artefacts in the
gridmap tend to create artefact regions as well, if they are big enough. In order
to avoid this one should either choose an appropriate subpart of the map or cross
check topological consistency.
4.2
Methods Based on Contextual Data Point Exploitation
The methods described so far do not use any information about the neighborhood of a grid cell; every input to the self-organizing map is interpreted as an
isolated point in the configuration space. The following methods try to also
exploit contextual information.
Voronoi Skeletization and Segmentation
The Algorithm: Thrun and Bücken [16] try to extract topologic information with
image analysis methods. They skeletize free space by approximating a Voronoi diagram, which in turn is used to find locations with minimal clearance to
obstacles. The underlying hypothesis is that such location often mark doors or
narrow pathways. Points on the Voronoi diagram with minimal clearance are
used to define lines separating regions of free space. These free space regions are
then topologically represented by nodes, while region adjacency yields the edges.
Neurosymbolic Integration for Environment Modelling
397
The resulting graph can by pruned using a few simple pruning rules to further
reduce the number of nodes.
Results: We implemented the method as described in [16]. Our results show that
the method is quite sensible to noise and map distortions, which result in very
complex Voronoi diagrams and, consequently, in large numbers of small regions.
Figure 5 shows an example of a segmentation obtained by our implementation).
We experimented with various methods of filtering the grid map before applying
the Voronoi algorithm, which led to much better results. It should be noted
that the region generated by this approach do not match well with the regions
intuitively associated with knowledge-level concepts such as rooms, etc. How a
connection between regions and such concepts can be achieved, is an unsolved
problem for this appproach.
Fig. 5. Example of Voronoi skeletization and segmentation
Also, as this method is not incremental in nature it is not as easily extended to
dynamic environments as the self-organizing maps. Whenever significant changes
to the grid map occur, the whole segmentationprocess must be repeated, possibly
leading to completely different topologies.
Wall Histograms The SMART project uses an autonomous system in an office environment. Most office enironments are built with mainly orthogonal walls.
The wall histogram method makes this assumption and considers only rectangular regions. The regions have an additional orientation parameter, though, which
usually reflects the overall rotation of the robot’s coordinate system against the
direction of the main wall axis in the environment.
Given a grid-based occupancy map G taken from G r×c , the set of all grid
maps consisting of r rows and c columns, we want to apply the segmentation
398
G. Kraetzschmar et al.
function S : G r×c 7→ 2R to obtain a set of regions R consisting of a number of
disjoint regions Ri , i = 1 . . . n. As mentioned, we chose to restrict ourselves to sets
of rectangular regions, each of which should be ideally either fully occupied or
free. Note, that an equi-distant grid-based occupancy map is itself a rectangular
segmentation of the environment, and thus, r × c is an upper bound for n.
Usually, we are interested in segmentations yielding small numbers of regions,
i.e n ≪ r×c. However, some properties of G make it more difficult to achieve grid
map segmentation into small region sets: i) The orientation of the environment
with respect to the axis of the map is unknown. Thus, rectangles collinear with
map axes do not lead to small region sets. ii) The map may contain noise due
to wrong sensor measurements. Thus, boundaries of regions representing walls,
objects, etc. may appear to be curved and it is hard to select line segments
for separation. Regions that are neither fully free nor occupied will result. iii)
The environment is not necessarily fully explored. Unexplored or sporadically
explored regions will add additional noise. Under these considerations, we restrict
ourselves to partitions defined by an angle and two orthogonal sets of segmenting
lines (cuts) which define a set of disjoint rectangles.
Even with all restrictions imposed so far, most grid-based occupancy maps
representing non-trivial or real-world environments still allow a very large number of rectangular segmentations R. We need at least an informal criterion to
measure the quality of segmentation. Given our overall goals, the set of regions
R should provide an intuitive segmentation, which can be more easily matched
to symbolic concepts like rooms, tables, walls, etc.
The Algorithm: We developed an algorithm that tries to deal with the problems
described above. The algorithm, called the wall histogram method, is based on
methods developed in computer vision. It consists of the following steps:
– Identification of the main orientation α ∈ [0; π2 ) of the environment within
the coordinate system of the grid map G.
– Application of morphologic operations on the grid map G to extract the
obstacle boundaries.
– Identification of long edges by building histograms over the extracted boundaries.
Then segmentation R is constructed by cutting the grid map G at the positions
found in the previous step, at the angle α. The details of the steps above are as
follows: i) Blurring using a Gauss filter (Figure 4.2a) ii) Application of a Sobel
operator in x and y direction (Figure 4.2b and 4.2c) iii) Recombination of the
gradient direction and length information from the two Sobel filtered images
(Figure 4.2d and 4.2e) iv) Depending on the length of the gradient vectors the
directions are accumulated in a histogram, which is used to determine the most
frequent direction of edges in the image part (Figure 4.2f).
Wall hypotheses are created by first smoothing the image using a morphologic
operator which is ’wall shaped’. Afterwards the edges are extracted by growing
the image with another morphologic operator and subtracting the original from
Neurosymbolic Integration for Environment Modelling
blurred
sobel x
1
399
sobel y
2
2
0.9
1.5
50
0.8
1.5
50
50
1
1
0.7
100
0.6
100
100
0.5
0.5
0.5
0
0
150
150
150
0.4
−0.5
−0.5
0.3
−1
−1
200
200
200
0.2
−1.5
−1.5
0.1
−2
250
250
50
100
150
200
250
0
50
a) Blurred grid map.
100
150
200
−2
250
250
50
b) Horizontal Sobel
directions
100
150
200
250
) Verti al Sobel
angle
weight
700
3
2
2
50
600
50
500
1
100
1.5
100
400
0
150
1
150
300
−1
200
200
200
0.5
−2
100
250
−3
50
100
150
200
250
250
d) Re ombined
dire tions.
50
100
150
200
250
0
e) Con den e values.
0
0
10
20
30
40
50
60
70
80
90
f) Angle histogram.
the result. Then simple projections of the edge image along the dominant direction, as extracted above, and orthogonal to it are accumulated in histograms,
which have peaks in places which are good hypotheses for wall or obstacle boundaries (Figure 6). These peaks are extracted and their positions mark the
boundaries where the grid map should be splitted into adjacent regions.
Preliminary Results: The method was tested on maps created by simulations
and real robots using a preliminary version of the lower-level Dynamo mapping
architecture. To test the robustness of the direction extraction algorithm the
images were rotated artificially. This is important, as the robot is not guaranteed
to start map building with his coordinate system aligned to walls. Also, we are
robust against rectangular rooms, that are placed in a non rectangular manner.
gridmap axis
gram
histo
0
0
0
14
2
2
2
14
0
0
axis
tion
direc
Fig. 6. Histogram generation.
400
G. Kraetzschmar et al.
As shown in Figure 7, the extraction of wall hypotheses works already quite
well. The remaining ambiguous hypotheses are reduced by doing the calculation
only on parts of the map that have reasonable size, which reduces the hypotheses
to the few relevant in the area in question.
Fig. 7. some results on gridmaps with different orientation
5
Conclusions
We presented a framework for hybrid spatial representations in mobile robotics,
which provides rich opportunities to study neurosymbolic integration. For one
of these problems, the segmentation of occupancy grid maps in sets of regions,
several approaches have been described. Currently, no single approach produces satisfactory results for all spatial aspects, even for static environments: The
approaches based on self-organizing maps are well-suited for generating topological representations for free space, but have severe problems with occupied
space. Both methods as well as Voronoi skeletization result in unintuitive tesselations of free space. Our own methods gives better results in this regard, but
provides no suitable topologies for navigating in free space. However, we believe
that suitable variations and combinations of these methods will produce much
better results and are therefore working on such implementations.
References
[1] Michael Beetz and Drew McDermott. Declarative goals in reactive plans. In
James Hendler, editor, Proceedings of AIPS-92: Artificial Intelligence Plannign
Systems, pages 3–12, San Mateo, CA, 1992. Morgan Kaufmann.
Neurosymbolic Integration for Environment Modelling
401
[2] Alberto Elfes. Dynamic control of robot perception using stochastic spatial models. In Proceedings of the International Workshop on Information Processing in
Mobile Robots, March 1991.
[3] Matthias O. Franz, Bernhard Schölkopf, Philipp Georg, Hanspeter A. Mallot, and
Heinrich H. Bülthoff. Learning view graphs for robot navigation. In W. Lewis
Johnson and Barbara Hayes-Roth, editors, Proceedings of the 1st International
Conference on Autonomous Agents, pages 138–147, New York, February5–8 1997.
ACM Press.
[4] Bernd Fritzke. A growing neural gas network learns topologies. Advances in
Neural Information Processing Systems, 7, 1995.
[5] S. Goonatilake and S. Khebbal, editors. Intelligent Hybrid Systems. John Wiley
& Sons, Ltd, 1995.
[6] Melanie Hilario. An overview of strategies for neurosymbolic integration. In
Ron Sun and Frederic Alexandre, editors, IJCAI-95 Workshop on ConnectionistSymbolic Integration: From Unified to Hybrid Approaches, pages 1–6, Montreal,
August 1995.
[7] Vasant Honovar and Leonard Uhr, editors. Artificial Intelligence and Neural Networks: Steps toward Principled Integration. Academic Press, 1994.
[8] Abraham Kandel and Gideon Langholz, editors. Hybrid Architectures for Intelligent Systems. CRC Press, 1992.
[9] J. Leonard and H. F. Durrant-Whyte. Mobile robot localization by tracking
geometric beacons. IEEE Transactions on Robotics and Automation, 7(3):376–
382, 1991.
[10] Larry R. Medskers, editor. Hybrid Intelligent Systems. Kluwer Academic Publishers, Norwell, 1995.
[11] David Pierce and Benjamin Kuipers. Map learning with uninterpreted sensors
and effectors. Technical Report AI96-246, University of Texas, Austin, 1997.
[12] D.A. Randell, Z. Cui, and A.G. Cohn. A spatial logic based on regions and
connection. In Proceedings of the Third International Conference on Knowledge
Representation and Reasoning, pages 165–176, Cambridge, MA, USA, October
1992.
[13] Michael Recce and K. D. Harris. Memory for places: A navigational model in
support of Marr’s theory of hippocampal function. Hippocampus, 6:735–748, 1996.
[14] Ron Sun and L. Bookman, editors. Computational Architectures Integrating Neural and Symbolic Processes. Kluwer Academic Publishers, Boston, 1995.
[15] Jun Tani and Naohiro Fukumura. Learnign goal-directed sensory-based navigation
of a mobile robot. Neural Networks, 7(3):553–563, 1994.
[16] Sebastian Thrun and Arno Bücken. Learning maps for indoor mobile robot navigation. Technical Report CMU-CS-96-121, Carnegie Mellon University, April
1996.
[17] Jules M. Vleugels, Joost Kok, and Mark H. Overmars. Motion planning using
a colored Kohonen network. Technical Report RUU-CS-93-39, Department of
Computer Science, Utrecht University, 1993.
[18] Stefan Wermter, editor. Connectionist, Statistical and Symbolic Approaches to
Learning for Natural Language Processing: IJCAI-95 Workshop, volume 1040 of
Lecture Notes in Computer Science. Springer Verlag, New York, 1996.
Author Index
Andrews, Robert 226
Arevian, Garen 158
Austin, James 372
Kremer, Stefan C.
Bailey, David 14
Barbosa, Valmir C. 92
Bogacz, Rafal 63
Bologna, Guido 226, 240
Maire, Frederic 226
Mayberry, Marshall R. 144
Miikulainen, Risto 144
Morris, William G. 175
Cavill, Steven J. 270
Cottrell, Garrison W. 175
Omlin, Christian W. 123
Orovas, Christos 372
Diederich, Joachim 226
Dorffner, Georg 255
Palm, Günther 387
Panchev, Christo 158
Park, Nam Seog 78
Lipson, Hod
Elman, Jeffrey 175
Enderle, Stefan 387
Gallant, Stephen I. 204
Giles, Lee 123
Giraud-Carrier, Christophe
Gori, Marco 211
Hallack, Nelson A. 92
Hammerton, James A. 298
Hölldobler, Steffen 46
Honkela, Timo 348
Kalinke, Yvonne 46
Kalman, Barry L. 298
Kanerva, Pentti 194
Kolen, John F. 107
Kraetzschmar, Gerhard 387
286
Reilly, Ronan G.
Feldman, Jerome 14
Fogg, Anthony J.B. 270
Foy, Michael A. 270
Frasconi, Paolo 211
107
363
Sablatnög, Stefan 387
Schittenkopf, Christian 255
Sharkey, Noel 313
Shastri, Lokendra 28
Siegelmann, Hava T. 286
Sperduti, Alessandro 211
Sun, Ron 1, 333
63
Taylor, Stewart J. 270
Thornber, Karvel K. 123
Tickle, Alan B. 226
Tiňo, Peter 255
Vaughn, Marilyn L.
270
Wermter, Stefan 1, 158
Wunderlich, Jörg 46
Zaverucha, Gerson 92
Ziemke, Tom 313