Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Hybrid Neural Systems

2000

Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann Lecture Notes in Computer Science Edited by G.Goos, J. Hartmanis, and J. van Leeuwen 1778 Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo Stefan Wermter Ron Sun (Eds.) Hybrid Neural Systems Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA Jörg Siekmann, University of Saarland, Saarbrücken, Germany Volume Editors Stefan Wermter University of Suderland Centre of Informatics, SCET St Peters Way, Sunderland, SR6 0DD, UK E-mail: stefan.wermter@sunderland.ac.uk Ron Sun University of Missouri-Colombia CECS Department 201 Engineering Building West, Columbia, MO 65211-2060, USA E-mail: rsun@cecs.missouri.edu Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Hybrid neural systems / Stefan Wermter ; Ron Sun (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; Hong Kong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 (Lecture notes in computer science ; Vol. 1778 : Lecture notes in artificial intelligence) ISBN 3-540-67305-9 CR Subject Classification (1991): I.2.6, F.1, C.1.3, I.2 ISBN 3-540-67305-9 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer is a company in the BertelsmannSpringer publishing group. c Springer-Verlag Berlin Heidelberg 2000  Printed in Germany Typesetting: Camera-ready by author data conversion by PTP Berlin, Stefan Sossna Printed on acid-free paper SPIN: 10719871 06/3142 543210 Preface The aim of this book is to present a broad spectrum of current research in hybrid neural systems, and advance the state of the art in neural networks and artificial intelligence. Hybrid neural systems are computational systems which are based mainly on artificial neural networks but which also allow a symbolic interpretation or interaction with symbolic components. This book focuses on the following issues related to different types of representation: How does neural representation contribute to the success of hybrid systems? How does symbolic representation supplement neural representation? How can these types of representation be combined? How can we utilize their interaction and synergy? How can we develop neural and hybrid systems for new domains? What are the strengths and weaknesses of hybrid neural techniques? Are current principles and methodologies in hybrid neural systems useful? How can they be extended? What will be the impact of hybrid and neural techniques in the future? In order to bring together new and different approaches, we organized an international workshop. This workshop on hybrid neural systems, organized by Stefan Wermter and Ron Sun, was held during December 4–5, 1998 in Denver. In this well-attended workshop, 27 papers were presented. Overall, the workshop was wide-ranging in scope, covering the essential aspects and strands of hybrid neural systems research, and successfully addressed many important issues of hybrid neural systems research. The best and most appropriate paper contributions were selected and revised twice. This book contains the best revised papers, some of which are presented as state-of-the-art surveys, to cover the various research areas of the collection. This selection of contributions is a representative snapshot of the state of the art in current approaches to hybrid neural systems. This is an extremely active area of research that is growing in interest and popularity. We hope that this collection will be stimulating and useful for all those interested in the area of hybrid neural systems. We would like to thank Garen Arevian, Mark Elshaw, Steve Womble and in particular Christo Panchev, from the Hybrid Intelligent Systems Group of the University of Sunderland for their important help and assistance during the preparations of the book. We would like to thank Alfred Hofmann from Springer for his cooperation. Finally, and most importantly, we thank the contributors to this book. January 2000 Stefan Wermter Ron Sun Table of Contents An Overview of Hybrid Neural Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Wermter and Ron Sun 1 Structured Connectionism and Rule Representation Layered Hybrid Connectionist Models for Cognitive Science . . . . . . . . . . . . . 14 Jerome Feldman and David Bailey Types and Quantifiers in SHRUTI: A Connectionist Model of Rapid Reasoning and Relational Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Lokendra Shastri A Recursive Neural Network for Reflexive Reasoning . . . . . . . . . . . . . . . . . . . 46 Steffen Hölldobler, Yvonne Kalinke and Jörg Wunderlich A Novel Modular Neural Architecture for Rule-Based and Similarity-Based Reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Rafal Bogacz and Christophe Giraud-Carrier Addressing Knowledge-Representation Issues in Connectionist Symbolic Rule Encoding for General Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 Nam Seog Park Towards a Hybrid Model of First-Order Theory Refinement . . . . . . . . . . . . . 92 Nelson A. Hallack, Gerson Zaverucha, and Valmir C. Barbosa Distributed Neural Architectures and Language Processing Dynamical Recurrent Networks for Sequential Data Processing . . . . . . . . . . 107 Stefan C. Kremer and John F. Kolen Fuzzy Knowledge and Recurrent Neural Networks: A Dynamical Systems Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Christian W. Omlin, Lee Giles, and Karvel K. Thornber Combining Maps and Distributed Representations for Shift-Reduce Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 Marshall R. Mayberry and Risto Miikkulainen Towards Hybrid Neural Learning Internet Agents . . . . . . . . . . . . . . . . . . . . . . 158 Stefan Wermter, Garen Arevian, and Christo Panchev VIII Table of Contents A Connectionist Simulation of the Empirical Acquisition of Grammatical Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 William C. Morris, Garrison W. Cottrell, and Jeffrey Elman Large Patterns Make Great Symbols: An Example of Learning from Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 Pentti Kanerva Context Vectors: A Step Toward a “Grand Unified Representation” . . . . . . 204 Stephen I. Gallant Integration of Graphical Rules with Adaptive Learning of Structured Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 Paolo Frasconi, Marco Gori, and Alessandro Sperduti Transformation and Explanation Lessons from Past, Current Issues, and Future Research Directions in Extracting the Knowledge Embedded in Artificial Neural Networks . . . . . . . 226 Alan B. Tickle, Frederic Maire, Guido Bologna, Robert Andrews, and Joachim Diederich Symbolic Rule Extraction from the DIMLP Neural Network . . . . . . . . . . . . . 240 Guido Bologna Understanding State Space Organization in Recurrent Neural Networks with Iterative Function Systems Dynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Peter Tiňo, Georg Dorffner, and Christian Schittenkopf Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network that Performs Low Back Pain Classification . . . . . . . . . 270 Marilyn L. Vaughn, Steven J. Cavill, Stewart J. Taylor, Michael A. Foy, and Anthony J.B. Fogg High Order Eigentensors as Symbolic Rules in Competitive Learning . . . . . 286 Hod Lipson and Hava T. Siegelmann Holistic Symbol Processing and the Sequential RAAM: An Evaluation . . . . 298 James A. Hammerton and Barry L. Kalman Robotics, Vision and Cognitive Approaches Life, Mind, and Robots: The Ins and Outs of Embodied Cognition . . . . . . . 313 Noel Sharkey and Tom Ziemke Supplementing Neural Reinforcement Learning with Symbolic Methods . . . 333 Ron Sun Table of Contents IX Self-Organizing Maps in Symbol Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 348 Timo Honkela Evolution of Symbolization: Signposts to a Bridge Between Connectionist and Symbolic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 Ronan G. Reilly A Cellular Neural Associative Array for Symbolic Vision . . . . . . . . . . . . . . . . 372 Christos Orovas and James Austin Application of Neurosymbolic Integration for Environment Modelling in Mobile Robots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 Gerhard Kraetzschmar, Stefan Sablatnög, Stefan Enderle, and Günther Palm Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 An Overview of Hybrid Neural Systems Stefan Wermter1 and Ron Sun2 1 University of Sunderland, Centre for Informatics, SCET St. Peter’s Way, Sunderland, SR6 0DD, UK 2 University of Missouri, CECS Department Columbia, MO, 65211-2060, USA Abstract. This chapter provides an introduction to the field of hybrid neural systems. Hybrid neural systems are computational systems which are based mainly on artificial neural networks but also allow a symbolic interpretation or interaction with symbolic components. In this overview, we will describe recent results of hybrid neural systems. We will give a brief overview of the main methods used, outline the work that is presented here, and provide additional references. We will also highlight some important general issues and trends. 1 Introduction In recent years, the research area of hybrid and neural processing has seen a remarkably active development [62,50,21,4,48,87,75,76,25,49,94,13,74,91]. Furthermore, there has been an enormous increase in the successful use of hybrid intelligent systems in many diverse areas such as speech/natural language understanding, robotics, medical diagnosis, fault diagnosis of industrial equipment and financial applications. Looking at this research area, the motivation for examining hybrid neural models is based on different viewpoints. First, from the point of view of cognitive science and neuroscience, a purely neural representation may be most attractive but symbolic interpretation of a neural architecture is also desirable, since the brain has not only a neuronal structure but has the capability to perform symbolic reasoning. This leads to the question how different processing mechanisms can bridge the large gap between, for instance, acoustic or visual input signals and symbolic reasoning. The brain uses specialization of different structures. Although a lot of the functionality of the brain is not yet known in detail, its architecture is highly specialized and organized at various levels of neurons, networks, nodes, cortex areas and their respective connections [10]. Furthermore, different cognitive processes are not homogeneous and it is to be expected that they are based on different representations [73]. Therefore, there is evidence from cognitive science and neuroscience that multiple architectural representations are involved in human processing. Second, from the point of view of knowledge-based systems, hybrid symbolic/neural representations have some advantages, since different, mutually complementary properties can be combined. Symbolic representations have advantages of easy interpretation, explicit control, fast initial coding, dynamic variable S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 1–13, 2000. c Springer-Verlag Berlin Heidelberg 2000 2 S. Wermter and R. Sun binding and knowledge abstraction. On the other hand, neural representations show advantages for gradual analog plausibility, learning, robust fault-tolerant processing, and generalization. Since these advantages are mutually complementary, a hybrid symbolic neural architecture can be useful if different processing strategies have to be supported. While from a neuroscience or cognitive science point of view it is most desirable to explore exclusively neural network representations, for knowledge engineering in complex real-world systems, hybrid symbolic/neural systems may be very useful. 2 Various Forms of Hybrid Neural Architectures Various classification schemes of hybrid systems have been proposed [77,76,89,47]. Other characterizations of architectures covered specific neural architectures, for instance recurrent networks [38,52], or they covered expert systems/knowledgebased systems [49,29,75]. Essentially, a continuum of hybrid neural architectures emerges which contains neural and symbolic knowledge to various degrees. However, as a first introduction to the field, we present a simplified taxonomy here: unified neural architectures, transformation architectures, and hybrid modular architectures. 2.1 Unified Neural Architectures Unified neural architectures are a type of hybrid neural system. They have also been referred to as unified hybrid systems [47]. They rely solely on connectionist representations but symbolic interpretations of nodes or links are possible. Often, specific knowledge of the task is built into a unified neural architecture. Much early research on unified neural architectures can be traced back to work by Feldman and Ballard, who provided a general framework of structured connectionism [16]. This framework was extended in many different directions including, for instance, parsing [14], explanation [12], and logic reasoning [30,40,70,71,72]. Recent work along these lines focuses also on the so-called NTL, Neural Theory of Language, which attempts to bridge the large gap between neurons and cognitive behavior [17,65]. A question that naturally arises is: why should we use neural models for symbol processing, instead of symbolic models? Possible reasons may include: neural models are a more apt framework for capturing a variety of cognitive processes, as is argued in [15,66,86,72]. Some inherent processing characteristics of neural models, such as similarity-based processing, [72,6] make them more suitable for certain tasks such as cognitive modeling. Learning processes may be more easily developed in neural models, such as gradient descent [63] and its various approximations, Expectation-Maximization, and even Inductive Logic Programming methods [26]. There can be two types of representations [77]: Localist connectionist architectures contain one distinct node for representing each concept [42,71,67,3,58,31,66]. An Overview of Hybrid Neural Systems 3 Distributed neural architectures comprise a set of non-exclusive, overlapping nodes for representing each concept [60,50,27]. The work of researchers such as Feldman [16,17], Ajjanagadde and Shastri [67], Sun [72], and Smolensky [69] has demonstrated why localist connectionist networks are suitable for implementing symbolic processes usually associated with higher cognitive functions. On the other hand, “radical connectionism” [13] is a distributed neural approach to modeling intelligence. Usually, it is easier to incorporate prior knowledge into localist models since their structures can be made to directly correspond to that of symbolic knowledge [19]. On the other hand, neural learning usually leads to distributed representation. Furthermore there has been work on integrating localist and distributed representations [28,72,87]. 2.2 Transformation Architectures Hybrid transformation architectures transform symbolic representations into neural representations or vice versa. The main processing is performed by neural representations but there are automatic procedures for transferring neural representations to symbolic representations or vice versa. Using a transformation architecture it is possible to insert or extract symbolic knowledge into or from a neural architecture. Hybrid transformation architectures differ from unified neural architectures by the automatic transfer. While certain units in unified neural architectures may be interpreted symbolically by an observer, hybrid transformation architectures actually allow the knowledge transfer into symbolic rules, symbolic automata, grammars, etc. Examples of such transformation architectures include the work on activationbased automata extraction from recurrent networks [54,90]. Alternatively, a weight-based transformation between symbolic rules and feedforward networks has been extensively examined in knowledge-based artificial neural networks [68,20]. The most common transformation architectures are rule extraction architectures where symbolic rules are extracted from neural networks [19,1]. These architectures have received a lot of attention since rule extraction discovers the hyperplane positions of units in neural networks and transforms them to if-thenelse rules. Rule extraction has been performed mostly with multi-layer perceptron networks [79,5,8,11], Kohonen networks, radial basis functions [2,33] and recurrent networks [53,90]. Extraction of symbolic knowledge from neural networks has also played an important aspect in this current volume, e.g. [81,7,84]. Furthermore, insertion of symbolic knowledge can be either gradual through practice [23] or one-shot. 2.3 Hybrid Modular Architectures Hybrid modular architectures contain both symbolic and neural modules appropriate to the task. Here, symbolic representations are not just initial or final representations as in a transformation architecture. Rather, they are combined 4 S. Wermter and R. Sun and integrated with neural representations in many different ways. Examples in this class, for instance, contain CONSYDERR [72], SCREEN [95] or robot navigators where sensors and neural processing are fused with symbolic top-down expectations [37]. A variety of distinctions can be made. Neural and symbolic modules in hybrid modular architectures can be loosely coupled, tightly coupled or completely integrated [48]. Loosely Coupled Architectures A loosely coupled hybrid architecture has separate symbolic and neural modules. The control flow is sequential in the sense that processing has to be finished in one module before the next module can begin. Only one module is active at any time, and the communication between modules is unidirectional. There are several loosely coupled hybrid modular architectures for semantic analysis of database queries [9] or dialog processing [34] or simulated navigation [78]. Another example of a loosely coupled architecture has been described in a model for structural parsing [87] combining a chart parser and feedforward networks. Other examples of loose coupling, which is sometimes also called passive coupling, include [45,36]. In general, this loose coupling enables various loose forms of cooperation among modules [73]. One form of coupling is in terms of pre/postprocessing vs. main processing: while one or more modules take care of pre/postprocessing, such as transforming input data or rectifying output data, a main module focuses on the main part of the processing task. Commonly, while pre/post processing is done using a neural network, the main task is accomplished through the use of symbolic methods. Another form of cooperation is through a master-slave relationship: while one module maintains control of the task at hand, it can signal other modules to handle some specific aspects of the task. Yet another form of cooperation is the equal partnership of multiple modules. Tightly Coupled Architectures A tightly coupled hybrid architecture contains separate symbolic and neural modules where control and communication are via common shared internal data structures in each module. The main difference between loosely and tightly coupled hybrid architectures are common data structures which allow bidirectional exchanges of knowledge between two or more modules. This makes communication faster and more active but also more difficult to control. Therefore, tightly coupled hybrid architectures have also been referred to as actively coupled hybrid architectures [47]. As examples of tightly coupled architectures, systems for neural deterministic parsing [41] and inferencing [28] have been built where the control changes between symbolic marker passing and neural similarity determination. Furthermore, a hybrid system developed by Tirri [83] consists of a rule base, a fact base and a neural network of several trained radial basis function networks [57,59]. In general, a tightly coupled hybrid architecture allows multiple exchanges of knowledge between two or more modules. The result of a neural module can have a direct influence on a symbolic module or vice versa before it finishes its global An Overview of Hybrid Neural Systems 5 processing. For instance, CDP is a system for deterministic parsing [41], SCAN contains a tightly coupled component for structural processing and semantic classification [87]. While the neural network chooses which action to perform, the symbolic module carries out the action. During the process of parsing, control is switched back and forth between these modules. Other tightly coupled hybrid architectures for structural processing have been described in more detail in [89]. CLARION is also a system that couples symbolic and neural representations to explore their synergy. Fully Integrated Architectures In a fully integrated hybrid architecture there is no discernible external difference between symbolic and neural modules, since the modules have the same interface and they are embedded in the same architecture. The control flow may be parallel. Communication may be bidirectional between many modules, although not all possible communication channels have to be used. One example of an integrated hybrid architecture is SCREEN, which was developed for exploring integrated hybrid processing for spontaneous language analysis [95,92]. In fully integrated and interleaved systems, the constituent modules interact through multiple channels (e.g., various possible function calls), or may even have node-to-node connections across two modules, such as CONSYDERR [72] in which each node in one module is connected to a corresponding node in the other module. Another hybrid system designed by Lees et al [43] interleaves case-based reasoning modules with several neural network modules. 3 Directions for Hybrid Neural Systems In Feldman and Bailey’s paper, it was proposed that there are the following distinct levels [15]: cognitive linguistic level, computational level, structured connectionist level, computational biology level and biological level. A condition for this vertical hybridization is that it should be possible to bridge the different levels, and the higher levels should be reduced to, or grounded in, lower levels. A top-down research methodology is advocated and examined for concepts towards a neural theory of language. Although the particulars of this approach are not universally agreed upon, researchers generally accept the overall idea of multiple levels of neural cognitive modeling. In this view, models should be constructed entirely of neural components; both symbolic and subsymbolic processes should be implemented in neural networks. Another view, horizontal hybridization, argues that it may be beneficial, and sometimes crucial, to “mix” levels so that we can make better progress on understanding cognition. This latter view is based on realistic assessment of the state of the art of neural model development, and the need to focus on the essential issues (such as the synergy between symbolic and subsymbolic processes [78]) rather than nonessential details of implementation. Horizontal approaches have been used successfully for real-world hybrid systems, for instance in speech/language 6 S. Wermter and R. Sun analysis [95]. Purely neural systems in vertical hybridization are more attractive for neuroscience but hybrid systems of horizontal hybridization are currently also a tractable way of building large-scale hybrid neural systems. Representation, learning and their interaction represent some of the major issues for developing symbol processing neural networks. Neural networks designed for symbolic processing often involve complex internal structures consisting of multiple components and several different representations [67,71,3]. Thus learning is made more difficult. There is a need to address the problems of what type of representation to adopt, how the representational structure in such systems is built up, how the learning processes involved affect the representation acquired and how the representational constraints may facilitate or hamper learning. In terms of what is being learned in hybrid neural systems, we can have (1) learning contents for a fixed architecture, (2) learning architectures for given contents, or (3) we can learn both contents and architecture at the same time. Although most hybrid neural learning systems fall within the first two categories, e.g. [18,46], there are some hybrid models that belong to the third category, e.g. [50,92]. Furthermore, there is some current work on parallel neural and symbolic learning, which includes using (1) two separate neural/symbolic algorithms applied simultaneously [78], (2) two separate algorithms applied in succession, (3) integrated neural/symbolic learning [80,35], and (4) purely neural learning of symbolic knowledge, e.g. [46,51]. The issues described above are important for making progress in theories and applications of hybrid systems. Currently, there is not yet a theory of “hybrid systems”. There has been some preliminary early work towards a theoretical framework for neural/symbolic representations, but to date there is still a lack of an overall theoretical framework that abstracts away from the details of particular applications, tasks and domains. One step towards such a direction may be the research into the relationship between automata theory and neural representations [39,24,88]. Processing natural language has been and will continue to be a very important test area for exploring hybrid neural architectures. It has been argued that “language is the quintessential feature of human intelligence” [85]. While certain learning and architectures in humans may be innate, most researchers in neural networks argue for the importance of development and environment during language learning [87,94]. For instance, it was argued [51] that syntax is not innate and that it is a process rather than representation, and abstract categories, like subject, can be learned bottom-up. The dynamics of learning natural language is also important for designing parsers using techniques like SRN and RAAM. SARDSRN and SARDRAAM were presented in the context of shift-reduce parsing [46] to avoid the problem associated with SRN and RAAM (that is, losing constituent information). Interestingly, it has been argued that compositionality and systematicity in neural networks arise from an associationistic substrate [61] based on principles from evolution. An Overview of Hybrid Neural Systems 7 Also, research into improving WWW use by using neural networks may be promising [93]. While currently most search engines only employ fairly traditional search strategies, machine learning and neural networks could improve processing of heterogeneous unstructured multimedia data. Another important promising research area is knowledge extraction from neural networks in order to support text mining and information retrieval [81]. Inductive learning techniques from neural networks and symbolic machine learning algorithms could be combined to analyze the underlying rules for such data. A crucial task for applying neural systems, especially for applying learning distributed systems, is the design of appropriate vector representations for scaling up to real-world tasks. Large context vectors are also essential for learning document retrieval [22]. Due to the size of the data, only linear computations are useful for full-scale information retrieval. However, vector representations are still often restricted to co-occurances, rather than focusing on syntax, discourse, logic and so on [22]. However, complex representations may be formed and analyzed using fractal approaches [82]. Hard real-world applications are important. A system was built for foreign exchange rate prediction that uses a SOM for reduction and that generates a symbolic representation as input for a recurrent network which can produce rules [55]. Another self-organizing approach for symbol processing was described for classifying Usenet texts and presenting the classification as a hierarchical two-dimensional map [32]. Related neural classification work for text routing has been described [93]. Neural network representations have also been used for important parts of vision and association [56]. Finally, there is promising progress in neuroscience. Computational neuroscience is still in its infancy but it may be very relevant to the long-term progress of hybrid symbolic neural systems. Related to that, more complex high order neurons may be one possibility for building more powerful functionality [44]. Another way would be to focus more on global brain architectures, for instance for building biological inspired robots with rooted cognition [64]. It was argued [85] that in 20 years computer power will be sufficient to match human capabilities, at least in principle. But meaning and deep understanding are still lacking. Other important issues are perception, situation assessment and action [78], although perceptual pattern recognition is still in a very primitive state. Rich perception also requires links with rich sets of actions. Furthermore, it has been argued that language is the “quintessential feature” of human intelligence [85] since it is involved in many intelligent cognitive processes. 4 Concluding Remarks In summary, further work towards a theory and fundamental principles of hybrid neural systems is needed. First of all, there is promising work towards relating automata theory with neural networks, or logics with such networks. Furthermore, the issue of representation needs more focus. In order to tackle larger real world tasks using neural networks, for instance in information retrieval, learning 8 S. Wermter and R. Sun internet agents, or large-scale classification, further research on the underlying vector representations for neural networks is important. Vertical forms of neural/symbolic hybridization models are widely used in cognitive processing, logic representation and language processing. Horizontal forms of neural/symbolic hybridization exist for larger tasks, such as speech/language integration, knowledge engineering, intelligent agents or condition monitoring. Furthermore, it will be interesting to see in the future to what extent computational neuroscience will offer further ideas and constraints for building more sophisticated forms of neural systems. References 1. R. Andrews, J. Diederich, and A. B. Tickle. A survey and critique of techniques for extracting rules from trained artificial networks. Technical report, Queensland University of Technology, 1995. 2. R. Andrews and S. Geva. Rules and local function networks. In Proceedings of the Rule Extraction From Trained Artificial Neural Networks Workshop, Artificial Intelligence and Simulation of Behaviour, Brighton UK, 1996. 3. J. Barnden. Complex symbol-processing in Conposit. In R. Sun and L. Bookman, editors, Architectures incorporating neural and symbolic processes. Kluwer, Boston, 1994. 4. J. A. Barnden and K. J. Holyoak, editors. Advances in connectionist and neural computation theory, volume 3. Ablex Publishing Corporation, 1994. 5. J. Benitz, J. Castro, and J. I. Requena. Are artificial neural networks black boxes? IEEE Transactions on Neural Networks, 8(5):1156–1164, 1997. 6. R. Bogacz and C. Giraud-Carrier. A novel modular neural architecture for rulebased and similarity-based reasoning. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 7. G. Bologna. Symbolic rule extraction form the DIMLP neural network. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 8. G. Bologna and C. Pellegrini. Accurate decomposition of standard MLP classification responses into symbolic rules. In International Work Conference on Artificial and Natural Neural Networks, IWANN’97, pages 616–627, Lanazrote, Canaries, 1997. 9. Y. Cheng, P. Fortier, and Y. Normandin. A system integrating connectionist and symbolic approaches for spoken language understanding. In Proceedings of the International Conference on Spoken Language Processing, pages 1511–1514, Yokohama, 1994. 10. P. S. Churchland and T. J. Sejnowski. The Computational Brain. MIT Press, Cambridge, MA, 1992. 11. T. Corbett-Clarke and L. Tarassenko. A principled framework and technique for rule extraction from multi-layer perceptrons. In Proceedings of the 5th International Conference on Artificial Neural Networks, pages 233–238, Cambridge, England, July 1997. 12. J. Diederich and D. L. Long. Efficient question answering in a hybrid system. In Proceedings of the International Joint Conference on Neural Networks, Singapore, 1992. 13. G. Dorffner. Neural Networks and a New AI. Chapman and Hall, London, UK, 1997. An Overview of Hybrid Neural Systems 9 14. M. A. Fanty. Learning in structured connectionist networks. Technical Report 252, University of Rochester, Rochester, NY, 1988. 15. J. Feldman and D. Bailey. Layered hybrid connectionist models for cognitive science. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 16. J. A. Feldman and D. H. Ballard. Connectionist models and their properties. Cognitive Science, 6:205–254, 1982. 17. J. A. Feldman, G. Lakoff, D. R. Bailey, S. Narayanan, T. Regier, and A. Stolcke. L0 - the first five years of an automated language acquisition project. AI Review, 8, 1996. 18. P. Frasconi, M. Gori, and A. Sperduti. Integration of graphical rules with adaptive learning of structured information. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 19. L.M. Fu. Rule learning by searching on adapted nets. In Proceedings of the National Conference on Artificial Intelligence, pages 590–595, 1991. 20. L.M. Fu. Neural Networks in Computer Intelligence. McGraw-Hill, Inc., New York, NY, 1994. 21. S. I. Gallant. Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA, 1993. 22. S. I. Gallant. Context vectors: a step toward a grand unified representation. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 23. J. Gelfand, D. Handleman, and S. Lane. Integrating knowledge-based systems and neural networks for robotic skill. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 193–198, San Mateo, CA., 1989. 24. L. Giles and C. W. Omlin. Extraction, insertion and refinement of symbolic rules in dynamically driven recurrent neural networks. Connection Science, 5:307–337, 1993. 25. S. Goonatilake and S. Khebbal. Intelligent Hybrid Systems. Wiley, Chichester, 1995. 26. N. A. Hallack, G. Zaverucha, and V. C. Barbosa. Towards a hybrid model of firstorder theory refinement. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 27. J. A. Hammerton and B. L. Kalman. Holistic symbol computation and the sequential RAAM: An evaluation. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 28. J. Hendler. Developing hybrid symbolic/connectionist models. In J. A. Barnden and J. B. Pollack, editors, Advances in Connectionist and Neural Computation Theory, Vol.1: High Level Connectionist Models, pages 165–179. Ablex Publishing Corporation, Norwood, NJ, 1991. 29. M. Hilario. An overview of strategies for neurosymbolic integration. In Proceedings of the Workshop on Connectionist-Symbolic Integration: From Unified to Hybrid Approaches, pages 1–6, Montreal, 1995. 30. S. Hölldobler. A structured connectionist unification algorithm. In Proceedings of the National Conference of the American Association on Artificial Intelligence 90, pages 587–593, Boston, MA, 1990. 31. S. Hölldobler, Y. Kalinke, and J. Wunderlich. A recursive neural network for reflexive reasoning. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 32. S. Honkela. Self-organizing maps in symbol processing. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 33. J. S. R. Jang and C. T. Sun. Functional equivalence between radial basis function networks and fuzzy inference systems. IEEE Transactions Neural Networks, 4(1):156–159, 1993. 10 S. Wermter and R. Sun 34. D. Jurafsky, C. Wooters, G. Tajchman, J. Segal, A. Stolcke, E. Fosler, and N. Morgan. The Berkeley Restaurant Project. In Proceedings of the International Conference on Speech and Language Processing, pages 2139–2142, Yokohama, 1994. 35. P. Kanerva. Large patterns make great symbols: an example of learning from example. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 36. C. Kirkham and T. Harris. Development of a hybrid neural network/expert system for machine health monitoring. In R. Rao, editor, Proceedings of the 8th International Congress on Condition Monitoring and Engineering Management, COMADEM95, pages 55–60, 1995. 37. G.K. Kraetzschmar, S. Sablatnoeg, S. Enderle, and G. Palm. Application of neurosymbolic integration for environment modelling in mobile robots. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 38. S. C. Kremer. A theory of grammatical induction in the connectionist paradigm. Technical Report PhD dissertation, Dept. of Computing Science, University of Alberta, Edmonton, 1996. 39. S.C. Kremer and J. Kolen. Dynamical recurrent networks for sequential data processing. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 40. F. Kurfeß. Unification on a connectionist simulator. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 471–476. North-Holland, 1991. 41. S. C. Kwasny and K. A. Faisal. Connectionism and determinism in a syntactic parser. In N. Sharkey, editor, Connectionist natural language processing, pages 119–162. Lawrence Erlbaum, Hillsdale, NJ, 1992. 42. T. Lange and M. Dyer. High-level inferencing in a connectionist network. Connection Science, 1:181–217, 1989. 43. B. Lees, B. Kumar, A. Mathew, J. Corchado, B. Sinha, and R. Pedreschi. A hybrid case-based neural network approach to scientific and engineering data analysis. In Proceedings of the Eighteenth Annual International Conference of the British Computer Society Specialist Group on Expert Systems, pages 245–260, Cambridge, 1998. 44. H. Lipson and H.T. Siegelmann. High order eigentensors as symbolic rules in competitive learning. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 45. J. MacIntyre and P. Smith. Application of hybrid systems in the power industry. In L. Medsker, editor, Intelligent Hybrid Systems, pages 57–74. Kluwer Academic Press, 1995. 46. M.R. Mayberry and R. Miikkulainen. Combining maps and distributed representations for shift-reduce parsing. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 47. K. McGarry, S. Wermter, and J. MacIntyre. Hybrid neural systems: from simple coupling to fully integrated neural networks. Neural Computing Surveys, 2:62–94, 1999. 48. L. R. Medsker. Hybrid Neural Network and Expert Systems. Kluwer Academic Publishers, Boston, 1994. 49. L. R. Medsker. Hybrid Intelligent Systems. Kluwer Academic Publishers, Boston, 1995. 50. R. Miikkulainen. Subsymbolic Natural Language Processing. MIT Press, Cambridge, MA, 1993. 51. W. C. Morris, G. W. Cottrell, and J. L. Elman. A connectionist simulation of the empirical acquisition of grammatical relations. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. An Overview of Hybrid Neural Systems 11 52. M. C. Mozer. Neural net architectures for temporal sequence processing. In A. Weigend and N. Gershenfeld, editors, Time series prediction: Forecasting the future and understanding the past, pages 243–264. Addison-Wesley, Redwood City, CA, 1993. 53. C. W. Omlin and C. L. Giles. Extraction and insertion of symbolic information in recurrent neural networks. In V. Honavar and L. Uhr, editors, Artificial Intelligence and Neural Networks:Steps Towards principled Integration, pages 271–299. Academic Press, San Diego, 1994. 54. C. W. Omlin and C. L. Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1):41–52, 1996. 55. C.W. Omlin, L. Giles, and K. K. Thornber. Fuzzy knowledge and recurrent neural networks: A dynamical systems perspective. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 56. C. Orovas and J. Austin. A cellular neural associative array for symbolic vision. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 57. J. Park and I. W. Sandberg. Universal approximation using radial basis function networks. Neural Computation, 3:246–257, 1991. 58. N. S. Park. Addressing knowledge representation issues in connectionist symbolic rule encoding for general inference. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 59. T. Peterson and R. Sun. An RBF network alternative for a hybrid architecture. In International Joint Conference on Neural Networks, Ancorage, AK, May 1998. 60. J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46:77– 105, 1990. 61. R. Reilly. Evolution of symbolisation: Signposts to a bridge between connectionist and symbolic systems. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 62. R. G. Reilly and N. E. Sharkey. Connectionist Approaches to Natural Language Processing. Lawrence Erlbaum Associates, Hillsdale, NJ, 1992. 63. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing, volume 1, pages 318–362. MIT Press, Cambridge, MA, 1986. 64. N. Sharkey and N. T. Ziemke. Life, mind and robots: The ins and outs of embodied cognition. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 65. L. Shastri. A model of rapid memory formation in the hippocampal system. In Proceedings of the Meeting of the Cognitive Science Society, pages 680–685, Stanford, 1997. 66. L. Shastri. Types and quantifiers in SHRUTI: a connectionist model of rapid reasoning and relational processing. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 67. L. Shastri and V. Ajjanagadde. From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings. Behavioral and Brain Sciences, 16(3):417–94, 1993. 68. J. Shavlik. A framework for combining symbolic and neural learning. In V. Honavar and L. Uhr, editors, Artificial Intelligence and Neural Networks: Steps towards principled Integration, pages 561–580. Academic Press, San Diego, 1994. 69. P. Smolensky. On the proper treatment of connnectionism. Behavioral and Brain Sciences, 11(1):1–74, March 1988. 70. A. Sperduti, A. Starita, and C. Goller. Learning distributed representations for the classifications of terms. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 494–515, Montreal, 1995. 12 S. Wermter and R. Sun 71. R. Sun. On variable binding in connectionist networks. Connection Science, 4(2):93–124, 1992. 72. R. Sun. Integrating Rules and Connectionism for Robust Commonsense Reasoning. Wiley, New York, 1994. 73. R. Sun. Hybrid connectionist-symbolic models: A report from the IJCAI95 workshop on connectionist-symbolic integration. Artificial Intelligence Magazine, 1996. 74. R. Sun. Supplementing neural reinforcement learning with symbolic methods. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 75. R. Sun and F. Alexandre. Proceedings of the Workshop on Connectionist-Symbolic Integration: From Unified to Hybrid Approaches. McGraw-Hill, Inc., Montreal, 1995. 76. R. Sun and F. Alexandre. Connectionist Symbolic Integration. Lawrence Erlbaum Associates, Hillsdale, NJ, 1997. 77. R. Sun and L.A. Bookman. Computational Architectures Integrating Neural and Symbolic Processes. Kluwer Academic Publishers, Boston, MA, 1995. 78. R. Sun and T. Peterson. Autonomous learning of sequential tasks: experiments and analyses. IEEE Transactions on Neural Networks, 9(6):1217–1234, 1998. 79. S. Thrun. Extracting rules from artificial neural networks with distributed representations. In G.Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7. MIT Press, San Mateo, CA, 1995. 80. S. Thrun. Explanation-Based Neural Network Learning. Kluwer, Boston, 1996. 81. A. Tickle, F. Maire, G. Bologna, R. Andrews, and J. Diederich. Lessons from past, current issues and future research directions in extracting the knowledge embedded in artificial neural networks. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 82. P. Tino, G. Dorffner, and C. Schittenkopf. Understanding state space organization in recurrent neural networks with iterative function systems dynamics. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 83. H. Tirri. Replacing the pattern matcher of an expert system with a neural network. In S. Goonatilake and S.Khebbal, editors, Intelligent Hybrid Systems, pages 47–62. John Wiley and Sons, 1995. 84. M.L. Vaughn, S.J. Cavill, S.J. Taylor, M.A. Foy, and A.J.B. Fogg. Direct knowledge extraction and interpretation from a multilayer perceptron network that performs low back pain classification. In Hybrid Neural Systems (this volume). SpringerVerlag, 2000. 85. D. Waltz. The importance of importance. In Presentation at Workshop on Hybrid Neural Symbolic Integration, Breckenridge, CO., 1998. 86. D. L. Waltz and J. A. Feldman. Connectionist Models and their Implications. Ablex, 1988. 87. S. Wermter. Hybrid Connectionist Natural Language Processing. Chapman and Hall, Thomson International, London, UK, 1995. 88. S. Wermter. Preference Moore machines for neural fuzzy integration. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 840–845, Stockholm, 1999. 89. S. Wermter. The hybrid approach to artificial neural network-based language processing. In R. Dale, H. Moisl, and H. Somers, editors, A Handbook of Natural Language Processing. Marcel Dekker, 2000. 90. S. Wermter. Knowledge extraction from transducer neural networks. Applied Intelligence: The International Journal of Artificial Intelligence, Neural Networks, and Complex Problem-Solving Techniques, 12:27–42, 2000. An Overview of Hybrid Neural Systems 13 91. S. Wermter, G. Arevian, and C. Panchev. Towards hybrid neural learning internet agents. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 92. S. Wermter and M. Meurer. Building lexical representations dynamically using artificial neural networks. In Proceedings of the International Conference of the Cognitive Science Society, pages 802–807, Stanford, 1997. 93. S. Wermter, C. Panchev, and G. Arevian. Hybrid neural plausibility networks for news agents. In Proceedings of the National Conference on Artificial Intelligence, pages 93–98, Orlando, USA, 1999. 94. S. Wermter, E. Riloff, and G. Scheler. Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer, Berlin, 1996. 95. S. Wermter and V. Weber. SCREEN: Learning a flat syntactic and semantic spoken language analysis using artificial neural networks. Journal of Artificial Intelligence Research, 6(1):35–85, 1997. Layered Hybrid Connectionist Models for Cognitive Science Jerome Feldman and David Bailey International Computer Science Institute, Berkeley CA 94704 Abstract. Direct connnectionist modeling of higher cognitive functions, such as language understanding, is impractical. This chapter describes a principled multi-layer architecture that supports AI style computational modeling while preserving the biological plausibility of structured connectionist models. As an example, the connectionist realization of Bayesian model merging as recruitment learning is presented. 1 Hybrid Models in Cognitive Science Almost no one believes that connectionist models will suffice for the full range of tasks in creating and modeling intelligent systems. People whose goals are primarily performance programs have no compunction about deploying hybrid systems and rightly so. But many connectionists are primarily interested in modeling human and other animal intelligence and it is not as clear what methodology is most appropriate in this enterprise. This paper provides one answer that has been useful to our group in our decade of effort in modeling language acquisition in the NTL (originally L0) project. Connectionists of all persuasions agree that intelligence will best be explained in terms of its neural foundations using computational models with simple notions of spreading activation and experience-based weight change. But no one claims that a model containing millions (much less billions) of units is itself a scientific description of some phenomenon, such as vision or language understanding. Beyond this basic agreement there is a major bifurcation into two approaches. The (larger) PDP community believes that progress is best made by training large back propagation networks (and more recently Elman style recurrent nets) to perform specific functions and then examining the learned weights for patterns and insight. There is a lot of good current work on extracting higher level representations from learned weights and this is discussed in other chapters. But there is no evidence that PDP networks can just learn the full range of intelligent behavior and they will not be discussed in this chapter. The other main approach to connectionist modeling is usually called structured because varying amounts of specificly designed computational mechanism are built into the model. This work is almost always localist in character, because it is much more natural to postulate that the pre-wired computational mechanisms are realized by localized circuits, especially if one is actually building the S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 14–27, 2000. c Springer-Verlag Berlin Heidelberg 2000 Layered Hybrid Connectionist Models for Cognitive Science 15 model. In principle, structured connectionist models (SCM) could capture the exact brain structure underlying some behavior and, in fact, models of this sort are common in computational neuroscience. But for the complex behaviors underlying intelligence, a complete SCM would not be any simpler to understand that the neural circuitry itself. And, of course, we don’t know nearly enough about either the brain or cognition to even approximate such iconic models. One suggestion that came up in the early years was to just use conventional symbolic models for so-called higher functions and restrict connectionist modeling to simple sensory motor behaviors. This defeatist stance was never very popular because it left higher cognition without the parallel, fault tolerant, evidential computational mechanisms that are the heart of connectionism. Something like this is now more feasible because the conventional AI approach has become a probabilistic belief network and thus more likely to be mappable to connectionist models. Even so, if one is serious about cognitive modeling, there are good reasons to restrict choices to computational mechanisms that are at least arguably within the size, speed and learnability constraints of the brain. For example, in general it takes exponential time for an arbitrary belief network to settle and thus specializations would be needed for a plausible cognitive model. Although the idea was implicit in earlier structured connectionist work, it is only recently that we have enunciated a systematic philosophy on how to build hybrid connectionist cognitive models. The central idea is hierarchical reducibility - any construct posited at a higher modeling level must have a computationally and biologically plausible reduction to the level below. The table below depicts our current five level structure. We view this approach as a perfectly ordinary instance of the scientific method as routinely practiced in the physical and life sciences. But it does seem to provide us with a good working style providing tractability while maintaining connectionist principles and the potential for direct experimental testing. cognitive: computational: connectionist: comp. neuro.: neural: words, concepts f-structs, x-schemas (see below) structured models, learning rules detailed neural models [implicit] Our computational level is analogous to Marr’s and comprises a mixture of familiar notions like feature structures and a novel representation, executing schemas, described below. Apart from providing a valuable scientific language for specifying proposed structures and mechanisms, these representational formalisms can be implemented in simulations to allow us to test our hypotheses. They also support computational learning algorithms so we can use them in experiments on acquisition. Importantly, these computational mechanisms are all reducible to structured connectionist models so that embodiment can be realized. It is not necessarily easy to carry out these reductions and a great deal of effort has gone into understanding them. Perhaps the most challenging problem is 16 J. Feldman and D. Bailey the connectionist representation of variable binding. This has been addressed for over a decade by Shastri and his students [13,12] and also by a number of other groups [9,15]. This body of work has shown that connectionist models can indeed encode a large body of systematic knowledge and perform interesting inferences in parallel via spreading activation. A recent extension of these techniques [7] supports the connectionist realization of the X-schema formalism discussed below. For contrast, we focus here on a rather different kind of problem - the mapping of the Bayesian model merging technique [14] to the standard structured connectionist mechanism of recruitment learning [6]. The particular context is the David Bailey’s dissertation [2], on a model of how children learn the words for simple actions of their hand. 2 Verblearn and Its Connectionist Reduction Bailey’s Verblearn system has three major subparts as depicted at the computational level in Figure 1. The bottom section of Figure 1 depicts the underlying actions, encoded as X-schemas. The top third depicts various possible wordsenses associated with an action, encoded as feature structures. The task is to learn the best set of word senses for describing the variations expressed in the training language. The crucial linking mechanism (also encoded as a feature structure) is shown in the center of Figure 1. As will be described below, the system learns an appropriate set of word senses and can demonstrate its knowledge by labeling novel actions or carrying out commands specified with newly learned words. The goal of this paper is to show how Bailey’s computational level account can be reduced to the structured connectionist level in a principled way. The two main data structures we need to map are feature structures and executing schemas (X-schemas). As mentioned above, X-schemas can be modeled at the connectionist level as an extension of Shruti [7]. The modeling of individual feature values follows the standard structures connectionist strategy (and a biologically plausible one as well) of using place coding [6]. That is, for each feature, there is a dedicated connectionist unit (i.e. a separate place) for each possible value of the feature. Within the network representing the possible values of a feature, we desire a winner-take-all (WTA) behavior. That is, only one value unit should be active at a time, once the network has settled, again in the standard way [6]. Representing the link between entities, feature names and feature values is done, again as usual, with triangle nodes. 2.1 Triangle Units The essential building block is the triangle unit [6,4], shown in Figure 2(a). A triangle unit is an abstraction of a neural circuit which effects a three-way binding. In the figure, the units A, B and C represent arbitrary “concepts” which are bound by the triangle unit. All connections shown are bidirectional and excitatory. The activation function of a triangle unit is such that activation Layered Hybrid Connectionist Models for Cognitive Science 17 push schema posture elbow jnt aspect slide 1.0 palm 0.7 extend 0.9 once 0.8 cube 0.8 schema posture depress 1.0 relevant linking features accel index 0.9 aspect low 0.7 shove object object schema posture elbow jnt accel slide 1.0 palm 0.9 extend 0.9 high 0.9 once 0.6 button 1.0 world state features motor parameter features schema slide | depress posture elbow jnt direction accel aspect object grasp|palm|indx flex | extend up | dn | lf | rt once | iteratedlow | med | hi cube | button world state features used by schema small PRESHAPE Slide Schema GRASP slipping TIGHTEN GRIP weight at goal 2.3 lbs false at goal PRESHAPE PALM start || 2 large MOVE ARM (horiz-dir, force, dur) APPLY GRIP done not at goal MOVE ARM TO (objloc) 2 Fig. 1. An overview of the verb-learner at the computational level, showing details of the Slide x-schema, some linking features, and two verbs: push (with two senses) and shove (with one sense). (a) (b) B A a+b+c>=2 a C A b B c C (external sources of activation) Fig. 2. (a) A simple triangle unit which binds A, B and C. (b) One possible neural realization. 18 J. Feldman and D. Bailey on any two of its incoming connections causes an excitatory signal to be sent out over all three outgoing connections. Consequently, the triangle unit allows activation of A and B to trigger C, or activation of A and C to trigger B, etc. Triangle nodes will be used here as abstract building blocks, but Figure 2(b) illustrates one possible connectionist realization. A single unit is employed to implement the binding, and each concept unit projects onto it. Concept units are assumed to fire at a uniform high rate when active and all weights into the main unit are equal. As a result, each input site of the triangle unit can be thought of as producing a single 0-or-1 value (shown as lower-case a, b and c) indicating whether its corresponding input unit is active. The body of the binding unit then just compares the sum of these three values to the threshold of 2. If the threshold is met, the unit fires. Its axon projects to all three concept units, and the connections are strong enough to activate all of the concept units, even those receiving no external input. posture "push" palm Fig. 3. Using a triangle unit to represent the value (palm) of a feature (posture) for an entity ("push"). A particularly useful type of three-way binding consists of an entity, a feature, and a value for the feature, as shown in Figure 3. With this arrangement, if posture and palm are active, then "push" will be activated—a primitive version of the labelling process. Alternatively, if "push" and posture are active, then palm will be activated—a primitive version of obeying. The full story [1] requires a more complex form of triangle units, but the basic ideas can be conveyed with just the simple form. 2.2 Connectionist Level Network Architecture This section describes a network architecture which implements (approximately) the multiple-sense verb representation and its associated algorithms for labelling and obeying. The architecture is shown in Figure 4, whose layout is intended to be reminiscent of the upper half of Figure 1. Layered Hybrid Connectionist Models for Cognitive Science 19 phonology, morphology, etc. "push" "shove" "pull" word senses push1 force force=low force=med force=high push2 dir dir=away motor control (x-schemas) dir=down size size=small size=large perceptual system Fig. 4. A connectionist version of the model, using a collection of triangle units for each word sense. On the top is a “vocabulary” subnetwork containing a unit for each known verb. Each verb is associated with a collection of phonological and morphological details, whose connectionist representation is not considered here but is indicated by the topmost “blob” in the figure. Each verb unit can be thought of as a binding unit which ties together such information. The verb units are connected in a winner-take-all fashion to facilitate choosing the best verb for a given situation. On the bottom is a collection of subnetworks, one for each linking feature. The collection is divided into two groups. One group—the motor-parameter features—is bidirectionally connected to the motor control system, shown here as a blob for simplicity. The other group—the world-state features—receives connections from the perceptual system, which is not modelled here and is indicated by the bottom-right blob. Each feature subnetwork consists of one unit for each possible value. Within each feature subnetwork, units are connected in a winner-take-all fashion. A separate unit also represents each feature name. The most interesting part of the architecture is the circuitry connecting the verb units to the feature units. In the central portion of Figure 4 the connectionist representation of two senses of push are shown, each demarcated by a box. Each sense requires several triangle units with specialized functions. One triangle unit for each sense can be thought of as primary; these are drawn larger and labelled “push1” and “push2”. These units are of the soft conjunctive type and serve to integrate information across the features which the sense is concerned about. Their left side connects to the associated verb unit. Their right 20 J. Feldman and D. Bailey side has multiple connections to a set of subsidiary triangle units, one for each world-state feature (although only one is shown in the figure). The lower side of the primary triangle unit works similarly, but for the motor-parameter features (two are shown in the figure). Note also that the primary triangle units are connected into a lexicon-wide winner-take-all network. 2.3 Labelling and Obeying We can now illustrate how the network performs labelling and obeying. Essentially, these processes involve providing strong input to two of the three sides of some word sense’s primary triangle unit, resulting in activation of the third side. For labelling, the process begins when x-schema execution and the perceptual system activate the appropriate feature and value units in the lower portion of Figure 4. In response—and in parallel—every subsidiary triangle unit connected to an active feature unit weighs the suitability of the currently active value unit according to its learned connection strengths. In turn, these graded responses are delivered to the lower and right-hand sides of each word sense’s primary triangle unit. The triangle units become active to varying degrees, depending on the number of activated subsidiary units and their degrees of activation. The winner-take-all mechanism ensures that only one primary unit dominates, and when that occurs the winning primary unit turns on its associated verb unit. For obeying, we assume one verb unit has been activated (say, by the auditory system) and the appropriate world-state feature and value units have been activated (by the perceptual system). As a result, the only primary triangle units receiving activation on more than one side will be those connected to the command verb. This precipitates a competition amongst those senses to see which has the most strongly active world-state subsidiary triangle units—that is, which sense is most applicable to the current situation. The winner-take-all mechanism boosts the winner and suppresses the others. When the winner’s activation peaks, it sends activation to its motor-parameter subsidiary triangle units. These, in turn, will activate the motor-parameter value units in accordance with the learned connection strengths. Commonly this will result in partial activation on multiple values for some features. The winner-take-all mechanism within each feature subnetwork chooses a winner. (Alternatively, we might prefer to preserve the distributed activation pattern for use by smarter x-schemas which can reason with probabilistic specification of parameters. E.g., if all the force value units are weakly active, the x-schema knows it should choose a suitable amount of force.) 3 Learning - Connectionist Account The ultimate goal of the system is to learn the right word senses from labeled experience. The verb learning model assumes that the agent has already acquired various x-schemas for the actions of one hand manipulating an object on a table and that an informant labels actions that the agent is performing. The algorithm starts by assuming that each instance (e.g. of a word sense) is a new category Layered Hybrid Connectionist Models for Cognitive Science 21 and then proceeds to merge these until a total information criterion no longer improves. More technically, the learning task is an optimization problem, in that we seek, amongst all possible lexicons, the “best” one given the training set. We seek the lexicon model m that is most probable given the training data t. argmax P (m | t) (1) m The probability being maximized is the a posteriori probability of the model, and our algorithm is a “maximum a posteriori (MAP) estimator” . The fundamental insight of Bayesian learning is that this quantity can be decomposed, using Bayes’ rule, into components which separate the fit to the training data and an a priori preference for certain models over others. P (m | t) ∝ P (m) P (t | m) (2) Here, as usual, the prior term P(m) is proportional to the complexity of the model and the likelihood term P(t | m) is a measure of how well the model fits the data, in this case the labeled actions. The goal is to adjust the model to optimize the overall fit, model merging is one algorithm for this. In general terms, the algorithm is: Model merging algorithm: 1. Create a simple model for each example in the training set. 2. Repeat the following until the posterior probability decreases: a) Find the best candidate pair of models to merge. b) Merge the two models to form a possibly more complex model, and remove the original models. In our case, “model” in the name “model merging” refers to an individual word sense f-struct. The learning algorithm creates a separate word sense for every occurrence of a word, and then merges these word sense f-structs so long as the reduction in the number of word senses outweighs the loss of training-set likelihood resulting from the merge. A major advantage of the model merging algorithm is that it is one-shot. After a single training example for a new verb, the system is capable of using the verb in a meaningful, albeit limited, way. Model merging is also relatively efficient since it does not backtrack. Yet it often successfully avoids poor local minima because its bottom-up rather than top-down strategy is less likely to make premature irreversible commitments. We now consider how the model merging algorithm can be realized in a connectionist manner, so that we will have a unified connectionist story for the entire system. At first glance, the model merging algorithm does not appear particularly connectionist. Two properties cause trouble. First, the algorithm is constructivist. That is, new pieces of representation (word senses) need to be built, as 22 J. Feldman and D. Bailey opposed to merely gradually changing existing structures. Second, the criterion for merging is a global one, rather than depending on local properties of word senses. Nevertheless, we have a proposed connectionist solution employing a learning technique known as recruitment learning. 3.1 Recruitment Learning Recruitment learning [5,11] assumes a localist representation of bindings such as the triangle unit described in §2.1, and provides a rapid-weight-change algorithm for forming such “effective circuits” from previously unused connectionist units. Figure 5 illustrates recruitment with an example. Recall that a set of triangle nodes is usually connected in a winner-take-all (WTA) fashion to ensure that only one binding reaches an activation level sufficiently high to excite its third member. For recruitment learning, we further posit that there is a pool of “free” triangle units which also take part in the WTA competition. The units are free in that they have low, random weights to the various “concept units” amongst which bindings can occur. Crucially, though, they do have connections to these concept units. But the low weights prevent these free units from playing an active role in representing existing bindings. recruited Triangle unit WTA network T1 Concept units A B T2 C D free E T3 F G Fig. 5. Recruitment of triangle unit T3 to represent the binding E–F–G. This architecture supports the learning of new bindings as follows. Suppose, as in Figure 5, several triangle units already represent several bindings, such as Layered Hybrid Connectionist Models for Cognitive Science 23 T1, which represents the binding of A, C and F. (The bindings for T2 are not shown.) Suppose further that concept units E, F and G are currently active, and the WTA network of triangle units is instructed (e.g. by a chemical mechanism) that this binding must be represented. If there already exists a triangle unit representing the binding, it will be activated by the firing of E, F and G, and that will be that. But if none of the already-recruited triangle units represents the binding, then it becomes possible for one of the free triangle units (e.g. T3)—whose low, random weights happen to slightly bias it toward this new binding—to become weakly active. The WTA mechanism selects this unit and increases its activation, which then serves as a signal to the unit to rapidly strengthen its connections to the active concept units.1 It thereby joins the pool of recruited triangle units. As described, the technique seems to require full connectivity and enough unrecruited triangle units for all possible conjunctions. Often, though, the overall architecture of a neural system provides constraints which greatly reduce the number of possible bindings, compared to the number possible if the pool of concept units is considered as an undifferentiated whole. For example, in our connectionist word sense architecture, it is reasonable to assume that the initial neural wiring is predisposed toward binding words to features—not words to words, or feature units to value units of a different feature. The view that the brain starts out with appropriate connectivity between regions on a coarse level is bolstered by the imaging studies of [3] which show, for example, different localization patterns for motor verbs (nearer the motor areas) vs. other kinds of verbs. Still, the number of potential bindings and connections may be daunting. It turns out, though, that sparse random connection patterns can alleviate this apparent problem [5]. The key idea is to use a multi-layered scheme for representing bindings, in which each binding is represented by paths amongst the to-be-bound units rather than direct connections. The existence of such paths can be shown to have high probability even in sparse networks, for reasonable problem sizes [16]. 3.2 Merging Via Recruitment The techniques of recruitment learning can be put to use to create the word sense circuitry shown earlier in Figure 4. The connectionist learning procedure does not exactly mimic the algorithm given above but captures the main ideas. To illustrate our connectionist learning procedure, we will assume that the two senses of push shown in Figure 4 have already been learned, and a new training example has just occurred. That is, the “push” unit has just become active, as have some of the feature value units reflecting the just-executed action. 1 This kind of rapid and permanent weight change, often called long term potentiation or LTP, has been documented in the nervous system. It is a characteristic of the NMDA receptor, but may not be exclusive to it. It is hypothesized to be implicated in memory formation. See [10] for details on the neurobiology, or [12] for a more detailed connectionist model of LTP in memory formation. 24 J. Feldman and D. Bailey The first key observation is that when a training example occurs, external activation arrives at a verb unit, motor-parameter feature value units, and world-state feature value units. This three-way input is the local cue to the various triangle units that adaptation should occur—labelling and obeying never produce such three-way external input to the triangle units. Depending on the circumstances, there are three possible courses of action the net may take: – Case 1: The training example’s features closely match those of an existing word sense. This case is detected by activation of the primary triangle unit of the matching sense—strong enough activation to dominate the winner-take-all competition. In this case, an abbreviated version of merging occurs. Rather than create a full-fledged initial word sense for the new example, only to merge it into the winning sense, the network simply “tweaks” the winning sense to accommodate the current example’s features. Conveniently, the winning sense’s primary triangle unit can detect this situation using locally available information, namely: (1) it is highly active; and (2) it is receiving activation on all three sides. The tweaking itself is a version of Hebb’s Rule [8]: the weights on connections to active value units are incrementally strengthened. With an appropriate weight update rule, this strategy can mimic the probability distributions learned by the model merging algorithm. – Case 2: The training example’s features do not closely match any existing sense. This case is detected by failure of the winner take all mechanism to elevate any word sense above a threshold level. In this case, standard recruitment learning is employed. Pools of unrecruited triangle units are assumed to exist, pre-wired to function as either primary or subsidiary units in future word senses. After the winner-take-all process fails to produce a winner from the previously-recruited set of triangle units, recruitment of a single new primary triangle unit and a set of new subsidiary units occurs. The choice will depend on the connectivity and initial weights of the subsidiary units to the feature value units, but will also depend on the connections amongst the new units which are needed for the new sense to cohere. Once chosen, these units’ weights are set to reflect the currently active linking feature values, thereby forming a new word sense which essentially is a copy of the training example. – Case 3: The training example’s features are a moderate match to two (or more) existing word senses. This case is detected by a protracted competition between the two partially active senses which cannot be resolved by the winner-take-all mechanism. Figure 6 depicts this case. As indicated by the darkened ovals, the training example is labelled “push” but involved medium force applied to a small size object—a combination which doesn’t quite match either existing sense. This case triggers recruitment of triangle units to form a new sense as described for case 2, but with an interesting twist. The difference is that the weights of the new subsidiary triangle units will reflect not only the linking features of the current training example, but also the distribution of values Layered Hybrid Connectionist Models for Cognitive Science 25 phonology, morphology, etc. "push" "shove" "pull" new merged sense push1 push12 force force=low force=med force=high push2 dir dir=away motor control (x-schemas) dir=down size size=small size=large perceptual system Fig. 6. Connectionist merging of two word senses via recruitment of a new triangle unit circuit. represented in the partially active senses. Thus, the newly recruited sense will be a true merge of the two existing senses (as well as the new training example). Figure 6 illustrates this outcome by the varying thicknesses on the connections to the value units. If you inspect these closely you will see that the new sense “push12” encodes broader correlations with the force and size features than those of the previous senses “push1” and “push2”. In other words, “push12” basically codes for dir = away, force not high. How can this transfer of information be accomplished, since there are no connections from the partially active senses to the newly recruited sense? The trick is to use indirect activation via the feature value units. The partially active senses, due to their partial activation, will deliver some activation to the value units—in proportion to their outgoing weights. Each value unit adds any such input from the various senses which connect to it. Consequently, each feature subnetwork will exhibit a distributed activation pattern reflecting an average of the distributions in the two partially active senses (plus extra activation for the value associated with the current action). This distribution will then be effectively copied into the weights in the newly recruited triangle units, using the usual weight update rule for those units. A final detail for case 3: to properly implement merging, the two original senses must be removed from the network and returned to the pool of unrecruited units. If they were not removed, the network would quickly accumulate an implausible number of word senses. After all, part of the purpose 26 J. Feldman and D. Bailey of merging is to produce a compact model of each verb’s semantics. But there is another reason to remove the original senses. The new sense will typically be more general than its predecessors. If the original senses were kept, they would tend to “block” the new sense by virtue of their greater specificity (i.e. more peaked distributions). The new sense would rarely get a chance to become active, and its weights would weaken until it slipped back into unrecruited status. So to force the model to use the new generalization, the original senses must be removed. Fortunately, the cue for removal is available locally to these senses’ triangle units: the protracted period of partial activation, so useful for synthesizing the new sense, can serve double duty as a signal to these triangle units to greatly weaken their own weights, thus returning them to the unrecruited pool. The foregoing description is only a sketch, and activation functions have not been fully worked out. It is possible, for example, that the threshold distinguishing case 2 from case 3 could prove too delicate to set reliably for different languages. These issues are left for future work. Nonetheless, several consequences of this particular connectionist realization of a model-merging-like algorithm are apparent. First, the strategy requires presentation of an intermediate example to trigger merging of two existing senses. The architecture does not suddenly “notice” that two existing senses are similar and merge them. Another consequence of the architecture is that it never performs a series of merges as a “batch” as happens in model merging. On the other hand, the architecture does, in principle, allow each merge operation to combine more than two existing senses at a time. Indeed, technically speaking, the example illustrated in Figure 6 is a three-way merge of “push1,” “push2” and the current training example. The question of the relative merits of these two strategies is another good question to pursue. 4 Conclusion We have shown that the two seemingly connectionist-unfriendly aspects of model merging—its constructiveness and its use of a global optimization criterion—can be overcome by using recruitment learning and a modified winner-take-all mechanism. This, hopefully, elucidates the general point of this chapter. Of the many ways of constructing hybrid connectionist models, one seems particularly well suited for cognitive science. For both computational and explanatory purposes, it is convenient to do some (sometimes all) of our modeling at a computational level that is not explicitly connectionist. By requiring a biologically and computationally plausible reduction of all computational level primitives to the (structured) connectionist level, we retain the best features of connectionist models and promote the development of an integrated Cognitive Science. And it is a lot of fun. Layered Hybrid Connectionist Models for Cognitive Science 27 References 1. David R. Bailey. When Push Comes to Shove: A Computational Model of the Role of Motor Control in the Acquisition of Action Verbs. PhD thesis, Computer Science Division, EECS Department, University of California at Berkeley, 1997. 2. David R. Bailey, Jerome A. Feldman, Srini Narayanan, and George Lakoff. Modeling embodied lexical development. In Proceedings of the 19th Cognitive Science Society Conference, pages 19–24, 1997. 3. Antonio R. Damasio and Daniel Tranel. Nouns and verbs are retrieved with differently distributed neural systems. Proceedings of the National Academy of Sciences, 90:4757–4760, 1993. 4. Joachim Diederich. Knowledge-intensive recruitment learning. Technical Report TR-88-010, International Computer Science Institute, Berkeley, CA, 1988. 5. Jerome A. Feldman. Dynamic connections in neural networks. Biological Cybernetics, 46:27–39, 1982. 6. Jerome A. Feldman and Dana Ballard. Connectionist models and their properties. Cognitive Science, 6:205–254, 1982. 7. Dean J. Grannes, Lokendra Shastri, Srini Narayanan, and Jerome A. Feldman. A connectionist encoding of schemas and reactive plans. Poster presented at 19th Cognitive Science Society Conference, 1997. 8. Donald O. Hebb. The Organization of Behavior. Wiley, New York, NY, 1949. 9. J.E. Hummel and I. Biederman. Dynamic binding in a neural network for shape recognition. Psychological Review, 99:480–517, 1992. 10. Gary Lynch and Richard Granger. Variations in synaptic plasticity and types of memory in corticohippocampal networks. Journal of Cognitive Neuroscience, 4(3):189–199, 1992. 11. Lokendra Shastri. Semantic Networks: An evidential formalization and its connectionist realization. Morgan Kaufmann, Los Altos, CA, 1988. 12. Lokendra Shastri. A model of rapid memory formation in the hippocampal system. In Proceedings of the 19th Cognitive Science Society Conference, pages 680–685, 1997. 13. V. Ajjanagadde & L. Shastri. Rules and variables in neural nets. Neural Computation, 3:121–134, 1991. 14. Andreas Stolcke and Stephen Omohundro. Best-first model merging for hidden Markov model induction. Technical Report TR-94-003, International Computer Science Institute, Berkeley, CA, January 1994. 15. Ron Sun. On variable binding in connectionist networks. Connection Science, 4:93–124, 1992. 16. Leslie Valiant. Circuits of the mind. Oxford University Press, New York, 1994. Types and Quantifiers in shruti — A Connectionist Model of Rapid Reasoning and Relational Processing Lokendra Shastri International Computer Science Institute, Berkeley CA 94704, USA, shastri@icsi.berkeley.edu, WWW home page: http://icsi.berkeley/˜shastri Abstract. In order to understand language, a hearer must draw inferences to establish referential and causal coherence. Hence our ability to understand language suggests that we are capable of performing a wide range of inferences rapidly and spontaneously. This poses a challenge for cognitive science: How can a system of slow neuron-like elements encode a large body of knowledge and perform inferences with such speed? shruti attempts to answer this question by demonstrating how a neurally plausible network can encode a large body of semantic and episodic facts, and systematic rule-like knowledge, and yet perform a range of inferences within a few hundred milliseconds. This paper describes a novel representation of types and instances in shruti that supports the encoding of rules and facts involving types and quantifiers, enables shruti to distinguish between hypothesized and asserted entities, and facilitates the dynamic instantiation and unification of entities during inference. 1 Introduction In order to understand language, a hearer must draw inferences to establish referential and causal coherence, generate expectations, make predictions, and recognize the speaker’s intent. Hence our ability to understand language suggests that we are capable of performing a wide range of inferences rapidly, spontaneously and without conscious effort — as though they are a reflex response of our cognitive apparatus. In view of this, such reasoning has been described as reflexive reasoning [22]. This remarkable human ability poses a challenge for cognitive science and computational neuroscience: How can a system of slow neuron-like elements encode a large body of systematic knowledge and perform a wide range of inferences with such speed? The neurally plausible (connectionist) model shruti attempts to address the above challenge. It demonstrates how a network of neuron-like elements could encode a large body of structured knowledge and perform a variety of inferences within a few hundred milliseconds [3][22][14][23][20]. shruti suggests that the encoding of relational information (frames, predicates, etc.) is mediated by neural circuits composed of focal-clusters and the S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 28–45, 2000. c Springer-Verlag Berlin Heidelberg 2000 Types and Quantifiers in shruti 29 dynamic representation and communication of relational instances involves the transient propagation of rhythmic activity across these clusters. A role-entity binding is represented within this rhythmic activity by the synchronous firing of appropriate cells. Systematic mappings — and other rule-like knowledge — are encoded by high-efficacy links that enable the propagation of rhythmic activity across focal-clusters, and a fact in long-term memory is a temporal pattern matcher circuit. The possible role of synchronous activity in dynamic neural representations has been suggested by other researchers (e.g., [28]), but shruti offers a detailed computational account of how synchronous activity can be harnessed to solve problems in the representation and processing of high-level conceptual knowledge. A rich body of neurophysiological evidence has emerged suggesting that synchronous activity might indeed play an important role in neural computation [26] and several models using synchrony to solve the binding problem during inference have been developed (e.g., [9]).1 As an illustration of shruti’s inferential ability consider the following narrative: “John fell in the hallway. Tom had cleaned it. He got hurt.” Upon being presented with the above narrative2 shruti reflexively infers the following:3 Tom had mopped the floor. The floor was wet. John was walking in the hallway. John slipped and fell because the floor was wet. John got hurt because he fell. Notice that shruti draws inferences required to establish referential and causal coherence. It explains John’s fall by making the plausible inference that John was walking in the hallway and he slipped because the floor was wet. It also infers that John got hurt because of the fall. Moreover, it determines that “it” in the second sentence refers to the hallway, and that “He” in the third sentence refers to John, and not to Tom. The representational and inferential machinery developed in shruti can be applied to other problems involving relational structures, systematic but contextsensitive mappings between such structures, and rapid interactions between persistent and dynamic structures. The shruti model meshes with the “Neural Theory of Language” project [4] on language acquisition and provides neurally plausible solutions to several representational and computational requirements arising in the project. The model also offers a plausible framework for realizing the “Interpretation as Abduction” approach to language understanding described in [8]. Moreover, shruti’s representational machinery has been extended to realize control and coordination mechanisms required for modeling actions and reactive plans [24]. This paper describes a novel representation of types and instances in shruti. This representation supports the encoding of rules and facts involving types and 1 2 3 For other solutions to the binding problem within a structured connectionist framework see [11][5][27]. Each sentence in the narrative is conveyed to shruti as a set of dynamic bindings (see Section 4). The sentences are presented in the order of their occurrence in the narrative. After each sentence is presented, the network is allowed to propagate activity for a fixed number of cycles. A detailed discussion of this example appears in [25]. 30 L. Shastri quantifiers, and at the same time allows shruti to distinguish between hypothesized entities and asserted entities. This in turn facilitates the dynamic instantiation and unification of entities and relational instances during inference. For a detailed description of various aspects of shruti’s representational machinery refer to [22][20][25]. The rest of the chapter is organized as follows: Section 2 provides an overview of how relational knowledge is encoded in shruti. Section 3 discusses the representation of types and instances. Section 4 describes the representation of dynamic bindings. Section 5 explains how phase separation between incompatible entities is enforced in the type hierarchy via inhibitory mechanisms, and how phases are merged to unify entities. Section 6 describes the associative potentiation of links in the type hierarchy. Next, Section 7 reviews the encoding of facts, and Section 8 outlines the encoding of rules (or mappings) between relational structures. A simple illustrative example is presented in Section 9. +v:Human+v:Book +:John +:Mary +:Book-17 ?:John ?:Mary ?:Book-17 E-fact F1 1000 - + + giver recip g-obj ? ?v:Human T-fact F2 50 - ? ?v:Book buyer b-obj buy give 900 800 med1 + r1 ? r2 r3 + ? s1 med2 s2 John 800 + Mary + ? ? 980 Book-17 + + - ? owner ? o-obj Human +e +v ?v ?e own +e Agent +e +v ?v ?e Book +v ?v ?e Fig. 1. An overview of shruti’s representational machinery. Types and Quantifiers in shruti 2 31 An Overview of shruti’s Representational Machinery All long-term (persistent) knowledge is encoded in shruti via structured networks of nodes and links. Such long-term knowledge includes generic relations, instances, types, general rules, and specific facts. In contrast, dynamic aspects of knowledge are represented via the activity of nodes, the propagation of activity along excitatory and inhibitory links, and the integration of incident activity at nodes. Such dynamic knowledge includes active (dynamic) facts and bindings, propagation of bindings, fusion of evidence, competition among incompatible entities, and the development of coherence. Figure 1 provides an overview of some of the key elements of shruti’s representational machinery. The network fragment shown in the figure depicts a partial encoding of the following rules, facts, instances, and types: 1. 2. 3. 4. 5. 6. ∀(x:Agent y:Agent z:Thing) give(x,y,z) ⇒ own(y,z) [800,800]; ∀(x:Agent y:Thing) buy(x,y) ⇒ own(x,y) [900,980]; EF: give(John, Mary, Book-17) [1000]; TF: ∀(x:Human y:Book) buy(x,y) [50]; is-a(John, Human); is-a(Mary, Human); 7. is-a(Human, Agent); 8. is-a(Book-17, Book). Item (1) is a rule which captures a systematic relationship between giving and owning. It states that when an entity x of type Agent, gives an entity z of type Thing, to an entity y of type Agent, then the latter comes to own it. Similarly, item (2) is a rule which states that whenever any entity of the type Agent buys something, it comes to own it. The pair of weights [a,b] associated with a rule have the following interpretation: a indicates the degree of evidential support for the antecedent being the probable cause (or explanation) of the consequent, and b indicates the degree of evidential support for the consequent being a probable effect of the antecedent.4 Item (3) corresponds to a long-term “episodic” fact (or E-fact) which states that John gave Mary a specific book (Book-17). Item (4) is a long-term “taxon” fact (or T-fact) which states that the prior evidential support for a given (random) human buying a given (random) book is 50. Item (5) states that John is a human. Similarly, items (6–8). Given the above knowledge, shruti can rapidly draw inferences of the following sort within a few hundred milliseconds5 (numbers in [] indicate strength of inference): 4 5 Weights in shruti lie in the interval [0,1000]. The mapping of probabilities and evidential supports to weights in shruti is non-linear and loosely defined. The initial weights can be set approximately, and subsequently fine tuned to model a given domain via learning. The time required for drawing an inference is estimated by c∗π, where c is the number of cycles of rhythmic activity it takes shruti to draw an inference (see Section 9), and π is the period of rhythmicity. A plausible value of π is 25 milliseconds [22]. 32 L. Shastri 1. own(Mary, Book-17) [784]; Mary owns a particular book (referred to as Book-17). 2. ∃x:Book own(Mary,x) [784]; Mary owns a book. 3. ∃(x:Agent y:Thing) own(x,y) [784]; Some agent owns something. 4. buy(Mary,Book-1) [41]; Mary bought a particular book (referred to as Book-1). 5. is-a(Mary, Agent); Mary is an agent. Figure 2 depicts a schematized response of the shruti network shown in Figure 1 to the query “Does Mary own a book?” (∃ x:Book own(Mary, x)?). We will revisit this activation trace in Section 9 after we have reviewed shruti’s representational machinery, and discussed the encoding of instances and types in more detail. For now it suffices to observe that the query is conveyed to the network by activating appropriate “?” nodes (?:own, ?:Mary and ?e:Book) and appropriate role nodes (owner and o-obj). This leads to a propagation of activity in the network which eventually causes the activation of the nodes +:own and +:Book-17. This signals an affirmative answer (Yes, Mary owns Book-17). Note that bindings between roles and entities are expressed by the synchronous activation of bound role and entity nodes. 2.1 Different Node Types and Their Computational Behavior Nodes in shruti are computational abstractions and correspond to small ensembles of cells. Moreover, a connection from a node A to a node B corresponds to several connections from cells in the A ensemble to cells in the B ensemble. shruti makes use of four node types: m-ρ-nodes, τ -and nodes, τ -or nodes of type 1), and τ -or nodes of type 2. This classification is based on the computational properties of nodes, and not on their functional or representational role. In particular, nodes serving different representational functions can be of the same computational type. The computational behavior of m-ρ-nodes and τ -and nodes is described below: m-ρ nodes: An m-ρ node with threshold n becomes active and fires upon receiving n synchronous inputs. Here synchrony is defined relative to a window of temporal integration ω. Thus all inputs arriving at a node with a lead/lag of no more than ω, are deemed to be synchronous. Thus an m-ρ node A receiving above-threshold periodic inputs from m-ρ nodes B and C (where B and C may be firing in different phases) will respond by firing in phase with both B and C. A similar node type has been described in [15]. A scalar level (strength) of activity is associated with the response of an m-ρ node.6 This level of activity is computed by the activation combination function 6 The response-level of a m-ρ node in a phase can be governed by the number of cells in the node’s cluster firing in that phase. Types and Quantifiers in shruti 33 +:own +:med1 +:give F1 ?:give giver g-obj recip ?:med2 s2 s1 ρ3 r1 ?:med1 r3 r2 ?:own o-obj owner +:John ?:John +:Book-17 ?:Book-17 ρ2 ?e:book +:Mary ?:Mary ρ1 time Fig. 2. A schematized activation trace of selected nodes for the query own(Mary,Book17)?. 34 L. Shastri (ECF) associated with the node. Some ECFs used in the past are sum, max, and sigmoid. Other combination functions are under investigation [25]. τ -and nodes: A τ -and node becomes active on receiving an uninterrupted and above-threshold input over an interval ≥ πmax , where πmax is a system parameter. Computationally, this sort of input can be idealized as a pulse whose amplitude exceeds the threshold, and whose duration is greater than or equal to πmax . Physiologically, such an input may be identified with a high-frequency burst of spikes. Thus a τ -and node behaves like a temporal and node and becomes active upon receiving adequate and uninterrupted inputs over an interval πmax . Upon becoming active, such a node produces an output pulse of width ≥ πmax . The level of output activation is determined by the ECF associated with the node for combining the weighted inputs arriving at the node. The model also makes use of inhibitory modifiers that can block the flow of activation along a link. This blocking is phasic and lasts only for a duration ω. 2.2 Encoding of Relational Structures Each relation (in general, a frame or a predicate) is represented by a focal-cluster which as an anchor for the complete encoding of a relation. Such focal-clusters are depicted as dotted ellipses in Figure 1. The focal-cluster for the relation give is depicted toward the top and the left of Figure 1. For the purpose of this example, it is assumed that give has only three roles: giver, recipient and giveobject. Each of these roles is encoded by a separate node labeled giver, recip and g-obj, respectively. The focal-cluster of give also includes an enabler node labeled ? and two collector nodes labeled + and –. The positive and negative collectors are mutually inhibitory (inhibitory links are depicted by filled blobs). In general, the focal-cluster for an n-place relation contains n role nodes, one enabler node, one positive collector node and one negative collector node. We will refer to the enabler, the positive collector, and the negative collector of a relation P as ?:P, +:P, and –:P, respectively. The collector and enabler nodes of relations behave like τ -and nodes. Role nodes and the collector and enabler nodes of instances behave like m-ρ nodes. Semantic Import of Enabler and Collector Nodes. Assume that the roles of a relation P have been dynamically bound to some fillers and thereby represent an active instance of P (we will see how this is done, shortly). The activation of the enabler ?:P means that the system is seeking an explanation for the active instance of P. In contrast, the activation of the collector +:P means that the system is affirming the active instance of P. Similarly, the activation of the collector -:P means that the system is affirming the negation of the active instance of P. The activation levels of ?:P, +:P and -:P signifies the strength with which information about P is being sought, believed, or disbelieved, respectively. For example, if the roles giver, recipient and object are dynamically bound to John, Mary, and a book, respectively, then the activation of ?:give means that the system is asking whether “John gave Mary a book” matches a fact in memory, or whether it can be inferred from what is known. In contrast, the Types and Quantifiers in shruti 35 activation of +:P with the same role bindings means that the system is asserting “John gave Mary a book”. Degrees of Belief: Support, no Information and Contradiction. The levels of activation of the positive and negative collectors of a relation measure the effective degree of support offered by the system to the currently active relational instance. Thus the activation levels of the collectors +:P and -:P encode a graded belief ranging continuously from no on the one extreme (only -:P is active), to yes on the other (only +:P is active), and don’t know in between (neither collector is very active). If both the collectors receive comparable and strong activation then a contradiction is indicated. Significance of Collector to Enabler Connections. Links from the collector nodes to the enabler node of a relation convert a dynamic assertion of a relational instance into a query about the assertion. Thus the system continually seeks an explanation for active assertions. The weight on the link from +:P (or -:P) to ?:P is a sum of two terms. The first term is proportional to the system’s propensity for seeking explanations — the more skeptical the system, the higher the weight. The second term is inversely proportional to the probability of occurrence of a positive (or negative) instance of P — the more unlikely a fact, the more intense the search for an explanation. The links from the collectors of a relation to its enabler also create positive feedback loops of activation and thereby create stable coalitions of active cells under appropriate circumstances. If the system seeks an explanation for an instance of P and finds support for this instance, then a stable coalition of activity arises consisting of ?:P, other ensembles participating in the explanation, +:P and finally ?:P. Such activity leads to priming (see Section 6), and the formation of episodic memories (see [19,21]). 3 Encoding Instances and Types The encoding of types and instances is illustrated in Figure 3. The focal-cluster of each entity consists of a ? and a + node. In contrast, the focal-cluster of each type consists of a pair of ? nodes (?e and ?v) and a pair of + nodes (+e and +v). While the nodes +v and ?v participate in the expression of knowledge (facts and attributes) involving the whole type, the nodes +e and ?e participate in the encoding of knowledge involving particular instances of the type. Thus nodes v and e signify universal and existential quantification, respectively. All nodes participating in the representation of types are m-ρ nodes. 3.1 Interconnections within Focal-Clusters of Instances and Types The interconnections shown in Figure 3 among nodes within the focal-cluster of an instance and among nodes within the focal-cluster of a type lead to the following functionality (I refers to an instance, T 1 refers to a type): 36 L. Shastri – Because of the link from +:I to ?:I, any assertion about an instance leads to a query or a search for a possible explanation of the assertion. – Because of the link from +v:T 1 to +e:T 1, any assertion about the type leads to the same assertion being made about an unspecified member of the type (e.g., “Humans are mortal” leads to “there exists a mortal human”).7 – Because of the link from +v:T 1 to ?v:T 1, any assertion about the whole type leads to a query or search for a possible explanation of the assertion (e.g., the assertion “Humans are mortal” leads to the query “Are humans mortal?”). – Because of the link from +e:T 1 to ?e:T 1, any assertion about an instance of the type leads to a query or search for a possible instance that would verify the assertion (e.g., the assertion “There is a human who is mortal” to the query “Is there is a human who is mortal?”). – Because of the link from ?e:T 1 to ?v:T 1, any query or search for an explanation about a member of the type leads to a query about the whole type (one way of determining whether “A human is mortal” is to find out whether “Humans are mortal”). – Moreover, paths formed by the above links lead to other behaviors. For example, given the path from +v:T 1 to ?e:T 1, any assertion about the whole type leads to a query or search for an explanation of the assertion applied to a given subtype/member of the type (e.g., “Humans are mortal” leads to the query “Is there a human who is mortal?”). Note that the closure between the “?” and “+” nodes is provided by the matching of facts (see Section 7). 3.2 The Interconnections Between Focal-Clusters of Instances and Types The interconnections between nodes in the focal-clusters of instances and types lead to the following functionality: – Because of the link from +v:T 1 to +:I, any assertion about the type T 1 leads to the same assertion about the instance I (“Humans are mortal” leads to “John is mortal”). – Because of the link from +:I to +e:T 1, any assertion about I leads to the same assertion about a member of T 1 (“John is mortal” leads to “A human is mortal”). – Because of the link from ?:I to ?v:T 1, a query about I leads to a query about T 1 as a whole (one way of determining whether “John is mortal” is to determine whether “Humans are mortal”). – Because of the link from ?e:T 1 to ?:I, a query about a member of T 1 leads to a query about I (one way of determining whether “A human is mortal” is to determine whether “John is mortal”). 7 shruti infers the existence of a mortal human given that all humans are mortal, though this is not entailed in classical logic. Types and Quantifiers in shruti 37 Similarly, interconnections between sub- and supertypes lead to the following functionality. – Because of the link from +v:T 2 to +v:T 1, any assertion about the supertype T 2 leads to the same assertion about the subtype T 1 (“Agents can cause change” leads to “Humans can cause change”). – Because of the link from +e:T 1 to +e:T 2, any assertion about a member of T 1 leads to the same assertion about a member of T 2 (“Humans are mortal” leads to “mortal agents exist”). – Because of the link from ?v:T 1 to ?v:T 2, a query about T 1 as a whole leads to a query about T 2 as a whole (one way of determining whether “Humans are mortal” is to determine whether “Agents are mortal”). – Because of the link from ?e:T 2 to ?e:T 1, a query about a member of T 2 leads to a query about a member of T 1 (one way of determining whether “an Agent is mortal” is to determine whether “a Human is mortal”). I + (John) ? from to T-facts and E-facts T1 (Human) +e +v ?v ?e to from T-facts and E-facts T2 +e +v ?v ?e (Agent) to from T-facts and E-facts Fig. 3. The encoding of types and (specific) instances. See text for details. 38 L. Shastri +:give g-obj recip giver +:John +:Mary +e:Book Fig. 4. The rhythmic activity representing the dynamic bindings give(John, Mary, a-Book). Bindings are expressed by the synchronous activity of bound role and entity nodes. 4 Encoding of Dynamic Bindings The dynamic encoding of a relational instance corresponds to a rhythmic pattern of activity wherein bindings between roles and entities are represented by the synchronous firing of appropriate role and entity nodes. With reference to Figure 1, the rhythmic pattern of activity shown in Figure 4 is the dynamic representation of the relational instance (give: hgiver=Johni, hrecipient=Maryi, hgive-object=a-Booki) (i.e., “John gave Mary a book”). Observe that the collector ensembles +:John, +:Mary and +e:Book are firing in distinct phases, but in phase with the roles giver, recip, and g-obj, respectively. Since +:give is also firing, the system is making an assertion. The dynamic representation of the query “Did John give Mary a book?” would be similar except that the enabler node would be active and not the collector node. The rhythmic activity underlying the dynamic representation of relational instances is expected to be highly variable, but it is assumed that over short durations — ranging from a few hundred milliseconds to about a second — such activity may be viewed as being composed of k interleaved quasi-periodic activities where k equals the number of distinct entities filling roles in active relational instances. The period of this transient activity is at least k ∗ ωint where ωint is the window of synchrony, i.e., the amount by which two spikes can lead/lag and still be treated as being synchronous. As speculated in [22], the activity of role and entity cells engaged in dynamic bindings might correspond to γ band activity (∼ 40 Hz). 5 Mutual Exclusion and Collapsing of Phases Instances in the type hierarchy can be part of a phase-level mutual exclusion cluster (ρ-mex cluster). The + node of every entity in a ρ-mex cluster sends inhibitory links to, and receives inhibitory links from, the + node of all other Types and Quantifiers in shruti 39 entities in the cluster. As a result of this mutual inhibition, only the most active entity within a ρ-mex cluster can remain active in any given phase. A similar ρ-mex cluster can be formed by +e: nodes of mutually exclusive types as well as +v: nodes of mutually exclusive types. Another form of inhibitory interaction between siblings in the type hierarchy leads to an “explaining away” phenomenon in shruti. Let us illustrate this inhibitory interaction with reference to the type hierarchy shown in Figure 1. The link from +:John to +e:Human sends an inhibitory modifier to the link from ?e:Human to ?:Mary. Similarly, the link from +:Mary to +e:Human sends an inhibitory modifier to the link from ?e:Human to ?:John (such modifiers are not shown in the figure). Analogous connections exist between all siblings in the type hierarchy. As a result of such inhibitory modifiers, if ?e:Human propagates activity to ?:John and ?:Mary in phase ρ1, then the strong activation of +:John in phase ρ1 attenuates the activity arriving from ?e:Human into ?:Mary. In essence, the success of the query “Is it John?” in the context of the query “Is it human?” makes the query “Is it Mary?” unimportant. This use of inhibitory connections for explaining away is motivated by [2]. As discussed in Section 8, shruti supports the introduction of “new” phases during inference. In addition, shruti also allows multiple phases to coalesce into a single phase during inference. In the current implementation, such phase unification can occur under two circumstances. First, phase collapsing can occur whenever a single entity dominates multiple phases (for example, if the same entity comes to be the answer of multiple queries). Second, phase collapsing can occur if two unifiable instantiations of a relation arise within a focal-cluster. For example, an assertion own(Mary, Book-17) alongside the query ∃ x:Book own(Mary,x)? (Does Mary own a book”) will result in a merging of the two phases for “a book” and “Book-17. Note that the type hierarchy will map the query ∃ x:Book own(Mary,x)? into own(Mary,Book-17)?, and hence, lead to a direct match between own(Mary,Book-17) and own(Mary,Book-17)?. 6 Priming: Associative Short-Term Potentiation of Weights Let I be an instance of T 1. If ?:I receives activity from ?e:T1 and concurrent activity from +:I, then the weight of the link from ?e:T1 to ?:I increases (i.e., gets potentiated) for a short-duration.8 Let T2 be a supertype of T1. If ?e:T1 receives activity from ?e:T2, and concurrent activity from +e:T1, then the weight of the link from ?e:T2 to ?e:T1 also increases for a short-duration. Analogous weight increases can occur along the link from ?v:T1 to ?v:T2 if ?v:T2 receives concurrent activity from +v:T2 and ?v:T1. Similarly, the weight of the link from ?:I to ?v:T1 can undergo a short-term increase if ?v:T1 receives concurrent activity from +v:T1 and ?:I.9 8 9 This is modeled after the biological phenomena of short-term potentiation (STP) [6]. In principle, short-term weight increases can occur along the link from +e:T1 to +e:T2 if +e:T2 receives concurrent activity from +v:T2 and +e:T1. Similarly, the weight of the link from +:I to +e:T1 can undergo a short-term increase, if +e:T1 receives concurrent activity from +v:T1 and +:I. 40 L. Shastri The potentiation of link weights can affect the system’s response time as well as the response itself. Let us refer to an entity whose incoming links are potentiated as a “primed” entity. Since a primed entity would become active sooner than an unprimed entity, a query whose answer is a primed entity would be answered faster (all else being equal). Furthermore, all else being equal, a primed entity would dominate an unprimed entity in a ρ-mex cluster, and hence, if a primed and an unprimed entity compete to be the filler of a role, the primed entity would emerge as the role-filler. 7 Facts in Long-Term Memory: E-Facts and T-Facts Currently shruti encodes two types of relational instances (i.e., facts) in its long-term memory (LTM): episodic facts (E-Facts) and taxon facts (T-facts). While an E-fact corresponds to a specific instance of a relation, a T-fact corresponds to a distillation or statistical summary of various instances of a relation (e.g., “Days tend to be hot in June”). An E-fact E1 associated with a relation P becomes active whenever all the dynamic bindings specified in the currently active instantiation of P match those encoded in E1 . Thus an E-fact is sensitive to any mismatch between the bindings it encodes and the currently active dynamic bindings. In contrast, a T-fact is sensitive only to matches between its bindings and the currently active dynamic bindings. Note that both E- and Tfacts tolerate missing bindings, and hence, respond to partial cues. The encoding of E-facts is described below – the encoding of T-facts is described in [20]. Figure 5 illustrates the encoding of E-facts love(John, Mary) and ¬love(Tom, Susan). Each E-fact is encoded using a distinct fact node (these are labeled F1 and F2 in Figure 5). A fact node sends a link to the + or – collector of the relation depending on whether the fact encodes a positive or a negative assertion. Given the query love(John,Mary)? the E-fact node F1 will become active and activate +:love, +:John and +:Mary nodes indicating a “yes” answer to the question. Similarly, given the query love(Tom,Susan)?, the E-fact node F2 will become active and activate –:love, +:Tom and +:Susan nodes indicating a “no” answer to the query. Finally, given the query love(John,Susan)?, neither +:love nor –:love would become active, indicating that the system can neither affirm nor deny whether John loves Susan (the nodes +:John and +:Susan will also not receive any activation). Types can also serve as role-fillers in E-facts (e.g., Dog in “Dogs chase cats”) and so can unspecified instances of a type (e.g., a dog in “a dog bit John”). Such E-facts are encoded by using the appropriate nodes in the focal-cluster for “Dog”. In general, if an existing instance, I, is a role-filler in a fact, then ?:I provides the input to the fact cluster and +:I receives inputs from the binder node in the fact cluster. If the whole type T is a role-filler in a fact, then ?v:T provides the input to the fact cluster and +v:T receives inputs from the binder node in the fact cluster. If an unspecified instance of type T is a role-filler in a long-term fact, then a new instance of type T is created and its “?” and “+” nodes are used to encode the fact. Types and Quantifiers in shruti 41 (b) (a) α1 α2 00 11 00 00 11 11 11 00 11 11 00 00 11 1000 00 10 11 + ? 000 F2 111 00011 111 00 000 00 111 11 000 111 00 F1 11 00 00 11 001111 00 000 11 111 from ?:Tom from F1 +:John ?:Susan from ?:John from ?:Mary lover lovee +:Mary lover lovee love Fig. 5. (a) The encoding of E-facts: love(John,Mary) and ¬love(Tom,Susan). The pentagon shaped nodes are “fact” nodes and are of type τ -and. The dark blobs denote inhibitory modifiers. The firing of a role node without the synchronous firing of the associated filler node blocks the activation of the fact node. Consequently, the E-fact is blocked whenever there is a mismatch between the dynamic binding of a role and its binding specified in the E-fact. (b) Links from the fact node back to role-fillers are shown only for the fact love(John,Mary) to avoid clutter. The circular nodes are m-ρ nodes with a high threshold which is satisfied only when both the role node and the fact node are firing. Consequently, a binder node fires in phase with the associated role node, if the fact node is firing. Weights α1 and α2 indicate strengths of belief. 8 Encoding of Rules A rule is encoded via a mediator focal-cluster that mediates the flow of activity and bindings between antecedent and consequent clusters (mediators are depicted as parallelograms in Figure 1). A mediator consists of a single collector (+), an enabler (?), and as many role-instantiation nodes as there are distinct variables in the rule. A mediator establishes links between nodes in the antecedent and consequent clusters as follows: (i) The roles of the consequent and antecedent relation(s) are linked via appropriate role-instantiation nodes in the mediator. This linking reflects the correspondence between antecedent and consequent roles specified in the rule. (ii) The enabler of the consequent is connected to the enabler of the antecedent via the enabler of the mediator. (iii) The appropriate (+/–) collector of the antecedent relation is linked to the appropriate (+/–) collector of the consequent relation via the collector of the mediator. A collector to collector link originates at the + (–) collector of an antecedent relation if the relation appears in its positive (negated) form in the antecedent. The link terminates at the + (–) collector of the consequent relation if the relation appears in a positive (negated) form in the consequent.10 10 The design of the mediator was motivated, in part, by discussions the author had with Jerry Hobbs. 42 L. Shastri Consider the encoding of the following rule in Figure 1: ∀ x:agent y:agent z:thing give(x,y,z) ⇒ own(y,z) [800,800] This rule is encoded via the mediator, med1, containing three role-instantiation nodes r1, r2, and r3. The weight on the link from ?:med1 to ?:give indicates the degree of evidential support for give being the probable cause (or explanation) of own. The weight on the link from +:med1 to +:own indicates the degree of evidential support for own being a probable effect of give. These strengths are defined on a non-linear scale ranging from 0 to 1000. A role-instantiation node is an abstraction of a neural circuit with the following functionality. If a role-instantiation node receives activation from the mediator enabler and one or more consequent role nodes, it simply propagates the activity onward to the connected antecedent role nodes. If on the other hand, the role-instantiation node receives activity only from the mediator enabler, it sends activity to the ?e node of the type specified in the rule as the type restriction for this role. This causes the ?e node of this type to become active in an unoccupied phase.11 The ?e node of the type conveys activity in this phase to the role-instantiation node which in turn propagates this activity to connected antecedent roles nodes. The links between role-instantiation nodes and nodes in the type hierarchy has not been shown in Figure 1. shruti can encode rules involving multiple antecedents and consequents (see [20]). Furthermore, shruti allows a bounded number of instantiations of the same predicate to be simultaneously active during inference [14]. 9 An Example of Inference Figure 2 depicts a schematized response of the shruti network shown in Figure 1 to the query “Does Mary own a book?” (∃ x:Book own(Mary, x)?). This query is posed by activating ?:Mary and ?e:book nodes, the role nodes owner and oobj, and the enabler ?:own, as shown in Figure 2. We will refer to the phases of activation of ?:Mary and ?e:book as ρ1 and ρ2, respectively. Activation from the focal-cluster for own reaches the mediator structure of rules (1) and (2). Consequently, nodes r2 and r3 in the mediator med1 become active in phases ρ1 and ρ2, respectively. Similarly, nodes s1 and s2 in the mediator med2 become active in phases ρ1 and ρ2, respectively. At the same time, the activation from ?:own activates the enablers ?:med1 and ?:med2 in the two mediators. Since r1 does not receive activation from any of the roles in its consequent’s focal-cluster (own), it activates the node ?e:agency in the type hierarchy in a free phase (say ρ3). The activation from nodes r1, r2 and r3 reach the roles giver, recip and g-obj in the give focal-cluster, respectively. Similarly, activation from nodes s1 and s2 reach the roles buyer and b-obj in the buy focal-cluster, respectively. In essence, the system has created new bindings for give and buy wherein giver is 11 A similar phase-allocation mechanism is used in [1] for realizing function terms. Currently, an unoccupied phase is assigned in software, but eventually this will result from inhibitory interactions between nodes in the type hierarchy. Types and Quantifiers in shruti 43 bound to an undetermined agent, recipient is bound to Mary, g-obj is bound to a book, buyer is bound to Mary, and b-obj is bound to a book. These bindings together with the activation of the enabler nodes ?:give and ?:own encode two new queries: “Did some agent give Mary a book?” and “Did Mary buy a book?”. At the same time, activation travels in the type hierarchy and thereby maps the query to a large number of related queries such as “Did a human give Mary a book?”, “Did John give Mary Book-17?”, “Did Mary buy all books” etc. The E-fact give(John, Mary, Book-17) now becomes active as a result of matching the query give(John, Mary, Book-17)? and causes +:give to become active. This in turn causes +:med1, to become active and transmit activity to +:own. This results in an affirmative answer to the query and creates a reverberant loop of activity involving the clusters own, med1, give, the fact node F1, and the entities John, Mary, and Book-17. 10 Conclusion The type structure described above, together with other enhancements such as support for negation, priming, and evidential rules, allow shruti to support a rich set of inferential behaviors, and perhaps, shed some light on the nature of symbolic neural representations. shruti identifies a number of constraints on the representation and processing of relational knowledge and predicts the capacity of the active (working) memory underlying reflexive reasoning [17][22]. First, on the basis of neurophysiological data pertaining to the occurrence of synchronous activity in the γ band, shruti leads to the prediction that a large number of facts (relational instances) can be active simultaneously and a large number of rules can fire in parallel during an episode of reflexive reasoning. However, the number of distinct entities participating as role-fillers in these active facts and rules must remain very small (≈ 7). Recent experimental findings as well as computational models lend support to this prediction (e.g., [12][13]). Second, since the quality of synchronization degrades as activity propagates along a chain of cell clusters, shruti predicts that as the depth of inference increases, binding information is gradually lost and systematic inference reduces to a mere spreading of activation. Thus shruti predicts that reflexive reasoning has a limited inferential horizon. Third, shruti predicts that only a small number of instances of any given relation can be active simultaneously. A number of issues remain open. These include the encoding of rules and facts involving complex nesting of quantifiers. While the current implementation supports multiple existential and universal quantifiers, it does not support the occurrence of existential quantifiers within the scope of an universal quantifier. Also, the current implementation does not support the encoding of complex types such as radial categories [10]. Another open issue is the learning of new relations and rules (mappings). In [18] it is shown that a recurrent network can learn rules involving variables and semantic restrictions using gradient-descent learning. While this work serves as a proof of concept, it does not address issues of 44 L. Shastri scaling and catastrophic interference. Several researchers are pursuing solutions to the problem of learning in the context of language acquisition (e.g., [16][4][7]). In collaboration with M. Cohen, B. Thompson, and C. Wendelken, the author is also augmenting shruti to integrate the propagation of belief with the propagation of utility. The integrated system will be capable of seeking explanations, making predictions, instantiating goals, constructing reactive plans, and triggering actions that maximize the system’s expected future utility. Acknowledgment This work was partially funded by grants NSF SBR-9720398 and ONR N0001493-1-1149, and subcontracts from Cognitive Technologies Inc. related to contracts ONR N00014-95-C-0182 and ARI DASW01-97-C-0038. Thanks to M. Cohen, J. Feldman, D. Grannes, J. Hobbs, D.R. Mani, B. Thompson, and C. Wendelken. References 1. Ajjanagadde, V.: Reasoning with function symbols in a connectionist network. In the Proceedings of the 12th Conference of the Cognitive Science Society, Cambridge, MA. (1990) 285–292. 2. Ajjanagadde, V.: Abductive reasoning in connectionist networks: Incorporating variables, background knowledge, and structured explanada, Technical Report WSI 91-6, Wilhelm-Schickard Institute, University of Tubingen, Germany (1991). 3. Ajjanagadde, V., Shastri, L.: Efficient inference with multi-place predicates and variables in a connectionist network. In the Proceedings of the 11th Conference of the Cognitive Science Society, Ann-Arbor, MI (1989) 396–403. 4. Bailey, D., Chang, N., Feldman, J., Narayanan, S.: Extending Embodied Lexical Development. In the Proceedings of the 20th Conference of the Cognitive Science Society, Madison, WI. (1998) 84–89. 5. Barnden, J., Srinivas, K.: Encoding Techniques for Complex Information Structures in Connectionist Systems. Connection Science, 3, 3 (1991) 269–315. 6. Bliss, T.V.P., Collingridge, G.L.: A synaptic model of memory: long-term potentiation in the hippocampus. Nature 361, (1993) 31–39. 7. Gasser, M., Colunga, E.: Where Do Relations Come From? Indiana University Cognitive Science Program, Technical Report 221, (1998). 8. Hobbs, J.R., Stickel, M., Appelt, D., Martin, P.: Interpretation as Abduction, Artificial Intelligence, 63, 1-2, (1993) 69–142. 9. Hummel, J. E., Holyoak, K.J.: Distributed representations of structure: a theory of analogical access and mapping. Psychological Review, 104, (1997) 427–466. 10. Lakoff, G.: Women, Fire, and Dangerous Things — What categories reveal about the mind, University of Chicago Press, Chicago (1987). 11. Lange, T. E., Dyer, M. G.: High-level Inferencing in a Connectionist Network. Connection Science, 1, 2 (1989) 181–217. 12. Lisman, J. E., Idiart, M. A. P.: Storage of 7 ± 2 Short-Term Memories in Oscillatory Subcycles. Science, 267 (1995) 1512–1515. 13. Luck, S. J., Vogel, E. K.: The capacity of visual working memory for features and conjunctions. Nature 390 (1997) 279–281. Types and Quantifiers in shruti 45 14. Mani, D.R., Shastri, L.: Reflexive Reasoning with Multiple-Instantiation in a Connectionist Reasoning System with a Typed Hierarchy, Connection Science, 5, 3&4, (1993) 205–242. 15. Park, N.S., Robertson, D., Stenning, K.: An extension of the temporal synchrony approach to dynamic variable binding in a connectionist inference system. Knowledge-Based Systems, 8, 6 (1995) 345–358. 16. Regier, T.: The Human Semantic Potential: Spatial Language and Constrained Connectionism, MIT Press, Cambridge, MA, (1996). 17. Shastri, L.: Neurally motivated constraints on the working memory capacity of a production system for parallel processing. In the Proceedings the 14th Conference of the Cognitive Science Society, Bloomington, IN (1992) 159–164. 18. Shastri, L.: Exploiting temporal binding to learn relational rules within a connectionist network. TR-97-003, International Computer Science Institute, Berkeley, CA, (1997). 19. Shastri, L.: A Model of Rapid Memory Formation in the Hippocampal System, In the Proceedings of the 19th Annual Conference of the Cognitive Science Society, Stanford University, CA, (1997) 680–685. 20. Shastri, L.: Advances in shruti — A neurally motivated model of relational knowledge representation and rapid inference using temporal synchrony. Applied Intelligence, 11 (1999) 79–108. 21. Shastri, L.: Recruitment of binding and binding-error detector circuits via longterm potentiation. Neurocomputing, 26-27 (1999) 865–874. 22. Shastri, L., Ajjanagadde V.: From simple associations to systematic reasoning: A connectionist encoding of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences, 16:3 (1993) 417–494. 23. Shastri, L., Grannes, D.J.: A connectionist treatment of negation and inconsistency. In the Proceedings of the 18th Conference of the Cognitive Science Society, San Diego, CA, (1996). 24. Shastri, L., Grannes, D.J., Narayanan, S., Feldman, J.A.: A Connectionist Encoding of Schemas and Reactive Plans. In Hybrid Information Processing in Adaptive Autonomous vehicles, G.K. Kraetzschmar and G. Palm (Eds.), Lecture Notes in Computer Science, Springer-Verlag, Berlin (To appear). 25. Shastri, L., Wendelken, C.: Knowledge Fusion in the Large – taking a cue from the brain. In the Proceedings of the Second International Conference on Information Fusion, FUSION’99, Sunnyvale, CA, July (1999) 1262–1269. 26. Singer, W.: Synchronization of cortical activity and its putative role in information processing and learning. Annual Review of Physiology 55 (1993) 349–74. 27. Sun, R.: On variable binding in connectionist networks. Connection Science, 4, 2 (1992) 93–124. 28. von der Malsburg, C.: Am I thinking assemblies? In Brain Theory, ed. G. Palm & A. Aertsen. Springer-Verlag (1986). A Recursive Neural Network for Reflexive Reasoning Steffen Hölldobler 1 , Yvonne Kalinke 2 ⋆ , and Jörg Wunderlich 3 ⋆⋆ 1 2 Dresden University of Technology, Dresden, Germany Queensland University of Technology, Brisbane, Australia 3 Neurotec Hochtechnologie GmbH, Friedrichshafen, Germany Abstract. We formally specify a connectionist system for generating the least model of a datalogic program which uses linear time and space. The system is shown to be sound and complete if only unary relation symbols are involved and complete but unsound otherwise. For the latter case a criteria is defined which guarantees correctness. Finally, we compare our system to the forward reasoning version of Shruti. 1 Introduction Connectionist systems exhibit many desirable properties of intelligent systems like, for example, being massively parallel, context–sensitive, adaptable and robust (see eg. [10]). It is strongly believed that intelligent systems must also be able to represent and reason about structured objects and structure–sensitive processes (see eg. [12,25]). Unfortunately, we are unaware of any connectionist system which can handle structured objects and structure–sensitive processes in a satisfying way. Logic systems were designed to cope with such objects and processes and, consequently, it is a long–standing research goal to combine the advantages of connectionist and logic systems in a single system. There have been many results on such a combination which involves propositional logic (c.f. [24,26]). In [15] we have shown that a three–layered feed forward network of binary threshold units can be used to compute the meaning function of a logic program. The input and output layer of such a network consists of a vector of units, each representing a propositional letter. The activation pattern of these layers represent an interpretation I with the understanding that the unit representing the propositional letter p is active iff p is true under I . For certain classes of logic programs it is well–known that they admit a least model, and that this model can be computed as the least fixed point of the program’s meaning function applied to an arbitrary initial interpretation [1,11]. To ⋆ ⋆⋆ The author acknowledges support from the German Academic Exchange Service (DAAD) under grant no. D/97/29570. The results reported in this paper were achieved while the author was at the Dresden University of Technology. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 46–62, 2000. c Springer-Verlag Berlin Heidelberg 2000 A Recursive Neural Network for Reflexive Reasoning 47 compute the least fixed point in such cases, the feed forward network mentioned in the previous paragraph is turned into a recurrent one by connecting each unit in the output layer to the corresponding unit in the input layer. We were able to show — among other results — that such so-called Rnns, ie. recursive neural networks with a feed forward kernel , converge to a stable state which represents the least model of the corresponding logic program. Moreover, in [6] it was shown that the binary threshold units in the hidden layer of the kernel can be replaced by units with sigmoidal activation function. Consequently, the networks can be trained by backpropagation and after training new refined program clauses (or rules) can be extracted using, for example, the techniques presented in [31]. Altogether, this is a good example of how the properties of logic systems can be combined with the inherent properties of connectionist system. Unfortunately, this does not solve the aforementioned problem because structured objects and structure–sensitive processes cannot be modeled within propositional logic but only within first- and higher–order logics. For these logics, however, similar results combining connectionist and logic systems are not known. One of the problems is that as soon as the underlying alphabet contains a single non–nullary function symbol and a single constant, then there are infinitely many ground atoms which cannot be represented locally in a connectionist network. In [17,18] we have shown that for certain classes of first–order logic programs interpretations can be mapped onto real numbers such that the program’s meaning function can be encoded as a continuous function on the real numbers. Applying a result from [13] we conclude that three–layered feed forward networks with sigmoidal activation function for the units occurring in the hidden layer and linear activation function for the units occurring in the input and output layer can approximate the meaning function of first–order logic programs arbitrarily well. Moreover, turning this feed forward kernel into an Rnn, the Rnn computes an approximation of the least fixed point, i.e. the least model, of a given logic program. The notion of an approximation is based on a distance function between interpretations such that — loosely speaking — the distance is indirectly proportional to the number of atoms on which both interpretations agree. Unfortunately, the result reported in [17,18] is purely theoretical and we have not yet developed a real connectionist system which uses it. One of the main obstacles for doing so is that we need to find a connectionist representation for terms. There are various alternatives: • We may use a structured connectionist network as in [14]: In this case the network is completely local, all computations like, for example, unification can be performed, but it is not obvious at all how such networks can be learned. The structure is by far to complex for current learning algorithms based on the recruitment paradigm [9]. • We may use a vector of fixed length to represent terms as in the recursive auto–associative memory [28], the labeling recursive auto–associative memory [30] or in the memory based on holographic reduced representations [27]. Unfortunately, in extensive tests none of these proposals has led to sa- 48 S. Hölldobler, Y. Kalinke and J. Wunderlich tisfying results: The systems could not safely store and recall terms of depth larger than five [20]. • We may use hybrid systems, where terms are represented and manipulated in a conventional way. But this is not a kind of integration that we were hoping for because in this case results from connectionist systems cannot be applied to the conventional part. • We may use connectionist encodings of conventional data structures like counters and stacks [16,21], but currently the models are still too simple. • We may use a phase–coding to bind constants to terms as suggested in Shruti [29]: In this case we restrict our first–order language to contain only constants and multi–place relation symbols. Considering this current state of the art in connectionist term representations we propose in this paper to extend our connectionist model developed in [15] to handle constants and multi–place relations by turning the units into phase– sensitive ones and solving the variable binding problem as suggested in Shruti. Because our system generates models for logic programs in a forward reasoning manner, it is necessary to consider the version of Shruti, were forward reasoning is performed. There are three main difficulties with such a Shruti system: • In almost any derivation more than one copy of the rules is needed, which leads to sequential processing. • The structure of the system is quite complex: there are many different types of units with a complex connection structure, and it is not clear at all how such structures can be learned. • The logical foundation of the system has not been developed yet. In [2] we have developed a logical calculus for the backward reasoning version of Shruti by showing that reflexive reasoning as performed by Shruti is nothing but reasoning by reductions in a conventional, but parallel logic system based on the connection method [3]. In this paper we develop a logic system for forward reasoning in a first–order calculus with constants and multi–place relations and specify a recurrent neural network implementing this system. The logic system is again based on the connection method using reduction techniques which link logic systems to database systems. We define a calculus called Bur (for bottom– up reductions) which has the following properties: • For unary relation symbols the calculus is sound and complete. • For relation symbols with an arity larger than one the calculus is complete but not necessarily sound. • We develop a criterion which guarantees that the result achieved in the case where the relation symbols have an arity larger than one are sound. • Computations require only linear parallel time and linear parallel space. Furthermore, we extend the feed forward neural networks developed in [15] by turning the units into phase–sensitive ones. We formally show that the Bur calculus can be implemented in these networks. Compared to Shruti our networks consist only of two types of phase–sensitive units. The connection structure is an Rnn with a four–layered feed forward neural network as kernel and, thus, A Recursive Neural Network for Reflexive Reasoning 49 is considerable simpler than the connection structure of Shruti. Besides giving a rigorous formal treatment of the logic underlying Shruti if run in a forward reasoning manner, this line of research may also lead to networks for reflexive reasoning, which can be trained using standard techniques like backpropagation. The paper is organized as follows: In the following Section 2 we repeat some basic notions, notations and results concerning logic programming and reflexive reasoning. The Bur calculus is formally defined in Section 3. Its connectionist implementation is developed in Section 4. The properties of the implementation and its relation to the Shruti system are discussed in Sections 5 and 6 respectively. In the final Section 7 we discuss our results and point out future research. 2 Logic Programs and Reflexive Reasoning We assume the reader to have some background in logic programming and deduction systems (see eg. [23,5]) as well as in connectionist systems and, in particular, in the Shruti system [29]. Thus, in this section we will just briefly repeat the basic notions, notations and results. A (definite) logic program is a set of clauses, ie. universally closed formulas1 of the form A ← A1 ∧ . . . ∧ An , where A, Ai , 1 ≤ i ≤ n , are first–order atoms. A and A1 ∧ . . . ∧ An are called head and body respectively. A clause of a logic program is said to be a fact if its body is empty and its head does not contain any occurrences of a variable; otherwise it is called a rule. A logic program is said to be a datalogic program if all function symbols occurring in the program are nullary, ie. if all function symbols are constants. For example, the database in Shruti is a datalogic program.2 Definite logic programs enjoy many nice properties, among which is the one that each program P admits a least model. This model contains precisely all the logical consequences of the program. Moreover, it can be computed iteratively as the least fixed point of a so–called meaning function TP which is defined on interpretations I as TP (I) = {A | there exists a ground instance A ← A1 ∧ . . . ∧ An of a clause in P such that {A1 , . . . , An } ⊆ I}, where an interpretation is a set of ground atoms. In case of a datalogic program P over a finite set of constants, the least fixed point of TP can be computed in finite, albeit exponential time (in the worst case) with respect to the size of P . Following the argumentation in [29], datalogic programs are thus unsuitable to model reflexive reasoning. Only by imposing additional conditions on the syntactic structure of datalogic programs as well as on their runtime behavior, it was possible to show that a backward reasoning version of Shruti is able to answer questions in linear time. 1 2 Ie. all variables are assumed to be universally closed. To be precise, existentially bound variables in a Shruti database must be replaced by new constants (see [2]). 50 3 S. Hölldobler, Y. Kalinke and J. Wunderlich Bottom–Up Reductions: The BUR Calculus In this section we develop a new calculus called Bur based on the idea to apply reduction techniques to a given knowledge base in a bottom–up manner, whereby the reduction techniques can be efficiently applied in parallel. We are particularly interested in reduction techniques, which can be applied in linear time and space. Let C be a finite set of constant symbols and R a finite set of relation symbols.3 A Bur knowledge base P is a number of formulas that are either facts or rules. Thus, a Bur knowledge base is simply a datalogic program. Before turning to the definition of the reduction techniques we consider a knowledge base P1 with the facts p(a, b) and p(c, d) (1) q(a, c) and q(b, c) (2) r(X, Y, Z) ← p(X, Y ) ∧ q(Y, Z), (3) p(X, Y ) ← (X, Y ) ∈ {(a, b), (c, d)} (4) as well as and a single rule where C = {a, b, c, d} is the set of constants, R = {p, q, r} the set of relation symbols and X, Y, Z are variables. Using a technique known as database (or DB) reduction in the connection method (see [4]) the facts in (1) can be equivalently replaced by and, likewise, the facts in (2) can be equivalently replaced by q(X, Y ) ← (X, Y ) ∈ {(a, c), (b, c)}. (5) Although the transformations seem to be straightforward they have the desired side–effect that there is now only one possibility to satisfy the conditions p(X, Y ) and q(Y, Z) in the body of rule (3), viz. by using (4) and (5) respectively. Technically speaking, there is an isolated connection between p(X, Y ) occurring in the body of (3) and the head of (4) and, likewise, between q(Y, Z) occurring in the body of (3) and the head of (5) [4]. Such isolated connections can be evaluated. Applying the corresponding reduction technique yields r(X, Y, Z) ← (X, Y, Z) ∈ π1,2,4 (p ✶p/2=q/1 q), (6) where ✶ denotes the (natural equi-) join of the relations p and q , p/2 = q/1 denotes the constraint that the second argument of the relation p should be identical to the first argument of q and π1,2,4 (s) denotes the projection of the relation s to the first, second and forth argument. Evaluating the database operations occurring in equation (6) leads to the reduced expression 3 r(X, Y, Z) ← (X, Y, Z) ∈ {(a, b, c)}. (7) Throughout the paper we will make use of the following notational conventions: a, b, . . . denote constants, p, q, . . . relation symbols and X, Y, . . . variables. A Recursive Neural Network for Reflexive Reasoning 51 In general, after applying database reductions to facts the evaluation of isolated connections between the reduced facts and the atoms occurring in the body of rules leads to expressions containing the database operations union ( ∪ ), intersection ( ∩ ), projection ( π ), Cartesian product ( ⊗ ) and join ( ✶ ). These are the standard operations of a relation database (see eg. [32]). The most costly operation is the join, which in the worst case requires exponential space and time with respect to the number of arguments of the involved relations and the number of atoms occurring in the body of a rule. Because it is our goal to set up a calculus which allows reasoning within linear time and space boundaries, we must avoid the join operation. This can be achieved if we replace database reductions by so–called pointwise database reductions: In our example, the facts in (1) and (2) are replaced by p(X, Y ) ← X ∈ {a, c} ∧ Y ∈ {b, d} (8) q(X, Y ) ← X ∈ {a, b} ∧ Y ∈ {c} (9) r(X, Y, Z) ← X ∈ π1 (p) ∧ Y ∈ π2 (p) ∩ π1 (q) ∧ Z ∈ π2 (q), (10) and respectively. After evaluating isolated connections (3) now becomes which can be further evaluated to r(X, Y, Z) ← X ∈ {a, c} ∧ Y ∈ {b} ∧ Z ∈ {c}. (11) In general, the use of pointwise database reductions instead of database reductions leads to expressions involving only the database operations union, intersection, projection and Cartesian product, all of which can be computed in linear time and space using an appropriate representation. The drawback of this approach is that now so–called spurious tuples may occur in relations. For example, according to (11) not only (a, b, c) is in relation r (as in (7)) but also (c, b, c) . r(a, b, c) is a logical consequence of the example knowledge base, whereas r(c, b, c) is not. It is easy to see that spurious tuples may occur only if multiplace relation symbols are involved. We will come back to this problem later in this section. After this introductory example we can now formally define the reduction rules of the Bur calculus. One should keep in mind that these rules are used to compute the least fixed point of TP for a given Bur database P . Without loss of generality we may assume that the head of each rule contains only variable occurrences: Any occurrence of a constant c in the head of a rule may be replaced by a new variable X if the condition X ∈ {c} is added to the body of the rule.4 A similar transformation can be applied to facts, ie. each fact of the form p(c1 , . . . , cn ) can be replaced by p(X1 , . . . , Xn ) ← n ^ i=1 Xi ∈ {ci }. We will call such expressions generalized facts. 4 This is called the homogeneous form in [8]. 52 S. Hölldobler, Y. Kalinke and J. Wunderlich The Bur calculus contains the following two rules: • Pointwise DB reduction: Let p(X1 , . . . , Xn ) ← n ^ i=1 Xi ∈ Ci and p(X1 , . . . , Xn ) ← n ^ i=1 Xi ∈ Di be two generalized facts in P . Replace these facts by n ^ p(X1 , . . . , Xn ) ← i=1 Xi ∈ Ci ∪ Di , • Evaluation of isolated connections: Let C be the set of constants, p(X1 , . . . , Xm ) ← n ^ pi (ti1 , . . . , tiki ) (12) i=1 be a rule in P such that there are also generalized facts of the form pi (Yi1 , . . . , Yiki ) ← ki ^ l=1 Yil ∈ Cil , 1 ≤ i ≤ n, in P . Let Dj = ∩{Cil | Xj occurs at the l th position in pi (ti1 , . . . , tiki )} for each variable Xj occurring in (12). If Dj 6= ∅ for each j , then add the following generalized fact to P : p(X1 , . . . , Xm ) ← n ^ i=1 Xi ∈ C ∩ Di . One should observe that by evaluating isolated connections generalized facts will be added to a Bur knowledge base P . Because pointwise DB reductions do not decrease the number of facts and C is finite, this process will eventually terminate in that no new facts are added. Let M denote the largest set of facts obtained from p by applying the Bur reduction rules. Theorem 1. 1. The Bur calculus is sound and complete if all relation symbols occurring in P are unary, ie. M is precisely the least model of P . 2. The Bur calculus is complete but not necessarily sound if there are multiplace relation symbols in P , ie. M is a superset of the least model of P . The proof of this theorem can be found in [33]. The first part of Theorem 1 confirms the fact that considering unary relation symbols and a finite set of constants does neither extend the expressive power of a Bur knowledge base compared to propositional Horn logic nor does it affect the time and space requirements for computing the minimal model of a program (see [7]). Because the reduction techniques in the Bur calculus can be applied in linear time and A Recursive Neural Network for Reflexive Reasoning 53 space the minimal model of such a Bur knowledge base can be computed in linear time and space as well. The second part of Theorem 1 confirms the fact that considering multiplace relation symbols and a finite set of constants does not change the expressive power of a Bur knowledge base compared to propositional Horn logic but does affect the time and space requirements. Turning a Bur knowledge base into an equivalent propositional logic program may lead to exponentially more rules and facts. Hence, the best we can hope for if we apply reduction techniques bottom– up and in linear time and space is a pruning of the search space. In the worst case the pruning can be neglected. In some cases however the application of the reduction techniques may lead to considerable savings. Such a beneficial case is characterized in the following theorem. Theorem 2. If after d applications of the reduction techniques all relations have at most one argument for which more than one binding is generated, then all facts derived so far are logical consequences of the knowledge base. In other words, the precondition of this theorem defines a correctness criterion in that the Bur calculus is also sound for multiplace relations if the criterion is met in the limit. The proof of the theorem can again be found in [33]. 4 A Connectionist Implementation of the BUR Calculus The connectionist implementation of the Bur calculus is based on two main ideas: (1) use the kernel of the Rnn model to encode the logical structure of the Bur knowledge base and its recursive part to encode successive applications of the reduction techniques and (2) use the temporal synchronous activation model of Shruti to solve the dynamic binding problem. In the Bur model two types of phase–sensitive binary threshold units are used, which are called btu–p–units and btu–c–units, respectively. They have the same functionality as the ρ–btu and τ –and–units in the Shruti model, ie. the output of a btu–p- and a btu–c–unit in a phase πc in a cycle ω are  1 if ibtu–p (πc ) ≥ θbtu–p , obtu–p (πc ) = 0 else and obtu–c (πc ) =  1 if ∃πc′ . [πc′ ∈ ω ∧ ibtu–c (πc′ ) ≥ θbtu–c ] 0 else respectively, where θ denotes the threshold and i(πc ) the input of the unit in the phase πc . Because all connections in the network will be defined as weighted with 1 the input i(πc ) of a btu–p and btu–c–unit equals the sum of the outputs of all units that are connected to that unit in the phase πc . The number of phases in a cycle ω is determined by the number of constant symbols occurring in a given Bur knowledge base. 54 S. Hölldobler, Y. Kalinke and J. Wunderlich For a given Bur knowledge base we construct a four–layered feed forward network according to the following algorithm. To shorten the notation the superscripts I , O , 1 and 2 indicate whether the unit belongs to the input, output, first or second hidden layer of the network respectively. Definition 3. The network corresponding to a Bur knowledge base P is an Rnn with an input, two hidden and an output layer constructed as follows: 1 For each constant c occurring in P add a unit btu–pIc with threshold 1 . 2 For each relation symbol p with arity k occurring in P add units btu–pIp[1] , O . . . , btu–pIp[k] and btu–pO p[1] , . . . , btu–pp[k] each with threshold 1. 3 For each formula F of the form p(. . .) ← p1 (. . .) ∧ . . . ∧ pn (. . .) in P do: 3.1 For each variable X occurring in the body of F add a unit btu–c1X . Draw connections from each unit btu–pIp[j] to this unit iff relation p occurs in the body of C and its j th argument is X . Set the threshold of the new unit btu–c1X to the number of incoming connections. 3.2 For each constant c occurring in F add a unit btu–c1c . Draw a connection from unit btu–pIc to this unit and connections from each unit btu–pIp[j] iff relation p occurs in the body of F and its j th argument is c . Set the threshold of the new unit btu–c1c to the number of incoming connections. 3.3 For each unit btu–c1X that was added in step 3.1 add a companion unit btu–p1X iff variable X occurs in the head of F . For each unit btu–c1c that was added in step 3.2 add a companion unit btu–p1c iff constant c occurs in the head of F . Draw connections from the input layer to the companion units such that these units receive the same input as their companion units btu–c1X and btu–c1c , and assign the same threshold. 3.4 If k is the arity of the relation p(. . .) occurring in the head of F then add units btu–p2p[1] , . . . , btu–p2p[k] . Draw a connection from each btu–c1 unit added in steps 3.1 to 3.3 to each of these units. 3.5 Draw a connection from btu–p1X to btu–p2p[j] iff variable X occurs at position j in p(. . .) . Draw a connection from btu–p1c to btu–p2p[j] iff constant c occurs at position j in p(. . .) . Set the threshold of the btu–p2 units to the number of incoming connections. 3.6 For each 1 ≤ j ≤ k draw a connection from unit btu–p2p[j] to unit btu–pO p[j] . 4 For each relation p with arity k occurring in P and for each 1 ≤ j ≤ k I draw a connection from unit btu–pO p[j] to unit btu–pp[j] . 5 Set the weights of all connections in the network to 1 . The network is a recursive network with a feed forward kernel. This kernel is constructed in steps (1) to (3.6). It is extended in step (4) to a Rnn. As an example consider the following Bur knowledge base P2 : p(a, b) q(a, b) ← p(a, b) p(Y, X) ← q(X, Y ) r(X) ← p(X, Y ) ∧ q(Y, X) A Recursive Neural Network for Reflexive Reasoning p r p[1] p[2] q[1] q[2] 1 1 1 1 clause 2 2 q 1 clause 4 3 3 2 2 1 1 a b clause 3 3 2 3 3 1 1 1 2 2 1 1 1 1 p[1] p[2] q[1] q[2] p 2 55 q 1 1 r Fig. 1. The Bur network for a simple knowledge base. btu–p–units are depicted as squares and btu–c–units as circles. For presentation clarity we have dropped the recurrent connections between corresponding units in the output and input layer. Fig. 1 shows the corresponding Bur network. Because this knowledge base contains just the two constants a and b , the cycle ω is defined by {πa , πb } . The inference process is initiated by presenting the only fact p(a, b) to the input layer of the network. More precisely, the units labelled p1 and a in the input layer of Fig. 1 are clamped in phase πa , whereas the units labelled p2 and b are clamped in phase πb . The external activation is maintained throughout the inference process. Fig. 2 shows the activation of the units during the computation. After five cycles (equals 10 time steps) the spreading of activation reaches a stable state and the generated model can be read off as will be explained in the following paragraph. Analogous to the Rnn model the input and the output layer of a Bur network represent interpretations for the knowledge base. The instantiation of a variable X (occurring as an argument of some relation) by a constant c is realized by activating the unit that represents X in phase πc representing c . A ground atom p(c1 , . . . , ck ) is an element of the interpretation encoded in the activation patterns of the input (and output) layer in cycle ω iff for each 1 ≤ j ≤ k the 5 units btu–pIp[j] (and btu–pO p[j] ) are activated in phases πcj ∈ ω . Because all facts of a Bur knowledge base are ground, the set of facts represents interpretation for the given knowledge base. A computation is initialized 5 Because each relation is represented only once in the input and the output layer and each argument p[j] may be activated in several phases during one cycle, these ac- 56 S. Hölldobler, Y. Kalinke and J. Wunderlich r q[2] q[1] p[2] p[1] b a 0 2 4 6 8 10 time Fig. 2. The activation of the units within the computation in the Bur network shown in Fig. 1. Each cycle ω consists of two time steps, where the first one corresponds to the phase πa and the second one to the phase πb . The bold line after two cycles marks the point in time up to which the condition in Theorem 2 is fulfilled, ie. each relation argument is bound to one constant only. During the third cycle the arguments p[1] and p[2] are both bound to both constants a and b . by presenting the facts to the network. This is done by clamping all units representing constants and all units representing the arguments of these facts. Thereafter the activation is propagated through the network. We refer to the propagation of activation from the input layer to the output layer as a comd putation step. Let IBU R denote the interpretation represented by the output layer of a Bur network after d ≥ 1 computation steps. The computation in the Bur network terminates if the network reaches a stable state, ie. if for two interpretations computed in successive computation steps d and d + 1 we find d+1 d IBU R = IBU R . Such a stable state will always be reached in finite time because all rules in a Bur knowledge base are definite clauses and there are only finitely many constants and no other function symbols. 5 Properties of the BUR Network It is straightforward to verify that a Bur network encodes the reduction rules of the Bur calculus. A condition X ∈ C for some argument j of a relation p is encoded by activating the unit p[j] occurring in the input (and output) layer in all phases corresponding to the constants occurring in C . This basically covers pointwise DB reductions. The hidden layers and their connections are constructed such that they precisely realize the evaluation of isolated connections. This is in fact an enrichment of a McCulloch–Pitts network [24] by a phase–coding of bindings. Let us now first consider the case where the Bur knowledge base P contains only unary relation symbols. In this case the computation of the Bur network with respect to an interpretation I within one computation step equals the computation of the meaning function TP (I) for the logic program P . One tivation pattern represent several bindings and, thus, several instances of the relation p simultaneously. This may lead to crosstalk as will be discussed later. A Recursive Neural Network for Reflexive Reasoning 57 should observe that TP (∅) is precisely the set of all facts occurring in P and, thus, corresponds precisely to the activation pattern presented as external input to initialize the Bur network. Hence, it is not too difficult to show by induction on the number of computation steps that the following proposition holds. Proposition 4. Let TP the meaning function for a Bur knowledge base P and d be the number of computation steps. If P contains only unary relation d+1 d (∅) . symbols, then IBU R = TP Because for a Bur knowledge base P the least fixed point of TP exists and can be computed in finite time Proposition 4 ensures that in the case of unary relation symbols the Bur network computes the least model of the Bur knowledge base. Moreover, by Theorem 1(1) we learn that the Bur network is a sound and complete implementation of the Bur calculus in this case. The result is restricted to a knowledge base with unary relation symbols only, because in the general case of multi–place relation symbols the so–called crosstalk problem may occur. If several instances of a multi–place relation are encoded in an interpretation the relation arguments can each be bound to several constants. Because it is not encoded which argument binding belongs to which instance, instances that are not actually elements of the interpretation may by mistake supposed to be. This problem corresponds precisely to the problem of whether spurious tuples are computed in the Bur calculus. Reconsider P2 for which Fig. 2 shows the input to the Bur network and the activation of the output units during the first five computation steps of the computation. During the second computation step the units btu–pO p[1] and are both activated in the phases π and π , because they have to btu–pO a b p[2] represent the bindings p[1] = a ∧ p[2] = b of the instance p(a, b) and p[1] = b ∧ p[2] = a of the instance p(b, a) of the computed interpretation. But these activations also represents the instances p(a, a) and p(b, b) . We can show, however, that despite of the instances that are erroneously represented as a result of the crosstalk problem, all instances that result from an application of the meaning function to a given interpretation are computed correctly. Proposition 5. Let P be a Bur knowledge base, TP the meaning function for d+1 d (∅) . P and d be the number of computation steps. Then, IBU R ⊇ TP One can be even more precise by showing that the Bur network is again a sound and complete implementation of the Bur calculus in that the stable state of the network precisely represents M (see Theorem 1(2)). As a consequence of the one–to–one correspondence between the Bur calculus and its connectionist implementation we can now apply the precondition of Theorem 2 as a criterion that, if met, ensures that all instances represented by an interpretation in the Bur network belong to the least model of the Bur knowledge base. More precisely, because the computation in the Bur network yields all ground atoms that actually are logical consequences of P and using the criterion stated in Theorem 2 we can determine the computation step d in 58 S. Hölldobler, Y. Kalinke and J. Wunderlich which ground atoms that are not logical consequences of P are computed for the first time. Let I<d denote the set of ground atoms that are computed within computations steps less than d and I≥d the ones computed within computation steps equal or higher than d . The elements of I<d are logical consequences of the knowledge base by Theorem 2, whereas the elements in I≥d \ I<d may or may not be logical consequences of the knowledge base. This problem can be decided by presenting these elements to a sound and complete backward reasoning system for datalogic. One should observe, however, that in the worst case the set I≥d \ I<d may contain exponentially many elements with respect to the size of the knowledge base. Finally, we will analyze the size of a Bur network and the time it takes for a network to settle down in a stable state. From Definition 3 we learn that the number of units in the Bur network grows linear with the size n of the Bur knowledge base P .6 The time needed for one computation step is 4×|ω| , where |ω| denotes the length of the cycle ω . An element A of the least model of P is computed in 4 × |ω| × (l − 1) time, where l is the length of the shortest derivation of A with respect to TP . In other words, the time is linear with respect to the shortest derivation. In the worst case, however, l = 2|C| , where |C| denotes the number of constants occurring in the alphabet underlying a Bur knowledge base. This time complexity comes as no surprise as the computation of the least model of a datalogic program is not in the class N C and thus is unlikely to be parallelizable in an optimal or efficient way (see eg. [22]). 6 The BUR vs. the Forward Reasoning SHRUTI System Note: In this section Shruti always refers to the forward reasoning Shruti system, if not indicated otherwise. The Bur system was invented to show to what extent the Rnn model of [15] can be extended to multi–place relation symbols by using the phase–coding model of Shruti. It was not intended to be a logical reconstruction of Shruti. But the systems are quite close and, consequently, should be compared in detail. Expressiveness: Both, the Bur and the Shruti7 system, can cope with datalogic programs. However, the programs in Shruti are syntactically restricted and certain conditions have to be met during the computation. There are no such restrictions in Bur and, consequently, the expressive power of the Bur system is larger than that of the Shruti system. Soundness: A detailed analysis of the Shruti system showed that computing in Shruti is nothing but computing with reductions in the connection method [33]. Because these reductions are sound, Shruti is sound as well.8 As shown in 6 7 8 n is determined by the number of clauses, the number of relation and constant symbols occurring in the knowledge base, the average arity of the relation symbols and the average number of variables and constant symbols in each clause body. After replacing existentially bound variables by new constants. This should not be confused with the analysis done in [2], which was concerned with the backward reasoning version of Shruti. A Recursive Neural Network for Reflexive Reasoning 59 Theorem 1(2), Bur may be unsound if relation symbols with arity larger than one are involved. Completeness: Consider P2 as the knowledge base for a Shruti network. An inspection of this knowledge base shows that its least model contains two different instances of the relation p . If we extend this example by adding more facts concerning p and q , then many more — in fact, exponentially many — copies of the rule r(X) ← p(X, Y ) ∧ q(Y, X) . are needed. This example is not specifically chosen, but in a forward reasoning system it is almost always the case that multiple instances occur. Such a problem can only be solved in the Shruti system using the so–called multiple instantiation switches, which are able to store a certain number of different instances of a relation (see [29]). The connectionist network encoding of these switches is quite complicated and its size depends on the number of needed copies. Hence, an ideal Shruti network requires exponential space. Because such networks cannot be realized, the number of copies is restricted to a certain fixed number. But now Shruti is no longer complete in that some logical consequences of the knowledge base cannot be computed anymore. In contrast, as shown in Theorem 1(2) Bur is complete. Space: As already mentioned, an ideal Shruti system requires exponential space whereas Bur requires only linear space (see Section 5). Due to lack of space we cannot depict the Shruti network for P2 . But the interested reader can easily verify that the Shruti network is much more complicated than the Bur network shown in Fig. 1. On the other hand, Bur has to face the crosstalk problem and expensive postprocessing may be needed to separate the logical consequences of a knowledge base from those facts which are computed but are no logical consequences. In other words, Bur trades space for time. Time: The time to show that a certain fact is a logical consequence of the knowledge base is in both systems linear with respect to the shortest possible derivation of this fact. Learning: The Bur network has a simple recurrent structure with a feed forward kernel. For feed forward networks backpropagation and its derivatives are well established learning techniques. There seems to be no major hurdle to extend these techniques to cope with phase–codings, although this has to be shown in the future. Hence, we believe that the Bur model is well suited for inductive learning tasks using sets of input/output patterns. The Shruti model and its extension using multiple instantiations and a special type hierarchy9 is encoded in a connectionist setting using many different and complicated unit types. Hence, learning in Shruti networks is hardly imaginable using standard algorithms. Facts: A Bur network encodes only the rules of a knowledge base, whereas the facts are presented (clamped) to the input layer as an initial activation pattern. In other words, the same set of rules can be used with different facts without changing the structure of the network. In contrast, a Shruti network encodes 9 Such a special hierarchy is not needed in Bur because types can be presented by unary relation symbols and Bur is sound and complete for unary relations (see Theorem 1(1)). 60 S. Hölldobler, Y. Kalinke and J. Wunderlich the rules and the facts and, consequently, a change in the facts requires a change in the network structure. Summing up, Shruti is undoubtedly a powerful and quite successful tool for reflexive backward reasoning. It is not so obvious that it is the best choice for reflexive forward reasoning. The Bur calculus presented in this paper has some advantages concerning expressive power, required space, simpler connectionist structure and the handling of facts. On the other hand, the systems differ as far as soundness and completeness issues are concerned: Bur is complete but unsound, whereas Shruti is sound and incomplete. 7 Discussion In this paper we have presented a new calculus together with a connectionist implementation: the Bur system. It is a rigorous design starting from a first– order logic (datalogic with the usual logical consequence relation), developing a calculus (bottom–up reductions in the connection method using data base technologies) and specifying a connectionist implementation (recurrent neural networks with a feed forward kernel). The system is sound and complete for unary relation symbols and complete but unsound for relation symbols with arity larger than one. A correctness criteria is given, which provides a test for soundness in the latter case. The connectionist implementation requires linear space with respect to the size of the knowledge base. Furthermore, if a certain atom A is a logical consequence of the knowledge base, then this can be shown in time linear with respect to the shortest derivation of A . In general however, due to the unsoundness, we may need exponential time to decide whether A is a logical consequence of the knowledge base. This is not a bad design but rather a consequence of the fact that this problem is in N P . Finally, Bur has a simple connectionist structure, which should make it possible to adapt standard learning techniques to the Bur networks. In many cases a Bur network can be minimized by eliminating redundant units and connections. For example, the network shown in Fig. 1 contains several redundant units like the rightmost unit in the area marked as “clause 3” and shortcuts may be introduced. It is important, however, that along such shortcuts the activation is propagated in the time that is required if the additional units are present. In other words, the shortcuts may not lead to speedups because otherwise the logical structure is not maintained. In contrast to the very successful backward reasoning version of Shruti, Bur is a forward reasoning system for reflexive reasoning. There is some evidence that humans reason in a forward direction by building partial models and basing their decisions on these models (see [19]). It remains to be tested whether the Bur system is a valid model for these findings. Finally, let us come back to the problem mentioned at the beginning of this paper. The Bur calculus is not a solution to the problem of how to represent structured objects and structure sensitive processes in connectionist systems. But we believe that it is another step in the right direction because it gives us A Recursive Neural Network for Reflexive Reasoning 61 a better understanding on how logic systems and connectionist systems can be amalgamated. References 1. K. R. Apt and M. H. Van Emden. Contributions to the theory of logic programming. Journal of the ACM, 29:841–862, 1982. 2. A. Beringer and S. Hölldobler. On the adequateness of the connection method. In Proceedings of the AAAI National Conference on Artificial Intelligence, pages 9–14, 1993. 3. W. Bibel. On matrices with connections. Journal of the ACM, 28:633–645, 1981. 4. W. Bibel. Advanced topics in automated deduction. In R. Nossum, editor, Fundamentals of Artificial Intelligence II, pages 41–59. Springer, LNCS 345, 1988. 5. W. Bibel. Deduction. Academic Press, London, San Diego, New York, 1993. 6. A.S. d’Avila Garcez, G. Zaverucha, and L.A.V. de Carvalho. Logic programming and inductive learning in artificial neural networks. In Ch. Herrmann, F. Reine, and A. Strohmaier, editors, Knowledge Representation in Neural Networks, pages 33–46, Berlin, 1997. Logos Verlag. 7. W. F. Dowling and J. H. Gallier. Linear-time algorithms for testing the satisfiability of propositional Horn formulae. Journal of Logic Programming, 1(3):267–284, 1984. 8. E. W. Elcock and P. Hoddinott. Comments on Kornfeld’s equality for Prolog: E-unification as a mechanism for argumenting the prolog search strategy. In Proceedings of the AAAI National Conference on Artificial Intelligence, pages 766–774, 1986. 9. J. A. Feldman. Memory and change in connection networks. Technical Report TR96, Computer Science Department, University of Rochester, 1981. 10. J. A. Feldman and D. H. Ballard. Connectionist models and their properties. Cognitive Science, 6(3):205–254, 1982. 11. M. Fitting. Metric methods – three examples and a theorem. Journal of Logic Programming, 21(3):113–127, 1994. 12. J. A. Fodor and Z. W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. In Pinker and Mehler, editors, Connections and Symbols, pages 3–71. MIT Press, 1988. 13. K.-I. Funahashi. On the approximate realization of continuous mappings by neural networks. Neural Networks, 2:183–192, 1989. 14. S. Hölldobler. A structured connectionist unification algorithm. In Proceedings of the AAAI National Conference on Artificial Intelligence, pages 587–593, 1990. 15. S. Hölldobler and Y. Kalinke. Towards a massively parallel computational model for logic programming. In Proceedings of the ECAI94 Workshop on Combining Symbolic and Connectionist Processing, pages 68–77. ECCAI, 1994. 16. S. Hölldobler, Y. Kalinke, and H. Lehmann. Designing a counter: Another case study of dynamics and activation landscapes in recurrent networks. In Proceedings of the KI97: Advances in Artificial Intelligence, volume 1303 of Lecture Notes in Artificial Intelligence, pages 313–324. Springer, 1997. 17. S. Hölldobler, Y. Kalinke, and H.-P. Störr. Recurrent neural networks to approximate the semantics of acceptable logic programs. In G. Antoniou and J. Slaney, editors, Advanced Topics in Artificial Intelligence, volume 1502 of LNAI, Berlin/Heidelberg, 1998. Proceedings of the 11th Australian Joint Conference on Artificial Intelligence (AI’98), Springer–Verlag. 62 S. Hölldobler, Y. Kalinke and J. Wunderlich 18. S. Hölldobler, Y. Kalinke, and H.-P. Störr. Approximating the semantics of logic programs by recurrent neural networks. Applied Intelligence, 11:45–59, 1999. 19. P. N. Johnson-Laird and R. M. J. Byrne. Deduction. Lawrence Erlbaum Associates, Hove and London (UK), 1991. 20. Y. Kalinke. Using connectionist term representation for first–order deduction – a critical view. In F. Maire, R. Hayward, and J. Diederich, editors, Connectionist Systems for Knowledge Representation Deduction. Queensland University of Technology, 1997. CADE–14 Workshop, Townsville, Australia. 21. Y. Kalinke and H. Lehmann. Computations in recurrent neural networks: From counters to iterated function systems. In G. Antoniou and J. Slaney, editors, Advanced Topics in Artificial Intelligence, volume 1502 of LNAI, Berlin/Heidelberg, 1998. Proceedings of the 11th Australian Joint Conference on Artificial Intelligence (AI’98), Springer–Verlag. 22. R. M. Karp and V. Ramachandran. Parallel algorithms for shared-memory machines. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, chapter 17, pages 869–941. Elsevier Science Publishers B.V., New York, 1990. 23. J. W. Lloyd. Foundations of Logic Programming. Springer, Berlin, Heidelberg, 1987. 24. W. S. McCulloch and W. Pitts. A logical calculus and the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5:115–133, 1943. 25. A. Newell. Physical symbol systems. Cognitive Science, 4:135–183, 1980. 26. G. Pinkas. Symmetric neural networks and logic satisfiability. Neural Computation, 3:282–291, 1991. 27. T. A. Plate. Holographic reduced representations. In Proceedings of the International Joint Conference on Artificial Intelligence, pages 30–35, 1991. 28. J. B. Pollack. Recursive auto-associative memory: Devising compositional distributed representations. In Proceedings of the Annual Conference of the Cognitive Science Society, pages 33–39, 1988. 29. L. Shastri and V. Ajjanagadde. From associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioural and Brain Sciences, 16(3):417–494, September 1993. 30. A. Sperduti. Labeling RAAM. Technical Report TR-93-029, International Computer Science Institute, Berkeley, CA, 1993. 31. G.G. Towell and J.W. Shavlik. Extracting refined rules from knowledge–based neural networks. Machine Learning, 131:71–101, 1993. 32. J. D. Ullman. Principles of Database Systems. Computer Science Press, Rockville, Maryland, USA, second edition, 1985. 33. J. Wunderlich. Erweiterung des RNN–Modells um SHRUTI–Konzepte: Vom aussagenlogischen zum Schließen über prädikatenlogischen Programmen. Master’s thesis, TU Dresden, Fakultät Informatik, 1998. A Novel Modular Neural Architecture for Rule-Based and Similarity-Based Reasoning Rafal Bogacz and Christophe Giraud-Carrier Department of Computer Science, University of Bristol Merchant Venturers Building, Woodland Rd Bristol BS8 1UB, UK {bogacz,cgc}@cs.bris.ac.uk Abstract. Hybrid connectionist symbolic systems have been the subject of much recent research in AI. By focusing on the implementation of highlevel human cognitive processes (e.g., rule-based inference) on low-level, brain-like structures (e.g., neural networks), hybrid systems inherit both the efficiency of connectionism and the comprehensibility of symbolism. This paper presents the Basic Reasoning Applicator Implemented as a Neural Network (BRAINN). Inspired by the columnar organisation of the human neocortex, BRAINN’s architecture consists of a large hexagonal network of Hopfield nets, which encodes and processes knowledge from both rules and relations. BRAINN supports both rule-based reasoning and similarity-based reasoning. Empirical results demonstrate promise. 1 Introduction Over the past few years, the mainly historical, and arguably unproductive, division between psychological and biological plausibility has narrowed significantly through the design and implementation of successful hybrid connectionist symbolic systems. Rather than committing to a single philosophy, such systems draw on the strengths of both biology and psychology, by implementing high-level human cognitive processes (e.g., rule-based inference) within low-level, brain-like structures (e.g., neural networks). Hence, hybrid systems inherit the characteristics of both traditional symbolic systems (e.g., expert systems) and connectionist architectures, including: – Complex reasoning – Learning and generalisation from experience – Efficiency through massive parallelism This paper presents the Basic Reasoning Applicator Implemented as a Neural Network (BRAINN). The original description of BRAINN is in [1]. The architecture of BRAINN mimics the columnar organisation of the human neocortex. It consists of a large hexagonal network of Hopfield nets [7] in which both rules and relations can be encoded in a distributed fashion. Each relation is stored in a single Hopfield net, whilst each rule is stored in a set of adjacent Hopfield nets. Through systematically orchestrated relaxations, BRAINN combines S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 63–77, 2000. c Springer-Verlag Berlin Heidelberg 2000 64 R. Bogacz and C. Giraud-Carrier rule-based reasoning typical of expert systems with similarity-based reasoning typical of neural networks. Hence, BRAINN supports both monotonic reasoning and several forms of common-sense (non-monotonic) reasoning [10]. The paper is organised as follows. Section 2 details the BRAINN architecture and algorithms. Section 3 reports the results of a number of experiments with BRAINN. Section 4 reviews related work, and section 5 concludes the paper and outlines directions for future work. 2 BRAINN BRAINN’s architecture is inspired by the columnar organisation of the human neocortex [3]. The human neocortex is divided into minicolumns (Ø0.03mm), i.e., groups of neurons gathered around dendrite bundles. Minicolumns create a hexagonal network, and are, in turn, organised in hexagonal macrocolumns (Ø0.5mm). The axons of some pyramidal neurons have an unusually high number of synapses within about 0.5mm of their soma. Simultaneous activation of neurons, in a 0.5mm radius, is thus sometimes observed. These neurons excite one another and can in turn recruit additional neurons, since the adjacent neurons receive activation from two or more neurons, as shown in Figure 1. Fig. 1. Neuron Recruitment Calvin [3] speculates further that newly recruited neurons can subsequently recruit others, so that the pattern of activation of one or two macrocolumns can spread through the whole network. He argues that this process is especially relevant to short term memory phenomena. 2.1 Knowledge Implementation BRAINN’s underlying neural network architecture supports the encoding of both relations and if-then rules, as detailed in the following sections. Relations The relations considered here can be represented as triples of the form <Object, Attribute, Value>. In BRAINN, each component of the triple is represented by a unique pattern. The pattern is a sequence of bits of constant A Novel Modular Neural Architecture 65 length N , where each bit may have value -1 or +1. Hence, each relation can be stored in a Hopfield network with 3N units. Figure 2 shows one such network for N = 4. The activations of units x1 to xN correspond to the binary representation of the object, the activations of units xN +1 to x2N correspond to the binary representation of the attribute and the activations of units x2N +1 to x3N correspond to the binary representation of the value. cat drinks milk Object x2 x1 1 -1 -1 1 -1 -1 1 1 xN 1 -1 1 -1 x3N Legend - unit with activation –1 - unit with activation +1 xN+1 Attribute Value x2N+1 x2N Fig. 2. Relation Encoding For simplicity, Hopfield networks encoding relations are referred to as assemblies, which work as associative memories. After delivering two components of a relation to an assembly, the third one may be retrieved. The weights of the network are set using either the Hebb rule or the Perceptron rule [5], as detailed below. Weights Setting with the Hebb Rule. Upon creation of the network, all weights are initialised to 0. For each new triple to remember, the binary representations of its components are delivered to the corresponding neurons in the assembly as shown in Figure 2. Then, the weights wij between units i and j (i 6= j) are updated according to equation 1. wij ← wij + xi xj (1) Weights Setting with the Perceptron Rule. The Perceptron rule is similar to the Hebb rule, but it increases storage capacity and reduces susceptibility to pattern correlation [5]. With the Perceptron rule, the weights wij to unit i of unit i (as per are modified according to equation 1 only if the output xt+1 i equation 2 below), is different from the bit value in the triple’s representation xti . In addition, learning is iterative. Patterns are presented repeatedly until they are stored correctly or the learning time exceeds a pre-defined limit. The learning algorithm is shown in Figure 3. The network functions as an associative memory, using the principle of relaxation to retrieve relations. A pattern is delivered to the network and, during relaxation, unit i changes its state xi according to equation 2 until the network reaches a stable state. 3N X wij xti ) (2) = sgn( xt+1 i j=1 66 R. Bogacz and C. Giraud-Carrier Initialise all weights to 0 Repeat – For each triple to remember 1. Deliver triple’s elements to the network’s units xi 2. For each unit i (a) Compute activations xt+1 i (b) If xt+1 6= xti Then update weights: wij ← wij + xti xtj (j 6= i) i Until no updates are made or time is out Fig. 3. Weight Setting with Perceptron Rule The Hopfield network stabilises on the stored pattern most similar to the delivered one or sometimes in a random state called a spurious attractor. In BRAINN, all of the questions asked by the user take the form of a triple <Object, Attribute, Value>, where one component is replaced by a question mark (e.g., <mouse, eats, ?>). The network then retrieves the triple’s missing component based on knowledge of the other two. The units corresponding to the unknown component are set to 0. If the question is delivered to an assembly remembering the expected relation, then, after relaxation, the units corresponding to the unknown component are equal to the binary representation of the relation’s missing element. If the network does not store the triple, it stabilises in a spurious attractor or another remembered triple. Examples are in section 2.3. Given the above mapping of relations to triples, it is possible that an object may have more than one value for a single attribute, e.g., <cat, eats, whiskas> and <cat, eats, mouse>. To solve this problem, such triples are stored in distinct assemblies. Details are described in section 2.4. If-Then Rules The rules that BRAINN uses are traditional if-then rules, where the left-hand side (LHS) consists of a conjunction of conditions and the righthand side (RHS) is a single condition, for example: IF <soil, is, sandy> AND <soil, humus_level, high> THEN <soil, compaction, high> As with relations, conditions are represented by triples of the form <Object, Attribute, Value> and subsequently stored in assemblies of the form described in section 2.1.1. The various assemblies representing the conditions of a rule can then be connected into a network of assemblies that encodes the rule. Figure 4 shows such a network for the above rule. In the network, there are connections between each unit from the LHS assemblies and each unit from the RHS assembly. The weights between assemblies are set according to the Hebb rule. Therefore, if one knows one side of the rule, one can retrieve the other. To accommodate rules with varying numbers of conditions in LHS and to provide a uniform network topology for both rules and relations (rather than a set of disconnected networks), assemblies are organised into a large hexagonal A Novel Modular Neural Architecture 67 soil For simplicity, only one neuron per element of relation is shown. soil sandy is soil compaction high high humus_level Fig. 4. Rule Encoding in Network network as shown in Figure 5. With such an architecture, each rule may have a maximum of 6 conditions in its LHS. Larger numbers of conditions in LHS can, of course, be handled by chaining rules appropriately, using new variables (e.g., IF A AND B AND C THEN D can be rewritten as IF A AND B THEN E, and IF E AND C THEN D). Each line between assemblies denotes connections between all units from the assemblies Fig. 5. Hexagonal Network of Assemblies When reasoning with rules, BRAINN implements backward chaining. Hence, the network retrieves the LHS of a rule upon delivery of its RHS. The following, along with Figure 6, details how this takes place in the network. For the sake of argument, assume that a rule consists of four conditions in its LHS. The RHS is stored in one assembly and the four conditions from the LHS are stored in adjacent assemblies. Upon activation of the assembly corresponding to the RHS, the network must retrieve the four conditions of the LHS. a) b) i. RHS RHS RHS ii. RHS RHS RHS RHS RHS RHS RHS RHS LHS RHS LHS LHS LHS iii. iv. LHS LHS LHS LHS LHS LHS RHS LHS LHS RHS LHS LHS Fig. 6. Rule Retrieval: a) Storage; b) Retrieval - i) RHS sent to all assemblies, ii) Relaxation, iii) LHS sent to adajacent assemblies and iv) Relaxation 68 R. Bogacz and C. Giraud-Carrier First, the RHS is delivered to all the assemblies in the hexagonal network. Once all of the Hopfield networks have relaxed, only the assembly storing the RHS has stabilised on the delivered pattern, since for that assembly, delivered and stored patterns are the same. Then, the assembly storing the RHS sends its pattern (vector of activation) to all six adjacent assemblies. Each adjacent assembly receives the pattern vector of the RHS assembly multiplied by the matrix of weights between the RHS assembly and itself. All of the adjacent assemblies relax after receiving the vector. In the case of the four LHS assemblies, the vector received is one of the patterns remembered in the local Hopfield network, so these assemblies will be stable. The other two assemblies will not be stable and will thus change their state. Moreover, the four LHS assemblies now send their patterns back to the RHS assembly. The RHS assembly receives these patterns multiplied by the matrix of weights between the LHS assemblies and itself. Thus, the pattern received by the RHS assembly is equal to its own pattern of activation. A kind of resonance is achieved, allowing the retrieval of the correct LHS. Note that LHS assemblies recruited by the aforementioned process are implicitly conjoined, i.e., the left-hand side of the rule is the conjunction of the conditions found in all of the LHS assemblies retrieved. To avoid confusion during backward chaining when several rules have the same right-hand sides (e.g., IF A AND B THEN C, and IF D THEN C), the right-hand sides are stored in different assemblies. Rules with Variables The rules discussed so far are essentially propositional. It is often useful, and even necessary, to encode and use more general rules, which include variables. To reason in the presence of such rules, an effective way of binding variables is required. In BRAINN, variable binding is achieved by using special weight values between LHS and RHS assemblies, as shown in Figure 7 for the rule IF <&someone, drinks, milk> THEN <&someone, is, strong>. Let &X be the variable. Then, the weights between the units representing &X in LHS and the units representing &X in RHS are equal to 1, whilst the weights between the units representing &X and all other units are equal to 0. VRPHRQH Legend VRPHRQH weights equal to 1 weights set up according to Hebb rule GULQNV If there is no line between two units, the weight is equal to 0 PLON LV VWURQJ Fig. 7. Weight Setting for a Simple Rule A Novel Modular Neural Architecture 69 With such a set of weights, the pattern for the variable is sent between assemblies without any modifications nor interactions with the rest of the information in the assembly. The weights inside the LHS assemblies and the RHS assembly must also satisfy similar conditions. That is, the weight of self-connection for all units representing a variable is equal to 1, whilst the weight between each unit representing a variable and any other unit is equal to 0. These latter conditions guarantee the stability of the assembly, which is critical to the reasoning algorithm. 2.2 Functional Overview Although BRAINN’s knowledge implementation is inspired by biological considerations, its information processing mechanisms are not biologically plausible. A high-level view of BRAINN’s overall architecture is shown in Figure 8. Short Term Memory Reasoning Goal Control Process Long Term Memory (Hexagonal network) Fig. 8. BRAINN’s Architecture The system’s knowledge (i.e., rules and relations) is stored in the Long Term Memory (LTM). Temporary, run-rime information is stored in the Short Term Memory (STM) and the reasoning goal is stored in a dedicated variable. Reasoning is effected by a form of backward chaining. The following sections detail the reasoning mechanisms implemented by the Control Process. 2.3 Rule-Based Reasoning To facilitate reasoning, BRAINN’s assemblies are labelled with the type of information they store: SN for a (semantic net’s) relation, LHS for a rule’s left-hand side, and RHS for a rule’s right-hand side. The label is represented by a unique sequence of 4 bits, stored in a few additional units in each assembly. Hence, each assembly actually consists of 3N + 4 units. As previously stated, BRAINN’s rule-based reasoning engine implements a form of backward chaining. The pseudocode for the algorithm is described in Figure 9. If more than one rule can be used, the rules are sorted by ascending number of conditions in their LHS. The algorithm checks that an LHS condition is satisfied by (recursively) asking the network to produce its value. For example, 70 R. Bogacz and C. Giraud-Carrier ApplyRule(question) 1. Deliver question to all assemblies 2. Relax the network 3. If there is a SN assembly containing question Then return corresponding answer 4. Else (a) For all RHS assemblies containing question i. Retrieve LHS of rule ii. Sort rules by ascending number of LHS conditions (b) For all rules in above order i. Load rule to STM (both RHS and LHS assemblies) ii. For each LHS condition of rule – If LHS.value 6= ApplyRule(<LHS.object, LHS.attribute, ?>)) Then try next rule iii. Give the answer from RHS of rule Fig. 9. BRAINN’s Backward Chaining Algorithm the algorithm checks the condition sky has colour blue by asking the question <sky, has colour, ?>. Although adequate for single-valued attributes, this may cause problems for multi-valued attributes. The following illustrates the working of the rule application algorithm on a simple reasoning task. Assume that BRAINN’s knowledge base consists of the following relation and rule: <Garfield, drinks, milk> IF <&someone, drinks, milk> THEN <&someone, is, strong> For simplicity, also assume that the hexagonal network consists of only 3 assemblies, organised as shown in Figure 10. The divisions in the assemblies represent subsets of units, one for each element of information (i.e., object, attribute, value and label). Also assume that the relation is stored in the upper assembly and the rule in lower and right assemblies as shown in Figure 10. Fig. 10. Simple Hexagonal Network A Novel Modular Neural Architecture 71 The simplest question that the user can ask, is about what Garfield drinks, i.e., <Garfield, drinks, ?> The algorithm delivers the question to all the assemblies. The network, after relaxation, is shown in Figure 11. A label over a division denotes that the activation of units in that division is equal to the binary representation of that label. If there is no label over a division, the network is in a spurious attractor. Fig. 11. Network after Relaxation: <Garfield, drinks, ?> After relaxation, the bottom assembly is in a spurious attractor, the right assembly has settled to one of the patterns stored in that assembly, and the top assembly has settled to the relation (tag is SN) containing the question. Hence, the system gives the answer from this assembly, i.e., milk. The following question, which asks what Garfield is, causes BRAINN’s rulebased reasoning mechanisms to be applied. <Garfield, is, ?> As before, the question is delivered to all the assemblies. The network, after relaxation, is shown in Figure 12. Fig. 12. Network after Relaxation: <Garfield, is, ?> Two assemblies are empty, because the network has settled in spurious attractors. Sequences of bits in those assemblies have no meaning. The lower assembly 72 R. Bogacz and C. Giraud-Carrier stores the RHS condition, which contains the question. Neighbours of that assembly receive its pattern of activation multiplied by the matrices of weights between assemblies. The resulting network is shown in Figure 13. Fig. 13. Network after Retrieving LHS of Rule The upper assembly is clear because the weights between the upper and lower assemblies are equal to zero (no rule is stored). In the right assembly, the LHS has been retrieved. The rule, IF <Garfield, drinks, milk> THEN <Garfield, is, strong>, is written to STM and the question, <Garfield, drinks, ?>, is delivered to the network. The behaviour of the network for this question is as described above. The value returned is the same as the value in the LHS of the retrieved rule, hence the answer for the question, <Garfield, is, ?>, is given from the RHS of the rule, i.e., strong. Currently, the system cannot answer questions involving variables (e.g., <?, is, strong>) since, after relaxation, the assembly which stores the RHS has one part (i.e., the object) empty or without meaning. Further work is necessary to overcome this limitation. 2.4 Similarity-Based Reasoning In addition to encoding relations and rules as described in section 2.1, BRAINN learns and reasons from similarity, as shown below. Consider the knowledge database shown in Figure 14. Fig. 14. Sample Knowledge Base A Novel Modular Neural Architecture 73 The database contains some information about cars, planes and lorries. The user may ask “What is a lorry used for travelling on?” The database does not contain the answer explicitly. However, lorries and cars have more attributes in common than lorries and planes. In this sense, a lorry is more similar to a car than to a plane. Hence, the system can guess that a lorry is for travelling on the ground like a car. To increase the capacity of the network, BRAINN generally stores new information in the assembly where the most similar information is already present. Two mechanisms are then available for similarity-based reasoning, one using on a voting algorithm and the other relying on Pavlov-like connections. Voting Algorithm The voting algorithm assumes that all the relations with the same object are stored in the same assembly. The algorithm to retrieve the value of attribute Aquery for object Oquery is described in Figure 15. All the relations with object Oquery are retrieved from memory (one-by-one from the same assembly, using relaxation) For each retrieved relation <Oquery, Aretrived, Vretrived> 1. Aretrived and Vretrived are delivered to each assembly 2. For each assembly C – If ∃Osimilar s.t. C stores a relation <Osimilar, Aretrived, Vretrived> Then • If ∃Vsimilar s.t. C stores a relation <Osimilar, Aquery, Vsimilar> Then vote for Vsimilar Choose the value with the largest number of votes as the answer Fig. 15. Voting Algorithm For example, assume BRAINN implements the knowledge database presented in Figure 14. If asked <lorry, is for travelling, ?>, BRAINN would assert on ground since lorry shares two properties with car (made of metal + has wheels = 2 votes for on ground) and only one with plane (made of metal = 1 vote for in air). Pavlov-like Connections The algorithm based on Pavlov-like connections assumes that all the relations with the same attribute and the same value are stored in the same assembly, whilst relations with the same attribute but different values are stored in different assemblies. The hexagonal network is overlaid with a fully connected mesh. These additional connections between all the assemblies capture co-occurring features (e.g., if some values of some attributes occur together for one object, then some of the assemblies are active together). The strengths of these Pavlov’s connections represent how often assemblies are active together. When a new relation is learnt then: 74 R. Bogacz and C. Giraud-Carrier 1. The object from this relation is sent to all the assemblies 2. The assemblies are relaxed (only the assemblies that remember any information about the object are in “resonance”) 3. The strength of all the Pavlov’s connections between the assembly where the new relation is stored and the assemblies in resonance is increased Figure 16 shows the Pavlov’s connections for the knowledge database of Figure 14. Only non-zero connections are shown. Line thickness is proportional to strength. Fig. 16. Pavlov-like Connections The algorithm to answer the question <Oquery, Aquery, ?> is in Figure 17. The pattern of the object Oquery is sent to all the assemblies and all the assemblies that remember any information about the object Oquery are activated Pavlov’s connections are used to determine which value of the attribute Aquery usually occurs with the set of features of the object Oquery Fig. 17. Pavlov Algorithm For example, consider the behaviour of the system for the question: <lorry, is for travelling, ?>. The object lorry is sent to all the assemblies and two assemblies storing information about the lorry are activated (see Figure 16). The assembly remembering the relation <?, is for travelling, on the ground> receives activation from two assemblies and the assembly remembering the relation <?, is for travelling, in the air> from one assembly only, hence the system will guess that the answer is: on the ground. The advantage of the voting algorithm is its simplicity and relatively low computational cost (computations are strongly parallel). The Pavlov-like algorithm is slightly more involved, but has very interesting feature - connections A Novel Modular Neural Architecture 75 between features are remembered even if the particular cases are forgotten (also a feature of human learning). Therefore, such connections could be used for rule extraction. 3 Empirical Results BRAINN is implemented in C++ under Windows, with a GUI displaying traces of the network’s behaviour (see http://www.cs.bris.ac.uk/ bogacz/brainn.html). Results of preliminary experiments with BRAINN follow. 3.1 Classical Reasoning Protocols Several tasks from the set of Benchmark Problems for Formal Nonmonotonic Reasoning [8] were presented to BRAINN. The system incorporates the premises and correctly derives the conclusions for problems A1, A2, A3, A4, B1 and B2, which include default reasoning, linear inheritance and cancellation of inheritance. 3.2 Sample Knowledge Base BRAINN was also tested with a more realistic knowledge base in the domain of soil science. This knowledge base consists of 20 rules with up to 2 conditions in the LHS (e.g., IF <&soil, is, clay> AND <&soil, humus level, high> THEN <&soil, compaction, low>) and chains of inference of length 3 at most. The following is an example of BRAINN’s reasoning after “learning” the soil science knowledge base. User inputs: <My_soil, colour, brown> <My_soil, weight, heavy> User query: <My_soil, compaction, ?> The question is sent to all the assemblies. After relaxation, no SN assembly is found with the answer, but there is a RHS assembly that contains an answer. This RHS assembly belongs to the rule, IF <&soil, Fe level, high> AND <&soil, weight, heavy> THEN <&soil, compaction, high>. The RHS assembly sends its weight-multiplied pattern to all of its neighbours. After relaxation the two LHS neighbours are found. The rule is retrieved and written to STM. Then, the first condition, <My soil, Fe level, high>, is sent to all the assemblies. Again, no SN assembly is found, but the RHS assembly of the rule, IF <&soil, colour, brown> THEN <&soil, Fe level, high>, is activated. This second rule is retrieved (as above) and written to STM. The condition of the rule, <My soil, colour, brown>, is then sent to all the assemblies. After relaxation, one of the assemblies still contains the condition so that <My soil, Fe level, high> is confirmed. The system sends the second condition, <My soil, weight, heavy>, of the first rule from STM to all the assemblies. After relaxation one of assemblies contains the condition. Both conditions of the first rule are now confirmed and the system produces the answer, high, from the RHS of the first rule. Although BRAINN behaves as expected, the network is large (36 assemblies of 29 units each). 76 4 R. Bogacz and C. Giraud-Carrier Related Work The BRAINN system is inspired by some of Hinton’s early work [6]. In Hinton’s system, as in BRAINN, relations are implemented in a neural network in a distrributed fashion. However, a different network architecture is used. Many hybrid symbolic connectionist systems have been proposed. One of the first such systems is described in [14]. That system imitates the structure of a production system and is made up of several separate modules (working memory, production rules and facts). With its distributed representation in all of the modules, the system can match variables against data in the working memory module by using a winner-take-all algorithm. The system has complex structures and is computationally costly. Moreover, it is restricted to performing sequential rule-based reasoning. CONSYDERR [10] is a connectionist model for concept representation and commonsense reasoning. It consists of a two-level architecture that naturally captures the dichotomy between concepts and the features used to describe them. However, it does not address learning (how such a skill could be incorporated is also unclear) and is limited to reasoning from concepts. CLARION [13], like CONSYDERR, uses two modules of information processing. One module encodes declarative knowledge in a localist network where the nodes that represent a rule’s conditions are connected to the node representing that rule’s conclusion. The other module encodes procedural knowledge in a layered sub-symbolic network. Given input data, decisions are reached through processing in and interaction between both modules. CLARION also allows rule extraction from the procedural to the declarative knowledge module. ScNets [4] aim at offering an alternative to knowledge acquisition from experts. Known rules may be pre-encoded and new rules can be learned inductively from examples. The representation lends itself to rule generation but the constructed networks are complex. Finally, ASOCS [9] are dynamic, self-organizing networks that learn incrementally, from both examples and rules. ASOCS are massively parallel networks, but are restricted to binary classification. A number of other relevant systems are described in [12]. A thorough review of the literature on hybrid symbolic connectionist models is in [11] and a dynamic list of related papers is in [2]. 5 Conclusion This paper presents a hybrid connectionist symbolic system, called BRAINN (Basic Reasoning Applicator Implemented as a Neural Network). In BRAINN, a hexagonal network of Hopfield networks is used to store both relations and rules. Through systematically orchestrated relaxations, BRAINN supports both rule-based and similarity-based reasoning, thus allowing traditional (monotonic) reasoning, as well as several forms of common-sense (non-monotonic) reasoning. Preliminary experiments demonstrate promise. Future work will focus on developing Pavlov’s similarity based algorithm by implementing rules extraction and integrating similarity-based and rule-based A Novel Modular Neural Architecture 77 reasoning into one algorithm. In addition, biological plausibility will be improved by incorporating Goal and STM in the hexagonal network and using local rules to control the behaviour of each assembly and determine the global reasoning process. Acknowledgements This work is supported in part by an ORS grant held by the first author. The soil science knowledge base was donated by Adam Bogacz. References 1. Bogacz, R. and Giraud-Carrier, C. (1998). BRAINN: A Connectionist Approach to Symbolic Reasoning. In Proceedings of the First International ICSC Symposium on Neural Computation (NC’98), 907-913. 2. Boz, O. (1997). Bibliography on Integration of Symbolism with Connectionism, and Rule Integration and Extraction in Neural Networks. Online at http://www.lehigh.edu/ ob00/integrated/references-new.html. 3. Calvin, W. (1996). The Cerebral Code. MIT Press. 4. Hall, L.O. and Romaniuk, S.G. (1990). A Hybrid Connectionist, Symbolic Learning System. In Proceedings of the National Conference on Artificial Intelligence (AAAI’90), 783-788. 5. Herz, J., Krogh, A. and Palmer, R. (1991). Introduction to the Theory of Neural Computation. Addison-Wesley. 6. Hinton, G. (1981). Implementing Semantic networks in Parallel Hardware. In Hinton, G. and Anderson, J. (Eds.), Parallel Models of Associative Memory. Lawrence Erlbaum Associates, Inc. 7. Hopfield, J. and Tank, D. (1985). Neural Computation of Decisions in Optimization Problems. Biological Cybernetics, 52:141-152. 8. Lifschitz, V. (1988). Benchmark Problems for Formal Nonmonotonic Reasoning. In Proceedings of the Second International Workshop on Non-Monotonic Reasoning, LNCS 346:202-219. 9. Martinez, T.R. (1986). Adaptive Self-Organizing Networks. Ph.D. Thesis (Tech. Rep. CSD 860093), University of California, Los Angeles. 10. Sun, R. (1992). A Connectionist Model for Common Sense Reasoning Incorporating Rules and Similarities. Knowledge Acquisition, 4:293-331. 11. Sun, R. (1994). Bibliography on Connectionist Symbolic Integration. In Sun, R. (Ed.), Computational Architectures Integrating Symbolic and Connectionist Processing, Kluwer Academic Publishers. 12. Sun, R. and Alexandre, F. (Eds.) (1995). Working Notes of IJCAI’95 Workshop on Connectionist-Symbolic Integration. 13. Sun, R. (1997). Learning, Action and Consciousness: A Hybrid Approach Toward Modeling Consciousness. Neural Networks, 10(7):1317-1331. 14. Touretzky, D. and Hinton, G. (1985). Symbols Among the Neurons: Details of a Connectionist Inference Architecture. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’85), 238-243. Addressing Knowledge-Representation Issues in Connectionist Symbolic Rule Encoding for General Inference Nam Seog Park Information Technology Laboratory, GE Corporate Research and Development One Research Circle, Niskayuna, NY 12309 Abstract. This chapter describes one method for addressing knowledge representation issues that arise when a connectionist system replicates a standard symbolic style of inference for general inference. Symbolic rules are encoded into the networks, called structured predicate networks (SPN) using neuron-like elements. Knowledge-representation issues such as unification and consistency checking between two groups of unifying arguments arise when a chain of inference is formed over the networks encoding special type of symbol rules. These issues are addressed by connectionist sub-mechanisms embedded into the networks. As a result, the proposed SPN architecture is able to translate a significant subset of first-order Horn Clause expressions into a connectionist representation that may be executed very efficiently. 1 Introduction Connectionist symbol processing attempts to replicate symbol processing functionality using connectionist components. Until now, many connectionist systems have demonstrated their abilities to represent dynamic bindings in connectionist styles [1-7]. However, only a few connectionist systems were able to provide additional connectionist mechanisms to deal with knowledge representation issues such as unification and consistency checking within and across different groups of unifying arguments, which are required to support general inference [6,7]. Encoding symbolic knowledge in a connectionist style requires not only encoding symbolic expressions into corresponding networks but also implementing some symbolic constraints that the syntax of expression imposes. For instance, if a connectionist mechanism has to encode a symbolic rule such as p(X, X) → q(X) that requires repeated arguments of the p predicate to get bound to the same constant filler or free variable fillers during inference, an additional connectionist mechanism, other than a basic dynamic binding mechanism, is needed to force this condition. This additional connectionist mechanism should be capable of detecting a consistency violation when two different constant fillers are assigned to different occurrences of the same variable argument, X’s in this case. This chapter describes how these issues can be addressed using a generalized phase-locking mechanism [8-10]. The phase-locking mechanism was originally S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 78–91, 2000. c Springer-Verlag Berlin Heidelberg 2000 Addressing Knowledge-Representation Issues 79 proposed by Shastri and Ajjanagadde [7] and extended by Park, Robertson, and Stenning [8] to overcome some of fundamental limitations of the original mechanism in replicating standard symbolic inference. 2 A Structured Predicate Network A structured predicate network (SPN) is a knowledge encoding scheme that the generalized phase-locking mechanism employs. This scheme maps a symbolic rule in a first-order Horn Clause expression [10] to a corresponding localist network. When encoded, each symbolic rule is mapped to a corresponding SPN that is composed of three parts, {Pa , M, Pc }. As illustrated in Figure 1, Pa is a predicate assembly representing the antecedent of the rule, Pc represents the consequent, and M is the intermediate mechanism that connects them together. Encoding a symbolic rule in this architecture is considered as finding a one-to-one mapping between the given rule and the SPN. ... The antecedent predicate assembly .... Intermediate mechanism .... ... The consequent predicate assembly Fig. 1. The structure of an SPN The main focus of this chapter is how to establish the required mapping by providing a proper connectionist mechanisms to build predicate assemblies and the intermediate mechanism. Since each symbolic rule needs to be encoded into an SPN with a unique intermediate mechanism to support a target form of symbolic inference, an automatic rule and SPN mapping mechanism is necessary. 3 Basic Building Blocks The basic components of the SPN are three neuron-like elements: a π-btu, a τ or, and a multiphase τ -or element. Unlike ordinary nodes found in conventional neural network models, these elements have special temporal behaviors hypothesized. These elements sample their inputs over several phases of an oscillation cycle and determine their output patterns, depending on the input patterns sampled during this time period. A phase is a minimum time interval in which 80 N.S. Park a neuron element performs its basic computations – sampling its inputs and thresholding – and an oscillation cycle is a window of time in which neuron elements show their oscillatory behaviors. On becoming active, these elements continually produce the output (oscillating) for the duration of inference [7]. For instance, a π-btu element becomes active on receiving one or more spikes in different phases of any oscillation cycle. On becoming active, a π-btu element produces an oscillatory spike that is in-phase with the driving inputs. A τ -or element, on the other hand, becomes active on receiving one or more spikes within a period of oscillation. Once activated, this element produces an oscillatory pulse train whose pulse width is comparable to the period of an oscillation cycle. A multiphase τ -or element becomes active when it receives more than one input pulses in different phases within a period of oscillation and produces an oscillatory pulse train whose pulse width is comparable to the period of an oscillation cycle. A threshold, n, associated with these elements indicates that the elements will fire only if they receive n or more spike inputs in the same phase. Figure 2 demonstrates their behavior graphically. 12 34 56 12 34 56 π-btu element 12 34 56 12 34 56 12 34 56 12 34 56 τ -or element multiphase τ -or element Fig. 2. Temporal behavior of the neuron-like elements. When a mapping is carried out between a given symbolic rule and its corresponding SPN, the antecedent and consequent of the rule are translated into the corresponding predicate assemblies. An n-ary predicate, p(arg1 , arg2 , . . ., argn ), in the antecedent or consequent, is mapped to a predicate assembly consisting of n entity nodes as can be seen in Figure 3. arg1 p arg2 argn ... p {arg1([0],[0]]), arg2([0]],[0]), ... , argn([0],[0])} Fig. 3. The structure of a predicate assembly and corresponding symbolic notation. Addressing Knowledge-Representation Issues 81 An entity node is a pair of π-btu elements. When it is used to represent an argument of the predicate, the left element is used to represent a variable role of the argument and the right element a constant role. Either or both its elements may become active during a chain of inference. If an entity node is used to represent a constant, only its right element becomes active. Whereas when the entity node is used to represent a variable, only its left node becomes active during a chain of inference. In the symbolic notation of the entity node, argi ([0],[0]), the symbol “0” denotes the state in which the left and the right elements of the entity node are inactive. As the basic building blocks, an entity node, a τ -or element, and a multiphase τ -or element are used not only to build predicate assemblies but also to build connectionist sub-mechanisms to be used for the intermediate mechanism of the given symbolic rule. The readers are invited to refer to [9,10] for in-depth description of the behavior of the neuron elements and the structure of a predicate assembly, as well as a dynamic binding mechanism built on these components. 4 Knowledge-Representation Issues To identify the necessary connectionist sub-mechanisms to build the SPN for a given symbolic rule, let us consider the following rule: p(X, X, Y ) → q(X, Y ). A standard type of inference expected with this rule is obtaining the conclusion q(a,a) when the fact p(a,U,U) is known. If a symbolic inference system is used for this inference, it would perform the unification first to acquire the bindings, {a/X,U/X,U/Y}, from which the system can get the information, {a/X,a/Y}, systematically. This information is then used to substitute the variables in the consequent so that the conclusion q(a,a) can be reached. Mapping the given symbolic rule to the SPN means finding a corresponding connectionist mechanism to carry out similar inference. This implies finding both mechanisms to encode rules and mechanisms to replicate an important symbolic inference procedure [9]. When the above rule is mapped to the corresponding SPN, two predicate assemblies corresponding to the p(X,X,Y) and q(X,Y) predicates will be built using an assembly of entity nodes as follows: p{X1 ([0],[0]),X2 ([0],[0]),Y([0],[0])}, q{X([0],[0]),Y([0],[0])}. The subscript numbers attached to the argument names of p predicate indicate the order of each argument repeatedly appearing in the predicate. This is to differentiate repeated argument in different argument positions and does not affect the meaning of the original argument name. When the fact p(a,U,U) is presented to the SPN, the initial binding, {a/X, U/X, U/Y}, are represented in a connectionist manner by introducing two entity 82 N.S. Park nodes (called filler nodes) corresponding a and U respectively and by activating their neuron elements in the following fashion: – activate both the right elements of the a filler node and p:X1 argument node in the same phase, say the first phase, which results in a[0,1] and p:X1 [0,1], where the number 1 stands for the first phase; – activate the right element of the U filler node and those of the p:X2 and p:Y argument nodes in a different phase, the second phase for example, which results in U[2,0], p:X2 [2,0], and p:Y[2,0], where the number 2 indicates the second phase. This activation will lead the situations in which in-phase activation between the a filler node and the p:X1 argument node represent the binding, {a/X}, and similar in-phase activation among U, p:X2 , and p:Y nodes represent the bindings, {U/X,U/Y}: a([0],[1]), U([2],[0]), p{X1 ([0],[1]),X2 ([2],[0]),Y([2],[0])}, q{X([0],[0]),Y([0],[0])}. If the intermediate mechanism that will be built between the p predicate assembly and the q predicate assembly (see Figure 1) propagates these initial bindings to the q predicate assembly in such a way that arguments of the q predicate assembly are activated in q{X([2],[1]),Y([2],[1])}, the result of inference can be obtained from the in-phase activation among the filler nodes, a and U, and the argument nodes, q:X and q:Y. From in-phase activation between the a filler node and the right elements of the two argument nodes, the bindings {a/X,a/Y}, and from in-phase activation between the U filler node and the left elements of the two argument nodes, the bindings {U/X,U/Y} are obtained. These together provide the conclusion of the inference q(a,a) with the intermediate result of the unification, {a/U}. As this example illustrates, finding mapping between the symbolic rule and the corresponding SPN first requires building the p and q predicate assemblies for the rule and a sub-mechanism that will be used to propagate the initial bindings, in the form of active phases, from the p predicate assembly to the q predicate assembly as inference initiates. Furthermore, additional sub-mechanisms are also necessary to ensure the consistency conditions that the syntax of the rule imposes and to perform unification within and across groups of unifying arguments. The collection of these sub-mechanisms is called the intermediate mechanism in SPN architecture. In summary, the intermediate mechanism handles knowledge-representation issues during the inference on the SPN. The more detailed tasks that this mechanism is expected to perform are: Addressing Knowledge-Representation Issues 83 – keep the consistency of the initial bindings on the antecedent predicate assembly; – provide a path for binding propagation from the antecedent predicate assembly to the consequent predicate assembly; – perform binding interaction (unification) within and across unifying groups of arguments; – maintain consistency of the bindings in the consequent after binding propagation. These issues were initially articulated in [6], and the users can refer to this article for some other revevant knowledge-representation issues. 5 Building an Intermediate Mechanism How to build an intermediate mechanism for a given symbolic rule is a key issue in the SPN architecture. This section provides in more detail the structure of the intermediate mechanism. 5.1 Sub-mechanisms for Consistency Checking and Binding Propagation Two types of consistency checking are required when a symbolic rule is given to be encoded into the SPN: constant consistency checking for a constant argument and variable consistency checking for repeated variable arguments. Constant consistency checking forces the condition that all constant arguments in the antecedent must get bound to the same constant filler or free variable fillers during inference. Variable consistency checking, on the other hand, forces all the repeated variable arguments in the antecedent to get bound to the same constant filler or free variable fillers. The rules needing this treatment are determined by checking the type of arguments and whether they have repeated arguments in the antecedent. In case of the example rule p(X, X, Y ) → q(X, Y ) variable consistency checking is required to make sure the repeated arguments, X’s, are always bound to the same constant filler during inference. A connectionist submechanism that forces this consistency is called the consistency checking submechanism. After consistency checking, the initial bindings set between the filler nodes and the antecedent predicate assembly need to propagate to the consequent predicate assembly to complete the inference. Therefore, an additional submechanism is needed to provide paths between the argument nodes in the antecedent predicate assembly to the corresponding argument nodes in the consequent predicate assembly. The rules that need this sub-mechanism are determined by checking the argument matching between the arguments in the antecedent and those in the consequent of the given rule. The example rule has both arguments, X and Y, appearing in the antecedent and the consequent at the same time; therefore, the example rule needs the binding propagation sub-mechanism. 84 N.S. Park p:X2 p:X1 p:Y p b1 mto1 q q:X q:Y Fig. 4. Consistency checking and binding propagation sub-mechanisms for the rule p(X, X, Y ) → q(X, Y ) Figure 4 shows the consistency checking sub-mechanism and the binding propagation sub-mechanism required for the example rule. The figure illustrates two predicate assemblies at the top and the bottom of the SPN. Between them are sub-mechanisms for consistency checking and binding propagation. Since any repeated arguments of the antecedent are forcing the condition that they have to get bound to the same constant filler or free variable fillers during inference, this requires any bindings generated from the repeated arguments to be collected for consistency checking. The node with the label b1 is added for this purpose. This node is called a binding node. When the repeated argument nodes (p:X’s) get bound to filler nodes by being activated at the same phase, the initial bindings are propagated to the b1 binding node automatically. The left element of the binding node then represents variable bindings of the repeated argument nodes and the right element constant bindings by becoming active at the corresponding phases. The consistency of these intermediate bindings is then checked by the separate consistency checking sub-mechanism. The mto1 node, which is a multi-phase τ -or element, is inserted for this purpose. Whenever the right elements of the repeated argument nodes are bound to two different constant filler nodes, the right element of the b1 node will become active at more than one phase. The mto1 node receiving direct input from the b1 node will detect this situation and projects the inhibitory signal to stop the flow of the activation from the antecedent predicate assembly to the consequential predicate assembly. Note the dashed link from mto1 to one black dot near the q predicate assembly, which is in turn connected to three other black dots by a dashed line. This is to abbreviate representation of a full connection from mto1 to each black dot. On becoming active, the mto1 node sends the same inhibitory signal to all of these black dots at the same time. Addressing Knowledge-Representation Issues 85 In Figure 4, the repeated argument nodes, p:X1 and p:X2 , in the p predicate assembly are connected to the corresponding argument node, q:X, in the q predicate assembly via the b1 binding node. These links serve as the binding propagation sub-mechanism between X arguments. At the same way, the links between the p:Y argument node to the q:Y argument node are used for the binding propagation sub-mechanism between Y arguments. Whenever the argument nodes of the p predicate assembly become active at the specific phases, this activation automatically propagates to the corresponding argument nodes of the consequent predicate assembly through these binding propagation submechanisms. 5.2 A Sub-mechanism for Binding Interaction The antecedent of a symbolic rule usually has several arguments, and these argument can be categorized into several groups based on their symbolic name. The example rule p(X, X, Y ) → q(X, Y ), for instance, has two groups of arguments, {p:X1 ,p:X2 } and {p:Y}. Because each of these groups share the same argument name, all the arguments in the same argument group are supposed to get bound to the same constant filler or free variables during inference to keep the consistency. For this reason, this argument group is called a Unifying Argument Group (UAG). At the beginning of inference, if the different types of fillers are assigned to the arguments pertaining to the same UAG, unification occurs among them. For example, presenting the fact, p(a,U,U), to the antecedent of the example rule will generate two sets of bindings, {a/X,U/X} and {U/Y}, to start inference. Since the first set of bindings is obtained from the same argument name, X, unification occurs. As a result, the new intermediate binding, {a/U}, will be obtained. In addition, the example rule also requires unification between these two UAGs. In the above situation, the same variable filler U is assigned to the two arguments, p:X2 and p:Y, which belong to two different UAGs. Therefore, the unification result obtained within the first UAG has to interact with that obtained within the second UAG to produce the desirable result of inference. This binding interaction between UAGs makes sure that the intermediate binding, {a/U}, produced from the first UAG to be used for the second UAG so that the third argument of the p predicate gets bound to the constant filler a in the end rather than being unified only with the variable filler U throughout the inference. In general, binding interaction between variable UAGs occurs when at least one of UAGs has repeated variable arguments. This type of binding interaction is not necessary between two UAGs having a single variable argument because no consistency checking is required in this case. The connectionist sub-mechanism which performs this task in a SPN architecture is called the binding interaction sub-mechanism. The symbolic rules which require this sub-mechanism are determined by checking if the rule has more than two variable UAGs in the antecedent and if one of them has repeated 86 N.S. Park variable arguments. Figure 5 visualizes the binding interaction sub-mechanism built for the example rule. p:X2 p:X1 p:Y p b1 to1 mto1 2 b2 2 2 mto2 q q:X q:Y Fig. 5. Binding interaction sub-mechanism for the rule p(X, X, Y ) → q(X, Y ) The additional components added to this network, compared to the network shown in Figure 4, are the to1, b2, and mto2 nodes. The first two nodes, to1 and b2, are inserted as the binding interaction sub-mechanism and the mto2 node as the consistency checking after binding interaction. The role of the to1 node is to block the direct binding propagation from the argument nodes of the antecedent predicate assembly to the corresponding argument nodes of the consequent predicate assembly. When the same variable filler is assigned to two arguments belong to different UAGs at the same time, the to1 node becomes active and projects the inhibitory signal to the binding propagation sub-mechanism between the antecedent predicate assembly and the consequential predicate assembly. As can be seen in the figure, the two links coming from the left element of the p:Y argument node and that of the b1 binding node provide inputs to the to1 node. On receiving these signals, the to1 node determines whether it will project the inhibitory signal or not. Since the threshold of the to1 node is 2, it will be activated only when the two input signals are active in the same phase; that is, they are in synchrony with the same variable filler node, U in this case. Activation of the to1 node allows the initial bindings to be propagated to the q predicate assembly only through the b2 binding node where the cross UAG binding interaction occurs. Before the activation of the b2 binding node Addressing Knowledge-Representation Issues 87 propagates further, its consistency is checked by the mto2 node to ensure the consistent result at the end of inference. There are several steps of the activation flow on the SPN through the entire step of inference. When the network is initialized with the fact p(a,U,U), the fillers and the antecedent predicate assembly will be activated in the following phases: a([0],[1]), U([2],[0]), p{X1 ([0],[1]),X2 ([2],[0]),Y([2],[0])}. In the next oscillation cycle, the b1 binding node is activated in b1([2],[1]). After checking the consistency, this activation propagates to the b2 binding node automatically. The b2 binding node also receives activation from the third argument node, p:Y, at the same time and becomes active in b2([2],[1]). During this binding propagation, the to1 node becomes active and blocks the direct binding propagation from the argument nodes in the p predicate assembly to those in the q predicate assembly. Consequently, the argument nodes in the q predicate assembly receive inputs only from the b2 binding node and are activated in the following phases: q{X([2],[1]),Y([2],[1])}. This situation is finally interpreted as the conclusion of the inference, q(a,a), with the binding {a/U} as the desired result of unification. 5.3 Additional Sub-mechanisms Until now only one example rule has been used to explain the rule-SPN mapping procedure. In order to build more comprehensive SPNs which replicate various standard symbolic styles of inference, additional sub-mechanisms are needed. First of all, the SPN architecture needs an extra consistency checking submechanism for consistency checking of the constant arguments in the antecedent of a rule. This sub-mechanism forces the condition that all constant arguments in the antecedent must get bound to the same constant fillers specified as the argument names or free variable fillers during inference. This means that the first argument of the rule, p(a, X) → q(X), should get bound to only the constant a to initiate the inference. Otherwise, the consistency violation occurs and the inference has to stop. Secondly, depending on the syntax of a rule, the binding interaction submechanism is needed not only between variable UAGs, as already explained, but also between the following UAGs: 88 N.S. Park – a pair of a constant UAG and a variable UAG, – a pair of two constant UAGs. For any pair of a constant and a variable UAG, if the same variable filler is assigned to their arguments at the same time, the binding obtained between the variable filler and the arguments in the constant UAG must migrate to the arguments in the variable UAG. This situation can be seen when the fact p(U,U) is presented to the rule p(c, X) → q(X). Presenting the fact first generates the bindings between the variable fillers (U’s) and the two arguments of the antecedent, which results in the set of bindings {U/c,U/X}. Since the arguments p:c and p:X are bound to the same variable filler, binding interaction between these two arguments generates the intermediate binding, {c/X}, which is then used to produce the desirable result q(c) from the consequent of the rule. Thus, a separate binding interaction sub-mechanism with different internal structure is needed. Binding interaction between constant UAGs refers to the situation where the same variable filler is assigned to two different constant argument nodes as exemplified with the rule p(a, b) → q(b, c). Presenting the fact p(U,U) should fail the matching between the presented fact and the antecedent of the rule because the same variable filler U is bound to two different constant arguments a and b at the same time. This indicates that for any two different constant UAGs in the antecedent, we need a binding interaction sub-mechanism which detects this situation and prevents the rule from firing during inference. Due to the space limitation, the details about these additional sub-mechanisms are not described in this chapter. However, the readers can refer to [11] for more details about these sub-mechanisms. 5.4 A Mapping Algorithm The following summaries how to map a given symbolic rule to the corresponding SPN: RULE-SPN MAPPING PROCEDURE STEP1: For a given symbolic rule, build the corresponding antecedent predicate assembly and the consequential predicate assembly. STEP2: Build the intermediate mechanism in such way that 1. Among the argument nodes in the same unifying argument group (UAG) a. Introduce a binding node for any repeated argument nodes b. Build the consistency checking sub-mechanism for constant argument node and repeated variable argument nodes c. Build the binding propagation sub-mechanism between the argument nodes in the antecedent predicate assembly Addressing Knowledge-Representation Issues 89 and their corresponding argument nodes in the consequential predicate assembly which shares the same argument name 2. With one of the following pair of UAGs in the antecedent, build a binding interaction sub-mechanism between them. - a pair of variable UAGs, of which at least one has repeated variable arguments - a pair of a variable UAG and a constant UAG - a pair of constant UAGs Apart from the example rule used throughout this chapter, the SPN architecture was explored to encode various types of symbolic rules and successfully demonstrated symbolic style of inference. The following rules are some of them: – – – – – p(a) → q(a), p(a, b) → q(a, b), p(X, X) → q(X), p(a, X, X) → q(a, X), p(X, X, Y, Y ) → q(X, Y ). However, further experiments showed that the current SPN architecture has one limitation in processing unification across more than two UAGs. This situation was detected when the fact, p(U,V,W,U,V), is presented to the SPN encoding the following rule: p(a, X, X, Y, Y ) → q(a, X, Y ). This rule has three UAGs – {a}, {p:X1 ,p:X2 }, and {p:Y1 ,p:Y2 } – in the antecedent. When the predicate, p(U,V,W,U,V), is presented the SPN to initiate inference, arguments in each UAG carry out unification to obtain three sets of bindings: {U/a}, {V/X,W/X}, and {U/Y,V/Y}. These bindings in turn interact with each other to produce the following sets of intermediate bindings: – { } from interaction between {U/a} and {V/X,W/X} – {a/Y} from interaction between {U/a} and {U/Y,V/Y} – {X/Y} from interaction between {V/X,W/X} and {U/Y,V/Y} These results, as they stand, only give the intermediate result, {a/Y} and {X/Y}, and do not provide another intermediate binding, {a/X}, which is needed to produce the desirable result, q(a,a,a). In order to obtain this binding, one more step of unification is required based on the above intermediate bindings produced during the first stage of binding interaction between UAGs. The current SPN’s architecture, however, is designed to perform only singlestage unification between each pair of UAGs; it cannot therefore deal with inference which requires more than one step of unification. In order to achieve the additional unification step, the SPN needs another set of binding interaction submechanisms between the first layer of the binding interaction sub-mechanisms 90 N.S. Park and the q predicate assembly. In general, if there are m UAGs in the antecedent of a rule, the SPN requires m − 1 layers of binding interaction sub-mechanisms to perform full unification among all the UAGs involved. It is worth nothing, however, that the number of binding interaction sub-mechanisms required is only proportional to the number of unique UAGs of a rule, not to the total number of the arguments of the rule. 6 Related Work CONSYDERR [12] addresses all the knowledge-representation issues and proposed solutions for them. It performs unification and deals with consistency checking. However, CONSYDERR handles these issues not by a uniform connectionist mechanism but by different types of complex network elements which are hypothesized to have such a function. CHCL [5] can also deal with consistency checking and unification implicitly by a adopted connectionist unification algorithm [13]. Its representation scheme of terms using a set of position-label pairs can also easily represent function terms in the initial stages of inference and also variable bindings if any are generated during inference. SHRUTI [7], built on the basis of the original phase-locking mechanism, performs limited consistency checking and unification due to a limitation in representing full variable bindings. DCPS [1] and TPPS [2] only have built-in unification mechanisms in restricted forms of rules due to the distributed representation they employed. However, it is not clear how these connectionist models deal with consistency checking. 7 Conclusion Any connectionist architecture which aspires to replicate symbolic inference should be able to deal with the knowledge representation issues as well as representing dynamic bindings. A structured predicate network (SPN) architecture is a step in this direction. The rule encoding architecture proposed in this chapter provides a way of representing not only a group of unifying arguments but also many groups of such arguments with appropriate consistency checking between groups. Although what is described here is a forward chaining style inference, the basic concept of the proposed connectionist architecture with only slight adjustments can be used to encode facts and rules to support backward chaining style inference. The proposed SPN architecture is able to translate a significant subset of first-order Horn Clause expressions into a connectionist representation that may be executed very efficiently. However, in order to have the full expressive power of Horn clause FOPL, we need to add the ability to represent structured terms as arguments of a predicate and recursion in rules. Currently, no connectionist system has provided convincing solutions to all these problems. Addressing Knowledge-Representation Issues 91 References 1. Touretzky, D. S., Hinton, G. E.: A distributed connectionist production system. Cognitive Science 12(3) (1988) 423 – 466 2. Dolan, C. P, Smolensky, P.: Tensor product production system: A modular architecture and representation. Connectionist Science 1(1) (1989) 53 – 68 3. Lange, T. E., Dyre, M. G.: High–level inferencing in a connectionist network. Connection Science 1(2) (1989) 181 – 217 4. Barnden, J.: Neural-net implementation of complex symbol-processing in a mental model approach to syllogistic reasoning. Proceedings of The 11th International Joint Conference on Artificial Intelligence, San Mateo, CA, Morgan Kaufmann Publishers Inc. (1989) 568 – 573 5. Hölldobler, S. H., Kurfeß, F.: CHCL - A connectionist inference system, In B. Fronhöfer & G. Wrightson (Eds.). Parallelization in Inference Systems, Lecture Notes in Computer Science, Berlin, Springer-Verlag (1991) 318 – 342 6. Sun, R.: On variable binding in connectionist networks. Connection Science 4(2) (1992) 93–124 7. Shastri, L., Ajjanagadde, V.: From simple associations to systematic reasoning: A connectionist representation of rules, variables, and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences 16(3) (1993) 417 – 451 8. Park, N. S., Robertson, D., Stenning, K.: An extension of the temporal synchrony approach to dynamic variable binding in a connectionist inference system. Knowledge-Based Systems 8(6) (1995) 345 – 357 9. Park, N. S., Robertson, D.: A connectionist representation of symbolic components, dynamic bindings and basic inference operations. Proceeding of ECAI-96 Workshop on Neural Networks and Structured Knowledge (1996) pp 33 – 39 10. Park, N. S., Robertson, D.: A localist network architecture for logical inference. In R. Sun, F. Alexandre (Eds.), Connectionist-Symbolic Integration, Hillsdale, NJ, Lawrence Erlbaum Associates Publishers (1997) 11. Park, N. S.: A Connectionist Representation of First-Order Formulae with Dynamic Variable Binding, Ph.D. dissertation, Department of Artificial Intelligence, University of Edinburgh (1997) 12. Sun, R.: Integrating Rules and Connectionism for Robust Commonsense Reasoning. John Wiley and Sons, New York, NY. (1994) 13. Hölldobler, S. H.: A structured connectionist unification algorithm. Proceedings of the National Conference on Artificial intelligence, Menlo Park, CA, AAAI Press (1990) 587 – 593 Towards a Hybrid Model of First-Order Theory Refinement Nelson A. Hallack, Gerson Zaverucha, and Valmir C. Barbosa Programa de Engenharia de Sistemas e Computação—COPPE Universidade Federal do Rio de Janeiro Caixa Postal 68511, 21945-970 Rio de Janeiro , RJ, Brasil {hallack, gerson, valmir}@cos.ufrj.br Abstract. The representation and learning of a first-order theory using neural networks is still an open problem. We define a propositional theory refinement system which uses min and max as its activation functions, and extend it to the first-order case. In this extension, the basic computational element of the network is a node capable of performing complex symbolic processing. Some issues related to learning in this hybrid model are discussed. 1 Introduction In recent years, systems that combine analytical and inductive learning using neural networks, like KBANN [8], CIL2 P [15,16], and RAPTURE [7], have been shown to outperform purely analytical or inductive systems in the task of propositional theory refinement. This process involves mapping an initial domain theory, which can be incorrect and/or incomplete, onto a neural network, which is then trained with examples. Finally, the revised knowledge is extracted from the network back into a logical theory. However, a similar mapping for first-order theories involves two well-known and unsolved problems on neural networks—the representation of structured terms and their unification. To overcome these problems, some models have been proposed: [17] uses symmetric networks, SHRUTI [18] employs a model which makes a restricted form of unification—actually this system only propagates bindings—, and CHCL [19] realizes complete unification. Unfortunately, these approaches have in common their inability to perform learning, which is our main goal. In [1], a restricted form of learning is performed. Kalinke [20] criticizes several approaches for representing structured terms using distributed representations. The use of hybrid systems seems a promising idea, in the sense that it in principle allows us to take advantage of the characteristics of both models. We describe the current state of a system that we are developing towards this goal. Basically, we replace the AND and OR usually achieved through the setting of weights and thresholds with the fuzzy AND (min) and OR (max) functions, and use as the computational element of our network a neuron with extended capabilities. The following is how the remainder of the paper is organized. Section 2 discusses some problems that arise in the standard model of propositional theory S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 92–106, 2000. c Springer-Verlag Berlin Heidelberg 2000 Towards a Hybrid Model of First-Order Theory Refinement 93 refinement with neural networks and describes the MMKBANN network with min/max activation functions. Section 3 presents some experimental results on this model. In Sect. 4, we provide an extension of the MMKBANN model to the first-order case, assuming a node with extended capabilities. In Sect. 5, a brief example of relational theory refinement is shown and some issues of learning are discussed. 2 MinMax Neural Networks The task of propositional theory refinement is, given a knowledge base (KB) and a set of examples E—both describing the dependencies between some set of propositional symbols—, to modify KB to cover the positive examples and not to cover the negative ones. When KB is expressed as a normal logic program,1 this problem can be regarded as an instance of the problem of learning a mapping between input and output neurons of a standard feedforward neural network or a partially recurrent network. In the knowledge-based neural networks approach, propositional symbols appearing only in the antecedent of KB clauses (we call these features to distinguish them from the propositional symbols that appear in the consequent of some clause) constitute the network’s input. Propositional symbols that are not features may be intermediate conclusions (represented as intermediate neurons), or final conclusions, that is, concepts which are being refined (represented as output neurons). A propositional symbol is considered true or false when its output value is within a predetermined range. For instance, in the CIL2 P model with activations between 0 and 1, the range corresponding to true is [Amin , 1], while [0, Amax ] corresponds to false, with Amax < 0.5 < Amin . When the output value is in (Amax , Amin ), the truth-value is unknown. The logical AND and OR operations are implemented by a procedure that sets bias and weight values appropriately, as depicted in Fig. 1. The values shown in this figure comply with the typical activation function given by yj (e, w) = where vj (e, w) = X 1 , 1 + e−vj (e,w) (1) [wji yi (e, w) ] + θj . (2) i∈I(j) In (1) and (2), e ∈ E is the running example, w is the network’s weight vector, I(j) is the set of connections incoming to neuron j, wji is the weight of the connection from neuron i to neuron j, θj is neuron j’s threshold, and yj (e, w) its output. Our standard knowledge-based neural network, designed to perform propositional theory refinement and henceforth called UCIL2 P (for Unfolded-CIL2 P), basically follows the KBANN topology: each propositional symbol is located in a 1 A set of definite Horn clauses allowing the negation of antecedents. 94 N.A. Hallack, G. Zaverucha, and V.C. Barbosa .................. ... ..... .. ... ..... ... -3W/2 ... ....... ......... . . . . . . . . . . . . . . . . . . . . . . . . ... .. ..... .... ........ . ..... ... ... ... ... ... . ... . ... ... ... ... ... W . ... -W . . . .W ... .. . . . . ... . . . . . .. ........... . . ....................... ..................... . . . . ...... ........ ... . ... . .. ... ..... . .. .. .... ..... . ... . . ... ... . . .. ... ...... ....... ...... ....... ....... ......... . . . . . . . . . . . ...... ...... ...... .................. ... ..... .. ... ..... ... -W/2 ... ...... ......... ........... ....... . . . . . . . . ..... .... ...... ... ....... ..... .. ... ... ... ... ... ... ... . . ... ... .. W . ...W . ... W .. . ... . . . . . ... . . . . . . . .......... ....................... ..................... . . . . ...... ........ ... . ... . .. ... ..... . .. .. .... ..... . ... . . ... ... . . .. ... . . . . ...... ..... ...... ....... ....... ....... . . . . . . . . . . . ...... ...... ...... (a) (b) D D A B A C B C Fig. 1. The use of weights and thresholds to represent (a) D = A ∧ B ∧ ¬C and (b) D =A∨B∨C specific layer according to the initial KB and near-zero weights are added connecting neurons of one layer to those of the next. These connections are designed to allow the discovery of dependencies between the neurons that are not expressed in KB. Unlike KBANN, we choose to always represent the AND of a clause by a neuron, so in our network neurons in the same layer always compute the same logical function (AND or OR). Unlike CIL2 P, we do not have a Jordan-like network architecture [5]. Therefore, a UCIL2 P network (determined by KB) may have an arbitrary number of hidden layers. The Amin and Amax values are the same as in CIL2 P, as described previously. The initial weights and thresholds are set in such a manner as to yield values similar to those of Fig. 1. For instance, KB = {A → D, A ∧ B → E, ¬C → E} leads to the values shown in Fig. 2. We call an antecedent of a clause a negative or positive literal, respectively, if it is negated or not. For instance, ¬C is a negative literal (or antecedent) of clause ¬C → E and B is a positive literal (or antecedent) of clause A ∧ B → E. ......... ......... ...... ........ ...... ........ ... ... ... ... . . ..... ..... ... -W/2 . ... -W/2 ... . . . . . ....... ........ ........... .. ........... ......... ......... . . .. .. .. .. ...... .......... ........... . . ............ .......... ......... . . . ... . . .. ........ .. . ...... ... ... . ..... . . ... ... .. ... ... . .. .. W...... .. .. .. .. .. . . W...... .. .. .. .. ..W ....... . ..... ............... . .. ............... . . . . . . . . . . . . . . . . .. ........... .......... .... .. ... ... ... . . .. . . ... . . ... ... . ..... ..... ..... ........-W/2 .. -3W/2 .. W/2 . . . . . . . ... . . . . . . . . . ... . .................... ......... .......... . . ...... ................. .. .. .. . . . . . . . . . . . . . . . . . . . . ........... .. . .. .. ......... ............... . .. .. .. .. .. .. .. . ................ . . . . . . ...... ... . .. . .. .. .......... .......... . .. ........... .. .. .. ............ .. .. .. . .. .. .. . . ... ... .... ....... .... ...... .. .. .. .. .. .. .... .. .. .. .. .. .. .. .. ... ... ...... .. . . . . . . . . . . . . . . . W -W W W . . . . .. .. .. . .................. .. . ................ . .. ................ .......... .. .. .. .. . . . . . . . . . . .... . . . . .. .. .... .... ..... .. .. ..... .. . . . ... . . . . . . . . . .. .. . .... ..... ..... .. .. .. . ... . . . . . . . .... .... ... . ...... ....... . . . . . . . . . . . . . . ............ ............ .......... D A E B C Fig. 2. A UCIL2 P network with dashed lines to represent connections of near-zero weights Some drawbacks arise from this architecture, and also from similar architectures like KBANN and CIL2 P, as follows. The initial value of W , which is calculated according to KB and the Amin and Amax values, grows with max(m, n), where m is the maximum number of clauses with the same consequent and n is Towards a Hybrid Model of First-Order Theory Refinement 95 the maximum number of positive literals in a clause. Because high initial weights can disturb training, if in KB this problem occurs the calculated value of W is disregarded. This is not actually a serious problem, as seen from KBANN and CIL2 P’s results, but in these cases there is no formal guarantee of the correspondence between KB and the resulting network. Another drawback is that, after training, the initial topology, with some nodes standing for OR and others for the AND operator, is lost and we can no longer ensure the correspondence with a symbolic set of rules. This problem has only partially been addressed via extraction algorithms [4] and the introduction of a penalty function in the backpropagation algorithm that forces the network to maintain a correspondence with a symbolic set of rules [25]. To overcome these problems, we propose an architecture with the fuzzy AND and OR operators, min and max, respectively.2 There are two types of dependency between an OR node j and some other node i ∈ I(j) in the network that we would like to represent: 1. Node i is one of the arguments of the OR node j; 2. There are no dependencies between node i and the OR node j. Similarly, there are three types of dependency between an AND node j and some other node i ∈ I(j) in the network that we would like to represent: 1. Node i is a positive antecedent of the AND node j; 2. Node i is a negative antecedent of the AND node j; 3. There are no dependencies between node i and the AND node j. Considering the range of activation as [0, 1], a natural choice to represent these dependencies in the OR and AND nodes would be to pick yj (e, w) = max wji yi (e, w) (3)   1 wji , + yj (e, w) = min wji yi (e, w) − 2 2 i∈I(j) (4) i∈I(j) and respectively, as their activation functions. Using (3), we can represent the first type of OR-dependency simply with wji = 1 and the second type with wji = 0. Using (4), we can represent the first type of AND-dependency with wji = 1, the second type with wji = −1, and the third type with i 6∈ I(j). But since our intention is to let the weight-learning procedure learn which antecedents should be in a clause, (4) turns out not to be a good choice. Unlike UCIL2 P, we cannot delete an antecedent i from a clause represented by neuron 2 Although not differentiable at all points, such functions did on preliminary tests allow faster convergence during training than differentiable approximations of them. Standard gradient descent can no longer be applied, though. 96 N.A. Hallack, G. Zaverucha, and V.C. Barbosa j by simply changing the value of wji ; nor can we use near-zero weight values to connect candidate antecedents to j, as in the standard model. So we choose yj (e, w) = min [wji g(yi (e, w)) − wji + 1] i∈I(j) as the activation function for AND nodes, where  yi (e, w), if i is a positive antecedent; g(yi (e, w)) = 1 − yi (e, w), if i is a negative antecedent. (5) (6) The motivation for (5) lies in the binary variables that are used to model constraints in integer programming. When wji = 1, wji g(yi ) − wji + 1 becomes g(yi ); when wji = 0, it becomes 1. Since min(A, B, 1) = min(A, B) if A, B ≤ 1, a weight-learning procedure is capable of adding and deleting antecedents in a clause if (5) is used. If we use [−1, 1] instead of [0, 1] as the range of activation, then (3) and (5) become (7) yj (e, w) = max [wji yi (e, w) + wji − 1] i∈I(j) and yj (e, w) = min [wji g(yi (e, w)) − wji + 1] , i∈I(j) (8) respectively, where g(yi (e, w)) =  yi (e, w), if i is a positive antecedent; −yi (e, w), if i is a negative antecedent. (9) Although it is not yet quite clear whether we should use [−1, 1] or [0, 1] as the range of activation, in the experiments that we describe later [−1, 1] is employed. The network that employs (8) and (9) as activation functions is called MMKBANN (for MinMax KBANN) and has the same topology as UCIL2 P. The differences between the two networks are the activation functions and the weights, which are initially 1 or 0 in the former network. Neurons in one layer are connected to the neurons of the next layer with weights 1 or 0, depending on whether the corresponding relation is in KB or not. Because we do not know in advance whether an antecedent should be added to a clause as a positive or a negative literal, there must be two connections between the corresponding neurons, one for each case. However, this is a domain-dependent design decision. Notice that the use of min/max functions ensures the correspondence of the initial KB and the resulting MMKBANN network. It also guarantees a closer correspondence between the trained network and a Horn-clause set. It is also worthy to notice the resemblance between MMKBANN and the combinatorial neural model (CNM) [12], which also uses min/max as the activation functions of its neurons. In the CNM model, however, the network has an input layer, a hidden layer, and a single-neuron output layer, with weights between the hidden and output layers only. Towards a Hybrid Model of First-Order Theory Refinement 97 The functions min and max are not differentiable, so we apply a subgradient search method similar to the method used in [14] to minimize the mean square error given by 1 XX [yo (e, w) − ybo (e)]2 (10) mse(w) = |E| e∈E o∈O where w is the vector having as components the weights of the network, E is the training set of examples, O the set of output units, yo (e, w) the output value of unit o when example e is running, and ybo (e) the desired value for this unit. Like in gradient-descent minimization, we can let ∆w = −α s(w) |s(w)| (11) where s(w) is one of the subgradients of mse(w) at w and α is the learning rate parameter. The ji-component of one of such subgradients, corresponding to one of the subgradients of mse(w) with respect to a weight wji , is given by sji (w) = 2 XX ∂yo (e, w) ∂yj (e, w) [yo (e, w) − ybo (e)] . |E| ∂yj (e, w) ∂wji (12) e∈E o∈O The values of ∂yo (e, w)/∂yj (e, w) and ∂yj (e, w)/∂wji can be computed in the same way as done in standard backpropagation, by using ∂ max(x1 , x2 , · · · , xn ) = ∂xi  1 , if xi = max(x1 , x2 , · · · , xn ) 0 , otherwise. (13) ∂ min(x1 , x2 , · · · , xn ) = ∂xi  1 , if xi = min(x1 , x2 , · · · , xn ), 0 , otherwise. (14) and where 1 ≤ i ≤ n. Equivalently, we can use ∂ max(x1 , x2 , · · · , xn ) = ∂xi  1 , if |xi − max(x1 , x2 , · · · , xn )| ≤ θ 0 , otherwise. (15) ∂ min(x1 , x2 , · · · , xn ) = ∂xi  1 , if |xi − min(x1 , x2 , · · · , xn )| ≤ θ 0 , otherwise. (16) and to compute ∂yo (e, w)/∂yj (e, w) and ∂yj (e, w)/∂wji , where θ is a user-defined parameter of training. The intuition behind this modification is that all values close enough to the max (min) value should be assigned credit equally. In our experiments, the use of (15) and (16) with θ = 0.2 instead of (13) and (14) led to better learning of the training set. 98 3 N.A. Hallack, G. Zaverucha, and V.C. Barbosa Experimental Results on Propositional Domains In order to investigate the learning capabilities of MMKBANN, we will compare its results to those obtained by UCIL2 P . But first, aiming at establishing UCIL2 P’s robustness, we will compare this system with one of the most successful models in the task of propositional theory refinement, namely the RAPTURE framework. The domains used in the comparison are the Promoter domain with 106 examples and the splice-junction recognition problem [8,9], both originating from the Human Genome Project. The algorithm used to train UCIL2 P was the Resilient Propagation (RProp) [10], a variation of standard backpropagation with the desirable properties of: – faster learning; – robustness in the choice of parameters; – greater appropriateness to networks with many hidden layers. The RAPTURE results used in our comparison were obtained from [6]. Results of UCIL2 P are averaged over 10 trials. In all domains we trained the network until it reached 100% accuracy in the set of examples or until 100 epochs were completed. The RProp parameters were η + = 1.2, η − = 0.5, ∆max = 0.25, ∆min = 10−6 , and ∆0 = 0.1. The network topologies were exactly those defined by KB. With Amin = 0.975 and Amax = 0.025, in the Promoter domain the calculated W was 48.85 and the network always reached 100% accuracy in the training set. In the splice-junction problem, the calculated W was 16.65 and the network always reached 100% accuracy in the training sets with 50, 100, and 200 examples. The error function was given by (10), since the cross-entropy error, which is said to be better suited for problems of binary classification [11], performed slightly worse in our preliminary tests. We obtained better results than those reported here with the UCIL2 P model when we used initial W values smaller than the calculated ones, and also when we added some new neurons to the initial topology. This shows the deleterious effect of high initial weights in training and indicates that algorithms allowing the dynamic addition of neurons can obtain better results in these domains. As we see in Fig. 3, the results obtained by UCIL2 P were almost equal to those from RAPTURE in the Promoter domain, but clearly worse in the splicejunction problem. This is not disappointing, since RAPTURE relies not only on a weight-learning procedure but also on algorithms for the dynamic insertion of neurons and connections. Now we turn our attention to the MMKBANN model. The question we would like to answer is, for what domains is the MMKBANN model well-suited and for what domains is it not? As long as the knowledge that can be encoded into MMKBANN resembles a normal logic program, we can expect good performance when the concept to be learned is expressible in this formalism with a number of clauses that is less than or equal to the number of hidden neurons in the network. Towards a Hybrid Model of First-Order Theory Refinement 35 30 25 20 15 10 05 00 99 20 15 10 ... ................. .......... ............... .......... ............... .......... ............. 10 ........ ............ ........ ............ ........ .......... 20 ........ ........ ............ ........ .......... 40 .......... ........ ............ ........ 60 ........ ........ ............ ........ 90 (a) 05 00 .......... ........ ............ ........ ............ ........ ............ ........ ............ ........ ............ ........ .......... 50 ........ ............ ........ ............ ........ ............ ........ ............ ........ .......... 100 ............ ........ ............ ........ ............ ........ ............ ........ 200 .. .............. ........ ............ ........ ............ ........ 400 (b) Fig. 3. Percentage of error versus number of training examples in (a) the Promoter domain and (b) the splice-junction domain. MMKBANN results are shown in black, UCIL2 P results in white, and RAPTURE results in shaded bars Bad performance is expected otherwise. Aiming at confirming this expectation, we ran experiments in two domains: the already mentioned domain of Promoter and an artificial domain derived from the game of chess, which is a modification of a domain used in [2]. In neither domain have we created connections corresponding to negated literals. While in the artificial domain the target concept to be learned does not require such literals, in the Promoter domain all we have is evidence of this, since no negated literals appear in KB. The results on both models were averaged over 10 trials. For MMKBANN we also used the topology defined by KB and the RProp training method. Results for the Promoter domain, shown in Fig. 3, tend to confirm our expectation. As discussed in [6], this domain requires the evidence-summing property—that is, the property that small pieces of evidence sum up to form significant evidence. While present in UCIL2 P, this property does not exist in MMKBANN, due to the use of the min and max functions.3 We credit the poor performance of MMKBANN to this. In most of the training sets, MMKBANN could not reach 100% accuracy on the examples. The other domain is a 4 × 5 chess board (see Fig. 4), and the problem is to find out the board configurations in which the king placed at position c1 cannot move to the empty position c2. Other pieces that can be on the board are a queen, a rook, a bishop, and a knight, for both players. Each example is generated by randomly placing a subset of these pieces on the other empty board positions (for a detailed description, see [2]). Each board position is represented by six attributes, indicating whether one of the four types of pieces is located at that position, whether the piece is an enemy, and whether the position is empty. The correct definition of the concept comprises 33 definite Horn clauses. Since we did not delete or add clauses to the correct KB, this is the number of hidden neurons of the network. 3 For instance, max(10, 0, 0) = max(10, 9, 9). 100 N.A. Hallack, G. Zaverucha, and V.C. Barbosa a4 b4 a3 b3 c4 d4 e4 c3 d3 e3 c2 d2 e2 c1 d1 e1 EMPTY a2 b2 a1 b1 ........ ... .................. ..... . .. .... ............... Fig. 4. The 4 × 5 chess board used in our experiment We corrupted the correct knowledge in two ways: we added antecedents to existing clauses and we deleted antecedents from clauses. The results obtained by UCIL2 P on these domains were quite disappointing. Despite being able to learn the training sets with 100% accuracy, the network generalized poorly, as we see in Fig. 5, indicating that the network overfitted the training examples. The calculated W was 18.19, with Amin = 0.975 and Amax = 0.025. MMKBANN could not learn with 100% accuracy most of the training sets, but its results are better than those from UCIL2 P, confirming to some extent our expectation. 25 15 20 10 15 10 05 05 00 50 100 200 (a) 300 00 50 100 200 300 (b) Fig. 5. Percentage error versus number of training examples in the chess domain with (a) inserted antecedents and (b) deleted antecedents. MMKBANN results are shown in black, and UCIL2 P results in white bars 4 A Hybrid Framework for First-Order Theory Refinement Having described a model that makes inference and inductive refinement over a propositional theory while maintaining a close relation with symbolic processing, our goal is to extend this model to the case of first-order theory. Sun [21] describes the “discrete neuron,” intended to be a description tool hiding unnecessary Towards a Hybrid Model of First-Order Theory Refinement 101 implementation details. He argues that the discrete neuron can be implemented with normal neurons. Inspired by this neuron, and aiming at handling the problems that arise in the first-order case, we make a similar automaton-theoretic description of a neuron. Definition 1. Let Σ be an alphabet (for instance, Σ = {a, b, . . . , z}) and n a positive integer. Then U (Σ, n) = { (x, y) | x ∈ Σ n , y ∈ IR }. For example, if Σ = {a, b, . . . , z}, then (a, 1) ∈ U (Σ, 1) and (b, c, 0.3) ∈ U (Σ, 2).4 We represent the ground facts P (a) and P (b), relative to some unary predicate P , by associating a set {(a, 1), (b, 1)} ⊂ U (Σ, 1) with predicate P . In general, the real number r in a tuple (a1 , . . . , at , r) ∈ U (Σ, t) represents our confidence about the truth of P (a1 , . . . , at ). We choose to represent this confidence value in [−1, 1]. Definition 2. A node j is a 5-tuple hΣ, InputArity, arity, I, yi, where: – – – – Σ is an alphabet; InputArity ∈ INn for n ≥ 0; arity ∈ IN; I is a vector of components I1 , . . . , In such that Ik ⊆ U (Σ, InputArityk ) for 0 ≤ k ≤ n; – y : U (Σ, InputArity1 ) × . . . × U (Σ, InputArityn ) → U (Σ, arity). We call this element a node instead of a neuron only to emphasize that we are not worried about its implementation with standard neurons. I denotes the set of incoming connections to the node and y its output. Our model for firstorder theory refinement, called MMCILP,5 is based on this node. Unlike the MMKBANN model, where each node (predicate) stands at a specific layer in a feedforward network, and similarly to CIL2 P, which uses a Jordan-like architecture, we have a three-layer network with a one-to-one mapping between predicate symbols and nodes in the input and output layers. The input-layer nodes compute identity, the hidden-layer nodes (one for each encoded clause) compute the fuzzy AND, and the output-layer nodes compute the fuzzy OR. There are recurrent connections between the corresponding predicate symbols in input and output layers. The justification for this three-layer architecture with recurrent connections, instead of the feedforward model of MMKBANN, is that there are concepts in the language of first-order logic whose definition is recurrent. While in propositional logic there is no sense in a clause with the same propositional symbol appearing in both its consequent and one of its antecedents (for instance A ∧ B → A), such is not true in the first-order case. For instance, one definition for the concept descendant is given by the two-clause logic program 4 5 We do, for simplicity, use a set’s elements to represent the set. So b, c is used in place of {b, c}. MMCILP stands for MinMax Connectionist Inductive Logic Programming. 102 N.A. Hallack, G. Zaverucha, and V.C. Barbosa c1: descendant(A,B) :- parent(B,A). c2: descendant(A,C) :- descendant(A,B), parent(C,B). where descendant(A,B) means that A descends from B and parent(A,B) means that A is B’s parent. There are three types of connection in MMCILP, excluding those that represent recurrence: – OR connections, from hidden to output nodes; – AND+ connections, from input to hidden nodes, the former standing for positive antecedents; – AND− connections, from input to hidden nodes, the former standing for negative antecedents. Each connection has an associated weight w and can receive complex messages to be conveyed, like M = {(a, 1), (b, 0.5)}. In this case, the outgoing message is w(M ) = {(a, f (1, w)), (b, f (0.5, w))}, where the f function is given by   w x + w − 1 , for OR connections; (17) f (x, w) = w x − w + 1 , for AND+ connections;  1 − w x − w , for AND− connections. As an example, consider the definite Horn clause program shown in Fig. 6, incorrectly defining the concept grandfather (this example was taken from [3]), with two training examples. In Fig. 7, we can see the corresponding “network.” Consider the alphabet Σ = {walter, carol, lee, jim} for all nodes.6 The recurrent connections from output-layer nodes to input-layer nodes are represented making for each N predicate yN i := yN o , where N i and N o denote the predicate corresponding node at input and output layers, respectively. The hidden-layer nodes corresponding to ground facts will have fixed output7 —for instance, yc5 = {(walter, carol, 1)}. Nodes corresponding to clauses with antecedents are more complex—for instance, the hidden-layer node c4 has the vector (2, 1) as InputArityc4 , that is the arity of the inputs I1 = w1 (ymother ) and I2 = w2 (yfemale ), where w1 denotes the weight of the AND+ connection from the input-layer node mother to c4, and w2 denotes the weight of the AND+ connection from the input-layer node female to c4. The hidden-layer node c4 has arityc4 = 2 and y function given by y(I) = { (A, B, r) | ∃A, B, r1 , r2 (A, B, r1 ) ∈ I1 ∧ (B, r2 ) ∈ I2 ∧r = min(r1 , r2 )}. Output-layer nodes are simpler—for instance, parento has inputs I1 = w3 (yc3 ) and I2 = w4 (yc4 ), where w3 and w4 are the weights of the incoming OR connections to parento from c3 and c4, respectively. Its y function is defined as the union of its inputs, choosing the tuple of maximum confidence value when the same tuple, disregarding the confidence value, appears in two or more inputs. Given P, a function-free acceptable logic program8 — 6 7 Notice that the predicate names are not in the alphabet since they are not necessary in the operation of the network. They appear only in the translation of KB into the network. The weights that correspond to facts are fixed, since they are supposed to be true. Towards a Hybrid Model of First-Order Theory Refinement 103 c1: grandfather(A,B) :- father(A,C), parent(C,B). c2: grandfather(A,B) :- father(A,B). c3: parent(A,B) :- father(A,B). c4: parent(A,B) :- mother(A,B), female(B). c5: father(walter,carol). c6: father(lee,jim). c7: mother(carol,jim). c8: male(walter). c9: male(lee). c10: male(jim). c11: female(carol). positive: grandfather(walter,jim). negative: grandfather(lee,jim). Fig. 6. An incorrect description of the concept grandfather and two examples ..................... ..................... ..................... ..................... ... ... ... ... ..... ..... ..... ..... .. .. ... ... ... ... ... ... .. ... ..... grandf ... ..... parent .... ..... ..... father . . . ... ...mother.... . . . ... ... ... . . . . . . . .... . . . . . . . . . . . ..... ..... ..... ....... ........ ........................... ...................... ...................... .......... .... ...... ..... ........ ...... ........ . ...... . . . . . . . . . .. .... ..... ..... ........ .... ... ..... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. .. . .... ... . . ... ... ... . .. . . . . . ... ... ... ... ... . . .. ... . . ... ... ... ... . ... .. . . . . . ... . ... ... ... ... . . . . . . . ................... . . ..................... ..................... ...................... ...................... ...................... ..................... . . ..... . . . . . . . . . . . . . . . . . . . . . . ... ....... ... ... ... ... ... ... ... ... ... ... ... ... . . . . . . . . . . . . ... ... ... ... .. ... .. ... .. ... .. ... .. . . . . . ..... . . ... . ... .. ... . ... . ... . ... . . . . . . ... ... . . . . . . .. ..... .. ..... .. ....... .. ..... .. .............. .. .............. ... .... . . . . . . . . . . . . . . ...... ...... ...... .. .. .. ...................... ........ ........... ....... ....... ....... .................... ........ ......... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............. . . . . . . . . . . . . . . . . . . . . . . . . . . . ... ...... ...... ... ..... .. ......... .. .... ...... ... . . . . . . . . . . . . . . . . . . . . . . ... . ...... ..... ...... ........ ... ... ...... ....... ...... ... .... ... ...... ....... ...... ... .... ...... .. ....... ...... ... ..... ...... .. ....... ..... . . . . . . . . . . . . . . . . . . . . . . . . ... .. ..... .... ... ... ........... ... .... ....... ....... ........ ....... ..... ...... ... ....... ........ ..... ..... ... ....... ...... ..... ............... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ......... ......................... ........ ........... ........ ............ ........... ....................... ........ ............ ........ ............ ............ .... .... .... .... .... .... ... ... .... ... ... ...... ... ... ... ... ... ... .. . . . . . ..... father ... .....mother.... ..... parent .... ..... grandf .... ..... female .... ..... male .... . . . . . . ... ... ... ... ... ... .. .. .. .. .. .. . . . . . . . .... . . . . . . . . . . . . . . . . . . . . ...... ...... ...... ...... ...... .. . . . . ........ .......... .................. .................. .................. .................. .................. ..... ...................... ...................... ..... ..... ... ... ... ... ... ... . . ..... ..... ... male .... ...female .... . . .... . . . . . ..... ... ........ ........ . . . . . . . . . . . . . . . ......... ........... ............ ......... . . . . ......... ... .... ..... ... ... ... ... ... . ... ... . . ... . . ... .. ... ... ... ... ... ... ... ... .. ... ... . . . ... . . ... . . . . . . ... . . .. . . . . ..................... .................... ................... ...................... . . . . . . . . . . . . . . . . . . . ... ... ... ... ... ... ... ... . . . ... . . . . . . ... .. ... .. ... .. ..... . . . ... . .. . .. ... .. ... ... ... .... ... .... .. ..... . . . . .... . . . . . . . . . . ...... ...... ...... ....... ........ ................... ................... ................... ........ c8 c9 c10 c11 c7 c4 c3 c6 c5 c1 c2 Fig. 7. The network obtained from the grandfather program. All connections have weight 1. Recurrent connections have been omitted. Node grandf represents concept grandfather i.e., a subset of normal logic programs—, it can be shown that the corresponding network obtained from P will compute the logical operator TP , that is, the stable model of P [22]. 8 This restriction is to guarantee that the stable model of P is finite. It occurs also in ILP [23], where h-bounded and flattening techniques are applied to programs with function symbols. 104 5 N.A. Hallack, G. Zaverucha, and V.C. Barbosa Learning Our learning scheme here is the same as for MMKBANN. Given a collection E of examples about some predicates whose definition we would like to learn, we minimize the mean square error function mse(w) given by 1 X [y(e, w) − yb(e)]2 (18) mse(w) = |E| e∈E where yb(e) is the desired confidence value of the example e and y(e, w) is the confidence value obtained for this example in the corresponding output-layer node. Values of y(e, w) can be obtained by computing TP . Since knowledge about the program is encoded not only in the connections between nodes but also in their activation functions, it is necessary to look at these functions when errors are backpropagated during training. For example, while backpropagating the error associated with the uncovered positive example grandfather(walter,jim) from the output predicate grandfather, when this error arrives at node c1 it is necessary to find the tuples of the antecedents father and parent that originated the tuple (walter, jim) at c1’s output. Suppose these tuples are (walter, carol) for father and (carol, jim) for parent. Then these tuples receive an associated error and the backpropagation process starts again at the output-layer nodes father and parent. Learning in this way and guided by the two examples, our system was able to correct the given definition of concept grandfather. The corrections were a decrease in the weight connecting c2 to grandfather and a decrease in the weight connecting female to c4. Now recall that in the MMKBANN model we assumed that nodes in consecutive layers were fully interconnected, aiming at the addition of new antecedents and clauses. A similar assumption in MMCILP would be: 1. A node in the hidden layer is connected to all nodes in the output layer with the same arity. If this connection is stated in KB its weight is 1, otherwise it is 0; 2. A node in the hidden layer has all possible antecedents in its activation function. If the antecedent is stated in KB its associated weight is 1, otherwise it is 0. Whereas 1 seems reasonable, 2 is unrealistic because the number of possible antecedents grows exponentially with the arity of predicates in the input layer. One possible solution, which is currently being studied, is to keep for each clause a pool of candidate antecedents and let the weight learning procedure choose the best ones. Entropy-gain heuristics, like those encountered in [13,24], will select the pool members. 6 Conclusions and Future Work The framework we have described can be regarded as a starting point in the direction of a connectionist-symbolic first-order refinement system. We have used Towards a Hybrid Model of First-Order Theory Refinement 105 supervised learning algorithms on a hybrid system that is capable of performing complex symbolic operations, in the expectation that the benefits usually associated with neural networks have been retained in MMCILP. More experiments are needed to clarify whether this holds. As we evaluate our framework, some important issues will have to be considered. First is the high computational cost due to the use of subgradient descent methods, which typically require hundreds of iterations. The use of more clever ways to compute TP in each of the iterations instead of the naı̈ve way we have described here can soften the demand for computational power, but surely some considerable fraction of such a demand will still remain. There is also the combinatorial explosion in the search space for possible clauses, but this is inherent to all first-order theory refinement systems. The adoption of syntactical restrictions in the allowed clauses, as those that usually appear in ILP [23], must be considered. The use of min/max functions instead of the normal setting encountered at KBANN and CIL2 P, with weights and thresholds, ensures the AND/OR structure, but also makes the learning of the training set more difficult, as indicated by the MMKBANN results. The use of techniques such as feature selection [13] and the dynamic addition of nodes [2] can decrease this problem in both the propositional and the first-order cases. Acknowledgements The authors are partially supported by CNPq. This work is part of the ICOM project, also funded by CNPq/ProTeM-CC. References 1. Botta, M., Giordana, A., Piola, R.: FONN: Combining first order logic with connectionist learning. in Proceedings of the International Conference on Machine Learning-97. (1997) 48–56 2. Optiz, D.W., Shavlik, J.W.: Dynamically adding symbolically meaningful nodes to knowledge-based neural networks. Knowledge-Based Systems. 8 (1995) 301–311 3. Wogulis, J.: Revising relational domain theories. in Proceedings of the Eighth International Workshop on Machine Learning. (1991) 462–466 4. Towell, G., Shavlik, J.W.: Extracting refined rules from knowledge-based neural networks. Machine Learning. 13 (1993) 71–101 5. Jordan, M.I.: Attractor dynamics and parallelism in a connectionist sequential machine. in Proceedings of the Eighth Annual Conference of the Cognitive Science Society. (1986) 531–546 6. Mahoney, J.J.: Combining symbolic and connectionist learning methods to refine certainty-factor rule-bases. Ph.D. Thesis. University of Texas at Austin. (1996) 7. Mahoney, J.J., Mooney, R.J.: Combining connectionist and symbolic learning methods to refine certainty-factor rule-bases. Connection Science. 5 (special issue on architectures for integrating neural and symbolic processing) (1993) 339–364 8. Towell, G., Shavlik, J.: Knowledge-based artificial neural networks. Artificial Intelligence. 69 (1994) 119–165 106 N.A. Hallack, G. Zaverucha, and V.C. Barbosa 9. Towell, G.: Symbolic knowledge and neural networks: Insertion, refinement and extraction. Ph.D. Thesis. Computer Science Department, University of Wisconsin, Madison. (1992) 10. Riedmiller, M., Braun, H.: A direct adaptive method for faster backpropagation learning: The RProp algorithm. in Proceedings of the International Conference on Neural Networks. (1993) 586–591 11. Rummelhart, D.E., Durbin, R., Golden, R., Chauvin, Y.: Backpropagation: The basic theory. In: Backpropagation: Theory, Architectures, and Applications. Rummelhart, D.E., Chauvin, Y. (eds.) Hillsdale NJ: Lawrence Erlbaum Associates. (1995) 1–34 12. Machado, R.J., Rocha, A.F.: The combinatorial neural network: A connectionist model for knowledge-based systems. In: Uncertainty in Knowledge Bases. Bouchon, B., Zadeh, L., Yager, R. (eds.) Berlin Germany: Springer-Verlag. (1991) 578–587 13. Setiono, R., Liu, H.: Improving backpropagation learning with feature selection. Applied Intelligence. 6, no.2 (1996) 129–140 14. Machado, R.J., Barbosa, V.C., Neves, P.A.: Learning in the combinatorial neural model. IEEE Transactions on Neural Networks. 9, no.5 (1998) 831–847 15. Garcez, A., Zaverucha, G., Carvalho L.A.: Logic programming and inductive learning in artificial neural networks. Workshop on Knowledge Representation in Neural Networks (KI’96). Budapest. (1996) 9–18 16. Garcez, A., Zaverucha, G.: The connectionist inductive learning and logic programming system. Applied Intelligence Journal (special issue on neural networks and structured knowledge: representation and reasoning). 11, no.1 (1999) 59–77 17. Pinkas, G.: Logical inference in symmetric connectionist networks. Doctoral thesis. Sever Institute of Technology, Washington University. (1992) 18. Shastri, L., Ajjanagadde, V.: From simple associations to systematic reasoning. Behavioral and Brain Sciences. 16 no. 3 (1993) 417–494 19. Holldobler, S.: Automated inferencing and connectionist models. Postdoctoral thesis. Intellektik, Informatik, TH Darmstadt. (1993) 20. Kalinke, Y.: Using connectionist term representations for first-order deduction— a critical view. CADE-14, Workshop on Connectionist Systems for Knowledge Representation and Deduction. Townsville, Australia. (1997) http://pikas.inf.tudresden.de/˜yve/publ.html 21. Sun, R.: Robust reasoning: integrating rule-based and similarity-based reasoning. Artificial Intelligence. 75 (1995) 241–295 22. Gelfond, M., Lifschitz, V.: Classical negation in logic programs and disjunctive databases. New Generation Computing. 9 (1991) 365–385 23. Lavrac, N., Dzeroski, S.: Inductive Logic Programming: techniques and applications. Ellis Horwood series in Artificial Intelligence. 44 (1994) 24. Quinlan, J.R.: Learning logical definitions from relations. Machine Learning. 5 (1990) 239–266 25. Menezes, R., Zaverucha, G., Barbosa, V.C.: A penalty-function approach to rule extraction from knowledge-based neural networks. International Conference on Neural Information Processing (ICONIP98), Kitakyushu, Japan. (1998) 1497–1500 Dynamical Recurrent Networks for Sequential Data Processing Stefan C. Kremer1 and John F. Kolen2 1 Guelph Natural Computation Group, Dept. of Computing and Information Science, University of Guelph, Guelph, ON, N1G 4E1, CANADA skremer@uoguelph.ca 2 Dept. of Computer Science & Institute for Human and Machine Cognition, University of West Florida, Pensacola, FL 32514, USA jkolen@typhoon.coginst.uwf.edu Abstract. All symbol processing tasks can be viewed as instances of symbol-to-symbol transduction (SST). SST generalizes many familiar symbolic problem classes including language identification and sequence generation. One method of performing SST is via dynamical recurrent networks employed as symbol-to-symbol transducers. We construct these transducers by adding symbol-to-vector preprocessing and vector-to-symbol postprocessing to the vector-to-vector mapping provided by neural networks. This chapter surveys the capabilities and limitations of these mechanisms from both top-down (task dependent) and bottom up (implementation dependent) forces. 1 The Problem This chapter focuses on dynamical recurrent network solutions to symbol processing problems. All symbol processing tasks can be shown to be special cases of symbol-to-symbol transduction (SST). In SST, the transduction mechanism maps streams of input symbols selected from an input alphabet, Σ, to streams of output symbols from an output alphabet, Γ . A transducer can compute functions from a language of strings selected from Σ ∗ to another language of strings selected from Γ ∗ f : Σ∗ → Γ ∗. (1) One implementation of this mapping is a multi-tape Turing machine (TM) with these properties: there are three tapes, a read-only input tape, a working tape, and a write-only output tape. The TM must always advance the head on the input tape with every transition. Once the head passes the end of the given input string, it reads blanks. In addition to moving its input head, the TM always advances the output tape head one cell each time it writes a symbol to the tape. Even under these constraints, the transducer is capable of computing S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 107–122, 2000. c Springer-Verlag Berlin Heidelberg 2000 108 S.C. Kremer and J.F. Kolen any computable function. We call this the transduction task. There are a number of interesting, special-case scenarios that can be placed on the string-to-string transduction problem. 1.1 Output Symbol Generation Scenarios An important question regarding transducers is “What is the relationship between input and output string symbols?” All SSTs compute output symbols, Γ ∗ , from input symbols, Σ ∗ , and internal state, S. Under this assumption, we consider the following scenarios: One-to-one. One output symbol is generated for each input symbol: S × Σ → S × Γ. n-to-one. An output symbol after every nth input (one-to-one is a special case of this for n = 1): S × Σ n → S × Γ . Literal event. Output are generated on a one-to-one basis to a special “generate output” symbol, γ ∈ Σ presented on the input tape: S × (Σ)∗ γ → S × Γ. Internal selection. This is the most general case where the output symbols are generated according to some arbitrary rule mapping strings of input symbols to strings of output symbols. This rule is not required to produce output symbols at a particular ratio to input symbols, nor is it required to produce output symbols relative to a particular trigger symbol: “generate output”[7]. These two types of rules, however, are not precluded, and therefore internal selection represents an encompassing completely general case: Σ ∗ → Γ ∗ . While some would argue that these scenarios are fundamentally different, each scenario be cast in terms of the others. For example, one-to-one is obviously a special case of n-to-one which in turn is a special case of internal selection. A literal event transducer can be constructed from a one-to-one mechanism by outputing a “don’t care” symbol which can be interpreted as an empty string (ǫ). Since the transducers can compute recursively enumerable sets, they are closed under general sequential machine (GSM) mappings [14]. GSM mappings are performed by finite state transducers with the ability to emit ǫs. The scenarios above assume that sequences of inputs produce sequences of outputs. These sequences, however, may be of length one in a degenerate case where either the input tape or the output tape is limited to a single symbol. In the former case we are dealing with a sequence generation task where a sequence of output symbols is generated in response to a single input signal: f : Σ → Γ ∗ . In the latter case we have a classification task where a stream of input symbols is classified into a finite number of categories represented by the single output symbols: f : Σ ∗ → Γ . If the output alphabet is further limited to the symbols accept and reject, then the function accepts strings from a specific language. This accept scenario links formal languages to their formal computational mechnisms. For most tasks, the transducer is constructed by hand from a behavioral specification. Constructing transducers is difficult from full specifications, yet we find ourselves in situations where the specification is incomplete and defined only in terms of examples. Construction under these constraints is an induction Dynamical Recurrent Networks for Sequential Data Processing 109 problem similar to the classic grammar induction problem [12] and is usually the main justification of resorting to neural network solutions. 1.2 Application/Motivation SST is easily associated with human language processing tasks such as phonetic to acoustic mapping, text to phoneme mapping, grammar induction, and translation. However, many other problems can be cast into this paradigm as well. This was perhaps best exploited by King Sun Fu, one of the foremost researchers in the applications of grammatical induction. He used the term “syntactic pattern recognition” to describe this field [9]. Fu used grammars to describe many different things, only some of which would be considered languages in the lay sense. Specifically, symbol stream transduction has been applied to: modeling natural language learning, process control, signal processing, phonetic to acoustic mapping, speech generation, robot navigation, nonlinear temporal prediction, system identification, learning complex behaviors, motion planning, prediction of time ordered medical parameters, and speech recognition, to name but a few. In fact, grammatical induction can be used to induce anything that can be represented by some language. In this sense, it truly represents the archetypical form of induction. 2 Generalized Architecture We are interested in connectionist approaches to solving SST problems such as those described above. Connectionist mechanisms, however, are vector-to-vector transducers (VVT). They take a finite-length input vector and produce a finitelength output vector. Connectionist mechanisms, as concisely pointed out in [6], can not be directly applied to SST. Two additional processes must be added: a symbol-to-vector encoding which takes symbols from the input tape and converts them into vectors for presentation to the networks, and a converse mechanism to perform vector-to-symbol decoding. A hybrid, connectionist-symbolic architecture for SST is illustrated in Figure 1. We describe below a coherent picture of the emergence of computationally complex behavior of connectionist networks employed as SSTs. Three different sources cooperate in this emergence: input modulation, internal dynamics of the connectionist network, and observation [17]. Input channel effects generalize the effects that input changes have on the dynamics of the network. In essence, the input symbol indexes a particular internal dynamic. Internal dynamics refers to the laws of motion that guide the trajectory of the system. These equations of motions are taken as the definition of behavior of dynamic recurrent networks. Finally, observation is the transduction of the output vector to the output symbol. This observation mechanism partitions the output vector space into finite sets labeled with our output symbols. These three sources collaborate to spawn the emergence of computation of the SST. 110 S.C. Kremer and J.F. Kolen Symbol Sequence Symbol Symbol Symbol Encoding Vector Recurrent Network Vector Discretization Symbol Symbol Symbol Symbol Sequence Fig. 1. Symbol-to-Symbol Transducer implemented with a recurrent connectionist network. 2.1 Input Modulation The symbol-to-vector encoding is most often implemented via table lookup. The symbol ai is mapped to the vector x(i) , for example. Often a one-hot, or 1-in-n, encoding is used for vectors, such that the components of the vector x(i) are defined as: (i) (i) (i) (2) x(i) = (x1 , x2 , x3 , ..., x(i) n ), (i) where the j th component, xj , of this vector is defined  0 ifi = j (i) . xj = 1 otherwise (3) Dynamical Recurrent Networks for Sequential Data Processing 111 A variation on this approach involves the presentation of several consecutive input symbols at a time. Under this approach each symbol is mapped to a vector. An extended vector, composed by concatenating the affine representations of the individual symbols, is created for presentation to the network. This approach was pioneered in the NetTalk project [36]. Under these approaches, the encoder takes a finite set of symbols, or strings of symbols, and maps them to a finite set of vectors. The same symbol will always encode to the same vector. From the perspective of the recurrent network, the vector is a constant whenever the same input is presented. This constant vector will reduce the internal dynamics of the network to a state-to-state mapping [18]. In other words, the input symbol selects one mapping from the finite set of all state-to-state mappings. For example, consider f (x, y) = x + y. If symbol a is mapped to 0 and symbol b is mapped to 1, then we get two indexed functions fa (y) = y and fb (y) = 1 + y. The input symbol sequence selects a function that is the composition of the fa and fb functions. The result of this selection on state spaces is similar to iterated function systems [17]. Other approaches include varying the mapping from symbol to vector over time. For instance, the encoding vector could be scaled by αt , where α is some constant between 0 and 1 and t is position of the symbol in the input stream. One could interpret this as a bias on the initial symbols of the sequence [27]. 2.2 The Internal Dynamics While conventional feedforward networks with multiple input symbols presented at one time like in [36] can compute SSTs, the computational power of such systems is limited. For example, such a system could not compute the parity of arbitrary length strings. For this reason, we consider networks whose internals are not static. In dynamical systems, this change is governed by an internal dynamic. Many scientists expect to find the roots of complex behavior within the internal dynamics. It is a very enticing view as it is the internal processing, or dynamics, that we have most control over when we build models of cognitive processes. Committing to this approach frees researchers to focus their energies on issues of representation, organization, and communication within the context of the dynamic. Many formal computing machines offer explicit control mechanisms. Identifying aspects of their design crucial to the generation of computationally complex behavior is fairly straight forward. Finite state automata, push-down automata, linear-bounded automata, and Turing machines all share finite state control but exploit different storage mechanisms which, in turn, produce differences in their behaviors. Alternatively, the same automata can come from a Turing machine by assuming restrictions on the control program: FSA by assuming unidirectional head movement, PDA by assuming that heads move normally in one direction, but always write a blank on moving in the other direction, nd LBA by assuming that they never move past the ends of the input string. The same types of restrictions can be imposed on grammars to convert more general grammatical frameworks to ones with limited representational powers. 112 S.C. Kremer and J.F. Kolen 2.3 The Act of Observation We have outlined the effects of internal dynamics and input modulation on the computational abilities of systems. The vector-to-symbol encoding is also important as well. Essentially, the output vector space is discretized into regions that map to a discrete symbol. One possible mapping function is nearest neighbor encoding (a prototype vector). A list of symbol/vector pairs is given (often based on a one-hot encoding scheme) and the nearest vector to the output vector is selected and its corresponding symbol is produced as output. This section will address the effects of observation on those the vector-to-symbol encoding. Kolen and Pollack [20] examined the vector-to-symbol encoding, or observation, effects on the induction of grammatical competence models of physical systems. The centerpiece of this work was a variable speed rotator that had interpretations both as a context-free generator and as a context-sensitive generator. Believing that physical systems possess an intrinsic computational model leads one to the Observers’ Paradox: the resulting interpretation depends upon measurement granularity. Because the variable speed rotator has two disjoint interpretations, computational complexity classes emerge from the interaction of system state dynamics and observer-established measurement. The complexity class of a system is an aspect of the property commonly referred to as computational complexity, a property we define as the union of observed complexity classes over all observation methods. While this definition undermines the traditional notion of system complexity, namely that systems have unique well-defined computational complexities, it accounts for our intuitions regarding the limitations of identifying computation in physical systems. This argument shows how observing the output of a system, such as a recurrent network, can lead to multiple hypotheses of internal complexities. The flip side of this statement implies that observed complexity can be tuned via selection of an observation mechanism. 2.4 Synthesis We described, above, the three sources of emergent computation in recurrent network SST. These sources include a sensitivity to input in constructing virtual state transformations, the ability to constrain internal dynamics capable of supporting complex behavior, and finally the fact that output is an observation of the internal states of the network. These three sources collaborate to provide proper conditions for the emergence of computation within the context of recurrent neural networks. The preceding arguments show conclusively that these three conditions are sufficient for emergence, but are they necessary? Clearly, input dependency is important because most of the time we identify computational systems as transducers of information. Even if a computational system ignored its input, as in the case of an enumerative generator, it still could be portrayed as utilizing input. Likewise, internal dynamics cannot be excluded either. They provide the Dynamical Recurrent Networks for Sequential Data Processing 113 underlying mechanism for system change. Finally, the role of observation is inseparable from the notion of emergent computation in that the system must be observed doing something. Emergent computation arises from the interaction between the environment, the system, and the observer. Thus, a computational model unaccompanied by its input encodings and observation mechanisms is useless. 3 Specific Architectures In the previous section, we identified the three necessary and sufficient conditions for computationally complex behavior in recurrent network based SST. We now focus on the internal dynamics and consider three broad classes of dynamical recurrent network architectures. These are: feedforward networks, output feedback networks, and hidden feedback networks. These are illustrated in Figure 2. In the subsections below, we consider each in turn. 3.1 Feedforward Networks One of the simplest methods for dealing with input data that varies over time as well as space is using not only the current input symbol to drive the network, but also previous input symbols (Figure 2(a)). In these types of systems, temporal delays can be introduced to provide the processing units in the network with a history of input values. The simplest approach is the Window in Time (WIT) network made famous by the NetTalk system of Sejnowski and Rosenberg [36]. In this architecture, the current and previous input vectors are presented as inputs to a (typically) one-hidden layer feedforward network. The WIT approach constrains the machines which can be induced by using a fixed memory in the sense that it is always formed by the vector concatenation of the current input symbol and the previous n − 1 input symbols. Here, n represents the size of the temporal window on previous inputs. This form of memory is very restrictive in the sense that the network itself cannot influence what is or is not stored in memory. A number of variations on the simple WIT approach have been suggested. The most notable of these is Waibel et al.’s Time-Delay Neural Networks (TDNNs) [39]. A general review of other time delay networks can also be found in [3]. 3.2 Output Feedback Networks A second approach to dealing with sequential data is to use not only a finite history of previous input values to determine the current output, but also a finite history of previous output sequences (Figure 2(b)). Because these networks feedback their outputs, they bear a certain similarity to infinite impulse response filters (IIRs) in communications theory [28, 29]. Some [26] have chosen 114 S.C. Kremer and J.F. Kolen (a) (b) (c) Output Activations Copied to Input Units Hidden Activations Copied to Input Units Fig. 2. Three classes of dynamical recurrent network architectures: (a) feedforward networks, (b) output feedback networks, (c) hidden feedback networks. Dynamical Recurrent Networks for Sequential Data Processing 115 to describe this class of networks as Nonlinear Auto-Regressive with eXogenous inputs (NARX) networks because of their similarity to NARX models in dynamical systems theory. The NARX approach also uses a fixed memory since it is always formed by the vector concatenation of the current input symbol and the previous n−1 input symbols, plus the previous m output symbols. Here, m and n represent the size of the temporal windows on previous outputs and inputs respectively. This form of memory is also restrictive in the sense that the weights of the network cannot be adjusted to influence what is stored in memory. Clearly, the WIT networks described above are a degenerate case of the output feedback networks where m = 0. 3.3 Hidden Feedback Networks The final alternative for dealing with sequences is to use a memory system which does not explicitly store previous inputs or outputs. In such a system, the memory or state is invisible to an outside observer. This is not unlike the operation of a Hidden Markov Model (HMM). The advantage of using a memory that does not merely store previous inputs and outputs is that it can compute salient properties in an input pattern that may not be explicitly represented in a finite history of input or output symbols. In these networks, memory is computed in a set of hidden units (Figure 2(c)). Another analogy can be drawn between hidden feedback networks with Mealy and Moore machines [14]. These two computation models are described in terms of output and next state mappings. The activation values of these hidden units are computed based on the activations the input units and a layer of special units called context units. By convention (see [5]), the activations of the context units are initially set to 0.5 (or 0.0), and subsequently set to the activation values of the hidden units at the previous time step (the number of context and hidden units must be equal). Specifically, the hidden unit activation values are copied to the context units at each time step. Thus, at any given time, the hidden unit activations represent the current state, while the context unit activations represent the previous state. A number of variations on this approach have been explored. The three most significant involve: (1) using a set of first-order connections between the input and hidden units in a massively parallel connection scheme, (2) using a set of second-order connections between the input and hidden units in a massively parallel connection scheme, and (3) using a set of first-order connections between the the input and hidden units in a one-to-one connection scheme. The first of these has been studied extensively in [5, 4] and represents the most straight-forward approach to implementing a general hidden feedback network. The second-order connection scheme lends itself to encoding labelled state transitions, since each second-order connection combines one context unit (or previous state) and one input unit and transmits a signal to one hidden unit (or next state) [11, 32]. The one-to-one connection scheme represents a simplified method for implementing memory [8]. In this system, memory is represented 116 S.C. Kremer and J.F. Kolen locally in the sense that a given hidden unit is used to compute its own future value, but not the future activation values of any other units. This offers significant computational advantages when calculating a gradient with respect to error. 4 Training Methods The primary method for adapting recurrent networks to perform specific desired computations involves computing a gradient in error space with respect to the weights of the network. By adjusting the weights in the direction of decreasing error, the network can be adapted to perform as desired. While the standard backpropagation rule [34] can be used to adapt the weights of a feedforward network, like those used in feedforward networks with delays, a different algorithm must be used to compute gradients in networks with recurrent or feedback connections. Three main gradient descent algorithms for networks with recurrent connections have been proposed. All three compute an approximation to the error gradient in weights space, but they differ in how this computation is performed. 4.1 Backpropagation Through Time The most intuitive approach is to temporally unfold the network and backpropagate the error through time (BPTT). That is, treat the hidden and input units at time t as two new virtual layers, the hidden and input units at time t − 1 as two more virtual layers, and so on to the very first time-step. Since there really are only one set of input and hidden units, and thus only one set of weights originating from them, weight values between the virtual layers must be shared. Weight changes to shared weights can be computed by using the standard backpropagation algorithm to calculate changes as if the shared weights were individual, and then summing the changes for all virtual incarnations of the same weight, and applying the summed change to the actual weight. This approach gives a very intuitive way of implementing a gradient descent algorithm in a network with recurrent networks. The time complexity of the BPTT algorithm is Θ(n2 · t) where n represents the number of hidden units in the network and t represents the length of the sequence. Analysis of the space complexity of this weight adaptation algorithm reveals that the memory requirements grows proportionately to the length of an input sequence (Θ(n · t)). This is makes this approach generally unsuitable for learning tasks involving very long sequences. One solution to this problem is to truncate the gradient calculation and unravel the network for only a finite number of steps. In this situation, the space complexity is proportional to the selected number of steps of unraveling. In the most extreme case, the network is only unraveled for one step [5, 32]. Empirical evidence has shown that this truncation can make certain kinds of problems unlearnable [2, 22]. Dynamical Recurrent Networks for Sequential Data Processing 4.2 117 Real-Time Recurrent Learning An alternative to BPTT which overcomes the limitation of requiring memory proportional to string length without truncating the gradient has been proposed in [40]. Rather than use virtual layers, this approach describes the propagation problem in terms of recursive equations. The solution of these equations results in the Real-time Recurrent Learning (RTRL) method. This approach computes the gradient by storing additional information about the interactions between processing units. The computation of these interactions requires additional time, but allows the memory required to remain constant regardless of sequence length. Specifically, the time and space complexity is Θ(n3 ). While the time and space of this method is bound by the architecure, it may be more efficient to use BPTT if the sequence length is not much larger than the number of hidden units. 4.3 Schmidhuber’s Algorithm The tradeoff between time and space offered by BPTT and RTRL can be avoided using a clever algorithm which effectively combines the two approaches. Developed by Schmidhuber [35], the algorithm operates in two modes of operation: one corresponding to BPTT and another corresponding to RTRL. By performing only a finite number of steps in BPTT mode, the memory required for this stage of the algorithm remains finite, yet the computational advantages are realized. This algorithm also operates in Θ(n3 ) time complexity. 4.4 Teacher Forcing In networks where output is fed back, the user is not forced to chose between regular backpropagation and recurrent gradient descent algorithms. The activation values of the output units can be treated as trainable values which can be adapted using a recurrent gradient descent algorithm that adapts the weights of the network. It is also possible, however, to use regular backpropagation if the desired output values are known for all points in the sequence. Using this approach, the activation values of the output units which are fed back into the network are not the actual values computed by the output units, but rather the values which these units should assume once training is complete (i.e. the target values). Obviously, this type of approach can only be used in a system where output values are known throughout the training sequence. Otherwise, the output values mid-sequence are essentially hidden and thus must be trained using a recurrent gradient descent algorithm. 5 General Learning Limitations In the previous section, we presented brief summaries several network architectures and their training algorithms1 . It is easy to think that these techniques are universal and will work under any circumstance. 1 For a more complete review, the reader is referred to [21] 118 S.C. Kremer and J.F. Kolen Before tackling a new problem, however, it is important to explore the a priori restrictions that exists due to computational characteristics of the problem and not the mechanism solving the problem. It may be the case that the problem in question is not solvable with the given resources (time or space). Much work has been done on the problem of learning SSTs in the form of automata and grammars. The simplest learning problem for SSTs is language decidability. An input sequence is transformed into a single symbol, either accept or reject. Gold [12] examined the resources necessary for solving this problem under a variety of circumstances. He identified two basic methods of presenting string to a language learner: “text” and “informant”. A text was defined as a sequence of legal strings containing every string of the language at least once. Typically, texts are presented one symbol after another, one string after another. Since most interesting languages have an infinite number of strings, the process of string presentation never terminates. By contrast, Gold defined an informant as a device which can tell the learner whether any particular string is in the language. Typically the informant presents one symbol at a time, and upon a string’s termination supplies a grammaticality judgment: grammatical or non-grammatical. Gold coined the term “identifiable in the limit” to answer the question: which classes of languages are learnable with respect to the above to methods of information presentation? A language is identifiable in the limit if there exists a point in time at which no further string presentations alter the categorization of strings, and all categorizations are correct. Gold showed that identifiability in the limit is a high ideal to strive for. When he assumed a text information presentation, Gold showed that only finite cardinality languages could be learned. Finite cardinality languages consist of a finite number of legal strings, and are a small subset of the regular sets (the smallest set in the Chomsky hierarchy). In other words, none of the language classes typically studied in language theory are text learnable. Gold went on to show that under informant learning, only two kinds of language are identifiable in the limit: regular sets (which are those languages having only transition rules of the form A → wB, where A and B are variables and w is a (possibly empty) string of terminals), and context-free languages (which are those languages having only transition rules of the form A → β, where A is a single variable). Other classes of language, like the recursively enumerable languages (those having transition rules, α → β, where α and β are arbitrary strings of terminals and non-terminals), were shown to be unlearnable. 6 Representational Limitations of Specific Architectures It is possible to evaluate the representational capacities of specific architectures. This is done by considering what kinds of sequence transduction or classification problems can be solved for some set of weights. By computing the union of these Dynamical Recurrent Networks for Sequential Data Processing 119 problems over all possible weights, we can evaluate the representational capacity of the network. The representational capacity is important for two reasons. The first, and most obvious, is that if a network cannot represent a problem, then it cannot possible be trained to solve that problem. The second reason is that selecting an architecture with an inappropriately large representational capacity means that the learning process has too many degrees of freedom, resulting increased training time and the possibility of more local minima. Thus, it is critical to select an architecture whose representational capacity is neither too large, nor too small. Giles, Horne and Lin [10] were the first to recognize that Kohavi [16] had previously identified a class of automata called definite machines. The representation capabilites of these machines exactly match those of feedforward networks with delays when operating on input and output sequences consisting of finite input-output alphabets. These networks, and their formal machine counterparts, are able to represent only certain kinds of sequence transduction, and recognition problems. They are able to recognize only a subset of regular languages. In particular, they are unable to differentiate strings from the alphabet Σ ∗ = {0, 1} that contain an even number of 1’s from those containing an odd number of 1’s. Output feedback networks are much more versatile in term of the types of problems that they can represent. In fact, it has been shown that these types of networks are Turing Machine equivalent assuming the appropriate information is available on the systems output [37]. Hidden feedback networks generally have significant computational power, and have been known for some time to have the computational capabilities of Turing machines (assuming enough hidden units are available) [30, 33, 31, 15, 38]. There are some hidden feedback networks, however, whose computational power has been shown to be limited [23, 24, 25]. 7 Discussion and Conclusion We have already examined the work of Gold that showed that identifying languages in the limit can be an intractable problem. This fact holds regardless of implementation substrate (i.e. the symbolic-connectionist hybrid described above). He made no assumptions on the mechanism for learning, only on the data presented and the problem classes to be addressed. This means that recurrent networks cannot perform the difficult cases of language learning in the limit either. On the other hand, the learning mechanisms of recurrent connectionist networks differ from their symbolic counter parts in the sense that they can consider more than one potential solution at a time. If we consider the weights of such a network to represent a hypothetical language, then the error gradient of that hypothetical language provides us with information about neighboring languages. In this sense, the difference between symbolic and network approaches to 120 S.C. Kremer and J.F. Kolen SST are similar to the differences between interior and exterior methods in linear programming. One way in which the difficulties of language learning manifest themselves under this paradigm is via a phenomenon known as “shrinking gradients”. Hochreiter [13] and Bengio, Simard and Frasconi [1] independently discovered that it is very difficult for recurrent networks to learn to latch information for long periods of time. This paper has examined the topic of symbol processing in dynamical recurrent networks. By focusing on a general sequence to sequence transduction problem, we identified the three components of any such system: input modulation, internal dynamics and output observation. We also dicussed the difficulties of this task in the context of the special case grammar induction problem as well as recurrent network solutions to this problem. The potential benefits of applying neural networks to symbol processing must be tempered by the recognition that the theoretical restrictions of the task apply to any implementational substrate. A more detailed review of dynamical recurrent networks for symbol processing and other tasks can be found in [19]. Acknowledgements Stefan C. Kremer was supported by a grant from the Natural Sciences and Engineering Research Council of Canada. John F. Kolen was supported by subcontract number NCC2-1026 from the National Aeronautics and Space Administration. References [1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157– 166, 1994. [2] A. Cleeremans, D. Servan-Schreiber, and J. L. McClelland. Finite state automata and simple recurrent networks. Neural Computation, 1(3):372–381, 1989. [3] B. de Vries and J. M. Principe. A theory for neural networks with time delays. In Richard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances in Neural Information Processing 3, pages 162–168, San Mateo, CA, 1991. Morgan Kaufmann Publishers, Inc. [4] J. L. Elman. Distributed representations, simple recurrent networks and gram matical structure. Machine Learning, 7(2/3):195–226, 1991. [5] J.L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990. [6] J. A. Fodor and Z. W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28:3–71, 1988. [7] M. L. Forcada and R. P. Ñeco. Recursive hetero-associative memories for translation. In Proceedings of the International Workshop on Artificial Neural Networks IWANN’97 (Lanzarote, Spain, June, 1997), pages 453–462, 1997. [8] P. Frasconi, M. Gori, and G. Soda. Recurrent networks for continuous speech recognition. In Computational Intelligence 90, pages 45–56. Elsevier, September 1990. Dynamical Recurrent Networks for Sequential Data Processing 121 [9] K.-S. Fu. Syntactic Pattern Recognition and Applications. Prentice-Hall, Inc., Engelwood Cliffs, NJ, 1982. [10] C. L. Giles, B.G. Horne, and T. Lin. Learning a class of large finite state machines with a recurrent neural network. Neural Networks, 8(9):1359–1365, 1995. [11] C.L. Giles, G.Z. Sun, H.H. Chen, Y.C. Lee, and D. Chen. Higher order recurrent networks & grammatical inference. In D. S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 380–387, San Mateo, CA, 1990. Morgan Kaufmann Publishers. [12] E. M. Gold. Language identification in the limit. Information and Control, 10:447– 474, 1967. [13] J. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Master’s thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München, 1991. [14] J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison–Wesley, 1979. [15] J. Kilian and H.T. Siegelmann. On the power of sigmoid neural networks. In Proceedings of the Sixth ACM Workshop on Computational Learning Theory, pages 137–143. ACM Press, 1993. [16] Z. Kohavi. Switching and Finite Automata Theory. McGraw-Hill, Inc., New York, NY, second edition, 1978. [17] J. F. Kolen. Exploring the Computational Capabilities of Recurrent Neural Networks. PhD thesis, Ohio State University, 1994. [18] J. F. Kolen. The origin of clusters in recurrent network state space. In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, pages 508– 513, Hillsdale, NJ, 1994. Earlbaum. [19] J. F. Kolen and S. C. Kremer, editors. A Field Guide to Dynamical Recurrent Networks. IEEE Press, 2000. [20] J.F. Kolen and J.B. Pollack. The paradox of observation and the observation of paradox. Journal of Experimental and Theoretical Artificial Intelligence, 7:275– 277, 1995. [21] S. C. Kremer. Spatio-temporal connectionist networks: A taxonomy and review. submitted. [22] S. C. Kremer. On the computational power of Elman-style recurrent networks. IEEE Transactions on Neural Networks, 6(4):1000–1004, 1995. [23] S. C. Kremer. Comments on ‘constructive learning of recurrent neural networks: Limitations of recurrent cascade correlation and a simple solution’. IEEE Transactions on Neural Networks, 7(4):1049–1051, July 1996. [24] S. C. Kremer. Finite state automata that recurrent cascade-correlation cannot represent. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 612–618. MIT Press, 1996. [25] S. C. Kremer. Identification of a specific limitation on local-feedback recurrent networks acting as mealy-moore machines. IEEE Transactions on Neural Networks, 10(2):433–438, March 1999. [26] T. Lin, B. G. Horne, P. Tiño, and C. L. Giles. Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338, 1996. [27] M. C. Mozer. Neural net architectures for temporal sequence processing. In A.S. Weigend and N.A. Gershenfeld, editors, Time Series Prediction, pages 243–264. Addison–Wesley, 1994. 122 S.C. Kremer and J.F. Kolen [28] K. S. Narendra and K. Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1(1):4– 27, 1990. [29] K. S. Narendra and K. Parthasarathy. Gradient methods for the optimization of dynamical systems containing neural networks. IEEE Transactions on Neural Networks, 2:252–262, March 1991. [30] J. B. Pollack. On Connectionist Models of Natural Language Processing. PhD thesis, Computer Science Department of the University of Illinois at UrbanaChampaign, Urbana, Illinois, 1987. [31] J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46:77– 105, 1990. [32] J. B. Pollack. The induction of dynamical recognizers. Machine Learning, 7:227– 252, 1991. [33] J.B. Pollack. Implications of recursive distributed representations. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems 1, pages 527– 536, San Mateo, CA, 1989. Morgan Kaufmann. [34] D. Rumberlhart, G. Hinton, and R. Williams. Learning internal representation by error propagation. In J. L. McClelland, D.E. Rumelhart, and the P.D.P. Group (Eds. ), editors, Parallel Distributed Processing: Explorations in the Micro structure of Cognition, volume 1: Foundations, pages 318–364. MIT Press, Cambridge, MA, 1986. [35] J. H. Schmidhuber. A fixed size storage o(n3 ) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2):243– 248, 1992. [36] T. J. Sejnowski and C. R. Rosenberg. NETtalk: a parallel network that learns to read aloud. In J.A. Anderson and E. Rosenfeld, editors, Neurocomputing: Foundations of Research, pages 663–672. MIT Press, 1988. [37] H.T. Siegelmann, B.G. Horne, and C.L. Giles. Computational capabilities of recurrent narx neural networks. IEEE Transactions on Systems, Man and Cybernetics, 1997. In press. [38] H.T. Siegelmann and E.D. Sontag. On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995. [39] A. Waibel. Consonant recognition by modular construction of large phonemic time-delay neural networks. In D.Z. Anderson, editor, Neural Information Processing Systems, pages 215–223, New York, NY, 1988. American Institute of Physics. [40] R. J. Williams and D. Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989. Fuzzy Knowledge and Recurrent Neural Networks: A Dynamical Systems Perspective⋆ Christian W. Omlin1 , Lee Giles2,3 , and K.K. Thornber2 1 Department of Computer Science, University of Stellenbosch, South Africa 2 NEC Research Institute, Princeton, NJ 08540 3 UMIACS, U. of Maryland, College Park, MD 20742 E-mail: omlin@cs.sun.ac.za {giles,karvel}@research.nj.nec.com Abstract. Hybrid neuro-fuzzy systems - the combination of artificial neural networks with fuzzy logic - are becoming increasingly popular. However, neuro-fuzzy systems need to be extended for applications which require context (e.g., speech, handwriting, control). Some of these applications can be modeled in the form of finite-state automata. This chapter presents a synthesis method for mapping fuzzy finite-state automata (FFAs) into recurrent neural networks. The synthesis method requires FFAs to undergo a transformation prior to being mapped into recurrent networks. Their neurons have a slightly enriched functionality in order to accommodate a fuzzy representation of FFA states. This allows fuzzy parameters of FFAs to be directly represented as parameters of the neural network. We present a proof the stability of fuzzy finite-state dynamics of constructed neural networks and through simulations give empirical validation of the proofs. 1 Introduction 1.1 Preliminaries We all use fuzzy concepts in our everyday lives. We say a car is traveling ‘very slowly’ or ‘fast’ without specifying the exact speed in miles per hour. Fuzzy set theory allows to mathematically define such linguistic quantities, and to carry out computations with vague information, such as fuzzy reasoning. It was introduced as an alternative to traditional set theory to provide a calculus for reasoning under uncertainty [1]. Whereas the latter specifies crisp set membership, i.e. an object either is or is not an element of a set, the former allows for a graded membership function which can be interpreted as vague or uncertain information. Fuzzy logic has been very successful in a variety of applications [2,3,4,5,6,7,8,9,10]. ⋆ This chapter contains material reprinted from IEEE Transactions on Fuzzy Systems, Vol. 6, No. 1, p. 76-89, c 1998, Institute of Electrical and Electronics Engineers, and from Proceedings of the IEEE, Vol. 87, No. 9, p. 1623-1640, c 1999, Institute of Electrical and Electronics Engineers, by permission. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 123–143, 2000. c Springer-Verlag Berlin Heidelberg 2000 124 C.W. Omlin, L. Giles, and K.K. Thornber LN −40 MN −30 −20 SN −10 ZE 0 SP MP 10 20 LP 30 40 Fig. 1. Numbers as Fuzzy Sets: Numbers are defined as fuzzy sets with a triangular membership function µ(x). The fuzzy sets are Zero (ZE), small negative (SN), small positive (SP), medium negative (MN), medium positive (MP), large negative (LN), and large positive (LP). The membership function µ(x) defines to what degree points on the x-axis belong to these fuzzy sets. An example of fuzzy sets for real numbers is shown Fig. 1. It is impossible to say that the number one is a small positive number but that the number two is not. However, we can define a degree to which numbers belong to the set of small positive numbers. Since the number two is larger than the number one, it is plausible to define the number two to belong to the set of small positive numbers to a lesser degree than the number one. Fuzzy logic has been particularly successful in applications where linguistic variables often have clear physical meaning, thus making it possible to incorporate rule-based and linguistic information in a systematic way. Then, powerful algorithms for training neural networks can be applied to refine the parameters of fuzzy systems, thus resulting in adaptive fuzzy systems. Fuzzy systems and feedforward neural networks share a lot of similarities. Among other common characteristics, they are computationally equivalent. It has been shown in [11] that fuzzy systems of the form IF x is Ai AND y is Bi THEN z is Ci where Ai , Bi , and Ci are fuzzy sets with triangular membership functions, are universal approximators: Theorem 1. For any given real-valued continuous function Γ defined on a compact set, there exists a fuzzy logic system whose output approximates Γ with arbitrary accuracy. It has been shown in [12] that recurrent neural networks are computationally at least as powerful as Turing machines; whether or not these results also apply to recurrent fuzzy systems remains an open question. While the methodologies underlying fuzzy systems and neural networks are quite different, their functional forms are often similar. The development of learning algorithms for neural networks has been beneficial to the field of fuzzy systems which adopted some learning algorithms; there exist backpropagation training algorithms for fuzzy logic systems which are similar to those for neural networks [13,14]. However, most of the proposed combined architectures are only able to Fuzzy Knowledge and Recurrent Neural Networks 125 process static input-output relationships; they are not able to process temporal input sequences of arbitrary length. Inputs Antecedent Labels A x 1 Rules R Consequent Labels 1 C 1 A 2 R Action 1 2 y C2 A x 3 R3 C 3 2 A Layer 1 4 Layer 2 R 4 Layer 3 Layer 4 Layer 5 Fig. 2. Fuzzy Feedforward Neural Network: An example of initializing a feedforward neural network with fuzzy linguistic rules neural network (solid directed connections). Additional weights may be added for knowledge refinement or revision through training (dashed directed connections). They may be useful for unlearning incorrect prior knowledge or learning unknown rule dependencies. Typically, radial basis funcx−a 2 tions of the form e−( b ) are used to represent fuzzy sets (layers 2 and 4); sigmoid discriminant functions of the form 1+e1−x are appropriate for computing the conjunction of antecedents of a rule (layer 3). In some cases, neural networks can be structured based on the principles of fuzzy logic [15,16]. Neural network representations of fuzzy logic interpolation have also been used within the context of reinforcement learning [17]. Consider the following set of linguistic fuzzy rules: IF (x1 is A1 ) AND (x2 is A3 ) THEN C1 . IF ((x1 is A2 ) AND (x2 is A4 )) OR ((x1 is A1 ) AND (x2 is A3 )) THEN C2 . IF (x1 is A2 ) AND (x2 is A4 ) THEN C3 . where Ai and Cj are fuzzy sets and xk are linguistic input variables. A possible mapping of such rules into a feedforward neural network is shown in Fig. 2. These mappings are typical for control applications. The network has an input layer (layer 1) consisting of real-valued input variables x1 and x2 (e.g., linguistic variables), a fuzzification layer (layer 2) which maps input values xi to fuzzy sets Ai , an interpolation layer (layer 3) which computes the conjunction of all antecedent conditions in a rule (e.g., differential softmin operation), a layer which combines the contributions from all fuzzy control rules in the rule base (layer 4), 126 C.W. Omlin, L. Giles, and K.K. Thornber and a defuzzification layer (layer 5) which computes the final output (e.g. mean of maximum method). Thus, fuzzy neural networks play the role of fuzzy logic interpolation engines. 1 The rules are then tuned using a training algorithm for feedforward neural networks. 2 Fuzzy Finite-State Automata 2.1 Motivation There exist applications where the variables of linguistic rules are recursive, i.e., the rules are of the form IF (x(t − 1) is α) AND (u(t − 1) is β) THEN x(t) is γ where u(t − 1) and x(t − 1) represent previous input and state variables, respectively. The value of the state variable x(t) depends on both the previous input u(t − 1) and the previous state x(t − 1). Clearly, feedforward neural networks generally do not have the computational capabilities to represent such recursive rules when the depth of the recursion is not known a priori. Recurrent neural networks, on the other hand, are capable of representing recursive rules. Recurrent neural networks have the ability to store information over indefinite periods of time, can develop ‘hidden’ states through learning, and are thus potentially useful for representing recursive linguistic rules. Since a large class of problems where the current state depends on both the current input and the previous state can be modeled by finite-state automata, it is reasonable to investigate whether recurrent neural networks can also represent fuzzy finite-state automata (FFAs). 2.2 Significance of Fuzzy Automata Fuzzy finite-state automata (FFAs). have a long history [20,21]; the fundamentals of FFAs have been discussed in [22] without presenting a systematic machine synthesis method. Their potential as design tools for modeling a variety of systems is beginning to be exploited. Such systems have two major characteristics: (1) the current state of the system depends on past states and current inputs, and (2) the knowledge about the system’s current state is vague or uncertain. Fuzzy automata have been found to be useful in a variety of applications such as in the analysis of X-rays [23], in digital circuit design [24], and in the design of intelligent human-computer interfaces [25]. Neural network implementations of 1 The term fuzzy inference is also used to describe the function of a fuzzy neural network. We choose the term fuzzy logic interpolation in order to distinguish between the function of fuzzy neural networks and fuzzy logic inference where the objective is to obtain some properties of fuzzy sets B1 , B2 , . . . from properties of fuzzy sets A1 , A2 , . . . with the help of an inference scheme A1 , A2 , . . . → B1 , B2 , . . . which is governed by a set of rules [18,19]. Fuzzy Knowledge and Recurrent Neural Networks 127 fuzzy automata have been proposed in the literature [26,27,28,29]. The synthesis method proposed in [26] uses digital design technology to implement fuzzy representations of states and outputs. In [29], the implementation of a Moore machine with fuzzy inputs and states is realized by training a feedforward network explicitly on the state transition table using a modified backpropagation algorithm. From a control perspective, fuzzy finite-state automata have been shown to be useful for modeling fuzzy dynamical systems, often in conjunction with recurrent neural networks [30,31,32,33]. There has been a lot of interest in learning, synthesis, and extraction of finite-state automata in recurrent neural networks [34,35,36,37,38,39,40,41]. In contrast to deterministic finite-state automata (DFAs), a set of FFA states can be occupied to varying degrees at any point in time; this fuzzification of states generally reduces the size of the model, and the dynamics of the system being modeled is often more accessible to a direct interpretation. We have previously shown how fuzzy finite-state automata (FFAs) can be mapped into recurrent neural networks with second-order weights using a crisp representation of FFA states [42]. That encoding required a transformation of a FFA into a DFA which computes the membership functions for strings; it is only applicable to a restricted class of FFAs which have final states. The transformation of a fuzzy automaton into an equivalent deterministic acceptor generally increases the size of the automaton and thus the network size. Furthermore, the fuzzy transition memberships of the original FFA undergo modifications in the transformation of the original FFA into an equivalent DFA which is suitable for implementation in a second-order recurrent neural network. Thus, the direct correspondence between system and network parameters is lost which may obscure the natural fuzzy description of systems being modeled. Here, we present a method for encoding FFAs using a fuzzy representation of states. The objectives of the FFA encoding algorithm are (1) ease of encoding FFAs into recurrent networks, (2) the direct representation of “fuzziness”, i.e. the fuzzy memberships of individual transitions in FFAs are also parameters in the recurrent networks, and (3) achieving a fuzzy representation by making only minimal changes to the underlying architecture used for encoding DFAs (and crisp FFA representations). Representation of FFAs in recurrent networks requires that the internal representation of FFA states and state transitions be stable for indefinite periods of time. The proofs of representational properties of AI and machine learning structures are important for a number of reasons. Many users of a model want guarantees about what it can theoretically do, i.e. its performance and capabilities; others need this for use justification and acceptance. The capability of representing DFAs in recurrent networks can be viewed as a foundation for the problem of learning DFAs from examples (if a network cannot represent DFAs, then it certainly will have difficulty in learning them). A stable encoding of knowledge means t hat the model will give the correct answer (string membership in this case) independent of when the system is used or how long it is used. This can lead to robustness that is noise independent. 128 2.3 C.W. Omlin, L. Giles, and K.K. Thornber Formal Definition In this section, we give a formal definition of FFAs [43] and illustrate the definition with an example. Definition 1. A fuzzy finite-state automaton (FFA) M is defined by an alphabet Σ = {a1 , . . . , am }, a set of states Q = {q1 , . . . , qn }, a fuzzy start state R ∈ Q 2 , a finite output alphabet Z, a fuzzy transition map δ : Σ ×Q×[0, 1] → Q, and an output map ω : Q → Z. Weights θijk ∈ [0, 1] define the ‘fuzziness’ of state transitions, i.e. a FFA can simultaneously be in different states with a different degree of certainty. The particular output mapping depends on the nature of the an application. Since our goal is to construct a fuzzy representation of FFA states and their stability over time, we will ignore the output mapping ω for the remainder of this discussion, and not concern ourselves with the language L(M ) defined by M . For a possible definition, see [43]. An example of a FFA over the input alphabet {0, 1} is shown in Fig. 3. 0/0.5 2 1 1/0.2 0/0.7 1/0.3 1/0.7 0/0.9 0/0.6 0/0.1 3 4 1/0.4 0/0.3 Fig. 3. Example of a Fuzzy Finite-State Automaton: A fuzzy finite-state automaton is shown with weighted state transitions. State 1 is the automaton’s start state. A transition from state qj to qi on input symbol ak with weight θ is represented as a directed arc from qj to qi labeled ak /θ. Note that transitions from states 1 and 4 on input symbols ‘0’ are fuzzy (δ(1, 0, .) = {2, 3} and δ(4, 0, .) = {2, 3}). 2 In general, the start state of a FFA is fuzzy, i.e. it consists of a set of states that are occupied with varying memberships. It has been shown that a restricted class of FFAs whose initial state is a single crisp state is equivalent with the class of FFAs described in Definition 1 [43]. The distinction between the two classes of FFAs is irrelevant in the context of this chapter. Fuzzy Knowledge and Recurrent Neural Networks 3 3.1 129 Representation of Fuzzy States Preliminaries The current fuzzy state of a FFA M is a collection of states {qi } of M which are occupied with different degrees of fuzzy membership. A fuzzy representation of FFA states thus requires knowledge about the membership with which each state qi is occupied. This requirement dictates the representation of the current fuzzy state in a recurrent neural network. Since the method for encoding FFAs in recurrent neural networks is a generalization of the method for encoding DFAs, we will briefly discuss the DFA encoding algorithm. For DFA encodings, we used discrete-time, second-order recurrent neural networks with sigmoidal discriminant functions which update their current state according to the following equations: X 1 (t+1) (t) (t) = g(αi (t)) = , αi (t) = bi + Wijk Sj Ik , (1) Si −α (t) i 1+e j,k where bi is the bias associated with hidden recurrent state neurons Si , Wijk is a second-order weight, and Ik denotes the input neuron for symbol ak . The indices (t) (t) i, j, and k run over all state and input neurons, respectively. The product Sj Ik directly corresponds to the state transition δ(qj , ak ) = qi . DFAs can be encoded in discrete-time, second-order recurrent neural networks with sigmoidal discriminant functions such that the DFA and constructed network accept the same regular language [44]. The desired finite-state dynamics are encoded into a network by programming a small subset of all available weights to values +H and −H; this leads to a nearly orthonormal internal DFA state representation for sufficiently large values of H, i.e. a one-to-one correspondence between current DFA states and recurrent neurons with a high output. Since the magnitude of all weights in a constructed network are equal to H, the equation governing the dynamics of a constructed network is of the special form (t+1) Si = g(x, H) = 1 1+ eH(1−2x)/2 (2) where x is the input to neuron Si . 3.2 Recurrent State Neurons with Variable Output Range We extend the functionality of recurrent state neurons in order to represent fuzzy states. The main difference between the neuron discriminant function for DFAs and FFAs is that the neuron now receives as inputs the weight strength H, the signal x which represents the collective input from all other neurons, and the transition weight θijk where δ(qj , ak , θijk ) = qi : (t+1) Si = g̃(x, H, θijk ) = θijk 1 + eH(θijk −2x)/2θijk (3) The value of θijk is different for each of the states that collectively make up the current fuzzy network state. This is consistent with the definition of FFAs. 130 C.W. Omlin, L. Giles, and K.K. Thornber Compared to the discriminant function g(.) for the encoding of DFAs, the weight H which programs the network state transitions is strengthened by a factor 1/θijk (0 < θijk ≤ 1); the range of the function g̃(.) is squashed to the interval [0, θijk ], and it has been shifted towards the origin. Setting θijk = 1 reduces the function (3) to the sigmoidal discriminant function (2) used for DFA encoding. More formally, the function g̃(x, H, θ) has the following important invariant property which will later simplify the analysis: Lemma 1. g̃(θx, H, θ) = θ g̃(x, H, 1) = θ g(x, H). Thus, g(x, H) can be obtained by scaling g̃(x, H, 1) uniformly in the x− and y−directions by a factor θ. This invariant property of g̃ allows a stability analysis of the internal FFA state representation similar to the analysis of the stability of the internal DFA state representation to be carried out. 3.3 Programming Fuzzy State Transitions Consider state qj of FFA M and the fuzzy state transition δ(qj , ak , {θijk } = {qi1 . . . qir }). We assign recurrent state neuron Sj to FFA state qj and neurons Si1 . . . Sir to FFA states qi1 . . . qir . The basic idea is as follows: The activation of recurrent state neuron Si represents the certainty θijk with which some state transition δ(qj , ak , θijk ) = qi is carried out, i.e. Sit+1 ≃ θijk . If qi is not reached at time t+1, then we have Sit+1 ≃ 0. Thus, we program the second-order weights Wijk as follows:  +H if qi ∈ δ(qj , ak , θijk ) Wijk = (4) 0 otherwise  +H if qj ∈ δ(qj , ak , θjjk ) (5) Wjjk = −H otherwise bi = −H/2 if qi ∈ M. (6) Setting Wijk to a large positive value will ensure that Sit+1 will be arbitrarily close to θijk and setting Wjjk to a large negative value will guarantee that the output Sjt+1 will be arbitrarily close to 0. This is the same technique using for programming DFA state transitions in recurrent networks [44] and for encoding partial prior knowledge of a DFA for rule refinement [45]. 4 4.1 Automata Transformation Preliminaries The above encoding algorithm leaves open the possibility for ambiguities when a FFA is encoded in a recurrent network as follows: Consider two FFA states qj and ql with transitions δ(qj , ak , θijk ) = δ(ql , ak , θilk ) = qi where qi is one of all successor states reached from qj and ql , respectively, on input symbol ak . Further assume that qj and ql are members of the set of current FFA states (i.e., these states are occupied with some fuzzy membership). Then, the state transition Fuzzy Knowledge and Recurrent Neural Networks 131 δ(qj , ak , θijk ) = qi requires that recurrent state neuron Si have dynamic range [0, θijk ] while state transition δ(ql , ak , θilk ) = qi requires that state neuron Si asymptotically approach θilk . For θijk 6= θilk , we have ambiguity for the output range of neuron Si : Definition 2. We say an ambiguity occurs at state qi if there exist two states qj and ql with δ(qj , ak , θijk ) = δ(ql , ak , θilk ) = qi and θijk 6= θilk . A FFA M is called ambiguous if an ambiguity occurs for any state qi ∈ M . That ambiguity could be resolved by testing all possible paths through the FFA and identify those states for which the above described ambiguity can occur. However, such an endeavor is computationally expensive. Instead, we propose to resolve that ambiguity by transforming any FFA M . 4.2 Transformation Algorithm Before we state the transformation theorem, and give the algorithm, it will be useful to define the concept of equivalent FFAs: Definition 3. Consider a FFA M that is processing some string s = σ1 σ2 . . . σL with σi ∈ Σ. As M reads each symbol σi , it makes simultaneous weighted state transitions Σ ×Q×[0, 1] according to the fuzzy transition map δ(qj , ak , θijk ) = qi . The set of distinct weights {θijk } of the fuzzy transition map at time t is called the active weight set. Note that the active weight set can change with each symbol σi processed by M . We will define what it means for two FFAs to be equivalent: Definition 4. Two FFAs M and M ′ with alphabet Σ are called equivalent if their active weight sets are at all times identical for any string s ∈ Σ ∗ . We have proven the following theorem [46]: Theorem 2. Any FFA M can be transformed into an equivalent, unambiguous FFA M ′ . The trade-off for making the resolution of ambiguities computationally feasible is an increase in the number of FFA states. We have proven the correctness of the algorithm in [46]. Here, we illustrate the algorithm with an example. 4.3 Example of FFA Transformation Consider the FFA shown in Fig. 4a with four states and input alphabet Σ = {0, 1}; state q1 is the start state 3 . The algorithm initializes the variable ‘list’ 3 The FFA shown in Fig. 4a is a special case in that it does not contain any fuzzy transitions. Since the objective of the transformation algorithm is to resolve ambiguities for states qi with δ({qj1 , . . . , qjr }, ak , {, θij1 k , . . . , θijr k }) = qi , fuzziness is of no relevance; therefore, we omitted it for reasons of simplicity. 132 C.W. Omlin, L. Giles, and K.K. Thornber with all FFA states, i.e., list={q1 , q2 , q3 , q4 }. First, we notice that no ambiguity exists for input symbol ‘0’ at state q1 since there are no state transitions δ(., 0, .) = q1 . There exist two state transitions which have state q1 as its target, i.e. δ(q2 , 1, 0.2) = δ(q3 , 1, 0.7) = q1 . Thus, we set the variable visit = {q2 , q3 }. According to Definition 2, an ambiguity exists since θ121 6= θ131 . We resolve that ambiguity by introducing a new state q5 and setting δ(q3 , 1, 0.7) = q5 . Since δ(q3 , 1, 0.7) no longer leads to state q1 , we need to introduce new state transitions leading from state q5 to the target states {q} of all possible state transitions: δ(q1 , ., .) = {q2 , q3 }. Thus, we set δ(q5 , 0, θ250 ) = q2 and δ(q5 , 1, θ351 ) = q3 with θ250 = θ210 and θ351 = θ311 . One iteration through the outer loop thus results in the FFA shown in Fig. 4b. Consider Fig. 4d which shows the FFA after 3 iterations. State q4 is the only state left which has incoming transitions δ(., ak , θ4.k ) = q4 where not all values θ4.k are identical. We have δ(q2 , 0, 0.9) = δ(q6 , 0, 0.9) = q4 ; since these two state transition do not cause an ambiguity for input symbol ‘0’, we leave these state transitions as they are. However, we also have δ(q2 , 0, θ420 ) = δ(q3 , 0, θ430 ) = δ(q7 , 0, θ470 ) = q4 with θ430 = θ470 6= θ420 = 0.9. Instead of creating new states for both state transitions δ(q3 , 0, θ430 ) and δ(q7 , 0, θ470 ), it suffices to create one new state q8 and to set δ(q3 , 0, 0.1) = δ(q7 , 0, 0.1) = q8 . States q6 and q7 are the only possible successor states on input symbols ‘0’ and ‘1’, respectively. Thus, we set δ(q8 , 0, 0.6) = q6 and δ(q8 , 1, 0.4) = q7 . There exist no more ambiguities and the algorithm terminates (Fig. 4e). 5 Network Architecture The architecture for representing FFA is similar with the architecture for DFAs except that each neuron Si of the state transition module has a dynamical output range [0, θijk ] where θijk is the rule weight in the FFA state transition δ(qj , ak , θijk ) = qi . Notice that each neuron Si is only connected to pairs (Si , Ik ) for which θijk = θij ′ k since we assume that M is transformed into an equivalent, unambiguous FFA M ′ prior to the network construction. The weights Wijk are programmed as described in Section 3.3. Each recurrent state neurons receives as inputs the value Sjt and an output range value θijk ; it computes its output according to Equation (3). 6 6.1 Network Stability Analysis Preliminaries In order to demonstrate how the FFA encoding algorithm achieves stability of the internal FFA state representation for indefinite periods of time, we need to understand the dynamics of signals in a constructed recurrent neural network. We define stability of an internal FFA state representation as follows: Definition 5. A fuzzy encoding of FFA states with transition weights {θijk } in a second-order recurrent neural network is called stable if only state neurons Fuzzy Knowledge and Recurrent Neural Networks 133 0/0.5 0/0.5 1 2 1 1/0.7 2 1/0.2 0/0.9 1/0.3 1/0.2 1/0.3 1/0.3 0/0.9 0/0.6 0/0.6 0/0.1 0/0.5 5 1/0.7 0/0.1 3 4 4 3 1/0.4 1/0.4 (a) (b) 0/0.5 0/0.5 1 1 2 1/0.3 2 1/0.3 0/0.9 0/0.5 1/0.2 0/0.9 1/0.3 1/0.2 0/0.5 1/0.3 5 5 1/0.7 1/0.7 0/0.1 3 3 4 0/0.1 1/0.7 0/0.1 4 1/0.4 0/0.6 0/0.6 1/0.2 1/0.2 1/0.4 7 0/0.9 0/0.9 6 6 (c) (d) 0/0.5 1 2 1/0.2 0/0.5 0/0.9 1/0.3 1/0.3 5 1/0.7 4 3 0/0.1 0/0.6 1/0.2 1/0.7 1/0.4 7 0/0.9 0/0.1 6 0/0.6 1/0.4 8 (e) Fig. 4. Example of FFA Transformation: Transition weight ambiguities are resolved in a sequence of steps: (a) the original FFA; there exist ambiguities for all four states; (b) the ambiguity of transition from state 3 to state 1 on input symbol 1 is removed by adding a new state 5; (c) the ambiguity of transition from state 4 to state 2 on input symbol 0 is removed by adding a new state 6; (d) the ambiguity of transition from state 4 to state 3 on input symbol 1 is removed by adding a new state 7; (e) the ambiguity of transition from states 3 and 7 - both transition have the same fuzzy membership - to state 4 is removed by adding a new state 8. corresponding to the set of current FFA states have an output greater than θijk /2 where θijk is the dynamic range of recurrent state neurons, and all remaining recurrent neurons have low output signals less than θijk /2 for all possible input sequences. 134 C.W. Omlin, L. Giles, and K.K. Thornber It follows from that definition that there exists an upper bound 0 < φ− < θijk /2 for low signals and a lower bound θijk /2 < φ+ < θijk for high signals in networks that represent stable FFA encodings. The ideal values for low and high signals are 0 and θijk , respectively. 6.2 Fixed Point Analysis for Sigmoidal Discriminant Function Here, we summarize without proofs some of the results that we used to demonstrate stability of neural DFA encodings; details of the proofs can be found in [44]. In order to guarantee low signals to remain low, we have to give a tight upper bound for low signals which remains valid for an arbitrary number of time steps: Lemma 2. The low signals are bounded from above by the fixed point [φ− f ]θ of the function f  0 f =0 (7) f t+1 = g̃(r · f t ) where [φ− f ]θ represents the fixed point of the discriminant function g̃() with variable output range θ, and r denotes the maximum number of neurons that contribute to a neuron’s input. For reasons of simplicity, we will write φ− f for ] with the implicit understanding that the location of fixed points depends [φ− f θ on the particular choice of θ. This lemma can easily be proven by induction on t. It is easy to see that the function to be iterated in Equation (7) is f (x, H, θ, r) = θ . We will show later in this section that the conditions that gua1 + eH(θ−2rx)/2θ rantee the existence of one or three fixed points are independent of the parameter θ. The function f (x, H, θ, r) has some desirable properties: Lemma 3. For any H > 0, the function f (x, H, θ, r) has at least one fixed point φ0f . Lemma 4. There exists a value H0− (r) such that for any H > H0− (r), f (x,H,θ,r) + 0 has three fixed points 0 < φ− f < φf < φf < θ. + 0 Lemma 5. If f (x, H, θ, r) has three fixed points φ− f , φf , and φf , then  − 0  φf x0 < φf lim f t = φ0f x0 = φ0f t→∞  + φf x0 > φ0f where x0 is an initial value for the iteration of f (.). (8) Fuzzy Knowledge and Recurrent Neural Networks 135 Lemma 5 can be proven by defining an appropriate Lyapunov function P and +4 showing that P has minima at φ− f and φf . The basic idea behind the network stability analysis is to show that neuron outputs never exceed or fall below some fixed points φ− and φ+ , respectively. + The fixed points φ− f and φf are only valid upper and lower bounds on low and high signals, respectively, if convergence toward these fixed points is monotone. The following corollary establishes monotone convergence of f t towards fixed points: Corollary 1. Let f 0 , f 1 , f 2 , . . . denote the finite sequence computed by successive iteration of the function f . Then we have f 0 < f 1 < . . . < φf where φf is one of the stable fixed points of f (x, H, θ, r). With these properties, we can quantify the value H0− (r) such that for any H > H0− (r), f (x, H, θ, r) has three fixed points. The low and high fixed points φ− f and φ+ are the bounds for low and high signals, respectively. The larger r, the f larger H must be chosen in order to guarantee the existence of three fixed points. If H is not chosen sufficiently large, then f t converges to a unique fixed point θ/2 < φf < θ. The following lemma expresses a quantitative condition which guarantees the existence of three fixed points: θ has three fixed points Lemma 6. The function f (x, H, θ, r) = 1 + eH(θ−2rx)/2θ − + 0 0 < φf < φf < φf < θ if H is chosen such that 2(θ + (θ − x) log( θ−x x )) H > H0− (r) = θ−x where x satisfies the equation θ2 r= 2x(θ + (θ − x) log( θ−x x )) Even though the location of fixed points of the function f depends on H, r, and θ, we will use [φf ]θ as a generic name for any fixed point of a function f . Similarly, we can quantify high signals in a constructed network: Lemma 7. The high signals are bounded from below by the fixed point [φ+ h ]θ of the function  0 h =1 (9) ht+1 = g̃(ht − f t ) 4 Lyapunov functions can be used to prove the stability of dynamical systems. For a given dynamical system S, if there exists a function P - we can think of P as an energy function - such that P has at least one minimum, then S has a stable state. Here, we can choose P (xi ) = (xi − φ)f )2 where xi is the value of f (.) after i iterations and φ is one of the fixed points. It can be shown algebraically that, for x0 6= φ0f , P (xi ) decreases with every step of the iteration of f (.) until a stable fixed point is reached. 136 C.W. Omlin, L. Giles, and K.K. Thornber Notice that the above recurrence relation couples f t and ht which makes it difficult if not impossible to find a function h(x, θ, r) which when iterated gives the same values as ht . However, we can bound the sequence h0 , h1 , h2 , . . . from below by a recursively defined function pt - i.e. ∀t : pt ≤ ht - which decouples ht from f t : Lemma 8. Let [φf ]θ denote the fixed point of the recursive function f , i.e. lim f t = [φf ]θ . Then the recursively defined function p t→∞  p0 = 1 pt+1 = g̃(g t − [φf ]θ ) (10) has the property that ∀t : pt ≤ ht . Then, we have the following sufficient condition for the existence of two stable fixed point of the function defined in Equation (9): Lemma 9. Let the iterative function pt have two stable fixed points and ∀t : pt ≤ ht . Then the function ht has also two stable fixed points. The above lemma has the following corollary: Corollary 2. A constructed network’s high signals remain stable if the the sequence p0 , p1 , p2 , . . . converges towards the fixed point θ/2 < [φ+ p ]θ < θ. Since we have decoupled the iterated function ht from the iterated function f t by introducing the iterated function pt , we can apply the same technique to pt for finding conditions for the existence of fixed points as in the case of f t . In fact, the function that when iterated generates the sequence p0 , p1 , p2 , . . . is defined by θ θ (11) = p(r, θ, x) = ′ (θ−2r ′ x))/2θ − H 1+e 1 + eH(θ−2(x−[φf ]θ ))/2θ with ′ H ′ = H(1 + 2[φ− f ]θ ), r = 1 1 + 2[φ− f ]θ (12) We can iteratively compute the value of [φp ]θ for given parameters H and r. This results in the following lemma for high signals: 1 has three fixed Lemma 10. The function p(x, [φ− f ]θ ) = H(θ−2(x−[φ− ] ))/2θ f θ 1+e 0 + points 0 < [φ− p ]θ < [φp ]θ < [φp ]θ < 1 if H is chosen such that 2(θ + (θ − x) log( θ−x x )) H > H0+ (r) = − (1 + 2[φf ]θ )(θ − x) where x satisfies the equation θ2 1 = − 1 + 2[φf ]θ 2x(θ + (θ − x) log( θ−x x )) Fuzzy Knowledge and Recurrent Neural Networks 137 Since there is a collection of fuzzy transition memberships θijk involved in the algorithm for constructing fuzzy representations of FFAs, we need to determine whether the condition of Lemmas 6 and 10 hold true for all rule weights θijk . The following corollary establishes a useful invariant property of the function H0 (x, r, θ): Corollary 3. The value of the minima H(x, r, θ) only depends on the value of r and is independent of the particular values of θ: inf H(x, r, θ) = inf 2 θ log( θ−x x ) = H0 (r) θ − 2rx (13) The relevance of the above corollary is that there is no need to test conditions for all possible values of θ in order to guarantee the existence of fixed points. We can now proceed to prove stability of low and high signals, and thus stability of the fuzzy representation of FFA states, in a constructed recurrent neural network. 6.3 Network Stability The existence of two stable fixed points of the discriminant function is only a necessary condition for network stability. We also need to establish conditions under which these fixed points are upper and lower bounds of stable low and high signals, respectively. Before we define and derive the conditions for network stability, it is convenient to apply the result of Lemma 1 of Section 3.2 to the fixed points of the sigmoidal discriminant function: Corollary 4. For any value θ with 0 < θ ≤ 1, the fixed points [φ]θ of the discriminant function θ 1 + eH(θ−2rx)/2θ have the following invariant relationship: [φ]θ = θ [φ]1 Thus, we do not have to consider the conditions separately for all values of θ that occur in a given FFA. We now redefine stability of recurrent networks constructed from DFAs in terms of fixed points: Definition 6. An encoding of DFA states in a second-order recurrent neural network is called stable if all the low signals are less than [φ0f ]θi , and all the high signals are greater than [φ0h ]θi for all θi of all state neurons Si . We have simplified θi.. to θi because the output of each neuron Si has a fixed upper limit θ for a given input symbol, regardless which neurons Sj contribute residual inputs. We note that this new definition is stricter than that we gave in 138 C.W. Omlin, L. Giles, and K.K. Thornber Definition 5. In order for the low signal to remain stable, the following condition has to be satisfied: H 0 + Hr[φ− (14) f ]θj < [φf ]θj 2 Similarly, the following inequality must be satisfied for stable high signals: − H − 0 + H[φ+ h ]θj − H[φf ]θi > [φh ]θi 2 − (15) The above two inequalities must be satisfied for all neurons at all times. Instead of testing for all values θijk separately, we can simplify the set of inequalities as follows: Lemma 11. Let θmin and θmax denote the minimum and maximum, respectively, of all fuzzy transition memberships θijk of a given FFA M . Then, inequalities (14) and (15) are satisfied for all transition weights θijk if the inequalities H 0 + Hr[φ− f ]θmax < [φf ]θmin 2 (16) H − 0 + H[φ+ h ]θmin − H[φf ]θmax > [φh ]θmax 2 (17) − − are satisfied. We can rewrite inequalities (16) and (17) as − H 0 + Hr θmax [φ− f ]1 < θmin [φf ]1 2 (18) and H − 0 + Hθmin [φ+ (19) h ]1 − Hθmax [φf ]1 > θmax [φh ]1 2 + Solving inequalities (18) and (19) for [φ− f ]1 and [φh ]1 , respectively, we obtain conditions under which a constructed recurrent network implements a given FFA. These conditions are expressed in the following theorem: − Theorem 3. For some given unambiguous FFA M with n states and m input symbols, let r denote the maximum number of transitions to any state over all input symbols of M . Furthermore, let θmin and θmax denote the minimum and maximum, respectively, of all transitions weights θijk in M . Then, a sparse recurrent neural network with n state and m input neurons can be constructed from M such that the internal state representation remains stable if [φ0f ]1 1 ) + θmin r θmax 2 H [φ0h ]1 1 1 − + θ ) [φ ] + ( ] > (2) [φ+ max 1 1 f h θmin 2 H (1) [φ− f ]1 < 1 ( (3) H > max(H0− (r), H0+ (r)) . Fuzzy Knowledge and Recurrent Neural Networks 139 Furthermore, the constructed network has at most 3mn second-order weights with alphabet Σw = {−H, 0, +H}, n + 1 biases with alphabet Σb = {−H/2}, and maximum fan-out 3m. For θmin = θmax = 1, conditions (1)-(3) of the above theorem reduce to those found for stable DFA encodings [44]. This is consistent with a crisp representation of DFA states. 7 Simulations In order to validate our theory, we constructed a fuzzy encoding of a randomly generated FFA with 100 states (after the execution of the FFA transformation algorithm) over the input alphabet {0, 1}. We randomly assigned weights in the range [0, 1] to all transitions in increments of 0.1. The maximum indegree was Din (M ) = r = 5. We then tested the stability of the fuzzy internal state representation on 100 randomly generated strings of length 100 by comparing, at each time step, the output signal of each recurrent state neuron with its ideal output signal (since each recurrent state neuron Si corresponds to a FFA state qi , we know the degree to which qi is occupied after input symbol ak has been read: either 0 or θijk ). A histogram of the differences between the ideal and the observed signal of state neurons for selected values of the weight strength H over all state neurons and all tested strings is shown in Fig. 5. As expected, the error decreases for increasing values of H. We observe that the number of discrepancies between the desired and the actual neuron output decreases ‘smoothly’ for the shown values of H (almost no change can be observed for values up to H = 6). The most significant change can be observed by comparing the histograms for H = 9.7 and H = 9.75: The existence of significant neuron output errors for H = 9.7 suggests that the internal FFA representation is highly unstable. For H ≥ 9.75, the internal FFA state representation becomes stable. This discontinuous change can be explained by observing that there exists a critical value H0 (r) such that the number of stable fixed points also changes discontinuously from one to two for H < H0 (r)) and H > H0 (r)), respectively (see Fig. 5). The ‘smooth’ transition from large output errors to very small errors for most recurrent state neurons (Fig. 5a-e) can be explained by observing that not all recurrent state neurons receive the same number of residual inputs; some neurons may not receive any residual input for some given input symbol ak at time step t; in that case, the low signals of those neurons are strengthened to g̃(0, H, θi.k ) ≃ 0. 8 Conclusions Theoretical work that proves representational relationships between different computational paradigms is important because it establishes the equivalences of those models. Previously it has been shown that it is possible to deterministically 140 C.W. Omlin, L. Giles, and K.K. Thornber 1.6 1.6 1.4 1.4 1.2 Frequency Frequency 1.2 1.0 0.8 0.6 0.8 0.6 0.4 0.4 0.2 1.0 0.2 0 0.2 0.4 0.6 0.8 Absolute Neuron Output Error 0 1 0 (a) 3.5 2.5 3.0 2.5 Frequency Frequency 2. 1.5 1.0 2.0 1.5 1.0 0.5 0.5 0 0.2 0.4 0.6 0.8 Absolute Neuron Output Error 0 1 0 0.2 0.4 0.6 0.8 Absolute Neuron Output Error 1 (d) 10.0 4.0 8.0 3.0 6.0 Frequency Frequency (c) 5.0 2.0 1.0 0 1 (b) 3.0 0 0.2 0.4 0.6 0.8 Absolute Neuron Output Error 4.0 2.0 0 0.2 0.4 0.6 0.8 Absolute Neuron Output Error (e) 1 0 0 0.2 0.4 0.6 0.8 Absolute Neuron Output Error 1 (f) Fig. 5. Stability of FFA State Encoding: The histogram shows the absolute neuron output error of a network with 100 neurons that implements a randomly generated FFA, and reads 100 randomly generated strings of length 100 for different values of the weight strength H. The error was computed by comparing, at each time step, the actual with the desired output of each state neuron. The distribution of neuron output signal errors are for weight strengths (a) H = 6.0, (b) H = 9.0, (c) H = 9.60, (d) H = 9.65, and (e) H = 9.70, and (f) H = 9.75. Fuzzy Knowledge and Recurrent Neural Networks 141 encode fuzzy finite-state automata (FFA) in recurrent neural networks by transforming any given FFA into a deterministic acceptor which assign string membership [42]. In such a deterministic encoding, only the network’s classification of strings is fuzzy, whereas the representation of states is crisp. The correspondence between FFA and network parameters - i.e. fuzzy transition memberships and network weights, respectively - is lost in the transformation. Here, we have demonstrated analytically and empirically that it is possible to encode FFAs in recurrent networks without transforming them into deterministic acceptors. The constructed network directly represents FFA states with the desired fuzziness. That representation requires (1) a slightly increased functionality of sigmoidal discriminant functions (it only requires the discriminants to accommodate variable output range), and (2) a transformation of a given FFA into an equivalent FFA with a larger number of states. In the proposed mapping FFA → recurrent network, the correspondence between FFA and network parameters remains intact; this can be significant if the physical properties of some unknown dynamic, nonlinear system are to be derived from a trained network modeling that system. Furthermore, the analysis tools and methods used to demonstrate the stability of the crisp internal representation of DFA’s carried over and generalized to show stability of the internal FFA representation. We speculate that other encoding methods are possible and that it reamins an open question as to which encoding methods are better. Acknowledgments We would like to acknowledge useful discussions with K. Bollacker, D. Handscomb, and B.G. Horne. References 1. L. Zadeh, “Fuzzy sets,” Information and Control, vol. 8, pp. 338–353, 1965. 2. P.P. Bonissone, V. Badami, K.H. Chiang, P.S. Khedkar, K.W. Marcelle, and M.J. Schutten, “Industrial applications of fuzzy logic at General Electric,” Proceedings of the IEEE, vol. 83, no. 3, pp. 450–465, 1995. 3. S. Chiu, S. Chand, D. Moore, and A. Chaudhary, “Fuzzy logic for control of roll and moment for a flexible wing aircraft,” IEEE Control Systems Magazine, vol. 11, no. 4, pp. 42–48, 1991. 4. J. Corbin, “A fuzzy logic-based financial transaction system,” Embedded Systems Programming, vol. 7, no. 12, pp. 24, 1994. 5. L.G. Franquelo and J. Chavez, “Fasy: A fuzzy-logic based tool for analog synthesis,” IEEE Transactions on Computer-Aided Design of Integrated Circuits, vol. 15, no. 7, pp. 705, 1996. 6. T. L. Hardy, “Multi-objective decision-making under uncertainty fuzzy logic methods,” Tech. Rep. TM 106796, NASA, Washington, D.C., 1994. 7. W. J. M. Kickert and H.R. van Nauta Lemke, “Application of a fuzzy controller in a warm water plant,” Automatica, vol. 12, no. 4, pp. 301–308, 1976. 8. C.C. Lee, “Fuzzy logic in control systems: fuzzy logic controller,” IEEE Transactions on Man, Systems, and Cybernetics, vol. SMC-20, no. 2, pp. 404–435, 1990. 142 C.W. Omlin, L. Giles, and K.K. Thornber 9. C.P. Pappis and E.H. Mamdani, “A fuzzy logic controller for a traffic junction,” IEEE Transactions on Systems, Man, and Cybernetics, vol. SMC-7, no. 10, pp. 707–717, 1977. 10. X.M. Yang and G.J. Kalambur, “Design for machining using expert system and fuzzy logic approach,” Journal of Materials Engineering and Performance, vol. 4, no. 5, pp. 599, 1995. 11. L.-X. Wang, Adaptive Fuzzy Systems and Control: Design and Stability Analysis, Prentice-Hall, Englewood Cliffs, NJ, 1994. 12. H.T. Siegelmann and E.D. Sontag, “On the computational power of neural nets,” Journal of Computer and System Sciences, vol. 50, no. 1, pp. 132–150, 1995. 13. V. Gorrini and H. Bersini, “Recurrent fuzzy systems,” in Proceedings of the Third IEEE Conference on Fuzzy Systems, 1994, vol. I, pp. 193–198. 14. L.-X. Wang, “Fuzzy systems are universal approximators,” in Proceedings of the First International Conference on Fuzzy Systems, 1992, pp. 1163–1170. 15. P.V. Goode and M. Chow, “A hybrid fuzzy/neural systems used to extract heuristic knowledge from a fault detection problem,” in Proceedings of the Third IEEE Conference on Fuzzy Systems, 1994, vol. III, pp. 1731–1736. 16. C. Perneel, J.-M. Renders, J.-M. Themlin, and M. Acheroy, “Fuzzy reasoning and neural networks for decision making problems in uncertain environments,” in Proceedings of the Third IEEE Conference on Fuzzy Systems, 1994, vol. II, pp. 1111–1125. 17. H.R. Berenji and P. Khedkar, “Learning and fine tuning fuzzy logic controllers through reinforcement,” IEEE Transactions on Neural Networks, vol. 3, no. 5, pp. 724–740, 1992. 18. K.K. Thornber, “The fidelity of fuzzy-logic inference,” IEEE Transactions on Fuzzy Systems, vol. 1, no. 4, pp. 288–297, 1993. 19. K.K. Thornber, “A key to fuzzy-logic inference,” International Journal of Approximate Reasoning, vol. 8, pp. 105–121, 1993. 20. E.S. Santos, “Maximin automata,” Information and Control, vol. 13, pp. 363–377, 1968. 21. L. Zadeh, “Fuzzy languages and their relation to human and machine intelligence,” Tech. Rep. ERL-M302, Electronics Research Laboratory, University of California, Berkeley, 1971. 22. B. Gaines and L. Kohout, “The logic of automata,” International Journal of General Systems, vol. 2, pp. 191–208, 1976. 23. A. Pathak and S.K. Pal, “Fuzzy grammars in syntactic recognition of skeletal maturity from x-rays,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 16, no. 5, pp. 657–667, 1986. 24. S.I. Mensch and H.M. Lipp, “Fuzzy specification of finite state machines,” in Proceedings of the European Design Automation Conference, 1990, pp. 622–626. 25. H. Senay, “Fuzzy command grammars for intelligent interface design,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 22, no. 5, pp. 1124–1131, 1992. 26. J. Grantner and M.J. Patyra, “Synthesis and analysis of fuzzy logic finite state machine models,” in Proc. of the Third IEEE Conf. on Fuzzy Systems, 1994, vol. I, pp. 205–210. 27. J. Grantner and M.J. Patyra, “VLSI implementations of fuzzy logic finite state machines,” in Proceedings of the Fifth IFSA Congress, 1993, pp. 781–784. 28. S.C. Lee and E.T. Lee, “Fuzzy neural networks,” Mathematical Biosciences, vol. 23, pp. 151–177, 1975. Fuzzy Knowledge and Recurrent Neural Networks 143 29. F.A. Unal and E. Khan, “A fuzzy finite state machine implementation based on a neural fuzzy system,” in Proceedings of the Third International Conference on Fuzzy Systems, 1994, vol. 3, pp. 1749–1754. 30. F.E. Cellier and Y.D. Pan, “Fuzzy adaptive recurrent counterpropagation neural networks: A tool for efficient implementation of qualitative models of dynamic processes,” J. Systems Engineering, vol. 5, no. 4, pp. 207–222, 1995. 31. E.B. Kosmatopoulos and M.A. Christodoulou, “Neural networks for identification of fuzzy dynamical systems: an application to identification of vehicle highway systems,” Tech. Rep., Dept. of Electronic and Computer Engineering, Technical U. of Crete, 1995. 32. E.B. Kosmatopoulos and M.A. Christodoulou, “Structural properties of gradient recurrent high-order neural networks,” IEEE Transactions on Circuits and Systems, 1995. 33. E.B. Kosmatopoulos, M.M. Polycarpou, M.A. Christodoulou, and P.A. Ioannou, “High-order neural networks for identification of dynamical systems,” IEEE Transactions on Neural Networks, vol. 6, no. 2, pp. 422–431, 1995. 34. M.P. Casey, “The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction,” Neural Computation, vol. 8, no. 6, pp. 1135–1178, 1996. 35. A. Cleeremans, D. Servan-Schreiber, and J. McClelland, “Finite state automata and simple recurrent neural networks,” Neural Computation, vol. 1, no. 3, pp. 372–381, 1989. 36. J.L. Elman, “Finding structure in time,” Cognitive Science, vol. 14, pp. 179–211, 1990. 37. P. Frasconi, M. Gori, M. Maggini, and G. Soda, “Representation of finite state automata in recurrent radial basis function networks,” Machine Learning, vol. 23, pp. 5–32, 1996. 38. C.L. Giles, C.B. Miller, D. Chen, H.H. Chen, G.Z. Sun, and Y.C. Lee, “Learning and extracting finite state automata with second-order recurrent neural networks,” Neural Computation, vol. 4, no. 3, pp. 393–405, 1992. 39. J.B. Pollack, “The induction of dynamical recognizers,” Machine Learning, vol. 7, no. 2/3, pp. 227–252, 1991. 40. R.L. Watrous and G.M. Kuhn, “Induction of finite-state languages using secondorder recurrent networks,” Neural Computation, vol. 4, no. 3, pp. 406, 1992. 41. Z. Zeng, R.M. Goodman, and P. Smyth, “Learning finite state machines with selfclustering recurrent networks,” Neural Computation, vol. 5, no. 6, pp. 976–990, 1993. 42. C.W. Omlin, K.K. Thornber, and C.L. Giles, “Fuzzy finite-state automata can be deterministically encoded into recurrent neural networks,” IEEE Transactions on Fuzzy Systems, vol. 6, no. 1, pp. 76–89, 1998. 43. D. Dubois and H. Prade, Fuzzy sets and systems: theory and applications, vol. 144 of Mathematics in Science and Engineering, pp. 220–226, Academic Press, 1980. 44. C.W. Omlin and C.L. Giles, “Constructing deterministic finite-state automata in recurrent neural networks,” Journal of the ACM, vol. 43, no. 6, pp. 937–972, 1996. 45. C.W. Omlin and C.L. Giles, “Rule revision with recurrent neural networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 1, pp. 183–188, 1996. 46. C. Lee Giles, C.W. Omlin, and K.K. Thornber, “Equivalence in knowledge representation: Automata, recurrent neural networks, and dynamical fuzzy systems,” Proceedings of the IEEE, vol. 87, no. 9, pp. 1623–1640, 1999. Combining Maps and Distributed Representations for Shift-Reduce Parsing Marshall R. Mayberry, III and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX, 78712 martym,risto@cs.utexas.edu Abstract. Simple Recurrent Networks (Srns) have been widely used in natural language processing tasks. However, their ability to handle long-term dependencies between sentence constituents is rather limited. Narx networks have recently been shown to outperform Srns by preserving past information in explicit delays from the network’s prior output. Determining the number of delays, however, is problematic in itself. In this study on a shift-reduce parsing task, we demonstrate a hybrid localist-distributed approach that yields comparable performance in a more concise manner. A SardNet self-organizing map is used to represent the details of the input sequence in addition to the recurrent distributed representations of the Srn and Narx networks. The resulting architectures can represent arbitrarily long sequences and are cognitively more plausible. 1 Introduction The subsymbolic approach (i.e. neural networks with distributed representations) to processing language is attractive for several reasons. First, it is inherently robust: the distributed representations display graceful degradation of performance in the presence of noise, damage, and incomplete or conflicting input [18,31]. Second, because computation in these networks is constraint-based, the subsymbolic approach naturally combines syntactic, semantic, and thematic constraints on the interpretation of linguistic data [17]. Third, subsymbolic systems can be lesioned in various ways and the resulting behavior is often strikingly similar to human impairments [18,19,23]. These properties of subsymbolic systems have attracted many researchers in the hope of accounting for interesting cognitive phenomena, such as role-binding and lexical errors resulting from memory interference and overloading, aphasic and dyslexic impairments resulting from physical damage, and biases, defaults and expectations emerging from training history [20,19,18,24]. Since its introduction in 1990, the simple recurrent network (Srn) [7] has become a mainstay in connectionist natural language processing tasks such as lexical disambiguation, prepositional phrase attachment, active-passive transformation, anaphora resolution, and translation [1,4,21,34]. However, this promising S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 144–157, 2000. c Springer-Verlag Berlin Heidelberg 2000 Combining Maps and Distributed Representations for Shift-Reduce Parsing 145 line of research has been hampered by the Srn’s inability to handle long-term dependencies, which abound in natural language tasks, because the Srn’s hidden layer gradually loses track of earlier words in the input sentence through memory degradation. Another class of recurrent neural networks called Nonlinear AutoRegressive models with eXogenous (Narx) inputs [5,22,15] has been proposed as an alternative to Srns that can better deal with long-term dependencies by latching information earlier in a sequence. They have proven to be good at dealing with the type of long-term dependencies that often arise in nonlinear systems such as system identification [5], time series prediction [6], and grammatical inference [10], but have not yet been demonstrated on Nlp tasks, which typically involve complex representations. This paper describes a hybrid method of extending distributed recurrent networks such as the Srn and Narx with a localist representation of an input sentence. The approach is based on SardNet [11], a self-organizing map algorithm designed to represent sequences. SardNet permits the sequence information to remain explicit, yet generalizable in the sense that similar sequences result in similar patterns on the map. When SardNet is coupled with Srn or Narx, the resulting networks perform better in the shift-reduce parsing task taken up in this study. Even with no recurrency and no explicit delays, such as in a feedforward network to which SardNet has been added, the performance is almost as good. Moreover, we believe that SardNet is a biologically and cognitively plausible way of representing sequences because it is a topological map and does not impose hard memory limits. These results show that SardNet can be used as an effective, concise, and elegant sequence memory in distributed natural language processing architectures. 2 The Task: Shift-Reduce Parsing The task taken up in this study, shift-reduce (SR) parsing, is one of the simplest approaches to sentence processing that nevertheless has the potential to handle a substantial subset of English [33]. Its basic formulation is based on the pushdown automata for parsing context-free grammars, but it can be extended to contextsensitive grammars as well. The parser consists of two data structures: the input buffer stores the sequence of words remaining to be read, and the partial parse results are kept on the stack (Fig. 1). Initially the stack is empty and the entire sentence is in the input buffer. At each step, the parser has to decide whether to shift a word from the buffer to the stack, or to reduce one or more of the top elements of the stack into a new element representing their combination. For example, if the top two elements are currently NP and VP, the parser reduces them into S, corresponding to the grammar rule S → NP VP (step 17 in Fig. 1). The process stops when the elements in the stack have been reduced to S, and no more words remain in the input. The reduce actions performed by the parser in this process constitute the parse result, such as the syntactic parse tree (line 18 in Fig. 1). 146 M.R. Mayberry and R. Miikkulainen Stack the the boy NP[the,boy] NP[the,boy] who NP[the,boy] who liked NP[the,boy] who liked the NP[the,boy] who liked the girl NP[the,boy] who liked NP[the,girl] NP[the,boy] who VP[liked,NP[the,girl]] NP[the,boy] RC[who,VP[liked,NP[the,girl]]] NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased the NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased the cat NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] chased NP[the,cat] NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]] VP[chased,NP[the,cat]] S[NP[NP[the,boy],RC[who,VP[liked,NP[the,girl]]]],VP[chased,NP[the,cat]]] Input the boy who who liked the girl chased chased chased chased chased the cat . . . 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Action Shift Shift Reduce Shift Shift Shift Shift Reduce Reduce Reduce Reduce Shift Shift Shift Reduce Reduce Reduce Stop Fig. 1. Shift-Reduce Parsing a Sentence. Each step in the parse is represented by a line from top to bottom. The current stack is at left, the input buffer in the middle, and the parsing decision in the current situation at right. At each step, the parser either shifts the input word onto the stack, or reduces the top two elements of the stack into a higher-level representation, such as the boy → [the,boy] (step 3). (Phrase labels such as “NP” and “RC” are only used in this figure to make the process clear; they are not explicitly made a part of the actual distributed compressed representation of the partial parse results.) The sequential scanning process and incremental forming of partial representations is a plausible cognitive model for language understanding. SR parsing is also very efficient, and lends itself to many extensions. For example, the parse rules can be made more context sensitive by taking more of the stack and the input buffer into account. Also, the partial parse results may consist of syntactic or semantic structures. The general SR model can be implemented in many ways. A set of symbolic shift-reduce rules can be written by hand or learned from input examples [9,29,35]. It is also possible to train a neural network to make parsing decisions based on the current stack and the input buffer. If trained properly, the neural network can generalize well to new sentences [30]. Whatever correlations there exist between the word representations and the appropriate shift/reduce decisions, the network will learn to utilize them. Another important extension is to implement the stack as a neural network. This way the parser can have access to the entire stack at once, and interesting cognitive phenomena in processing complex sentences can be modeled. The Spec system [19] was a first step in this direction. The stack was represented as a compressed distributed representation, formed by a Raam (Recursive AutoAssociative Memory) auto-encoding network [25]. The resulting system was able to parse complex relative clause structures. When the stack representation was artificially lesioned by adding noise, the parser exhibited very plausible cognitive performance. Shallow center embeddings were easier to process, as were sentences with strong semantic constraints in the role bindings. When the parser made Combining Maps and Distributed Representations for Shift-Reduce Parsing 147 errors, it usually switched the roles of two words in the sentence, which is what people also do in similar situations. A symbolic representation of the stack would make modeling such behavior very difficult. The Spec architecture, however, was not a complete implementation of SR parsing; it was designed specifically for embedded relative clauses. For general parsing, the stack needs to be encoded with neural networks to make it possible to parse more varied linguistic structures. We believe that the generalization and robustness of subsymbolic neural networks will result in powerful, cognitively valid performance. However, the main problem of limited memory accuracy of the Srn parsing network must first be solved. 3 Parser Architecture A subsymbolic parser is based on a recurrent network such as Srn or Narx. The network reads a sequence of input word representations into output patterns representing the parse results, such as syntactic or case-role assignments for the words. At each time step, a copy of the hidden layer (Srn) or prior outputs (Narx) is saved and used as input during the next step, together with the next word. In this way each new word is interpreted in the context of the entire sequence so far, and the parse result is gradually formed at the output. Recurrent neural networks can be used to implement a shift-reduce parser in the following way: the network is trained to step through the parse (such as that in Fig. 1), generating a compressed distributed representation of the top element of the stack at each step (formed by a Raam network: section 4.1). The network reads the sequence of words one word at a time, and each time either shifts the word onto the stack (by passing it through the network, e.g. step 1), or performs one or more reduce operations (by generating a sequence of compressed representations corresponding to the top element of the stack: e.g. steps 8-11). After the whole sequence is input, the final stack representation is decoded into a parse result such as a parse tree. Such an architecture is powerful for two reasons: (1) During the parse, the network does not have to guess what is coming up later in the sentence, as it would if it always had to shoot for the final parse result; its only task is to build a representation of the current stack in its hidden layer and the top element in its output. (2) Instead of having to generate a large number of different stack states at the output, it only needs to output representations for a relatively small number of common substructures. Both of these features make learning and generalization easier. 3.1 Srn In the simple recurrent network, the hidden layer is saved and fed back into the network at each step during sentence processing. The network is normally trained using the standard backpropagation algorithm [27]. A well-known problem with the Srn model is its low memory accuracy. It is difficult for it to remember items that occurred several steps earlier in the input sequence, especially if the 148 M.R. Mayberry and R. Miikkulainen network is not required to produce them in the output layer during the intervening steps [32,19]. The intervening items are superimposed in the hidden layer, obscuring the traces of earlier items. As a result, parsing with an Srn has been limited to relatively simple sentences with shallow structure. 3.2 Narx Nonlinear AutoRegressive models with eXogenous inputs have been proposed as an alternative to Srns. In Narx networks, previous sequence items are explicitly represented in a predetermined number of input and output delays to help reduce the forgetting behavior of recurrent networks due to the loss of gradient information from much earlier input in a sequence [2,15]. The network is typically trained via BackPropagation through Time [27], which allows earlier information to be captured and to influence later outputs by propagating gradient information through “jump-ahead” connections that develop when the network is unfolded in time. The performance of Narx networks is strongly dependent on the number of input/output delays, and determining that number is itself a nontrivial question [16]. 3.3 SardNet The solution described in this paper is to use an explicit representation of the input sequence as additional input to the hidden layer. This representation provides more accurate information about the sequence, such as the relative ordering of the incoming words, and can be combined with the distributed hidden layer to generate accurate output that nevertheless retains all the advantages of distributed representations. The sequence representation must be explicit enough to allow such cleanup, but it must also be compact and generalize well to new sequences. The SardNet (Sequential Activation Retention and Decay Network) [11] self-organizing map for sequences has exactly these properties. SardNet is based on the Self-Organizing Map neural network [12,13], and organized to represent the space of all possible word representations. As in a conventional selforganizing map network, each input word is mapped onto a particular map node called the maximally-responding unit, or winner. The weights of the winning unit and all the nodes in its neighborhood are updated according to the standard adaptation rule to better approximate the current input. The size of the neighborhood is set at the beginning of the training and reduced as the map becomes more organized. In SardNet, the sentence is represented as a pattern of activation on the map (Fig. 2). For each word, the maximally responding unit is activated to a maximum value of 1.0, and the activations of units representing previous words are decayed according to a specified decay rate (e.g. 0.9). Once a unit is activated, it is removed from competition and cannot represent later words in the sequence. Each unit may then represent different words depending on the context, which Combining Maps and Distributed Representations for Shift-Reduce Parsing 149 SARDNET the boy who who liked the girl chased chased chased Input Word chased Context Layer SRN NARX [[the,boy],[who,[liked,[the,girl]]]] Compressed RAAM Fig. 2. The Parser Network. This snapshot shows the network during step 11 of parsing the sentence from Fig. 1. The representation for the current input word, chased, is shown at top left. Each word is input to the SardNet map, which builds a representation for the sequence word by word. In the Srn implementation of the parser, the previous activation of the hidden layer is copied (as indicated by the dotted line labelled Srn) to the Context assembly at each step. In the Narx implementation, a predetermined number of previous output representations constitutes the Context (indicated by the dotted line labelled Narx). The Context, together with the current input word and the current SardNet pattern, is propagated to the hidden layer of the network. As output, the network generates the compressed Raam representation of the top element in the shift-reduce stack at this state of the parse (in this case, line 12 in Fig. 1). SardNet is a map of word representations, and is trained through the Self-Organizing Map (SOM) algorithm [13,12]. All other connections are trained through backpropagation (for Srn) or Bptt (for Narx) [27]. allows for an efficient representation of sequences, and also generalizes well to new sequences. In this parsing task, a localist SardNet representation of the input sentence is formed at the same time as the distributed network hidden layer representation. The map is used along with the next input word and with the previous hidden layer (in the case of the Srn) or previous outputs (in the form of some prespecified number of delays in Narx) (Fig. 2) as extra context that is input into the hidden layer. This architecture allows these networks to perform the shiftreduce parsing task with significantly less memory degradation. The sequence information remains accessible in SardNet, so that the network is able to focus on capturing correlations related to the structure of sentence constituents during parsing rather than having to maintain precise constituent information in the hidden layer. 150 M.R. Mayberry and R. Miikkulainen S → NP(n) VP(n,m) RC(n) → who VP(n,m) Rule Schemata VP(n,m) → Verb(n,m) NP(m) NP(n) → the Noun(n) RC(n) NP(n) → the Noun(n) RC(n) → whom NP(m) Verb(m,n) Nouns Noun(0) → boy Verb(0,0) Verb(0,3) Verb(1,2) Verb(2,1) Verb(3,0) → → → → → liked, saw chased liked bit saw Noun(1) → girl Verb(0,1) Verb(1,0) Verb(1,3) Verb(2,2) Verb(3,1) Noun(2) → dog Verbs → liked, saw → liked, saw → chased → bit → saw Noun(3) → cat Verb(0,2) Verb(1,1) Verb(2,0) Verb(2,3) Verb(3,3) → → → → → liked liked, saw bit bit, chased chased Fig. 3. Grammar. This phrase structure grammar generates sentences with subjectand object-extracted relative clauses. The rule schemata with noun and verb restrictions ensure agreement between subject and object depending on the verb in the clause. Lexicon items are given in bold face. 4 4.1 Experiments Input Data, Training, and System Parameters The data used to train and test the parser networks was generated from the phrase structure grammar in Fig. 3, adapted from a grammar that has become common in the literature [8,19]. Since our focus was on shift-reduce parsing, and not processing relative clauses per se, sentence structure was limited to one relative clause per sentence. From this grammar, training targets corresponding to each step in the parsing process were obtained. For shifts, the target is simply the current input. In these cases, the network is trained to auto-associate, which these networks are good at. For reductions, however, the targets consist of representations of the partial parse trees that result from applying a grammatical rule. For example, the reduction of the sentence fragment who liked the girl would produce the partial parse result [who,[liked,[the,girl]]]. Two issues arise: how should the parse trees be represented, and how should reductions be processed during sentence parsing? The approach taken in this paper is the same as in Spec (Sec. 2), as well as in other connectionist parsing systems [19,3,28]. Compressed representations of all the syntactic parse trees are built up through auto-association of the constituents using the Raam neural network. This training is performed beforehand separately from the parsing task. Once formed, the compressed representations can be decoded into their constituents using just the decoder portion of the Raam architecture. In shift-reduce parsing, the input buffer after each “Reduce” action is unchanged; rather, the reduction occurs on the stack. Therefore, if we want to perform the reductions one step at a time, the current word must be maintained in the input buffer until the next “Shift” action. Therefore, the input to the network consists of the sequence of words that make up the sentence with the input word repeated for each reduce action, and the target consists of the representation of the top element of the stack (as shown in Fig. 1). Combining Maps and Distributed Representations for Shift-Reduce Parsing 151 the 10000000 who 01010000 whom 01100000 . 11111111 boy 00101000 dog 00100010 girl 00100100 cat 00100001 chased 00011000 saw 00010010 liked 00010100 bit 00010001 Fig. 4. Lexicon. Each word representation is put together from a part-of-speech identifier (first four components) and a unique ID tag (last four). This encoding is then repeated eight times to form a 64-unit word representation. Such redundancy makes it easier to identify the word. network delays hidden weights network delays hidden map size weights Ffn – 500 64000 SardFfn – 201 144 63888 Srn – 197 64025 SardSrn – 134 144 64016 Narx 0 500 64000 SardNarx 0 252 100 63856 Narx 3 200 64000 SardNarx 3 137 100 63940 Narx 6 125 64000 SardNarx 6 94 100 63928 Fig. 5. Network Parameters. In order to keep the network size as consistent as possible, the number of units in the hidden layers size was varied according to the size of the inputs. Because the Sard networks included a 100-unit map (144 units in the SardFfn and SardSrn) that was connected to both the input and hidden layers, the size of the hidden layer was proportionally made smaller. Word representations were hand-coded to provide basic part-of-speech information together with a unique ID tag that identified the word within the syntactic category (Fig. 4). The basic encoding of eight units was repeated eight times to fill out a 64-unit representation. The 64-unit representation length was needed to encode all of the partial parse results formed by Raam. The redundancy in the lexical items facilitates learning in general. 4.2 System Parameters and Training SardNet maps were added to an Srn and to a Narx network with zero, three, and six delays (common in other comparisons of Narx and Srn [10,15,14]) to yield SardSrn and SardNarx parsing architectures, respectively. Additionally, a SardNet map was added to a feedforward network (Ffn). This SardFfn network provides a baseline for evaluating the map itself in the parsing task. The performances of all the architectures were compared in the shift-reduce parsing task. The size of the hidden layer for each network was determined so that the total number of weights was as close to 64,000 as the topology would permit (Fig. 5). Four data sets of 20%, 40%, 60%, and 80% of the 436 sentences generated by the grammar were randomly selected and each parser was trained on each dataset four times. Training on all 160 runs was stopped when the error on a 22sentence (5%) validation set began to level off. The same validation set was used for all the simulations and was randomly drawn from a pool of sentences that did not appear in any of the training sets. Testing was then performed on the remaining sentences that were neither in the training set nor in the validation 152 M.R. Mayberry and R. Miikkulainen set. All networks were trained with a learning rate of 0.1, and the maps had a decay rate of 0.9. A map of 100 units was pretrained with a learning rate of 0.6, and then used for all of the SardNarx networks. A slightly larger map with 144 units was used for the SardFfn and SardSrn networks since these networks had otherwise much fewer weights and because a larger hidden layer did not help performance, whereas a larger map did. (This would also be true of the SardNarx networks, but making the map larger constrained the size of the hidden layer to the point where it hurt performance.) Training took about one day on a 400 MHz dual-processor Pentium II workstation for each network. 4.3 Results The average mismatches performance measure reports the average number of leaf representations per sentence that are not correctly identified from the lexicon by nearest match in Euclidean distance. As an example, if the target reduction is [who,[liked,[the,girl]]]], (step 11 of Fig. 1), but the output corresponds to [who,[saw,[the,girl]]]], then a mismatch would occur at the leaf labelled saw once the Raam representation was decoded. Average mismatches provide a measure of the correctness of the information in the Raam representation. It is a true measure of the utility of the network and was, therefore, used in our experiments. Most of the sentences in the training and test datasets were seventeen words long because the same number of steps were needed to reduce both subjectand object-extracted relative clauses. The longest long-term dependency the networks had to overcome was at step three in the parsing process where the first reduction occurred, which was part of the final compressed Raam parse representation for the complete sentence. It was in decoding this final parse representation that even the best networks made errors. The results are summarized in Fig. 6. The main result is that virtually all Sard networks performed better than the purely distributed networks. The only exception was on the 20% dataset, where Narx-6 beat SardFfn and SardNarx-0 and was comparable to the SardSrn and SardNarx-3. These results demonstrate that adding SardNet to a distributed network results in a significant performance gain. Compared to their performance on the larger datasets, the Sard networks were weaker on the 20% dataset. On closer inspection it turned out that the map was not smooth enough to allow as good generalization as in the larger datasets, where there was sufficient data to overcome the map irregularities. It is also interesting to note that adding even a single delay to these networks completely eliminated this problem, bringing the performance in line with the others. There are a number of ways that this problem can potentially be eliminated, including richer lexical representations and tuning of the SardNet parameters. We are studying such methods to improve generalization. The trade-off between the sizes of the map and hidden layer should also be borne in mind. There is a certain point in a distributed network that decreasing the hidden layer size results in measurable performance degradation, whereas increasing it beyond another Combining Maps and Distributed Representations for Shift-Reduce Parsing 153 Fig. 6. Summary of Parsing Performance. Averages over four simulations each for the six types of network tested using the stricter average mismatches per sentence measure on the test data. (The Ffn and Narx-0 results are all above 0.7 and are not shown in this Fig. to allow the results among the remaining networks to be more easily discernible.) With the exception of the Narx-6 network on the 20% dataset, the networks to which SardNet was added clearly outperformed the purely distributed networks. Even the SardNet networks with no recursion (SardFfn and SardNarx0) performed significantly better than the Srn, Narx-3, and Narx-6 networks on the rest of the test datasets. The better performance of SardSrn, SardNarx-3, and SardNarx-6, however, demonstrate the benefits of recursion. point yields diminishing returns. We have found that the same holds for the map. Quantifying these observations will be addressed in future work. 4.4 Example Parse The different roles of the distributed representations in the hidden layer and the localist representation on the SardNet map can be seen clearly by contrasting the performances of SardSrn and the Srn on a typical sentence, such as the one in Fig. 1. Neither SardSrn nor Srn had any trouble with the shift targets. Not surprisingly, early in training the networks would master all the shift targets in the sentence before they would get any of the reductions correct. The first reduction ([the,boy] in our example) also poses no problem for either network. Nor, in general, does the second, [the,girl], because the constituent information is still fresh in memory. However, the ability of the Srn to generate the later reductions accurately degrades rapidly because the information about earlier constituents is smothered by the later steps of the parse. Interestingly, the structural information survives much longer. For example, instead of [who,[liked,[the,girl]]]], 154 M.R. Mayberry and R. Miikkulainen the Srn might produce [who,[bit,[the,dog]]]]. The structure of this representation is correct; what is lost are the particular instantiations of the parse tree. This is where SardNet makes a difference. The lost constituent information remains accessible in the feature map. As a result, SardSrn is able to capture each constituent even through the final reductions. 5 Discussion and Future Work These results demonstrate a practicable solution to the memory degradation problem of distributed networks. When prior constituents are explicitly represented at the input, the network does not have to maintain specific information about the sequence, and can instead focus on what it is best at: capturing structure. Although the sentences used in these experiments are still relatively uncomplicated, they do exhibit enough structure to suggest that much more complex sentences could be tackled with distributed networks augmented with SardNet. These results also show that networks with SardNet can perform as well as Narx networks with many delays. Why is this a useful result? The point is that, in the general case, it will be unclear how many delays are needed in a Narx network, whereas SardNet can accommodate sequences of indefinite length (limited only by the number of nodes in the map without reactivation). Indeed, SardNet functions much in the same way as the delays in the Narx networks do, but with the explicit representation of the input tokens replaced by single-unit activations on the map. This relieves the designer from having to specify, by trial and error, the appropriate number of delays. It should also lead to more graceful degradation with unexpectedly long sequences, and therefore would allow the system to scale up better and exhibit more plausible cognitive behavior. The SardNet idea is not just a way to improve the performance of subsymbolic networks; it is an explicit implementation of the idea that humans can keep track of identities of elements, not just their statistical properties [18]. The subsymbolic networks are very good with statistical associations, but cannot distinguish between representations that have similar statistical properties. People can; whether they use a map-like representation or explicit delays (and how many) is an open question, but we believe the SardNet representation suggests an elegant way to capture a lot of the resulting behavior. SardNet is a plausible cognitive approach, and useful for building powerful subsymbolic language understanding systems. SardNet is also in line with the general neurological evidence for topographical representations in the brain. It is also worth noting that the operation of the recurrent networks on the shift-reduce parsing task is a nice demonstration of holistic computation. The network is able to learn how to generate each Raam parse representation during the course of sentence processing without ever having to decompose and recompose the constituent representations. Partial parse results can be built up incrementally into increasingly complicated structures, which suggests that Combining Maps and Distributed Representations for Shift-Reduce Parsing 155 training could be performed incrementally. Such a training scheme is especially attractive given that training in general is still relatively costly. An extension of the hybrid approach taken in this study, currently being investigated by our group, is an architecture where SardNet is combined with a Raam network. Raam, although having many desirable properties for a purely connectionist approach to parsing, has long been a bottleneck during training. Its operation is very similar to the Srn, and it suffers from the same memory accuracy problem: with deep structures the superimposition of higher-level representations gradually obscure the traces of low-level items, and the decoding becomes inaccurate. This degradation makes it difficult to use Raam to encode/decode parse results of realistic language. Preliminary results indicate that the explicit representation of a compressed structure formed on a SardNet feature map, coupled with the distributed representations of Raam, yields an architecture able to encode richer linguistic structure. This approach should readily lend itself to encoding the feature-value matrices used in the lexicalist, constraint-based grammar formalisms of contemporary linguistics theory, such as Hpsg [26], needed to handle realistic natural language. 6 Conclusion We have shown how explicit representation of constituents on a self-organizing map allows recurrent networks to process sequences more effectively. We demonstrated that neural networks equipped with SardNet sequence memory achieve much better performance than distributed networks alone, and comparable performance to Narx networks with several delays, on a nontrivial shift-reduce parsing task. SardNet, however, provides a more elegant and cognitively plausible solution to the problem of long-term dependencies in that it does not impose hard limits on the length of the sequences it can process. In future work, we will extend the method to other recurrent neural network architectures such as Raam to allow much richer linguistic structures to be represented in connectionist architectures. Acknowledgments This research was supported in part by the Texas Higher Education Coordinating Board under grant ARP-444. SardSrn demo: http://www.cs.utexas.edu/users/nn/pages/research/nlp.html. References 1. R. B. Allen. Several studies on natural language and back-propagation. In Proceedings of the IEEE First International Conference on Neural Networks (San Diego, CA), volume II, pages 335–341. Piscataway, NJ: IEEE, 1987. 2. Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994. 156 M.R. Mayberry and R. Miikkulainen 3. G. Berg. A connectionist parser with recursive sentence structure and lexical disambiguation. In W. Swartout, editor, Proceedings of the Tenth National Conference on Artificial Intelligence, pages 32–37. Cambridge, MA: MIT Press, 1992. 4. D. J. Chalmers. Syntactic transformations on distributed representations. Connection Science, 2:53–62, 1990. 5. S. Chen, S. Billings, and P. Grant. Non-linear system identification using neural networks. In International Journal of Control, pages 1191–1214, 1990. 6. J. Connor, L. Atlas, and D. Martin. Recurrent networks and narma modeling. Advances in Neural Information Processing Systems, 4:301–308, 1992. 7. J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990. 8. J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7:195–225, 1991. 9. U. Hermjakob. Learning Parse and Translation Decisions from Examples with Rich Context. PhD thesis, Department of Computer Sciences, The University of Texas at Austin, Austin, TX, 1997. Technical Report UT-AI97-261. 10. B. Horne and C. Giles. An experimental comparison of recurrent neural networks. Advances in Neural Information Processing Systems, 7:697–704, 1995. 11. D. L. James and R. Miikkulainen. SARDNET: A self-organizing feature map for sequences. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors, Advances in Neural Information Processing Systems 7, pages 577–584. Cambridge, MA: MIT Press, 1995. 12. T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78:1464–1480, 1990. 13. T. Kohonen. Self-Organizing Maps. Springer, Berlin; New York, 1995. 14. T. Lin, B. G. Horne, and C. L. Giles. How embedded memory in recurrent neural network architectures helps learning long-term temporal dependencies. Neural Networks, 11(5):861–868, 1998. 15. T. Lin, B. G. Horne, and C. L. Giles. Learning long-term dependencies in narx recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329– 1338, 1996. 16. T. Lin, C. L. Giles, B. G. Horne, and S. Y. Kung. A Delay Damage Model Selection Algorithm for NARX Neural Networks. IEEE Transactions on Signal Processing, 45(11):2719-2730, 1997. 17. J. L. McClelland and A. H. Kawamoto. Mechanisms of sentence processing: Assigning roles to constituents. In J. L. McClelland and D. E. Rumelhart, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 2: Psychological and Biological Models, pages 272–325. MIT Press, Cambridge, MA, 1986. 18. R. Miikkulainen. Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. MIT Press, Cambridge, MA, 1993. 19. R. Miikkulainen. Subsymbolic case-role analysis of sentences with embedded clauses. Cognitive Science, 20:47–73, 1996. 20. R. Miikkulainen. Dyslexic and category-specific impairments in a self-organizing feature map model of the lexicon. Brain and Language, 59:334–366, 1997. 21. P. Munro, C. Cosic, and M. Tabasko. A network for encoding, decoding and translating locative prepositions. Connection Science, 3:225–240, 1991. 22. K. S. Narendra and K. Parthasarathy. Identification and control of dynamical systems using neural networks. IEEE Transactions on Neural Networks, 1:4–27, 1990. Combining Maps and Distributed Representations for Shift-Reduce Parsing 157 23. D. C. Plaut. Connectionist Neuropsychology: The Breakdown and Recovery of Behavior in Lesioned Attractor Networks. PhD thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA, 1991. Technical Report CMU-CS-91185. 24. D. C. Plaut and T. Shallice. Perseverative and semantic influences on visual object naming errors in optic aphasia: A connectionist account. Technical Report PDP.CNS.92.1, Parallel Distributed Processing and Cognitive Neuroscience, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA, 1992. 25. J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46:77– 105, 1990. 26. C. Pollard and I. A. Sag. Head-Driven Phrase Structure Grammar. University of Chicago Press, Chicago, IL, 1994. 27. D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart and J. L. McClelland, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, pages 318–362. MIT Press, Cambridge, MA, 1986. 28. N. E. Sharkey and A. J. C. Sharkey. A modular design for connectionist parsing. In M. F. J. Drossaers and A. Nijholt, editors, Twente Workshop on Language Technology 3: Connectionism and Natural Language Processing, pages 87–96, Enschede, the Netherlands, 1992. Department of Computer Science, University of Twente. 29. R. F. Simmons and Y.-H. Yu. The acquisition and application of context sensitive grammar for English. In Proceedings of the 29th Annual Meeting of the ACL. Morristown, NJ: Association for Computational Linguistics, 1991. 30. R. F. Simmons and Y.-H. Yu. The acquisition and use of context dependent grammars for English. Computational Linguistics, 18:391–418, 1992. 31. M. F. St. John and J. L. McClelland. Learning and applying contextual constraints in sentence comprehension. Artificial Intelligence, 46:217–258, 1990. 32. A. Stolcke. Learning feature-based semantics with simple recurrent networks. Technical Report TR-90-015, International Computer Science Institute, Berkeley, CA, 1990. 33. M. Tomita. Efficient Parsing for Natural Language. Kluwer, Dordrecht; Boston, 1986. 34. D. S. Touretzky. Connectionism and compositional semantics. In J. A. Barnden and J. B. Pollack, editors, High-Level Connectionist Models, volume 1 of Advances in Connectionist and Neural Computation Theory, Barnden, J. A., series editor, pages 17–31. Ablex, Norwood, NJ, 1991. 35. J. M. Zelle and R. J. Mooney. Comparative results on using inductive logic programming for corpus-based parser construction. In S. Wermter, E. Riloff, and G. Scheler, editors, Connectionist, Statistical, and Symbolic Approaches to Learning for Natural Language Processing, pages 355–369. Springer, Berlin; New York, 1996. Towards Hybrid Neural Learning Internet Agents Stefan Wermter, Garen Arevian, and Christo Panchev Hybrid Intelligent Systems Group University of Sunderland, Centre for Informatics, SCET St Peter’s Way, Sunderland, SR6 0DD, UK stefan.wermter@sunderland.ac.uk http://www.his.sunderland.ac.uk/ Abstract. The following chapter explores learning internet agents. In recent years, with the massive increase in the amount of available information on the Internet, a need has arisen for being able to organize and access that data in a meaningful and directed way. Many well-explored techniques from the field of AI and machine learning have been applied in this context. In this paper, special emphasis is placed on neural network approaches in implementing a learning agent. First, various important approaches are summarized. Then, an approach for neural learning internet agents is presented, one that uses recurrent neural networks for the learning of classifying a textual stream of information. Experimental results are presented showing that a neural network model based on a recurrent plausibility network can act as a scalable, robust and useful news routing agent.concluding section examines the need for a hybrid integration of various techniques to achieve optimal results in the problem domain specified, in particular exploring the hybrid integration of Preference Moore machines and recurrent networks to extract symbolic knowledge. 1 Introduction The exponential expansion of Internet information has been very apparent; however, there is still a great deal that can be done in terms of improving the classification and subsequent access of the data that is potentially available. The motivation for trying various techniques from the field of machine learning arises from the fact that there is a great deal of unstructured data. Much time is spent on searching for information, filtering information down to essential data, reducing the search space for specific domains, classifying text and so on. The various techniques of machine learning are examined for automating the learning of these processes, and tested to address the problem of an expanding and dynamic Internet [26]. So-called “internet agents” are implemented to address some of these problems. The simplest definition of an agent is that it is a software system, to some degree autonomous, that is designed to perform or learn a specific task S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 158–174, 2000. c Springer-Verlag Berlin Heidelberg 2000 Towards Hybrid Neural Learning Internet Agents 159 [2,30] which is either one algorithm or a combination of several. Agents can be designed to perform various tasks including textual classification [34,10], information retrieval and extraction [5,9], routing of information such as email, news [3,18,6,46], automating web browsing [1], organization [36,4], personal assistance [39,20,17,14] and learning for web-agents [28,46]. In spite of a lot of work on internet agents, most systems currently do not have learning capabilities. In the context of this paper, a learning agent is taken to be an algorithmic approach to a classification problem that allows it to be dynamic, robust and able to handle noisy data, to a degree autonomously, while improving its performance through repeated experience [44]. Of course, learning internet agents can have a variety of definitions as well, and the emphasis within this context is more on autonomously functioning systems that can either classify or route information of a textual nature. In particular, after a summary of various approaches, the HyNeT recurrent neural network architecture will be described, which is shown to be a robust and scalable text routing agent for the Internet. 2 Different Approaches to Learning in Agents The field of Machine Learning is concerned with the construction of computer programs that automatically improve their performance with experience [33]. A few examples of currently applied machine learning approaches for learning agents are decision trees [37], Bayesian statistical approaches [31], Kohonen networks [24,22] and Support Vector Machines (SVMs) [19]. However, in the following summary, the potential use of neural networks is examined. 2.1 Neural Network Approaches Many internet-related problems are neither discrete nor are the distributions known due to the dynamics of the medium. Therefore, internet agents can be made more powerful by employing various learning algorithms inspired by approaches from neural networks. Neural networks have several main properties which make them very useful for the Internet. The information processing is non-linear, allowing the learning of real-valued, discrete-valued and vector-valued examples; they are adaptable and dynamic in nature, and hence can cope with a varying operating environment. Contextual information and knowledge is represented by the structure and weights of a system, allowing interesting mappings to be extracted from the problem environment. Most importantly, neural networks are fault-tolerant and robust, being able to learn from noisy or incomplete data due to their distributed representations. There are many different neural network algorithms; however, while bearing in mind the context of agents and learning, several types of neural network are more suitable than others for the task that is required. For a dynamic system like the Internet, an online agent needs to be as robust as possible, essentially to be left to the task of routing, classifying and organizing textual data in an 160 S. Wermter, G. Arevian, and C. Panchev autonomous and self-maintaining way by being able to generalize, to be faulttolerant and adaptive. The three approaches so far shown to be most suitable are recurrent networks [46], Kohonen self-organizing maps (SOMs) [24,22] and reinforcement learning [42,43]. All these neural network approaches have properties which are briefly discussed and illustrated below. Supervised Recurrent Networks Recurrent neural networks have shown great promise in many tasks. For example, certain natural language processing approaches require that context and time be incorporated as part of the model [8,7]; hence, recent work has focused on developing networks that are able to create contextual representations of textual data which take into account the implicit representation of time, temporal sequencing and the context as a result of the internal representation that is created. These properties of recurrent neural networks can be useful for creating an agent that is able to derive information from text-based, noisy Internet input. In particular, recurrent plausibility networks have been found useful [45,46]. Also, NARX (Nonlinear Autoregressive with eXogenous inputs) models have been shown to be very effective in learning many problems such as those that involve long-term dependencies [29]; NARX networks are formalized by [38]: y(t) = f (x(t − nx ), . . . , x(t − 1), x(t), y(t − ny ), . . . , y(t − 1)), where x(t) and y(t) are the input and output of the network at a time t; nx and ny represent the order of the input and output, and the function f is the mapping performed by the multi-layer perceptron. In some cases, it has been shown that NARX and RNN (Recurrent Neural Network) models are equivalent [40], and under conditions that the neuron transfer function is similar to the NARX transfer function, one may be transformed to the other and vice versa - the benefit being that if the output dimension of a NARX model is larger than the number of hidden units, training an equivalent RNN will be faster; pruning is also easier in an equivalent NARX whose stability behavior can be analyzed more readily. Unsupervised Models Recently, applications of Kohonen nets have been extended to the realm of text processing [25,16], to create browsable mappings of Internet-related hypertext data. A self-organizing map (SOM) forms a nonlinear projection from a high-dimensional data manifold onto a low-dimensional grid [24]. The SOM algorithm computes an optimal collection of models that approximates the data by applying a specified error criterion and takes into account the similarities and hence the relations between the models; this allows the ordering of the reduced-dimensionality data onto a grid. The SOM algorithm [23,24] is formalized as follows: there is an initialization step, where random values for the initial weight vectors wj (0) are set; if the total number of neurons in the lattice is N , wj (0) must be different for j = 1, 2, . . . , N . The magnitude for the weights should be kept small for optimal performance. Towards Hybrid Neural Learning Internet Agents 161 There is a sampling step where example vectors x from the input distribution are taken that represent the sensory signal. The optimally matched ’winning’ neuron i(x) at discrete time t is found using the minimum-distance Euclidean criterion by a process called similarity matching: i(x) = argj min k x(t) − wj (t) k f or j = 1, 2, . . . , N The synaptic weight vectors of all the neurons are adjusted and updated, according to:  wj (t) + µ(t)[x(t) − wj (t)] f or j ∈ Λi(x) (t) wj (t + 1) = otherwise wj (t) The learning rate is µ(t), and Λi(x) (t) is the neighborhood function centered around the winning neuron i(x); both µ(t) and Λi(x) (t) are continuously varied. The sampling, matching and update are repeated until no further changes are observed in the mappings. In this way, the WEBSOM agent [25] can represent web documents statistically by their word frequency histograms or some reduced form of the data as vectors. The SOM here is acting as a similarity graph of the data. A simple graphical user interface is used to present the ordered data for navigation. This approach has been shown to be appropriate for the task of learning for newsgroup classification. Reinforcement Learning Approaches This is the on-line learning of inputoutput mappings through a process of exploration of a problem space. Agents that use reinforcement learning rely on the use of training data that evaluates the final actions taken. There is active exploration with an explicit trial-anderror search for the desired behavior [43,12]; evaluative feedback, specifically characteristic of this type of learning, indicates how good an action taken is, but not if it is the best or worst. All reinforcement algorithm approaches have explicit goals, interact with and influence their environments. Reinforcement learning aims to find a policy that selects a sequence of actions which are statistically optimal. The probability that a specific environment makes a transition from a state x(t) to y at a time t + 1, given that it was previously in states x(0), x(1), ..., and that the corresponding actions a(0), a(1), ..., were taken, depend entirely on the current state x(t) and action a(t) as shown by: Γ {x(t + 1) = y|x(0), a(0); x(1), a(1); . . . ; x(t), a(t)} = Γ {x(t + 1) = y|x(t), a(t)} where Γ (·) is the transition probability or change of state. If the environment is in a state x(0) = x, the evaluation function [43,12] is given by: 162 S. Wermter, G. Arevian, and C. Panchev H(x) = E " ∞ X k # γ r(k + 1)|x(0) = x k=0 Here, E is the expectation operator, taken with respect to the policy used to select actions by the agent. The summation is termed the cumulative discounted reinforcement, and r(k + 1) is the reinforcement received from the environment after action a(k) is taken by the agent. The reinforcement feedback can have a positive value (regarded as a ’reward’ signal), a negative value (regarded as ’punishment’) or unchanged; γ is called the discount-rate parameter and lies in the range 0 ≤ γ < 1, where if γ → 0, then the reinforcement is more short term, and if γ → 1, then the cumulative actions are for the longer term. Learning the evaluation function H(x) allows the use of the cumulative discounted reinforcement later on. This approach, though not fully explored for sequential tasks on the Internet, holds promise for the design of a learning agent system that fulfills the necessary criteria - one that is autonomous, able to adapt, robust, can handle noise and sequential decisions. 3 Analysis and Discussion of a Specific Learning Internet Agent: HyNeT A more detailed description of one particular learning agent will now be presented. A great deal of recent work on neural networks has shifted from the processing of strictly numerical data towards the processing of various corpora and the huge body of the Internet [35,26,5,19]. Indeed, it has been an important goal to study the more fundamental issues of connectionist systems, and the way in which knowledge is encoded in neural networks and how knowledge can be derived from them [13,32,11,41,15]. A useful example, applicable as it is a real-world task, is the routing and classification of newswire titles and will now be described. 3.1 Recurrent Plausibility Networks In this section, a detailed analysis of one such agent called HyNeT (Hybrid Neural/symbolic agents for Text routing on the internet), which uses a recurrent neural network, is presented and experimental results are discussed. The specific neural network explored here is a more developed version of the simple recurrent neural network, namely a Recurrent Plausibility Network [45,46]. Recurrent neural networks are able to map both previous internal states and input to a desired output - essentially acting as short-term incremental memories that take time and context into consideration. Fully recurrent networks process all information and feed it back into a single layer, but for the purposes of maintaining contextual memory for processing arbitrary lengths of input, they are limited. However, partially recurrent networks Towards Hybrid Neural Learning Internet Agents 163 have recurrent connections between the hidden and context layer [7] or Jordan networks have connections between the output and context layer [21]; these allow previous states to be kept within the network structure. Simple recurrent networks have a rapid rate of decay of information about states. For many classification tasks in general, recent events are more important but some information can also be gained from information that is more longerterm. With sequential textual processing, context within a specific processing time-frame is important and two kinds of short-term memory can be useful - one that is more dynamic and varying over time which keeps more recent information, and a more stable memory, the information of which is allowed to decay more slowly to keep information about previous events over a longer time-period. In other research [45], different decay memories were introduced by using distributed recurrent delays over the separate context layers representing the contexts at different time steps. At a given time step, the network with n hidden layers processes the current input as well as the incremental contexts from the n − 1 previous time steps. Figure 1 shows the general structure of our recurrent plausibility network. Feedforward Propagation Output Layer On(t) Hidden Layer Hidden Layer Recurrent Connections Hn(t) Cn−1(t−1) Context Layer Cn−2(t−1) Context Layer Hn−1(t) I0(t) Input Layer Fig. 1. General Representation of a Recurrent Plausibility Network. The input to a hidden layer Hn is constrained by the underlying layer Hn−1 as well as the incremental context layer Cn−1 . The activation of a unit Hni (t) 164 S. Wermter, G. Arevian, and C. Panchev at time t is computed on the basis of the weighted activation of the units in the previous layer H(n−1)i (t) and the units in the current context of this layer C(n−1)i (t). In a particular case, the following is used: X X wli C(n−1)i (t)) wki H(n−1)i (t) + Lni (t) = f ( k l The units in the two context layers with one time a step are computed as follows: Cni (t) = (1 − ϕn )H(n+1)i (t − 1) + ϕn Cni (t − 1) where Cni (t) is the activation of a unit in the context layer at time t. The selfrecurrency of the context is controlled by the hysteresis value ϕn . The hysteresis value of the context layer Cn−1 is lower than the hysteresis value of the next context layer Cn . This ensures that the context layers closer to the input layer will perform as memory that represents a more dynamic context for small time periods. 3.2 Reuters-21578 Text Categorization Test Collection The Reuters News Corpus is a collection of news articles that appeared on the Reuters Newswire; all the documents have been categorized by Reuters into several specific categories. Further formatting of the corpus [27] has produced the so-called ModApte Split; some examples of the news titles are given in Table 1. Table 1. Example titles from the Reuters corpus. Semantic Category money-fx shipping interest economic corporate commodity energy shipping & energy money-fx & currency Example Titles Bundesbank sets new re-purchase tender US Navy said increasing presence near gulf Bank of Japan determined to keep easy money policy Miyazawa sees eventual lower US trade deficit Oxford Financial buys Clancy Systems Cattle being placed on feed lighter than normal Malaysia to cut oil output further traders say Soviet tankers set to carry Kuwaiti oil Bank of Japan intervenes shortly after Tokyo opens All the news titles belong to one or more of eight main categories: Money and Foreign Exchange (money-fx, MFX), Shipping (ship, SHP), Interest Rates (interest, INT), Economic Indicators (economic, ECN), Currency (currency, CRC), Corporate (corporate, CRP), Commodity (commodity, CMD), Energy (energy, ENG). Towards Hybrid Neural Learning Internet Agents 3.3 165 Various Experiments Conducted In order to get a comparison of performance, several experiments were conducted using different vector representations of the words in the Reuters corpus as part of the preprocessing; the variously derived vector representations were fed into the input layer of simple recurrent networks, the output being the desired semantic routing category. The preprocessing strategies are briefly outlined and explained below. The recall/precision results are presented later in Table 2 for each experiment. Simple Recurrent Network and Significance Vectors In the initial experiment, words were represented using significance vectors; these were obtained by determining the frequency of a word in different semantic categories using the following operation: F requency of w in xi f or j ∈ {1, · · · n} v(w, xi ) = P F requency of w in xj j If a vector (x1 x2 . . . xn ) represents each word w, and xi is a specific semantic category, then v(w, xi ) is calculated for each dimension of the word vector, as the frequency of a word w in the different semantic categories xi divided by the number of times the word w appears in the corpus. The computed values are then presented at the input of a simple recurrent network [8] in the form (v(w, x1 ), v(w, x2 ), . . . , v(w, xn )). Simple Recurrent Network and Semantic Vectors An alternative preprocessing strategy was to represent vectors as the plausibility of a specific word occurring in a particular semantic category, the main advantage being that they are independent of the number of examples present in each category: N ormalized f requency of w in xi , j ∈ {1, · · · n} v(w, xi ) = P N ormalized f requency of w in xj j where: N ormalized f requency of w in xi = F requency of w in xi N umber of titles in xi The normalized frequency of appearance a word w in a semantic category xi (i.e. the normalized category frequency) was again computed as a value v(w, xi ) for each element of the semantic vector, divided by normalizing the frequency of appearance of a word w in the corpus (i.e. the normalized corpus frequency). 166 S. Wermter, G. Arevian, and C. Panchev Recurrent Plausibility Network and Semantic Vectors In the final experiment, a recurrent plausibility network, as shown in Figure 1 was used; the actual architecture used for the experiment was one with two hidden and two context layers. After empirically testing various combinations of settings for the values of the hysteresis value for the activation function of the context layers, it was found that the network performed optimally with a value of 0.2 for the first context layer, and 0.8 for the second. Table 2. Best recall/precision results from various experiments Type of Vector Representation Used in Experiment Training set recall precision Significance Vectors and Simple Recurrent Network 85.15 86.99 Semantic Vectors and Simple Recurrent Network 88.57 88.59 Semantic Vectors with Recurrent Plausibility Network 89.05 90.24 “Bag of Words” with Recurrent Plausibility Network - 3.4 Test set recall precision 91.23 90.73 92.47 91.61 93.05 92.29 86.60 83.10 Results of Experiments The results in Table 2 show the clear improvement in the overall recall/precision values from the first experiment using the significance vectors, to the last using the plausibility network. The experiment with the semantic vector representation showed an improvement over the first. The best performance was shown by the use of the plausibility network. In comparison, a bag-of-words approach, to test performance on sequences without order, reached 86.6% recall and 83.1% precision; this indicates that the order of significant words and hence the context are important as a source of information which the recurrent neural network learns, allowing better classification performance. These results demonstrate that a carefully developed neural network agent architecture can deal with significantly large test and training sets. In some previous work [45], recall/precision accuracies of 95% were reached but the library titles used in the work were much less ambiguous than the Reuters Corpus (which had a few main categories and newstitles that could easily be misclassified due to the inherent ambiguity) and only 1 000 test titles were used in the approach while the plausibility network was scalable to 10 000 corrupted and ambiguous titles. For general comparison with other approaches, interesting work on text categorization on the Reuters corpus has been done using whole documents [19] rather than titles. Taking the ten most frequently occurring categories, it has been shown that the recall/precision break-even point for Support Vector Machines was 86%, 82% for k-Nearest Neighbor, 72% for Naive Bayes. Though a different set of categories and whole documents were used, and therefore the results may not be directly comparable to results shown in Table 2, they do Towards Hybrid Neural Learning Internet Agents 167 2.5 2 Sum Squared Error 1.5 1 0.5 MIYAZAWA SEES EVENTUAL LOWER 0 0 150 300 U 450 600 750 900 S TRADE DEFICIT Epoch Fig. 2. The error surface of the title “Miyazawa Sees Eventual Lower US Trade Deficit” however give some indication of document classification performance on this corpus. Especially for medium text data sets or when only titles are available, the HyNeT agent compares favorably with the other machine learning techniques that have been tested on this corpus. 3.5 Analysis of the Output Representations For a clear presentation of the network’s behavior, the results are illustrated and analyzed below; the error surfaces show plots of the sum-squared error of the output preferences, plotted against the number of training epochs and each word of a title. Figure 2 shows the surface error of the title “Miyazawa Sees Eventual Lower US Trade Deficit”. In the Reuters Corpus this is classified under the “economic” category; as can be seen, the network does learn the correct category classification. The first two words, “Miyazawa” and “sees”, are initially given several possible preferences to other categories and the errors are high early on in the training. However, the subsequent words “eventual”, “lower”, etc. cause the network to increasingly favor the correct classification, and at the end, the trained network has a very strong preference (shown by the low error value) for the incremental context of the desired category. The second example is shown in Figure 3, titled “Bank of Japan Determined To Keep Easy Money Policy” and belonging to the “interest” category. This example shows a more complicated behavior in the contextual learning, in contrast to the previous one. The words beginning “Bank of Japan” are ambiguous and could be classified under different categories such as “money/foreign exchange” 168 S. Wermter, G. Arevian, and C. Panchev 2.5 Sum Squared Error 2 1.5 1 0.5 BANK OF JAPAN DETERMINED 0 0 150 300 450 Epoch 600 750 900 TO KEEP EASY MONEY POLICY Fig. 3. The error surface of the title “Bank of Japan Determined To Keep Easy Money Policy” and “currency”, and indeed the network shows some confused behavior; again however, the context of the latter words such as “easy money policy” eventually allow the network to learn the correct classification. 3.6 Context Building in Plausibility Neural Networks Figures 5 and 7 present cluster dendrograms based on the internal context representations at the end of titles. The test includes 5 representative titles for each category; each title belongs to only one category. All titles are correctly classified by the network. The first observation that can be made from these figures is that the dendrogram based on the activations of the second context layer (closer to the output layer) provides a better distinction between the classes. In other words, it can be seen that the second context layer is more representative of the title classification than the first one. This analysis aims to explore how these contexts are built and what the difference is between the two contexts along a title. Using the data for Figures 5 and 7, the class-activity of a particular context unit is defined with respect to a given category as the activation of this unit when a title from this category has been presented to the network. That is, for example, at the end of a title from the category “economic”, the units with the higher activation will be classified as being more class-active with respect to the “economic” category, and the units with lower activation as less class-active. For the analysis of the context building in the plausibility network, the activation of the context units were taken while processing the title “Assets of money market mutual funds fell 35.3 mln dlrs in latest week to 237.43 billion”. Towards Hybrid Neural Learning Internet Agents 169 1 Activation 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 IN . WEEK . . BILLION . . MLN . FELL ASSETS . MONEY . MUTUAL Word Sequence Neuron Unit Fig. 4. The activation of the units in the first context layer. The order of the units is changed according to the class-activity energy - A_1445 commodity - A_365 economic - A_1209 economic - A_1192 economic - A_1199 economic - A_1210 interest - A_3442 commodity - A_373 commodity - A_358 money-fx - A_2568 economic - A_1207 energy - A_1439 energy - A_1435 energy - A_1432 energy - A_1425 ship - A_5113 ship - A_5031 ship - A_5600 ship - A_5407 ship - A_5393 commodity - A_356 commodity - A_348 interest - A_4557 interest - A_4271 interest - A_3868 corporate - A_4 corporate - A_2 corporate - A_3 corporate - A_1 corporate - A_0 interest - A_3734 money-fx - A_3262 currency - A_7035 currency - A_7005 currency - A_7018 currency - A_7019 currency - A_6986 money-fx - A_2763 money-fx - A_2175 money-fx - A_2094 Fig. 5. The cluster dendrogram and internal context representations of the first context layer for 40 representative titles. 170 S. Wermter, G. Arevian, and C. Panchev 1 Activation 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 IN . WEEK . . BILLION . . MLN . FELL ASSETS . MONEY . MUTUAL Word Sequence Neuron Unit Fig. 6. The activation of the units in the second context layer. The order of the units is changed according to the class-activity energy - A_1445 energy - A_1439 energy - A_1435 energy - A_1432 energy - A_1425 ship - A_5600 ship - A_5393 ship - A_5113 ship - A_5031 ship - A_5407 commodity - A_373 commodity - A_365 commodity - A_358 commodity - A_356 commodity - A_348 currency - A_7035 currency - A_7005 currency - A_7018 currency - A_7019 currency - A_6986 money-fx - A_2763 money-fx - A_3262 economic - A_1210 economic - A_1207 economic - A_1209 economic - A_1192 economic - A_1199 interest - A_4557 interest - A_3868 interest - A_3734 interest - A_4271 interest - A_3442 money-fx - A_2568 money-fx - A_2175 money-fx - A_2094 corporate - A_4 corporate - A_3 corporate - A_1 corporate - A_2 corporate - A_0 Fig. 7. The cluster dendrogram and internal context representations of the second context layer for 40 representative titles. Towards Hybrid Neural Learning Internet Agents 171 This title belongs to the “economic” category and the data was sorted with a key which is the activity of the neurons with respect to this category. The results are shown in Figures 4 and 6. The most class-active unit for the class “economic” is given as unit 1 in the figure, and the lowest class-activity as unit 6. Thus, the ideal curve at a given word step for the title to be classified to the correct category will be a monotonically decreasing function starting from the units with the highest classactivity to the units with lower class-activity. As can be seen, most of the units in the first context layer (closer to the input) are more dynamic. They are highly dependent on the current word. Therefore the first context layer does not build a representative context for the required category at the end of the title. It rather responds to the incoming words, building a short dynamic context. However, the second context layer is incrementally building its context representation for the particular category. It is the context layer which is most responsible for a stable output and does not fluctuate so much with the different incoming words. 4 Conclusions A variety of neural network learning techniques were presented which are considered relevant to the specific problem of classification on Internet texts. A new recurrent network architecture, HyNeT, was presented that is able to route news headlines. Similar to incremental language processing, plausibility networks also process news titles using previous context as extra information. At the beginning of a title, the network might predict an incorrect category which usually changes to the correct one later on when more contextual information is available. Furthermore, the error of the network was also carefully examined at each epoch and for each word of the training headlines. These surface error figures allow a clear, comprehensive evaluation of training time, word sequence and overall classification error. In addition, this approach may be quite useful for any other learning technique involving sequences. Then, an analysis of the context layers was presented showing that the layers do indeed learn to use the information derived from context. To date, recurrent neural networks have not been developed for a new task of such size and scale, in the design of title routing agents. HyNeT is robust, classifies noisy arbitrary real-world titles, processes titles incrementally from left to right, and shows better classification reliability towards the end of titles based on the learned context. Plausibility neural network architectures hold a lot of potential for building robust neural architectures for semantic news routing agents on the Internet. References 1. M. Balabanovic and Y. Shoham. Learning information retrieval agents: Experiments with automated web browsing. In Proceedings of the 1995 AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, Stanford, CA, 1995. 172 S. Wermter, G. Arevian, and C. Panchev 2. M. Balabanovic, Y. Shoham, and Y. Yun. An adaptive agent for automated web browsing. Technical Report CS-TN-97-52, Stanford University, 1997. 3. W. Cohen. Learning rules that classify e-mail. In AAAI Spring Symposium on Machine Learning in Information Access, Stanford, CA, 1996. 4. R. Cooley, B. Mobasher, and J. Srivastava. Web mining: Information and pattern discovery on the world wide web. In International Conference on Tools for Artificial Intelligence, Newport Beach, CA, November 1997. 5. M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from the world wide web. In Proceedings of the 15th National Conference on Artificial Intelligence, Madison, WI, 1998. 6. P. Edwards, D. Bayer, C.L. Green, and T.R. Payne. Experience with learning agents which manage internet-based information. In AAAI Spring Symposium on Machine Learning in Information Access, pages 31–40, Stanford, CA, 1996. 7. J. L. Elman. Finding structure in time. Technical Report CRL 8901, University of California, San Diego, CA, 1988. 8. J. L. Elman. Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7:195–226, 1991. 9. D. Freitag. Information extraction from html: Application of a general machine learning approach. In National Conference on Artificial Intelligence, pages 517– 523, Madison, Wisconsin, 1998. 10. J. Fuernkranz, T. Mitchell, and E. Riloff. A case study in using linguistic phrases for text categorization on the WWW. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorisation, Madison, WI, 1998. 11. L. Giles and C. W. Omlin. Extraction, insertion and refinement of symbolic rules in dynamically driven recurrent neural networks. Connection Science, 5:307–337, 1993. 12. S. Haykin. Neural Networks: A Comprehensive Foundation. Macmillan College Publishing Company, New York, 1994. 13. J. Hendler. Developing hybrid symbolic/connectionist models. In J. A. Barnden and J. B. Pollack, editors, Advances in Connectionist and Neural Computation Theory, Vol.1: High Level Connectionist Models, pages 165–179. Ablex Publishing Corporation, Norwood, NJ, 1991. 14. R. Holte and C. Drummond. A learning apprentice for browsing. In AAAI Spring Symposium on Software Agents, Stanford, CA, 1994. 15. V. Honavar. Symbolic artificial intelligence and numeric artificial neural networks: towards a resolution of the dichotomy. In R. Sun and L. A. Bookman, editors, Computational Architectures integrating Neural and Symbolic Processes, pages 351– 388. Kluwer, Boston, 1995. 16. S. Honkela. Self-organizing maps in symbol processing. In Hybrid Neural Systems (this volume). Springer-Verlag, 2000. 17. M.A. Hoyle and C. Lueg. Open SESAME: A look at personal assisitants. In Proceedings of the Interanational Conference on the Practical Applications of Intelligent Agents and Multi-Agent Technology, London, pages pp. 51–56, 1997. 18. D. Hull, J. Pedersen, and H. Schutze. Document routing as statistical classification. In AAAI Spring Symposium on Machine Learning in Information Access, Stanford, CA, 1996. 19. T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of the European Conference on Machine Learning, Chemnitz, Germany, 1998. Towards Hybrid Neural Learning Internet Agents 173 20. T. Joachims, D. Freitag, and T. Mitchell. Webwatcher: A tour guide for the world wide web. In Fifteenth International Joint Conference on Artificial Intelligence, Nagoya, Japan, 1997. 21. M. I. Jordan. Attractor dynamics and parallelism in a connectionist sequential machine. In Proceedings of the Eighth Conference of the Cognitive Science Society, pages 531–546, Amherst, MA, 1986. 22. S. Kaski, T. Honkela, K. Lagus, and T. Kohonen. WEBSOM - self-organizing maps of document collections. Neurocomputing, 21:101–117, 1998. 23. T. Kohonen. Self-Organization and Associative Memory. Springer, Berlin, third edition, 1989. 24. T. Kohonen. Self-Organizing Maps. Springer, Berlin, 1995. 25. T. Kohonen. Self-organisation of very large document collections: State of the art. In Proceedings of the International Conference on Ariticial Neural Networks, pages 65–74, Skovde, Sweden, 1998. 26. S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280:98–100, 1998. 27. D. D. Lewis. Reuters-21578 text categorization test collection, 1997. http://www.research.att.com/˜lewis. 28. R. Liere and P. Tadepalli. The use of active learning in text categorisation. In AAAI Spring Symposium on Machine Learning in Information Access, Stanford, CA, 1996. 29. T. Lin, B. G. Horne, P. Tino, and C. L. Giles. Learning long-term dependencies in NARX recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338, November 1996. 30. F. Menczer, R. Belew, and W. Willuhn. Artificial life applied to adaptive information agents. In Proceedings of the 1995 AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, 1995. 31. D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine Learning, Neural and Statistical Classification. Ellis Horwood, New York, 1994. 32. R. Miikkulainen. Subsymbolic Natural Language Processing. MIT Press, Cambridge, MA, 1993. 33. T. M. Mitchell. Machine Learning. WCB/McGraw-Hill, New York, 1997. 34. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Learning to classify text from labeled and unlabeled documents. In Proceedings of the National Conference on Artificial Intelligence, Madison, WI, 1998. 35. R. Papka, J. P. Callan, and A. G. Barto. Text-based information retrieval using exponentiated gradient descent. In Advances in Neural Information Processing Systems, volume 9, Denver, CO, 1997. MIT Press. 36. M. Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. In International Joint Conference on Artificial Intelligence, Nagoya, Japan, 1997. 37. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 38. H. T. Siegelmann, B. G. Horne, and C. L. Giles. Computational capabilities of recurrent NARX neural networks. Technical Report CS-TR-3408, University of Maryland, College Park, 1995. 39. M. Spiliopoulou, L. C. Faulstich, and K. Winkler. A data miner analyzing the navigational behavior of web users. In ACAI-99 Workshop on Machine Learning in User Modeling, Crete, July 1999. 40. J.P.F. Sum, W.K. Kan, and G.H. Young. A note on the equivalence of NARX and RNN. Neural Computing and Applications, 8:33–39, 1999. 41. R. Sun. Integrating Rules and Connectionism for Robust Commonsense Reasoning. Wiley, New York, 1994. 174 S. Wermter, G. Arevian, and C. Panchev 42. R. Sun and T. Peterson. Multi-agent reinforcement learning: Weighting and partitioning. Neural Networks, 1999. 43. R. S. Sutton and A. G. Barto. Reinforcement Learning: an Introduction. MIT Press, Cambridge, MA, 1998. 44. G. Tecuci. Building Intelligent Agents: An Apprenticeship Multistrategy Learning Theory, Methodology, Tool and Case Studies. Academic Press, San Diego, 1998. 45. S. Wermter. Hybrid Connectionist Natural Language Processing. Chapman and Hall, Thomson International, London, UK, 1995. 46. S. Wermter, C. Panchev, and G. Arevian. Hybrid neural plausibility networks for news agents. In Proceedings of the National Conference on Artificial Intelligence, pages 93–98, Orlando, USA, 1999. A Connectionist Simulation of the Empirical Acquisition of Grammatical Relations William C. Morris1 , Garrison W. Cottrell1 , and Jeffrey Elman2 1 Computer Science and Engineering Department University of California, San Diego 9500 Gilman Dr., La Jolla CA 92093-0114 USA 2 Center for Research in Language Department of Cognitive Science University of California, San Diego 9500 Gilman Dr., La Jolla CA 92093-0114 USA Abstract. This paper proposes an account of the acquisition of grammatical relations using the basic concepts of connectionism and a construction-based theory of grammar. Many previous accounts of firstlanguage acquisition assume that grammatical relations (e.g., the grammatical subject and object of a sentence) and linking rules are universal and innate; this is necessary to provide a first set of assumptions in the target language to allow deductive processes to test hypotheses and/or set parameters. In contrast to this approach, we propose that grammatical relations emerge rather late in the language-learning process. Our theoretical proposal is based on two observations. First, early production of childhood speech is formulaic and becomes systematic in a progressive fashion. Second, grammatical relations themselves are family-resemblance categories that cannot be described by a single parameter. This leads to the notion that grammatical relations are learned in a bottom up fashion. Combining this theoretical position with the notion that the main purpose of language is communication, we demonstrate the emergence of the notion of “subject” in a simple recurrent network that learns to map from sentences to semantic roles. We analyze the hidden layer representations of the emergent subject, and demonstrate that these representations correspond to a radially–structured category. We also claim that the pattern of generalization and undergeneralization demonstrated by the network conforms to what we expect from the data on children’s generalizations. 1 Introduction Grammatical relations are frequently a problem for language acquisition systems. In one sense they represent the most abstract aspect of language; subjects transcend all semantic restrictions – virtually any semantic role can be a subject. While semantics is seen as being related to world-knowledge, syntax is seen as existing on a distinct plane. For this reason there are language theories in which grammatical relations are considered the most fundamental aspect of language. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 175–193, 2000. c Springer-Verlag Berlin Heidelberg 2000 176 W.C. Morris, G.W. Cottrell, and J. Elman One approach to learning syntax has been to relegate grammatical relations and their behaviors to the “innate endowment” that each child is born with. There are a number of theories of language acquisition (e.g., [4, 27, 46, 47]) that start with the assumption that syntax is a separate component of language, and that the acquisition of syntax is largely independent of semantic considerations. Accordingly, in these theories there is an innate, skeletal syntactic system present from the very beginning of multiword speech. Acquiring syntax consists of modifying and elaborating the skeletal system to match the target language. This assumption of innate syntax inevitably leads to a problem, sometimes referred to as the “bootstrapping problem”. How does one start this “purely syntactic” analysis? How does one start making initial assignments of words to grammatical relations (i.e., subject, object, etc.)? A commonly proposed mechanism involves the child tentatively assigning nominals to grammatical relations based on their semantic content by linking rules1 (e.g., Pinker [46, 47]). This implies that these grammatical relations and linking rules are present at the very beginning of the learning process. One problem with this approach is that cross-linguistically the behaviors of grammatical relations differ too much to be accommodated by a single system. Proposals have been put forward [36, 47] that a single parameter with a binary value (“accusative” or “ergative”) is sufficient to account for the extant grammatical systems. This has been shown to be inadequate [29, 40, 51] because there are languages that have neither strictly accusative nor strictly ergative syntax. We propose a language acquisition system that does not rely on innate linguistic knowledge [40]. The proposal is based on Construction Grammar [24, 25] and on the learning mechanisms of PDP-style connectionism [50]. We have hypothesized that abstractions such as “subject” emerge through rote learning of particular constructions, followed by the merging of these “mini-grammars”. The claim is that in using this sort of a language acquisition system it is possible for a child to learn grammatical relations over time, and in the process accommodate to whatever language-specific behaviors his target language exhibits. Here we present a preliminary study showing that a neural net that is trained with the task of assigning semantic roles to sentence constituents can acquire grammatical relations. We have demonstrated this in two ways: by showing that this network associates particular subjecthood properties with the appropriate verb arguments, and by showing that the network has gone some distance toward abstracting this nominal away from its semantic content. In the following, we first review the ways in which the grammatical relation “subject” appears in several languages. This gives rise to the notion that grammatical relations do not have, for example, only two patterns of ways in which they control (in the linguistic sense) other categories. Rather, grammatical re1 Linking rules are heuristics (or algorithms, depending on the theory) for making provisional assignments of verb arguments to grammatical relations. The criteria for the assignments are semantic. Because virtually any semantic role can be a subject, the algorithmic variants of these theories are quite complicated. For a recent treatment of linking rules, see Dowty [21]. A Connectionist Simulation of the Empirical Acquisition 177 lations exhibit a variety of patterns of control over syntactic properties. This suggests it would be difficult for the subject relation to be described by a binary innate parameter. Next, we review relevant developmental data on the acquisition of syntax. The evidence we review suggests that 1) syntax is acquired in a bottom up, data-driven fashion, and 2) that there are specific patterns of overand under- generalization that reflect the nature of the linguistic input to the child. We then review the theory proposed by Morris [40] based on this data. Finally, we present a connectionist simulation of one stage of the theory, and demonstrate that the system acquires a notion of “subject” without any innate bias to do so. 2 The Shape of Grammatical Relations While a number of theorists have explored the real complexity of grammatical relations (e.g., [19, 20, 23, 29, 51]), there remains a perception among some theorists (e.g., [34, 35, 36]) that grammatical relations are essentially a binary phenomenon: grammatical relations are deemed to be either accusative or ergative, and hence an “ergative parameter” determines the behaviors. This has been the prevailing view in a number of language acquisition theories [47]. A first-order approximation of the difference between accusative and ergative grammatical relations is that the subject of a syntactically accusative language is typically the agent of an action, while in a syntactically ergative language the “subject”,2 or subject-like grammatical relation, is typically the patient of an action. One potential distinguishing property (indicative, though not decisive) would be which nominal in a sentence controls clause coordination. Thus in the sentence, Max hit Larry and ran away, who ran away? In a strongly syntactically accusative language, it is Max that ran away; in a strongly syntactically ergative language, it is Larry that ran away. For those who regard the accusative/ergative split as being simply binary, the problem becomes merely identifying the subject. If the subject is the agent, then the language is accusative, if it is the patient, it is ergative. But the problem is not that simple. It is not merely the identity of the subject that is the issue, but what properties do the various grammatical relations control? In some sense, the question is what “shape” do the grammatical relations in a language take on? We have examined the literature to find the syntactic properties that are associated with subjects cross-linguistically. Perhaps the definitive work in this area is Keenan [29], from which we have extracted a set of six properties that are capable of being associated with subjects (and quasi-subjects) cross-linguistically: 2 Because of the associations with accusative phenomena carried by the term “subject” in a number of theoretical approaches, one might wish to call the primary grammatical relation in syntactically ergative languages something else. The term “pivot” has been used. 178 W.C. Morris, G.W. Cottrell, and J. Elman 1. Addressee of imperatives. 2. Control of reflexivization. E.g., Max shaved himself. (The controller of the reflexive is the subject.) 3. Control of coordination. E.g., Max pinched Lola and fled. (The deleted argument of the second clause is coreferential with the subject of the first clause.) 4. Target of equi-NP deletion. E.g., Max convinced Lola to be examined by the doctor. Max convinced the doctor to examine Lola. (The deleted argument of the embedded clause is the subject.) 5. Ability to launch floating quantifiers. E.g., The boys could all hear the mosquitoes. (The quantifier all refers to the subject, i.e., boys, rather than to the object, i.e., mosquitoes.) 6. Target of relativization deletion. E.g., I know the man who saw Max. I know the man who Max saw. In English the last item is a free property; any nominal that is coreferential with the relativized matrix nominal can be deleted in the embedded clause in relativization. The examples demonstrate two of the cases. The grammatical relations of various languages control various combinations of these (and other) properties. This is what we mean by the “shape” of grammatical relations. We have analyzed these syntactic properties in English and in two other languages, Dyirbal (Australian) [18, 20] and Kapampangan (Philippine), which have rather different constellations of properties from those of English, as well as from each other [40]. Grammatical relations in these languages have shown interesting patterns of behavior. For example, in English the first five of these properties are controlled by the subject, the last is a “free property”, not controlled by any grammatical relation. In Dyirbal, properties 3, 4, & 6 are controlled by an “ergative subject”, or “pivot” [18, 20]. In Kapampangan, one grammatical relation (which tends to be the agent) controls properties 1, 2, & 3, while another (which ranges over all semantic roles) controls properties 5 & 6. Property 4 can be controlled by either of the grammatical relations. Hence English is a highly syntactically-accusative language, Dyirbal is a highly syntactically-ergative language, and Kapampangan appears to be a split language, neither highly ergative nor highly accusative in syntax. This is discussed at some length in [40], but as these languages do not bear directly on the present simulation, we will simply note that this issue is addressed in both the theoretical proposal and in our long-term goals. Our purpose for raising the issue here is to argue that for a language acquisition to be “universal”, i.e., capable of learning any human language, it must be able to accommodate a variety of language types. Simply settling on the identity of the subject is not sufficient. Rather, the various control patterns (“shapes”) described above must be accommodated. Our proposal involves a system that can learn a variety of shapes. A Connectionist Simulation of the Empirical Acquisition 3 179 Review of Data from Psycholinguistic Studies There are several avenues of psycholinguistic data that we have explored. One of these is the issue of early abstraction vs. rote behavior. There have been a number of studies that have indicated that children’s earliest multiword utterances have been largely rote or semi-rote behaviors [1, 2, 11, 12, 13, 44, 45, 53]. In a pair of studies Tomasello and Olguin showed an asymmetry between the relative facility with which two-year-old children can manipulate nouns, both in terms of morphology and syntax, and the relative difficulty with which they handle verbs. Tomasello & Olguin [55] demonstrated their productivity with nouns, while Olguin & Tomasello [43] showed their relative nonproductivity with verbs. It appears that the control that children have over verbs very early in the multiword stage is largely rote; there is no systematic relationship between them. That is, there is little or no transfer from knowledge of one verb to the next. There have been a number of studies [26, 41, 42, 56, 57] that have been interpreted as providing evidence of early abstraction. There are several problems with the interpretations of these studies, however. Some of these have interpreted arguably rote behaviors as representing abstraction [54], and others have interpreted small-scale systematic behavior as large scale systematic behavior [3, 15, 44, 45]). That is, it was found that certain systematic behaviors were limited to semantically similar predicates. Despite the fact that an individual child’s developing grammar is a quickly moving target, the issues of systematic and non-systematic behaviors can in certain instances be teased out. Indications of systematic behaviors can be seen in overgeneralization, and indications of the limits of systematic behaviors can be seen in undergeneralization. In numerous studies, Bowerman [5, 6, 7, 8, 9, 10] has investigated instances of overgeneralization in child speech; overgeneralization is the phenomenon of extending rules inappropriately. For example, children exposed to English learn the “lexical causative” alternation, as in the ball rolled∼Larry rolled the ball, and the vase broke∼Max broke the vase. Children inappropriately extend this alternation to verbs such as giggle or sweat to produce such sentences as Don’t giggle me or It always sweats me [9]. Overgeneralizations of this sort are evidence that the child has developed the notion of a class of verbs, such as roll, float, break, sweat, giggle, and disappear, which share a semantic role (patient) in their intransitive forms, and that the child is willing to treat them the same syntactically. The fact that this is inappropriate for the word sweat means that the child is extremely unlikely to have heard this usage before, therefore the child has used systematic behavior to produce this utterance. Another of Bowerman’s studies [10] involved the overgeneralization of linking rules. Children rearranged verb-argument structures in accordance with a linking rule generalization rather than in accordance with some presumed verb-class alternation (e.g., I saw a picture which enjoyed me.). Of particular note here is the timing of these, and other, overgeneralizations. Most of the overgeneralizations that Bowerman has studied, including the lexical causative overgeneralization discussed above, appear starting between two 180 W.C. Morris, G.W. Cottrell, and J. Elman and a half and three and a half years of age. The linking rule overgeneralizations started appearing after the age of 6. The former overgeneralizations are presumably learned behaviors—the child must learn what sorts of verb classes exist in a language and what alternations are associated with them before these overgeneralizations can occur. On the other hand, according to many nativist theories, linking rules are innate [46, 47]. Furthermore, linking rules must be active very early in multi-word speech in order for the first tentative assignments of nouns to grammatical relations to be made, a necessary step in breaking into the syntactic system. Yet the overgeneralizations ascribable to linking rules do not appear until the age of six years or later. If we can judge by overgeneralization, it would appear that linking rules are not innate; at the very least it appears that they are not active at a time when they are most needed, i.e., early in multi-word speech. The alternative is that they are not necessary precursors to multiword speech. Rather, they are highly abstract generalizations that first give evidence of existence after a large portion of the grammar of a language has been mastered. Undergeneralization, too, has a role to play in determining the nature of the learning mechanisms. A number of studies have been conducted showing an interesting asymmetry in the learning of the passive construction in English. A study by Maratsos, Kuczaj, Fox, & Chalkley [38] showed that four- and fiveyear-old children could understand both the active and passive voices of action verbs (e.g., drop, hold, shake, wash), but had difficulty understanding the passive voices of psychological or perceptual verbs (e.g., watch, know, like, remember). Maratsos, Fox, Becker, & Chalkley [37] showed that this difficulty appeared to extend until the age of 10. Another studies by de Villiers et al. [17] confirmed the comprehension asymmetry between the two types of verbs, while a study by Pinker et al. [48] showed a similar asymmetry in production. In a preliminary study Maratsos et al. [37] also showed that parental input to children was limited in a similar way: parents used few, if any, experiential verbs in the passive voice. 3 This study is particularly interesting because a common notion of the passive is that its relationship to the active voice is defined in terms of subjects and objects. Whether or not this is true in an adult, it appears that this is not the way that children learn this alternation. It seems that children first acquire this systematic alternation in a semantically-limited arena, in which the active-voice patient is promoted to the passive “subject”. Only later do they extend it to a more “semantically abstract” arena in which it is the active-voice object that is promoted to the subject position. 3 The few experiential verbs that they did find in the passive voice in parental input were of the percept-experiencer type (e.g., frighten, surprise) rather than the experiencer-percept type (e.g., fear, like). Maratsos et al. did not test the children for their comprehension of percept-experiencer verbs. A Connectionist Simulation of the Empirical Acquisition 4 181 A Theoretical Proposal We wish to test a proposal put forward in Morris [40], which describes an approach to learning grammatical relations without recourse to innate, domainspecific, linguistic knowledge. This model is based on (i) the Goldberg variation of Construction Grammar [24, 25], and (ii) the learning mechanisms of connectionism [50], inter alia. The proposal is that the acquisition of grammatical relations occurs as a three-stage process. In the first stage a child learns verb argument structures as separate, individual “mini-grammars”. This word is used to emphasize that there are no overarching abstractions that link these individual argument structures to other argument structures. Each argument structure is a separate grammar unto itself. In the second stage the child develops correspondences between the separate mini-grammars; initially the correspondences are based on both semantic and syntactic similarity, later the correspondences are established on purely syntactic criteria. The transition is gradual, with the role that semantics plays decreasing slowly. For example, the verbs eat and drink are quite similar to each other, and will “merge” quickly into a larger grammar. Similarly, the verbs hit and kick will merge early, since their semantics and syntax are similar. While all four of these verbs have agents and patients as verb arguments, there are many semantic differences between the verbs of ingestion and the verbs of physical assault, therefore the merge between these two verb groups will occur later in development. Ultimately, these agent-patient verbs will merge with experiencer-percept verbs (e.g., like, fear, see, remember), percept-experiencer verbs (e.g., please, frighten, surprise), and others, yielding a prototypical transitive construction with an extremely abstract argument structure. The verb-arguments in these abstract argument structures can be identified as “A”, the transitive actor, and “O”, transitive patient (or “object”). In addition there is prototypical intransitive argument structure with a single argument, “S”, the intransitive “subject”. (This schematic description is due to Dixon [19].) In the third stage, the child begins to associate the abstract arguments of the abstract transitive and intransitive constructions with the coindexing constructions that instantiate the properties of, for example, clause coordination, control structures, and reflexivization. So, for example, an intransitive-to-transitive coindexing construction will associate the S of an intransitive first clause with the deleted co-referent A of a transitive second clause. This will enable the understanding of a sentence like Max arrived and hugged everyone. Similarly, a transitive-to-intransitive coindexing construction will associate the A of an initial transitive clause with the S of a following intransitive clause; this will enable the understanding of a sentence like Max hugged Annie and left. Since this association takes place relatively late in the process, necessarily building on layers of abstraction and guided by input, the grammatical relations (of which S, A, and O are the raw material) “grow” naturally into the languageappropriate molds. 182 W.C. Morris, G.W. Cottrell, and J. Elman From beginning to end this is a usage-based acquisition system. It starts with rote-acquisition of verb-argument structures, and by finding commonalities, it slowly builds levels of abstraction. Through this bottom-up process, it accommodates to the target language. (For other accounts of usage based systems, see also Bybee [14] and Langacker [31, 32, 33].) 5 A Connectionist Simulation In this section we present a connectionist simulation to test whether a network could build abstract relationships corresponding to “subjects” and “objects” given an English-like language with a variety of grammatical constructions. This was done in such a way that there is no “innate” knowledge of language in the network. In particular, there are no architectural features that correspond to “syntactic elements”, i.e., no grammatical relations, no features that facilitate word displacement, and so forth. The main assumptions are that the system can process sequential data, and that it is trying to map sequences of words to semantic roles. The motivation behind the network is the notion that merely the drive to map input words to output semantics is sufficient to induce the necessary internal abstractions to facilitate the mapping. To test this hypothesis, a Simple Recurrent Network [22] was created and tested using the Stuttgart Neural Network Simulator (SNNS). The network is shown in Figure 1. The network takes in a sequence of patterns representing sentences generated from a grammar. At each time step, a word or end of sentence marker is presented. After each sentence, an input representing “reset” is presented, for which the network is supposed to zero out the outputs. The output patterns represent semantic roles in a slot-based representation. The teaching signal for the roles are given as targets starting from the first presentation of the corresponding filler word, and then held constant throughout the rest of the presentation of the sentence. Fig. 1. Network architecture. A Connectionist Simulation of the Empirical Acquisition 183 The input vocabulary consists of 56 words (plus end of sentence and reset), represented as 10-bit patterns, with 5 bits on and 5 bits off. Of these 56 words, 25 are verbs, 25 are nouns, and remaining 6 are a variety of function words. All of the nouns are proper names. Of the verbs, 5 are unergative (intransitive, with agents as the sole arguments, e.g., run, sing), 5 are unaccusative (intransitive, with patient arguments, e.g., fall, roll), 10 are “action” transitives (with agent & patient arguments, e.g., hit, kick, tickle), and 5 are “experiential” transitives (with experiencer & percept arguments, e.g., see, like, remember). In addition there is a “matrix verb”, persuade, which is used for embedded sentence structures. The 5 remaining words are who, was, by, and, and self. The output layer is divided into 6 slots that are 10 units wide. The first slot is the verb identifier, the second through the fifth are the identifiers for the agent, the patient, the experiencer, and the percept. (Note that at most only two of these four slots should be filled at one time.) The sixth slot is the “matrix agent” slot, which will be explained below. The representation of the fillers is unrelated to the representation of the words – the slot fillers only have 2 bits set out of 10. Hence the network cannot just copy the inputs to the slots. Using the back-propagation learning procedure [49] the network was taught to assign the proper noun identifier(s) to the appropriate role(s) for a number of sentence structures. Thus for the sentence, Sandy persuaded Kim to kiss Larry, the matrix agent role is filled by Sandy, the agent role is filled by Kim, and the patient role is filled by Larry. In the sentence, Who did Larry see, the experiencer role is filled by Larry and the percept role is filled by who. Training was conducted for 50 epochs, with 10,000 sentences in each epoch. The learning rate was 0.2, initial weights set within a range of 1.0. There was no momentum. Examples of the types of sentences and their percentage in the training set are listed below: 1. Simple declarative intransitives (18%). E.g., Sandy jumped (agent role) and Sandy fell (patient role). 2. Simple declarative transitives (26%). E.g., Sandy kissed Kim (agent and patient roles) and Sandy saw Kim (experiencer and percept roles). 3. Simple declarative passives (6%). E.g., Sandy was kissed (patient role). 4. Questions (20%). E.g., Who did Sandy kiss?(agent and patient roles, object is questioned), Who kissed Sandy? (agent and patient roles, subject is questioned), Who did Sandy see? (experiencer and percept roles, object is questioned), and Who saw Sandy? (experiencer and percept roles, subject is questioned). 5. Control (equi-NP) sentences (25%). E.g., Sandy persuaded Kim to run (matrix agent and agent roles), Sandy persuaded Kim to fall (matrix agent and patient roles), Sandy persuaded Kim to kiss Max (matrix agent, agent, and patient roles) and Sandy persuaded Kim to see Max (matrix agent, experiencer, and percept roles). 6. Control (equi-NP) sentences with questions (6%). E.g., Who did Sandy persuade to run/fall? (questioning embedded subject, whether agent or patient, of an intransitive verb), Who persuaded Sandy to run/fall? (questioning matrix agent; note embedded intransitive verb), Who persuaded Sandy to kiss/see Max? 184 W.C. Morris, G.W. Cottrell, and J. Elman (questioning matrix agent; note embedded transitive verb), and Who did Sandy persuade to kiss Max? (questioning embedded agent). The generalization test involved two systematic gaps in the data presented to the network; both involved experiential verbs. The first was passive sentences with experiential verbs e.g., Sandy was seen by Max. The second involved questioning embedded subjects in transitive clauses with experiential verbs, e.g., Who did Sandy persuade to see Max? Neither of these sentence types occurred with experiential verbs in the training set. The test involved probing these gaps. The network was not expected to generalize over these two systematic gaps in the same way. The questioning-of-embedded-subject-sentences gap is part of an interlocking group of constructions which “conspire” to compensate for the gap. The “members of the conspiracy” are the transitive sentences (group 2 above), the questions (group 4), and the control sentences (group 5). These sentences are related to each other, and they should cause the network to treat the agents of action verbs and the experiencers of experiential verbs the same. Thus we believe that this gap, which is unattested in parental input, should show some generalization. Our explanation in terms of construction conspiracies would then be the basis for our explanation of many of the overgeneralizations that occur in children. Meanwhile, the passive gap has no such compensating group of constructions. Only the transitive sentences (group 2) provide support for the passive generalization. This gap corresponds to one that actually exists in parental input. If our model is a good one, we would expect that it should not bridge this gap. 6 Results In Table 1 we show the result of testing a variety of constructions, some forms of which were trained, and two were not. Five hundred sentences of each listed type were tested. The results were computed using Euclidean distance decisions-each field in the output vector was compared with all possible field values (including the all-zeroes vector), and the fields assigned the nearest possible correct value. For a sentence to be “correct” all of the output fields had to be correct. The two salient lines are for simple passive clauses with experiential verbs, which had a 6.2% success rate, and questioning embedded subjects with experiential verbs, which had a 67.4% success rate. The near complete failure of generalization for simple passive clauses with experiential verbs showed that the nonappearance of experiential verbs in the passive voice in the training set caused the network to learn the passive voice as a semantically narrow alternation. This is similar to the undergeneralization found by Maratsos et al. [37, 38], discussed above. This gap, as mentioned earlier, has been shown [37] to be one that actually exists in parental input to children. On the other hand, the questioning of embedded subjects with experiential verbs, which likewise did not appear in the training set, showed much greater generalization, in all likelihood because there is a “conspiracy of syntactic constructions” surrounding this gap. As mentioned above, the simple transitive A Connectionist Simulation of the Empirical Acquisition 185 Table 1. Sentence comprehension using Euclidean distance decisions. Sentence description Percent correct Simple active clauses, action verbs 97.6% Simple active clauses, experiential verbs 97.6% Simple passive clauses, action verbs 91.8% Simple passive clauses, experiential verbs 6.2% Control (equi-NP) structures 83.6% Questioning embedded subjects, action verbs 91.4% Questioning embedded subjects, experiential verbs 67.4% clauses, questioned simple clauses, and control sentences, were the prime “conspirators”. Simple transitive clauses established the argument structures for both the agent-patient verbs and the experiencer-percept verbs: – Roger kissed Susie. (agent–patient argument structure) – Linda saw Pete. (experiencer–percept argument structure) Questioned simple clauses established the ability to question the subjects of both argument structures: – Who pinched Sandy? (questioned agent) – Who remembered Max? (questioned experiencer) Control sentences established embedded clauses for both argument structures: – Fred persuaded Ian to tickle Lynn. (embedded agent–patient. argument structure) – Fred persuaded Sam to hate Terry. (embedded experiencer–percept argument structure) Questioning embedded agents established the relevant pattern, including the fronting of the embedded, questioned constituent: – Who did Raul persuade to tickle Sally? (embedded questioned agent) The interlocking patterns above led to extension of this last pattern to experiencer– percept verbs. The passive gap has no such compensating group of constructions. Only the transitive sentences (group 2) provided support for the passive generalization; as we shall see, these were insufficient to bridge the gap. Simple transitive clauses established the similarity of argument structures: – Sally tickled Jack. (agent–patient argument structure) – Jack liked Sally. (experiencer–percept argument structure) Simple intransitive clauses established patients as subjects: – Susie fell. (patient–only argument structure) 186 W.C. Morris, G.W. Cottrell, and J. Elman Passive sentences, which only occurred with agent–patient verbs, established an alternation between active–voice agent–patient argument structures and passive–voice patient–only argument structures with the same verbs: – Jack was tickled. (patient–only argument structure with a verb that is seen in the active voice) The gap of the questioned–embedded–experiencer was overcome because there was a sufficient number of overlapping constructions and there was a wellestablished precedent of experiencer subjects. As a result we are seeing a level of abstraction, with the network able to “define”, in some sense, the gap in terms of the embedded subject rather than merely an embedded agent. In order for the gap of the passive-voice for experiencer–percept verbs to be overcome there would have to have been an established precedent of percept– subjects. There were none. There were no percept–only verbs in the data set; indeed, there are arguably no percept–only verbs in English. The gap of the passive-voice for experiencer–percept verbs was not overcome because there was an insufficient number of overlapping constructions, and because there was no precedent of percept–subjects in the data set. 6.1 Analysis of Representations in the Hidden Layer We wanted to probe the way that the network represented subjects internally, i.e., in the hidden layer. This was done by creating and comparing “subjectvariance vectors” for combinations of verb classes and syntactic constructions. Subject–variance vectors are vectors representing the variance of the hidden layer units when only the subject is varied. This should show where the subject is being encoded in the hidden layer. Creating the variance vectors is a three-step process. To construct these vectors, we presented the network with 25 sentences varying only in their subject. We saved the 120 hidden unit activations at the end of each presentation, and computed the variance on a per unit basis. The variances so computed should then represent “where” the subject is being encoded for that verb/construction combination. Next we compared the subject-variance vectors within a verb class to each other. An average subject-variance vector was computed for each verb class (for a given construction); this represented the “prototype” subject representation for the verb class. To test how tightly associated the representations of the subjects of these verb-construction classes were we computed the average Euclidean distance from the prototype to each of the members of the class. For unaccusative (patientonly) and unergative (agent-only) verbs in simple clauses and in embedded clauses the average distances were about 0.5. For transitive verbs, both agent-patient and experiencer-percept verbs, in simple clauses and in embedded clauses, the averages were about 0.3. For passive-voice agent-patient verbs the average was about 0.4. (The fact that intransitive verb-construction combinations have “less A Connectionist Simulation of the Empirical Acquisition 187 Table 2. Euclidean distances between prototype subject variance vectors. Simple clauses Transitive Agent Transitive Agent Intransitive Agent Transitive Experiencer Intransitive Agent Transitive Agent Simple clauses Intransitive Patient Intransitive Agent Transitive Experiencer Transitive Agent Embedded clauses Embedded Transitive Agent Embedded Transitive Agent Embedded Transitive Experiencer Embedded Transitive Experiencer Embedded Intransitive Agent Embedded Transitive Agent Active voice Intransitive Patient Transitive Experiential Intransitive Agent Transitive Agent Passive Voice Percept Other simple clauses Intransitive Agent Transitive Experiencer Transitive Experiencer Intransitive Patient Intransitive Patient Intransitive Patient Embedded counterparts Embedded Intransitive Patient Embedded Intransitive Agents Embedded Transitive Experiencer Embedded Transitive Agent Other embedded clauses Embedded Transitive Experiencer Embedded Intransitive Agent Embedded Intransitive Agent Embedded Intransitive Patient Embedded Intransitive Patient Embedded Intransitive Patient Passive voice Passive Voice Patient Passive Voice Patient Passive Voice Patient Passive Voice Patient Passive Voice Patient Distance 0.69 0.69 0.82 1.15 1.18 1.40 Distance 0.70 0.73 0.77 0.81 Distance 0.57 0.67 0.78 1.07 1.07 1.27 Distance 0.72 1.27 1.37 1.51 0.38 disciplined”, i.e., less tightly associated, subject representations may be explained by the fact that these verbs have only a single verb argument. The network need not “remember” two verb arguments simultaneously; it can therefore be profligate in the manner of the storage of the subject’s identity.) The third step involved looking at the distances between the prototypes. This allowed us to see how similar prototypes were. The results of our comparisons are shown in Table 2, where we use “Intransitive Agent” for unergative subjects (e.g., Sandy jumped) and “Intransitive Patient” for unaccusative subjects (e.g. Sandy fell). In general, one can think of distances less than 1.0 as “close” (although none are as close as the within-class distances mentioned above) and distances greater than 1.0 as “far”. With this in mind, Table 2 shows that there are interesting relationships between the instantiations of subjects in various verband-construction groups (recall that all the entries in the Table correspond to subjects). First, considering only the simple clauses, we see that the entries divide into two distinct groups. The intransitive patients are relatively far from the other classes, while the agents and experiencers tend to pattern together. To understand why, consider that in transitive constructions with agent-patient verbs, both agents and patients must be present. Therefore the two semantic roles 188 W.C. Morris, G.W. Cottrell, and J. Elman must be stored simultaneously, and thus their representations must be in somewhat different units. Agents, whether transitive or intransitive, will most likely be represented by the same set of units. Note that experiencers never need to be stored simultaneously with agents. Therefore their representation can overlap agent-subjects much more than can the representations of patient-subjects. Then the question is why experiencers pattern more with agents than patients. We believe this is because the agent-subjects are simply the most frequent in the training set, and thus have a primacy in “carving out” the location for subjects in the hidden layer. This is also consistent with many linguistic theories where agents are considered the prototypical subjects. Second, the distances between the matrix clause subjects and their embedded clause counterparts are also close, and in the same range as the distances between “non-antagonistic” subject types. Third, the embedded clauses essentially replicate the pattern seen in the simple clauses, with agent and experiencer subjects patterning together, and patient subjects at a distance. Fourth, the passive voice patient subjects are far from active voice subjects, with the (not unexpected) exception of active voice intransitive patients. Clearly, the network has drawn a major distinction between patient-subjects and nonpatient subjects. Again, we hypothesize that the network did this simply because of the necessity of storing agents and patients simultaneously. Finally, we see that passive voice patient subjects are very close (within the range of a within-class distance) to passive voice subjects of experiential verbs (percepts). Recall that the network was never trained on experiential verbs in the passive voice and never trained with percept-subjects; the network has basically stored such subjects in the same location as passive voice patient subjects. This is consistent with the failure of the network to correctly process these novel constructions. We conclude from this analysis of the subject-variance vectors that within a syntactically defined class of verbs, the subjects are stored in very nearly the same set of units. These subject patterns are more similar to each other than they are to the subject patterns for the same class of verbs in other constructions, or to the subject patterns of other classes of verbs. Most importantly, though, the representation of “subject” in the network is controlled by two main factors. First, if the subjects of two sentences must fill the same thematic role, they will be stored similarly. Second, representations are pushed apart according to whether the processing requirements force them to compete for representational resources. In the case of our set of sentence types, the effect is that agents and patients are stored separately because they can appear together, and experiencers are stored very close to agents, since they never appear together. The result is that the instantiation of “subject” in the network amounts to a radial category in the manner of Lakoff’s Women, Fire, and Dangerous Things [30]. These relationships are largely in accord with the predictions of the theoretical model sketched out in this paper. A Connectionist Simulation of the Empirical Acquisition 7 189 Discussion and Conclusions This simulation was intended to demonstrate that the most abstract aspects of language are learnable. There are two broad areas in which this is explored: control of “subjecthood” properties and demonstration of relative abstraction. In the area of control of properties, this simulation demonstrated that the network was capable of learning to process equi-NP deletion sentences (also known as “control constructions”). This is shown in the ability of the network to correctly process sentences such as Sandy persuaded Kim to run (these are shown in groups 5 & 6, in section 5 above). As was seen above, the network was able to correctly understand these sentences at a rate of 84%. The network’s ability to abstract from semantics was shown in the ability of the network to partially bridge the artificial gap in the training set, that of the questioned embedded subject of experiential verbs. The network was able to define the position in that syntactic construction in terms of a semanticallyabstract entity, that is, a subject rather than an agent. Consistent with developmental data, the network also did not generalize when it should not have. In particular, it did not process passive sentences with perceptual subjects. We have hypothesized that this pattern of generalization and lack of generalization can be explained as a conspiracy of constructions, that bootstrap the processing required for a new construction. Without this scaffolding, the network assimilates the new construction into a known one. As is clear from the examination of the hidden layer, we can see how the network stores a partially-abstract representation of the subject. We can also see the limitations of abstraction; the network’s representation of the subject of a given sentence is also partially specified in semantically loaded units. And, as we have seen in the Maratsos [37] study, this appears to be appropriate to the way that humans learn language. This result is also consistent with Goldberg’s theoretical analysis [25] that predicts this semantically-limited scope to certain syntactic constructions. Of course, we have been preceded by many others in the use of recurrent networks for language comprehension [16, 22, 28, 39, 52]. Most of these previous works impose a great deal of structure on the networks that, in some sense, parallels a preconceived notion of what sentence processing should look like. The previous work to which we owe the greatest debt is that of Elman [22], who developed Simple Recurrent Networks, and St. John & McClelland [52], who applied them to the problem of mapping from sequences of words to semantic representations. There are two main differences between this work and that of St. John & McClelland. In terms of networks, ours is simpler, because we specify in advance an output representation for semantics. While our semantics is simpler, the syntactic constructions used in training are more complex. Indeed, the fact that we focus upon the notion of a grammatical relation and how it could be learned is what differentiates this work from much of the previous work. Such a notion, as shown in the list of characteristic properties, requires a fairly large array of sentence types. Our analysis of the network’s representation of this notion also is novel. 190 W.C. Morris, G.W. Cottrell, and J. Elman One obvious drawback of our work is the impoverished semantics. All of our nouns were glossed as proper names, but they were just simple bit patterns with no inherent structure. The only difference in verb “meanings”, aside from a particular bit pattern for a signature, was the set of thematic roles they licensed. A richer semantics would presumably be required to model the earlier stages of the theory, where verbs with similar meanings merge into larger categories. On the bright side, preliminary studies for future work, as well as similar studies by Van Everbroeck [58], indicate that this sort of network can be scaled up in the size of the vocabulary. In the context of this book, this work demonstrates that a “radical” connectionist approach, that is, one without any additional bells and whistles to force it to be “symbolic”, is indeed able to form categories usually reserved for symbolic approaches to linguistic analysis. Indeed, we believe that this sort of approach will eventually show that syntax as a separate entity from semantic processing is an unnecessary assumption. Rather, what we see in our network is that “syntax”, in the usual understanding of that term, is part and parcel of the processing required to map from a sequence of input words to a set of semantic roles. Acknowledgments The authors would like to thank Adele Goldberg and Ezra Van Everbroeck for their comments and suggestions in the course of this study. References [1] N. Akhtar. Characterizing English-speaking children’s understanding of SVO word order. to appear. [2] L. Bloom. One Word at a Time: The Use of Single Word Utterances Before Syntax. Mouton de Gruyter, The Hague, 1973. [3] L. Bloom, K. Lifter, and J. Hafitz. Semantics of verbs and the development of verb inflection in child language. Language, 56:386–412, 1980. [4] H. Borer and K. Wexler. The maturation of syntax. In T. Roeper and E. Williams, editors, Parameter Setting. D. Reidel Publishing Company, Dordrecht, 1987. [5] M. Bowerman. Learning the structure of causative verbs: A study in the relationship of cognitive, semantic and syntactic development. In Papers and Reports on Child Language Development, volume 8, pages 142–178. Department of Linguistics, Stanford University, 1974. [6] M. Bowerman. Semantic factors in the acquisition of rules for word use and sentence construction. In D. M. Morehead and A. E. Morehead, editors, Normal and deficient child language. University Park Press, Baltimore, 1976. [7] M. Bowerman. Evaluating competing linguistic models with language acquisition data: Implications of developmental errors with causative verbs. Quaderni di semantica, 3:5–66, 1982. [8] M. Bowerman. Reorganizational processes in lexical and syntactic development. In E. Wanner and L. R. Gleitman, editors, Language acquisition: The state of the art, pages 319–346. Cambridge University Press, Cambridge, 1982. A Connectionist Simulation of the Empirical Acquisition 191 [9] M. Bowerman. The “no negative evidence” problem: How do children avoid an overgeneral grammar? In John A. Hawkins, editor, Explaining Language Universals, pages 73–101. Basil Blackwell, Oxford (UK), 1988. [10] M. Bowerman. Mapping thematic roles onto syntactic functions: Are children helped by innate linking rules? Linguistics, 28:1253–1289, 1990. [11] M. D. S. Braine. On learning the grammatical order of words. Psychological Review, 70:323–348, 1963. [12] M. D. S. Braine. Children’s first word combinations, volume 41 of Monographs of the Society for Research in Child Development. University of Chicago Press, Chicago, 1976. [13] R. Brown. A First Language: The Early Stages. Harvard University Press, Cambridge MA, 1973. [14] J. Bybee. Morphology: A Study of the Relation between Meaning and Form. John Benjamins, Amsterdam, 1985. [15] E. V. Clark. Early verbs, event-types, and inflections. In C. E. Johnson and J. H. V. Gilbert, editors, Children’s Language, volume 9, pages 61–73. Lawrence Erlbaum Associates, Mahwah, NJ, 1996. [16] Garrison W. Cottrell. A Connectionist Approach to Word Sense Disambiguation. Research Notes in Artificial Intelligence. Morgan Kaufmann, San Mateo, 1989. [17] J. G. de Villiers, M. Phinney, and A. Avery. Understanding passives with nonaction verbs. Paper presented at the Seventh Annual Boston University Conference on Language Development, October 8-10, 1982. [18] R. M. W. Dixon. The Dyirbal Language of North Queensland. Cambridge University Press, Cambridge UK, 1972. [19] R. M. W. Dixon. Ergativity. Language, 55:59–138, 1979. [20] R. M. W. Dixon. Ergativity. Cambridge University Press, Cambridge, 1994. [21] D. Dowty. Thematic proto-roles and argument selection. Language, 67(3):547–619, 1991. [22] J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990. [23] W. A. Foley and R. D. Van Valin Jr. Functional Syntax and Universal Grammar. Cambridge University Press, Cambridge, 1984. [24] A. E. Goldberg. Argument Structure Constructions. PhD thesis, University of California, Berkeley, 1992. [25] A. E. Goldberg. A Construction Grammar Approach to Argument Structure. University of Chicago Press, Chicago, 1995. [26] K. Hirsh-Pasek and R. M. Golinkoff. The Origins of Grammar: Evidence from Early Language Comprehension. MIT Press, Cambridge MA, 1996. [27] N. M. Hyams. Language acquisition and the theory of parameters. D. Reidel Publishing Company, Dordrecht, 1986. [28] Ajay N. Jain. A connectionist architecture for sequential symbolic domains. Technical Report CMU-CS-89-187, Carnegie Mellon University, 1989. [29] E. L. Keenan. Towards a universal definition of “Subject”. In C. Li, editor, Subject and Topic, pages 303–334. Academic Press, New York, 1976. [30] G. Lakoff. Women, Fire, and Dangerous Things. Chicago University Press, Chicago, 1987. [31] R. W. Langacker. Theoretical Prerequisites, volume 1 of Foundations of Cognitive Grammar. Stanford University Press, Stanford, 1987. [32] Ronald W. Langacker. Concept, Image, and Symbol: The Cognitive Basis of Grammar. Mouton de Gruyter, Berlin, 1991. [33] Ronald W. Langacker. Descriptive Application, volume 2 of Foundations of Cognitive Grammar. Stanford University Press, Stanford, 1991. 192 W.C. Morris, G.W. Cottrell, and J. Elman [34] B. Levin. On the Nature of Ergativity. PhD thesis, MIT, 1983. [35] C. D. Manning. Ergativity: Argument Structure and Grammatical Relations. PhD thesis, Stanford, 1994. [36] A. P. Marantz. On the nature of grammatical relations. MIT Press, Cambridge MA, 1984. [37] M. Maratsos, D. E. C. Fox, J. A. Becker, and M. A. Chalkley. Semantic restrictions on children’s passives. Cognition, 19:167–191, 1985. [38] M. Maratsos, S. A. Kuczaj II, D. E. C. Fox, and M. A. Chalkley. Some empirical studies in the acquisition of transformational relations: Passives, negatives, and the past tense. In W. A. Collins, editor, Children’s Language and Communication: The Minnesota Symposia on Child Psychology, volume 12, pages 1–45. Lawrence Erlbaum Associates, Hillsdale NJ, 1979. [39] Risto Miikkulainen. Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. MIT Press, Cambridge, MA, 1993. [40] W. C. Morris. Emergent Grammatical Relations: An Inductive Learning System. PhD thesis, University of California, San Diego, 1998. [41] L. Naigles. Children use syntax to learn verb meanings. Journal of Child Language, 17:357–374, 1990. [42] L. Naigles, K. Hirsh-Pasek, R. Golinkoff, L. R. Gleitman, and H. Gleitman. From linguistic form to meaning: Evidence for syntactic bootstrapping in the two-yearold. Paper presented at the Twelfth Annual Boston University Child Language Conference, Boston MA, 1987. [43] R. Olguin and M. Tomasello. Twenty-five-month-old children do not have a grammatical category of verb. Cognitive Development, 8:245–272, 1993. [44] J. M. Pine, E. V. M. Lieven, and C. F. Rowland. Comparing different models of the development of the english verb category. MS, Forthcoming. [45] J. M. Pine and H. Martindale. Syntactic categories in the speech of young children: The case of the determiner. Journal of Child Language, 23:369–395, 1996. [46] S. Pinker. Language Learnability and Language Development. Harvard University Press, Cambridge MA, 1984. [47] S. Pinker. Learnability and Cognition: The Acquisition of Argument Structure. MIT Press, Cambridge MA, 1989. [48] S. Pinker, D. S. LeBeaux, and L. A. Frost. Productivity and constraints in the acquisition of the passive. Cognition, 26:195–267, 1987. [49] D. E. Rumelhart and J. L. McClelland. On learning the past tenses of English verbs. In J. L. McClelland and D. E. Rumelhart, editors, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 2, pages 216– 271. The MIT Press, Cambridge MA, 1986. [50] D. E. Rumelhart and J. L. McClelland. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1. The MIT Press, Cambridge MA, 1986. [51] P. Schachter. The subject in Philippine languages: Topic, Actor, Actor-Topic, or none of the above? In C. Li, editor, Subject and Topic, pages 491–518. Academic Press, New York, 1976. [52] Mark F. St. John and James L. McClelland. Learning and applying contextual constraints in sentence comprehension. Artificial Intelligence, 46:217–257, 1990. [53] M. Tomasello. First verbs: A case study of early grammatical development. Cambridge University Press, Cambridge, 1992. [54] M. Tomasello and P. J. Brooks. Early syntactic development: A construction grammar approach. In M. Barrett, editor, The Development of Language. UCL Press, London, in press. A Connectionist Simulation of the Empirical Acquisition 193 [55] M. Tomasello and R. Olguin. Twenty-three-month-old children do have a grammatical category of noun. Cognitive Development, 8:451–464, 1993. [56] V. Valian. Syntactic categories in the speech of young children. Developmental Psychology, 22:562–579, 1986. [57] V. Valian. Syntactic subjects in the early speech of American and Italian children. Cognition, 40:21–81, 1991. [58] E. Van Everbroeck. Language type frequency and learnability: A connectionist appraisal. In M. Hahn and S. C. Stoness, editors, The Proceedings of the Twenty First Annual Conference of the Cognitive Science Society, pages 755–760, Mahwah NJ, 1999. Lawrence Erlbaum Associates. Large Patterns Make Great Symbols: An Example of Learning from Example Pentti Kanerva RWCP Theoretical Foundation SICS Laboratory Real World Computing Partnership, Swedish Institute of Computer Science SICS, Box 1263, SE-164 29 Kista, Sweden kanerva@sics.se Abstract. We look at distributed representation of structure with variable binding, that is natural for neural nets and that allows traditional symbolic representation and processing. The representation supports learning from example. This is demonstrated by taking several instances of the mother-of relation implying the parent-of relation, by encoding them into a mapping vector, and by showing that the mapping vector maps new instances of mother-of into parent-of. Possible implications to AI are considered. 1 Introduction Distributed representation is used commonly with neural nets, as well as in ordinary computers, to encode a large number of attributes or things with a much smaller number of variables or units. In this paper we assume that the units are binary so that the encodings of things are bit patterns or bit vectors (‘pattern’ and ‘vector’ will be used interchangeably in this paper). In normal symbolic processing the bit patterns are arbitrary identifiers or pointers, and it matters only whether two patterns are identical or different, whereas bit patterns in a neural net are somewhat like an error-correcting code: similar patterns (highly correlated, small Hamming distance) mean the same or similar things—which is why neural nets are used as classifiers and as low-level sensory processors. Computer modeling of high-level mental functions, such as language, led to the development of symbolic processing, and for a long time artificial intelligence (AI) and symbolic processing (e.g., Lisp) were nearly synonymous. However, the resulting systems have been rigid and brittle rather than lifelike. It is natural to look to neural nets for a remedy. Their error-correcting properties should make the systems more forgiving. Consequently, considerable research now goes into combining symbolic and neural approaches. One way of doing it is to use both symbolic and neural processing according to what each one does the best. Wermter [15] reviews such approaches for language processing. Another is to encode (symbolic) structure with distributed representation that is suitable for neural nets. The present paper explores this second direction, possibly providing insight into how symbolic processes are realized in the brain. Much has been written and debated about the encoding of structure for neural nets (see, e.g., Sharkey [13]) without reaching a clear consensus. Hinton [4] has discussed S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 194-203, 2000  Springer-Verlag Berlin Heidelberg 2000 1 Large Patterns Make Great Symbols: An Example of Learning from Example 195 it in depth and has introduced the idea of reduced representation. The idea is that both a composed structure and its components are represented by vectors of the same dimensionality, akin to the fixed-width pointers of a symbolic expression, except that the vector for the composed structure is built of the vectors for its components; it is a function of the component vectors. This idea is realized in Pollack’s Recursive AutoAssociative Memory (RAAM) [11], in Kussul, Rachkovskij, and Baidyk’s Associative-Projective Neural Network [8, 12], in Plate’s Holographic Reduced Representation (HRR) [9, 10], and in my binary Spatter Code [6]. I will use here the latter because it allows for particularly simple examples. The research on representation by Shastri and Ajjanagadde [14] and by Hummel and Holyoak [5] demonstrate yet other solutions to problems we share. Most of this paper is devoted to demonstrating a symbol-processing mechanism for neural nets. The point, however, is not to develop a neural-net Lisp but to see what additional properties a system based on distributed representation might have, and how these properties could lead to improved modeling of high-level mental functions. These issues are touched upon at the end of the paper. 2 Binary Spatter-Coding of Structure The binary Spatter Code is a form of Holographic Reduced Representation. It is summarized below in terms applicable to all HRRs, and in traditional symbolic terms using a two-place relation r(x, y) as an example. 2.1 Space of Representations HRRs work with large random patterns, or very-high-dimensional random vectors. All things—variables, values, composed structures, mappings between structures—are points of a common space: they are very-high-dimensional random vectors with independent, identically distributed components. The dimensionality of the space, denoted by N, is usually in the thousands (e.g., N = 10,000). The Spatter Code uses dense binary vectors (i.e., 0s and 1s are equally probable). The vectors are written in boldface, and when a single letters stands for a vector it is also italicized so that, for example, x is the N-dimensional vector (‘N-vector’ for short) that represents the variable or role x, and c is the N-vector that represents the value or filler c. 2.2 Item Memory or Clean-up Memory Some operations produce approximate vectors that need to be cleaned up (i.e., identified with their exact counterparts). It is done with an item memory that stores all valid vectors known to the system, and retrieves the best-matching vector when cued with a noisy vector, or retrieves nothing if the best match is no better than what results from random chance. The item memory performs a function that, at least in principle, is performed by an autoassociative neural memory. 2 196 2.3 P. Kanerva Binding Binding is the first level of composition. It combines things that are very closely associated with each other, as when one is used to name or identify the other. Thus a variable is bound to a value with a binding operator that combines the N-vectors for the variable and the value into a single N-vector for the bound pair. The Spatter Code binds with coordinatewise (bitwise) Boolean Exclusive-OR (XOR, ⊗), so that the variable x having the value c (i.e., x = c) is encoded by the N-vector x⊗c whose nth bit is xn ⊗cn (xn and cn are the nth bits of x and c, respectively). An important property of all HRRs is that binding of two random vectors produces a random vector that resembles neither of the two. 2.4 “Unbinding” The inverse of the binding operator decomposes a bound pair into its constituents: it finds the filler if the role is given, or the role if the filler is given. The XOR is its own inverse function, so that, for example, x⊗(x⊗c) = c finds the vector to which x is bound in x⊗c (i.e., what’s the value of x?). 2.5 Merging Merging is the second level of composition in which identifiers and bound pairs are combined into a single entity. It has also been called ‘super(im)posing’, ‘bundling’, and ‘chunking’. It is done by a normalized sum vector, and the merging of G and H is written as 〈G + H〉, where 〈…〉 stands for normalization. The relation r(A, B) can be represented by merging the representations for r, ‘r1 = A’, and ‘r2 = B’, where r1 and r2 are the first and second roles of the relation r. It is encoded by rAB = 〈r + r1 ⊗A + r2 ⊗B〉 . (1) The normalized sum of dense binary vectors is given by bitwise majority rule, with ties broken at random. An important property of all HRRs is that merging of two or more random vectors produces a random vector that resembles each of the merged vectors (i.e., rAB is similar to r, r1 ⊗A, and r2 ⊗B). 2.6 Distributivity In all HRRs, the binding and unbinding operators distribute over the merging operator. For example, x⊗〈G + H + I 〉 = 〈x⊗G + x⊗H + x⊗I 〉 . (2) Distributivity is a key to analyzing HRRs. 2.7 Probing To find out what is bound to r1 in rAB (i.e., what’s the value of r1 in r(A, B)?), we probe rAB (of eqn. 1) with r1 using the unbinding operator. It yields a vector A′ that is recognizable as A (A′ will retrieve A from the item memory). The analysis is as 3 Large Patterns Make Great Symbols: An Example of Learning from Example 197 A′ = r1 ⊗rAB = r1 ⊗〈r + r1 ⊗A + r2 ⊗B〉 , (3) follows: which, by distributivity, becomes A′ = 〈r1 ⊗r + r1 ⊗(r1 ⊗A) + r1 ⊗(r2 ⊗B)〉 (4) A′ = 〈r1 ⊗r + A + r1 ⊗r2 ⊗B〉 . (5) and simplifies to Thus A′ ιs similar to A; it is also similar to r1 ⊗r and to r1 ⊗r2 ⊗B (see Merging, sec. 2.5), but they are not stored in the item memory and thus act as random noise. 2.8 Holistic Mapping and Simple Analogical Retrieval The functions described so far are sufficient for traditional symbol processing; for example, for realizing a Lisp-like list-processing system. Holistic mapping is a parallel alternative to sequential search and substitution of traditional symbolic processing. Probing is the simplest form of holistic mapping: it approximately maps a composed pattern into one of its bound constituents, as seen above (eqns. 3–5). However, much more than that can be done in a single mapping operation. For example, it is possible to do several substitutions at once, by constructing a mapping vector from individual substitutions (each substitution appears as a bound pair, which then are merged; the kernel map M* discussed at the end of the next section is an example). I have demonstrated this kind of mapping between things that share structure (same roles, different objects) elsewhere [7]. In the next section we do the reverse: we map between structures that share objects (same objects in two different relations). The mappings are constructed from examples, so that this is a demonstration of analogical retrieval or inference. 3 Learning from Example We will look at two relations, one of which implies the other: ‘If x is the mother of y, then x is the parent of y’, represented symbolically by m(x, y) → p(x, y). We take a specific example m(A, B) of the mother-of relation and compare it to the corresponding parent-of relation p(A, B), to get a mapping M1 between the two. We then use this mapping on another pair (U, V ) for which the mother-of relation holds, to see whether M1 maps m(U, V ) into the corresponding parent-of relation p(U, V ). We encode ‘A is the mother of B’, or m(A, B), with the random N-vector mAB = 〈m + m1 ⊗A + m2 ⊗B〉, where m encodes (it names) the relation and m1 and m2 encode its two roles. Similarly, we encode ‘A is the parent of B’, or p(A, B), with pAB = 〈p + p1 ⊗A + p2 ⊗B〉. Then M1 = MAB = mAB⊗pAB (6) maps a specific instance of the mother-of relation into the corresponding instance of the parent-of relation, because mAB⊗MAB = mAB⊗(mAB⊗pAB) = pAB. 4 198 P. Kanerva The mapping M AB is based on one example; is it possible to generalize based on only one example? When the mapping is applied to another instance m(U, V ) of the mother-of relation, which is encoded by mUV = 〈m + m1 ⊗U + m2 ⊗V 〉, we get the vector W: W = mUV⊗MAB . (7) Does W resemble pUV? We will measure the similarity of vectors by their correlation ρ (i.e., by normalized covariance; −1 ≤ ρ ≤ 1). The correlations reported in this paper are exact, they are mathematical mean values or expectations. They have been calculated from composed vectors that are complete in the following sense: they include all possible bit combinations of their component vectors. For example, if the composed vectors involve a total of b “base” vectors, all vectors will be 2b bits. If we start with randomly selected (base) vectors m, m1, m2, p, p1, p2, A, B, …, U, V that are pairwise uncorrelated (ρ = 0), we observe first that mAB and pAB are uncorrelated but mAB and mUV are correlated because they both include m in their composition; in fact, ρ(mAB, mUV) = 0.25 and, similarly, ρ(pAB, pUV) = 0.25. When W is compared to pUV and to other vectors, there is a tie for the best match: ρ(W, pUV) = ρ(W, pAB) = 0.25. All other correlations with W are lower: with the related (reversed) parent-of relations pBA and pVU it is 0.125, with an unrelated parent-of relation pXY it is 0.0625, and with A, B, …, U, V, mAB, and mUV it is 0. So based on only one example, m(A, B) → p(A, B), it cannot be decided whether m(U, V ) should be mapped to the original “answer” p(A, B) or should generalize to p(U, V ). Let us now look at generalization based on three examples of mother implying parent: What is m(U, V ) mapped to by M3 that is based on m(A, B) → p(A, B), m(B, C) → p(B, C), and m(C, D) → p(C, D)? This time we will use a mapping vector M3 that is the sum of three binary vectors, M3 = MAB + MBC + MCD , (8) where MAB is as above, and MBC and MCD are defined similarly. Since M3 itself is not binary, mapping mAB or mUV with M3 cannot be done with an XOR. However, we can use an equivalent system in which binary vectors are bipolar, by replacing 0s and 1s with 1s and −1s, and bitwise XOR (⊗) with coordinatewise multiplication (×). Then the mapping can be done with multiplication, vectors can be compared with correlation, and the results obtained with M1 still hold. Notice that now MAB = mAB×pAB, for example (cf. eqn. 6). Mapping with M3 gives the following results: To check that it works at all, consider WAB = mAB×M3; it is most similar to pAB (ρ = 0.71) as expected because M3 contains MAB . Its other significant correlations are with mAB (0.41) and with pUV and pVU (0.18). Thus the mapping M3 strongly supports m(A, B) → p(A, B). It also supports the generalization m(U, V ) → p(U, V ) unambiguously, as seen by comparing WUV = mUV×M3 with pUV. The correlation is ρ(WUV , pUV) = 0.35; the other significant correlations of WUV are with pAB and pVU (0.18) and with pBA (0.15) because they all include the vector p (parent-of). 5 Large Patterns Make Great Symbols: An Example of Learning from Example 199 To track the trend further, we look at generalization based on five examples, m(A, B) → p(A, B), m(B, C) → p(B, C), …, m(E, F) → p(E, F), giving the mapping vector M5 = MAB + MBC + MCD + MDE + MEF . (9) Applying it to mAB yields ρ(mAB×M5, pAB) = 0.63 (the other correlations are lower, as they were for M3), and applying it to mUV yields ρ(mUV×M5, pUV) = 0.40 (again the other correlations are lower). When the individual mappings MXY are analyzed, each is seen to contain the kernel vectors m×p, m1 ×p1, and m2 ×p2, plus other vectors that act as noise and average out as more and more of the mappings MXY are added together. The analysis is based on the distributivity of the (un)binding operator over the merging operator (see sec. 2.6) and it is as follows: MXY = mXY×pXY = 〈m + m1 ×X + m2 ×Y 〉×〈p + p1 ×X + p2 ×Y 〉 = 〈〈m + m1 ×X + m2 ×Y 〉×p + 〈m + m1 ×X + m2 ×Y 〉×p1 ×X + 〈m + m1 ×X + m2 ×Y〉×p2 ×Y 〉 = 〈〈m×p + m1 ×X×p + m2 ×Y×p〉 + 〈m×p1 ×X + m1 ×X×p1 ×X + m2 ×Y×p1 ×X 〉 + 〈m×p2 ×Y + m1 ×X×p2 ×Y + m2 ×Y×p2 ×Y 〉〉 = 〈〈m×p + m1 ×X×p + m2 ×Y×p〉 + 〈m×p1 ×X + m1 ×p1 + m2 ×Y×p1 ×X 〉 + 〈m×p2 ×Y + m1 ×X×p2 ×Y + m2 ×p2〉〉 = 〈〈m×p + noise〉 + 〈m1 ×p1 + noise〉 + 〈m2 ×p2 + noise〉〉 . (10) The overstrike indicates cancellation, and the underscore picks out the kernel vectors. The three kernel vectors are responsible for the generalization, and from them we can construct a kernel mapping from mother-of to parent-of: M* = m×p + m1 ×p1 + m2 ×p2 . (11) When mUV is mapped with it, we get a maximum correlation with pUV, as expected, namely, ρ(mUV×M*, pUV) = 0.43; correlations with other parent-of relations are ρ(mUV×M*, pXY) = 0.14 (X ≠ U, Y ≠ V ) and 0 with everything else. The results are summarized in Figure 1 that relates the amount of data to the strength of inference and generalization. The data are examples or instances of motherof implying parent-of, m(x, y) → p(x, y), and the task is to map either an old (Fig. 1a) or a new (Fig. 1b) instance of mother-of into parent-of. The data are taken into account by encoding them into the mapping vector Mk . Figure 1a shows the effect of new data on old examples. Adding examples into the mapping makes it less specific, and consequently the correlation for old inferences (i.e., with pAB) decreases, but it decreases also for all incorrect alternatives. Figure 1b shows the effect of data on generalization. When the mapping is based on only one 6 200 P. Kanerva example, generalization is inconclusive (mUV×M1 is equally close to pAB and pUV), but when it is based on three examples, generalization is clear, as M3 maps mUV much closer to pUV than to any of the others. Finally, the kernel mapping M* represents a very large number of examples, a limit as the number of examples approaches infinity, and then the correct inference is the clear winner. 4 Discussion We have used a simple form of Holographic Reduced Representation to demonstrate mapping between two information structures, which is a task that has traditionally been in the domain of symbolic processing. Similar demonstrations have been made by Chalmers [2] and by others (e.g., Bodén & Niklasson [1]) using Pollack’s Recursive Auto-Associative Memory (RAAM) [11] and by Plate using real-vector HRR [9]. The lesson from such demonstrations is that certain kinds of representations and oper- (a) mAB × M k (b) mUV × M k 1 CORRELATION pAB 0.5 pBA pUV pAB pUV, pVU pVU pBA 0 M1 M3 M5 M* M1 M3 M5 M* Figure 1. Leaning from example: The mappings Mk map the mother-of relation to the parent-of relation. They are computed from k specific instances or examples of mXY being mapped to pXY. Each map Mk includes the instance ‘mAB maps to pAB’ and excludes the instance ‘mUV maps to pUV’. The mappings are tested by mapping an “old” instance mAB (a) and a “new” instance mUV (b) of mother-of to the corresponding parent-of. The graphs show the closeness of the mapped result (its correlation ρ) to different alternatives for parent-of. The mapping M* is a kernel map that corresponds to an infinite number of examples. Graph b shows good generalization (mUV is mapped closest to pUV) and discrimination (it is mapped much further from, e.g., pAB) based on only three examples (M3). The thickness of the heavy lines corresponds to slightly more than ±1 standard deviation around the expected correlation ρ when N = 10,000. 7 Large Patterns Make Great Symbols: An Example of Learning from Example 201 ations on them make it possible to do symbolic tasks with distributed representations suitable for neural nets. Furthermore, when patterns are used as if they were symbols (Gayler & Wales [3]), we do not need to configure different neural nets for different data structures. A general-purpose neural net that operates with such patterns is then a little like a general-purpose computer that runs programs for a variety of tasks. The examples above were encoded with the binary Spatter Code. It uses very simple mathematics and yet demonstrates the key properties of Holographic Reduced Representation. However, the thresholding of the vector sum, which is a part of the Spatter Code’s merging operation, discards considerable amount of information. Therefore the sums themselves rather than thresholded sums were used for the mappings Mk and M* (eqns. 8, 9, and 11). This means that the mappings are “outside the system” (they are not binary vectors; in HRRs, all things are points of the same space), and the results have to be measured by correlation rather than by Hamming distance. In fact, the system we ended up with—binding and mapping with coordinatewise multiplication, merging with vector addition—has been used by Ross Gayler (personal communication) to study distributed representation of structure. Plate’s real-vector HRR [9] also uses the vector sum for merging, but it uses circular convolution for binding. Circular convolution involves more computation than does coordinatewise binding but the principles are no more complicated. The binary system has its peculiarities because it binds and maps with XOR, which is its own inverse function and is also commutative. When we form the mapping from mother-of to parent-of, which is a valid inference, it is the same mapping as from parent-of to mother-of, which is valid as evidence but is not a valid inference. The binary system with XOR obviously is not the final answer to the distributed representation of structure, although by being very simple it makes for a good introduction. Another way to think about the invalid inference from parent-of to mother-of is that these simple mechanisms don’t even have to be fully reliable, that they are allowed to, and perhaps should, make mistakes that resemble mistakes of judgment that people make. Correlations obtained with HRRs tend to be low. Even the highest correlation for generalization in this paper is only 0.43 and it corresponds to the kernel map (Fig. 1b), and we usually have to distinguish between two quite low correlations, such as 0.3 and 0.25. Such fine discrimination is impossible if the vectors have only a few components, but when they have several thousands, even small correlations and small differences in them are statistically significant. This explains the need for very-highdimensional vectors. For example, with N = 10,000, correlations have a standard deviation no greater than ±0.01, so that the difference between 0.3 and 0.25 is highly significant (it’s more than 5 standard deviations). These statistics suggest that N = 10,000 allows very large systems to work reliably and that N = 100,000 would usually be unnecessarily large. 4.1 Toward Natural Intelligence Our example of learning from example uses a traditional symbolic setting with roles and fillers, with an operation for binding the two, and with another operation for combining bound pairs (and singular identifiers) into representations for new, higher level compound entities such as relations (eqn. 1). The same binding and merging opera- 8 202 P. Kanerva tions are used also to encode mappings between structures, such as the kernel map between two relations (eqn. 11). Furthermore, individual instances or examples of the mapping M are encoded with the binding operator—by binding corresponding instances of the two relations (eqn. 6)—and several such examples are merged into an estimate of the kernel map by simply adding them together or averaging (eqns. 8 and 9). That the averaging of vectors for structured entities should be meaningful, is a consequence of the kind of representation used. It has no counterpart in traditional symbolic representation and is, in fact, one of the most exciting and tantalizing aspects of distributed representation. Another is the “holistic” mapping of structures with mapping vectors such as M. They suggest the possibility of learning structured representations from examples without explicitly encoding roles and fillers or relying on highlevel rules. This would be somewhat like how people learn. For example, children can learn to use language grammatically without explicit instruction in grammar rules, just as they pick up the sound patterns of their local dialect without being told about vowel shifts and other such things. In modeling that kind of learning we would still use the binding and merging operators, and possibly one or two additional operators, but instead of binding the objects of a new instance to abstract roles, as was done in the examples of this paper, we would bind them to the objects of an old instance—an example—so that the example rather than an abstract frame would serve as the template. Averaging over many such examples could then produce representations from which the abstract notions of role and filler and other high-level concepts could be inferred. In essence, some traditions of AI and computational linguistics would be turned upside down. Instead of basing our systems on abstract structures such as grammars, we would base them on examples and would discover grammars in the representations that the system produces by virtue of its mechanisms. The rules of grammar would then reflect underlying mental mechanisms the way that the rules of heredity reflect underlying genetic mechanisms, and it is those mechanisms that interest us and that we want to understand. To reach such a goal, we need to discover the right kinds of representations, ones that allow meaningful new “symbols” to be created by simple operations on existing patterns, operations such as XOR and averaging for binding and merging. The point of this paper is to suggest that the goal may indeed be realistic and is a prime motive for the study of distributed representations. Acknowledgment The support of this research by Japan’s Ministry of International Trade and Industry through the Real World Computing Partnership is gratefully acknowledged. References 1. Bodén, M.B., Niklasson, L.F.: Features of distributed representation for tree structures: A study of RAAM. In: Niklasson, L.F., Bodén, M.B. (eds.): Current Trends in Connectionism. Erlbaum, Hillsdale, NJ (1995) 121–139 9 Large Patterns Make Great Symbols: An Example of Learning from Example 203 2. Chalmers, D.J.: Syntactic transformations on distributed representations. Connection Science 2, No. 1–2 (1990) 53–62 3. Gayler, R.W., Wales, R.: Connections, binding, unification and analogical promiscuity. In: Holyoak, K., Gentner, D., Kokinov, B. (eds.): Advances in Analogy Research: Integration of Theory and Data from the Cognitive, Computational, and Neural Sciences (Proc. Analogy ’98 workshop, Sofia). NBU Series in Cognitive Science, New Bulgarian University, Sofia (1998) 181–190 4. Hinton, G.E.: Mapping part–whole hierarchies into connectionist networks. Artificial Intelligence 46, No. 1–2 (1990) 47–75 5. Hummel, J.E., Holyoak, K.J.: Distributed representation of structure: A theory of analogical access and mapping. Psychological Review 104 (1997) 427–466 6. Kanerva, P.: Binary spatter-coding of ordered K-tuples. In: von der Malsburg, C., von Seelen, W., Vorbrüggen, J.C., Sendhoff, B. (eds.): Artificial Neural Networks (Proc. ICANN ’96, Bochum, Germany). Springer, Berlin (1996) 869–873 7. Kanerva, P.: Dual role of analogy in the design of a cognitive computer. In: Holyoak, K., Gentner, D., Kokinov, B. (eds.): Advances in Analogy Research: Integration of Theory and Data from the Cognitive, Computational, and Neural Sciences (Proc. Analogy ’98 workshop, Sofia). NBU Series in Cognitive Science, New Bulgarian University, Sofia (1998) 164–170 8. Kussul, E.M., Rachkovskij, D.A., Baidyk, T.N.: Associative-projective neural networks: Architecture, implementation, applications. Proc. Neuro-Nimes ’91, The Fourth Int’l Conference on Neural Networks and their Applications, (1991) 463–476. 9. Plate, T.A.: Distributed Representations and Nested Compositional Structure. PhD thesis. Graduate Department of Computer Science, University of Toronto (1994) (Available by ftp at ftp.cs.utoronto.ca as /pub/tap/plate.thesis.ps.Z) 10. Plate, T.A.: A common framework for distributed representation schemes for compositional structure. In: Maire, F., Hayward, R., Diederich, J. (eds.): Connectionist Systems for Knowledge Representation and Deduction (Proc. CADE ’97, Townsville, Australia) Queensland U. of Technology (1997) 15–34 11. Pollack, J.P.: Recursive distributed representations. Artificial Intelligence 46, No. 1–2 (1990) 77–105 12. Rachkovskij, D.A., Kussul, E.M.: Binding and normalization of binary sparse distributed representations by context-dependent thinning. (Manuscript available at http:// cogprints.soton.ac.uk/abs/comp/199904008) 13. Sharkey, N.E.: Connectionist representation techniques. AI Review 5, No. 3 (1991) 143–167 14. Shastri, L., Ajjanagadde, V.: From simple associations to systematic reasoning. Behavioral and Brain Sciences 16, No. 3 (1993) 417–494 15. Wermter, S.: Hybrid Approaches to Neural-Network-Based Language Processing. Report ICSI TR-97-030, International Computer Science Institute, Berkeley, California (1997) 10 Context Vectors: A Step Toward a “Grand Unified Representation” Stephen I. Gallant Knowledge Stream Partners, 148 State St., Boston MA 02109, USA, sgallant@ksp.com Abstract. Context Vectors are fixed-length vector representations useful for document retrieval and word sense disambiguation. Context vectors were motivated by four goals: 1. Capture “similarity of use” among words (“car” is similar to “auto”, but not similar to “hippopotamus”). 2. Quickly find constituent objects (eg., documents that contain specified words). 3. Generate context vectors automatically from an unlabeled corpus. 4. Use context vectors as input to standard learning algorithms. Context Vectors lack, however, a natural way to represent syntax, discourse, or logic. Accommodating all these capabilities into a “Grand Unified Representation” is, we maintain, a prerequisite for solving the most difficult problems in Artificial Intelligence, including natural language understanding. 1 Introduction We can view many of the fundamental advances in Physics as a series of successful unifications of separate, contradictory theories. For example Einstein unified wave and particle theories for light, Minkowski unified space and time into “spacetime”, Hawkings unified Thermodynamics and Black Holes, and the current ‘holy grail’ is an attempt to unify the contradictory theories of relativistic gravity with quantum theory. Just as theories play the central role in Physics, representation plays the central role in machine learning and AI. For example, a fixed-length vector representation is necessary for virtually every neural network learning algorithm. Getting the right features in such a vector can be much more important than the choice of a learning algorithm. The goal of this paper is to review a representation for text and images, Context Vectors, and to assess the role of Context Vectors as a step toward an eventual “Grand Unified Representation”. Finding a Grand Unified Representation is a likely prerequisite for solving the most difficult fundamental problems in Artificial Intelligence, including natural language understanding. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 204–210, 2000. c Springer-Verlag Berlin Heidelberg 2000 Context Vectors: A Step Toward a “Grand Unified Representation” 2 205 Context Vector Representations Context Vectors (CV’s) are fixed-length vectors that are used to represent words, documents and queries [9,10,12,14,15]. Typically the dimension for each vector is about 300, so we represent each of these objects by a single 300-dimensional vector of real numbers. Context Vectors are normalized to have Euclidean length equal to 1. Context Vectors were created with four goals in mind. 2.1 Similarity of Use The first goal is to capture “similarity of use” among terms. For example, we want a representation where “car” and “auto” are similar (ie., small Euclidean distance or large dot product), and where “car” and “hippopotamus” are roughly orthogonal (small dot product). 2.2 Quick Searching A second goal for Context Vectors is “quick sensitivity to constituent objects”. For example, we want the dot product of the CV for “car” with CV’s of documents containing “car” to be larger than the dot product of the CV for “car” with CV’s of documents that do not contain “car”. This can be accomplished by defining a document Context Vector to be the normalized sum of the CV’s of those words in the document. (In practice, we use a weighted sum of the CV’s to emphasize relative importance, ie., rarity of words, but we will omit many engineering details from this overview.) The basic principle that makes possible this sensitivity to constituent objects is that vector sum (superposition) preserves recognition. Suppose we take a sum, S, of 10 random, normalized 300-dimensional vectors. Then it is easy to prove that the expected dot product of a random, normalized vector, V, with S will be about 1 if V was one of the 10 constituent vectors, and about 0 otherwise (see [11], pg. 41). Note that this property only works with high-dimensional vector sums: if we have a sum of 10 integers that equals 93, then we have no information as to whether 16 was one of the 10 constituent integers. We can apply this Vector Superposition Principle to document CV’s, which are sums of CV’s from the constituent terms in the document. Therefore, a document Context Vector will tend to have a higher dot product with the CV of a word in that document (or a word whose CV is similar to a word in that document) than with the CV of a random, unrelated document. The task of document retrieval is now quite simple: 1. Form a query Context Vector in the same way as we form document CV’s, taking the weighted sum of constituent word Context Vectors. 2. Compute the dot product of the query Context Vector with each document Context Vector (requiring 300 multiplications and additions per document). 3. Select those documents having the largest dot product with the query Context Vector. 206 S.I. Gallant Fig. 1. Learning Context Vectors. The CV for “fox” is modified by adding a fraction (say 0.1) of the CV’s for “quick”, “brown”, “jumped”, and “over”. This makes the CV for “fox” more like the CV’s of these surrounding terms. 2.3 Automated Generation of Context Vectors A third goal for Context Vectors is their automatic generation from an unlabeled corpus of documents so that the vectors preserve “similarity of use” [13,?]. One simple method for doing this is the following (Figure 1): 1. Initialize a Context Vector for each word to a normalized, random vector. 2. Make several passes through the corpus, changing the Context Vector for each word to be more like the Context Vectors for its immediate neighbor word Context Vectors. This generates the Context Vectors for all the words. 3. Generate the Context Vectors for documents by taking the (weighted) sum of Context Vectors for constituent words. Although we have skipped a number of important engineering details, the above procedure generates a Context Vector system automatically that embodies “similarity of use” in its vector representations. This learning algorithm has much the same flavor as Kohonen’s Self-Organizing Maps (SOM) [17]. In SOM, a fixed (usually small) set of vectors is repeatedly nudged toward neighbors so that the vector come to “fill the space of neighboring exemplars”. Similarly, the above learning algorithm moves the CV’s for words to “fill the space of neighboring word CV’s”. In both cases, the resulting vectors take on characteristics of their neighbors, because neighbors become part of the vectors. A difference with CV generation, however, is that the neighbors are themselves CV’s, and these CV’s are themselves changing. One technique to help “stabilize” this learning is described in [2]. Note that all words, documents, queries, and phrases are represented by Context Vectors in exactly the same vector space, where distance between points is strongly related to distance between objects in terms of “usage” (or perhaps even “meaning”). Context Vectors: A Step Toward a “Grand Unified Representation” 2.4 207 Learning The fourth goal for Context Vectors, and a key motivation for creating Context Vector representations, is that CV’s be usable for neural networks and other machine learning techniques. Because Context Vectors for all objects are fixedlength vectors (and similar objects have similar representations), this property naturally follows. Thus we can conveniently apply neural networks to problems involving words or documents, for example to improve queries based upon user feedback from retrieved documents. For example, if we have a set of documents that are judged “relevant” to query Q, and another set of documents that are “not relevant” to Q, we can learn a query context vector that is different than Q. We first append “+1” as an additional feature to CV’s of relevant documents, and “-1” as an additional feature to CV’s of not relevant documents. We can now use either regression or other single-cell neural net learning algorithms, such as the Pocket Algorithm [8,11], to generate a set of weights having the same dimensions as the CV’s for documents. We can then use these weights directly as a query CV, or modify Q by adding in a fractional multiple of this set of weights. Experiments have shown that both methods can give significant improvement over the original query [15]. 3 Related Work There are many similarities between Context Vectors and earlier work on Latent Semantic Indexing (LSI) by Dumais and Associates [3,4,5,6], as discussed in [1]. In fact, the traditional Salton vector space model [19] also utilizes fixed-length vectors, but these vectors have (in theory) one component for every word stem in the corpus under consideration, and therefore no “similarity of use” property. Context Vectors can also be used for word-sense disambiguation, for example in differentiating “star in the sky” from “movie star”. The basic idea is to create a Context Vector for each word sense, and then use surrounding context of a word to find the closest word sense. Word senses may be taken from a dictionary (requiring hand-labeling of the training corpus), or generated by clustering the different contexts in which a word appears in a corpus [15]. There has also been work on “Image Context Vectors”, but these efforts have so far met with mixed results [16]. Currently, HNC software has a 50-person division called Aptex that has commercialized Context Vector technology, including Web applications. Although the author is nor aware of technical details of HNC’s recent web activities, note that a web page is itself a document. Therefore it is possible to precompute CV’s for web pages of interest. Each CV is a fixed length vector, say 300 numbers, and this representation is a searchable representation for web pages. Furthermore, if we know which web ads a user “clicks through” and which ads he or she ignores, we can use regression or single-cell neural net learning algorithms to generate a query CV, as previously described. Then we can chose which ad, from as yet unpresented ads, to present to the user for increased likelihood of selection. 208 S.I. Gallant Fig. 2. Toward a Grand Unified Representation All of these operations can be accomplished very quickly, including learning query CV’s, so this technology is well-suited to “real time” ad selection applications over the web. 4 Toward a “Grand Unified Representation” Context Vectors have been successful in achieving their four main goals. However, they are insufficient for representing natural language in several key respects: – Syntax: There is no representation for syntax. Even worse, most document retrieval systems ignore key syntactical words, such as prepositions and conjunctions. – Discourse and Logic: there is no Context Vector mechanism to capture the temporal order of a story, or to draw logical conclusions. Presumably, it is necessary to represent syntax as a prerequisite for representing discourse and logic. Taking a broader view, we see the situation as in Figure 2. In order to accomplish some of the most difficult AI tasks, we need advances in representation that unify the capabilities displayed along the bottom of Figure 2. How might this be done? One avenue worth exploring is to expand the dimensionality of Context Vector representations to capture syntactic and logical information. This would be in the same spirit as the current efforts among the Physics community aimed at higher dimensional “string theory”. Intuitively, it appears necessary to expand dimensions, rather than use even more superposition, because storing documents appears to already take up most of the storage capacity in Context Vectors. We are less optimistic about several other approaches, such as creating a syntax tree of Context Vectors. Context Vectors: A Step Toward a “Grand Unified Representation” 209 To speculate further on possible new approaches, we might first perform a parse of the text in question, and then build a CV representation that incorporates information from this parse. For example, we might emphasize “heads” of phrases, including verbs and prepositions, to create parse-based CV’s for phrases. Doing this would generate two separate CV’s for “threw the ball” and “milked the cow”, so that “George milked the cow and threw the ball” would end up with a different CV from “George milked the ball and threw the cow”. (In the current CV representation, these two sentences produce the same document CV.) Perhaps Plate’s Holographic Reduced Representations [18] might prove useful for constructing phrase CV’s? The ultimate representation should contain both the current first order context vectors, as well as parse-based second order phrase CV’s. The representation could either use different sets of features for the two representations, or superimpose them on the same set of (say 400) features using the Vector Superposition principle. One problem is that perfect parsing can be extremely difficult, and can even depend on the semantics of the text. However, restricting ourselves to phrase identification is somewhat easier, and a precise parse may not be critical. We might even consider going one step further, and “learning a parse”. One approach would be to attempt to replace standard parsing schemes by some Kohonen-style SOM operation that attempts to learn syntax relations. Of course, why stop there? As long as we are speculating, we could extend to third order structures involving relations among second order structures. Higher order structures would then come to represent discourse, and could be used to attempt to find similar story plots. We might even begin to approach the problem of “understanding” the text. 5 Conclusion We maintain that the key direction for research is representation, not learning theory. We need to aim for a Grand Unified Representation that can handle the various performance issues in Figure 2. A representation that merely combines aspects of neural and symbolic representations, without giving increased computational capabilities, is of limited value. Rather, we need a series of innovations that result in increasingly more powerful representations, each of which adds to representational ability. Context Vectors are a step in this “Unification Program”. They give a nice way to represent words, word senses, documents, and queries while capturing similarity of meaning. We hope that work with Context Vectors will stimulate the clever ideas and breakthroughs needed to arrive at a Grand Unified Representation, and ultimately to tackle the truly difficult tasks in Artificial Intelligence. 210 S.I. Gallant References 1. Caid WR, Dumais ST and Gallant SI. (1995) Learned vector-space models for document retrieval. Information Processing and Management, Vol. 31, No. 3, pp. 419-429. 2. Caid WR & Pu O. System and method of context vector generation and retrieval. United States Patent 5619709, Nov 21, 1995. 3. Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, G. W. and Harshman, R. A. Indexing by latent semantic analysis. Journal of the Society for Information Science, 1990, 41(6), 391-407. 4. Dumais, S. T. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers, 1991, 23(2), 229-236. 5. Dumais, S. T. (1993) LSI meets TREC: A status report. In D. Harman (Ed.) The First Text REtrieval Conference (TREC-1). NIST special publication 500-207, 137152. 6. Dumais, S.T. (1994) Latent Semantic Indexing (LSI) and TREC-2. In D. Harman (Ed.) The Second Text REtrieval Conference (TREC-2). NIST special publication 500-215, 105-115. 7. Furnas, G. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of SIGIR, 1988, 465-480. 8. Gallant, SI (1990) Perceptron-based learning algorithms. IEEE Transactions on Neural Networks 1(2):179-192. 9. Gallant SI (1991) Context vector representations for document retrieval. AAAI-91 Natural Language Text Retrieval Workshop: Anaheim, CA. 10. Gallant SI (1991) A practical approach for representing context and for performing word sense disambiguation using neural networks. Neural Computation 3(3):293309. 11. Gallant, S. I. (1993) Neural Network Learning and Expert Systems. M.I.T. Press. 12. Gallant, S. I. (1994) Method For Document Retrieval and for Word Sense Disambiguation Using Neural Networks. United States Patent 5,317,507, May 31, 1994. 13. Gallant, S. I. (1994) Method For Context Vector Generation for use in Document Storage and Retrieval. United States Patent 5,325,298, June 28, 1994. 14. Gallant S. I., Caid WR et al (1992) HNC’s MatchPlus system. The First Text REtrieval Conference: Washington, DC. pp. 107-111. 15. Gallant S. I., Caid WR et al (1993) Feedback and Mixing Experiments With MatchPlus. The Second Text REtrieval Conference (TREC-2). NIST special publication 500-215, 101-104. 16. Gallant, S. I., and Johnston, M. F. Image retrieval using Image Context Vectors: first results. IS&T/SPIE Symposium on Electronic Imaging: Science & Technology, San Jose, Ca, Feb. 5-10, 1995. In Niblack and Jain, Eds. Storage and Retrieval for Image and Video Databases III. SPIE Vol 2420, 82-94. 17. Kohonen, T. Self-Organizing Maps. Springer, Berlin, 1995. 18. Plate, T.A. Distributed Representations and Nested Compositional Structure. University of Toronto, Department of Computer Science Ph.D. Thesis, 1994. 19. Salton, G., and McGill, M.J. Introduction to Modern Information Retrieval. McGraw-Hill, New York, 1983. Integration of Graphical Rules with Adaptive Learning of Structured Information Paolo Frasconi1 , Marco Gori2 , and Alessandro Sperduti3 1 Dipartimento di Ingegneria Elettrica ed Elettronica, Università di Cagliari Piazza d’Armi, 09123 Cagliari Italy 2 Dipartimento di Ingegneria dell’Informazione, Università di Siena Via Roma 56, 53100 Siena, Italy 3 Dipartimento di Informatica, Università di Pisa, Corso Italia 40, I-56125 Pisa, Italy Abstract. We briefly review the basic concepts underpinning the adaptive processing of data structures as outlined in [3]. Then, turning to practical applications of this framework, we argue that stationarity of the computational model is not always desirable. For this reason we introduce very briefly our idea on how a priori knowledge on the domain can be expressed in a graphical form, allowing the formal specification of perhaps very complex (i.e., non-stationary) requirements for the structured domain to be treated by a neural network or Bayesian approach. The advantage of the proposed approach is the systematicity in the specification of both the topology and learning propagation of the adopted computational model (i.e., either neural or probabilistic, or even hybrid by combining both of them). 1 Introduction Several disciplines have taken advantage of structured representations. These include, among many other, knowledge representation, language modeling, and pattern recognition. The interest in developing connectionist architectures capable of dealing with these rich representations (as opposed to “flat” or vectorbased representations) can be traced back to the end of the 80’s. Different approaches have been proposed. Touretzky’s BoltzCONS system [15] is an example of how a Boltzman machine can handle symbolic structures using coarse-coded memories [11] as basic representational elements, and LISP’s car, cdr, and cons functions as basic operations. The RAAM model proposed by Pollack [10] is based on backpropagation (Backpropagation is only one particular way to implement the concept underlying the RAAM model) to discover compact recursive distributed representations of trees with a fixed branching factor. Recursive distributed representations are an instance of the concept of a reduced descriptor introduced by Hinton [6] to solve the problem of mapping part-whole hierarchies into connectionist networks. Also related to the concept of a reduced descriptor are Plate’s holographic reduced representations [9]. A formal characterization of representations of strucS. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 211–225, 2000. c Springer-Verlag Berlin Heidelberg 2000 212 P. Frasconi, M. Gori, and A. Sperduti tures in connectionist systems using the tensor product was developed by Smolensky [13]. Examples of application domains where structures are extensively used are medical and technical diagnoses (discovery and manipulation of structured dependencies, constraints, explanations), molecular biology (DNA and protein analysis), chemistry (classification of chemical structures, quantitative structureproperty relationship (QSPR), quantitative structure-activity relationship (QSAR)), automated reasoning (robust matching, manipulation of logical terms, proof plans, search space reduction), software engineering (quality testing, modularization of software), geometrical and spatial reasoning (robotics, structured representation of objects in space, figure animation, layouting of objects), speech and text processing (robust parsing, semantic disambiguation, organizing and finding structure in texts and speech). While algorithms that manipulate symbolic information are capable of dealing with highly structured data, adaptive neural networks are mostly regarded as learning models for domains in which instances are organized into static data structures, like records or fixed-size arrays. Recurrent neural networks, that generalize feedforward networks to sequences (a particular case of dynamically structured data) are perhaps the best known exception. However, during the last few years, neural networks for the representation and processing of structures have been developed [14]. In particular, recursive neural networks are a generalization of recurrent networks for processing sequences (i.e., linear chains from a graphical point of view) to the case of directed acyclic graphs. These kind of networks are of paramount importance for both structural pattern recognition and for the development of hybrid neural-symbolic systems, since they allow the treatment of structured information very naturally and, in several cases, very efficiently. The main motivations for using neural networks for the processing of structures are several: first of all, neural networks are universal function approximators, then they can perform automatic inference (learning), hold very good classification capabilities, and they can deal with noise and incomplete data. All the above properties are particularly useful when facing classification problems in a structured domain. Let us consider, for example, the use of neural networks for classification of labeled graphs which represent chemical compounds. The standard approach with feedforward neural networks consists of encoding each graph as a fixed-size vector, which is then used for feeding the neural network. Unfortunately, the a priori definition of the encoding process has several drawbacks. For example, in chemistry, the encoding is performed through the definition of topological indices which are properly designed by means of a very expensive trial and error approach) and the resulting vectorial representations of graphs may be very difficult to classify. Recursive neural networks face this problem by simultaneously learning either how to represent or how to classify structured patterns. This is a very desirable property which has already been shown to be useful in practice. Integration of Graphical Rules 213 It must be noted that automatic inference can also be obtained by using a symbolic representation, such as in Inductive Logic Programming [7]. Recursive neural networks, however, have their own specific peculiarity, since they can approximate functions from a structured domain (possibly with real valued vectors as labels) to the set of reals. To the best of our knowledge, this cannot be performed by any symbolic system. In many applications, however, the assumption of stationarity of the computational model, which is typically done when using neural or Bayesian networks, does not fit with the nature of the problem at hand. This is especially true when considering complex problems which may involve heterogeneous (i.e., both symbolic and numerical) and structured data. In this chapter, after a brief review of the basic concepts underpinning the adaptive processing of structures as outlined in [3], we show by two examples from practical applications that the assumption of stationarity may not be adequate to the complexity of the application domains. Following this argument, in this chapter we discuss some ideas about how to define a graphical formalism for the systematic specification and processing of a priori knowledge on the application domain. 2 Structured Domains In this chapter, we consider domains of DOAGs (Directed Ordered Acyclic Graphs) where vertices are marked by labels, i.e. subsets of domain variables. The meaning of an edge (v, w) is that the variables in the labels attached to v and w are related. A graph is uniformly labeled if: i) an equivalence relation is defined on the domain variables; ii) every label contains one and only one variable from each equivalence class. For example, in a pattern recognition domain, when classifying images containing geometrical patterns, P erimeter, Area, T exture are all possible equivalence classes since all geometrical patterns can be represented using these features. Let us assume uniformly labeled graphs (with N equivalence classes). Let Y1 , Y2 , . . ., YN be representative variables for the equivalence classes. Yi is said categorical when the realizations belong to a finite alphabet Yi . Yi is said numerical when the realizations are real numbers: Yi = IR. Moreover, the set of label realizations Y = Y1 × Y2 × · · · × YN is called label space. The class of DOAGs in which we are interested in, is formed by directed acyclic graphs such that, for each vertex v, a total order ≺ is defined on the edges leaving from v. E.g.: (v, w) ≺ (v, u) ≺ (v, t) A trivial example of structure is a sequence (see Figure 1), which can be defined as either: i) an external vertex, or ii) an ordered pair (t, h) where the head h is a vertex and the tail t is a sequence. 214 time=0 P. Frasconi, M. Gori, and A. Sperduti time=1 time=2 ... 1 0 1 0 1 0 a a b a b a .... head (last time step) frontier (origin of time axis) tail (a sequence) Fig. 1. left: Example of sequence; right: Causality for sequential transductions. 2.1 Sequential Transductions Here we review very briefly the concept of sequential transduction. Let U and Y be the input and output label spaces. We denote by U ∗ the set of all sequences with labels1 in U. A general transduction T is a subset of U ∗ ×Y ∗ . We shall limit our discussion to functions: T : U ∗ → Y ∗ . A transduction T (·) is causal if the output at time t does not depend on future inputs (at time t + 1, t + 2, . . .)(see Figure 1). A particular form of transduction can be obtained by having the outputs to depend on hidden state variables (label space X ), i.e., for each time t: X t = f (X t−1 , U t , t) and Y t = g(X t , U t , t) where f : X × U → X is the state transition function and g : X × U → Y is the output function, and the initial state is associated with the external vertex (frontier). A recursive state representation exists only if T (·) is causal (see Figure 2). Moreover, T (·) is stationary if f (·) and g(·) do not depend on t. 2.2 Transductions for Data Structures Here we generalize causal and stationary transductions to structured domains. Let U # and Y # be two DOAG spaces. A transduction T is a subset of U # × Y # . As in the sequential case, we impose the following restrictions: i) T is a function T : U # → Y # ; ii) T is IO-isomorph, i.e., skel(T (U )) = skel(U ) (skel(U ) returns a graph with the same topology as U but without labels attached to the vertices), where skel() is an operator returning the skeleton of the graph, i.e., the graph obtained by ignoring all node labels; iii) # is an ordered class of DAGs. Given an input graph U , a recursive state representation for a structural transduction can be defined so that, for each vertex v: X v = f (X ch[v] , U v ) and Y v = g(X v , U v ) 1 (1) Cf. the classical definition in language theory: U is a set of terminal symbols and U ∗ is the free monoid over U . Integration of Graphical Rules 215 Xv Uv X v.R X v.L U t-2 U t-1 Ut U v.L U v.R X v.R.L X v.R.R X t-2 X t-1 Xt frontier state if v.R is external X t = f ( X t-1 , Ut ) X v = f ( Uv , X v.L , X v.R ) frontier state if v.L is external (a) (b) Fig. 2. (a): Sequential transduction with hidden recursive state representation; (b): Structural transduction with hidden recursive state representation. where ch[v] are the (ordered) children of v, f : X m ×U → X is the state transition function and g : X × U → Y is the output function. Also for structural transductions a recursive state representation exists only if T (·) is causal, and T (·) is stationary if f (·) and g(·) do not depend on v (see Figure 2). Note that the update of equations (1) may follow any reverse topological sort of the input graph (see Figure 3). Specifically, some vertices can be updated in parallel according to the preorder defined by the topology of the input graph. We are particularly interested in supersource transductions, which can be characterized by the following points: i) the input graph U is a DOAG with supersource s; ii) the output is a single label Y , which is either a categorical variable, in case of classification, or a (multivariate) numerical variable, in case of regression; iii) the hidden state X v is updated by f (X ch[v] , U v ), while the output label is “emitted” at s: Y = g(X s ). In order to graphically describe structured transductions, it is convenient to define generalized shift operators. In case of sequences, the standard shift operator is defined as q −1 Y t = Y t−1 (unitary time delay). This definition can be extended for DOAGs as follows: qk−1 Y v is the label attached to the k-th child 216 P. Frasconi, M. Gori, and A. Sperduti 4 v 1 3 3 2 2 1st 2nd 1 1 2 0 1st 0 3rd 1 2nd 1st 1 0 1 q -1 q -1 Yv 2 1 0 2nd 1st 1 2nd 0 q -1 q -1 Yv 1 2 Fig. 3. left: Example of reverse topological sort for a DOAG: any total order on the nodes consistent with the numbering of the nodes is admissible (nodes with same number can be permuted); right: Example showing the non-commutativity of the generalized shift operators. of vertex v. Notice that the composition of these operators is not commutative (see Figure 3). Having defined generalized shift operators, we can now define a class of graphical models called recursive networks, used to represent structural transductions. We assume that: i) T () admits a recursive state space representation; ii) input, state, and output DOAGs are uniformly labeled. The recursive network of T () is a directed graph where: i) vertices are marked with representative variables; ii) edges are marked with generalized shift operators; iii) an edge (Av , Bv ), with label qk−p , means that for each v in the vertex set of the input structure, Bv is a function of qk−p Av . Given a graph U ∈ U # and a recursive transduction T , it is possible to compute the output of the transduction T by resorting to encoding networks: the encoding network associated with U and T is formed by unrolling the recursive network of T through the input graph U . Note that time-unfolding is a special case of the above concept for the class of sequences (see Figure 4). The encoding network is a graphical representation of the functional dependences between the variables in consideration given a specific DOAG in input. This encoding network can be implemented either by a neural network or a Bayesian network [3]. For example, let us consider neural networks. In this case, both the state transition function f (·) and the output function g(·) are realized by feedforward neural networks, leading to the parametric representation X v = f (X ch[v] , U v , θ f ), and Y v = g(X v , U v , θ g ), where θ f and θ g are connection weights. In the special case of linear chains, the above equations exactly correspond to the general state space equations Integration of Graphical Rules 217 Yv t=2 t=4 t=1 t=3 t=5 q-1 + Sequence (list) Xv = Uv Encoding network: TIME UNFOLDING Recursive network a Yv a b + c q-1L q-1R Xv d e e Data structure (binary tree) b = d c e e Uv Recursive network frontier states Encoding network Fig. 4. Example of generation of encoding networks for sequential and structural transductions. of recurrent neural networks. Moreover, the above representation is stationary, since both f (·) and g(·) do not depend on the node v. The encoding network, in this case, will result in a feedforward neural network obtained by replicating and connecting the feedforward neural networks implementing f (·) and g(·), according to the topology of the input DOAG. Standard learning algorithms for neural networks can be used to train the neural encoding network, taking care of the fact that θ f and θ g are replicated across the network. 3 Need for Non-stationary Transductions The basic ingredients of the framework briefly described in the previous section can be summarized by the following pseudo-equation: Encoding Network(i) = Recursive Network + Input Graph(i) This framework assumes stationarity as long as the Recursive Network is stationary. In the following, we argue that in several application cases it is useful to assume non-stationarity. Non-stationary solutions are also sometimes needed because of the recursive nature of the model. Here we briefly discuss two practical cases where non-stationarity of the model will clearly improve the efficiency and performance of the model. 218 3.1 P. Frasconi, M. Gori, and A. Sperduti A Chemical Application Chemical compounds are usually represented as undirected graphs. Each node of the graph is an atom or a group of atoms, while arcs represent bonds between atoms (see Figure 5). Fig. 5. Typical chemical compound, naturally represented by an undirected graph. One fundamental problem in chemistry is the prediction of the biological activity of chemical compounds. Quantitative Structure-Activity Relationship (QSAR) is an attempt to face the problem relying on compounds’ structures. The biological activity of a drug is fully determined by the micromechanism of interaction of the active molecules with the bioreceptor. Unfortunately, discovering this micromechanism is very hard and expensive. Hence, because of the assumption that there is a direct correlation between the activity and the structure of the compound, the QSAR approach is a way of approaching the problem by comparing the structure of all known active compounds with inactive compounds, focusing on similarities and differences between them. The aim is to discover which substructure or which set of substructures characterize the biomechanism of activity, so as to generalize this knowledge to new compounds. The main requirement for the use of recursive networks consists in finding a representation of molecular structures in terms of DOAGs. The candidate representation should retain the detailed information about the structure of the compound, atom types, bond multiplicity, chemical functionalities, and finally a good similarity with the representations usually adopted in chemistry. The main representational problems are: how to represent cycles, how to give a direction to edges, how to define a total order over the edges. Let us consider a specific problem: the prediction of the non-specific activity (affinity) towards the Benzodiazepine/GABAA receptor [1]. An appropriate description of the molecular structures of benzodiazepines can be based on a labeled tree representation. In fact, concerning the first problem, since cycles mainly Integration of Graphical Rules 219 constitute the common shared template of the benzodiazepines compounds, it is reasonable to represent a cycle (or a set of connected cycles) as a single node where the attached label carries information about its chemical nature. The second problem can be solved by the definition of a set of rules based on the I.U.P.A.C. nomenclature system2 . Finally, the total order over the edges follows a set of rules mainly based on the size of the sub-compounds. An example of representation for a Benzodiazepine is shown in Figure 6. TEMPLATE R9 R1 bdz O R8 O N N A N R3 B A R7 B N R6 5 R 2’ R6’ C R1=H R2’=F R3=H R5=PH R6=H R6’=H R7=COCH3 R8=H R9=H H F H PH H H C H H C O H H H bdz(h,f,h,ph,h,h,c2(o,c3(h,h,h)),h,h). Fig. 6. Example of representation for a benzodiazepine. This representation of Benzodiazepines implies that, once a given recursive network has been defined, the set of parameters corresponding to the pointers are used with different roles: at the level of the root, they are used to implement the target regression, while at the remaining levels they are used to perform the encoding of the substituents (which do not have a target value associated). Since there is no strong evidence about the fact that these two functionality should be implemented by the same set of parameters, it is clear that it would have been better to have a different set of parameters for each functionality. Moreover, in this particular case, since the substituents just exploit 3 out of 9 pointers, it turns out that only these 3 pointers are “overloaded”, thus introducing a clear arbitrary bias in the model. 3.2 Logo Recognition Pattern recognition is another source of applications in which one may be interested in adaptive processing of data structures. This was recognized early with the introduction of syntactic and structural pattern recognition [4,8,5,12], that are based on the premise that the structure of an entity is very important for both classification and description. 2 The root of a tree representing a benzodiazepine is determined by the common template. 220 P. Frasconi, M. Gori, and A. Sperduti Figure 7 shows a logo with a corresponding structural representation based on a tree, whose nodes are components properly described in terms of geometrical features. This representation is invariant with respect to roto-translations and naturally incorporate both symbolic and numerical information. a a=(square,0.843) e b f=(circle,1.086) b=(triangle, 0.018) f SUM d c h i SUM SUM g c=(triangle, 0.018) g=(letter(S),0.009) d=(triangle, 0.018) h=(letter(U),0.007) e=(triangle, 0.018) i=(letter(M),0.011) Fig. 7. A logo with the corresponding representation based on both symbolic and subsymbolic information. Of course, the extraction of robust representations from patterns is not a minor problem. The presence of a significant amount of noise is likely to affect significantly representations that are strongly based on symbols. Hence, depending on the problem at hand, the structured representation that we derive should emphasize the symbolic or the sub-symbolic information. For example, the logo shown in Figure 7 could be significantly corrupted by noise so as to make it unreasonable to recognize the word “SUM”. In that case, one should just consider all the word as sub-symbolic information collected in a single node. In general, however, the symbolic information could be exploited as a selector for different sets of parameters: tree nodes with the same symbolic part can be processed by using the same set of parameters, while tree nodes with different symbolic part can be processed by a different set of parameters. This would allow the reduction of the variance in input, thus permitting the learning algorithm to focus only on relevant aspects of the problem. Some work in this direction has already been presented in [2]. 4 Representing Non-stationary Transductions In the previous section, we have argued about the advantages of “using” nonstationary transductions. Non-stationary transductions can be defined within Integration of Graphical Rules 221 the proposed formalism by implementing f (·) and g(·) through non-stationary models, e.g., non-stationary neural networks. Very often, however, a priori knowledge about this non-stationarity could conveniently be represented in an explicit way at the level of the graphical model, i.e., at the level of the recursive network, independently of the implementation mechanism. The basic idea is to define a sort of graphical language for describing explicitly the desired non-stationarity. The advantage of this approach is to avoid the introduction of distinct parameters for each vertex in the input structures. In fact, the graphical formalisms allows the user to exactly specify under which specific conditions a new set of parameters must be used. For example, let us consider the case of a transduction for binary trees where both f (·) and g(·) depend on: i) the distance of the node in the tree from the frontier; ii) the value associated to a numerical variable (U ) within the labels. A specific instance of this situation could be described by the pseudo-code reported in Figure 8. The first statement declares a variable of type sequence, while the second statement declares a variable of type vertex. The variable of type sequence is then used to contain the sequence of vertices returned by the statement sort vertices by(dist from(frontier),<). This statement returns one of the admissible topological orders (there may be many) where the vertices of the input DOAG are sorted by increasing (<) distance from the frontier. The next foreach (v,seq){ . . . } command iterates the value of v through each element of the sequence Seq, executing the commands contained in the body { . . . } after each assignment. Within the foreach, there is an if-then-else statement which tests how far the current vertex is from the frontier. If the distance of the current vertex from the frontier is less than 3, then an additional test is performed on the associated numerical variable U : if the value of U is within the range [0.3, 0.55], then the recursive network corresponding to the second occurring then branch is used, otherwise the recursive network corresponding to the first occurring else branch is used. If the distance of the current vertex from the frontier is not less than 3, the recursive network corresponding to the second occurring else branch is used. Please note that the three recursive networks which appear in the “program” should be intended as distinct models, i.e., each network has a different set of parameters associated with its state transition and output functions. All the three recursive networks involved in the definition of the above description exploit the same set of variables (i.e., U , X, Y ), which assures the consistency of the description. In Figure 9, an example of application of the “program” is reported. In the right side of the figure, the encoding network obtained by applying the program to the input tree in the left-side of the figure is shown. Note that vertices 1, 2, 4, 5, and 6 satisfy both the first and second if-then-else conditions, while vertices 3 and 7 only satisfy the first if-then-else condition. Finally, vertex 8 is the only node for which the last recursive network is applied. 222 P. Frasconi, M. Gori, and A. Sperduti Sequence_of_vertices Seq; Vertex v; Seq <- sort_vertices_by( dist_from(frontier), <); foreach(v, Seq) { if (dist_from(frontier)<3) then { Y if ( U in [0.3,0.55] ) then q-1 2 q-1 1 ; X Y U q-1 1 else ; X U } Y else q-1 q-1 2 1 ; X U } Fig. 8. Example of non-stationary recursive network. It must be observed that the resulting encoding network is not connected since the subnetwork corresponding to vertices 3 and 6 is not connected to the remaining part of the network. As a consequence of that, neither vertex 3, nor vertex 6 will contribute to the definition of the state and output of vertices 7 and 8. The processing flow, in this case, is sequential, since the vertices of the input graph are sorted according to a given topological order. In general, however, the preorder defined by the topology of the input DOAG admits a degree of parallelism. This form of parallelism can be expressed by a data flow model, where the nodes of the input DOAG are visited in parallel in a bottom-up fashion, from the frontier to the supersource. For example, the nodes of the input DOAG of Figure 9 can be processed according to the following sequence Integration of Graphical Rules 223 0.23 Input Tree 0.23 0.4 8 0.23 4 0.5 0.5 0.43 5 frontier states 0.43 6 0.37 1 0.23 0.4 7 0.4 2 0.11 0.11 3 0.37 0.4 frontier states frontier states Encoding Network Fig. 9. Example of generation of the encoding network by using the input in the left side of the figure and the non-stationary recursive network shown in Figure 8. of sets, S1 ={1,2,3,4}, S2 ={5,6}, S3 ={7}, S4 ={8}, where nodes in the same set are processed in parallel but set Si must be processed strictly after set Si−1 . Such parallelism could be expressed by adopting an event-based environment where an event can be understood as a condition satisfied by a given vertex of the input DOAG. This event-based environment could be introduced by the statement deal with { . . . } which in general contains a set of (non contradictory) event(condition): {. . .} statements and a default: {. . .} statement applied when no condition of the event statements is satisfied. It must be observed that a condition, besides to be a hard logical condition on the topology of the DOAG, i.e., whether a node is at a given level of the DOAG or the node has more than 2 children, can also be defined as a fuzzy condition, implemented, for example, by a neural network. So the test on the condition can either activate a symbolic procedure or a pattern recognition subsystem. One interesting aspect of this approach is that, since the “program”, together with the input DOAG, is used to define the encoding network, learning is driven by the “program” as well. In fact, the correspondence between parameters and connections within the encoding network is established by the “program”. In general, it may be feasible to define statements concerning the learning modality for each recursive network involved in the “program”, even considering different (but “compatible”) learning algorithms for each recursive network, i.e., different gradient descent algorithms. In Figure 4, an example of a simple if-then-else graphical rule which solves our chemical problem, as stated in Section 3.1, is shown. If the current vertex is the root of the input DOAG, then all the subgraphs (substituents) are 224 P. Frasconi, M. Gori, and A. Sperduti Y -1 q -11 q 2 if(root(vertex) == TRUE ) q -19 then ; X U -1 q -1 q -11 q 2 3 X else ; U Fig. 10. Example of non-stationary rule for the chemical application. considered and an output (predicted biological activity) is generated, otherwise only three set of parameters are used for the encoding of the subgraphs (chemical fragments) and no output (prediction) needs to be generated. 5 Conclusion We have briefly reviewed the basic concepts underpinning the adaptive processing of data structures as outlined in [3]. Concepts such as causality and stationarity have been adapted to the context of learning data structures. In particular, we have argued that in practical applications stationarity in not always desiderable. In order to explain the role of non-stationarity, we have discussed two practical examples, i.e., predicting the biological activity of chemical compounds and automatic classification of logos. Finally, we have introduced very briefly our idea on how a priori knowledge on the domain can be expressed in a graphical form, allowing the formal specification of eventually very complex (i.e., nonstationary) requirements for the structured domain to be treated by a neural network or a Bayesian approach. The advantage of the proposed approach is the Integration of Graphical Rules 225 systematicity in the specification of both the topology and learning propagation of the adopted computational model (i.e., either neural or probabilistic, or even hybrid by combining both of them). Moreover, the proposed approach would allow the easy definition and reuse of basic modules, besides to the possibility to modify the computational model with a few changes in the formal specification. References 1. A.M. Bianucci, A. Micheli, A. Sperduti, and A. Starita. Quantitative structureactivity relationships of benzodiazepines by recursive cascade correlation. In IEEE International Joint Conference on Neural Networks, pages 117–122, 1998. 2. M. Diligenti, M. Gori, M. Maggini, and E.Martinelli. Adaptive graphical pattern recognition: the joint role of structure and learning. In Proceedings of the International Conference on Advances in Pattern Recognition, pages 425–432. Springer, 1998. 3. P. Frasconi, M. Gori, and A. Sperduti. A framework for adaptive data structures processing. IEEE Transactions on Neural Networks, 9(5):768–786, 1998. 4. K. S. Fu. Syntactic Pattern Recognition and Applications. Prentice-Hall, Englewood Cliffs, N.J, 1982. 5. R. C. Gonzalez and M. G. Thomason. Syntactic Pattern Recognition. Addison Wesley, Reading, Massachusettes, 1978. 6. G. E. Hinton. Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46:47–75, 1990. 7. S. Muggleton and L. De Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19,20:629–679, 1994. 8. T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, Heidelberg, 1977. 9. Tony A. Plate. Holographic reduced representations. IEEE Transactions on Neural Networks, 6(3):623–641, May 1995. 10. J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46(12):77–106, 1990. 11. R. Rosenfeld and D. S. Touretzky. Four capacity models for coarse-coded symbol memories. Technical Report CMU-CS-87-182, Carnegie Mellon, 1987. 12. R. J. Schalhoff. Pattern Recognition: Statistical, Structural and Neural Approaches. John Wiley & Sons, 1992. 13. P. Smolensky. Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46:159–216, 1990. 14. A. Sperduti and A. Starita. Supervised neural networks for the classification of structures. IEEE Transactions on Neural Networks, 8(3):714–735, 1997. 15. D. S. Touretzky. Boltzcons: Dynamic symbol structures in a connectionist network. Artificial Intellicence, 46:5–46, 1990. Lessons from Past, Current Issues, and Future Research Directions in Extracting the Knowledge Embedded in Artificial Neural Networks Alan B. Tickle, Frederic Maire, Guido Bologna, Robert Andrews, and Joachim Diederich Machine Learning Research Centre Queensland University of Technology Box 2434 GPO Brisbane Queensland 4001, Australia {ab.tickle, f.maire, g.bologna, r.andrews, j.diederich}@qut.edu.au Abstract. Active research into processes and techniques for extracting the knowledge embedded within trained artificial neural networks has continued unabated for almost ten years. Given the considerable effort invested to date, what progress has been made? What lessons have been learned? What direction should the field take from here? This paper seeks to answer these questions. The focus is primarily on techniques for extracting rule-based explanations from feed-forward ANNs since, to date, the preponderance of the effort has been expended in this arena. However the paper also briefly reviews the broadening overall agenda for ANN knowledge-elicitation. Finally the paper identifies some of the key research questions including the search for criteria for deciding in which problem domains these techniques are likely to out-perform techniques such as Inductive Decision Trees. 1 Introduction Notwithstanding the proven abilities of Artificial Neural Networks (ANNs) in problem domains such as pattern recognition and function approximation [31], [46], [47], to an end-user the modus operandi of ANNs remain something of a numerical enigma. In particular ANNs inherently lack even a rudimentary capability to explain either the process by which they arrived at a given result or, in general, the totality of “knowledge” actually embedded therein. After an investment of research effort spanning an interval which now exceeds ten years, the point has been reached where this deficiency is all but redressed. An appreciation of the dimensions of this research effort can be gauged both from recent surveys [2,3,48,49,50,51,52,53] and by simply counting the number of papers appearing on this topic at the major international conferences. Apart from revealing a rich and diverse set of approaches to solving the ANNknowledge-extraction problem, such an analysis also highlights how the research S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 226–239, 2000. c Springer-Verlag Berlin Heidelberg 2000 Extracting the Knowledge Embedded in Artificial Neural Networks 227 effort has tended to coalesce around particular facets of the problem e.g. the variations in ANN types and, in particular: 1. Techniques for extracting and representing the “knowledge” of trained feedforward ANNs as sets of symbolic rules; 2. Techniques for extracting Finite State Machine (FSM) representations from Recurrent Neural Networks (RNNs); and 3. Techniques for extracting fuzzy rules. The analysis also highlights another important area of research. This is in developing techniques whereby an existing rule-base or similar a priori knowledge is used to initialise an ANN, new data is then used to train the ANN, and an updated set of (refined) rules or knowledge is extracted from the trained ANN [2]. This is the problem domain of rule-refinement [54]. The range and diversity of techniques being developed under the umbrella of ANN-knowledge-extraction underscores the fact that eliciting the knowledge embedded within trained ANNs is both an active and evolving discipline. Hence the purpose of the ensuing discussion is: 1. To review the ADT taxonomy [2,53] for classifying techniques for extracting the knowledge embedded within ANNs; 2. To identify the important lessons which have been learned so far in the area of ANN rule-extraction; 3. To utilise the ADT taxonomy as a basis for highlighting the increasingly rich and diverse range of mechanisms and techniques which have been developed to deal with the more general problem of knowledge-extraction from ANNs; and 4. To identify and discuss some of the key research questions which will have an important bearing on the direction that future development of the field will take. 2 A Review of the ADT Taxonomy for Classifying ANN-Knowledge-Extraction Techniques In 1995, Andrews, Diederich, and Tickle proposed a taxonomy (the so-called ADT taxonomy) for categorising ANN rule-extraction techniques [2]. One of the primary goals in developing this taxonomy was to provide a uniform basis for the systematic comparison of the different approaches which had appeared up to that time. In its original form the ADT taxonomy comprised a total of five primary classification criteria: 1. The expressive power (or, alternatively, the rule format) of the extracted rules; 2. The quality of the extracted rules; 3. The translucency of the view taken within the rule extraction technique of the underlying Artificial Neural Network units; 228 A.B. Tickle et al. 4. The algorithmic complexity of the rule extraction/rule refinement technique; and 5. The extent to which the underlying ANN incorporates specialised training regimes (i.e. a measure of the portability of the rule extraction technique across various ANN architectures). The authors used three basic groupings of rule formats: (i) conventional (Boolean, propositional) symbolic rules; (ii) rules based on fuzzy sets and logic; and (iii) rules expressed in first-order-logic form i.e. rules with quantifiers and variables. They also used a set of four measurements of rule quality: (a) rule accuracy (i.e. the extent to which the rule set is able to classify a set of previously unseen examples from the problem domain correctly); (b) rule fidelity (i.e. the extent to which the rule set mimics the behaviour of the ANN from which it was extracted); (c) rule consistency (i.e. the extent to which, under differing training sessions, the ANN generates rule sets which produce the same classifications of unseen examples); and (d) rule comprehensibility (e.g. measuring the size of the rule set in terms of the number of rules and the number of antecedents per rule). For the third criterion (i.e. the translucency criterion) the authors sought to categorise a rule extraction technique based on the granularity of the underlying ANN which was either explicitly or implicitly assumed within the rule extraction technique. In a notional spectrum of such perceived degrees of granularity, the authors used two basic delimiting points to define the extremities of the range of options i.e. decompositional where rule extraction is done at the level of individual hidden and output units within the ANN, and pedagogical where the view of the ANN is at its maximum level of granularity i.e. as a “black box”. The labelling of this latter extremity as “pedagogical” was based on the work of Craven and Shavlik who cast their rule extraction technique as a learning process [9]. However Neumann [32] has commented that this label is perhaps something of a misnomer because it is arguable that the underlying process is more about searching than it is about learning. The label “eclectic” was assigned to the mid-point of this spectrum to accommodate essentially hybrid techniques which analyse the ANN at the individual unit level but which extract rules at the global level. Whilst the ADT taxonomy was conceived in the context of categorising ANN rule-extraction techniques, the five general notions it embodies viz (a) rule format (b) quality (c) translucency (d) algorithmic complexity and (e) portability have been applied to other forms of representing the knowledge extracted from ANNs as well as to other ANN architectures and training regimes. Subsequent work has shown that not only is this taxonomy applicable to a cross-section of current techniques for extracting rules from trained feed-forward ANNs but also that the taxonomy can be adapted and extended to embrace a broader range of ANN types (e.g. recurrent neural networks) and explanation structures [52]. In particular a distinguishing characteristic of techniques for extracting Deterministic Finite State Automata (DFAs) from recurrent neural networks is that the underlying analysis is performed at the level of ensembles of neurons rather than individual neurons. To accommodate such techniques which operate at an Extracting the Knowledge Embedded in Artificial Neural Networks 229 intermediate level of granularity, it was proposed to use the label “compositional” to denote techniques which extract rules from ensembles of neurons rather than individual neurons. Related work on classification schemas for ANN rule-extraction techniques has tended to focus primarily on alternative measures for comparing the quality of the extracted rule set. For example Krishnan et al. [24] have proposed a set of three criteria for assessing rule quality. Under their terminology, rules should be (1) valid i.e. “the rules must hold regardless of the values of the unmentioned variables in the rules”, (2) maximally general i.e. “if any of the antecedents are removed, the rule should no longer be valid”, and (3) complete i.e. “all possible valid and maximally general rules must be extracted”. Similarly Healy [20] also proposes three measures of rule quality incorporating the notions of validity, consistency, and completeness. Under his schema, valid rules are those which are correct for the data i.e. the rule If A then B is valid if A ⇒ B is the correct inference for all data cases. Similarly, consistency means that the rules do not allow both B and not B. Completeness means that there are no unreachable conclusions.) At an overall level, the absence of a consistent measure of rule quality has complicated the task of directly comparing the performance of ANN rule-extraction techniques. 3 Seven “Take Home” Messages from ANN Rule Extraction Given the widespread use of trained feed-forward ANNs, it is therefore not surprising that this type of ANN is the one which has attracted the preponderance of attention for the purposes of providing the requisite explanation capability. Moreover, because of the widespread acceptance of symbolic rules as a vehicle for knowledge representation, the dominant form of such explanations is also symbolic rules. Hence rule-extraction from trained feed-forward ANNs now encompasses a rich and diverse range of ideas, mechanisms, and procedures. Consequently it is useful at this juncture to make some general observations and comments on what has been achieved to date. In particular, it is possible to glean at least seven main lessons of potential importance both to those who are in the process of developing knowledge-extraction techniques as well as to a prospective user of such techniques. These are as follows: 1. In the first instance it is useful to observe that the process of extracting rules from an ANN which has been trained to classify a data set from a given problem domain is different from that of the classic induction task where the corresponding set of symbolic classification-rules is gleaned directly from the data. This difference arises because it is possible to utilise the ANN to generate new data about the problem domain e.g. by the ANN as a “black box” - a motif adopted in the so-called “pedagogical” class of rule-extraction techniques. Moreover it is possible to formulate queries of the form “is this region of the input space totally classified as positive”. This is not possible by simply using the raw data. 230 A.B. Tickle et al. 2. There is also a second major difference between the process of rule-extraction from an ANN and symbolic rule-induction. This is that the former case is more appropriately described as a multi-criteria optimisation problem. Not only is the objective to extract symbolic rules from the trained ANN, the agenda is for rules which afford high levels of fidelity, accuracy, consistency, and comprehensibility. For both the ANN itself and a symbolic-induction technique applied directly to the data, the learning objective is simply for a result that generalises well, given a finite training sample. 3. There now exists a substantive body of results which confirms both the effectiveness and utility of the overall ANN rule-extraction approach as a tool for rule-learning in a diverse range of problem domains. In particular Fu [11] has shown that the combination of ANNs and rule-extraction techniques outperforms established rule-learning techniques in situations in which there is noise, inconsistent data, and uncertainty in the datasets. Moreover he has illustrated similar advantages in situations involving multicategory learning. Recently Fu [12,13] has also shown that a particular ANN-based technique (CFNet) is better able to learn domain rules from data sets comprising only a small fraction of domain instances in comparison with the decision-treebased rule generator system C4.5 ([36]) (Fu defines domain rules as the set of rules which are able to explain all the domain instances). While earlier results from other authors [33,54] have identified problem domains in which the extracted knowledge representation outperformed the ANN from which they were extracted, this avenue of research does not appear to have been followed-up in any substantive way. 4. Within ANN rule-extraction, there is a strong and continuing predilection towards so-called “decompositional” techniques [2,3,14,53] viz the process of first extracting rules at the individual (hidden and output) unit level within the ANN solution and then aggregating these rules to form global relationships. This is in contrast to the so-called “pedagogical” techniques which treat the ANN as a “black box” and extract global relationships between the inputs and the outputs of the ANN directly, without analysing the detailed characteristics of the underlying ANN solution. While both approaches rely heavily on heuristics to achieve tractability, recent experience [23,27,41] continues to suggest that it is easier to contrive an efficient heuristic to accomplish the key step in decompositional rule extraction (viz “weight pruning” and the systematic elimination of less relevant variables) than the corresponding step in pedagogical rule extraction (viz controlling the manner in which options in the solution space are generated). 5. No clear difference has yet emerged between the performance of specific purpose rule-extraction techniques and those techniques with more general portability [3]. Moreover whilst the utility of specific purpose techniques is attractive to an end-user, there is an on-going demand for techniques which can be applied in situations where an ANN solution already exists. 6. Computational complexity of the rule extraction process may be a limiting factor on what is achievable from such techniques. The complexity depends naturally on the format of the rules (extracting threshold rules is easy). Go- Extracting the Knowledge Embedded in Artificial Neural Networks 231 lea [19] showed that, the worst case computational complexity of extracting minimum DNF rules from trained ANNs and the complexity of extracting the same rules directly from the data are both NP-hard. In the same vein, one potentially promising approach to ANN rule-extraction focuses on extracting from single perceptrons, the best rule within a given class of rules. However, extracting the best M-of-N rule from a single-layer network has also been shown to be NP-hard whilst the task of deciding whether a perceptron is symmetric with respect to two variables is NP-complete [27]. Hence the combination of ANN learning and rule-extraction potentially involves significant additional computational cost over direct rule-learning techniques. 7. The possibility exists of significant divergence between the rules that capture the totality of knowledge embedded within the trained ANN and the set of rules that approximate the behaviour of the network in classifying the training set [50]. For example, Tickle [51] reported such a situation in extracting rules from a number of ANNs trained in a problem domain (the Edible Mushroom problem domain) characterised by (a) the availability of only a relatively small amount of data from the domain and (b) the fact that many of the attributes appear to be irrelevant in determining the classification of individual cases. Overall however, the lesson from this result is that it simply behoves an end user to validate any rules extracted either by using further test data from the problem domain or by having the rules scrutinised by a domain expert. 4 The Expanding Horizon of Knowledge-Extraction from ANNs Whilst rule extraction has attracted considerable attention, there have also been important continuing developments in other facets of eliciting knowledge from ANNs. For example, Saito and Nakano [38] demonstrated a technique for extracting scientific laws whereas other authors have explored the extraction of decision trees from ANNs [10] as well as utilising decision trees [22,25], propositional rules [8,35], and rough sets [4] as a means of ANN initialisation. In addition a number of authors e.g. [16,18,34,39,56] have extended previous work [14,15,16,17,33] on both the extraction of Deterministic Finite State Automata (DFAs) from recurrent ANNs as well as showing how a recurrent network can be constructed such that it acts as a Fuzzy Finite State Automata, (FFA), i.e. a recogniser for a fuzzy regular language. In parallel with the development of techniques for extracting Boolean rules from trained ANNs, corresponding techniques for extracting fuzzy rules continue to be synthesised. These are the so-called neurofuzzy systems and their adherents argue that they afford significant benefits in dealing with the inherent uncertainty in many classification problems [26,30]. As well as the extraction of knowledge from ANNs, further progress has also been made on the use of ANNs for rule refinement [35]. In essence, the process of using ANNs for rule refinement involves inserting an initial rule base (i.e. what 232 A.B. Tickle et al. might be termed prior knowledge) into an ANN, training the ANN on available datasets, and then extracting the “refined” rules [2,54]. Another potentially important development in the synthesis of knowledge-extraction techniques applicable to ANNs used in reinforcement learning [43,44,45]. In addition to the extraction of explicit, symbolic rules from ANNs designed for this problem domain, techniques have also been developed for the extraction of explicit plans (open-loop policies). In particular, the intent is to enable explicit reasoning of plans, without any a priori domain knowledge. 5 5.1 Research Questions and the Future Directions in Extracting Knowledge from ANNs A Sample of New Techniques and Ideas in Rule Extraction One of the important lines of development in the area of ANN rule-extraction is the synthesis of techniques which are not specific to a particular ANN regime but can be applied across a broad cross-section of related approaches. As indicated previously, within this area of the overall field of ANN knowledge-elicitation, the focus of attention continues to be on the so-called decompositional class of ANN rule-extraction algorithms. A key criterion for the development of a successful algorithm in this area is the identification and incorporation of a heuristic for limiting the size of the solution space which must be searched (and therefore ensuring tractability). Recently Maire [27] and Krishnan [23] have suggested similar decompositional techniques which are applicable to feed-forward networks with Boolean inputs. In particular, Krishnan [23,24] introduced a technique (COMBO) which offers an alternative to randomly trying combinations of the weights at the individual unit level. The initial step in this process is to sort the weights and then generate combinations of all possible sizes. The combinations for any particular size are then arranged in descending order of the sum of the weights in the combination. The authors argue that not only does this technique limit the size of search space but that it did so in a manner which ensures that the extracted rules satisfy their three quality criteria of being valid, maximally general, and complete. Similarly, Maire [27] presented a method to unify M-of-N rules based on a partial order. (The M-of-N concept essentially is a means of expressing rules in the compact form: If M of the following N antecedents are true then ...). In a similar vein Alexander and Mozer [1] have proposed a template-based procedure which expedites the process of reformulating an ANN weight-vector as a symbolic M-of-N expression rather than as normal symbolic rules. (Alexander and Mozer adopt the convention n-of-m to describe their rule-set.) The authors introduce the notion of a weight template as a parameterised region of the total weight space corresponding to a given n-of-m expression, find the template that best fits the actual weights in the trained ANN, and then extract the rules directly. Other approaches include that of Setiono et al. which facilitate the process of rule extraction by pruning redundant connections in the ANN and clustering the activation values of the hidden nodes [42]. Extracting the Knowledge Embedded in Artificial Neural Networks 233 McGarry et al. have extended the rule-extraction process to include Radial Basis Function (RBF) networks [29]. In particular they undertook a comparison of the quality and complexity of the rule sets extracted from RBF networks with those extracted from multi-layer perceptron networks trained on the same data sets. Apart from providing a symbolic representation of the clusters derived from the data sets, they were also able to show that it was possible to represent accurately the boundaries of these clusters using a compact rule set. This confirms similar results reported by Andrews et al. [3]. Bologna et al. have experimented rule extraction from combinations of neural networks [5]. Symbolic rules were extracted from each single neural network and also from the global combination. Rules generated from the global system had the same representation of those generated by each network (i.e. conventional rules). The neural network model used in the experiments was IMLP. This is a particular Multi-Layer Perceptron architecture from which symbolic rules are extracted in polynomial time using an eclectic rule extraction technique [7]. It should be noted that the key step of the rule extraction process is related to a boolean minimisation problem for which the goal is to determine whether an axis-parallel hyper-plane is discriminant in a given region of the input space. As a result, fidelity of rules can be made as high as desired. 5.2 Extracting Knowledge in Other Forms Traditionally, people have been looking for rules of the form Ri ⇒ Ro , which reads if the input of the ANN is in the region Ri , then its output is in the region Ro , where regions are axis parallel hypercubes. Recently, a number of rule extraction methods that use polyhedra as regions have been proposed. Setiono [40] and Bologna [6] called them oblique decision rules. Ideally, a rule extraction method should first compute the best possible approximation of the true reciprocal image of a region of the output space, then depending on the application, either use directly this raw region (for software verification), or process the information to make it more comprehensible to a human user. This inversion problem is the core problem of rule extraction. Maire [28] introduced an algorithm for inverting the function realised by a feed-forward network based on the inversion of the vector-valued functions computed by each individual layer. This algorithm back-propagates polyhedral regions from the output layer, which are constraints similar those used in Validity Interval Analysis [47], back to the input layer. These regions are finite unions of polyhedra. A by-product of this backpropagation algorithm is a new rule-extraction method whose fidelity can provably be made as high as desired. The method can be applied to all feed-forward networks with continuous values, and advocates the view that a rule extraction algorithm must first try to find a high fidelity set of rules (best approximation to the true reciprocal image), then perform the necessary post-processing (simplification, translation to other rule formats, projection and visualisation), contrary to most rule extraction methods, which make dramatic approximations from the start, approximations should be postponed to the last moment. 234 A.B. Tickle et al. Previous surveys of ANN rule-extraction techniques together with the critiques presented above, illustrate that there is an increasingly rich and diverse range of schemas being employed to represent the knowledge extracted from ANNs. Nonetheless, there remains considerable scope for further exploration and development. One such area, for example, is where ANNs are used in system control and management functions. In particular a distinguishing characteristic of this application of knowledge-extraction techniques is that the output from the ANN is real-valued (continuous) data as distinct from nominal-valued output used in classification problems. A second such example is in situations where an ANN has been trained on the type of noisy time-series data which characterises the financial markets such as foreign exchanges. Giles et al. have studied such applications using recurrent neural networks and representing the extracted knowledge in the form of a deterministic finite state automata [18]. However a desired goal of the knowledgeextraction process could also be an abstract representation of the market’s dynamical behaviour which can then be used as the basis for decision-making (e.g. whether or not to buy or sell a particular stock or foreign currency). In a similar vein, one of the emerging agenda in the general area of Machine Learning is a determination of the means of representing the results of the learning process in a form which best matches the requirements of the end-user. For example Humphrey et al. [21] discuss the application of knowledge visualisation techniques in Machine Learning and demonstrate how users’ questions can be mapped onto visualisation tasks to create new graphical representations that show the flow of examples through a decision structure. 5.3 A Formal Model of Rule Extraction Whilst rule extraction from trained feed-forward ANNs is only one facet of the total problem of extracting the knowledge embedded within ANNs, it is nonetheless an area which continues to attract considerable attention. Golea [19] sees as one of the main challenges in this particular area, the formulation of a consistent theoretical basis for what has been, until recently, a disparate collection of empirical results. To this end he has proposed a formal model for analysing the rule-extraction problem which is an adaptation of the well-known probably approximately correct (PAC) learning model [55]. Applied in the context of extracting Boolean if-then-else rules, this model provides the basis for proving that extracting simple rules from trained feed-forward ANNs is computationally an NP-hard problem. Hence an avenue for further investigation is to determine if this (or a similar) model could be used to analyse for the computational complexity of extracting other forms of rules such as fuzzy rules and probabilistic rules as well as that of extracting Finite State Machines (FSMs) from trained recurrent neural networks. The Support Vector Machines deserve similar attention. Healy [20] also observes that research into the area of rule extraction from ANNs could benefit from the availability of a formal model of the semantics of the rules which expressed the relationship between the application data, the neural network learning model, and the extracted rules. In particular he offers the Extracting the Knowledge Embedded in Artificial Neural Networks 235 proposition that a valid and complete rule base corresponds to a continuous function. His preliminary findings suggest that “to be truly capable, a rule extraction architecture must capture the antecedents of rules with conjunctive consequents as identifiable collections of network nodes and connections”. Apart from the need for more formal models, there also appears to be a need for non-experimental ways to measure the fidelity of extracted rules. Ultimately one of the overall goals of this process is to determine under what conditions, if any, the knowledge-extraction problem is tractable. 5.4 Predicting ANN Performance In the Machine Learning research domain, Inductive Decision Trees (IDTs) are commonly used models for generating symbolic rules from datasets [36]. In this context Quinlan introduced the C4.5 and C5.0 algorithms which actually are the “gold standard” in symbolic rule extraction. In many applications the average predictive accuracy obtained by Artificial Neural Networks and C5.0 is similar, but C5.0 training time is much shorter [37]. Thus, the reader could believe that there are no benefits to using ANNs in classification problems. This is not true. Indeed, there are classification problems in which the IDT average predictive accuracy is clearly worse than those obtained by ANNs [46] and similarly there are problems for which the converse is true. So, it is important to know in which context an ANN could participate in generating better quality rules. Quinlan has analysed and compared the classification mechanism of IDTs and ANNs [37]. He has pointed out that the inherent classification mechanism of an ANN is parallel since in computing the value of each output neuron, all input neurons are used simultaneously. For IDTs, the classification mechanism is sequential as an instance is classified by following a path in the tree of testing attributes. Now, suppose that the goal is to learn a classification task in which each attribute is not relevant at the single level, but in which there are relevant combinations of several attributes. In this case an ANN during its learning phase will be more appropriate than an IDT to determine those attribute combinations in its internal representation. To the contrary, with only one or few relevant attributes, a decision tree will be more suitable because in an ANN the high relevance of one attribute could be hidden by the low relevance of all other attributes. Quinlan states in a conjecture that classification problems which are unsuitable for ANNs are denoted to as S-problems, whereas P-problems are those unsuitable for IDTs [37]. One important question therefore that should be answered in the future is how can we estimate whether a given classification task belongs to the P-problem or to the S-problem category. 6 Summary and Conclusion The preceding discussion has highlighted the rich diversity of mechanisms and procedures now available to extract and represent the knowledge embedded within trained Artificial Neural Networks. Whilst the field of ANN knowledgeextraction is one which continues to attract considerable interest, the preceding 236 A.B. Tickle et al. discussion has also argued the case for using an established learning model (such as PAC learning) to provide a consistent theoretical basis for what has been, until now, a disparate collection of empirical results. Furthermore, useful work remains to be done to determine under what conditions, if any, the knowledgeextraction problem is tractable, and, indeed conditions under which it is worth using Artificial Neural Networks models rather than Inductive Decision Trees. Acknowledgements The authors would like to acknowledge the constructive criticisms made by two anonymous reviewers of an earlier version of this paper. References 1. Alexander, J.A., Mozer, M.C.: Template-Based Procedures for Neural Network Interpretation. Neural Networks 12 (1999) 479–498. 2. Andrews, R., Diederich J., Tickle, A.B.: A Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks. Knowledge-Based Systems 8(6) (1995) 373-389. 3. Andrews, R., Cable, R., Diederich, J., Geva, S., Golea, M., Hayward, R., Ho-Stuart, C., Tickle, A.B.: An Evaluation and Comparison of Techniques for Extracting and Refining Rules from Artificial Neural Networks. Technical report, Queensland University of Technology, Australia (1996). 4. Banerjee, M, Mitra, S., Pal, S. K.: Rough Fuzzy MLP: Knowledge Encoding and Classification. IEEE Transactions on Neural Networks 9(6) (1998) 1203–1216. 5. Bologna, G., Pellegrini, C.: Symbolic Rule Extraction from Modular Transparent Boxes. Proceedings of the Conference of Neural Networks and their Applications (NEURAP), 393–398 (1998). 6. Bologna, G., Pellegrini, C.: Rule-Extraction from the Oblique Multi-layer Perceptron. Proceedings of the Australian Conference on Neural Networks (ACNN’98), University of Queensland, Australia (1998) 260–264. 7. Bologna, G.: Symbolic Rule Extraction from the DIMLP Neural Network. Neural Hybrid Systems, S. Wermter and R. Sun (eds.), Springer Verlag (1999). 8. Carpenter, G.A., Tan, A.W.:Rule Extraction: From Neural Architecture to Symbolic Representation. Connection Science 7(1) (1995) 3–27. 9. Craven, M., Shavlik, J.W.: Using Sampling and Queries to Extract Rules From Trained Neural Networks. Machine Learning: Proceedings of the Eleventh International Conference (1994), Amherst MA, Morgan-Kaufmann, 73–80. 10. Craven, M.: Extracting Comprehensible Models from Trained Neural Networks. PhD Thesis, University of Wisconsin, Madison Wisconsin (1996). 11. Fu, L.M.: Rule Generation from Neural Networks. IEEE Transactions on Systems, Man and Cybernetics 28(8) (1994) 1114–1124. 12. Fu, L.M.: A Neural Network Model for Learning Domain Rules Based on its Activation Function Characteristics. IEEE Transactions on Neural Networks, 9(5) (1998) 787–795. 13. Fu, L.M.: A Neural Network for Learning Domain Rules with Precision. Proceedings of the International Joint Conference on Neural Networks (IJCNN99) (1999) (To Appear). Extracting the Knowledge Embedded in Artificial Neural Networks 237 14. Giles, C.L., Miller C.B., Chen, D., Chen, H., Sun, Z., Lee, Y.C.: Learning and Extracting Finite State Automata with Second-order Recurrent Neural Networks. Neural Computation 4 (1992) 393–405. 15. Giles, C.L., Omlin, C.W.: Rule Refinement with Recurrent Neural Networks. Proceedings of the IEEE International Conference on Neural Networks San Francisco CA (1993) 801–806. 16. Giles, C.L., Omlin, C.W.: Extraction, Insertion, and Refinement of Symbolic Rules in Dynamically Driven Recurrent Networks. Connection Science 5(3-4) (1993) 307– 328. 17. Giles, C.L., Omlin, C.W.: Rule Revision with Recurrent Networks. IEEE Transactions on Knowledge and Data Engineering 8(1) (1996) 183. 18. Giles, C.L., Lawrence, S., Tsoi, A.C.: Rule Inference for Financial Prediction using Recurrent Neural Networks. Proceedings of the IEEE/IAFE Conference on Computational Intelligence for Financial Engineering (CIFEr), IEEE Piscataway NJ (1997) 253–259. 19. Golea, M.: On the Complexity of Rule Extraction from Neural Networks and Network Querying. Proceedings of the Rule Extraction From Trained Artificial Neural Networks Workshop, Society For the Study of Artificial Intelligence and Simulation of Behavior Workshop Series (AISB’96) University of Sussex, Brighton, UK (1996) 51–59. 20. Healy, M.J.: A Topological Semantics for Rule Extraction with Neural Networks. Connection Science 11(1) (1999) 91–113. 21. Humphrey, M., Cunningham S.J., Witten I.H.: Knowledge Visualization Techniques for Machine Learning. Intelligent Data Analysis 2(4) (1998). 22. Ivanova, I., Kubat, M.: Initialisation of Neural Networks by Means of Decision Trees. Knowledge Based Systems 8(6) (1995) 333–344. 23. Krishnan, R.: A Systematic Method for Decompositional Rule Extraction From Neural Networks. Proceedings of the NIPS’96 Rule Extraction From Trained Artificial Neural Networks Workshop (1996), Queensland University of Technology 38–45. 24. Krishnan, R., Sivakumar, G., Battacharya, P.: A Search Technique for Rule Extraction from Trained Neural Networks. Pattern Recognition Letters 20 (1999) 273–280. 25. Kubat, M.: Decision Trees Can Initialize Radial-Basis Function Networks. IEEE Transactions on Neural Networks 9(5) (1998) 813–821. 26. Lin, C-T., Lee, C.S.G.: Neural Fuzzy Systems: a Neuro-Fuzzy Synergism to Intelligent Systems. Prentice-Hall Upper Saddle River NJ (1996). 27. Maire, F.: A Partial Order for the M-of-N Rule-Extraction Algorithm. IEEE Transactions on Neural Networks 8(6) (1997) 1542–1544. 28. Maire, F.: Rule-Extraction by Backpropagation of Polyhedra. Journées Francophones sur l’Apprentissage Automatique (JFA’98) (May) Arras France (1998). 29. McGarry, K., Wermter, S., MacIntyre, J.: Knowledge Extraction from Radial Basis Function Networks. Proceedings of the International Joint Conference on Neural Networks (IJCNN99) (1999) (To Appear). 30. Meneganti, M., Saviello, S., Tagliaferri, R.: Fuzzy Neural Networks for Classification and Detection of Anomalies. IEEE Transactions on Neural Networks 9(5) (1998) 848–861. 31. Michie, D., Spiegelhalter, D.L., Taylor C.C: Machine Learning, Neural and Statistical Classification. Hertfordshire Ellis Horwood (1994). 238 A.B. Tickle et al. 32. Neumann, J.: Classification and Evaluation of Algorithms for Rule Extraction From Artificial Neural Networks A Review. (Unpublished) Centre for Cognitive Science, University of Edinburgh (1998). 33. Omlin, C.W., Giles, C.L.: Extraction of Rules from Discrete-Time Recurrent Neural Networks. Connection Science 5(3-4) (1993) 307–336. 34. Omlin, C.W., Thornber, K., Giles, C.L.: Fuzzy Finite State Automata Can be Deterministically Encoded Into Recurrent Neural Networks. IEEE Transactions on Fuzzy Systems (To appear). 35. Optiz, D.W., Shavlik, J.W.: Dynamically Adding Symbolically Meaningful Nodes to Knowledge-Based Neural Networks. Knowledge-Based Systems 8(6) (1995) 301– 311. 36. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, California (1993). 37. Quinlan, J.R.: Comparing Connectionist and Symbolic Learning Methods. In Computational Learning Theory and Natural Learning Systems: Constraints and Prospects, ed. R. Rivest, MIT Press (1994), 445–456. 38. Saito, K., Nakano, R.: Law Discovery Using Neural Networks. Proceedings of the NIPS’96 Rule Extraction From Trained Artificial Neural Networks Workshop, Queensland University of Technology (1996) 62–69. 39. Schellhammer, I., Diederich, J., Towsey, M., Brugman, C.: Knowledge Extraction and Recurrent Neural Networks: an analysis of an Elman network trained on a natural language learning task. Queensland University of Technology, NRC Technical Report (1997) 97–151. 40. Setiono, R., Huan, L.: NeuroLinear: From Neural Networks to Oblique Decision Rules. Neurocomputing 17 (1997) 1–24. 41. Setiono, R.: Extracting Rules from Neural Networks by Pruning and Hidden Unit Splitting. Neural Computation 9 (1997) 205–225. 42. Setiono, R., Thong, J.Y.L., Yap, C.S.: Symbolic Rule Extraction from Neural Networks: An Application to Identifying Organisations Adopting IT. Information and Management 34(2) (September) (1998) 91–101. 43. Sun, R., Sessions, C.: Extracting Plans from Reinforcement Learners. Proceedings of the International Symposium on Intelligent Data Engineering and Learning (IDEAL’98) (October) Springer-Verlag (1998). 44. Sun, R., Peterson, T.: Autonomous Learning of Sequential Tasks: Experiments and Analyses. IEEE Transactions on Neural Networks, 9(6) (1998) 1217–1234. 45. Sun, R.: Knowledge Extraction from Reinforcement Learning. Proceedings of the International Joint Conference on Neural Networks (IJCNN99) (1999) (To Appear). 46. Thrun, S.B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K., Dzeroski, S., Fahlman, S. E., Fisher, D., Hamann, R., Kaufman, K., Keller, S., Kononenko, I., Kreuziger, J., Michalski, R.S., Mitchell, T., Pachowicz, P., Reich, Y., Vafaie, H., Van de Welde, K., Wenzel, W., Wnek, J. and Zhang, J.: The MONK’s Problems: a Performance Comparison of Different Learning Algorithms. Carnegie Mellon University, Technical report CMU-CS-91-197 (1991). 47. Thrun, S.B.: Extracting Provably Correct Rules From Artificial Neural Networks. Technical Report IAI-TR-93-5 Institut fur Informatik III Universitat Bonn (1994). 48. Tickle, A.B., Hayward, R., Diederich, J.: Recent Developments in Techniques for Extracting Rules from Trained Artificial Neural Networks. Herbstschule Konnektionismus (HeKonn 96) (October) Munster (1996). Extracting the Knowledge Embedded in Artificial Neural Networks 239 49. Tickle, A.B., Andrews, R., Golea, M., Diederich, J.: Rule Extraction from Trained Artificial Neural Networks. In A. Browne (Ed) Neural Network Analysis, Architectures and Applications Institute of Physics Publishing Bristol UK (1997) 61–99. 50. Tickle, A.B.,, Golea, M., Hayward, R., Diederich, J.: The Truth is in There: Current Issues in Extracting Rules from Trained Feed-Forward Artificial Neural Networks. Proceedings of the 1997 IEEE International Conference on Neural Networks (ICNN’97), 4 (1997) 2530–2534. 51. Tickle, A.B.: Machine Learning, Neural Networks and Information Security: Techniques for Extracting Rules from Trained Feed-Forward Artificial Neural Networks and their Application in an Information Security Problem Domain. PhD Dissertation, Queensland University of Technology (1997). 52. Tickle, A.B., Andrews, R., Golea, M., Diederich, J.: The Truth will come to Light: Directions and Challenges in Extracting the Knowledge Embedded within Trained Artificial Neural Networks. IEEE Transactions on Neural Networks 9(6) (1998) 1057–1068. 53. Tickle, A.B., Maire F., Bologna, G., Diederich, J.: Extracting the Knowledge Embedded within Trained Artificial Neural Networks: Defining the Agenda. Proceedings of the Third International ICSC Symposia on Intelligent Industrial Automation (IIA’99), and Soft Computing (SOCO’99) (1999) 732–738. 54. Towell, G., Shavlik, J.: The Extraction of Refined Rules From Knowledge Based Neural Networks. Machine Learning 131 (1993) 71–101. 55. Valiant, L.G.: A theory of the learnable. Communications of the ACM 27(11) (1984) 1134–1142. 56. Williams, R.J., Zipser, D.: A Learning Algorithm for Continually Running Fully Recurrent Neural Networks. Neural Computation 1(2) (1989) 270–280. Symbolic Rule Extraction from the DIMLP Neural Network Guido Bologna Machine Learning Research Centre Queensland University of Technology Box 2434 GPO Brisbane Queensland 4001, Australia bologna@fit.qut.edu.au Abstract. The interpretation of neural network responses as symbolic rules is actually a difficult task. Our first approach consists in characterising the discriminant hyper-plane frontiers. More particularly, we point out that the shape of a discriminant frontier built by a standard multilayer perceptron is related to an equation with two terms. The first one is linear, and the second is logarithmic. This equation is not sufficient to easily generate symbolic rules. So, we introduce the Discretized Interpretable Multi Layer Perceptron (DIMLP) model that is a more constrained multi-layer architecture. From this special network, rules are extracted in polynomial time and continuous attributes do not need to be binary transformed. We apply DIMLP to three applications of the public domain in which it gives better average predictive accuracy than C4.5 algorithm and discuss rule quality. 1 Introduction The lack of validation tools is often one of the reasons for not using neural systems in practice. For instance, physicians cannot trust a diagnosis system without explanation of its responses. The difficulty of justification of neural network responses is due to its distributed internal representation. More particularly, the overall network decision mechanism is represented onto a space of connection weights and activation values which has an exponential size and so in practice it cannot be entirely explored. Researchers in the field of symbolic rule extraction from neural networks have proposed several algorithms. They are concisely described by the taxonomy proposed by Andrews at al. [1]. Briefly, symbolic rule extraction methods belong to three categories: pedagogical, decompositional, and eclectic. In the pedagogical approach symbolic rules are generated by an inductive symbolic algorithm which globally analyses expressions related to the input and the output layer. For the decompositional approach, symbolic rules are determined by analysing the weights at the level of each hidden neuron and each output neuron. Finally, S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 240–254, 2000. c Springer-Verlag Berlin Heidelberg 2000 Symbolic Rule Extraction from the DIMLP Neural Network 241 the eclectic approach is a combination of the pedagogical and decompositional strategies. Inductive Decision Trees (IDT) as C4.5 [2] and CART [4] are popular algorithms for data classification and symbolic rule extraction. In the Machine Learning community it is well known that in many applications neural networks and inductive decision trees reach similar accuracy performances. However, Quinlan pointed out that a special category of problems is more favourable to neural networks and another one to decision trees [3]. In this chapter we investigate the shape of the discriminant frontiers built by a multi-layer perceptron in order to choose a strategy for the extraction of symbolic rules. As a result, we introduce the Discretized Interpretable Multi Layer Perceptron (DIMLP) model. In this context, rules are directly generated by determining discriminant hyper-plane frontiers. Moreover, a noticeable feature of our rule extraction algorithm is that it scales in polynomial time with respect to the number of hidden neurons and to the number of examples. We compare DIMLP, standard MLP networks and C4.5 decision trees in three classification problems of the public domain. The first one is the wellknown Monk-2, the second one is related to the Sonar dataset, and the last one concerns the classification of handwritten digits. 2 Discriminant Frontier Determination in a Standard Multi-layer Perceptron The problem of discriminant frontier determination in the general case of multilayer perceptrons is very complex. Hence, we simplify our investigations by first considering only two different classes and at most only one hidden layer of neurons. As a notation, symbols xi , hj , yk will denote input neurons, hidden neurons, and output neurons, respectively1 . Finally, the activation function of hidden and output neurons is the sigmoid function σ given as σ(x) = 1 . 1 + e−x (1) Without loss of generality we assume that an instance is classified class1 when y1 > y2 and class2 otherwise, with y1 and y2 corresponding to the two output neurons. Finally, it follows that the discriminant frontier between two classes lies in the set of points where y1 = y2 . 2.1 One Hidden Neuron In this case the outputs of the network are X w1i xi )) y1 = σ(v10 + v11 · σ( i 1 Bias virtual neurons are also used. (2) 242 G. Bologna and X w1i xi )) ; y2 = σ(v20 + v21 · σ( (3) i where w1i are the weights between the input layer and the hidden neuron, and v1i , v2j are the weights between the hidden neuron and the output layer. As the sigmoid is a monotonic function, the set of points where y1 = y2 is determined by X w1i xi ) = 0 . (4) (v10 − v20 ) + (v11 − v21 ) · σ( i At this point it is important to note that equation (4) has a solution if and only if 0<− v10 − v20 < 1 and v11 6= v21 . v11 − v21 (5) Equation (4) is equivalent to (v11 − v21 ) = −(v10 − v20 ) · (1 + e− Further, P i w1i xi ). P (v11 − v21 ) + (v10 − v20 ) = −e− i w1i xi . (v10 − v20 ) Assuming that (5) holds, we obtain   X v11 − v21 w1i xi + ln −1 − =0. v10 − v20 i (6) (7) (8) In the simple perceptron model a hyper-plane can be shifted with respect to the origin by the bias unit. In this case a discriminant hyper-plane is also shifted by the weights between the hidden neuron and the output neurons (logarithmic term). 2.2 Two Hidden Neurons Let us consider an MLP with two hidden neurons denoted to as h1 and h2 . In this case y1 and y2 are given as X y1 = σ( v1i hi ) (9) i and X v2i hi ) . y2 = σ( i (10) Symbolic Rule Extraction from the DIMLP Neural Network 243 We define ∆Vi as the difference between weights v1i and v2i (∆V0 will designate the difference of weights related to the bias neuron). Hence, y1 = y2 is equivalent to ∆V0 + ∆V1 h1 + ∆V2 h2 = 0 . (11) Equation (11) is equivalent to h1 = − ∆V0 + ∆V2 h2 . ∆V1 (12) Equation (12) has a solution if and only if 0<− ∆V0 + ∆V2 h2 < 1 and ∆V1 6= 0 . ∆V1 (13) Rewriting (12) leads to 1 + e− P i w1i xi =− ∆V1 . ∆V0 + ∆V2 h2 Finally, applying the logarithm we obtain   X ∆V1 w1i xi + ln −1 − =0. ∆V0 + ∆V2 h2 i (14) (15) Equation (15) is linear if h2 is constant. So, when h2 is saturated2 we have again almost linear hyper-plane frontiers. Otherwise, when the logarithmic term varies, the linear frontier is curved. 2.3 More than Two Hidden Neurons We suppose that we have one hidden layer with l hidden neurons and two output neurons. Hypothesis y1 = y2 is equivalent to l X ∆Vk hk = 0 . (16) k=0 After several steps (cf. eq. (15)) we obtain " # X ∆Vi wij xj + ln −1 − Pl =0; k=0,k6=i ∆Vk hk j (17) with conditions 0<− Pl k=0,k6=i ∆Vk hk ∆Vi < 1 and ∆Vi 6= 0 . (18) We denote the logarithmic term of (17) to as the shift-curving term, while the other one is called the hyper-plane term. 2 The values of the sigmoid function varies very slowly before -3 or after 3. 244 2.4 G. Bologna Discussion on Discriminant Frontiers The weighted sum related to a hidden neuron represents a possible discriminant hyper-plane frontier that will be shifted and/or curved by the weights between the hidden layer and the output layer. In other terms, a neuron defines a virtual hyper-plane frontier which may turn into a real hyper-plane frontier. The bending of a frontier appears when the shift-curving term varies. A linear approximation of MLP frontiers is given in [5]. At this point we may wonder about the creation of a special multi-layer perceptron model in which there is neither curving phenomenon, nor shifting factor. This kind of network would have only oblique hyper-plane frontiers. However, when many oblique frontiers are built, rules are not understandable3 . So, the question arising is whether it is possible to define a multi-layer perceptron model with hyper-plane frontiers parallel to the axis of input variables. This is precisely the matter of the next section. 3 The IMLP Model 3.1 The Architecture The IMLP network [6] has an input layer, one or two hidden layers and an output layer (see Figure 1). In this model, each neuron of the first hidden layer Output Layer 1 third bias 1 second bias 1 first bias virtual unit hi wij xj Input Layer Fig. 1. An IMLP network with two hidden layers. Each neuron of the first hidden layer is connected to only one input neuron and to the first bias unit. All the other layers are fully connected. 3 In this case the antecedents of the rules will be the linear combination of input attributes. Symbolic Rule Extraction from the DIMLP Neural Network 245 is connected to only one input neuron and to the first virtual bias unit. All the other layers are fully connected. As we will see in Section 3.2, this special connectivity pattern is the basic idea that will permit us to extract symbolic rules. The output hi of the ith neuron of the first hidden layer is given by the threshold function:  P 1 if j wij .xj > 0 . (19) hi = 0 otherwise When the model has two hidden layers, the second hidden layer responses are not given by the threshold function but rather by the sigmoid function. Hence, the IMLP architecture differs from the standard MLP architecture in three main ways: 1. Each neuron in the first hidden layer is connected to only one input neuron. 2. The activation function used by the neurons of the first hidden layer is the threshold function instead of the sigmoid function. 3. The size of the first hidden layer is determined experimentally as a multiple of the size of the input layer4 . 3.2 Symbolic Rule Extraction The key idea for symbolic rule extraction is hyper-plane determination. As we will see in next paragraph, each hyper-plane expression containing only one input variable represents a possible antecedent in a symbolic rule. Examples with One Hyper-Plane. Figure 2 shows an example of IMLP network with only one hidden layer and only one hidden neuron. The weighted sum of inputs and weights related to a hidden neuron represents a virtual hyperplane frontier that will be effective depending on the weights related to successive layers (v0 and v1 in figure 2). A real hyper-plane frontier will be located in −w0 /w1. In fact, the points where h1 = 1 are defined by w1 x1 + w0 ≥ 0. This is equivalent to x1 ≥ −w0 /w1 . The points where h1 = 0 are defined by w1 x1 + w0 < 0. This is equivalent to x1 < −w0 /w1 . Weights above those between the input layer and the first hidden layer determine whether a hyper-plane frontier is virtual or not. Assuming that we have class black circle when y1 ≥ 0.5 and white square otherwise, we obtain a real hyper-plane frontier when  v0 + v1 ≥ 0 (h1 = 1) . (20) (h1 = 0) v0 < 0 If (20) holds we obtain two symbolic rules given as 1. If (x1 ≥ −w0 /w1 ) =⇒ black circle. 2. If (x1 < −w0 /w1 ) =⇒ white square. 4 Thus, during the training phase all input neurons will have the same opportunity to affect network responses. 246 G. Bologna x2 y1 v0 = -2 h1 = 0 h1 = 1 v1 = 3 1 h1 w1 w0 1 x1 x1 -w0/w1 x2 y1 v0 = -2 1 h1 = 1 h1 w1 w0 1 h1 = 0 v1 = 1 x1 x1 -w0/w1 Fig. 2. The concept of virtual hyper-plane (dashed line) and real hyper-plane (plain line) in an IMLP network. The virtual hyper-plane located in −w0 /w1 is effective when v0 and v1 are in a particular configuration (cf. (20)). An Example with Two Perpendicular Frontiers. We create two perpendicular hyper-plane frontiers by connecting h1 to x1 and h2 to x2 . The IMLP network and the corresponding input space partition are shown in figure 3. y1 x2 v0 = -10 v1 = 5 v2 = 6 1 h1 w10 h1 = 1 h2 = 1 h1 = 0 h2 = 0 h1 = 1 h2 = 0 w20 w1 1 -w20/w2 h2 h1 = 0 h2 = 1 x1 w2 x2 -w10/w1 x1 Fig. 3. An example with perpendicular hyper-plane frontiers (plain lines) in an IMLP network with two hidden neurons. Symbolic Rule Extraction from the DIMLP Neural Network 247 A configuration of weights v0 , v1 , and v2 which enables h1 and h2 to define two real hyper-plane frontiers is  v0 < 0    v0 + v1 < 0 . (21)  v0 + v2 < 0   v0 + v1 + v2 ≥ 0 With respect to the hidden as     neurons h1 and h2 the input space partition is given h¯1 h¯2 h¯1 h2 h1 h¯2    h1 h 2 =⇒ white square =⇒ white square ; =⇒ white square =⇒ black circle (22) where a bar denotes the negation, and multiplication denotes conjunction. This is equivalent to  (h¯1 h¯2 + h¯1 h2 + h1 h¯2 ) =⇒ white square ; (23) h1 h2 =⇒ black circle where the addition denotes the disjunction. Using a Karnaugh map (see figure 4) we simplify (23) into   h¯1 =⇒ white square h¯2 =⇒ white square . (24)  h1 h2 =⇒ black circle Finally, transcribing (24) into the input space we obtain three symbolic rules. ) =⇒ white square. 1. If (x1 < − ww10 1 ) =⇒ white square. 2. If (x2 < − ww20 2 3. If (x1 ≥ − ww10 ) and (x2 ≥ − ww20 ) =⇒ black circle. 1 2 Once again, the real hyper-plane frontiers are not shifted with respect to the virtual hyper-plane frontiers. In general in an IMLP network virtual hyper-planes are not shifted because threshold functions related to the neurons of the first hidden layer create subregions of the input space in which network responses are constant. By contrary, in a standard multi-layer perceptron even if two input examples are as close as we desire they will give a slightly different network response. h2 h1 Fig. 4. A Karnaugh map characterising the input space partition of the IMLP network illustrated in figure 3. 248 G. Bologna Rule Extraction in the General Case. The key idea behind rule extraction is the simplification of boolean expressions formed by binary values of the first hidden layer and the classification provided by the output layer. With respect to the taxonomy introduced by Andrews at al. [1] our rule extraction algorithm is eclectic. Weights between the input layer and the first hidden layer achieve a binary transformation of each input attribute. Moreover, at the level of the first hidden layer, input signals are not combined together. So, it is as if we had the input layer again. Now if we apply a rule extraction method to analyse the associations between the input layer (in this case the first hidden layer) and the output layer it is a pedagogical method. Now, as we use the weights between the input layer and the first hidden layer to determine rule antecedents we obtain a pedagogical and decompositional rule extraction technique. Hence, it is eclectic. The general algorithm is given as: 1. Propagate all available examples in the network. 2. Form logical expressions composed of boolean values of the first hidden layer and the class given by the output layer. 3. Simplify all boolean expressions by a covering algorithm. 4. Rewrite simplified expressions in terms of input attributes. In the third step of the general algorithm, Karnaugh maps cannot be used for more than 6 variables. The larger cost operation of the rule extraction algorithm is mainly in the covering phase. If we search for the minimal covering, this calculation is exponential with the number of neurons of the first hidden layer and the number of examples. However, using heuristic covering algorithms as Espresso [7] or C4.5 [2] we obtain quasi-optimal solutions in polynomial time. It is always possible to generate rules that exactly mimic IMLP responses, nevertheless, if the number of generated rules is too large a partial covering may give a more comprehensible rule set. 3.3 Learning The use of threshold functions in the first hidden layers makes the error function non-differentiable. For the learning phase of IMLP networks we use an adapted back-propagation algorithm [8] and/or a simulated annealing algorithm [9]. The key idea of the former algorithm resides in gradually increasing the slope of the sigmoidal functions to obtain a network with threshold functions, whereas for simulated annealing the error function does not require to be differentiable. 4 The DIMLP Model The Discretized Interpretable Multi Layer Perceptron (DIMLP) model is a generalisation of the IMLP model. Neurons of the first hidden layer transform continuous inputs into discrete values instead of binary ones. Symbolic Rule Extraction from the DIMLP Neural Network 4.1 249 The Architecture The architectures of DIMLP and IMLP are fundamentally the same ones. The only difference resides in the activation function of the neurons of the first hidden layer. In DIMLP networks this activation function is a staircase function as shown by figure 5. So, continuous input values are transformed into several discrete levels. The staircase function is given as  If x < Rmin   S(x) = σ(Rmin ) h i Rmax −Rmin x−Rmin )) If Rmin ≤ x ≤ Rmax (25) S(x) = σ(Rmin + d · Rmax −Rmin · ( d   S(x) = σ(Rmax ) If x > Rmax where σ is the sigmoid function, Rmin and Rmax forms a range where the staircase is adapted to the sigmoid function, d is the number of discretized levels, and ”[ ]” denotes the integer part notation. Fig. 5. Discretization of the sigmoid function. 4.2 Learning There are two distinct training phases. During the first one, DIMLP has sigmoid functions in the first hidden layer (and also in the other layers) and the model is trained with a back-propagation algorithm. During the second training phase, all weights are frozen and a staircase function is adapted by simulated annealing to approximate the response of each neuron of the first hidden layer. 4.3 Rule Extraction From a geometrical point of view, each neuron of the first hidden layer virtually defines a number of parallel hyper-plane frontiers that is equal to the number of stairs of its staircase function. Let us clarify this situation by the example illustrated in figure 6. In practice, we show how to transform a DIMLP network into an IMLP network by creating for each hidden neuron of the former network as many hidden neurons as the number of stairs in the latter network. As we can extract 250 G. Bologna y S(x) y h v1 h v4 v3 v2-v1 h1 s1 v3-v2 h2 s2 v4-v3 h3 s3 h4 s4 1 v2 v1 0 x s1 s2 s3 s4 x DIMLP -1 -1 -1 -1 1 x IMLP Fig. 6. Transformation of an DIMLP network into an IMLP network. Note that in the IMLP network one more hidden layer is created. symbolic rules from an IMLP network, we can also extract rules from a DIMLP network. Rule extraction is performed by the same algorithm given in paragraph 3.2. Nevertheless, this time a covering algorithm is applied to discrete attributes instead of binary ones. 4.4 DIMLP Versus IMLP The DIMLP architecture is more compact than the IMLP architecture. To be more clear let us consider an IMLP network with 10 input neurons 100 neurons in the first hidden layer, 5 neurons in the second hidden layer, and two neurons in the output layer (10-100-5-2). Therefore, we have defined 10 virtual hyper-planes for each input variable. This IMLP network has 100·2+(100+1)·5+(5+1)·2 = 717 connections. Now, imagine that we define a 10-10-5-2 DIMLP network with 10 stairs for each staircase function. So, for each input variable we have again defined 10 virtual hyper-planes. However, the number of connections is smaller. We have 10 · 2 + (10 + 1) · 5 + (5 + 1) · 2 = 87 connections. 5 5.1 Experiments Datasets The three datasets used for the applications are: 1. Monk-2 [10]; 2. Sonar [11]; 3. Pen Based Handwritten Digits [12]. All datasets are obtainable from the University of California-Irvine data repository for Machine Learning (via anonymous ftp from ftp.ics.uci.edu). The summary of these datasets, their representations, and how each dataset is used in experiments are given below. Symbolic Rule Extraction from the DIMLP Neural Network 251 Monk-2. The dataset consists of 432 examples belonging to two classes (142 cases for the Monk class and 290 cases for the non-Monk class). Each example is described by 6 discrete attributes that have been transformed into 17 binary attributes. The training and testing sets have been fixed and contain 169 and 263 examples, respectively. Symbolic rules generating the dataset are defined by rules of the type: If exactly two between 6 (discrete) attributes have the first possible value then the class is Monk. Otherwise; the class is non-Monk. The positive ”compact” rule gives rise to 15 symbolic rules having 6 antecedents. Sonar. The dataset contains 208 examples described by 60 continuous attributes in the range 0.0 to 1.0. Examples belong to the class cylinder (111) and rock (97). The training and testing sets have been fixed. Each set contains 104 examples. Pen Based Handwritten Digits. The dataset contains 10992 examples. Each example is described by 16 integer attributes in the range 0 to 100. There are 10 classes, each corresponding to one possible digit. The frequency of each class is between 9% and 11%. The training and testing sets have been fixed and contain 7494 and 3498 examples, respectively. 5.2 Neural Architectures and C4.5 Parameters Monk-2. DIMLP: 17-17-10-2 (number of stairs in the staircase function: 2); the training phase is stopped when all training examples are learned. For the MLP architecture and C4.5 parameters see [10]. Sonar. DIMLP: 60-60-14-2 (number of stairs in the staircase function: 150). MLP: 60-12-2. The training phase of neural networks is stopped when all training examples are learned. For C4.5 parameters, m and c are set to 1 and 99, respectively5 . Pen Based Handwritten Digits. DIMLP: 16-48-40-10 (number of stairs in the staircase function: 150). MLP: 16-40-10. The training phase of neural networks is stopped when more than 99.6% of training examples are learned. Using previous C4.5 parameters the average accuracy obtained on training examples is 99.5%. 5.3 Results of the Monk-2 Problem Average predictive accuracies are given in table 1. The predictive accuracy obtained by C4.5 is only 64.8% [10]. Thrun reports that the majority of inductive 5 With these values the obtained average predictive accuracy is slightly better than those obtained using default parameters. 252 G. Bologna Table 1. Average predictive accuracy on 10 trials with fixed training and testing sets. DIMLP MLP 100% C4.5 100% 64.8% symbolic algorithms gives worse predictive accuracies than neural networks in the Monk-2 problem. The average predictive accuracies obtained by algorithms as ID3, ID5R, AQR, CN2 and Prism are: 69.1%; 69.2%; 79.7%; 69.0%; 72.7%; respectively [10]. Rules extracted from DIMLP for class Monk are those ones that generate the dataset, whereas for C4.5 the number of rules is notably larger in size. 5.4 Results of the Sonar Problem Average predictive accuracies are shown in table 2. The average number of rules and the average number of antecedents per rule are given in table 3. The best average accuracy is given by standard multi-layer perceptrons. Moreover, DIMLP networks are more accurate than C4.5 decision trees on average. However, fewer rules and antecedents per rule are generated by C4.5. Table 2. Average predictive accuracy on 10 trials with fixed training and testing sets. DIMLP 89.1% MLP C4.5 90.5% 76.3% Table 3. Average number of rules and average number of antecedents per rule on 10 trials with fixed training and testing sets. DIMLP C4.5 5.5 rules 24.7 6.9 ant./rule 3.6 2.7 Results of the Pen Based Handwritten Digits Problem Table 4 gives the average predictive accuracies. Table 5 illustrates the average number of rules and the average number of antecedents per rule. As well as in the Sonar problem, DIMLP networks are more accurate than C4.5 decision trees on average. Again, fewer rules and antecedents per rule are generated by C4.5. Symbolic Rule Extraction from the DIMLP Neural Network 253 Table 4. Average predictive accuracy on 10 trials with fixed training and testing sets. DIMLP MLP C4.5 96.7% 96.7% 93.2% Table 5. Average number of rules and average number of antecedents per rule on 10 trials with fixed training and testing sets. rules ant./rule 5.6 DIMLP C4.5 251.9 115.3 6.0 6.2 Discussion of Results In three classification problems DIMLP networks were more accurate than C4.5 decision trees. The reason for this dichotomy could reside in the fact that the inherent classification mechanism of neural networks is parallel, whereas those related to decision trees is sequential. Therefore, during the training phase a decision tree may miss rules involving multiple attributes which are weakly predictive separately but become strongly predictive in combination. On the other hand, a neural network may fail to discern a strongly relevant attribute among several irrelevant ones. As a consequence, we speculate that for these three classification problems and especially in the first and in the second, the inherent multi-variate search technique related to neural networks is better suited than the inherent uni-variate search technique related to decision trees. Let us clarify this conjecture with the Monk-2 dataset. Remind that this classification problem is generated by 15 symbolic rules having 6 antecedents. We would like to estimate the probability to find one of these rules using an uni-variate searching algorithm. Let us define Ai as “find antecedent ai in a given rule”. We define P (A1 , ... , A6 ) the probability to find 6 correct antecedents in a rule. We suppose P (Ai ) < α < 1, with α ∈ ℜ. If there is a subset of k antecedents ai1 , ..., aik for which Ai1 , ..., Aik are independent6 then P (A1 , ... , A6 ) < αk . It is worth noting that the larger k, the smaller the probability to find one correct rule. Hence, for an uni-variate search algorithm the probability to find one correct rule in the Monk-2 problem could be very small if k is close to 6. 6 Conclusion The use of staircase activation functions in the first hidden layer simplifies the determination of discriminant hyper-plane expressions. When only one neuron of 6 Imagine for instance the black and white classes of a chess board. 254 G. Bologna the first hidden layer is connected to an input neuron, discriminant hyper-plane expressions correspond to antecedents of symbolic rules. On three datasets DIMLP has shown to be close to standard MLP predictive accuracy with the advantage of being directly interpretable with symbolic rules. Moreover, on the same datasets DIMLP is more accurate than C4.5, but it also creates on the two last applications more rules and more antecedents. Finally, DIMLP represents an improvement with respect to IMLP as it has a more powerful internal representation. Acknowledgement This research is funded by a fellowship for young researchers provided by the Swiss National Science Foundation. References 1. Andrews R., Diederich J., Tickle A.B.: Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks. Knowledge-Based Systems, vol. 8, no. 6, 373–389 (1995). 2. Quinlan J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993). 3. Quinlan J.R.: Comparing Connectionist and Symbolic Learning Methods. In Computational Learning Theory and Natural Learning, R. Rivest (eds.), 445–456 (1994). 4. Breiman L., Friedmann J.H., Olshen R.A., Stone J.: Classification and Regression Trees. Wadsworth and Brooks, Monterey, California (1984). 5. Maire F.: Rule Extraction by Backpropagation of Polyhedra. Neural Networks (12) 4–5, 717–725 (1999). 6. Bologna G.: Rule Extraction from the IMLP Neural Network: a Comparative Study. In Proceedings of the Workshop of Rule Extraction from Trained Artificial Neural Networks (after the Neural Information Processing Conference), 13–19 (1996). 7. Rudel R., Sangiovanni-Vincentelli A.: Espresso-MV: Algorithms for MultipleValued Logic Minimisation. In Proceedings of the Custom International Circuit Conference (CICC’85), Portland, 230–234 (1985). 8. Corwin E., Logar A., Oldham W.: An Iterative Method for Training Multi-layer Networks with Threshold Functions. In IEEE Transactions on Neural Networks Journal, vol 5, no 3, 507–508 (1994). 9. Aarts E.H.L., Laarhoven P.J.M.: Simulated Annealing: Theory and Applications. Kluwer Academic (1987). 10. Thrun S.B.: The Monk’s Problems: a Performance Comparison of Different Learning Algorithms. Technical Report, Carnegie Mellon University, CMU-CS-91-197 (1991). 11. Gorman R.P., Sejnowski T.J.: Analysis of Hidden Units in a Layered Network Trained to Classify Sonar Targets. Neural Networks, 1(1) 75–88 (1988). 12. Alimoglu F., E. Alpaydin E.: Combining Multiple Representations and Classifiers for Pen-based Handwritten Digit Recognition. In Proceedings of the Fourth International Conference on Document Analysis and Recognition (ICDAR 97), Ulm, Germany (1997). Understanding State Space Organization in Recurrent Neural Networks with Iterative Function Systems Dynamics Peter Tiňo1,2 , Georg Dorffner1,3 , and Christian Schittenkopf1 1 2 Austrian Research Institute for Artificial Intelligence, Schottengasse 3, A-1010 Vienna, Austria {petert, georg, chris}@ai.univie.ac.at Department of Computer Science and Engineering, Slovak University of Technology, Ilkovicova 3, 812 19 Bratislava, Slovakia 3 Dept. of Medical Cybernetics and Artificial Intelligence, University of Vienna, Freyung 6, A-1010 Vienna, Austria Abstract. We study a novel recurrent network architecture with dynamics of iterative function systems used in chaos game representations of DNA sequences [16,11]. We show that such networks code the temporal and statistical structure of input sequences in a strict mathematical sense: generalized dimensions of network states are in direct correspondence with statistical properties of input sequences expressed via generalized Rényi entropy spectra. We also argue and experimentally illustrate that the commonly used heuristic of finite state machine extraction by network state space quantization corresponds in this case to variable memory length Markov model construction. 1 Introduction The correspondence between iterative function systems (IFS) [1] and recurrent neural networks (RNNs) has been recognized for some time [13,28]. Because of the non-linear nature of RNN dynamics a deeper insight into RNN state space structure has been lacking. Also, even though there is a strong empirical evidence supporting usefulness of extracting finite state machines from recurrent networks trained on symbolic sequences [23,7], we do not have a deeper understanding of what the machines actually represent. We address these issues in the context of a novel recurrent network architecture, that we call iterative function system network1 (IFSN). Dynamics of IFSNs corresponds to iterative function systems used in chaos game representations of symbolic sequences [16,11]. Using tools from multifractal theory and statistical 1 Recently, we discovered that Tabor [28] independently investigated similar types of recurrent networks. However, while we are mainly interested in learning and representational issues in recurrent networks with IFS dynamics, Tabor’s view is more general, with an emphasis on metric relations between the network representations of various forms of automata. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 255–269, 2000. c Springer-Verlag Berlin Heidelberg 2000 256 P. Tiňo, G. Dorffner, and C. Schittenkopf mechanics we establish a rigorous relationship between statistical properties of sequences driving the network input and scaling behavior of IFSN states. We also analyse the structure of the IFSN state space and interpret the state space quantization leading to machine extraction in a Markovian context. We finish the paper by a detailed study of finite state machine extraction from IFSNs driven by the chaotic Feigenbaum sequence. 2 Formal Definitions Consider a finite alphabet A = {1, 2, ..., A}. The sets of all finite2 and infinite sequences over A are denoted by A+ and Aω respectively. The set of all sequences consisting of a finite, or an infinite number of symbols from A is then A∞ = A+ ∪ Aω . The set of all sequences over A with exactly n symbols (i.e. of length n) is denoted by An . Let S = s1 s2 ... ∈ A∞ and i ≤ j. By Sij we denote the string si si+1 ...sj , with Sii = si . 2.1 Geometric Representations of Symbolic Sequence Structure In this section we describe iterative function systems (IFSs) [1] acting on the N -dimensional unit hypercube X = [0, 1]N , where3 N = ⌈log2 A⌉. To keep the notation simple, we slightly abuse mathematical notation and, depending on the context, regard the symbols 1, 2, ..., A, as integers, or as referring to maps on X. The maps i = 1, 2, ..., A, constituting the IFS are affine contractions i(x) = kx + (1 − k)ti , ti ∈ {0, 1}N , ti 6= tj for i 6= j, (1) 1 2 ]. with contraction coefficients k ∈ (0, The attractor of the IFS (1) is the unique set K ⊆ X, known as the Sierpinski SA sponge [12], for which K = i=1 i(K) [1]. For a string u = u1 u2 ...un ∈ An and a point x ∈ X, the point u(x) = un (un−1 (...(u2 (u1 (x)))...)) = (un ◦ un−1 ◦ ... ◦ u2 ◦ u1 )(x) (2) is considered a geometrical representation of the string u under the IFS (1). For a set Y ⊆ X, u(Y ) is then {u(x)| x ∈ Y }. Denote the center { 21 }N of the hypercube X by x∗ . Given a sequence S = s1 s2 ... ∈ A∞ , its (generalized) chaos game representation is formally defined as the sequence of points4  CGRk (S) = S1i (x∗ ) i≥1 . (3) When k = 21 and A = {1, 2, 3, 4}, we recover the IFS used by Jeffrey and others [11,22,25] to construct the chaos game representation of DNA sequences. 2 3 4 excluding the empty word for x ∈ ℜ, ⌈x⌉ is the smallest integer y, such that y ≥ x the subscript k in CGRk (S) identifies the contraction coefficient of the IFS used for the geometric sequence representation Understanding State Space Organization in Recurrent NNs 2.2 257 Statistics on Symbolic Sequences Let S = s1 s2 ... ∈ A∞ be a sequence generated by a stationary information source. Denote the (empirical) probability of finding an n-block w ∈ An in S by Pn (w). A string w ∈ An is said to be an allowed n-block in the sequence S, if Pn (w) > 0. The set of all allowed n-blocks in S is denoted by [S]n . A measure of n-block uncertainty (per symbol) in S is given by the entropy rate 1 X Pn (w) log Pn (w). (4) hn (S) = − n w∈[S]n If information is measured in bits, then log ≡ log2 . The limit entropy rate h(S) = limn→∞ hn (S) quantifies the predictability of an added symbol (independent of block length). The entropy rates hn are special cases of Rényi entropy rates [24]. The βorder (β ∈ ℜ) Rényi entropy rate hβ,n (S) = X 1 log Pnβ (w) n(1 − β) (5) w∈[S]n computed from the n-block distribution reduces to the entropy rate hn (S) when β = 1 [9]. The formal parameter β can be thought of as the inverse temperature in the statistical mechanics of spin systems [6]. In the infinite temperature regime, β = 0, the Rényi entropy rate h0,n (S) is just a logarithm of the number of allowed n-blocks, divided by n. The limit h(0) (S) = limn→∞ h0,n (S) gives the asymptotic exponential growth rate of the number of allowed n-blocks, as the block length increases. The entropy rates h(S) = h(1) (S) = limn→∞ h1,n (S) and h(0) (S) are also known as the metric and topological entropies respectively. Varying the parameter β amounts to scanning the original n-block distribution Pn – the most probable and the least probable n-blocks become dominant in the positive zero (β = ∞) and the negative zero (β = −∞) temperature regimes respectively. Varying β from 0 to ∞ amounts to a shift from all allowed n-blocks to the most probable ones by accentuating still more and more probable subsequences. Varying β from 0 to −∞ accentuates less and less probable n-blocks with the extreme of the least probable ones. 2.3 Scaling Behavior on Multifractals Loosely speaking, a multifractal is a fractal set supporting a probability measure [2]. The degree of fragmentation of the fractal support M is usually quantified through its fractal dimension D(M ) [1]. Denote by N (ℓ) the minimal number of hyperboxes of side length ℓ needed to cover M . The fractal (box-counting) dimension D(M ) relates the side length ℓ with N (ℓ) via the scaling law N (ℓ) ≈ ℓ−D(M ) . 258 P. Tiňo, G. Dorffner, and C. Schittenkopf For 0 < k ≤ 12 , the n-th order approximation Dn,k (M ) of the fractal dimension D(M ) is given by the box-counting technique with boxes of side ℓ = k n : −Dn,k (M ) N (k n ) = (k n ) . Just as the Rényi entropy spectra describe (non-homogeneous) statistics on symbolic sequences, generalized Rényi dimensions Dβ capture multifractal probabilistic measures µ [15]. Generalized dimensions Dβ (M ) of an object M describe a measure µ on M through the scaling law X µβ (B) ≈ ℓ(β−1)Dβ (M ) , (6) B∈Bℓ , µ(B)>0 where Bℓ is a minimal set of hyperboxes with sides of length ℓ disjointly5 covering M. In particular, for 0 < k ≤ 12 , the n-th order approximation Dβ,n,k (M ) of Dβ (M ) is given by X µβ (B) = ℓ(β−1)Dβ,n,k (M ) , (7) B∈Bℓ , µ(B)>0 where ℓ = k n . The infinite temperature scaling exponent D0 (M ) is equal to the box-counting fractal dimension D(M ) of M . Dimensions D1 and D2 are respectively known as the information and correlation dimensions [2]. Of special importance are the limit dimensions D∞ and D−∞ describing the scaling behavior of regions where the probability is most concentrated and rarefied respectively. 2.4 Chaos Game Representation of Single Sequences In [16] we established a relationship between the Rényi entropy spectra of a sequence S ∈ A∞ and the generalized dimension spectra of its chaos game representations. Theorem 1 [16]: For any sequence S ∈ A∞ , and any n = 1, 2, ..., the n-th order approximations of the generalized dimensions of its game representations are equal (up to a scaling constant log k −1 ) to the sequence n-block Rényi entropy rate estimates: hβ,n (S) , (8) Dβ,n,k (CGRn,k (S)) = log k1 where CGRn,k (S) is the sequence CGRk (S) without the first n − 1 points. Furthermore, for each S ∈ Aω , Dβ,n,k (CGRk (S)) = 5 at most up to Lebesgue measure zero borders hβ,n (S) . log k1 (9) Understanding State Space Organization in Recurrent NNs 259 Hence, for infinite sequences S ∈ Aω , when k = 21 , the generalized dimension estimates of geometric chaos game representations exactly equal the corresponding sequence Rényi entropy rate estimates. In particular, given an infinite sequence S ∈ Aω , as n grows, the box-counting fractal dimension and the information dimension estimates D0,n, 21 and D1,n, 21 of the original Jeffrey chaos game representation [11,22,25] tend to the sequence topological and metric entropies respectively. Another nice property of the chaos game representation CGRk is that it codes the suffix structure of allowed subsequences in the distribution of subsequence geometric representations (2) [16]. In particular, if v ∈ A+ is a suffix of length |v| of a string u = rv, r, u ∈ A+ , then u(X) ⊂ v(X), where v(X) is an N dimensional hypercube of side length k |v| . Hence, the longer is the common suffix v shared by two subsequences rv and rvqv of a sequence S = rvqvw, r, q, v ∈ A+ , w ∈ A∞ , the closer lie the corresponding points rv(x∗ ) and rvqv(x∗ ) in the chaos game representation of S, √ (10) dE (rv(x∗ ), rvqv(x∗ )) ≤ k |v| N . Here dE denotes the Euclidean distance. 3 Recurrent Neural Network Recurrent neural network (RNN) with adjustable recurrent weights presented in figure 1 was shown to be able to learn mappings that can be described by finite state machines [20], or produce symbolic sequences closely approximating (with respect to the information theoretic entropy and cross-entropy measures) the statistical structure in long chaotic training sequences [21,19]. (t) (t) The network has an input layer I (t) = (I1 , ..., IA ) with A neurons (to which the one-of-A codes of input symbols from the alphabet A = {1, ..., A} are presented, one at a time), a hidden non-recurrent layer H (t) , a hidden recurrent layer R(t+1) (RNN state space), and an output layer O(t) having the same number of neurons as the input layer. Activations in the recurrent layer are copied with a unit time delay to the context layer R(t) that forms an additional input. Using second-order hidden units, at each time step t, the input I (t) and the context R(t) determine the output O(t) and the future context R(t+1) by (t) Hi   X (t) (t) H Qi,j,k Ij R + Ti,j = g , k (11) j,k (t) Oi   X (t) Vi,j Hj + TiO  , = g j (12) 260 P. Tiňo, G. Dorffner, and C. Schittenkopf (t) O unit delay V (t) (t+1) R H Q W (t) I R (t) Fig. 1. Recurrent neural network (RNN) architecture. When recurrent weights W and thresholds T R are fixed prior to the training process and activation functions in the recurrent layer R(t+1) are linear so that the recurrent part [I (t) + R(t) → R(t+1) ] of the network implements the IFS (1), the architecture is referred as iterative function system network (IFSN). (t+1) Ri   X (t) (t) R Wi,j,k Ij R + Ti,j = g . (13) k j,k H R are , Ti,j Here, g is the standard logistic sigmoidal function. Wi,j,k , Qi,j,k and Ti,j O second-order real valued weights and thresholds, respectively. Vi,j and Ti are the weights and thresholds, respectively, associated with the hidden to output layer connections. When first-order hidden units are used, eqs. (11) and (13) change to   X X (t) (t) (t) H QIi,j Ij + QR (14) Hi = g  i,k Rk + Ti j and (t+1) Ri respectively. 3.1  = g X j k (t) I Wi,j Ij + X k (t)  R Wi,k Rk + TiR  , (15) Previous Work We briefly describe our previous experiments [21,19] with the RNN introduced above. The network was trained (via Real Time Recurrent Learning [10]) on single long chaotic symbolic sequences S = s1 s2 ... to predict, at each point in time, the next symbol. To start the training, the initial network state R(1) was Understanding State Space Organization in Recurrent NNs 261 randomly generated and the network was reset with R(1) at the beginning of each training epoch. After the training, the network was seeded with the initial state R(1) and code of the first symbol s1 in S. Then the network acted as a (t) (t) stochastic source. We transformed the RNN output O(t) = (O1 , ..., OA ) into a probability distribution over symbols ŝt+1 that will appear at the net input at the next time step: (t) O P rob (ŝt+1 = i) = PA i j=1 (t) Oj , i = 1, 2, ..., A. (16) We observed that trained RNNs produced sequences closely mimicking (in the information theoretic sense) the training sequence. Next we extracted from trained RNNs stochastic finite state machines by identifying clusters in recurrent neurons’ activation space (RNN state space) with states of the extracted machines. The extracted machines provide a compact and easy-to-analyse symbolic representation of the knowledge induced in RNNs during the training. Finite machine extraction from RNNs is a commonly used heuristic especially among recurrent network researchers applying their models in grammatical inference tasks [23]. The extracted machines often outperform the original networks on longer test strings. For an analysis of this phenomenon see [4,18]. In our experiments we found that these issues translate to the problem of training RNNs on single long chaotic sequences. With sufficient number of states the extracted stochastic machines do indeed replicate the entropy and cross-entropy performance of their mother RNNs. We considered two principal ways of machine extraction. In the test mode extraction we let the RNN generate a sequence in an autonomous mode and code the transitions between quantized network states driven by the generated symbols as stochastic6 machines MRN N . In the training sequence driven construction we drive the RNN with the training sequence S and code the transitions between quantized RNN states on symbols in S as stochastic machines MRN N (S) . We found that the machines MRN N (S) achieved considerably better modeling performance than their mother RNNs [19]. 3.2 Recurrent Neural Network with IFS Dynamics We propose an alternative RNN architecture, which we call iterative function system network (IFSN). IFSNs share the architecture with RNNs described above (see figure 1), with the exception that the recurrent neurons’ activation function is linear and the weights W (W I , W R ) and thresholds T R are fixed, so that the network dynamics in eq. (13) (eq. (15)) is given by (1): given the current state R(t) , i(R(t) ) is the next state R(t+1) , provided the input symbol at time t is i. Such a dynamics is equivalent to the dynamics of the IFS (1) driven by the 6 with each state transition in the machine we associate its empirical probability by counting how often was that transition evoked during the extraction process 262 P. Tiňo, G. Dorffner, and C. Schittenkopf symbolic sequence appearing at the network input. The trainable parts of IFSNs form a feed-forward architecture [I (t) + R(t) → H (t) → O(t) ]. In our recent experimental study [17] on two long chaotic sequences with different degrees of “complexity” measured by Crutchfield’s ǫ-machines [5], we have (interestingly enough) found that even though IFSNs have fixed non-trainable recurrent parts, they achieved performances comparable with those of RNNs having adjustable recurrent weights and thresholds. Moreover, the extracted machines MIF SN (S) actually outperformed the corresponding machines MRN N (S) . 4 Understanding State Space Organization in IFSNs Although previous approaches to the analysis of RNN state space organization did point out the correspondence between IFSs and RNN recurrent part [I (t) + R(t) → R(t+1) ] [14,13], due to nonlinearity of recurrent neurons’ activation function, they did not manage to provide a deeper insight into the RNN state space structure (apart from observing an apparent fractal-like clusters corresponding to nonlinear IFS attractor). Also, despite strong empirical evidence supporting the usefulness of extracting symbolic finite state representations from trained recurrent networks [23,7] a deeper understanding of what the machines actually represent is still lacking. The results summarized in section 2.4 enable us to formulate the principles behind coding, in the recurrent part of IFSNs, of the temporal/statistical structure in symbolic sequences appearing at the network input. Theorem 1 tells us that the fractal dimension of states in IFSN driven by a sequence S directly corresponds to the allowed subsequence variability in S expressed through the topological entropy of S. An analogical relationship holds between the information dimension of IFSN states and the metric entropy of S. Generally, multifractal characteristics of IFSN states measured via spectra of generalized dimensions directly correspond to Rényi entropy spectra of the input sequence S. The input sequence S feeding the IFSN is translated in the network recurrent part into the chaos game representation CGRk (S). The CGRk (S) forms clusters of state neuron activations, where (as explained in section 2.4) points lying in a close neighborhood code histories with a long common suffix (e.g. histories that are likely to produce similar continuations), whereas histories with different suffices (and potentially different continuations) are mapped to activations lying far from each other. When quantizing the IFSN state space to extract the network finite state representation, densely populated areas (corresponding to contexts with long common suffices) are given more attention by the vector quantizer. Consequently, more information processing states of the extracted machines are devoted to these potentially “problematic” contexts. This directly corresponds to the idea of variable memory length Markov models [26,27], where the length of the past history considered in order to predict the future is not fixed, but context dependent. Understanding State Space Organization in Recurrent NNs 5 263 Extracting Finite State Stochastic Machines from IFSNs relative frequency We report an experiment with a well-known chaotic sequence, called the Feigenbaum sequence [8]. Feigenbaum sequence is a binary sequence generated by the logistic map yt+1 = ryt (1 − yt ), y ∈ [0, 1], with the control parameter r set to the period doubling accumulation point value7 [15]. The iterands yt are partitioned into two regions8 [0, 21 ) and [ 12 , 1], corresponding to symbols 1 and 2, respectively. The training sequence S used in this study contained 260.000 symbols. The topological structure of the sequence (i.e. the structure of allowed nblocks not regarding their probabilities) can only be described using a context sensitive tool – a restricted indexed context-free grammar [6]. The metric structure of the Feigenbaum sequence is organized in a self-similar fashion [8]. The transition between the ranked distributions for block lengths 2g → 2g+1 , 3 2g−1 → 3 2g , g ≥ 1, is achieved by rescaling the horizontal and vertical axis by a factor 2 and 21 , respectively. Plots of the Feigenbaum sequence n-block distributions, n = 1, 2, ..., 8, can be seen in figure 2. Numbers above the plots indicate the corresponding block lengths. The arrows connect distributions with the (2, 21 )-scaling self-similarity relationship. 12 1 2 3 4 5 6 7 16 8 rank Fig. 2. Plots of self-similar rank-ordered block distributions of the Feigenbaum sequence for different block lengths (indicated by the numbers above the plots). The self similarity relates block distributions for block lengths 2g → 2g+1 , 3 2g−1 → 3 2g , g ≥ 1 (connected by arrows). We chose to work with the Feigenbaum sequence because Markovian predictive models on this sequence need deep prediction contexts. Classical fixed-order Markov models (MMs) cannot succeed and the power of admitting a limited number of variable length contexts can be fully exploited. First, we built a series of variable memory length Markov models (VLMMs) of growing size. For construction details see [26,27]. Then, we quantized the 7 8 r=3.56994567... this partition is a generating partition defined by the critical point 264 P. Tiňo, G. Dorffner, and C. Schittenkopf one-dimensional9 IFSN state space10 using dynamic cell structures (DCS) technique [3]. This way we obtained a series of extracted machines MIF SN (S) with increasing number of machine states. Each model was used to generate 10 sequences G of length equal to the length of the training sequence. Since the Feigenbaum sequence n-block distributions have just one or two probability levels, we measure the disproportions between the Feigenbaum P and model generated distributions through the L1 distances, dn (S, G) = w∈An |PS,n (w) − PG,n (w)|, where PS,n and PG,n are the empirical n-block frequencies in the training and model generated sequences, respectively. A modeling horizon n(M) of a model M is the longest block length, such that, for all 10 sequence generation realizations and for all block lengths n ≤ n(M), dn (S, G) is below a small threshold ∆. We set ∆ = 0.005, since in this experiment, either dn (S, G) ∈ (0, 0.005], or δn (M) >> 0.005. Figure 3 interprets the growing ability of VLMMs and machines MIF SN (S) to model the metric structure of allowed blocks in the Feigenbaum sequence S. The classical MM (stars) totally fails in this experiment, since the context length 5 is far too small to enable the MM to mimic the complicated subsequence structure in S. On the other hand VLMMs (squares) and machines MIF SN (S) n−block modeling performance (Feigenbaum sequence, Delta=0.005) 25 M−IFSN(S) VLMM MM 20 n(M) 15 10 5 5 10 15 20 # machine states 25 30 Fig. 3. Modeling horizons n(M) of models M built on the Feigenbaum sequence as a function of the number machine states in M. 9 10 Chaos representation of binary sequences is defined by a one-dimensional IFS (1). In this experiment we set the contraction ratio k of maps in the IFS to 0.5. Note that since the training sequence driven extraction of machines MIF SN (S) uses only recurrent part of IFSN (which is fixed prior to the training), no network training is needed and the finite state representations MIF SN (S) can be readily constructed. Understanding State Space Organization in Recurrent NNs 265 (triangles) quickly learn to explore a limited number of deep prediction contexts and perform comparatively well. The jumps in the modeling horizon graph of machines MIF SN (S) on figure 3 can be understood through their state transition diagrams. While the machine M4 in figure 4a can model only blocks of length 1,2 and 3, the introduction of an additional transition state in the machine M5 shown in figure 4b enables the latter machine to model blocks of length up to 6. Only three consecutive 2’s are allowed in the Feigenbaum training sequence S. The loop on symbol 2 in the state 1 of the machine M4 is capable of producing blocks of consecutive 2’s of any length. So, the n-block distribution, n ≥ 4, cannot be properly modeled by the machine M4 . The state 1 in the machine M4 is split into two machine M5 states 1.a and 1.b. Any number of 4-blocks 2212 can be followed by any number of 2-blocks 12 and vice versa. This is fine as long as we study structure of the 6-block distribution. Moving to higher block lengths, we find that once the 4-block 2212 is followed by the 2-block 12, another copy of the 2-block 12 followed by the 4-block 2212 must appear. This 12-block rule is implemented by the machine M8 in figure 5b. The machine M8 is created from the machine M7 in figure 5a by splitting the state 3.a into two states 3.a and 3.c. The machine M7 with 7 states is equivalent to the machine M5 (figure 4a) with 5 states: states 2.a, 2.b and 3.a, 3.b in M7 are equivalent to states 2 and 3, respectively, in M5 . State splitting responsible for the third jump in the modeling horizon graph between the extracted machines M22 and M23 with 22 and 23 states, respectively, is illustrated in figure 6. Symbols A and B stand for the 4-blocks 1212 and 2212, respectively. The machine M22 is equivalent to the machine M8 . State splitting in the middle left branch of the machine M22 removes the two lower cycles BAB, B, and creates a single larger cycle BBAB in the machine M23 . This machine correctly implements the training sequence block distributions for blocks of length up to 24. 2 2 1 2 1 4 2 3 a 2 2 1 2 2 1.a 1.b 1 4 2 2 1 3 b Fig. 4. State transition diagrams of the machines MIF SN (S) . The machines M4 (a) and M5 (b) were obtained by quantizing IFSN state space via dynamic cell structures with 4 and 5 centers respectively. State transitions are labeled only with the corresponding symbols, since the transition probabilities are uniformly distributed, i.e. for all states i, the probability associated with each arc leaving i is equal to 1/Ni , where Ni is the number of arcs leaving state i. 266 P. Tiňo, G. Dorffner, and C. Schittenkopf 2 2 1 2 1 4 2 2 2 1 2 2 1.a 3 1.b a 1 4 2 2 1 3 b Fig. 5. Machines M7 and M8 extracted from IFSN driven by the Feigenbaum sequence S. The network state space is quantized into 7 (a) and 8 (b) compartments, respectively. Construction details are described in caption to the previous figure. AB AB A=1212 B=2212 AB B AB BBAB B B M8 M 22 M 23 Fig. 6. Schematic representation of state transition structure in machines MIF SN (S) . Symbols A and B stand for the 4-blocks 1212 and 2212, respectively. The machine M22 , obtained from a codebook with 22 centers, is equivalent to the machine M8 (see also the previous figure). State splitting in the middle left branch of the machine M22 (dashed line) removes the two lower cycles BAB, B, and creates a single larger cycle BBAB in the machine M23 . Variable memory length Markov models implement the same subsequence constraints as the machines MIF SN (S) . Figures 7a and 7b present VLMMs N5 and N11 with 5 and 11 prediction contexts, respectively. The VLMMs are shown as probabilistic suffix automata with states labeled by the corresponding suffices. The VLMM N5 is isomorphic to the machine M5 in figure 4b, and the VLMM N11 is equivalent to the machine M8 in figure 5b. Although not shown here, the VLMM with 23 prediction contexts is isomorphic to the machine M23 schematically presented in figure 6. 6 Conclusion We introduced a novel recurrent network architecture, that we call iterative function system network (IFSN), with dynamics corresponding to iterative function systems used in chaos game representations of symbolic sequences [16,11]. In our previous work on modeling long chaotic sequences we empirically compared recurrent networks having adjustable recurrent weights and non-linear sigmoid activations in the recurrent layer with IFSNs and showed that introducing fixed IFS dynamics into recurrent networks does not degradate the network per- Understanding State Space Organization in Recurrent NNs 2122 267 2121 2 2 2 1 21222 2 212221 1 212 a 2122212121222 1 2 212221212122 1 2122212221 2 21222121212 2 212221222 2 21222121212221 2 2 2122212121 21222122 2 2122212 1 21222121 2 1 212221212 b Fig. 7. VLMMs N5 (a) and N11 (b) built on the Feigenbaum sequence. The VLMMs are are shown as probabilistic suffix automata with states labeled by the corresponding variable length prediction contexts. As with machines MIF SN (S) in this experiment, the state transition probabilities are uniformly distributed. formance. Even more surprisingly, we found that finite state stochastic machines extracted from IFSNs outperform machines extracted from “fully adjustable” RNNs. In this contribution we formally study state space organization in IFSNs. It appears that IFSNs reflect in their states the temporal and statistical structure of input sequences in a strict mathematical sense. The generalized dimensions of IFSN states are in direct correspondence with statistical properties of input sequences expressed via generalized Rényi entropy spectra. We also argued and experimentally illustrated that the commonly used heuristic of finite state machine extraction from RNNs by network state space quantization corresponds in case of IFSNs to variable memory length Markov model construction. Acknowledgements This work was supported by the Austrian Science Fund (FWF) within the research project “Adaptive Information Systems and Modeling in Economics and Management Science” (SFB 010). The Austrian Research Institute for Artificial Intelligence is supported by the Austrian Federal Ministry of Science and Transport. References 1. M.F. Barnsley. Fractals everywhere. Academic Press, New York, 1988. 2. C. Beck and F. Schlogl. Thermodynamics of chaotic systems. Cambridge University Press, Cambridge, UK, 1995. 268 P. Tiňo, G. Dorffner, and C. Schittenkopf 3. J. Bruske and G. Sommer. Dynamic cell structure learns perfectly topology preserving map. Neural Computation, 7(4):845–865, 1995. 4. M.P. Casey. The dynamics of discrete-time computation, with application to recurrent neural networks and finite state machine extraction. Neural Computation, 8(6):1135–1178, 1996. 5. J.P. Crutchfield and K. Young. Inferring statistical complexity. Physical Review Letters, 63:105–108, July 1989. 6. J.P. Crutchfield and K. Young. Computation at the onset of chaos. In W.H. Zurek, editor, Complexity, Entropy, and the physics of Information, SFI Studies in the Sciences of Complexity, vol 8, pages 223–269. Addison-Wesley, Reading, Massachusetts, 1990. 7. P. Frasconi, M. Gori, M. Maggini, and G. Soda. Insertion of finite state automata in recurrent radial basis function networks. Machine Learning, 23:5–32, 1996. 8. J. Freund, W. Ebeling, and K. Rateitschak. Self-similar sequences and universal scaling of dynamical entropies. Physical Review E, 54(5):5561–5566, 1996. 9. P. Grassberger. Information and complexity measures in dynamical systems. In H. Atmanspacher and H. Scheingraber, editors, Information Dynamics, pages 15– 33. Plenum Press, New York, 1991. 10. J. Hertz, A. Krogh, and R.G. Palmer. Introduction to the Theory of Neural Computation. Addison–Wesley, Redwood City, CA, 1991. 11. J. Jeffrey. Chaos game representation of gene structure. Nucleic Acids Research, 18(8):2163–2170, 1990. 12. R. Kenyon and Y. Peres. Measures of full dimension on affine invariant sets. Ergodic Theory and Dynamical Systems, 16:307–323, 1996. 13. J.F. Kolen. Recurrent networks: state machines or iterated function systems? In M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L. Elman, and A.S. Weigend, editors, Proceedings of the 1993 Connectionist Models Summer School, pages 203– 210. Erlbaum Associates, Hillsdale, NJ, 1994. 14. P. Manolios and R. Fanelli. First order recurrent neural networks and deterministic finite state automata. Neural Computation, 6(6):1155–1173, 1994. 15. J.L. McCauley. Chaos, Dynamics and Fractals: an algorithmic approach to deterministic chaos. Cambridge University Press, 1994. 16. P. Tiňo. Spatial representation of symbolic sequences through iterative function system. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, 29(4):386–393, 1999. 17. P. Tiňo and G. Dorffner. Recurrent neural networks with iterated function systems dynamics. In International ICSC/IFAC Symposium on Neural Computation, pages 526–532, 1998. 18. P. Tiňo, B.G. Horne, C.L. Giles, and P.C. Collingwood. Finite state machines and recurrent neural networks – automata and dynamical systems approaches. In J.E. Dayhoff and O. Omidvar, editors, Neural Networks and Pattern Recognition, pages 171–220. Academic Press, 1998. 19. P. Tiňo and M. Koteles. Extracting finite state representations from recurrent neural networks trained on chaotic symbolic sequences. IEEE Transactions on Neural Networks, 10(2):284–302, 1999. 20. P. Tiňo and J. Sajda. Learning and extracting initial mealy machines with a modular neural network model. Neural Computation, 7(4):822–844, 1995. 21. P. Tiňo and V. Vojtek. Modeling complex sequences with recurrent neural networks. In G.D. Smith, N.C. Steele, and R.F. Albrecht, editors, Artificial Neural Networks and Genetic Algorithms, pages 459–463. Springer Verlag Wien New York, 1998. Understanding State Space Organization in Recurrent NNs 269 22. J.L. Oliver, P. Bernaola-Galván, J. Guerrero-Garcia, and R. Román Roldan. Entropic profiles of dna sequences through chaos-game-derived images. Journal of Theor. Biology, (160):457–470, 1993. 23. C.W. Omlin and C.L. Giles. Extraction of rules from discrete-time recurrent neural networks. Neural Networks, 9(1):41–51, 1996. 24. A. Renyi. On the dimension and entropy of probability distributions. Acta Math. Hung., (10):193, 1959. 25. R. Roman-Roldan, P. Bernaola-Galvan, and J.L. Oliver. Entropic feature for sequence pattern through iteration function systems. Pattern Recognition Letters, 15:567–573, 1994. 26. D. Ron, Y. Singer, and N. Tishby. The power of amnesia. In Advances in Neural Information Processing Systems 6, pages 176–183. Morgan Kaufmann, 1994. 27. D. Ron, Y. Singer, and N. Tishby. The power of amnesia. Machine Learning, 25:117–150, 1996. 28. W. Tabor. Dynamical automata. Technical Report TR98-1694, Cornell University, Computer Science Department, 1998. Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network that Performs Low Back Pain Classification 1 1 2 2 Marilyn L. Vaughn , Steven J. Cavill , Stewart J. Taylor , Michael A. Foy , 2 and Anthony J.B. Fogg 1 Knowledge Engineering Research Centre, Cranfield University (RMCS), Shrivenham, Swindon. SN6 8LA, UK M.L.Vaughn@rmcs.cranfield.ac.uk 2 Princess Margaret Hospital, Okus Road, Swindon. SN1 4JU, UK Abstract. Using a new method published by the first author, this chapter shows how knowledge in the form of a ranked data relationship and an induced rule can be directly extracted from each training case for a Multi-layer Perceptron (MLP) network with binary inputs. The knowledge extracted from all training cases can be used to validate the MLP network and the ranked data relationship for any input case provides direct user explanations. The method is demonstrated for example training cases from a real-world MLP that classifies low back pain patients into three diagnostic classes. In using the method to validate the network a number of test cases apparently mis-classified by the network were found to have most likely been incorrectly classified by the clinicians. The method uses a direct approach which does not depend on combinatorial search and is thus applicable to real-world networks with large numbers of input features, as demonstrated in this current study. 1 Introduction Artificial neural networks are being increasingly used as decision support tools in a wide range of applications [1-4] but are currently undermined by their inability to explain or justify their output activations. In domains such as medical diagnosis it is especially important for clinicians to understand and to have confidence in a system’s prediction [5]. The multilayer perceptron (MLP) network is one of the most widely used neural networks in the field and most approaches for extracting knowledge from these networks have used a combinatorial search based approach [6-10]. However, a limitation of these methods is that obtaining all possible combinations of rules is NPhard [11,12] and the methods are not universally applicable to arbitrary MLP networks. Such techniques are thus not feasible for achieving the goal of readily interpreting trained neural networks that solve real-world problems with a large number of input features [5], as in the current study. Using a new method published by the first author [13, 14], it is shown in Section 3 of this chapter how to directly interpret an input case and extract knowledge from a S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp.270-285, 2000. © Springer-Verlag Berlin Heidelberg 2000 Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 271 standard MLP network with binary inputs. For any MLP input case the method directly finds the key inputs, both positive and negated, used by the network to classify the case. The top ranked key inputs for the input case represent a data relationship which provides an explanation of the classification of the case for the domain user. The process of discovering the key inputs provides an explanation for the network developer of how the MLP uses the hidden layer neurons to classify the input case. In Section 3 it is shown how the knowledge learned by the MLP from a training example can also be represented as a rule which is directly induced from the ranked data relationship. With the assistance of domain experts the validation of the knowledge extracted from all of the training examples provides a method for validating the MLP network. The interpretation and knowledge extraction method is demonstrated in Section 4, for selected example training cases from a MLP network that classifies low back pain (LBP) patients into three diagnostic classes. A ranked data relationship is found for each example case and a rule is directly induced from each of the ranked data relationships. In Section 5 it is shown how the average highest key input rankings for each diagnostic class can be used to represent the network’s knowledge for validation purposes. Results are presented of the average highest ranked key inputs that the low back pain MLP uses to classify all training cases for each diagnostic class. It is shown how the validation of this knowledge by the domain experts leads to the validation of the low back pain MLP network both during training and testing. Finally, the interpretation and knowledge discovery method is compared with other methods of rule extraction in Section 6 and the chapter is summarised in Section 7. 2 The Low Back Pain MLP Network Low back pain is one of the most common medical problems encountered in healthcare, with 60-80% of the population suffering some form within their life-span [15,16]. It has been estimated that only 15% of patients with low back pain obtain an accurate diagnosis of their problem with any degree of certainty, and many receive care that is less than optimal [16,17]. Low back pain is a difficult multi-factorial problem that includes physical, psychological and social aspects of illness [15]. For this study, low back pain is classified into three diagnostic classes: simple low back pain (SLBP) - mechanical low back pain, minor scoliosis and old spinal fractures; root pain (ROOTP) - nerve root compression due to either disc, bony entrapment or adhesions; and abnormal illness behaviour (AIB) - mechanical low back pain, degenerative disc or bony changes with signs and symptoms magnified as a sign of distress in response to chronic pain. A data set of 198 actual cases was collected from patient questionnaires, physical findings and clinical findings. In this preliminary study, the MLP network is a fully connected, feed-forward network with 92 binary encoded input neurons corresponding to 39 patient attributes, and 3 output layer neurons, each representing a diagnostic 272 M.L. Vaughn et al. class. Using the sigmoidal activation function and generalised delta learning rule the network was trained with 99 randomly selected patient cases. The MLP network architecture with the lowest test error was found with 10 hidden layer neurons at 1100 cycles when the training set had a 96% classification accuracy and a test set, with 99 patient cases, had a 67% classification accuracy. 3 The Interpretation and Knowledge Extraction Method The method first finds the hidden layer neurons which positively activate the classifying output neuron. This leads to the discovery of the key positive inputs in the MLP input case which positively drive the output classification, as follows. 3.1 Discovery of the Feature Detector Neurons For an MLP input case, the method [13] defines the feature detector neurons as the hidden neurons which positively activate the classifying output neuron. For sigmoidal activations the hidden layer neuron activation is always positive and, hence, the feature detectors are hidden neurons connected to the classifying neuron with positive weights. The hidden layer bias (which fixes the output neuron’s threshold at zero) also makes a positive contribution when the bias connection weight is positive. In performing a classification task, the first author has shown [18] that the MLP network finds sufficiently many hidden layer feature detector neurons with activation >0.5 which positively activate the classifying output neurons. These neurons also play a major role in negatively activating the non-classifying output neurons. The relative contribution of the feature detectors to the MLP output classification can be found by ranking the detectors in order of decrease in classifying output activation when selectively switched off at the hidden layer. 3.2 Discovery of the Significant Inputs The interpretation and knowledge extraction method defines the significant inputs as the inputs in an MLP input case which positively activate the classifying output neuron. Thus the significant inputs positively activate the feature detector neurons and, for binary encoded inputs, are positive inputs connected to the feature detectors with positive weights. The relative contribution of the significant inputs to the MLP output classification can be found by ranking the significant inputs in order of decrease in activation at the classifying neuron when each is selectively switched off in turn at the MLP input layer. There is evidence [14,18] that the most significant inputs are the MLP inputs that most uniquely discriminate the input case. Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 273 3.3 Discovery of the Negated Significant Inputs The interpretation and knowledge extraction method defines the negated significant inputs in the MLP input case as the significant inputs for another class which deactivate the feature detectors for that class when not active at the MLP input layer. For binary inputs these are zero-valued inputs connected to hidden neurons which are not feature detectors with positive weights. The relative contribution of the negated significant inputs to the MLP output classification can be found by ranking the negated significant inputs in order of decrease in activation at the classifying neuron when each is selectively switched on in turn at the MLP input layer. 3.4 Knowledge Learned by the MLP from the Training Data Using the knowledge extraction method, the knowledge learned by the MLP from the training data can be expressed as a non-linear data relationship and as an induced rule which is valid for the training set, as follows. Data Relationships. The knowledge learned by the network from an input training case is embodied in the data relationship between the significant (and negated significant) inputs and the associated network outputs: significant (and negated significant) => training inputs associated network outputs The data relationship is non-linear due to the effect of the sigmoidal activation functions at each processing layer of neurons. The most important inputs in the relationship can be ranked in order of decrease in classifying neuron activation, as discussed above. The ranked data relationship is generally exponentially decreasing and embodies the graceful degradation properties of the MLP network. For the domain expert the ranked data relationship represents the explanation of the case which can be enhanced with the information about the feature detector neurons for the network developer. Induced Rules. For an input training example, a rule which is valid for the MLP training set can be directly induced from the training data relationship in order of significant and negated significant input rankings [14,18]. The most general rule induced from each training example represents the key knowledge that the MLP network has learned from the training case and other similar training examples. From the most general rule for each training example a general set of rules can be induced which defines the key knowledge that the MLP network has learned from the training set. However, rules do not embody the fault-tolerant and graceful degradation properties of the MLP network and are not a substitute for the network. The rules, nontheless, are useful for validating the network knowledge and the completeness of the knowledge. 274 M.L. Vaughn et al. 3.5 MLP Network Validation and Verification The validation of the MLP network can be undertaken, with the assistance of the domain experts, by validating the data relationships and induced rules that the network has learned from all the training examples. The validation process can lead to the discovery of previously unknown data relationships and the analysis leading to this discovery provides a method for data mining [19]. In MLP network testing the test data relationships can be used to verify that the network is correctly generalising the knowledge learned by the network from the training process. Testing is not expected to reveal new significant (and negated significant) input data relationships since network generalisation is inferred from the network knowledge acquired during training. 4 Explanations and Knowledge Extraction from LBP Training Cases The interpretation and knowledge extraction method is demonstrated for three low back pain example cases: a class SLBP training case, a class ROOTP training case, and a class AIB training case, which have classifying output neuron activations of 0.96, 0.93 and 0.91 respectively when presented to the low back pain MLP network. 4.1 Discovery of the Feature Detectors for Example Training Cases Using the knowledge extraction method, as presented in Section 3, the feature detectors are discovered to be hidden neurons H3, H4, H6, and H8 for the SLBP case, hidden neurons H1, H3, H7, and H10 for the ROOTP case, and hidden neurons H1, H2, H5, and H8 for the AIB case. The feature detectors are ranked in order of contribution to the classifying output neuron when the detector is switched off at the hidden layer. These contributions are shown in Tables 1a, 1b and 1c. Table 1a. Hidden layer feature detectors - SLBP training case Hidden layer feature detector H3 H4 H6 H8 Activation of feature detector +1.0000 +0.9998 +0.9999 +0.9552 Connection weight to classifying neuron +0.5797 +1.5108 +0.8979 +0.5704 Positive input Negative input Total input Sigmoid output Contribution to output activation +0.5797 +1.5104 +0.8979 +0.5448 +3.5327 -0.3984 +3.1344 +0.9583 Feature detector rank 3 (- 3.2%) 1 (-12.8%) 2 (- 5.7%) 4 (- 2.9%) Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 275 Table 1b. Hidden layer feature detectors - ROOTP training case Hidden layer feature detector H1 H3 H7 H10 Activation of feature detector +0.9999 +1.0000 +1.0000 +0.7355 Connection weight to classifying neuron +1.4073 +0.3424 +1.8821 +0.4446 Positive input Negative input Total input Sigmoid output Contribution to output activation +1.4072 +0.3424 +1.8821 +0.3270 +3.9587 -1.3657 +2.5930 +0.9304 Feature detector rank 2 (-17.7%) 3 (- 2.8%) 1 (-27.9%) 4 (- 2.6%) Table 1c. Hidden layer feature detectors - AIB training case Hidden layer feature detector H1 H2 H5 H8 Activation of feature detector +0.8475 +0.9967 +0.9901 +0.1648 Connectionweight to classifying neuron +0.0393 +1.7291 +1.7100 +0.1861 Positive input Negative input Total input Sigmoid output Contribution to output activation +0.0333 +1.7234 +1.6932 +0.0307 +3.4805 -1.1554 +2.3250 +0.9109 Feature detector rank 3 (- 0.3%) 1 (-29.1%) 2 (-28.3%) 4 (- 0.3%) 4.2 Discovery of the Significant Inputs for Example Training Cases The low back pain MLP inputs are binary encoded and thus the significant inputs have value +1 and a positive connection weight to the feature detectors. The significant negated inputs have value +0 and a positive connection weight to hidden neurons that are not feature detectors. For the SLBP training case, there are 24 significant inputs, of which 16, including the input bias, show a decrease in the classifying output activation when selectively removed from the input layer. Similarly, the ROOTP training case has 26 significant inputs, of which 15 show negative changes at the output neuron when selectively removed from the input layer. Finally, the AIB training case has 33 significant inputs, with 17 showing a decrease in the output activation when selectively removed from the input layer. 276 M.L. Vaughn et al. 4.3 Data Relationships/Explanations for Example Cases For the SLBP case and the ROOTP case the top ten combined ranked significant and negated inputs account for a total decrease in the classifying neuron activation of 95% and 96% respectively when switched off together at the MLP input layer. For the AIB case the top four combined ranked inputs account for a 97% drop in classifying activation. The ranked input data relationships for each example training case and the accumulated decrease at the respective classifying neurons are shown in Tables 2a, 2b and 2c, and in [19]. The ranked data relationships represent the direct explanations of the example cases for the orthopaedic surgeons. The relationships show that the nonlinear data relationship for each example case is exponentially decreasing with respect to the ranked key inputs. In a similar way explanations can be provided automatically on a case-by-case basis for any input case presented to the MLP network. Table 2a. SLBP case training data relationship Rank 1 2 3 4 5 6 7 8 9 10 SLBP training case (−95%) pain brought on by bending over back pain worse than leg pain not low Zung depression score not normal DRAM no leg pain symptoms not leg pain worse than back pain straight right leg raise ≥70 ° back pain aggravated by sitting back pain aggravated by standing straight right leg raise not limited Accumulated classifying activation 0.94 0.67 0.33 0.13 0.07 0.06 0.05 0.05 0.05 0.04 Accumulated decrease −1.6% −29.6% −65.9% −86.0% −92.8% −93.6% −94.5% −94.6% −94.6% −95.2% Table 2b. ROOTP case training data relationship Rank 1 2 3 4 5 6 7 8 9 10 ROOTP training case (−96%) not back pain worse than leg pain not lumbar extension < 5° not pain brought on by bending over not no leg pain symptoms not straight right leg raise ≥ 70° back pain aggravated by coughing. loss of reflexes lumbar extension (5 to 14 °) not straight right leg raise not limited not high MSPQ score Accumulated classifying activation 0.91 0.90 0.75 0.42 0.34 0.27 0.23 0.15 0.07 0.04 Accumulated decrease −1.6% −3.3% −19.8% −54.4% −63.3% −70.9% −74.9% −84.2% −92.6% −96.4% Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 277 Table 2c. AIB case training data relationship Rank 1 2 3 4 AIB training case (−97%) not straight left leg raise limited by leg pain straight left leg raise ≤45 ° claiming invalidity/disability benefit not straight left leg raise (46 to 69°) Accumulated classifying activation 0.87 0.41 0.06 0.02 Accumulated decrease −4.7% −54.8% −93.1% −97.6% 4.4 Discussion of the Key Inputs for Example Cases From Table 2a it can be seen that in the SLBP case the patient presents back pain symptoms as worse than leg pain (rank 2) which supports no leg pain symptoms (rank 5) and not leg pain worse than back pain (rank 6). Also, a straight leg raise (SLR) ˜ 70 • (rank 7) is substantiated by no limitation on the SLR test (rank 10). This particular patient case indicates possibly a not normal psychological profile (rank 3,4). In the ROOTP case, as shown in Table 2b, high ranked inputs are negated SLBP key inputs indicating leg pain symptoms: not back pain worse than leg pain (rank 1), not no leg pain symptoms (rank 4) and not SLR ˜ 70• (rank 5). In the AIB case, as shown in Table 2c, three key inputs are associated with the left leg (rank 1,2,4) and the other key input is claiming invalidity/disability benefit (rank 3). 4.5 Induced Rules from Training Example Cases Using the knowledge extraction method, a rule which is valid for the MLP training set can be directly induced from the data relationship for an input training example in combined ranked order of significant and negated significant inputs. This is demonstrated for each of the example training cases, as follows. SLBP Example Training Case. The following rule, which is valid for the MLP training set, is induced in ranked order from the SLBP example training data relationship shown in Table 2a: IF pain brought on by bending over AND back pain worse than leg pain AND NOT low Zung depression score AND NOT normal DRAM AND no leg pain symptoms THEN class SLBP This rule represents the key knowledge that the MLP network has learned from this training case. The most general valid rule for the example case that can be found (from the above rule) which is valid for the MLP training set is given by: IF no leg pain symptoms THEN class SLBP This rule is verified by a frequency analysis of the ‘no leg pain symptoms’ attribute in the training set which shows that eight SLBP cases are the only training cases with this attribute present. 278 M.L. Vaughn et al. ROOTP Example Training Case. The following rule, which is valid for the MLP training set, is induced in ranked order from the ROOTP example training data relationship shown in Table 2b: IF NOT back pain worse than leg pain AND NOT lumbar extension < 5• AND NOT pain brought on by bending over AND NOT no leg pain symptoms AND NOT straight right leg raise ˜ 70• AND back pain aggravated by coughing THEN class ROOTP This is the most general rule that can be found (from the above rule) for this example case which is valid for the MLP training set. AIB Example Training Case. The following rule, which is valid for the MLP training set, is induced in ranked order from the AIB example training data relationship shown in Table 2c: IF NOT straight left leg raise limited by leg pain AND straight left leg raise ˆ45• AND claiming invalidity benefit AND NOT straight left leg raise (46 to 69•) AND smoker THEN class AIB This is the most general rule that can be found (from the above rule) for this example case which is valid for the MLP training set. 4.6 Comparison of Knowledge Representations The ranked data relationship, as shown in Tables 2a, 2b and 2c for each example case, embodies the graceful degradation properties of the MLP network and shows the exponentially decreasing relationship between the most important inputs and the classifying output activation for the input case. In general use the ranked data relationship represents the explanation of any input case presented to the MLP for the domain expert or network user. In comparison with the ranked data relationship, the rule induced from a training input case is brittle and does not indicate the relative importance of the attributes used by the MLP in classifying the case. For example, the rule induced from the AIB example training case in Section 4.5 is valid only if the 5th ranked attribute ‘smoker’ is included in the rule yet the data relationship shows that the attribute is of extremely low importance in the classification of the case. The advantage of the rule, however, is that it is valid for the training set and the more general the rule the easier it is to understand what the network has learned. This is important for network validation since all of the extracted rules represent the knowledge that the MLP has learned from the training set. Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 279 5 Knowledge Extraction from all LBP Training Cases Using the knowledge extraction method, the ranked significant inputs and the ranked negated significant inputs were discovered separately for each of the 99 training cases in the low back pain MLP training set. With the aim of aggregating this knowledge in a meaningful way for validation by the clinicians, the average of the ranking values of the significant inputs was taken by class, resulting in a ranked significant input profile for each diagnostic class, as shown in Table 3 and in [19]. Similarly, the ranked negated input class profiles for each diagnostic class are shown in Table 4 and in [19]. 5.1 Discussion of the Ranked Class Profiles In summary, the significant input class profiles in Table 3 indicate that a typical SLBP patient presents with back pain worse than leg pain and a good range of lumbar flexion, whereas a typical ROOTP patient presents with leg pain greater than back pain and limited lumbar movements. A typical AIB patient claims invalidity/disability benefit, high Waddell’s inappropriate signs and a psychological profile indicating distress. Some indicators are significant to both SLBP and AIB classes which is to be expected since AIB patients are SLBP patients that develop signs of illness behaviour manifested by distress. The negated input profiles in Table 4 indicate that a typical SLBP patient does not present with many of the key significant inputs of class ROOTP patients and vice versa. There is an indication that some SLBP patients may not have a normal psychological profile, possibly because patients with chronic back pain have the potential to develop signs of AIB. Many of the key negated AIB inputs support the key significant inputs of class AIB patients. 5.2 Validation of the LBP Network The ranking profiles for each diagnostic class were used to generally validate the low back pain MLP network, with the assistance of the domain experts, as follows. Validation of the Training Cases. The ranked class profiles, as shown in Tables 3 and 4, indicate to the domain experts that the low back pain MLP network has largely determined relevant attributes as typical characteristics of patients belonging to the 3 diagnostic classes. However, for the SLBP class, the 8th ranked attribute ‘no back pain’ is contradictory to the top ranked attribute ‘worse back pain’. Further investigation revealed that two SLBP training cases with attribute ‘no back pain’ had been incorrectly included in the training set. These were the only two cases in the training set with this attribute present which demonstrates the sensitivity of both the MLP network and the knowledge extraction method. As a result of the two incorrect training cases the training performance of the low back pain MLP network was reassessed at 98% classification accuracy. 280 M.L. Vaughn et al. Table 3. Top ten averaged ranked significant inputs for all training each diagnostic class R All SLBP training cases Table 4. Top ten averaged ranked negated inputs for all training cases cases in in each diagnostic class R All SLBP training cases 1 2 3 4 5 6 7 8 9 10 back pain worse than leg pain lumbar flexion ≥45 ° straight right leg raise ≥ 70 ° no leg pain symptoms pain brought on by bending over minimal ODI score lumbar extension < 5 ° no back pain symptoms straight left leg raise (46 to 69 °) straight right leg raise not limited 1 2 3 4 5 6 7 8 9 10 back pain aggravated by coughing lumbar flexion < 30° lumbar extension (5 to 14°) equal back and leg pains smoker normal DRAM loss of reflexes low Zung depression score recurring back pain leg pain worse than back pain R All ROOTP training cases R All ROOTP training cases 1 2 3 4 5 6 7 8 9 10 back pain aggravated by coughing. leg pain worse than back pain lumbar flexion < 30 ° lumbar extension (5 to 14 °) normal DRAM low MSPQ score low Zung depression score use of walking aids leg pain aggravated by coughing pain brought on by lifting 1 2 3 4 5 6 7 8 9 10 back pain worse than leg pain lumbar flexion≥45° pain brought on by bending over. no leg pain symptoms minimal ODI score straight right leg raise ≥ 70° lumbar extension < 5°. acute back pain high Zung depression score high MSPQ score R All AIB training cases R All AIB training cases 1 2 3 4 5 6 7 8 9 10 claiming invalidity/ disability benefit straight left leg raise≤45 ° high Waddell’s inappropriate signs distressed depressive pain brought on by bending over straight left leg raise ltd by hamstrings pain brought on by falling over back pain aggravated by standing high Zung depression score back pain worse than leg pain 1 2 3 4 5 6 7 8 9 10 straight left leg raise ltd by leg pain low Waddell’s inappropriate signs normal DRAM low Zung depression score leg pain worse than back pain leg pain aggravated by walking chronic back pain straight left leg raise (46° to 69°) back pain aggravated by coughing. straight right leg raise limited by back pain Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 281 Validation of the Test Cases. Of the 99 test cases, 35 were apparently mis-classified by the low back pain MLP network. By directly interpreting each of the mis-classified test cases using the knowledge extraction method it was possible to compare the top ranked significant and negated inputs of each test case with the class profiles shown in Table 3 and Table 4. Many of the mis-classified case test rankings indicated a high degree of commonality with the ranked class profiles. On further investigation it was agreed by the domain experts that 16 of the apparently mis-classified cases were likely to have been correctly classified by the low back pain MLP and incorrectly classified by the clinicians, based on the evidence of the average class rankings. Of these, 13 cases (out of 19) were correctly classified by the network as class AIB, 2 (out of 11) ) were correctly classified by the network as class ROOTP and 1 (out of 5) ) was correctly classified by the network as class SLBP. The difficulty experienced by clinicians in diagnosing AIB patients has been observed in other studies [20]. As a result of the 16 test cases correctly classified by the MLP network, the test performance of the low back pain MLP network was re-assessed at 81% classification accuracy which is similar to the results of other researchers [21,22]. 6 Comparison with other Methods Most approaches for extracting knowledge in the form of rules from trained neural networks use a search based approach [6-10] where the MLP network is regarded as a collection of two-layer perceptrons with activation functions which approximate step threshold units. A decompositional [10] approach is taken where rules are extracted for each hidden and output unit separately. The rule extraction algorithms search for combinations of input values which when satisfied guarantee that a given unit is maximally active. A limitation of these methods is that obtaining all possible combinations of rules is NP-hard [11,12] and the methods are not universally applicable to arbitrary MLP networks [5]. Such methods are thus not feasible for extracting rules from real-world networks with a large number of input features. To reduce the search space in the rule extraction process some approaches incorporate techniques such as specialised training procedures [9,23] and network pruning [2427,12] or special network topologies [28,29]. The Partial-RE method in [12] uses weight ordering to reduce the search space and can be used for larger size problems if a small number of premises per rule is sufficient, in which case the method is polynomial in n, where n is the number of input features. An approach that enables the extraction of rules that directly map inputs to outputs for feedforward networks is DEDEC [30] which, as in the current study, also extracts rules by ranking the inputs of the MLP according to their importance. However, the ranking process is done by first examining the weight vectors of the network and then clustering the ranked inputs. Each cluster is used to generate a set of optimal binary rules that describes the functional dependencies between the attributes of a cluster and the network outputs [12]. 282 M.L. Vaughn et al. Another approach that enables the extraction of rules that directly map inputs to outputs for arbitrary MLP networks is validity-interval analysis (VIA) [31] which uses linear programming to determine if a set of constraints on a network’s activation values is consistent. However, the method is computationally expensive since it requires multiple runs of linear programming per rule. A further drawback is that the method assumes that activations of the hidden layer units are independent. The rule extraction method described in the current study also enables the extraction of rules that directly map inputs to outputs for arbitrary MLP networks. However, the method in this study uses a direct, holistic approach which is not computationally expensive since the complexity of the method is linear in the number of hidden layer and output layer neurons. Unlike VIA, the approach makes use of the dependent hidden layer activations to interpret and discover knowledge from an input case. The method aims to extract rules only from the training set because the network knowledge is expected to embody what it has learned from the training examples. As a result, the maximum number of rules that can be extracted is limited by the size of the training set. However, duplication of the most general rules is expected, especially for training examples lying in the same hidden layer decision region [18]. Use of the method thus far, as indicated in this study and [14,18], indicates good rule comprehensibility with a small number of premises per extracted rule due to the exponentially decreasing relationships learned by the network. In general use the interpretation and knowledge extraction method described in this study directly provides an explanation for any input case presented to the network by discovering the ranked data relationship for the case, as demonstrated in Section 4. Potentially novel inputs beyond the network knowledge bounds, as defined by the hidden layer decision regions with training examples, can be detected and prevented, as shown in [18]. Since the method does not use combinatorial search or specialised training procedures it potentially represents a significant advance towards the goal of readily interpreting trained neural networks (for which solutions already exist) that solve real-world problems with a large number of input features [5,11], as in the current study. 7 Summary and Conclusions Using a new interpretation and knowledge extraction method it is shown in this chapter how to directly explain the classification of any MLP network input case by discovering the most important inputs in the input case in the form of a ranked data relationship. This reveals the non-linear exponentially decreasing relationship between the most important inputs and the classifying output activation for the input case, and embodies the graceful degradation properties of the MLP network. The knowledge that the MLP network learns from a training case can be represented as a ranked data relationship and as an induced general rule which is valid for the training set. In this study, the knowledge that the MLP network learns from all of the 99 training cases is represented by the ranked class profiles of the averaged highest input rankings for the significant inputs and for the negated inputs. Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 283 In validating the ranked class profiles the domain experts considered that the preliminary network had determined largely valid attributes as typical characteristics of each diagnostic class of patients. One evidently invalid SLBP attribute, however, revealed that two training cases had been incorrectly included in the training set which illustrates the sensitivity of both the MLP network and the knowledge extraction method. By directly interpreting 19 mis-classified AIB test cases it was agreed by the domain experts that 13 cases were likely to have been correctly classified by the low back pain MLP based on the evidence of the class characteristics. This demonstrates the greater consistency by the MLP network in classifying the AIB patients when compared with the clinicians. Since the interpretation and knowledge extraction method uses a direct, holistic approach which does not use combinatorial search or specialised training procedures it is concluded that the method potentially represents a significant advance towards the goal of readily interpreting trained feed-forward neural networks that solve real-world problems with a large number of input features, as in the current study. Future Work Future research work will seek to automatically induce a valid rule for each training input case to further enhance the MLP network knowledge validation process. Studies will also be made to discover if the knowledge learned by the network is invariant with respect to parameters in the learning environment such as network architecture, input/output data encoding, activation function, selection of initial weights and learning rules. The results of this research will be presented subsequently. Acknowledgements The financial support for this research was provided by the Ridgeway Hospital, Swindon and Compass Health Care. References 1. Spiegelhalter, D.J., Taylor, C.C., (eds) : Machine Learning, Neural and Statistical Classification. Ellis Horwood, Chichester (1994) 2. Patterson D.: Artificial Neural Networks Theory and Applications. Prentice Hall: Singapore (1996) 3. Dillon T., Arabshahi P., Marks R.J.: Everyday Applications of Neural Networks. IEEE Trans. Neural Networks, Vol. 8. (1997) 4. Looney C.G.: Pattern Recognition Using Neural Networks. Oxford University Press: New York (1997) 284 M.L. Vaughn et al. 5. Craven M.W., Shavlik J.W.: Using Sampling and Queries to Extract Rules from Trained Neural Networks, in Machine Learning. Proceedings of the Eleventh International Conference on Machine Learning, Amherst, MA,USA. Morgan Kaufmann (1994) 73-80 6. Ourston D., Mooney R.J.: Changing the Rules: A Comprehensive Approach to Theory Refinement. Proceedings of the Eighth National Conference on Artificial Intelligence (1990) 815-820 7. Gallant S.I.: Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA. (1993) 8. L.Fu: Neural Networks in Computer Intelligence, McGraw-Hill, London (1994) 9. Towell, G.G., & Shavlik, J.W.: Extracting Refined Rules from Knowledge Based Neural Networks. Machine Learning, Vol.13. (1993) 71-101 10. Andrews R., Diederich J., Tickle A.B.: Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks. Knowledge-Based Systems, Vol. 8 (6) (1995) 373-389 11. Tickle A.B., Andrews R., Golea M., Diederich J.: The Truth Will Come to Light: Directions and Challenges in Extracting the Knowledge Embedded Within Trained Artificial Neural Networks. IEEE Trans. Neural Networks, Vol. 9 (6) (1998) 1057-1067 12. Taha I.A., Ghosh J.: Symbolic Interpretation of Artificial Neural Networks. IEEE Trans. Neural Networks, Vol.11 (3) (1999) 448-463 13. Vaughn M.L.: Interpretation and Knowledge Discovery from the Multilayer Perceptron Network: Opening the Black Box. Neural Computing & Applications, Vol. 4 (2) (1996) 7282 14. Vaughn M.L., Ong E., Cavill S.J.: Interpretation and Knowledge Discovery from the Multilayer Perceptron Network that Performs Whole Life Assurance Risk Assessment. Neural Computing & Applications, Vol. 6 (4). (1997) 203-213 15. Waddell G.: A New Clinical Model for the Treatment of Low-Back Pain. Spine, Vol. 12 (7) (1987) 632-644 16. Jackson D., Llewelyn-Phillips H., Klaber-Moffett J.: Categorization of Back Pain Patients Using an Evidence Based Approach. Musculoskeletal Management, Vol. 2 (1996) 39-46 17. Bigos S., Bowyer O., Braen G.: Acute Low Back Problems in Adults. Clinical Practice Guideline No. 14. AHCPR Publication No. 95-0642. U.S. DHHS. (1994) 18. Vaughn M.L.: Derivation of the Multilayer Perceptron Weight Constraints for Direct Network Interpretation and Knowledge Discovery. To be published in Neural Networks (1999) 19. Vaughn M.L., Cavill S.J., Taylor S.J., Foy M.A., Fogg A.J.B.: Direct Knowledge Discovery and Interpretation from a Multilayer Perceptron Network that Performs Low-Back-Pain Classification. In Bramer M. (ed.), Knowledge Discovery and Data Mining: Theory and Practice. IEE Press (1999) 160-179 20. Waddell G., Bircher M., Finlayson D., Main C.: Symptoms and Signs: Physical Disease or Illness Behaviour. BMJ, Vol. 289 (1984) 739-741 21. Bounds D. G., Lloyd P. J., Mathew B. G., Waddell G.: A Multi Layer Perceptron Network for the Diagnosis of Low Back Pain. Proceedings of IEEE International Conference on Neural Networks, San Diego, California (1988) 481-489 22. Bounds D. G., Lloyd P. J., Mathew B.: A Comparison of Neural Network and Other Pattern Recognition Approaches to the Diagnosis of Low Back Disorders. Neural Networks, Vol. 3. (1990) 583-591 Direct Explanations and Knowledge Extraction from a Multilayer Perceptron Network 285 23. Craven, M.W., & Shavlik, J.W.: Learning symbolic rules using artificial neural networks, Proceedings of the Tenth International Conference on Machine Learning. Amherst MA, USA. Morgan Kauffman (1993) 73-80 24. Viktor, H.L., Engelbrecht, A.P. & Cloete, I.: Reduction of Symbolic Rules from Artificial Neural Networks Using Sensitivity Analysis, Proceedings of the 1995 IEEE International Conference on Neural Networks, Perth, Western Australia (1995) 25. Krishnan R.: A Systematic Method for Decompositional Rule Extraction from Neural Networks. Proceedings of NIPS’96 Workshop of Rule Extraction from Trained Artificial Neural Networks. Queensland Univ. Technol. (1996) 38-45 26. Maire F. : A Partial Order for the M-of-N Rule Extraction Algorithm. IEEE Trans. Neural Networks, Vol.8 (1997) 1542-1544 27. Setiono R.: Extracting Rules from Neural Networks by Pruning and Hidden Unit Splitting. Neural Comput, Vol 9. (1997) 205-225 28. Bologna, G.: Rule Extraction from the IMLP Neural Network: a Comparative Study. Proceedings of NIPS’96 Workshop of Rule Extraction from Trained Artificial Neural Networks. Queensland Univ. Technol. (1996) 29. Saito K., Nakano R.: Law Discovery Using Neural Networks. Proceedings of NIPS’96 Workshop of Rule Extraction from Trained Artificial Neural Networks. Queensland Univ. Technol. (1996) 62-69 30. Tickle A.B., Orlowski M., Diederich J.: DEDEC: A Methodology for Extracting Rules from Trained Artificial Neural Networks. Proceedings of NIPS’96 Workshop of Rule Extraction from Trained Artificial Neural Networks. Queensland Univ. Technol. (1996) 90-102 31. Thrun, S.B.: Extracting rules from artificial neural networks with distributed representations. In: Advances in Neural Information Processing Systems, Vol. 7. Tesauro G., Touretzky D., Leen T. (eds.). MIT Press (1995) High Order Eigentensors as Symbolic Rules in Competitive Learning Hod Lipson1 and Hava T. Siegelmann2 1 Mechanical Engineering Department Industrial Engineering and Management Department Technion – Israel Institute of Technology, Haifa 32000, Israel Current e-mail: hlipson@mit.edu, iehava@ie.technion.ac.il 2 Abstract. We discuss properties of high order neurons in competitive learning. In such neurons, geometric shapes replace the role of classic ‘point’ neurons in neural networks. Complex analytical shapes are modeled by replacing the classic synaptic weight of the neuron by high-order tensors in homogeneous coordinates. Such neurons permit not only mapping of the data domain but also decomposition of some of its topological properties, which may reveal symbolic structure of the data. Moreover, eigentensors of the synaptic tensors reveal the coefficients of polynomial rules that the network is essentially carrying out. We show how such neurons can be formulated to follow the maximum-correlation activation principle and permit simple local Hebbian learning. We demonstrate decomposition of spatial arrangements of data clusters including very close and partially overlapping clusters, which are difficult to separate using classic neurons. 1 Introduction A phase diagram contains data points representing measurements of a phenomenon plotted in multi-dimensional space. If the measured phenomenon follows no rules, the data points will uniformly fill the data space. However, if there are any governing principles to the measured phenomenon, the data points will not fill the space uniformly, but rather will form some kind of structure or pattern. Tools for exploratory data analysis such as neural networks may then be used to map this structure, and so create a model for the behavior of the phenomenon. In many cases, however, mapping the phenomenon, i.e. determining what areas in the data space it is more likely to occupy, is not sufficient. In order to understand the laws that govern the phenomenon, it is necessary to decompose the mapped volume into its components and derive relationships among them. It is convenient to associate the simpler mapping of the phenomenon with determination of its metrics, and the “deeper” understanding of the governing principles or symbolic structure with determination of its topology. In this paper we take the view where in order to extract symbolic meaning from an observed phenomenon it is necessary to use neuronal units that can individually account for a complete symbolic rule. In practice, we use geometric shapes: the shape itself is the symbolic rule and its parameters (say S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp.286-297, 2000. © Springer-Verlag Berlin Heidelberg 2000 High Order Eigentensors as Symbolic Rules in Competitive Learning 287 curvatures and gradients) are the tunable parameters. We describe an augmentation to classic competitive neurons that permits them to decipher these aspects. The ability to determine the topological structure of the data domain is considered a primary capacity of Kohonen’s self-organizing map (SOM) (Kohonen, 1997). Under certain conditions, a SOM network may self-organize so that each neuron relocates to the center of a cluster of input points which it represents. Due to the connectivity of the net, the topology of the net is also mapped onto the topology of the data domain, thus revealing topological properties of the structure such as cluster proximity. However, the topological structure of the net may also act as a constraint on the arrangement of the neurons, impeding its ability to capture certain configurations. When this topological constraint is released by adopting full connectivity, a general form of a vector quantization (VQ) clustering algorithm is obtained. Such algorithms (Pal et al, 1993) can map any data arrangement, but they do not provide information regarding the topological structure. Similarly, most other networks acquire topological information implicitly into their weights; this information cannot be directly extracted. In this paper we explore an alternative approach to modeling the topology of the data domain. Modeling is achieved by using neurons that not only map the location of their corresponding data, but also explicitly map its local topological and geometrical properties. These ‘geometric neurons’ are of higher dimensionality than their input domain, and may therefore track features of the activation area that might correspond to local symbolic properties. They use high-order ‘synaptic tensors’ instead of classic synaptic weight vectors, where weights correspond also to combinations of inputs of various orders. When these neurons are used in a network configuration, local topological properties are accumulated to explicitly reveal the global topological arrangement of the data. These neurons obey the simple Hebbian-type learning rule, and, depending on their shape and base functions, can reveal configurations even among close and partially overlapping clusters. The paper first outlines the concept of the proposed enhancement in light of existing work, and then provides a mathematical formulation for analytic geometric neurons with polynomial base functions. We show that a classic neuron is a first-order case of the geometric neuron, and the second-order neuron corresponds to the wellestablished ellipsoidal (Mahalanobis) metric neuron. We describe the neuron itself, its activation, its learning scheme and then demonstrate its functionality within a net. 2 Shape-Sensitive Neurons The fundamental property of a shape sensitive neuron is that it is capable of mapping the local topological and geometrical properties of a data volume because, unlike a point neuron, it has topological and geometric properties of its own. These properties are parametric and hence adaptive. Unlike classic networks, the topological structure of the data is then directly accessible, since the topology of each neuron is known and simple. A schematic illustration of a network with geometric neurons is shown in Figure 1(a), where point neurons are replaced by higher-order ‘blob’ neurons which can take the form of various topological components such as links, forks and volumes, 288 H. Lipson and H.T. Siegelmann as well as of ordinary point neurons. Four actual shape sensitive neurons are shown in Figure 1(b). High order neurons are defined as neurons which accept input not only from single inputs, but also from combinations of inputs, such as sets of inputs multiplied to various orders. The use of high order neurons in general is not new. High order neurons are generally associated with more degrees of freedom rather than explicit topological properties. Explicit geometric properties have been introduced for specific cases of prototype-based clustering and competitive neural networks. Gustafson and Kessel (1979) used the covariance matrix to capture ellipsoidal properties of clusters. Davé (1989) used fuzzy clustering with a non-Euclidean metric to detect lines in images. This concept was later expanded, and Krishnapuram et al (1995) used general second-order shells such as ellipsoidal shells and surfaces. For an overview and comparison of these methods see Frigui and Krishnapuram (1996). Incorporation of Mahalanobis (elliptical) metrics in neural networks was addressed by Kavuri and Venkatasubramanian (1993) for fault analysis applications, and by Mao and Jain (1996) as a general competitive network with embedded principal component analysis units. Kohonen uses adaptive tensorial weights (Kohonen, 1997) to capture significant variances in the components of input signals, thereby introducing a weighted Euclidean distance in the matching law. Abe et al (1997) attempt to extract fuzzy rules using ellipsoidal units and compare successfully to other rule-extracting methods. In this paper we explore the possibility of using neurons with general and explicit geometric properties under direct Hebbian learning. These neurons are not limited to ellipsoidal shapes. Å (a) (b) Fig. 1. (a) A schematic illustration of a network with geometric neurons. Point neurons are replaced by higher-order ‘blob’ neurons that can take the form of various topological (symbolic) components such as links, forks and volumes, as well as of ordinary point neurons. (b) Examples of four high order geometric neurons developed in this work, with different geometric and topological structures, and the data points they represent. In the following sections we adopt a notation where: x, w are column vectors, W, D are matrices; xH, WH denote vectors and matrices in homogeneous representation (j) (j) th th (described later), x , D - the j vector/matrix corresponding to the j neuron/class; x1, Dij - element of a vector/matrix; m - the order of neuron; d - the dimensionality of the input; N - the size of the layer (the number of neurons). High Order Eigentensors as Symbolic Rules in Competitive Learning 3 289 A High-Order Neuron with Polynomial Base Functions In classical self-organizing networks, each neuron j is assigned a synaptic weight (j) vector w . The winning neuron j(x) in response to an input x is the one showing the (j)T highest correlation with the input, i.e. neuron j for which w x is the largest. Note that (j) when the synaptic weights w are normalized to a constant Euclidean length, then the above criterion becomes identical to the minimum Euclidean distance matching criterion. However, the use of a minimum-distance matching criterion incorporates several difficulties. The minimum distance criterion implies that the features of the input domain are spherical, i.e., matching deviations are considered equally in all directions, and distances between features must be larger than the distances between points within a feature. These aspects preclude the ability to detect higher order, complex or ill-posed feature configurations and topologies, as these are based on higher geometrical properties such as directionality and curvature. This constitutes a major difficulty, especially when the input is of high dimensionality, where such configurations are difficult to visualize. Complex clusters may require complex metrics for separation. The modeling constraints imposed by the maximum correlation matching criterion (j) stem from the fact that the neuron’s synaptic weight w has the same dimensionality as the input x , i.e., the same dimensionality as a single point in the input domain, while in fact the neuron is modeling a cluster of points which may have higher order attributes such as directionality and curvature. We shall therefore refer to the classic neuron as a ‘first-order’ (zero degree) neuron, due to its correspondence to a point in multidimensional space. Second Order To circumvent this restriction, we augment the neuron with the capacity to map additional geometric and topological information. For example, as the first-order is a point neuron, the second-order case will correspond to orientation and size components, effectively attaching a local oriented coordinate system with nonuniform scaling to the neuron center and using it to define a new distance metric. Thus each second-order neuron will represent not only the mean value of the data points in the cluster it is associated with, but also the principal directions of the cluster and the variance of the data points along these directions. Intuitively, we can say that rather than defining a sphere, the second-order distance metric now defines a multidimensional oriented ellipsoid. After some mathematical manipulation (Gustafson and Kessel, 79), the second order information (orientation and scaling) can be shown to reside entirely in the correlation matrix R of the zero-mean data, and the matching criterion becomes ( i(x ) = arg j min  x − w ( j )  )T R −1(x − w( j ) ), j = 1,2, K , N (1) This criterion also corresponds to the term used in the maximum likelihood Gaussian classifier (Duda and Hart, 73). As it stands, Eq. (1) requires separate (j) (j) tracking the orientation and size information, in R , and the position in w . For a 290 H. Lipson and H.T. Siegelmann more systematic treatment, we combine these two coefficients into one expanded (j) covariance denoted RH in homogenous coordinates (Faux and Pratt, 81), as RH = ∑xH ( j) ( j) xH ( j)T (2) where in homogenous coordinates x  xH =   1 (j) (d+1)™(d+1) (d+1)™1 (j) so that now RH ³§ and X ³§ . When working in homoheneous coordinates, the constant unit 1 is appended to the vector, so that when the vector is multiplied, it carries lower orders as well. Thus, a single representation encapsulated both first degree (linear) and zero degree (constant) elements. In analogy to the ‘correlation matrix memory’ and ‘autoassociative memory’ (see Haykin 1994), the (j) extended matrix RH , in its general form, can be viewed as ‘homogeneous autoassociative tensor’. Now the new matching criterion becomes simply [ i (x ) = arg j min x TH R (Hj ) x H −1 ] −1 (3) = arg j min R (Hj ) 2 x H −1 = arg j min R (Hj ) x H , j = 1,2,K , N This representation retains the notion of maximum correlation, and for -1 convenience, we now call RH the synaptic tensor. Note that this step is based on the (j) fact that the eigenstructure of the extended correlation matrix RH in homogeneous coordinates corresponds to the principal directions and average (i.e. both second-order (j) and first-order) properties of the cluster accumulated in RH . This property permits extension to higher order tensors, where direct eigenstructure analysis is not well defined. The transition into homogeneous coordinates also dispensed with the need to zero-mean the correlation data. Digressing for a moment, we recall that the classic neuron possesses a synaptic (j) weight vector w which corresponds to a point in the input domain. The synaptic (j) (j) weight w can be seen to be the first order average of its signals, i.e., w =SxH (the last element xH (d+1)=1 of the homogeneous coordinates has no effect in this case). The shape-sensitive neurons hold information regarding the linear correlations among the (j) T coordinates of data points represented by the neuron, by using RH =S xHxH . Each element of RH is thus a proportionality constant relating two specific dimensions of the cluster. Higher Orders The second-order neuron can be regarded as a second-order approximation of the th corresponding data distribution. We may consequently introduce an m -order shape th sensitive neuron capable of modeling a d-dimensional data cluster to an m -order approximation. For example, a third-order neuron is capable of storing not only the High Order Eigentensors as Symbolic Rules in Competitive Learning 291 principal directions and size of the d-dimensional data cluster, but also its curvatures along each of these axes; hence, it is capable of matching the topology of, say, a Yshaped fork. th In order to obtain m -order components of the data cluster, we use an analogy between eigenstructure decomposition and least-squares fitting. The analogy holds that the principal directions of a data cluster (its eigenvectors) correspond to the normals of the orthogonal set of best-fit hyperplanes through the data set, and the eigenvalues correspond to the variances of the data from those hyperplanes. In homogeneous coordinates, the hyperplanes contain also an offset, and hence each homogeneous eigenvector corresponds to the coefficients of the corresponding hyperplane equation. Extending this analogy to higher orders, the higher principal components of the cluster (say, the principal curvatures) correspond to the eigentensors of higher-order correlation tensors in homogeneous coordinates. The neuron is therefore represented by a 2(m-1)-dimensional tensor of rank d+1, denoted (j) (d+1)™...™(d+1) . The factor of 2 is introduced by the squared error used by the leastby Z ³§ squares method. The tensor is created by successive ‘outer products’ of the homogeneous vector xH by itself. ZH ( j) = xH 2 ( m −1) (4) In practice, in order to extract the winner neuron it is only necessary to determine the amount of correlation between an input and the tensors. As in Eq. (3), in higher orders too this amounts to multiplication of the input xH by the tensor. The exponent consists of the factor (m-1), which is the degree of the approximation, and a factor of 2 since we are performing auto-correlation so the function is multiplied by itself. In analogy to reasoning that lead to Eq. (3), each eigenvector (now an eigentensor of order m-1) corresponds to the coefficients of a principal curve, and multiplying it by an input points produces an approximation of the distance of that input point from the curve. Consequently, the inverse of the tensor (j) ZH can then be used to compute the high-order correlation of the signal with the nonlinear shape neuron, by simple tensor multiplication: i (x ) = arg j min Z (Hj ) ⊗ x Hm −1 , −1 j = 1,2,K, N (5) where © denotes tensor multiplication. Note that the covariance tensor can only be inverted if its order is an even number, as satisfied by Eq. (4). Note also that amplitude operator is carried out by computing the root of the sum of the squares of the elements of the argument. The computed metric is now not necessarily spherical and may take various other forms. In practice however, high-order tensor inversion is not directly required. To make this analysis simpler, we use a Kronecker notation for tensor products (see Graham, 1981). Kronecker tensor product ‘flattens out’ the elements of X©Y into a large matrix formed by taking all possible products between the elements of X and those of Y. For example, if X is a 2 by 3 matrix, then X©Y is 292 H. Lipson and H.T. Siegelmann  Y ⋅ X 1,1 X ⊗Y =  Y ⋅ X 2,1 Y ⋅ X 1, 2 Y ⋅ X 2,2 Y ⋅ X 1, 3  Y ⋅ X 2,3  (6) where each block is a matrix of the size of Y. In this notation, the internal structure of higher-order tensors is easier to perceive and their correspondence to linear regression of principal polynomial curves is revealed. Consider, for example, a fourth order covariance tensor of the vector x={x,y}. The fourth-order tensor corresponds to the simplest non-linear neuron according to Eq. (4), and takes the form of the 2™2™2™2 tensor x 2 x⊗x⊗x⊗x = R⊗R =  xy xy   x 2 ⊗ y 2   xy  x4  xy   x 3 y = 2 y   x3 y  2 2 x y x3 y x3 y x2 y2 x2 y2 xy 3 x2 y2 x2 y2 xy 3 x2 y2   xy 3  3 xy   y4  (7) The homogeneous version of this tensor includes also all lower-order permutations of the coordinates of xH={x,y,1}, namely, the 3™3™3™3 tensor Z H ( 4) = x H , x H , x H , x H = R H , R H x 2  =  xy x  xy y 2 y x x 2   y  ⊗  xy 1   x xy y 2 y x  y 1  (8) Extracting Symbolic Rules It is immediately apparent that the above matrix corresponds to the matrix to be solved for finding a least squares fit of a conic section equation 2 2 ax +by +cxy+dx+ey+f=0 to the data points. Moreover, the set of eigenvectors of this matrix corresponds to the coefficients of the set of mutually orthogonal best-fit conic section curves that are the principal curves of the data. This notion adheres with Gnanadesikan’s method for finding principal curves (Gnanadesikan, 1977). Now, substitution of a data point into the equation of a principal curve yields an approximation of the distance of the point from that curve, and the sum of squared distances amounts to the term evaluated in Eq. (5). Note that each time we increase complexity, we are seeking a set of principal curves of one degree higher. This implies that the least-squares matrix needs to be two degrees higher (because it is minimizing the squared error), thus yielding the coefficient 2 in the exponent of Eq. (4) for the general case. Figure 2 shows a cluster of data points and one of their eigentensors. Rule extraction is performed as follows: First, the synaptic tensor of each neuron is analyzed to extract its eigentensors, which are half the order of the synaptic tensor. The terms of each eigentensor define the coefficients of a polynomial curve, such as the one shown in Figure 2 (b). Distance from this polynomial curve is one of the clustering metrics used by this neuron, and hence can be used as a symbolic rule for High Order Eigentensors as Symbolic Rules in Competitive Learning 293 classification of the associated data points. The algebraic rule can thus be directly extracted analytically. (a) (b) (c) rd Fig. 2. (a) The cluster of points, (b) one of the six eigentensors, and (c) the best-fit 3 -order 2D volume corresponding to a unit circle in the space spanned by the six orthogonal eigentensors. 4 Hebbian Learning for High-Order Neurons In order to show that higher-order shapes are a direct extension of classic neurons, we show that they are also subject to simple Hebbian learning. Following an interpretation of Hebb’s postulate of learning, synaptic modification (i.e., learning) occurs when there is a correlation between presynaptic and postsynaptic activities. We have already shown in Eq. (3) above that (a) the presynaptic activity xH and (j)-1 postsynaptic activity of neuron j coincide when the synapse is strong, i.e. RH xH is minimum. We now proceed to show that, in accordance with Hebb’s postulate of learning, (b) it is sufficient to incur self-organization of the neurons by increasing synapse strength when there is a coincidence of presynaptic and postsynaptic signals. In order to provide quality (b) above, we need to show how self organization is (j) obtained merely by increasing RH , where j is the winning neuron. As each new data point arrives at a specific neuron j in the net, the synaptic weight of that neuron is adapted by the incremental corresponding to Hebbian learning, ZH ( j) (k + 1) = Z H ( j ) (k ) + η (k )x (Hj ) 2( m−1) (9) where k is the iteration counter and h(k) is the iteration-dependent learning rate (j) coefficient. It should be noted that in Eq. 12, ZH becomes a weighted sum of the input 2(m-1) (unlike RH in Eq. 10, which is a uniform sum). The eigenstructure signals xH analysis of a weighted sum still provides the principal components under the assumption that the weights are uniformly distributed over the cluster signals. This assumption holds true if the process generating the signals of the cluster is stable over time - a basic assumption of all neural network algorithms (Bishop, 1997). In practice this assumption is easily acceptable, as will be demonstrated in the following sections. In principle, continuous updating of the covariance tensor may create instability in a competitive environment, as a winner neuron becomes more and more dominant. To force competition, the covariance tensor can be normalized using any of a number of 294 H. Lipson and H.T. Siegelmann factors dependent on the application, such as the number of signals assigned to the neuron so far, or the distribution of data among the neuron (forcing uniformity). Some basic criteria are discussed in (Lipson and Siegelmann, 1999). 5 Implementation The proposed neuron has been implemented both in an unsupervised and in a supervised setup. High order tensor multiplications have been performed in practice by ‘flattening out’ the tensors using Kroneker’s tensor product notation. Briefly, unsupervised learning is attained by letting randomly initialized neurons compete over input data, using the Hebbian learning and 'winner takes all' principle as described earlier. Supervised learning is attained by training one neuron per input class with inputs of that class only, and then cross-validating the results. The precise implementation is described in (Lipson and Siegelmann, 1999). Below we demonstrate some results. nd Figure 3(a) shows three 2 order neurons (ellipsoids) that self organized to decompose a point set into three natural groups. Note the correct decomposition despite the significant proximity and partial overlap of the cluster, a factor which usually ‘confuses’ classic networks. Figures 3(b,c) show decompositions of point sets rd using 3 order neurons, capable of modeling the data with more complex shapes than mere ellipses. Note how the direct determination of the area of overlap permits explicit modeling of uncertainty or ambiguity in the data domain, a factor which is crucial for symbolic understanding. Figure 4 shows three instances of different data nd topologies, modeled using three 2 order neurons. (a) (b) (c) nd rd Fig. 3. Self classification of point clusters, (a) 2 (elliptical), (b,c) 3 order. High Order Eigentensors as Symbolic Rules in Competitive Learning (a) (b) 295 (c) Fig. 4. Point clusters with various topologies and their interpretation, (a) string, (b) hole, (c) fork. Analysis shown was completed after single pass. The geometric neurons have also been tested on the 4-dimensional IRIS Data benchmark (Anderson, 1939) both in unsupervised and in supervised modes. Out of 150 flowers, the networks misclassified 3 (unsupervised) and 0 flowers (supervised), 1 respectively . Supervised learning was achieved by training each neuron on its designated class separately with 20% cross validation. Tables 1 and 2 summarize the results. Table 1. Comparison of self-classification results for IRIS data. (Blank cells correspond to unavailable data) Method Epochs # misclassified or unclassified 25 200 17 Super Paramagnetic (Blatt et al, 1996) LVQ / GLVQ (Pal et al, 1993) K-Means (Mao and Jain, 1996) 16 HEC (Mao and Jain, 1996) 5 nd 20 4 rd 30 3 2 -order Unsupervised 3 -order Unsupervised 296 H. Lipson and H.T. Siegelmann Table 2. Supervised classification results for IRIS data. Our results after one training epoch, with 20% cross validation. Averaged results for 250 experiments. (Blank cells correspond to unavailable data) Order 3-Hidden-Layer N.N. (Abe et al, 1997) Epochs 1000 # misclassified Average Best 2.2 1 Fuzzy Hyperbox (Abe et al, 1997) Fuzzy Ellipsoids (Abe et al, 1997) 2 1000 1 3.08 1 rd 1 2.27 1 th 1 1.60 0 th 1 1.07 0 th 1 1.20 0 th 1 1.30 0 3 -order Supervised 4 -order Supervised 5 -order Supervised 6 -order Supervised 7 -order Supervised 6 1 nd 2 -order Supervised Conclusions and Further Research In this paper, we discussed the use of high-order geometric neurons and demonstrated their practical use for modeling the principal structure of spatial distributions. Although high-order neurons do not directly correspond to neurobiological details, we believe that they can provide powerful symbolic modeling capabilities. In particular, they exhibit useful properties for correctly handling partially overlapping clusters, an occurrence that may represent a key symbolic property. Moreover, when an entire shape or ‘rule’ is encoded into a single neuron its easy to find a minimal set of ‘key examples’ that can be used to induce the rule in a learning system. For example, the eigentensors of the synaptic tensor appear to be such a set. The use of geometric neurons raises some further practical questions, which have not been addressed in this paper, such as selecting the number of neurons for a particular task (network size), the cost implication of the increased number of degrees of freedom, and network initialization. It is also necessary to investigate the relationship between the values of the synaptic tensor and the shapes it may acquire. Acknowledgements We thank Eitan Domani for his helpful comments and insight. This work was supported in part by the U.S.-Israel Binational Science Foundation (BSF), by the Israeli ministry of arts and sciences, and by the Fund for Promotion of Research at the Technion. Hod Lipson acknowledges the generous support of the Charles Clore Foundation and the Fischbach Fellowship. Download further information, MATLAB demos and implementation details of this work from http://www.cs.brandeis.edu/~lipson/papers/geom.htm High Order Eigentensors as Symbolic Rules in Competitive Learning 297 References Abe S., Thawonmas R., 1997, “A fuzzy classifier with ellipsoidal regions”, IEEE Trans. On Fuzzy Systems, Vol. 5, No. 3, pp. 358-368 Anderson E., “The Irises of the Gaspe Peninsula,” Bulletin of the American IRIS Society, Vol. 59, pp. 2-5, 1939. Bishop, C. M., 1997, Neural Networks for Pattern Recognition, Clarendon press, Oxford Blatt, M., Wiseman, S. and Domany, E., 1996, “Superparamagnetic clustering of data”, Physical Review Letters, 76/18, pp. 3251-3254 Davé R. N., 1989, “Use of the adaptive fuzzy clustering algorithm to detect lines in digital images”, in Proc. SPIE, Conf. Intell. Robots and Computer Vision, SPIE Vol. 1192, No. 2, pp. 600-611 Duda R. O., and Hart, P. E., 1973, Pattern classification and scene analysis, New York, Wiley. Faux I. D., Pratt M. J., 1981, Computational Geometry for Design and Manufacture, John Wiley & Sons, Chichester Frigui, H. and Krishnapuram, R., 1996, “A comparison of fuzzy shell-clustering methods for the detection of ellipses”, IEEE Transactions on Fuzzy Systems, 4/2, pp. 193-199 Geoffrey J. Mclachlan, Thriyambakam Krishnan, 1997, The EM algorithm and extensions, Wiley-interscience, New York Gnanadesikan, R., 1977, Methods for statistical data analysis of multivariate observations, Wiley, New York Graham A., 1981, Kronecker products and Matrix Calculus: with Applications, Wiley, Chichester Gustafson E. E. and Kessel W. C., 1979, “Fuzzy clustering with fuzzy covariance matrix”, in Proc. IEEE CDC, San Diego, CA, pp. 761-766 Haykin, S., 1994, Neural Networks, A comprehensive foundation, Prentice Hall, New Jersey Kavuri, S.N. and Venkatasubramanian, V., 1993, “Using fuzzy clustering with ellipsoidal units in neural networks for robust fault classification”, Computers Chem. Eng., 17/8, pp. 765-784 Kohonen, T., 1997,“Self organizing maps”, Springer Verlag, Berlin Krishnapuram, R., Frigui, H. and Nasraoui, O., 1995, “Fuzzy and probabilistic shell clustering algorithms and their application to boundary detection and surface approximation - Parts I and II”, IEEE Transactions on Fuzzy Systems, 3/1, pp. 29-60. Lipson H., Siegelmann H. T., 1999, “Clustering Irregular Shapes Using High-Order Neurons”, Neural Computation, accepted for publication Mao, J. and Jain, A., 1996, “A self-organizing network for hyperellipsoidal clustering (HEC), IEEE Transactions on Neural Networks, 7/1, pp. 16-29. Pal, N., Bezdek, J.C. and Tsao, E.C.-K., 1993, “Generalized clustering networks and Kohonen’s self-organizing scheme”, IEEE Transactions on Neural Networks, 4/4, pp. 549557 Pan, J.S., McInnes, F.R. and Jack, M.A., 1996, “Fast clustering algorithms for vector quantization” Pattern Recognition, 29:3, pp. 511-518. Holistic Symbol Processing and the Sequential RAAM: An Evaluation James A. Hammerton1 and Barry L. Kalman2 1 School of Computer Science⋆ ⋆ ⋆ , The University of Birmingham, UK james.hammerton@ucd.ie 2 Department of Computer Science, Washington University, St Louis barry@cs.wustl.edu Abstract. In recent years connectionist researchers have demonstrated many examples of holistic symbol processing, where symbolic structures are operated upon as a whole by a neural network, by using a connectionist compositional representation of the structures. In this paper the ability of the Sequential RAAM (SRAAM) to generate representations that support holistic symbol processing is evaluated by attempting to perform Chalmers’ syntactic transformations using it. It is found that the SRAAM requires a much larger hidden layer for this task than the RAAM and that it tends to cluster its hidden layer states close together, leading to a representation that is fragile to noise. The lessons for connectionism and holistic symbol processing are discussed and possible methods for improving the SRAAM’s performance suggested. 1 Introduction Recently, several connectionist techniques have been developed for representing compositional structures, such as Pollack’s RAAM [15], Plate’s HRRs [14] and Callan and Palmer-Brown’s (S)RAAM [3]. Connectionist researchers have demonstrated that some of these techniques support holistic symbol processing1 [4, 5, 8], where a symbol structure can be operated upon holistically in contrast to the symbol by symbol manipulations performed by traditional symbolic systems. For example, Chalmers [4] has performed the transformation of simple active sentences into their passive equivalents, holistically. Other examples of holistic symbol processing include Niklasson and Sharkey’s work [13] on transforming logical implications into their equivalent disjunctions, Blank et al.’s work [2] on transforming sentences of the form X chases Y to the form Y flees X and Weber’s constant-time term unification [19]. ⋆⋆⋆ 1 The first author has now moved to the Department of Computer Science, University College Dublin, Ireland Much of the literature refers to this as holistic computation. Hammerton [7] argues that the focus is on holistic symbol processing rather than holistic computation per se. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 298–312, 2000. c Springer-Verlag Berlin Heidelberg 2000 Holistic Symbol Processing and the Sequential RAAM: An Evaluation 299 For holistic symbol processing to progress from an interesting possibility to a phenomenon that can be exploited to enhance the performance of intelligent systems, it is necessary to develop an understanding of how the techniques work, what their limitations are and what it is that enables them to support holistic computation. Developing such an understanding may indicate ways of overcoming any limitations and may hold lessons for the development of other techniques. This paper2 presents an evaluation of the ability of the Sequential RAAM (SRAAM) [12, 15] to generate representations that support holistic symbol processing, by attempting to recreate Chalmers’ active to passive transformations [4] using the SRAAM in place of the RAAM, and then analysing its performance to understand how it is performing the task and whether the technique has any limitations. 2 Syntactic Transformations with the SRAAM 2.1 Why the SRAAM was Chosen Many of the connectionist representations that have been developed could be used for this work as they support holistic symbol processing to some degree. The SRAAM was chosen for various reasons: – Its ability to generate representations that support holistic symbol processing was not clear from the literature and thus required clarification. The more complex tasks which it has been used for involved confluent inference [5, 9]. Hammerton [8, 7] argues that confluent inference is not a form of holistic symbol processing but rather a separate phenomenon. The work of Blank et al. [2] does involve holistic symbol processing but the tasks involved are simpler than many of the tasks for which the RAAM has been used. Kwasny and Kalman [12] train SRAAMs to encode relatively complex structures but their attempts to perform holistic operations on those structures met with ambiguous results. – Kwasny and Kalman’s work [12] suggests that the SRAAM is easier to train, offers higher levels of generalisation and is more flexible in the range of structures it can represent than the RAAM. It is easier to train because the training scheme is a simple modification of the algorithm used to train simple recurrent networks (SRNs) and does not require an external memory to store intermediate results. The SRAAM achieved higher levels of generalization. When trained on 30 sequences out of a set of 183, it could correctly encode the entire set. Chalmers [4] trained a RAAM to encode a set of simple active and passive sentences. When trained on 80 of these and tested on a further 80 novel sentences, it made errors on 13 of the novel sentences. Finally the SRAAM is more flexible because it can represent arbitrarily branching trees rather than the fixed branching trees of the RAAM. If these advantages 2 This paper forms a summary of the work reported in Chapters 3,4 and 5 of Hammerton’s thesis [7]. 300 J.A. Hammerton and B.L. Kalman could be combined with effective support for holistic symbol processing then the SRAAM would be a strong candidate as a vehicle for holistic symbol processing. – The SRAAM is closer to standard connectionist models than the (S)RAAM [3] or HRRs [14], both of which appear to support holistic symbol processing at least as effectively as the RAAM if not more so. Neither (S)RAAM or HRRs employ error minimisation, nor do they utilise the networks of interconnected units favoured by connectionists. The SRAAM can be trained with standard techniques and is a simple variation on the SRN [12]. – Finally the Labelling RAAM [17, 18] is rather different in its operation compared to other methods due to the practice of turning it into a bi-directional associative memory. The production of reduced descriptions of symbol structures for use with other networks is thus not as natural with the LRAAM as with other methods. Furthermore there has already been extensive investigation of its properties and it is thus well understood. 2.2 Syntactic Transformations: A Benchmark Chalmers’ task involving active to passive transformations was chosen as this is a simple task which the RAAM performed well, and it provides a benchmark for other techniques. Where the sentences used by Chalmers were represented by ternary trees to be encoded by the RAAM, here they are encoded in the SRAAM as sequences. In the experiments reported here, 3 training regimes were used: – Back-propagation with sigmoidal units. Back-propagation is used to train SRAAMs employing sigmoidal units and the sum of squares error function (SSE). – Back-propagation with hyperbolic tangent units. Back-propagation is used to train SRAAMs employing units that use the hyperbolic tangent activation function. The Kalman-Kwasny error function (KKE) [10, 12] is employed with these networks. This leads to higher error values for the same data. Kalman and Kwasny claim improved performance with this form of training compared to standard back-propagation. – Kwasny and Kalman’s training regime with hyperbolic tangent units. Kwasny and Kalman’s work employs a variant of conjugate gradient training with the error derivatives recomputed to take into account the recurrence in the SRAAM and the fact that the target output patterns change over time. They claim superior performance of this training method over back-propagation. Additionally with each training regime networks with hidden layers of 12 or 39 hidden units were employed and the experiments were performed using both the representation of symbols used by Chalmers (Table 1) and referred to here as the “2 unit representation” and a set of orthogonal patterns (Table 2) referred to as the “1 unit representation”. Holistic Symbol Processing and the Sequential RAAM: An Evaluation 301 Table 1. The 2 unit representation of the symbols used by Chalmers. For SRAAMs employing hyperbolic tangent units, replace “0”s with “-1”s. john michael helen diane chris love hit betray kill hug is by nil 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Table 2. The 1 unit representation of the symbols used by Chalmers. For SRAAMs employing hyperbolic tangent units, replace “0”s with “-1”s. john michael helen diane chris love hit betray kill hug is by nil 2.3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Results Tables 3 and 4 summarise the results of training the SRAAM to encode and decode 130 of the sentences used by Chalmers, 65 active sentences and their passive equivalents. The remaining 120 sentences were used as a testing set to indicate generalisation. The results reported here for back-propagation are the best achieved for each network after a range of values for the momentum and learning rate were tried. The ranges used for both values was 0.01 to 0.9, and different combinations of values in these ranges were tried. Kwasny and Kalman’s simulator employs an adaptive step size so these parameters do not exist in their training. The “Method” column indicates which training regime was used. BP SSE is back-propagation with sigmoidal units and the sum of squares error. BP KKE is 302 J.A. Hammerton and B.L. Kalman Table 3. Various attempts to learn to encode and decode 130 sequences from Chalmers’ data set, using the 2 unit representation. Method BP SSE BP SSE BP KKE BP KKE KK KK Network Iterations 25-12-25 1350 52-39-52 1520 25-12-25 570 52-39-52 320 25-12-25 D:8310 F:80587 52-39-52 D:6950 F:71384 Error % train E/seq Train 98.87 0.8 1.80 71.28 42.3 0.96 492.80 3.8 2.65 243.71 47.7 0.59 1.23 46.9 0.53 0.04 85.3 0.15 % test E/seq Test 1.7 1.83 21.7 1.19 0.0 2.85 50.8 0.64 45 0.55 87.5 0.12 Table 4. Various attempts to learn to encode and decode 130 sequences from Chalmers’ data set, using the 1 unit representation. Method BP SSE BP SSE BP KKE BP KKE KK KK Network Iterations 25-12-25 2400 52-39-52 1820 25-12-25 120 52-39-52 270 25-12-25 D:7679 F:92470 52-39-52 D:7153 F:71545 Error % train E/seq Train 123.05 3.1 1.81 131.38 32.3 0.92 535.85 0.0 2.37 243.71 23.8 0.90 0.81 31.5 0.85 0.06 85.3 0.15 % test E/seq Test 0.0 2.09 17.5 1.22 0.0 2.40 15.8 1.16 20.8 0.88 50.0 0.50 back-propagation with hyperbolic tangent units and the Kalman-Kwasny error. KK is Kwasny and Kalman’s training method which employs hyperbolic tangent units. The “Network” column gives the network architecture for each SRAAM. The “Iterations” column indicates how many iterations were used in training. With Kwasny and Kalman’s method there are two types of iteration, one where derivatives are computed (D) and one where a line search is employed (F). The “Error” column shows the minimum error achieved during training. The “% train” column indicates the percentage of the training set correctly encoded and decoded at the end of training. The “E/seq Train” shows the number of errors per sequence produced by the encoding and decoding process for the training set. The “% test” column indicates the percentage of the testing set correctly encoded and decoded. The “E/seq Test” column shows the errors per sequence produced by encoding and decoding the testing set. None of the networks learned to encode and decode the entire training set without error, despite hidden layers of up to 39 units, 3 times larger than that used with Chalmers’ RAAM, being employed. The error levels in the testing set were generally higher than in the training set, although the error levels for the testing set were closer to those for the training set when Kwasny and Kalman’s training was used, suggestive of a greater power of generalisation with this method compared to back-propagation. Kwasny and Kalman’s method consistently outperforms back-propagation confirming their claims for superior training. The use of the hyperbolic tangent function improved the performance for back-propagation when using the 2 unit representation, but made it worse when Holistic Symbol Processing and the Sequential RAAM: An Evaluation 303 using the 1 unit representation. The convergence of back-propagation was faster with the hyperbolic tangent units than with sigmoidal units as Kwasny and Kalman suggested. Finally the 2 unit representation generally lead to better performance than the 1 unit representation, suggesting that the use of units to indicate where a symbol is a verb, noun, or “is” or “by” in the 2 unit representation may have helped training. To test this hypothesis SRAAMs with 39 hidden units employing a randomized 2 unit representation, where 2 units out of the 13 were selected at random and set to “1”, were trained. The errors were higher than in either of the above 2 cases, as would fit the hypothesis. Attempts to extend training beyond the points reported in these tables resulted in the error rising and then oscillating chaotically with back-propagation. With Kwasny and Kalman’s training restarting the training at the point where it stops with lower tolerances simply results in the training terminating immediately, suggesting it cannot improve things any further. Finally it was found that a transformation network could be trained to perform the active to passive transformations on the imperfectly learned encodings from the above experiment without inducing extra errors on top of those produced by the encoding and decoding process itself. This suggests that holistic symbol processing can be supported if the resources are available to achieve error free encoding and decoding of the sequences. 3 Analysing the Performance of the SRAAM 3.1 Determining the Cause of Failure The failure of the SRAAM to learn to encode and decode Chalmers’ sentences correctly was unexpected given the success other researchers have had. This failure may have occurred for any of several reasons. It may be that the solution does not exist. A proof it does exist in this case for SRAAMs of 20 or more hidden units is presented in Chapter 4 of Hammerton’s thesis [7]3 . It may be that the error landscape is rugged, making it difficult for back-propagation or conjugate gradient to find the global optimum. Another possibility is that the representations being developed interfere with each other, hindering training. The analysis summarised here was aimed at indicating whether the problem lay in features of the error landscape or in the way the SRAAM was organising its hidden-layer states. Thus analyses were performed on both the weights and the representations developed by the SRAAM in order to try and get a complete picture of what is going on. The analysis consisted of the following parts: – Determining the difficulty of finding the solution. This was done by determining how much noise needed to be added to solutions derived by hand from the proof that they exist to prevent subsequent training from finding them again and attempting to train the SRAAM using simulated annealing, a more global optimisation method. 3 This does not mean a solution does not exist for smaller networks. 304 J.A. Hammerton and B.L. Kalman – Determining the sensitivity of the network to noise in the weights. Every weight in the network has a percentage of its value either added or subtracted at random, unless it is zero in which case a small constant is added or subtracted instead. Then the errors produced in encoding and decoding are compared to the errors produced without the noise being added. – Determining the sensitivity of the network to noise in the hidden-layer patterns. The method used is an adaptation of the single-unit lesioning employed by Balogh [1] to determine the distributedness of the representations produced by Chalmers’ RAAM. Balogh’s method takes the encoding of a sentence, sets one of the units in the encoding to zero, and then decodes the sentence. The errors are then compared with the errors produced with the original encoding. This is adapted here so that instead of setting the value of the unit to zero, the unit is set to a percentage of the original value. By varying this percentage, one can determine both the sensitivity to noise and the nature of the representations developed by the SRAAM. – Determining how the SRAAM encodings cluster. Hierarchical Cluster Analysis (HCA) is performed on the hidden-layer encodings produced for each sentence. – Determining how the encoding and decoding trajectories are organised. The method employed here is to train a self-organising map [11] on the hiddenlayer states produced during encoding and decoding the sequences and then plot the trajectories taken when encoding and decoding the sequences on the map. – Determining the range of activations generated during encoding and decoding. This indicates the hypercube into which all the hidden-layer states are packed and thus the extent to which the available hyperspace is utilised. – Determining the range, average and average magnitude of the weights. If the magnitude of the weights is very high this may indicate a local optimum which is difficult to escape from due to large weight changes being required. – The HCA, analysis of the range, average and magnitude of the weights, the plotting of the encoding and decoding trajectories and the range of activations generated were all performed at three points during training to see if any trends would occur. 3.2 Results This section summarises the results of the analysis. More detail can be found in Chapters 4 and 5 of Hammerton’s thesis. Using Simulated Annealing to Train. Initially intended as a preliminary run, a 26-13-26 SRAAM was trained on 60 of the sequences using simulated annealing for 1.5 million iterations, taking 2 weeks to run on a 300 MHz Sun Ultra 2. The network had failed to find a solution, and an attempt to take the resulting network and train it further using back-propagation also failed to find a solution. As this was a smaller network and data set than used above, it was Holistic Symbol Processing and the Sequential RAAM: An Evaluation 305 felt that there was not enough time to repeat this with the full data set or larger networks given the length training involved. Thus it is only indicative of a solution that is difficult to find. Starting from Hand-Derived Solutions with Noise Added. With the hand-derived solutions, it was found that 10% noise was usually sufficient to prevent back-propagation finding them again and that 50% noise was sufficient in the case of Kwasny and Kalman’s method. This suggests that the handderived solutions are in a fairly small area of weight space with back-propagation and a somewhat larger, but apparently still inaccessible area with Kwasny and Kalman’s method. Adding Noise to Partial Solutions Found in Training. The SRAAMs trained by back-propagation showed serious degradation in performance when 10% noise was added and complete failure to encode and decode the sequences when 50% noise was added. With Kwasny and Kalman’s method, 1% noise was sufficient to cause serious degradation in the performance, with 10% resulting in complete failure. Thus it can be seen that the partial solutions developed by the SRAAMs were highly sensitive to noise in the weights, with Kwasny and Kalman’s method being more sensitive. This is suggestive of the networks being caught in a local optimum. Representational Lesioning. Figures 1 and 2 show the results of perturbing any single unit by 10% and 50% respectively, for the network trained by backpropagation using sigmoidal units employing the 1 unit representation and a 39 unit hidden layer. The errors are counted for each “slot”. For active sentences, slots 0 to 2 are the first, second and third words respectively. For passive sentences, slot 0 is the first word, slot 1 is the second and third word and slot 2 is the third and fourth word. This corresponds to the slots used in Balogh’s analysis of the RAAM representations [1] developed in Chalmers’ experiments, except Balogh numbers them 1 to 3 (where here it is 0 to 2), thus allowing comparison with the results he found. Slot 3 here corresponds to the length. As can be seen, perturbing any single unit in the encodings of the sequences with 10% noise led to a noticeable degradation in performance with upto 5% or so errors occurring in any slot and errors usually occur in multiple slots for any particular unit. Recall (Section 3.1) that these are errors produced in addition to those produced in the encoding and decoding process without perturbing the hidden layer patterns. 50% noise led to serious degradation, with upto 20% error occurring in any slot and errors usually occurring across all the slots regardless of which unit was lesioned. The pattern of induced errors was roughly similar regardless of which unit was perturbed, with more deeply embedded symbols being more likely to be corrupted than less deeply embedded symbols. This was suggestive of a distributed representation being developed that was sensitive to noise. This pattern of errors was repeated for the networks trained with the 2 unit representation and also for the networks trained with the hyperbolic tangent units. 306 J.A. Hammerton and B.L. Kalman 100 "lesion.seq130.9.slot.0" "lesion.seq130.9.slot.1" "lesion.seq130.9.slot.2" "lesion.seq130.9.slot.3" Percentage error 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 Unit lesioned 25 27 29 31 33 35 37 39 Fig. 1. The results of perturbing each unit in the encodings developed by the SRAAM trained using back-propagation, sigmoidal units, the 1 unit representation and 10% noise. 100 "lesion.seq130.5.slot.0" "lesion.seq130.5.slot.1" "lesion.seq130.5.slot.2" "lesion.seq130.5.slot.3" Percentage error 80 60 40 20 0 1 3 5 7 9 11 13 15 17 19 21 23 Unit lesioned 25 27 29 31 33 35 37 39 Fig. 2. The results of perturbing each unit in the encodings developed by the SRAAM trained using back-propagation, sigmoidal units, the 1 unit representation and 50% noise. Holistic Symbol Processing and the Sequential RAAM: An Evaluation 307 Figures 3 and 4 show the results of perturbing any single unit by 1% and 10% respectively in the networks trained by Kwasny and Kalman’s training regime, a 20 unit hidden layer and the 1 unit representation. As can be seen 1% perturbation of any single unit can lead to as much as 40% error across multiple slots with 10% perturbations sometimes yielding 100% error across multiple slots. Again most of the time errors occur in multiple slots and more deeply embedded information is the worst affected, however it is clear that the representations developed here are far more sensitive to error than those developed with back-propagation. This suggests that Kwasny and Kalman’s method produced highly fragile representations, despite the better encoding and decoding performance it achieved. Finally a point worth noting here is that the networks were developing distributed yet fragile representations, in contradiction to the common assumption that distributed representations are robust representations. These results show that the assumption need not necessarily hold. Hierarchical Cluster Analysis. The results from the hierarchical cluster analysis showed that the networks clustered the sequences by 2 strategies; they clustered by the value of the last symbol to be encoded and by whether the sequence was an active or a passive sentence. With back-propagation networks the latter strategy would however dominate. With Kwasny and Kalman’s networks the former strategy would dominate. There seemed to be no particular trend during training suggesting that the strategies are settled on early in training and then refined. 100 "lesiondata.99.slot.0" "lesiondata.99.slot.1" "lesiondata.99.slot.2" "lesiondata.99.slot.3" Percentage error 80 60 40 20 0 1 3 5 7 9 11 Unit lesioned 13 15 17 19 Fig. 3. The results of perturbing each unit in the encodings developed by the SRAAM trained using back-propagation, sigmoidal units, the 1 unit representation and 1% noise. 308 J.A. Hammerton and B.L. Kalman 100 "lesiondata.9.slot.0" "lesiondata.9.slot.1" "lesiondata.9.slot.2" "lesiondata.9.slot.3" Percentage error 80 60 40 20 0 1 3 5 7 9 11 Unit lesioned 13 15 17 19 Fig. 4. The results of perturbing each unit in the encodings developed by the SRAAM trained using back-propagation, sigmoidal units, the 1 unit representation and 10% noise. Encoding and Decoding Trajectories. Figure 5 shows the trajectories produced for the SRAAM trained on 130 sequences with back-propagation, sigmoidal units and the 1 unit representation. The trajectories are plotted on a 25x25 Kohonen map. The numbers labelling each point indicate which position in a sequence that point corresponds to, i.e. 0 being the start, 1 the next symbol and so on. This figure exemplifies the general finding for the back-propagation networks which is that the hidden layer patterns for the same position in different sequence cluster together, which provides an explanation for the sensitivity of the patterns to noise in the hidden layer; the final encodings are clustered together. With Kwasny and Kalman’s method, the Kohonen maps would show just 2 points, suggesting that the hidden-layer states are clustered into 2 tightly packed groups. This gives an explanation for the greater sensitivity to noise in the hidden-layer patterns on the part of the network trained by Kwasny and Kalman’s method. There was no identifiable trend in the organisation of the trajectories during training. Range of Activations Produced. The range of activations produced during encoding and decoding was wider for back-propagation than for Kwasny and Kalman’s method. There was no identifiable trend in this during training however. Holistic Symbol Processing and the Sequential RAAM: An Evaluation 2 2 2 2 20 2 2 2 2 2 2 2 1 1 1 1 1 10 2 1 1 1 2 2 2 1 2 2 2 2 2 2 2 3 2 2 3 3 3 20 3 3 2 2 3 2 2 2 2 2 2 2 0 10 1 3 3 3 4 2 2 4 2 4 4 4 2 4 4 4 2 4 4 4 4 2 4 4 4 0 1 1 1 1 1 1 15 2 2 2 2 2 2 2 2 1 1 1 2 1 1 1 0 0 1 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 5 1 1 2 2 3 3 3 20 2 2 2 2 2 2 2 15 2 2 2 3 3 3 2 2 3 3 3 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 15 20 4 3 0 0 1 1 1 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 4 4 4 4 4 4 2 2 2 2 2 2 2 2 2 2 4 4 4 0 0 0 0 0 0 0 0 0 1 1 1 1 1 4 2 4 4 4 1 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 4 4 4 4 4 4 4 4 4 4 4 5 10 4 0 0 0 0 0 0 4 4 4 0 0 20 4 4 4 4 4 4 4 0 0 1 0 De oding Traje tories 4 4 4 4 4 4 4 4 4 4 1 0 4 4 4 10 0 0 4 4 2 2 2 4 4 4 4 0 0 0 0 10 2 0 0 0 1 1 10 3 2 2 2 0 1 1 1 1 1 2 2 2 0 1 2 5 2 2 0 0 0 2 2 2 2 2 2 2 2 2 2 2 20 En oding Traje tories 2 2 2 15 2 2 0 1 0 10 20 1 0 0 1 1 5 2 1 4 1 2 1 0 1 0 4 4 5 1 5 1 4 4 4 0 0 1 1 4 4 4 4 1 1 1 4 4 4 1 1 10 4 4 0 0 1 1 0 1 0 4 4 4 4 4 4 2 0 1 4 4 4 4 4 4 2 4 4 4 4 4 1 5 1 4 4 4 4 2 2 2 4 4 4 3 4 2 2 2 2 2 15 2 2 2 3 3 3 2 2 2 2 3 3 3 3 2 2 2 3 3 3 2 2 2 2 2 2 2 2 2 2 2 2 15 2 2 2 2 2 2 2 309 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20 Fig. 5. Encoding and decoding trajectories for the SRAAM trained on 130 sequences with the 1 unit representation, back-propagation and sigmoidal units at the end of training. Active sentences on the left, passive on the right. 4 Discussion The broad picture emerging from the analysis is of solutions that appear difficult to find, partial solutions that are sensitive to noise in the weights, hidden-layer states that are clustered closely together and thus sensitive to noise and a strategy for organising the hidden layer states that is settled on early in training and then subsequently refined. It should also be noted that the hand-derived solutions had well spread out hidden-layer states. This suggests that if some way could be found to keep the hidden-layer states well separated improved training may result. This seems at first sight to be contradicted by the performance of Kwasny and Kalman’s method which is superior to that of back-propagation, yet it packs the hidden-layer states much more tightly together than back-propagation. This can be reconciled by observing that the basic strategy is decided early on in training and then refined. It may simply be that Kwasny and Kalman is better at refining the strategy than back-propagation and that if the strategy can be 310 J.A. Hammerton and B.L. Kalman improved the training may be improved. This is backed up by the superior performance of Kwasny and Kalman’s method on finding the solution when noise has been added to a hand-derived network. To test the hypothesis, the range of activations was computed for the hand-derived solution with 10% noise after training with Kwasny and Kalman’s method. This run had found an error free solution again and the range of activations produced was far wider than normal training produced. Unfortunately time was not available for further investigation of this hypothesis such as a direct attempt to create a training method that would try and keep the hidden-layer patterns well separated. 5 Conclusion and Further Work The main conclusion to draw from this work is that the SRAAM in its current form is not an effective vehicle for holistic symbol processing, though it may be that confluent inference can improve its performance judging by the success others have had with it [5, 9]. Not only did the SRAAM fail to learn a task the RAAM has little trouble with, despite being given a large hidden layer, but it has been found that there are features of the SRAAM’s behaviour that may be problematic for training generally. The close clustering of hidden-layer patterns can lead to interference and sensitivity to noise, whilst the sensitivity to noise in the weights is suggestive of solutions and partial solutions that exist in small areas of weight space making them difficult to find. However if a way can be found of keeping the hidden-layer patterns separated the performance may improve accordingly. It may be for example that Robert French’s context-biasing technique [6], which aims to do exactly this in feedforward networks could be adapted for use with the SRAAM. There are further more general lessons however. The main one is that the techniques for producing connectionist representations of compositional structures need to be thoroughly analysed in order to see whether problems exist in the way they create their representations. Also it is important that such analyses look at both the weights produced during training and the encodings produced, rather than to look at one or other alone, since there may be indications of trouble in either area. Much of the work on the RAAM and its derivatives looks solely at the representations produced, though there are exceptions to this (such as the work on LRAAM). Further work is thus needed to develop a full understanding of these techniques, and should include formal analyses where appropriate, and should look at factors such as how the use of confluent inference affects the behaviour of the techniques, and how the nature of the task affects the performance and behaviour of the techniques. Acknowledgements The authors wish to thank Peter Hancox, the first author’s supervisor for his guidance in this work; Russell Beale, Ela Claridge and Riccardo Poli for their help, advice and discussions of this work; Stan Kwasny for answering questions Holistic Symbol Processing and the Sequential RAAM: An Evaluation 311 about his work and for reading a draft of this paper; Jean Hammerton for proofreading a draft of this paper, and John Barnden for his interest in and discussions of this work. The back-propagation simulations and the analyses presented here were all run with the PDP++ neural network simulator and the authors wish to thank the maintainers of PDP++ and the contributors to the PDP++ discussion list for help in the setting up and usage of PDP++. The training runs employing Kwasny and Kalman’s training schedule were performed using the simulator written by Barry Kalman. This work was supported by a research studentship from the School of Computer Science, The University of Birmingham, UK. References [1] I. L. Balogh. An analysis of a connectionist internal representation: Do RAAM networks produce truly distributed representations? PhD thesis, New Mexico State University, 1994. [2] D. S. Blank, L. A. Meeden, and J. B. Marshall. Exploring the symbolic/subsymbolic continuum: A case study of RAAM. In J. Dinsmore, editor, Symbolic and Connectionist Paradigms: Closing the Gap, pages 113–148. Lawrence Erlbaum Associates, Hillsdale, NJ, 1992. [3] R.E. Callan and D. Palmer-Brown. (S)RAAM: An analytical technique for fast and reliable derivation of connectionist symbol structure representations. Connection Science, 9(2):139–159, 1997. [4] D. J. Chalmers. Syntactic transformations on distributed representations. Connection Science, 2(1–2):53–62, 1990. Reprinted in [16], pages 46–55. [5] L. Chrisman. Learning recursive distributed representations for holistic computation. Connection Science, 3(4):345–366, 1991. [6] R. M. French. Dynamically constraining connectionist networks to produce distributed, orthogonal representations to reduce catastrophic interference. In Proceedings of the Sixteenth Annual Conference of the Cognitive Science Society, Atlanta, GA, August 1994, pages 335–340, Hillsdale, NJ, 1994. Lawrence Erlbaum Associates. [7] J. A. Hammerton. Exploiting Holistic Computation: An evaluation of the Sequential RAAM. PhD thesis, School of Computer Science, The University of Birmingham, UK, 1998. [8] J. A. Hammerton. Holistic computation: Reconstructing a muddled concept. Connection Science, 10(1):3–19, 1998. [9] E. K. S. Ho and L. W. Chan. Confluent preorder parser as a finite state automata. In Proceedings of International Conference on Artificial Neural Networks ICANN’96, Bochum, Germany, July 16–19 1996, pages 899–904, Berlin, 1996. Springer-Verlag. [10] B. L. Kalman and Kwasny S. C. High performance training of feedforward and simple recurrent networks. Neurocomputing, 14:63–83, 1997. [11] T. Kohonen. The self-organising map. Proceedings of the IEEE, 78(9):1464–1480, 1990. [12] S. C. Kwasny and B. L. Kalman. Tail-recursive distributed representations and simple recurrent networks. Connection Science, 7(1):61–80, 1995. [13] L. Niklasson and N. E. Sharkey. Systematicity and generalization in compositional connectionist representations. In G. Dorffner, editor, Neural Networks and a New Artificial Intelligence, pages 217–232. Thomson Computer Press, London, 1997. 312 J.A. Hammerton and B.L. Kalman [14] T. A. Plate. Distributed Representations and Nested Compositional Structure. PhD thesis, University of Toronto, 1994. [15] J. B. Pollack. Recursive distributed representations. Artificial Intelligence, 46(1– 2):77–105, 1990. [16] N. E. Sharkey, editor. Connectionist Natural Language Processing:Readings from Connection Science. Intellect, Oxford, 1992. [17] A. Sperduti. Labeling RAAM. Technical Report TR-93-029, International Computer Science Institute, Berkeley, California, 1993. [18] A. Sperduti. On Some Stability Properties of the LRAAM Model. Technical Report TR-93-031, International Computer Science Institute, Berkeley, California, 1993. [19] V. Weber. Connectionist unification with a distributed representation. In Proceedings of the International Joint Conference on Neural Networks – IJCNN ’92, Beijing, China, pages 555–560, Piscataway, NJ, 1992. IEEE. Life, Mind, and Robots The Ins and Outs of Embodied Cognition Noel Sharkey1 and Tom Ziemke2,1 1 University of Sheffield Dept. of Computer Science Sheffield S1 4DP, UK noel@dcs.shef.ac.uk 2 University of Skövde Dept. of Computer Science 54128 Skövde, Sweden tom@ida.his.se Abstract. Many believe that the major problem facing traditional artificial intelligence (and the functional theory of mind) is how to connect intelligence to the outside world. Some turned to robotic functionalism and a hybrid response, that attempts to rescue symbolic functionalism by grounding the symbol system with a connectionist hook to the world. Others turned to an alternative approach, embodied cognition, that emerged from an older tradition in biology, ethology, and behavioural modelling. Both approaches are contrasted here before a detailed exploration of embodiment is conducted. In particular we ask whether strong embodiment is possible for robotics, i.e. are robot “minds” similar to animal minds, or is the role of robotics to provide a tool for scientific exploration, a weak embodiment? We define two types of embodiment, Loebian and Uexküllian, that express two different views of the relation between body, mind and behaviour. It is argued that strong embodiment, either Loebian or Uexküllian, is not possible for present day robotics. However, weak embodiment is still a useful way forward. 1 Introduction Cognitive science and artificial intelligence (AI) have been dominated by computationalism and functionalist theories of mind since their inception. The development of computers and their quick increase in information processing power during the 1940s and 50s led many theorists to assert that the relation between brain/body and mind in humans was the same as or similar to the relation between hardware and software in computers. The functionalist theory of mind seemed to solve in an elegant fashion the dispute between dualists and materialists on the relation between mind and matter. Thus, cognitive science and AI emerged as new disciplines, both of them initially based on the computer metaphor for mind. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 313–332, 2000. c Springer-Verlag Berlin Heidelberg 2000 314 N. Sharkey and T. Ziemke Serious doubts about the appropriateness of the computer metaphor were formulated in the arguments of Dreyfus [17] and Searle [37] around 1980, although for a long time they left most cognitive scientists and AI researchers cold. More attention was paid to the re-emergence of connectionism during the 1980s which, although to some degree concerned with neural hardware and thus taking a subsymbolic view on representation, did not question the computationalist framework in general. Ignoring the problems of purely computational theories of mind did not, however, make them disappear and theorists began to respond to the AI “mind in a vacuum” problem in one of two main approaches. Firstly, there was the approach of those, such as Harnad [22], who recast the problem as the symbol grounding problem and suggested it could be solved by using a hybrid combination of the respective strengths of symbolic and connectionist theories. This was an attempt to maintain a computational theory of cognition augmented with robotic capacities supposed to connect internal representations to the world they are supposed to represent. The relationship between functional theories of mind, AI, connectionism, and the hybrid response is examined in Section 2. We focus on the separation of functionalism from the world and why this is seen as a weakness by opponents of strong AI. This leads us to a discussion of theoretical attempts to connect the physical symbol system to the world using connectionist nets as a front end. The second approach, which is the main focus of this chapter, represents a more radical conception of cognition and AI arising out of biology and the study of animal behaviour. Led by a ‘new robotics’ that rejected traditional AI and representationalism altogether, it embraced the tenets of a situated and embodied cognition and challenged core assumptions of the funtionalist theory of mind. Two different types of embodiment are defined and discussed in Section 3 in relation to their biological underpinnings: Loebian or mechanistic embodiment and Uexküllian or phenomenal embodiment. Section 4 then moves on to critically examine two questions. Firstly, to what extent can the concept of embodiment be applied to artefacts such as robots? The arguments revolve around distinctions between living systems and machines that are made by humans, e.g. autopoiesis versus allopoiesis. Secondly, to what extent can meaning be attributed to the behaviour of a robot? It is argued that strong embodiment, either Loebian or Uexküllian, is not possible for present day robotics. However, weak embodiment is still a useful way forward. 2 Connectionism, Strong AI, and the Hybrid Response In this section, we discuss some of the criticisms faced by proponents of the functional theory of mind and computationalism as instantiated in strong AI. Searle [38] states that the thesis of strong AI is that, “the implemented program, by itself, is constitutive of having a mind. The implemented program, by itself, guarantees mental life”. The functional theory of mind is the view that mental states are physical states that are defined as functional states because of their causal relations. Thus, the functional role that a representation plays in a given Life, Mind, and Robots 315 computational system as a whole, imbues it with meaning (hence Function Role Semantics (FRS), [3]). Connectionism is an attempt to get closer to the physical basis of mind by viewing representations as brain states. However, as Smolensky (1988) [50] pointed out, the subsymbolic representations of connectionists are closer to the symbolic realm than to physical brain states. The main debates between the cognitivists and the connectionists were about the form that representation should take, and how they are structured in order to realise their functional roles [44]. The cognitivists stood for syntactically structured compositional representations [18]. For them it was the syntactic relations between the elements of the representational system as a whole that allowed for systematic inference. In contrast, the connectionists argued for non-concatenative subsymbolic representations that were spatially rather than syntactically structured [45]. Several connectionists argued this case by building connectionist models that demonstrated systematicity of subsymbolic representation in some form or other [12,14,31,32]. So far as we have stated it, the debate was one within the remit of the functional theory of mind. But the connectionists also had mathematical learning techniques that were based on abstract models of neural computation. Thus the representations and their spatial organisation could be learned “from the world” by extensional programming, self-organisation, reinforcement learning, or, more lately, genetic algorithms. Although most of the representation research was carried out in worlds that mainly consisted of symbols of one form or another (but see [40]), the theory was that connectionist networks were hardwired physical machines (like brains) with connections that change in response to the outside world [39]. Connectionism had opened questions about the physical basis of representation and how neural adaptation can create and change representations and their spatial organisation through interaction with the world. There were two ways in which the more philosophical connectionists could jump. One way was to go with eliminative materialists and see connectionism as the reduction of mental states to physical states, e.g. patterns of brain activation. It is here that the connectionists separate from ‘run of the mill’ cognitivists. It does not matter what machine the cognitivist runs the mind program on as long as it is capable of running it. For the connectionists, particular hardware was required, the brain. The other way to jump was to go for what Searle refers to as “causal emergence” [38], i.e. cognition emerges from a brain, [47]. There may be another type of strong AI lurking in these two positions. Stated in its most general form, if we were to physically implement a neural net system that had the causal powers of a brain, it would have mental states in the same way that we do. This does not sound much like strong AI. However, it could be argued that the properties of an artificial neural net are sufficient to provide a system with the causal power of a brain. Or, alternatively, the states of a neural net are sufficient to represent states of mind. There were also those who did not want to ‘go the whole hog’ with connectionism and, instead, decided to occupy the middle ground by proposing hybrid systems that attempted to integrate the best features of both connectionism 316 N. Sharkey and T. Ziemke and symbolic computation (see [44]). Some of these researchers simply wanted to bring connectionism into the mainstream by embedding artificial neural networks in larger systems. Others, with commitments to some of the tenets of cognitivism, used the connectionist learning to “hook” symbols onto the world [22] (see also [46,61]), in an attempt to rescue functionalism from what they saw as one of its major weaknesses, i.e. minds are semantic, AI programs are purely syntactic, therefore AI programs cannot be minds [37]. Lloyd (1989) points out that a problem with such an internalist conception is that: “...some kernel of the system must win representationality by other means.” [26]. He questions how a person might learn the meaning of a new word, say aham, without ostension, simply by using the context in which it occurs, just as is postulated to occur in a functional role semantics. This might be possible for a few terms, Lloyd argues, but there must come a critical point where there are just too many new terms to be able to allow an individual to learn them all without ostension (and to be able to provide ostensive definitions). Lloyd’s argument has similarities with John Searle’s famous 1980 Chinese Room argument [37]. In brief, Searle imagines himself placed in a room where questions written in Chinese are passed to him and he uses a sort of lookup table combined with transformation rules to write down the correct answer in Chinese and pass it out of the room. The point is that Searle does not understand Chinese; he has no idea of the meaning relation between the questions and the answers; it is just a matter of manipulating meaningless symbols. Searle’s point was that the operation of AI programs resembles what goes on in the Chinese room. Although the running programs manipulate symbols and provide outputs that are appropriate for their inputs, the programs or the machine do not understand what they are processing. There is no intrinsic meaning; the only meaning is for external observers. Therefore, AI programs are not intelligent in the strong sense. Harnad (1990), taking Searle’s criticisms on board, suggested a way to save strong AI. In what is termed a Hybrid response, the Chinese Room argument was recast as an argument against symbolic functionalism in which mind is held to be intrinsically, and solely, symbolic. Harnad proposed a robotic functionalism that extends symbolic functionalism to include some components that are intrinsically robotic. Robotic capacity, in this context, is a label for a range of sensorimotor interactions with the external environment, of which Harnad cites the discrimination and identification of objects, events, and states of affairs in the world as principal [22]. Harnad’s view appears to be that sensory transduction is sufficient to foil the thrust of the Chinese Room argument. Harnad is quite explicit about the nature of the sensory transduction, or robotic capacity, that serves to ground a symbolic capacity: Specifically he names the ability to discriminate and identify nonsymbolic representations. Discrimination involves the machine making similarity judgements about pairs of inputs, and constructing iconic representations. These icons are nonsymbolic transforms of distal stimuli. However, Harnad argues that merely being able to discriminate inputs does not make for a robotic capacity. Life, Mind, and Robots 317 In addition, the machine must also be able to identify the iconic representations. That is, it must be able to extract those invariant features of the sensory icons that will serve as a basis for reliably distinguishing one icon from another and for forming categorical representations. But Searle is not just arguing that AI programs need a way to point at and categorise objects in the world. While Harnad sees the need to link the physical symbol system to the world to give the symbols a referential meaning in the same way as Lloyd op cit. has argued, he is still committed to functionalism. Searle, on the other hand, is arguing for much more; a machine cannot have intrinsic semantics because it is not intentional and it has no consciousness. To get the idea of what he means by consciousness, as opposed to the functionalist’s representational states, Searle [38] asks us to pinch our skin sharply on the left forearm. He then lists a number of different things that will happen as a result and includes a list of neural and brain events. However, the most important event for Searle happens a few hundred milliseconds after you pinched your skin: “A second sort of thing happened, one that you know about without professional assistance. You felt pain. ... This unpleasant sensation had a certain particular sort of subjective feel to it, a feel which is accessible to you in a way that is not accessible to others around you.” What he is getting at here are is that the feeling of pain is one of the many qualia. But Searle is not a dualist, he asks, “How is it possible for physical, quantitatively describable neuron firings to cause qualitative, private subjective experiences?”. Searle’s view is biological. He holds that the phenomenal mind is caused by a real living brain. 3 Embodied Cognition Behaviour-based robotics represents an alternative approach to AI that has been gaining ground over the last decade (for overviews see [11,1,60,41]) . Although the basis for the approach has been around in biology for nearly a century and in robotics for over fifty years, it entered mainstream AI in the 1980s through the work and ideas of Rodney Brooks [6,7,8,9,10,11]. His approach to the study of intelligence was through the construction of physical robots embedded in, and interacting with, their environment by means of a number of behavioural modules working in parallel. The modules receive input from the sensors on the robot and the behaviour is controlled by the appropriate module for the situation. But there is no central controller. The modules are connected to each other in a way that allows certain modules to subsume the activity of others, hence this type of architecture is referred to as subsumption architecture. This approach differs from both symbolic and robotic functionalism. Unlike classical AI, the modules do not make use of explicit internal representations in the sense of representational entities that correspond to entities in the ‘real’ world. Here, there is no need to talk about grounding symbols or representations in the world1 . There are no internal representations to ground and there are 1 Brooks is not entirely consistent with this point as he does write about about physical grounding of representations in [7,10]. 318 N. Sharkey and T. Ziemke no world models or planners; the world is its own world model for the robot. Intelligence is found in the interaction of the robot with its environment. In terms of theory of mind, the literature appears to suggest a re-emergence of behaviourism in that there are no mental representations, just behaviour and predispositions to behaviour; sensing and acting in the world are necessary and sufficient for intelligence. Such a view takes form from considering a less anthropocentric universe than cognitivism. Much recent behaviour-based robotics takes inspiration from simple intelligent life forms such as arthropods, e.g. [2,57,25]. This provides a way to approach the fundamental building blocks of intelligence through a consideration of its biological basis. But there is an unexpected turn in the theoretical picture that we have been painting about the new AI. Brooks (1991) proposed that the two cornerstones of the new approach to Artificial Intelligence are situatedness and embodiment [8]. Situatedness means that the physical world directly influences the behaviour of the robot - it is embedded in the world and deals with the world as it is, rather than as an abstract description. Embodiment commits the research to robots rather than to computer programs, as the object of study. This is consistent with the anti-mentalist stance of the new robotics. However, the way in which embodiment is discussed has a least two radically different interpretations within the theory of mind, and behaviour-based roboticists appear to move between them with impunity. In the next two subsections, we examine these contrasting positions in relation to the work of the biologists, Charles Sherrington (1857-1952), Jacques Loeb (1859-1924) and Jakob von Uexküll (1864-1944), who, in different ways, developed theories of a biological basis for behaviour. 3.1 Loebian Embodiment The first type of embodiment follows from the mechanistic or behaviourist line that Brooks appeared to be following. In this view one would expect embodiment to mean that cognition is embodied in the mechanism itself. That is, cognition is embodied in the control architecture of a sensing and acting machine. There is nothing else. There is no cognitive apparatus separate from the mechanism itself. There is no need for symbol grounding in that there are no symbols to ground. It is the behaviour or the dispositions to behaviour that are grounded in the interaction between agent and environment. This is similar to notions from physicalism in which the physical states of a machine are considered to be its mental states, i.e. there is no subjectivity. Movement is “forced” by the environment. This form of robot control relates mostly to the work of Sherrington on reflexes [49] and Loeb on tropisms [27]. Sherrington [49] focused on the nervous system and how it constitutes the behaviour of the multicellular animal as a “social unit in the natural economy”. He proposed that it was nervous reaction that produces an animal individual from a mere collection of organs. The elementary unit of integration and behaviour was the simple reflex consisting of three separable structures: an effector organ, a conducting nervous pathway leading to that organ, and a receptor to initiate Life, Mind, and Robots 319 the reaction. This is the reflex arc and it is this simple reflex which exhibits the first grade of coordination. However, Sherrington admitted that the simple reflex is most likely to be a purely abstract conception. Since, in his view, all parts of the nervous system are connected together, no part may react without affecting and being affected by other parts. Nonetheless, the idea of chains of reflexes to form coherent behaviour became part of mainstream psychological explanation beginning with the work of the Russian psychologist Dimitri Pavlov who extended the study of reflex integration into the realm of animal learning, classical conditioning [33]. Loeb [27] had no patience with physiology and argued against the possibility of expressing the conduct of the whole animal as the “algegraic sum of the reflexes of its isolated segments”. His concern was with how the whole animal reacted in response to its environment. Like the later behaviourists, Loeb was interested in how the environment “forced” or determined the movement of the organism. He derived his theory of tropisms (directed movement towards or away from stimuli) from the earlier scientific study of plant life, e.g. the directed movement through geotropism [24] and phototropism [16]. Thus for Loeb, animals were cartesian puppets whose behaviour was determined by the environmental puppeteer (see also [43,59]). Loeb would certainly have been very interested in today’s biologically inspired robotics and was quick to see the implications of the artificial heliotropic machine built by J.J. Hammond. Loeb claimed that construction of the heliotropic machine, which was based on his theories, supported his mechanistic conception of the volitional and instinctive actions of animals [27]. Loeb used tropism2 rather than taxis to stress what he saw as the fundamental identity of the curvature movements of plants and the locomotion of animals in terms of forced movement enabled by body symmetry. Although his specific theory about animal symmetry eventually fell under the weight of counter experimental evidence, major parts of his general theory of animal behaviour were taken up by later biologists and psychologists using the term taxis for directed animal movement. Fraenkel and Gunn [19], sympathising with Loeb’s stance on the objective study of animal behaviour, heralded behaviour-based robotics by proposing that the behaviour of many organisms can be explained as a combination of taxes working together and in opposition. However, it was the ideas of both reflex and taxis that first inspired what is now called artificial life research. Grey Walter (1950, 1953) built two “electronic tortoises” in one of the earliest examples of a physical implementation of intelligent behaviour. Each of these electromechanical robots was equipped with two ‘sense reflexes’; a very small artificial nervous system built from a minimum of miniature valves, relays, condensers, batteries and small electric motors, and these reflexes were operated from two ‘receptors’: one photoelectric cell, giving the tortoises sensitivity to light, and an electrical contact which served as a touch receptor. The artificial tortoises were attracted towards light of moderate intensity, repelled by obstacles, bright light and steep gradients, and never stood 2 Nowadays the term reflex is reserved for movements that are not directed towards the source of stimulation whereas taxis and tropism are used to denote movements with respect to the source of stimulation. 320 N. Sharkey and T. Ziemke still except when re-charging their batteries. They were attracted to the bright light of their hutch only when their batteries needed re-charging. Grey Walter made many claims from his observations of these robots which included saying that the ‘tortoises’, exhibited hunger, sought out goals, exhibited self-recognition and mutual recognition. He also carried out the first artificial research on classical conditioning with his CORA system [21]. Grey Walter’s work combined and tested ideas from a mixture of Loeb’s tropisms and Sherrington’s reflexes. Although Loeb is not explicitly mentioned in the book, the influence is clear, not least from the terms positive and negative tropisms instead of taxes. These same ideas turn up again in recent robotics research and form the basis of much of modern robotics and Alife work (see also [48]). There is also considerable activity aimed at specific biological modelling that continues the line of mechanistic modelling of animal behaviour started by Hammond’s heliotrope machine. Sherrington believed that there was much more to mind than the execution of reflexes as we shall see later. Loeb is more representative of the antimentalistic mechanistic view of intelligence and thus we dub this view Loebian embodiment. 3.2 Uexküllian Embodiment Brooks [8] also espouses quite a different type of embodiment saying that, “The robots have bodies and experience the world directly - their actions are part of a dynamic with the world and have immediate feedback on their own sensations”. He was anxious to take on board von Uexküll’s concept of Merkwelt or perceptual world according to which each animal species, with its own distinctly non-human sensor suites and body, has it own phenomenal world of perception [5,9]. This notion of embodied cognition has its roots in von Uexküll’s idea of bringing together an organism’s perceptual and motor worlds in its Umwelt (subjective or phenomenal world) and hence we call it Uexküllian embodiment. Von Uexküll tried to capture the seemingly tailor-made fit or solidarity between the organism’s body and its environment in his formulation of a theoretical biology [53], and his Umwelt theory [52,54,56]. Von Uexküll criticized the mechanistic doctrine “that all living beings are mere machines” for the reason that it overlooked the organism’s subjective nature, which integrates the organism’s components into a purposeful whole. Thus his view is to a large degree compatible with Sherrington and Loeb’s ideas of the organism as an integrated unit of components interacting in solidarity among themselves and with the environment. However, he differed from them in suggesting a rudimentary non-anthropomorphic psychology in which subjectivity acts as an integrative mechanism for coherent action: We no longer regard animals as mere machines, but as subjects whose essential activity consists of perceiving and acting ... for all that a subject perceives becomes his perceptual world and all that he does, his effector world. Perceptual and effector worlds together form a closed unit, the Umwelt. [54] Life, Mind, and Robots 321 Although he strongly contradicted purely mechanistic/materialistic conceptions of life, and in particular the work of Loeb (e.g. [55]), von Uexküll was not a dualist. He did not deny the material nature of the organism’s components. For example, he discussed how a tick’s reflexes are “elicited by objectively demonstrable physical or chemical stimuli” [54]. However, for von Uexküll, the organism’s components are forged into a coherent unit that acts as a behavioural entity. It is a subject that, through functional embedding, forms a “systematic whole” with its Umwelt. Similar ideas have begun to emerge in robotics as part of a new enactive cognitive science approach, [51], and there is broad support in the field of embodied cognition where there is a reassessment of the relevance of life and biological embodiment for the study of cognition and intelligent behaviour, e.g. [15,13,36,58,34,61,62]. Two of the principals of the new approach, Maturana and Varela [29,30], have proposed that cognition is first and foremost a biological phenomenon. For them, “all living systems are cognitive systems, and living as a process is a process of cognition” [29]. In this framework, cognition is viewed as embodied action by which Varela et al. [51] mean “...first, that cognition depends upon the kinds of experience that come from having a body with various sensorimotor capacities, and second, that these individual sensorimotor capacities are themselves embedded in a more encompassing biological, psychological, and cultural context”. Thus, this view, like that of von Uexküll, emphasises the organism’s embedding in not only its physical environment, but also the context of its own phenomenal world (Umwelt), and the tight coupling between the two. In the words of Varela et al., “cognition in its most encompassing sense consists in the enactment or bringing forth of a world by a viable history of structural coupling” [51]. This is very different from Loebian embodiment where the mechanisms underlying behaviour are themselves controlled by the environment and where the organism is a mere Cartesian puppet (cf. also [43,59]). What we have called Uexküllian embodiment is the notion of a phenomenal cognition as an intrinsic part of a living body; as a process of that body. 4 Weak but Not Strong Embodiment In this section we take a closer look at the two types of embodiment proposed in Section 3. First, Uexküllian embodiment is revisited to ask in what sense can a robot be a subjective embodiment. Then Loebian embodiment is discussed in terms of mechanistic minds and observer errors. For both types of embodiment we inquire as to whether they exist in the strong sense of equating “robot minds” with animal minds or they exist only in the weak sense of scientific or engineering models for investigating or exploiting knowledge about living systems. 4.1 Uexküllian Embodiment Revisited A critical question facing Loeb, Sherrington, and von Uexküll concerned how a large number of living cells could act together in an integrated manner to 322 N. Sharkey and T. Ziemke produce a unitary behaving organism; in other words, to create a unitary living body. Loeb’s approach was entirely behaviouristic and anti-mentalistic while von Uexküll proposed a subjective world as the method of integration. Sherrington was slighly different. He worked on the nervous mechanisms of the reflexes as an integrative mechanism. But he also recognised the limitations of the nonsubjective. Writing about a decerberate dog that was used in his research, Sherrington states that, ... it contains no social reactions. It evidences hunger by restlessness and brisker knee jerks; but it fails to recognize food as food: it shows no memory, it cannot be trained to learn... The mindless body reacts with the fatality of a multiple penny-in-the-slot machine, physical, and not psychical. [49] There can be no Uexküllian embodiment on existing robots. Uexküllian embodiment requires a living body. It is the embodiment of cognition or Umwelt as living processes. A robot does not have a body in this sense. It is a collection of inanimate mechanisms and non-moving parts that form a loosely integrated physical entity; a robot body is more like a car body and not at all like a living body. According to von Uexküll there are principal differences between the construction of a mechanism and a living organism, see e.g. [55,62]. He uses the example of a pocket watch to illustrate how machines are centripetally constructed: The individual parts of the watch, such as its hands, springs, wheels, and cogs must be produced first, so that they may be added to a common centerpiece. In contrast, the construction of an animal starts centrifugally; animal organs grow outwards from single cells. Von Uexküll was clear about the machine vision of animal life: Now we might assume that an animal is nothing but a collection of perceptual and effector tools [like microphones and motor cars], connected by an integrating apparatus which though still a mechanism, is yet fit to carry on the life functions. This is indeed the position of all mechanistic theories, whether their analogies are in terms of rigid mechanics or more plastic dynamics. They brand animals as mere objects. The proponents of such theories forget that, from the first, they have overlooked the most important thing, the subject which uses the tools, perceives and functions with their aid. [54] More recently, Maturana and Varela [29] have also made a distinction between the organisation of living systems and machines made by humans. They were not interested in listing the static properties of living systems or the properties of the components of such systems. Rather they were concerned with the problem of how a living system, such as a single cell, can create and maintain its own identity despite a continual flux of perturbations and the continual changing of its components through destruction and transformation. As Boden [4], points out, this is universally accepted as one of the core problems of biology and for Maturana and Varela it is the problem. Life, Mind, and Robots 323 Maturana and Varela ibid attacked the problem by concentrating on the organisation of matter in living systems. In his essay, Biology of Cognition in [29], Maturana proposed that the organisation of a cell is circular because the components that specify it are also the very components whose production and maintenance the organisation secures. It is this circularity that must be maintained for the system to retain its identity as a living system. Maturana and Varela [29] described the maintenance of the circularity with the new term autopoiesis, meaning self- (auto) -creating, -making, or -producing (poiesis). An autopoietic machine, such as a living system, is a special type of homeostatic machine for which the fundamental variable to be maintained constant is its own organisation. This is unlike regular homeostatic machines which typically maintain single variables, such as temperature or pressure. The structure of an autopoietic system is the concrete realisation of the actual components (all of their properties) and the actual relations between them. Its organisation is constituted by the relations between the components that define it as a unity of a particular kind. These relations are a network of processes of production that, through transformation and destruction, produce the components themselves. It is the interactions and transformations of the components that continuously regenerate and realize the network of processes that produced them. Although these definitions apply to many systems, such as social systems, that may also be autopoietic, living systems are physical in a particular way. Moreover, autopoiesis is held to be necessary and sufficient to characterise the organisation of living systems. Living systems are not the same as machines made by humans as some of the mechanistic theories would suggest. In Maturana and Varela’s formulation, machines made by humans, including cars and robots, are allopoietic. Unlike an autopoietic machine, the organisation of an allopoietic machine is given in terms of a concatenation of processes. These processes are not the processes of production of the components that specify the machine as a unity. Instead, its components are produced by other processes that are independent of the organisation of the machine. Thus the changes that an allopoietic machine goes through without losing its defining organisation are necessarily subordinated to the production of something different from itself. In other words, it is not truly autonomous. In contrast, a living system is an autopoietic machine whose function it is to create and maintain the unity that distinguishes it from the medium in which it exists. It is truly autonomous. To be a living autonomous entity requires a unity that distinguishes an organism from its environment. At the core of autopoiesis lies the autonomy of the individual cells, which Maturana and Varela refer to as “first-order autopoietic units”, similar to what von Uexküll meant by the term Zellautonome for autonomous cellular unities [53]. The individual cells’ solidarity which constitutes the organism as an integrated behavioural entity and “second-order autopoietic unit” is due to the fact that “the structural changes that each cell undergoes in its history of interactions with other cells are complementary to each other, within the constraints of their participation in the metacellular unity they comprise” [30]. 324 N. Sharkey and T. Ziemke The chemical, mechanical, and integrating mechanisms of living things are missing from robots. Consequently, there can be no notion of multicellular solidarity or even a notion of a cell in a current robot. Although some may argue that the messaging between sensors, controllers and actuators, is a primitive type of integration, this is very different from the dependency relationship between living cells in real neural networks. Artificial neural nets can be used as a ‘stand-in’ integrative mechanism between sensors and actuators. However, they are not themselves integrated into the body of the robot; most of the body is a container for the controller, a stand to hang the sensors on, and a box for the motors and wheels. There is no interconnectivity or cellular communication. In multicellular creatures, solidarity of different types of cells in the body is required for survival. Cells need oxygen and so living bodies need to breathe, they need nutrition and so bodies need to behave in a way that enables ingestion of appropriate nutrients. Furthermore, like von Uexküll [53], Maturana and Varela point out that living systems, cannot be properly analyzed at the level of physics alone, but require a biological phenomenology: ... autopoietic unities specify biological phenomenology as the phenomenology proper to those unities with features distinct from physical phenomenology. This is so, not because autopoietic unities go against any aspect of physical phenomenology - since their molecular components must fulfill all physical laws - but because the phenomena they generate in functioning as autopoietic unities depend on their organization and the way this organization comes about, and not on the physical nature of their components (which only determine their space of existence). [30] Boden [4] has also recently argued that current metal robots cannot be living systems, or, in her words, “strong artificial life”. Although she goes along with much of Maturana and Varela’s characterisation of the organisation of the living, she stops short of accepting that all living processes are cognitive processes. Instead, Boden found it sufficient to argue from the weaker, but more straighforward, position that biochemical metabolism is necessary for life. Robots, like Grey Walter’s turtles, that can recharge their batteries do not have metabolic processing that involves closely interlocking biochemical processes. They only receive packets of energy that do not produce and maintain the robot’s body. It may be possible to produce an artificial life form, writes Boden, but it will probably have to be a biochemical one. Even unicellular organisms with no nervous system exhibit more bodily integration and intimacy with their environment than current robots. In no sense does a ‘situated’ and ‘embodied’ robot actually experience the world. Its ‘experience’ is no different than that of an electronic tape measure. An electronic tape measure uses sonar to take measurements of room dimensions and ceiling heights in houses. You can point it at a wall and get a readout of say 54.4 cm. This is similar to the sonar sensing used for robots. The output from the robot sensors is a voltage proportional to distance. We could mount two of these tape Life, Mind, and Robots 325 measures on the front of a three wheeled platform with motors for the two rear wheels. The motors could be controlled directly by the tape measure outputs by wiring the left sensor to the right motor and vice versa. After a little tinkering, this device should make a reasonable job of avoiding boxes and walls. It is thus a situated robot in the sense described above. However, it does not have its “own sensations” nor does it have “a body” in the sense discussed above that could enable it to ”experience the world directly”. Thus, to repeat ourselves, strong Uexküllian embodiment is not possible on current robots. Weak Uexküllian embodiment is, of course possible, in the sense of simulating embodied cognition with a physical robot. This would mean writing programs to capture aspects of cognition, but in a different way and with a different notion of cognition than used in disembodied AI. It would be a simulation of enactive cognition that could provide a ‘wedge in the door’ for biological and psychological research. In this sense it can be useful to view an autopoetic machine as an allopoietic machine. Although this has scientific value, according to Maturana and Varela, it will not reveal the autopoietic organisation. 4.2 Loebian Embodiment and the Clever Hans Error Weak Loebian embodiment is not in question here. Robots have already been applied usefully as physical tests of the plausibility of mechanistic hypotheses about particular animal behaviour, e.g. [57,25]. Weak embodiment also incorporates the use of the robot as a thought tool for engineering or modelling applications [6,42]. However, why should strong embodiment be in question? After all, we have already said that the mechanistic or Loebian approach treats animals as mere Cartesian puppets. But they are puppets that have been co-evolved with their environments and their own niche that encapsulates the meaning of their existence. The intimate relationship between the body of a living organism and its environment as described in the previous section is missing, and this is important for strong embodiment regardless of one’s views of mentality. Strong embodiment implies that the robot is integrated and connected to the world in the same way as an animal. However, the identity of an allopoietic machine, like a robot, is not determined by its operation. Rather, because the products of the machine are independent of its organisation, its identity depends entirely on the observer. The meaning of the robot’s “actions” is also in the observer’s world and not in the “robot world”. The robot’s behaviour has meaning only for the observer. To think otherwise is to commit what is known in biology as the Clever Hans error. We shall expand on the nature and implications of such errors of attribution for the notion of strong Loebian embodiment in the remainder of this section. Clever Hans was a horse whose performance startled citizens and scientists in the early part of the 20th century with its ability to perform simple arithmetic operations. A sum was written on a poster and displayed so that the horse could see it. Then the horse, to the amazement of the audience, would tap out the numerical solution with a hoof. There were even “scientific” theories about how it was using mental arithmetic. Hans even passed a test set up by Stumpf, the 326 N. Sharkey and T. Ziemke director of the Berlin Institute of Psychology, with a panel including a zoologist, a vet, and a circus trainer. However, eventually the horse was submitted to an objective psychological assessment [35] and all was revealed. The horse always got the answer correct whenever there was an observer who knew the correct answer. With an observer who did not know the answer, Hans tapped an arbitrary number of times. It turned out that people who knew the answer to the sum were giving Hans a subtle cue at the right moment and he stopped hoofing. Such was Hans’ ability at cue detection that it worked even if the observer knew what Hans was up to. Hans did not know or care much about human arithmetic; it was not part of a horse’s natural world. The exact number of hoof taps bore no relevance to Hans. If he responded appropriately to arbitrary start and stop cues, he was rewarded. Hans’ cleverness was in being able to spot the stop cues without, apparently, having been trained. His owner, an Austrian aristocrat named von Osten, did not realise that he was being ‘tricked’. Otherwise, the behaviour is not that much different to pigeons pecking in a Skinner box. It is as simple as that. However, in the case of Clever Hans, it just so happens that, for human observers, the start sign, the sum, etc. have a ruleful (arithmetic) relationship with the behaviour of the horse. Thus the observers attributed causal significance to Hans’ behaviour. Hans might have impressed them even more if they had given him questions such as, how many days are there in a week? Or, how many miles is it from Sheffield to Rotheram? Or, how many times has Brazil won the world cup? And, of course, he would also need a clever audience. The point is that the meaning that the observer received from Clever Hans did not originate from the horse. Rather, it was merely a distorted reflection of the observer’s own meaning. The meaning for Hans is less clear but it probably had nothing to do with arithmetic. At first blush, the tale of Clever Hans seems to have similarities with Searle’s Chinese room argument. Like Clever Hans, the person in the room was operating within the meaning domain of the observer. It was the observer who brought the meaning of the questioning and answering to the configuration. However, there is a major difference; Hans was simply responding to human cues to meet his own agenda. He was a living system as well. The observer was, in a sense, both in the ‘room’ of the system (as a cue provider) and outside of it at the same time (as the observer). However, to invoke the systems argument, the most popular rebuttal of the Chinese Room argument, for Clever Hans would be inappropriate. The argument would run that although Hans did not know the meaning of the human arithmetic task, the system as a whole, consisting of the observer, the problem, Hans and his hoof, understood the human arithmetic problem. But this would be an oddly redundant system since one of its components, “the observer”, can understand the whole problem so why have more components that understand nothing about it? The situation had an entirely different meaning for Hans that the observers did not understand. It would be a folly, however, to say that the ‘system as a whole’ understood the meaning of Hans’ behaviour in terms of the horse world, for the same reasons as with the human observer. Life, Mind, and Robots 327 Bringing the Clever Hans error to bear on robotics research, suppose that we have a light sensing robot, AutoHans, that has to perform the same task as its living counterpart. A screen with the sum on it is lighted and the robot begins moving back and forth (sadly there are no hoofs). When AutoHans has moved the correct number of times, the screen is darkened and AutoHans stops. These events are meaningful only to the observer and not to the machine. To think otherwise would be a Clever AutoHans error. Going a step further, suppose that we now “rewire” the light sensing robot so that it can follow a light gradient. Now we can stand above it, call the sum out loud, and wave a torch around in circles until the correct number of circles has been achieved by the robot. This is such obvious trickery that it would not be worth mentioning except to highlight that it is no different than putting well placed lights in the environment to manipulate the behaviour of vehicles. For example, Grey Walter’s 1950s tortoise robots would be repelled by the bright light in their hutch (battery charger) and moved only towards moderate light in the room. When their batteries were low, they moved only towards the bright light in the hutch. This was a predesigned world and a robot with a carefully crafted controller. If this argument is sound, then to attribute “hunger”, “attraction”, or other anthropomorphic labels to the behaviour of these devices is again making the Clever AutoHans error. They can only exhibit weak Loebian embodiment. Such systems, and this applies to all behavior-based robots, are designed by humans such that their movement in interaction with “cues” in the environment, e.g. lights of a particular intensity, looks, to human observers, like the behaviour of an organism. Even if the system adapts by neural and/or evolutionary methods, the goals or purpose of the system, by necessity, are designed by the researcher as the part that will make the work comprehensible to other human observers. These goals implicitly direct the themes of research papers and are what give the devices their credibility as autonomous agents. Essentially they search through “cue” space to find appropriate cues that lead to the satisfaction of the observer’s/experimenter’s goals. These goals are not the robot’s but the observer’s goals instantiated. Thus the robot’s interactions with the world carry meaning only for the observer. Take for example, the case of Webb’s cricket studies [57,28] in which a wheeled robot uses an auditory system capable of selectively localising the sound of a male cricket stridulating (rubbing its wings together rapidly to produce a sound that attracts potential mates). Comparisons of the intensity of auditory information from two microphones was used to directly control the motors on a mobile robot and drive it towards the sound source. Let us be clear about what this robot system tells us. It is a physical demonstration that mate selection can be mechanised through direct auditory-motor connections. No intermediate recognition or decision processes are necessary. This supports a biological model of mate selection as a taxis. Given the correspondence with some of the data on mate preferences in the male cricket, this is a useful piece of biological modelling. Sound localisation and its use in control is also useful from an engineering perspective. 328 N. Sharkey and T. Ziemke However, care must be exercised not to attribute strong Loebian embodiment to the physical incantations of this robotic system. Webb is very careful with her claims. It is inappropriate, except perhaps in a popular science magazine headline, to call this a “robotic cricket”. Yes, it can respond to certain sound frequencies and combinations of frequencies in a way that has some resemblances to the cricket; but only at a distal level of description and only for one taxis. This is at best a simulated partial embodiment. Clearly at the proximal level of description there is little similarity between the real cricket and the robot behaviours. Obviously the instant-to-instant behavioural output of a rigid platform on wheels will be very different from a behavioural output of a legged insect body [23]. Both will behave quite differently towards a blade of grass on their trek to the male. The robot is actually a single function device; a tracker to find the most “attractive” male crickets inside a given sensory perimeter within a constrained physical environment. Out in the wild what chance would the robot have to locate the sound of a male cricket? And if it did, what are the chances that it would reach the male? (beware of mud patches, swampland, and rocks). Moving to the end game of mate selection, what if the robot did reach the male? What would the robot do when the cricket stopped stridulating? The robot would be a less than amorous companion. The cricket would certainly not recognise it as a mate even if he did notice it. Or, for that matter, what would the robot do if the cricket stridulated a few inches above it or even sat on top of it stridulating? 5 Conclusions We began by examining one of the dominant criticisims that has dogged AI since the 1980s; namely that the symbolic realm of the AI program is connected to the world only through the knowledge of its human designers. Nor was connectionism free from this criticism although connectionist systems can be more readily connected to the outside world by “learning” a mapping between sensors and actuators. Indeed, many who agreed with the criticism but were still committed to the functional theory of mind developed a hybrid response to the problem by using connectionist nets to ‘hook’ the symbolic or representational domain onto the world The same time, through the 1980s and 1990s, a new movement emerged in AI whose focus was primarily on robotics, i.e. physical machines which are, unlike computer programs, capable of interacting with their environment. Terms like “embodiment”, “physical grounding” and “cognitive robotics” have become central in recent theories of mind, although they are used by cognitive theorists in at least two different ways. Firstly, as discussed in Section 2, there are those, who maintain the functionalist view of a computational mind and thus see the robot body as a means of “hooking” internal representations to the world. Secondly, as discussed in Section 3, there are those who reject the traditional notion of representation, and believe that cognition will emerge from the sensorimotor interaction of robot and environment. We identified two significantly different notions of embodiment, Loebian (or mechanistic) and Uexküllian (or phenomenal), and discussed them in terms of Life, Mind, and Robots 329 their relationship to theories of body, mind and behaviour. In defining Loebian embodiment, we showed how the mechanistic notions of an early behaviourism, which pictures organisms as mechanisms more or less ‘puppeteered’ by their environment, have found their way into robotic studies of cognition and behaviour. This was contrasted with Uexküllian embodiment which encapsulates the view that living cognition relies on a phenomenal subjective interaction with the world. Cognition is ‘embodied’ in each organism’s body and, in the extreme, all biological processes, including those of the single cell, are cognitive (e.g. Maturana and Varela [29,30]). However, because the new AI is still finding its (theoretical) feet, many of its practitioners move freely between these two notions of embodiment despite their differences. It has been argued here that strong embodiment of either type is, in principle, not possible for current robots. Despite their apparent ‘autonomy’ during operation, robots remain allopoietic machines, which ultimately derive meaning only from their designers and observers. However, weak embodiment of both types is possible: There have been a number of successful robotic studies of mechanistic theories of animal behaviour. Similarly, robots can certainly be used to study and simulate how artificial agents can enact or bring forth, by means of adaptive techniques, their own ‘phenomenal’ environmental embedding in the form of interactive representations and behavioural structures. Thus, studying an allopoietic machine, such as a robot, as if it was physically autopoietic (living), can yield useful scientific insights and can bring much that is new into engineering. The limitations of such work, however, should be kept in mind. Two of those, dealt with here, are that, (i) current robots cannot experience a phenomenal world that arises directly out of having a living body, and (ii) the study of an allopoietic machine cannot reveal the organisation of the underlying autopoietic machine. Acknowledgements The authors would like to warmly thank Amanda Sharkey for helpful comments on an earlier draft of this paper. References 1. Alan Arkin. Behavior-Based Robotics. Intelligent Robotics and Autonomous Agents Series. MIT Press, Cambridge, MA, 1998. 2. Randall D. Beer. Intelligence as Adaptive Behavior - An Experiment in Computational Neuroethology. Academic Press, San Diego, CA, 1990. 3. N. Block. Are absent qualia impossible? Philosophical Review, 89:257–272, 1980. 4. Margaret Boden. Is metabolism necessary? Technical Report CSPR 482, School of Cognitive and Computing Sciences, University of Sussex, Brighton, UK, January 1998. 5. Rodney A. Brooks. Achieving artificial intelligence through building robots. Technical Report Memo 899, MIT AI Lab, 1986. 6. Rodney A. Brooks. A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, 2:14–23, 1986. 330 N. Sharkey and T. Ziemke 7. Rodney A. Brooks. Elephants don’t play chess. Robotics and Autonomous Systems, 6(1–2):1–16, 1990. 8. Rodney A. Brooks. Intelligence Without Reason. In Proceedings of the Twelfth International Joint Conference on Artificial Intelligence (IJCAI-91), pages 569– 595, San Mateo, CA, 1991. Morgan Kauffmann. 9. Rodney A. Brooks. Intelligence without representation. Artificial Intelligence, 47:139–159, 1991. 10. Rodney A. Brooks. The Engineering of Physical Grounding. In Proceedings of the Fifteenth Annual Conference of the Cognitive Science Society, pages 153–154, Hillsdale, NJ, 1993. Lawrence Erlbaum. 11. Rodney A. Brooks. Cambrian Intelligence: The Early History of the New AI. MIT Press, Cambridge. MA, 1999. 12. D.J. Chalmers. Syntactic transformations on distributed representations. Connection Science, 2:53–62, 1990. 13. Hillel J. Chiel and Randall A. Beer. The brain has a body: Adaptive behavior emerges from interactions of nervous system, body, and environment. Trends in Neurosciences, 20:553–557, 1997. Dec. 1997. 14. L Chrisman. Learning recursive distributed representations for holistic computation. Connection Science, 3:345–366, 1991. 15. Andy Clark. Being There - Putting Brain, Body and World Together Again. MIT Press, Cambridge, MA, 1997. 16. De Candolle. Reported in Frankel & Gunn (1940) The Orientation of Animals: Kineses, Taxes and Compass Reactions, Clarendon Press: Oxford, UK, 1832. 17. H.L. Dreyfus. What computers can’t do: The limits of artificial intelligence. Harper & Row, New York, 2nd revised edition, 1979. 18. Jerry A. Fodor and Zenon W. Pylyshyn. Connectionism and cognitive architecture: A critical analysis. Cognition, 28:3–71, 1988. 19. G.S. Fraenkel and D.L. Gunn. The Orientation of Animals: Kineses, Taxes and Compass Reactions. Clarendon Press, Oxford, 1940. 20. William Grey Walter. An imitation of life. Scientific American, 182:42–54, 1950. 21. William Grey Walter. The living brain. Norton, New York, 1953. 22. Stevan Harnad. The symbol grounding problem. Physica D, 42:335–346, 1990. 23. Fred A. Keijzer. Some Armchair Worries about Wheeled Behavior. In From animals to animats 5 - Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior, pages 13–21, Cambridge, MA, 1998. MIT Press. 24. Knight. Reported in Frankel & Gunn (1940) The Orientation of Animals: Kineses, Taxes and Compass Reactions, Clarendon Press: Oxford, UK, 1806. 25. D. Lambrinos, M. Marinus, H. Kobayashi, T. Labhart, R. Pfeifer, and R. Wehner. An autonomous agent navigating with a polarized light compass. Adaptive Behavior, 6(1):131–161, 1997. 26. D. E. Lloyd. Simple Minds. MIT Press, Cambridge, MA, 1989. 27. Jacques Loeb. Forced movements, tropisms, and animal conduct. Lippincott Company, Philadelphia, 1918. 28. H.H. Lund, B. Webb, and J. Hallam. Physical and temporal scaling considerations in a robot model of cricket calling song preference. Artificial Life, 4(1):95–107, 1998. 29. H. R. Maturana and F. J. Varela. Autopoiesis and Cognition - The Realization of the Living. D. Reidel Publishing, Dordrecht, Holland, 1980. 30. H. R. Maturana and F. J. Varela. The Tree of Knowledge - The Biological Roots of Human Understanding. Shambhala, Boston, MA, 1987. Revised edition, 1992. Life, Mind, and Robots 331 31. L. Niklasson and N.E. Sharkey. The systematicity and generalisation of connectionist compositional representations. In R.Trappl, editor, Cybernetics and Systems. Kluwer Academic Press, Dordrecht, NL, 1992. 32. L. Niklasson and T. van Gelder. On being systematically connectionist. Mind and Language, 3:288–302, 1994. 33. Ivan Petrovich Pavlov. Conditioned Reflexes. Oxford University Press, London, UK, 1927. 34. Rolf Pfeifer and Christian Scheier. Understanding Intelligence. MIT Press, Cambridge, MA, 1999. 35. P. Pfungst. Clever Hans (The horse of Mr. von Osten): A contribution to experimental animal and human psychology. Henry Holt, New York, 1911. 36. Erich Prem. Epistemic autonomy in models of living systems. In Proceedings of the Fourth European Conference on Artificial Life, pages 2–9, Cambridge, MA, 1997. MIT Press. 37. John Searle. Minds, brains and programs. Behavioral and Brain Sciences, 3:417– 457, 1980. 38. J.R. Searle. The Mystery of Consciousness. Granta Books, London, 1997. 39. Noel E. Sharkey. Connectionist representation techniques. Artificial Intelligence Review, 5:143–167, 1991. 40. Noel E. Sharkey. Neural networks for coordination and control: The portability of experiential representations. Robotics and Autonomous Systems, 22(3-4):345–359, 1997. 41. Noel E. Sharkey. The new wave in robot learning. Robotics and Autonomous Systems, 22(3-4), 1997. 42. Noel E. Sharkey. Learning from innate behaviors: A quantitative evaluation of neural network controllers. Autonomous Robots, 5:317–334, 1998. Also appeared in Machine Learning, 31, 115-139. 43. Noel E. Sharkey and Jan Heemskerk. The neural mind and the robot. In A. J. Browne, editor, Neural Network Perspectives on Cognition and Adaptive Robotics, pages 169–194. IOP Press, Bristol, UK, 1997. 44. Noel E. Sharkey and Stuart A. Jackson. Three horns of the representational trilemma. In V. Honavar and L. Uhr, editors, Symbol Processing and Connectionist Models for Artificial Intelligence and Cognitive Modeling: Steps towards Integration, pages 155–189. Academic Press, Cambridge, MA, 1994. 45. Noel E. Sharkey and Stuart A. Jackson. An internal report for connectionists. In R. Sun and L. Bookman, editors, Computational architectures integrating neural and symbolic processes: A perspective on the state of the art, pages 223–244. Kluwer Academic Press, Boston, MA, 1995. 46. Noel E. Sharkey and Stuart A. Jackson. Grounding computational engines. Artificial Intelligence Review, 10(10):65–82, 1996. 47. Noel E. Sharkey and Amanda J.C. Sharkey. Emergent cognition. In J. Hendler, editor, Handbook of Neuropsychology. Vol. 9: Computational Modeling of Cognition, pages 347–360. Elsevier Science, Amsterdam, The Netherlands, 1994. 48. Noel E. Sharkey and Tom Ziemke. A consideration of the biological and psychological foundations of autonomous robotics. Connection Science, 10(3–4):361–391, 1998. 49. Charles Scott Sherrington. The integrative action of the nervous system. C. Scribner’s Sons, New York, 1906. 50. P. Smolensky. On the proper treatment of connectionism. Behavioral and Brain Sciences, 11:1–74, 1988. 332 N. Sharkey and T. Ziemke 51. F. Varela, E. Thompson, and E. Rosch. The Embodied Mind - Cognitive Science and Human Experience. MIT Press, Cambridge, MA, 1991. 52. Jakob von Uexküll. Umwelt und Innenwelt der Tiere. Springer, Berlin, Germany, 1921. 53. Jakob von Uexküll. Theoretische Biologie. Suhrkamp, Frankfurt/Main, Germany, 1928. 54. Jakob von Uexküll. A stroll through the worlds of animals and men - a picture book of invisible worlds. In Claire H. Schiller, editor, Instinctive Behavior - The Development of a Modern Concept, pages 5–80. International Universities Press, New York, 1957. Originally appeared as von Uexküll (1934) Streifzüge durch die Umwelten von Tieren und Menschen. Springer, Berlin. 55. Jakob von Uexküll. The Theory of Meaning. Semiotica, 42(1):25–82, 1982. 56. Jakob von Uexküll. Environment [Umwelt] and inner world of animals. In G. M. Burghardt, editor, Foundations of Comparative Ethology, pages 222–245. Van Nostrand Reinhold, New York, 1985. 57. Barbara Webb. Using robots to model animals: A cricket test. Robotics and Autonomous Systems, 16(2–4):117–134, 1995. 58. Michael Wheeler. Cognition’s coming home: The reunion of life and mind. In Phil Husbands and Inman Harvey, editors, Proceedings of the Fourth European Conference on Artificial Life, pages 10–19, Cambridge, MA, 1997. MIT Press. 59. Tom Ziemke. The ‘Environmental Puppeteer’ Revisited: A Connectionist Perspective on ‘Autonomy’. In Proceedings of the 6th European Workshop on Learning Robots (EWLR-6), pages 100–110, Brighton, UK, 1997. 60. Tom Ziemke. Adaptive Behavior in Autonomous Agents. Presence, 7(6):564–587, 1998. 61. Tom Ziemke. Rethinking Grounding. In Alexander Riegler, Markus Peschl, and Astrid von Stein, editors, Understanding Representation in the Cognitive Sciences. Plenum Press, New York, 1999. 62. Tom Ziemke and Noel E. Sharkey. A stroll through the worlds of robots and animals: Applying Jakob von Uexküll’s theory of meaning to adaptive robots and artificial life. Semiotica, special issue on the work of Jakob von Uexküll, to appear in 2000. Supplementing Neural Reinforcement Learning with Symbolic Methods Ron Sun CECS Department University of Missouri – Columbia Columbia, MO 65211, USA Abstract. Several different ways of using symbolic methods to enhance reinforcement learning are identified and discussed in some detail. Each demonstrates to some extent the potential advantages of combining RL and symbolic methods. Different from existing work, in combining RL and symbolic methods, we focus on autonomous learning from scratch without a priori domain-specific knowledge. Thus the role of symbolic methods lies truly in enhancing learning, not in providing a priori domain-specific knowledge. These discussed methods point to the possibilities and the challenges in this line of research. 1 Introduction Reinforcement learning is useful in situations where there is no exact teacher input (but there is sparse feedback), especially in sequential decision making. In sequential decision making tasks, an agent needs to perform a sequence of actions to reach some goal states. It may learn to perform such tasks, from scratch, without teacher input, but using reinforcements provided externally. However, it suffers from a few shortcomings (such as slow learning; discussed later). In this paper, I will identify four major ways with which reinforcement learning can benefit from incorporating symbolic methods, in terms of either improving learning processes or improving learning results. I will focus on autonomous learning without requiring additional a priori domain-specific knowledge when incorporating symbolic methods, which constitutes a more difficult task (compared with providing additional symbolic domain-specific knowledge to a reinforcement learner; cf. Maclin and Shavlik 1994). The benefit of such a focus is that our methods are more likely applicable to changing environments, unknown environments, or other environments in which a priori knowledge is hard or costly to obtain, and to tasks in which coding a priori domain-specific symbolic structures is difficult by hand. The important point is that, although in our methods symbolic structures are generated based on reinforcement learning, generated symbolic structures can in turn help with reinforcement learning in various ways (Sun and Peterson 1998). S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 333–347, 2000. c Springer-Verlag Berlin Heidelberg 2000 334 1.1 R. Sun Review of Reinforcement Learning One popular reinforcement learning algorithm for dealing with sequential decision making tasks is Q-learning (Watkins 1989). In a Q-learning model, a Q-value is an evaluation of the “quality” of an action in a given state: Q(x, a) indicates how desirable action a is in state x (which consists of sensory input). We can choose an action based on Q-values. One easy way of choosing an action is to choose the one that maximizes the Q-value in the current state; that is, we choose a if Q(x, a) = maxi Q(x, i). To ensure adequate exploration, a stochastic decision process, for example, based on Boltzmann distribution, can be used, so that different actions can be tried in accordance with their respective probabilities, to ensure various possibilities are all looked into. Using a Boltzmann distribution, we have e1/αQ(x,a) p(a|x) = P 1/αQ(x,a ) i ie where α controls the degree of randomness (temperature) of the decision-making process. To acquire the Q-values, we can use the Q-learning algorithm of Watkins (1989). In the algorithm, Q(x, a) estimates the maximum discounted cumulativePreinforcement that the model will receive from the current state x on: ∞ max( i=0 γ i ri ),, where γ is a discount factor that favors reinforcement received sooner relative to that received later, and ri is the reinforcement received at step i (which may be 0). The updating of Q(x, a) is based on minimizing r + γe(y) − Q(x, a); that is, ∆Q(x, a) = α(r + γe(y) − Q(x, a)) where α is a learning rate, γ is a discount factor, e(y) = maxa Q(y, a), and y is the new state resulting from action a. Thus, the updating is based on the temporal difference in evaluating the current state and the action chosen. 1 Through successive updates of the Q function, the model can learn to take into account future steps in longer and longer sequences (Watkins 1989). The agent may eventually converge to a stable function or find an optimal sequence that maximizes the reinforcement received. Hence, the agent learns to deal with sequential decision making tasks. 1.2 Problems with Reinforcement Learning A major problem facing reinforcement learning is the large size of the state space of any problem that is realistic and interesting. The large size of the 1 In the above formula, Q(x, a) estimates, before action a is performed, the discounted cumulative reinforcement to be received if action a is performed, and r + γe(y) estimates the discounted cumulative reinforcement that the agent will receive, after action a is performed; so their difference (the temporal difference in evaluating an action) enables the learning of Q-values that approximate the discounted cumulative reinforcement. Supplementing Neural Reinforcement Learning with Symbolic Methods state 335 Q-values action selection critic Fig. 1. The Q-learning Method state space leads to the storage problem of a huge lookup table. Therefore, function approximation or decomposition/aggregation methods are called for. The lookup table implementation of Q-learning may even be out of question because of a continuous input space and when discretized, the resulting huge state space (e.g., in our navigation task experiments, there are more than 1012 states; Sun and Peterson 1998). Function approximators may have to be used in such cases (Sutton 1990, Lin 1992, Tesauro 1992). The large size of the state space also leads to the problem of slow learning, due to the need to explore a sufficiently large portion of the space in order to find optimal or near optimal solutions (Whitehead 1993). Again, we need to use function approximation or decomposition/aggregation methods to remedy the problem. The most often used function approximator is backpropagation neural networks. To implement Q-learning, we can use a four-layered backpropagation network (see Figure 1), in which the first three layers form a backpropagation network for computing Q-values and the fourth layer (with only one node) performs stochastic decision making. The output of the third layer (i.e., the output layer of the backpropagation network) indicates the Q-value of each action (represented by an individual node), and the node in the fourth layer determines probabilistically the action to be performed based on a Boltzmann distribution (Watkins 1989). The training of the backpropagation network is based on minimizing the following:  r + γe(y) − Q(x, a) if ai = a erri = 0 otherwise where i is the index for an output node representing the action ai . The backpropagation procedure is then applied as usual to adjust weights. However, using function approximators leads to the lack of guarantee of convergence of reinforcement learning. In practice, learning performance can be arbitrarily bad as has been demonstrated by Boyan and Moore (1995) and others. Thus, we will examine ways (with symbolic methods) that remedy this problem to some extent. It should be noted that it is generally the case that decision theoretical methods for decomposition and aggregation are too costly computationally. They not only need to perform decomposition and aggregation, but also decision theoretical computation in deciding when and where to do so. Given that, good heuristic methods are called for that can avoid such costly deliberative computation 336 R. Sun and achieve reasonably good outcomes nevertheless, which we will look into in this paper. Another problem is that reinforcement learning algorithms only lead to closedloop action (or control) policies, which require moment-to-moment sensing to discern current states in order for an agent to make action decisions. This type of policy is unusable in situations where there is no feedback or feedback is unreliable, in which case an open-loop policy (or a semi-open-loop policy; more later) is much more desirable. 1.3 Remedies Here are the outlines of a few possible remedies that we proposed, which include the creation of spatial regions for simplifying a learning task by decomposition, the creation of sequential segments for simplifying a learning task by forming a hierarchical structure of sequences, the extraction of symbolic rules to help with the learning of function approximators, and the extraction of symbolic plans to make learning results more usable. The benefits of these methods lie in either (1) improving learning processes or (2) improving results. They enhance reinforcement learners without requiring additional a priori domain-specific knowledge (as opposed to Maclin and Shavlik 1994). First, in trying to speed up reinforcement learning and to improve learning results, we may want to form a coalition of multiple (modular) reinforcement learners (agents). Or in other words, we may partition a state space to form regions each of which is assigned to a different learner to handle. Incorporating symbolic methods can be very useful here. This is because fully on-line gradient descent and similar methods for partitioning suffer from the problem of slow learning due to the enlarged space to be explored (enlarged by combining partitioning and learning). Our region-splitting algorithm (Sun and Peterson 1999) addresses partitioning in complex reinforcement learning tasks by splitting regions on-line but separately using symbolic descriptions (i.e., semi-on-line), to facilitate the overall process. Splitting regions separately reduces the learning complexity, as well as simplifies individual modules (and their function approximators), thus facilitating overall learning. Second, in terms of speeding up reinforcement learning, incorporating symbolic rule learning methods can be very useful. This is because reinforcement learning methods such as Q-learning (Watkins 1989) are known to be extremely slow. They require exponential time to explore state spaces and can be too costly without function approximators. However, with function approximators, it is difficult to ensure proper generalization (Boyan and Moore 1995). Clarion has been developed as a framework for addressing this problem (Sun and Peterson 1998). It is based on the two-level approach proposed in Sun (1997). That is, the model utilizes both continuous and discrete knowledge (in neural and symbolic representations respectively), tapping into the synergy of the two types of processes. The model goes from neural to symbolic representations in a gradual fashion. It extracts symbolic rules to supplement neural networks (as function approximators) to ensure that improper generalizations are corrected. Supplementing Neural Reinforcement Learning with Symbolic Methods 337 Third, beside improving learning, we can improve the usability of results from reinforcement learning, using symbolic methods. For instance, we can improve usability by extracting explicit plans that can be used in an open-loop fashion from closed-loop policies resulting from reinforcement learning. Different from pure reinforcement learning that generates only reactive plans (closed-loop policies), as well as existing probabilistic planning that requires a substantial amount of a priori knowledge to begin with, we devise a two-stage bottom-up process, in which first reinforcement learning is applied, without the use of a priori domainspecific knowledge, to acquire a closed-loop policy and then explicit plans are extracted (Sun and Sessions 1998). The extraction of symbolic knowledge improves the applicability of learned policies, especially when environmental feedback is unavailable or unreliable. Fourth, yet another method, SSS, involves learning to segment sequences based on reinforcement received during task execution (Sun and Sessions 1999). It segments sequences to create hierarchical structures to reduce non-Markovian temporal dependencies in order to facilitate the learning of the overall task. Note that none of these above methods require a priori domain-specific knowledge to begin with, the benefit of which was stated earlier. (There are, however, a few parameters that need to be set, as will be discussed below.) 2 Details of Several Methods Let us look into some details of these methods. 2.1 Space Partitioning We developed the region-splitting method, which addresses automatic partitioning in complex reinforcement learning tasks with multiple modules (or “agents”), without a priori domain knowledge regarding task structures (Sun and Peterson 1999). Partitioning a state/input space into multiple regions helps to exploit differential characteristics of regions and differential characteristics of modules, thus facilitating learning and reducing the complexity of modules, especially when function approximators are used. Usually local modules turn out to be a lot simpler than a monolithic, global module. We adopt a learning/partitioning decomposition — separating the two issues and optimizing them separately, which facilitates the whole task (as opposed to the on-line methods of Jacobs et al 1991, Jordan and Jacobs 1994). What is especially important is the fact that we can use hard partitioning, because we do not need to calculate gradients of partitioning and hard partitioning is easier to do. In the region-splitting algorithm, a region in a partition (non-overlapping and with hard boundaries) is handled exclusively by a single module in the form of a backpropagation network. The algorithm attempts to find better partitioning by splitting existing regions incrementally when certain criteria are satisfied. The splitting criterion is based on the total magnitude of the errors that incurred in a region during training and also based on the consistency of the errors (the 338 R. Sun distribution of the directions of the errors, either positive or negative); these two considerations can be combined. Specifically, in the context of Q-learning (Watkins 1989), error is defined as the Q-value updating amount (i.e., the Bellman residual; see Sun and Peterson 1997). That is, errorx = g + γ maxa′ Qk′ (x′ , a′ ) − Qk (x, a), where x is a (full or partial) state description, a is the action taken, x′ is the new state resulting from a in x, g is the reinforcement received, k is the module (agent) responsible for x, and k ′ is the module (agent) responsible for x′ . We select those regions to split that have high sums of absolute errors (or alternatively, sums of squared errors), which are indicative of the high total magnitude of the errors (Bellman residuals), but have low sums of errors, which together with high sums of absolute errors are indicative of low error consistency (i.e., that Q-updates/Bellman residuals are distributed in different directions). That is, our combined criterion is X X errorx | − |errorx | < threshold1 consistency(r) = | x x where x refers to the data points encountered during previous training that are within the region r to be split. Next we select a dimension to be used in splitting within each region to be split. Instead of being random, we again use the heuristics of high sums of absolute errors but low error consistency. Since we have already calculated the sum of absolute errors and it remains the same regardless what we do, what we can do is to best split a dimension to increase the overall error consistency, i.e., the sums of errors (analogous to CART; see Breiman et al 1984). Specifically, we compare for each dimension i in the region r the following measure: the increase in consistency if the dimension is optimally split, that is, = max(| vi X x:xi <vi ∆consistency(r, i) X X errorx | + | errorx |) − | errorx | x x:xi ≥vi where vi is a split point for a dimension i, x refers to the points within region r on the one side or the other of the split point, when projected to dimension i. This measure indicates how much more we can increase the error consistency if we split a dimension i optimally. The selection of dimension i is contingent upon ∆consistency(r, i) > threshold2. Among those dimensions that satisfy ∆consistency(r, i) > threshold2, we choose the one with the highest ∆consistency(r, i). For a selected dimension i, we then optimize the selection of a split point vi′ based on maximizing the sum of the absolute values of the total errors on both sides of the split point: X X errorx | + | errorx |) vi′ = argmaxvi (| x:xi <vi x:xi ≥vi where vi′ is the chosen split point for a dimension i. Such a point is optimal in the exact sense that error consistency is maximized (Breiman et al 1984). Supplementing Neural Reinforcement Learning with Symbolic Methods 339 Then, we split the region r using a boundary created by the split point. We create a split hyper-plane using the selected point spec = xj < vj . We then split the region using the newly created boundary: region1 = region ∩ spec and region2 = region∩¬spec, where region is the specification of the original region. The algorithm is as follows: 1. Initialize one partition to contain only one region that covers the whole input space 2. Train an agent on the partition 3. Further split the partition 4. Train a set of agents (with each region assigned to a different agent) 5. If no more splitting can be done, stop; else, go to 3 Further splitting a partition: For each region that satisfies consistency(r) < threshol 1 do: 1. Select a dimension j in the input space that maximizes ∆consistency, provided that ∆consistency(r, j) > threshol 2 2. In the selected dimension j, select a point (a value vj ) lying within the region and maximizing ∆consistency(r, j) 3. Using the selected value point in the selected dimension, create a split hyperplane: spec = xj < vj 4. Split the region using the newly created hyper-plane: region1 = region∩spec and region2 = region ∩ ¬spec, where region is the specification of the original region; create two new agents for handling these two new regions by replicating the agent for the original region 5. If the number of regions exceeds R, keep combining regions until the number is right: randomly select two regions (preferring two adjacent regions) and merge the two; keep one of the two agents responsible for these two regions and delete the other The method was experimentally tested and compared to a number of other algorithms. As expected, we found that the multi-module automatic partitioning method outperformed single-module learning and on-line gradient descent partitioning methods (Sun and Peterson 1999). 2.2 Rule Extraction We developed a two-level model for rule extraction (i.e., the Clarion model) (Sun 1997, Sun and Peterson 1997, 1998, Sun et al 1999). The bottom level implements the usual Q-learning (Watkins 1989) using a backpropagation network. The top level learns rules. In the top level, we devised a novel rule learning algorithm based on neural reinforcement learning. The basic idea is as follows: we perform rule learning (extraction and subsequent revision) at each step, which is associated with the following information: (x, y, r, a), where x is the state before action a is performed, y is the new state entered after an action a is performed, and r is the reinforcement received after action a. If some action decided by the bottom level is successful then the agent extracts a rule that corresponds to the decision 340 R. Sun and adds the rule to the rule network. Then, in subsequent interactions with the world, the agent verifies the extracted rule by considering the outcome of applying the rule: if the outcome is not successful, then the rule should be made more specific and exclusive of the current case (“shrinking”); if the outcome is successful, the agent may try to generalize the rule to make it more universal (“expansion”). Rules are in the following form: conditions −→ action, where the left-hand side is a conjunction of individual conditions each of which refers to a primitive: a value range or a value in a dimension of the (sensory) input state. At each step, we update the following statistics for each rule condition and each of its minor variations (i.e., the rule condition plus/minus one value), with regard to the action a performed: that is, P Ma (i.e., Positive Match) and N Ma (i.e., Negative Match). Here, positivity/negativity is determined by the Bellman residual (the Q-value updating amount) which indicates whether or not the action is reasonably good. Based on these statistics, we calculate the information gain measure; that is, IG(A, B) = log2 P Ma (B) + 1 P Ma (A) + 1 − log2 P Ma (A) + N Ma (A) + 2 P Ma (B) + N Ma (B) + 2 where A and B are two different conditions that lead to the same action a. The measure compares essentially the percentage of positive matches under different conditions A and B (with the Laplace estimator; Lavrac and Dzeroski 1994). If A can improve the percentage to a certain degree over B, then A is considered better than B. In the algorithm, if a rule is better compared with the matchall rule (i.e, the rule with the condition that matches all inputs), then the rule is considered successful (for the purpose of deciding on expansion or shrinking operations). We decide on whether or not to extract a rule based on a simple success criterion which is fully determined by the current step (x, y, r, a): – Extraction: if r + γ maxb Q(y, b) − Q(x, a) > threshold, where a is the action performed in state x, r is the reinforcement received, and y is the resulting new state (that is, if the current step is successful), and if there is no rule that covers this step in the top level, set up a rule C −→ a, where C specifies the values of all the input dimensions exactly as in x. The criterion for applying the expansion and shrinking operators, on the other hand, is based on the afore-mentioned statistical test. Expansion amounts to adding an additional value to one input dimension in the condition of a rule, so that the rule will have more opportunities of matching inputs, and shrinking amounts to removing one value from one input dimension in the condition of a rule, so that it will have less opportunities of matching inputs. Here are the detailed descriptions of these operators: – Expansion: if IG(C, all) > threshold1 and maxC ′ IG(C ′ , C) ≥ 0, where C is the current condition of a matching rule, all refers to no condition at all (with regard to the same action specified by the rule), and C ′ is Supplementing Neural Reinforcement Learning with Symbolic Methods 341 a modified condition such that C ′ = C plus one value (i.e., C ′ has one more value in one of the input dimensions) (that is, if the current rule is successful and the expanded condition is potentially better), then set C ′′ = argmaxC ′ IG(C ′ , C) as the new (expanded) condition of the rule. Reset all the rule statistics. – Shrinking: if IG(C, all) < threshold2 and maxC ′ IG(C ′ , C) > 0, where C is the current condition of a matching rule, all refers to no condition at all (with regard to the same action specified by the rule), and C ′ is a modified condition such that C ′ = C minus one value (i.e., C ′ has one less value in one of the input dimensions) (that is, if the current rule is unsuccessful, but the shrunk condition is better), then set C ′′ = argmaxC ′ IG(C ′ , C) as the new (shrunk) condition of the rule. Reset all the rule statistics. If shrinking the condition makes it impossible for a rule to match any input state, delete the rule. For making the final decision of which action to take, we combine the corresponding values for each action from the two levels by a weighted sum; that is, if the top level indicates that action a has an activation value v (which should be 0 or 1 as rules are binary) and the bottom level indicates that a has an activation value q (the Q-value), then the final outcome is w1 ∗ v + w2 ∗ q. Stochastic decision making with Boltzmann distribution based on the weighted sums is then performed to select an action out of all the possible actions. Relative weights or percentages of the two levels are automatically set based on the relative performance of the two levels. Although it does not reduce learning complexity by order of magnitude, Clarion does help to speed up learning, empirically, in all our experiments in various sequential decision task domains (see Sun and Peterson 1998 for details). Note that this kind of rule extraction is very different from rule extraction at the end of training a backpropagation network (as in Towell and Shavlik 1993), in which costly search is done to find rules without benefiting learning processes per se. 2.3 Plan Extraction The above rule extraction method generates isolated rules, focusing on individual states, but not the chaining of these rules in accomplishing a sequential decision tasks. In contrast, the following plan extraction method generates a complete explicit plan that can by itself accomplish a sequential task (Sun and Sessions 1998a). By an explicit plan, we mean a control policy consisting of an explicit sequence of action steps, that does not require (or requires little) environmental feedback during execution (compared with a completely closed-loop control policy). When no environmental feedback (sensing) is required, an explicit plan amounts to an open-loop policy. When a small amount of feedback (sensing) is required, it amounts to a semi-open-loop policy. In either case, an explicit plan can lead to “policy compression” (i.e., it can lead to fewer specifications for fewer states, through explication of the closed-loop policy). For example, instead 342 R. Sun of a closed-loop policy specifying actions for all of the 1012 states in a domain (see Sun and Sessions 1998a), an explicit plan may contain a sequence of 10 actions each of which either does not rely on sensing states, or relies only on distinguishing 5 different states at a time. Our plan extraction method (Sun and Sessions 1998 a, b) turns a set of Q values (and the corresponding policy resulting from these values) into a plan that is in the form of a sequence of steps (in accordance with the traditional AI formulation of planning). The basic idea is that we use beam search, to find the best action sequences (or conditional action sequences) that achieve the goal with a certain probability. We recognize that the optimal Q value learned through Q-learning represents the total future probability of reaching the goal (Sun and Sessions 1998 a). Thus Q-values can be used as a guide in searching for explicit plans. We employ the following data structures in plan extraction. The current state set, CSS, consists of multiple pairs in the form of (s, p(s)), in which the first item indicates a state s and the second item p(s) indicates the probability of that state. For each state in CSS, we find the corresponding best action. In so doing, we have to limit the number of branches at each step, for the sake of time efficiency of the algorithm as well as the representational efficiency of the resulting plan. The set thus contains up to (a fixed number) n pairs, where n is the branching factor in beam search. In order to calculate the best default action at each step, we include a second set of states CSS ′ , which covers a certain number (m) of possible states not covered by CSS. The algorithm is as follows: Set the current state set CSS = {(s0 , 1)} and CSS ′ = {} Repeat until the termination conditions are satisfied (e.g., step > D) - For each action u, compute the probabilities of transitioning to each of all the possible next states (for all s′ ∈ S) from each of the current states (s ∈ CSS): p(s′ , s, u) = p(s) ∗ ps,s′ (u) - For each action u, compute its estimated utility with respect to each state in CSS: X ′ U t(s, u) = p(s , s, u) ∗ max Q(s′ , v) v s′ That is, we calculate the probabilities of reaching the goal after performing action u from the current state s. - For each action u, compute the estimated utility with respect to all the states in CSS ′ : U t(CSS ′ , u) = X X s∈CSS ′ p(s) ∗ ps,s′ (u) max Q(s′ , v) v s′ - For each state s in CSS, choose the action us with the highest utility U t(s, u): us = argmaxu U t(s, u) - Choose the best default action u with regard to all the states in CSS ′ : u = argmaxu′ U t(CSS ′ , u′ ) Supplementing Neural Reinforcement Learning with Symbolic Methods 343 - Update CSS to contain n states that have the highest n probabilities, i.e., with the highest p(s′ )’s: p(s′ ) = X p(s′ , s, us ) s∈CSS where us is the action chosen for state s. - Update CSS ′ to contain m states that have the highest m probabilities calculated as follows, among those states that are not in the new (updated) CSS: p(s′ ) = X p(s′ , s, us ) s∈CSS∪CSS ′ where us is the action chosen for state s (either a conditional action in case s ∈ CSS or a default action in case s ∈ CSS ′ ), and the summations are over the old CSS and CSS ′ (before updating). 2 In the measure U t, we take into account the probabilities of reaching the goal in the future from the current states (based on the Q values; see Theorem 1 in Sun and Session 1998 a), as well as the probability of reaching the current states based on the history of the paths traversed (based on p(s)’s). This is because what we are aiming at is the estimate of the overall success probability of a path. In the algorithm, we select the best actions (for s ∈ CSS and CSS ′ ) that are most likely to succeed based on these measures, respectively. Note that (1) as a result of incorporating nonconditional default actions, nonconditional plans are special cases of conditional plans. (2) If we set n = 0 and m > 0, we then in effect have a nonconditional plan extraction algorithm and the result from the algorithm is a nonconditional plan. (3) If we set m = 0, then we have a purely conditional plan (with no default action attached). An issue is how to determine the branching factor, i.e., the number of conditional actions at each step. We can start with a small number, say 1 or 2, and gradually expand the search by adding to the number of branches, until a certain criterion, a termination condition, is satisfied. We can terminate the search when p(G) > δ (where δ is specified a priori) or when a time limit is reached (in which case failure is declared). The advantage of extracting plans is that, instead of closed-loop policies that have to rely on moment-to-moment sensing, extracted plans can be used in an open-loop fashion, which is useful when feedback is not available or unreliable. In addition, extracting plans usually leads to the savings of sensing and storage costs (Sun and Sessions 1998 a). 2 For both CSS and CSS ′ , if a goal state or a state of probability 0 is selected, we may remove it and, optionally, reduce the beam width of the corresponding set by 1. 344 2.4 R. Sun Segmentation The SSS method (which stands for Self-Segmentation of Sequences; Sun and Sessions 1999) involves learning to segment sequences to create hierarchical structures, based on reinforcement received during task execution, with different levels of control communicating with each other through sharing reinforcement estimates obtained by each others. In SSS, there are three types of learning modules: – Individual action module Q: Each performs actions and learns through Q-learning to maximize its reinforcements. – Individual controller CQ: Each CQ learns when a Q module (corresponding to the CQ) should continue its control and when it should give up the control, in terms of maximizing reinforcements. The learning is accomplished through (separate) Q-learning. – Abstract controller AQ: It performs and learns abstract control actions, that is, which Q module to select under what circumstances. The learning is accomplished through (separate) Q-learning to maximize reinforcements. The model works as follows: 1. Observe the current state s. 2. The currently active Q/CQ pair takes control. If there is no active one (when the system first starts), go to step 5. 3. The active CQ selects and performs a control action based on CQ(s, ca) for different ca. If the action chosen by CQ is en , go to step 5. Otherwise, the active Q selects and performs an action based on Q(s, a) for different a. 4. The active Q and CQ performs learning. Go to step 1. 5. AQ selects and performs an abstract control action based on AQ(s, aa) for different aa, to select a Q/CQ pair to become active. 6. AQ performs learning. Go to step 3. For the details of learning of the three types of modules, see Sun and Sessions (1999). Through the interaction of the three types of modules, SSS segments sequences to create a hierarchy of subsequences, in an effort to maximize the overall reinforcement (see Sun and Sessions 1999 for detailed analyses). It distinguishes global and local contexts (i.e., non-Markovian dependencies) and treats them separately. This is done through automatically determining segments (subsequences) or, in other words, automatically seeking out proper configurations of temporal structures (i.e., global and local non-Markovian dependencies), which leads to more reinforcements. Note that such segmentation is different from reinforcement learning using pre-given hierarchical structures. In the latter work, a vast amount of a priori domain-specific knowledge has to be worked out by hand and engineered into a learner. Supplementing Neural Reinforcement Learning with Symbolic Methods 3 345 Discussion We can characterize the afore-discussed symbolic methods along several different dimensions. First of all, in terms of the timing of applying symbolic methods, we have the following possibilities: – off-line: symbolic methods are applied at the end (e.g., in extracting plans) – semi-on-line: symbolic methods are applied along the way but separately (e.g., in extracting rules and in creating regions) – on-line: symbolic methods are applied within a unified mathematical or computational framework (e.g., in creating hierarchies) Second, in terms of the techniques used for creating symbolic structures, we have the following possibilities: – – – – beam search IG-based incremental search modular competition incremental splitting Many other possibilities that we have not yet explored exist. Finally, we can have the following varieties of resulting symbolic structures: – small pieces of knowledge (e.g., rules) – large chunks of knowledge (e.g. plans) – global structures (e.g. hierarchies and regions) These dimensions together define a very large space which we have not yet fully explored. There are many possibilities out there. There are many challenges that we need to address in order to further develop hybrid reinforcement learning methods along the line outlined above. First of all, we need to develop better extraction algorithms that generate useful and comprehensible regions, rules, plans, and hierarchical structures. We can further explore the space outlined above. We shall also look into other mathematical frameworks for these processes. Second, we need to test these algorithms on largescale real-world application problems, in order to fully validate them. Third, we shall also look into other ways of incorporating symbolic processes and structures into reinforcement learning. Finally, we need to investigate ways of mixing these techniques to achieve even better performance. In sum, there are a variety of ways of improving reinforcement learning in practice, through combining it with symbolic methods, without losing its characteristics of being autonomous and requiring no a priori domain-specific knwoledge to begin with (beside reinforcements). This paper highlights a few possibilities and challenges of this line of work, and hopes to stimulate further research in this area. 346 R. Sun Acknowledgements This work was supported in part by Office of Naval Research grant N00014-951-0440. The work was done in collaboration with Todd Peterson, Chad Sessions, and Ed Merrill. References 1. J. Boyan and A. Moore, (1995). Generalization in reinforcement learning: safely approximating the value function. in: J. Tesauro, and D. Touretzky, and T. Leen, (eds.) Neural Information Processing Systems, 369-376, MIT Press, Cambridge, MA. 2. L. Breiman, L. Friedman, and P. Stone, (1984). Classification and Regression. Wadsworth, Belmont, CA. 3. R. Jacobs, M. Jordan, S. Nowlan, and G. Hinton, (1991). Adaptive mixtures of local experts. Neural Computation. 3, 79-87. 4. M. Jordan and R. Jacobs, (1994). Hierarchical mixtures of experts and the EM algorithm. Neural Computation. 6, 181-214. 5. N. Lavrac and S. Dzeroski, (1994). Inductive Logic Programming. Ellis Horword, New York. 6. L. Lin, (1992). Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine Learning. Vol.8, pp.293-321. 7. R. Maclin and J. Shavlik, (1994). Incorporating advice into agents that learn from reinforcements. Proc. of the National Conference on Artificial Intelligence (AAAI94). Morgan Kaufmann, San Meteo, CA. 8. S. Singh, (1994). Learning to Solve Markovian Decision Processes. Ph.D Thesis, University of Massachusetts, Amherst, MA. 9. R. Sun, (1992). On variable binding in connectionist networks. Connection Science, Vol.4, No.2, pp.93-124. 1992. 10. R. Sun, (1997). Learning, action, and consciousness: a hybrid approach towards modeling consciousness. Neural Networks, 10 (7), pp.1317-1331 11. R. Sun and T. Peterson, (1997). A hybrid model for learning sequential navigation. Proc. of IEEE International Symposium on Computational Intelligence in Robotics and Automation (CIRA’97). Monterey, CA. pp.234-239. IEEE Press, Piscateway, NJ. 12. R. Sun and T. Peterson, (1998). Autonomous learning of sequential tasks: experiments and analyses. IEEE Transactions on Neural Networks, Vol.9, No.6, pp.12171234. 13. R. Sun and T. Peterson, (1999). Multi-agent reinforcement learning: weighting and partitioning. Neural Networks, Vol.12 No.4-5. pp.127-153. 14. R. Sun, T. Peterson, and E. Merrill, (1999). A hybrid architecture for situated learning of reactive sequential decision making. Applied Intelligence, in press. 15. R. Sun and C. Sessions, (1998a). Extracting plans from reinforcement learners. Proceedings of the 1998 International Symposium on Intelligent Data Engineering and Learning (IDEAL’98). pp.243-248. eds. L. Xu, L. Chan, I. King, and A. Fu. Springer-Verlag, Heidelberg. 16. R. Sun and C. Sessions, (1998b). Learning to plan probabilistically from neural networks. Proceedings of IEEE International Joint Conference on Neural Networks, pp.1-6. IEEE Press, Piscataway, NJ. Supplementing Neural Reinforcement Learning with Symbolic Methods 347 17. R. Sun and C. Sessions, (1999). Self segmentation of sequences. Proceedings of IEEE International Joint Conference on Neural Networks, IEEE Press, Piscataway, NJ. 18. R. Sutton, (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. Proc.of Seventh International Conference on Machine Learning. Morgan Kaufmann, San Meteo, CA. 19. T. Tesauro, (1992). Practical issues in temporal difference learning. Machine Learning. Vol.8, 257-277. 20. G. Towell and J. Shavlik, (1993). Extracting refined rules from Knowledge-Based Neural Networks, Machine Learning. 13 (1), 71-101. 21. C. Watkins, (1989). Learning with Delayed Rewards. Ph.D Thesis, Cambridge University, Cambridge, UK. 22. S. Whitehead, (1993). A complexity analysis of cooperative mechanisms in reinforcement learning. Proc. of the National Conference on Artificial Intelligence (AAAI’93), 607-613. Morgan Kaufmann, San Francisco, CA. Self-Organizing Maps in Symbol Processing Timo Honkela Media Lab, University of Art and Design, Hämeentie 135 C, FIN-00560 Helsinki, Finland Timo.Honkela@uiah.fi http://www.mlab.uiah.fi/˜timo/ Abstract. A symbol as such is disassociated from the world. In addition, as a discrete entity a symbol does not mirror all the details of the portion of the world that it is meant to refer to. Humans establish the association between the symbols and the referenced domain — the words and the world — through a long learning process in a community. This paper studies how Kohonen self-organizing maps can be used for modeling the learning process needed in order to create a conceptual space based on a relevant context with which the symbols are associated. The categories that emerge in the self-organizing process and their implicitness are considered as well as the possibilities to model contextuality, subjectivity and intersubjectivity of interpretation. 1 Introduction Models of natural language may test the background assumptions of the developers, or, at least, reflect them. In the predominant approaches among computerized models of language, the linguistic categories and rules are predetermined and coded by hand explicitly as symbolic representations. The field of connectionist natural language processing, based on the use of artificial neural networks, may be characterized to take an opposite stand. The critical view on symbolic representations is based on the idea that the symbolic and discrete nature of written expressions in natural language does not imply that symbolic descriptions of linguistic phenomena are sufficient as such. This view appears to be relevant especially when semantic and pragmatic issues are considered. A traditional, logic-based analysis studies examples like the ones given below (from [35]). The emphasis lies in phenomena that are suitable to be explained in the framework of predicate logic, e.g., propositional forms, connectives, quantifiers, truth values, presuppositions, and logical ambiguity. – “Each one of Mozart’s works is a masterpiece.” – “If butter is heated, it melts.” However, if one considers the sentences and conversations given below, it should be apparent that there is need for formalisms and tools that enable modeling, for instance, of adaptation, vagueness, contextuality, subjectivity of interpretation, and the relationship between discrete symbols and continuous spaces in the domain under consideration. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 348–362, 2000. c Springer-Verlag Berlin Heidelberg 2000 Self-Organizing Maps in Symbol Processing 349 – “Please, show me some pictures with beautiful Finnish lake sceneries.” – “Do you see that small woman there?” “Actually, I don’t consider her small while people in her country are usually much shorter than here.” A radically connectionist natural language processing approach is based on the following assumptions. The ability to understand natural language utterances can be learned via examples. The categories necessary in the interpretation emerge in the self-organizing learning processes and they may be implicit rather than explicit as will be shown later in this article. An implicit category can be used during interpretation even if it is not named. The processing mechanisms are mainly statistical rather than rule-like. The process of symbol grounding, i.e., associating symbols with continuous multi-dimensional spaces and dynamic processes as well as the assumptions outlined above are discussed in this paper. The methodological basis is Kohonen’s self-organizing map algorithm. 2 Self-Organizing Map The basic self-organizing map (SOM) [22,24] can be visualized as a sheet-like neural-network array, the cells (nodes, units) of which become specifically tuned to various input signal patterns or classes of patterns in an orderly fashion. The learning process is competitive and unsupervised, meaning that no teacher is needed to define for an input the correct output, i.e., the cell into which the input is mapped. The locations of the responses in the map tend to become ordered in the learning process as if some meaningful nonlinear coordinate system for the different input features were being created over the network [24]. 2.1 Self-Organizing Map Algorithm Assume that some sample data sets (such as in Table 1) have to be mapped onto a 2-dimensional array. The set of input samples is described by a real vector x(t) ∈ Rn where t is the index of the sample, or the discrete-time coordinate. Each node i in the map contains a model vector mi (t) ∈ Rn , which has the same number of elements as the input vector x(t). The initial values of the components of the model vector, mi (t), may even be selected at random. In practical applications, however, the model vectors are more profitably initialized in some orderly fashion, e.g., along a two-dimensional subspace spanned by the two principal eigenvectors of the input data vectors [24]. Any input item is thought to be mapped into the location, the mi (t) of which matches best with x(t) in some metric (e.g. Euclidean). The self-organizing algorithm creates the ordered mapping as a repetition of the following basic tasks: 1. An input vector x(t) is compared with all the model vectors mi (t). The bestmatching unit (node) on the map, i.e., the node where the model vector is most similar to the input vector in some metric is identified. This bestmatching unit is often called the winner. 350 T. Honkela Table 1. Three-dimensional input data in which each sample vector x consists of the red-green-blue values of the color shown in the rightmost column. 250 165 222 210 255 184 189 255 233 ... 235 042 184 105 127 134 183 140 150 ... 215 42 135 30 80 11 107 0 122 ... antique white brown burlywood chocolate coral dark goldenrod dark khaki dark orange dark salmon ... 2. The model vectors of the winner and a number of its neighboring nodes in the array are changed towards the input vector according to the learning principle specified below. The basic idea in the SOM learning process is that, for each sample input vector x(t), the winner and the nodes in its neighborhood are changed closer to x(t) in the input data space. During the learning process, individual changes may be contradictory, but the net outcome in the process is that ordered values for the mi (t) emerge over the array. If the number of available input samples is restricted, the samples must be presented reiteratively to the SOM algorithm. Adaptation of the model vectors in the learning process takes place according to the following equations: mi (t + 1) = mi (t) + α(t)[x(t) − mi (t)] for each i ∈ Nc (t), mi (t + 1) = mi (t) otherwise, where t is the discrete-time index of the variables, the factor α(t) ∈ [0, 1] is a scalar that defines the relative size of the learning step, and Nc (t) specifies the neighborhood around the winner in the map array. At the beginning of the learning process the radius of the neighborhood is fairly large, but it is made to shrink during learning. This ensures that the global order is obtained already at the beginning, whereas towards the end, as the radius gets smaller, the local corrections of the model vectors in the map will be more specific. The factor α(t) also decreases during learning. The resulting map is shown in Figure 1. The experiment was conducted using SOM PAK software [25]. Perhaps the most typical notion of the SOM is to consider it as an artificial neural network model of the brain, especially of the experimentally found ordered “maps” in the cortex. There exists a lot of neurophysiological evidence to support the idea that the SOM captures some of the fundamental processing principles of the brain [27]. Other early artificial-neural-network models of self-organization have been presented, e.g., in [1], [3], and [54]. The SOM can also be viewed as a model of unsupervised machine learning, and as an adaptive knowledge representation scheme. The traditional knowledge Self-Organizing Maps in Symbol Processing lawn green pale green greenyellow palegoldenrod khaki lightgoldenrod darksea green dark khaki dark orange goldenrod coral sandy brown salmon chocolate darkgoldenrod light pink pink thistle light blue pale turquoise powder blue hot pink orchid violet sky blue mediumorchid mediumviolet red violet red sienna forest green lime green dark green slate gray dark slateblue black blue violet slate blue brown firebrick dark olivegreen mediumpurple dark orchid dark violet purple maroon olive drab lavender plum paleviolet red indian red floral white mint cream alice blue ghost white white rosy brown light coral linen old lace beige moccasin wheat burlywood tan dark salmon antiquewhite papayawhip 351 midnightblue navy blue steel blue cornflowerblue royal blue cadet blue medium seagreen light seagreen mediumturquoise turquoise darkturquoise Fig. 1. A map of colors based on their red-green-blue values. The color symbols in the rightmost column of Table 1 as used in labeling the map. The best matching unit is searched for each input sample and that node is labeled accordingly. representation formalisms – semantic networks, frame systems, predicate logic, to provide some examples – are static and the reference relations of the elements are determined by a human. Moreover, those formalisms are based on the tacit assumption that the relationship between natural language and world is one- 352 T. Honkela to-one: the world consists of objects and the relationships between the objects, and these objects and relationships have straightforward correspondence to the elements of language. An alternative point of view is that the pattern recognition process must be taken into account: expressions of languages refer to patterns and distributions of patterns in the concrete perceptual domain and often also in the abstract domain. One does not need to question the existence of the world in order to be critical towards the notion of entities or objects as a basis for epistemological considerations. Both anticipation and context influence the perception and the naming process. This view is adopted, e.g., in constructivism. One early constructivist, Heinz von Foerster, has stated that objects and events are not primitive experiences but they are representations of relations. The construction of these relations is subjective, constrained by anatomical and cultural factors. The postulate of an external (objective) reality gives way to a reality that is determined by modes of internal computations [7]. This relativity or subjectivity of interpretation of symbols does not lead, however, into arbitrariness while the language users can refine their interpretation models closer to each other through communication. The use of the SOM in modeling such a learning process is considered in [12]. 3 SOM-Based Symbol Processing The self-organizing map can be used in several ways when symbol processing is considered. One can create a map of symbols by associating each label with a numerical vector and finding corresponding best-matching location on the map. Another mean is to encode each symbol with, e.g., a unique random vector and use this coding as the basis in the learning process [47]. The order of the map is based on presenting the encoded word with its context during the learning. The context can, e.g., be textual [13,47], or numerical measurements and representations [11]. The latter can originate from a visual source. An overview of connectionist, statistical and symbolic approaches in natural language processing and an introduction to several articles is given in [55]. 3.1 Maps of Words Contextual information has widely been used in statistical analysis of natural language corpora. Charniak [4] presents the following scheme for grouping or clustering words into classes that reflect the commonality of some property. 1. Define the properties that are taken into account and can be given a numerical value. 2. Create a vector of length n with n numerical values for each item to be classified. 3. Cluster the points that are near each other in the n-dimensional space. The open questions are: what are the properties used in the vector, the distance metric used to decide whether two points are close to each other, and Self-Organizing Maps in Symbol Processing 353 the algorithm used in clustering. The SOM does both vector quantization and clustering at the same time. Moreover, it produces a topologically ordered result. Word encoding Handling computerized form of written language rests on processing of discrete symbols. One useful numerical representation of written text can be obtained by taking into account the sentential context in which the words occur. Before utilization of the context information, however, the numerical value of the code should not imply any order to the words. Therefore, it will be necessary to use uncorrelated vectors for encoding. The simplest method to introduce uncorrelated codes is to assign a unit vector for each word. When all different word forms in the input material are listed, a code vector can be defined to have as many components as there are word forms in the list. As an example related to Table 1 shown earlier, the color symbols of Table 2 are here replaced by binary numbers that encode them. One vector element (column in the table) corresponds to one unique color symbol. Table 2. A simple example of input data for the SOM algorithm in order to obtain map of symbols. The three first columns correspond to red-green-blue values and the rest of the columns are used to code the color symbols as binary values. 0.250 0.165 0.222 0.210 0.255 0.184 ... 0.235 0.042 0.184 0.105 0.127 0.134 ... 0.215 0.042 0.135 0.030 0.080 0.011 ... 1 0 0 0 0 0 . 0 1 0 0 0 0 . 0 0 1 0 0 0 . 0 0 0 1 0 0 . 0 0 0 0 1 0 . 0 0 0 0 0 1 . ... ... ... ... ... ... ... The component of the vector where the index corresponds to the order of the word in the list is set to the value “1”, whereas the rest of the components are “0”. This method, however, is only practicable in small experiments. With a vocabulary picked from a even reasonably large corpus the dimensionality of the vectors would become intolerably high. If the vocabulary is large, the word forms can be encoded by quasi-orthogonal random vectors of a much smaller dimensionality [47]. Such random vectors can still be considered to be sufficiently dissimilar mutually and not to convey any information about the meaning of the words. Mathematical analysis of the dimensionality reduction and the random encoding of the word vectors is presented in [47] and [21]. The random encoding can also be motivated from the linguistic point of view. The appearance of a word does not usually correlate with its meaning. However, it may be interesting to consider an encoding scheme in which the form of the words is taken into account to some extent. Motivation for such an experiment may stem, for instance, from the attempt to model aphasic phenomena. 354 T. Honkela Map Creation and Implicit Categories The basic steps for creating maps of words are given in the following. 1. A unique random vector is created for each word form in the vocabulary. 2. All the instances of the word under consideration, so-called key words, are found in the text collection. The average over the contexts of each key word is calculated. The random codes formed in step 1 are used in the calculation. The context may consist of, e.g., the preceding and the succeeding word, or some other window over the context. As a result each key word is associated with a contextual fingerprint. 3. Each vector formed in step 2 is input to the SOM. The resulting map is labeled after the training process by inputing the input vectors once again and by naming the best-matching neurons according to the key word part of the input vector. The averaging process is well motivated when the computational point of view is considered. The number of training samples is reduced considerably in the averaging. The areas on a map of words can be considered as implicit categories or classes that have emerged during the learning process. Consider, for instance, Figure 2 in which some syntactic classes have emerged on a map of words. The overall organization of this map of words reflects syntactical categories. The context of the analysis has consisted of the immediate neighboring words. Single nodes can be considered to serve as adaptive prototypes. Each prototype is involved in the adaptation process in which the neighbors influence each other and the map is gradually finding a form in which it can best represent the input. Local organization of a map of words seems to follow semantic features. Often a node becomes labeled by several symbols that are synonyms, antonyms or otherwise belong to a closed class (see, e.g., [6,19]). Often the class borders can be detected by analyzing the distances between the prototype vectors in the original input space [51,52]. The prototype theory of concepts involves that concepts have a prototype structure and there is no delimiting set of necessary and sufficient conditions for determining category membership that can also be fuzzy. Instances of a concept can be ranked in terms of their typicality. Membership in a category is determined by the similarity of an object’s attributes to the category’s prototype. The development of prototype theory is based on the works by, e.g., Rosch [48] and Lakoff [29]. MacWhinney [32] discusses the merits and problems of the prototype theory. He mentions that prototype theory fails to place sufficient emphasis on the relations between concepts. MacWhinney also points out that prototype theory has not covered the issue of how concepts develop over time in language acquisition and language change, and, moreover, it does not provide a theory of representation. MacWhinney’s competition model has been designed to overcome these deficits. Recently, MacWhinney has presented a model of emergence in language based on the SOM [33]. His work is closely related to the adaptive prototypes of the maps of words. In MacWhinney’s experiments the SOM is used to encode auditory and semantic information about words. Self-Organizing Maps in Symbol Processing am be PAST PARTICIPLE AND PAST when TENSE now VERBS MODAL VERBS PRESENT TENSE VERBS 355 what where been was is ADVERBS, PRONOUNS, PREPOSITIONS, ETC. INANIMATE NOUNS how PERSONAL PRONOUNS ANIMATE NOUNS Fig. 2. Emergent implicit classes on a map of words. The input consisted of the English translations of fairy tales collected by the Grimm brothers. The 150 most frequent words were mapped. The area in the middle consists of words in classes of adverbs, pronouns, prepositions only in a partial order. Some of the individual words have been shown. Detailed results are presented in [13]. Also Gärdenfors’ recent work (e.g., [9,10]) is very closely related to the issue of adaptive prototypes. Mitra and Pal have studied the relationship between fuzziness, self-organization and inferencing [42] and the use of the SOM as a fuzzy classifier [41]. Handling Ambiguity The main disadvantage of the averaging process described earlier seems to be that information related to the varying use of single words is lost. However, it is entirely possible to use the SOM to cluster the contexts of a word to obtain information about the potential ambiguity of a word. Such a study has been conducted in [45]. Gallant [8] has presented a disambiguation method based on neural networks. In a study with similar objectives, [50] used co-occurence information to create lexical spaces. The dimensionality reduction was based on singular value decomposition. An automatic method for word sense disambiguation was developed: a training set of contexts is clustered, each cluster is assigned a sense, and the sense of the closest cluster is assigned to the new occurrences. Schütze used two clustering methods to determine the sense clusters. Related Work Miikkulainen has widely used the SOM to create a model of story comprehension. The SOM is used to make conceptual analysis of the words 356 T. Honkela appearing in the phrases [37,38,40]. A model of aphasia based on the SOM is presented in [39]. The model consists of two main parts: the maps for the lexical symbols in the different input and output modalities, and the map for the lexical semantics. The SOM appears to be appealing model for aphasia. Consider, for instance, a situation in which the word “lion” is used instead of “tiger”, i.e., a case of neighboring items in a semantic map. On the other hand, the use of “sing” instead of “sink” corresponds to a small error in a phonetic map. Scholtes has used the SOM to parsing and several other natural language processing tasks such as filtering in information retrieval [49]. Several authors have presented methods for creating maps of documents rather than maps of words, e.g., [20,30,36]. The basic idea is to provide a visual display for exploration of text database. On the display two documents appear close to each other if they are similar in content. In [14,15], a map of words is used as a filtering preprocessor of the documents to be mapped. 3.2 Modeling Gradience and Non-symbolic Representations The world is continuous and changing, and, thus, the language is a medium of abstraction rather than a tool to create an exact “picture” of selected portions of the world. In the abstraction process, the relationship between a language and the world is one-to-many in the sense that a single word or expression in language is most often used to refer to a set or to a continuum of situations in the world. In order to be able to model the relationship between language and world, the mathematical apparatus of the predicate logic, for instance, does not seem to provide enough representational power. One way of enhancing the representation is to take into account the unclear boundaries between different concepts. Many names have been used to refer to this phenomenon such as gradience, fuzziness, impreciseness, vagueness, or fluidity of concepts. The possibility of abandoning the predetermined discrete, symbolic features is worth consideration. However, a remark on the notion of ’symbol’ may be necessary: the basic idea is to consider the possibility of grounding the symbols based on the unsupervised learning scheme. The symbols are used on the level of communication and may be used as the labels for the usually continuous multidimensional conceptual spaces. Symbols serve also as a means for compressing information. Figure 3c outlines a scheme for “breaking up” the symbols. If suitable “raw data” are used, it may not be necessary to use any intermediate levels in facilitating interpretation. For example, de Sa [5] proposes a model of learning in a cross-modal environment based on the SOM. The SOM is used to associate input from different modalities (pictures, language). De Sa’s practical experiments use, however, input for which the interpretation of the features is given beforehand. Nenov and Dyer [43,44] present an ambitious model and experiments on perceptually grounded language learning which is based on creating associations between linguistic expressions and visual images. Hyötyniemi [16] discusses the semantic considerations when dealing with mental models presenting three levels: features based on raw observations, patterns, and categories. Self-Organizing Maps in Symbol Processing 357 A mathematical framework for continuous formal systems is given in [31]; it can be based on, e.g., Gabor filters. Traditionally, the filters have to be designed manually. To facilitate automatic extraction of features, the ASSOM method [23,26] could be used. For instance, the ASSOM is able to learn Gabor-like filters automatically from the input image data. It remains to be seen, though, what kinds of practical results can be acquired by aiming at still further autonomy in the processing, e.g., by combining uninterpreted speech and image input. (a) degree of membership 1 1 .75 (b) degree of membership .75 "tallness" .5 .5 time in history sex .25 .25 "tallness" 0 0 height height etc. degree of membership age (c) specification of membership a non−symbolic feature space with reduced dimensionality nonlinear mapping based on adaptation original multidimensional space Fig. 3. Three stages of modeling continuity in the relation between linguistic expressions and the referenced continuous phenomena. In (a) a traditional view on fuzzy set theory is provided: the fuzziness of a single feature is represented by a curve in a one-dimensional space. The second alternative (b) points out the need to consider multidimensional cases: the degree of membership related to “tallness” of a person is not only based on the size of the person. The furthest scheme (c) is based on processing of continuous, uninterpreted “raw data” [15]. 358 3.3 T. Honkela Contextuality, Subjectivity and Intersubjectivity Considerable number of person years have been spent in coding the knowledge representations by means of traditional AI in terms of entities, rules, scripts, etc. It seems, however, that the qualitative problems are not satisfactorily solved by quantitative means. The world is changing all the time, and, perhaps still more importantly, the symbolic descriptions are not grounded. Symbol grounding, embodiment and their connectionist modeling is a central topic, e.g., in [53] and [46]. The contextuality of interpretation is easily neglected, being, nevertheless, a very commonplace phenomenon in natural language (see, e.g., [18]). The selforganizing map is a suitable method for contextual modeling: instead of handling symbols or variables separately the SOM can be used in associating them with the relevant context, or even in evaluating the relevancy of the context. It seems that the SOM can be used in modeling the individual use of language: to create maps of subjective use of language based on examples, and, furthermore, to model intersubjectivity, i.e., to have a map that also models the contents of other maps (see, e.g., [12]). Two persons may have different conceptual or terminological “density” of the topic under consideration. A layman, for instance, is likely to describe a phenomenon in general terms whereas an expert uses more specific terms. However, in communication individuals tune in to each other’s language use. De Boer has simulated the emergence of realistic vowel systems in a population of agents that try to imitate each other as well as possible. The agents start with no knowledge of the sound system at all. Through communication a coherent vowel system emerges [2]. Similar thorough experiment using the SOM for conceptual emergence in an agent community still remains to be conducted. Honkela [12] proposes a model that adds a third vector element in addition to symbol part and context part: the specification of the identity of the utterer of the expressions. Thus, the network would be selective according to the “listener”. This kind of enhanced map would provide a means for selecting the terms that can be used in communication. The detailed map could, for instance, include many expressions for different versions of a certain color among which a general term or a specific term would become selected based on the listener. This kind of use of the SOM provides a model of subjectivity and intersubjectivity: strictly speaking every “SOM-agent” has a symbol mapping of its own but the mappings are adapted through interactions to become similar enough to enable meaningful communication. 4 Conclusions Natural language understanding as a field of artificial intelligence requires, for instance, means to model emergence of conceptual systems, intersubjectivity in communication, interpretation of expressions with contextual cues, and symbol grounding into perceptual spaces. If such means are available the methods can be considered to be relevant also in cognitive science and epistemology. In this Self-Organizing Maps in Symbol Processing 359 paper, critical view on the traditional means to model natural language understanding and symbol processing has been given. Especially, the “purely symbolic methods”, e.g., systems based on predicate logic or semantic networks, appear not to be sufficient to tackle the phenomena mentioned above. As an alternative, the self-organizing map (SOM) has been considered in this paper. There are several results based on the SOM that cover relevant areas of the overall phenomena and thus the SOM seems to be a promising alternative for the more traditional approaches. There are, of course, related methods that can be used for similar purposes but the SOM covers several of the modeling requirements at the same time and, moreover, serves as a neurophysiologically grounded cognitive model. However, the basic SOM as such does not suit very well for processing structured information but one can combine the SOM with, e.g., recurrent networks to obtain better coverage of both structural and contentrelated linguistic phenomena (see, e.g., [34]). Moreover, the SOM principle can be generalized through adoption of principles of evolutionary computation which gives possibility of finding spatial order based, e.g., on functional similarity of potentially highly structured input samples [17,28]. Acknowledgements This work is mainly based on the results gained during my three-year stay at Academy Professor Teuvo Kohonen’s Neural Networks Research Center in Helsinki University of Technology. I wish to thank Professor Kohonen for his most invaluable advice and guiding. I am also grateful to Dr. Samuel Kaski, Dr. Jari Kangas, Ms Krista Lagus, Dr. Dieter Merkl, Prof. Erkki Oja, Mr. Ville Pulkki, and many others. References 1. Amari, S.-I.: A theory of adaptive pattern classifiers. IEEC, 16:299–307 (1967) 2. Boer, B. de: Investigating the Emergence of Speech Sounds. Proceedings of IJCAI’99, International Joint Conference on Artificial Intelligence. Dean, T. (ed.), Morgan Kaufmann, Vol. I, pp. 364–369 (1999) 3. Carpenter, G. and Grossberg, S.: A massively parallel architecture for a selforganizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37:54–115 (1987) 4. Charniak, E.: Statistical Language Learning. MIT Press, Cambridge, MA (1993) 5. de Sa, V.: Unsupervised Classification Learning from Cross-Modal Environmental Structure. PhD thesis, University of Rochester, Department of Computer Science, Rochester, New York (1994) 6. Finch, S. and Chater, N.: Unsupervised methods for finding linguistic categories. I. Aleksander and J. Taylor, eds., Artificial Neural Networks, vol. 2, pp. II-1365-1368, North-Holland (1992) 7. Foerster, H. von: Notes on an epistemology for living things. Observing Systems, Intersystems publications, pp. 258-271 (1981) 360 T. Honkela 8. Gallant, S. I.: A practical approach for representing context and for performing word sense disambiguation using neural networks. ACM SIGIR Forum, 3(3):293– 309 (1991) 9. Gärdenfors, P.: Mental representation, conceptual spaces and metaphors. Synthese, 106:21–47 (1996) 10. Gärdenfors, P.: Philosophy and Cognitive Science, chapter Conceptual spaces as a framework for cognitive semantics, pp. 159–180. Kluwer, Dordrecht (1996) 11. Honkela, T. and Vepsäläinen, A.M.: Interpreting imprecise expressions: experiments with Kohonen’s Self-Organizing Maps and associative memory. Proceedings of ICANN-91, International Conference on Artificial Neural Networks, T. Kohonen, K. Mäkisara, O. Simula and J. Kangas (eds.), North-Holland, vol. I, pp. 897-902 (1991) 12. Honkela, T.: Neural Nets that Discuss: A General Model of Communication Based on Self-Organizing Maps. Proceedings of ICANN-93, International Conference on Artificial Neural Networks, Amsterdam, Gielen, S. and Kappen, B. (eds.), SpringerVerlag, London, pp. 408-411 (1993) 13. Honkela, T., Pulkki, V. and Kohonen, T.: Contextual relations of words in Grimm tales analyzed by self-organizing map. Proceedings of ICANN-95, International Conference on Artificial Neural Networks, F. Fogelman-Soulié and P. Gallinari (eds), vol. 2, EC2 et Cie, Paris, pp. 3-7 (1995) 14. Honkela, T., Kaski, S., Lagus, T., and Kohonen, T.: Newsgroup exploration with WEBSOM method and browsing interface. Technical report A32, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland (1996) 15. Honkela, T.: Self-Organizing Maps in Natural Language Processing. PhD Thesis, Helsinki University of Technology, Espoo, Finland (1997) See http : //www.cis.hut.f i/∼tho/thesis/ 16. Hyötyniemi, H.: On mental images and ’computational semantics’. Proceedings of Finnish Artificial Intelligence Conference, Finnish Artificial Intelligence Society, Espoo, pp. 199–208 (1998) 17. Nissinen, A.S., and Hyötyniemi, H.: Evolutionary Self-Organizing Map Proceedings of EUFIT’98: European Congress on Intelligent Techniques and Soft Computing, pp. 1596-1600 (1998) 18. Hörmann, H.: Meaning and Context. Plenum Press, New York (1986) 19. Kaski, S., Honkela, T., Lagus, K., and Kohonen, T.: Creating an order in digital libraries with self-organizing maps. In Proceedings of WCNN’96, World Congress on Neural Networks (1996) 20. Kaski, S., Honkela, T., Lagus, K., and Kohonen, T.: WEBSOM—Self-Organizing Maps of Document Collections. Neurocomputing, 21:101-117 (1998) 21. Kaski, S.: Dimensionality reduction by random mapping: Fast similarity computation for clustering. Proceedings of IJCNN’98, International Joint Conference on Neural Networks (1998) 22. Kohonen, T.: Self-organizing formation of topologically correct feature maps. Biological Cybernetics, 43(1):59–69 (1982) 23. Kohonen, T.: The Adaptive-Subspace SOM (ASSOM) and its use for the implementation of invariant feature detection. Fogelman-Soulié, F. and Gallinari, P., editors, Proceedings of ICANN’95, International Conference on Artificial Neural Networks, volume I, pp. 3–10, Nanterre, France. EC2 (1995) 24. Kohonen, T.: Self-Organizing Maps. Springer, Berlin, Heidelberg (1995) Self-Organizing Maps in Symbol Processing 361 25. Kohonen, T., Hynninen, J., Kangas, J., and Laaksonen, J.: SOM PAK: The SelfOrganizing Map program package. Report A31, Helsinki University of Technology, Laboratory of Computer and Information Science (1996) 26. Kohonen, T., Kaski, S., Lappalainen, H., and Salojärvi, J.: The adaptivesubspace self-organizing map (ASSOM). Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland, June 4-6, pp. 191–196. Helsinki University of Technology, Neural Networks Research Centre, Espoo, Finland (1997) 27. Kohonen, T., and Hari, R.: Where the abstract feature maps of the brain might come fromo. Trends In Neurosciences, 22(3): 135-139 (1999) 28. Kohonen, T.: Fast Evolutionary Learning with Batch-Type Self-Organizing Maps. Neural Processing Letters, Kluwer, 9:153–162 (1999) 29. Lakoff, G.: Women, Fire and Dangerous Things. University of Chicago Press, Chicago (1987) 30. Lin, X., Soergel, D., and Marchionini, G.: A self-organizing semantic map for information retrieval. Proceedings of 14th. Ann. International ACM/SIGIR Conference on Research & Development in Information Retrieval, pp. 262–269 (1991) 31. MacLennan, B.: Continuous formal systems: A unifying model in language and cognition. Proceedings of the IEEE Workshop on Architectures for Semiotic Modeling and Situation Analysis in Large Complex Systems, August 27-29, 1995, Monterey, CA (1995) 32. MacWhinney, B.: Linguistic categorization, chapter Competition and Lexical Categorization. Benjamins, New York (1989) 33. MacWhinney, B.: Cognitive approaches to language learning, chapter Lexical Connectionism. MIT Press (1997) 34. Mayberry, M.R., and Miikkulainen, R.: SardSrn: A Neural Network Shift-Reduce Parser. Proceedings of IJCAI’99, International Joint Conference on Artificial Intelligence. Dean, T. (ed.), Morgan Kaufmann, Vol. II, pp. 820–825 (1999) 35. McCawley, J.D.: Everything that Linguists have always Wanted to Know about Logic but were ashemed to ask. Basil Blackwell, London (1981) 36. Merkl, D.: Self-Organization of Software Libraries: An Artificial Neural Network Approach. PhD thesis, Institut für Angewandte Informatik und Informationssysteme, Universität Wien (1994) 37. Miikkulainen, R.: DISCERN: A Distributed Artificial Neural Network Model of Script Processing and Memory. PhD thesis, Computer Science Department, University of California, Los Angeles, Tech. Rep UCLA-AI-90-05 (1990) 38. Miikkulainen, R.: Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. MIT Press, Cambridge, MA (1993) 39. Miikkulainen, R.: Self-organizing feature map model of the lexicon. Brain and Language, 59:334–366 (1997) 40. Miikkulainen, R. and Dyer, M. G.: Natural language processing with modular neural networks and distributed lexicon. Cognitive Science, 15:343–399 (1991) 41. Mitra, S. and Pal S.: Self-organizing neural network as a fuzzy classifier. IEEE Transactions on Systems, Man and Cybernetics, 24(3):385–99 (1994) 42. Mitra, S. and Pal, S.: Fuzzy self-organization, inferencing, and rule generation. IEEE Transactions on Systems, Man & Cybernetics, Part A [Systems & Humans], 26(5):608–20 (1996) 43. Nenov, V. I. and Dyer, M. G.: Perceptually grounded language learning: Part 1 A neural network architecture for robust sequence association. Connection Science, 5(2):115–138 (1993) 44. Nenov, V. I. and Dyer, M. G.: Perceptually grounded language learning: Part 2 DETE: a neural/procedural model. Connection Science, 6(1):3–41 (1994) 362 T. Honkela 45. Pulkki, V.: Data averaging inside categories with the self-organizing map. Report A27, Helsinki University of Technology, Laboratory of Computer and Information Science, Espoo, Finland (1995) 46. Regier, T.: A model of the human capacity for categorizing spatial relations. Cognitive Linguistics, 6(1):63–88 (1995) 47. Ritter, H. and Kohonen, T.: Self-organizing semantic maps. Biological Cybernetics, 61(4):241–254 (1989) 48. Rosch, E.: Studies in cross-cultural psychology, vol. 1, chapter Human categorization, pp. 3–49. Academic Press, New York (1977) 49. Scholtes, J. C.: Neural Networks in Natural Language Processing and Information Retrieval. PhD thesis, Universiteit van Amsterdam, Amsterdam, Netherlands (1993) 50. Schütze, H.: Dimensions of meaning. Proceedings of Supercomputing, pp. 787–796 (1992) 51. Ultsch, A.: Self-organizing neural networks for visualization and classification. Opitz, O., Lausen, B., and Klar, R., editors, Information and Classification, pp. 307–313, London, UK. Springer (1993) 52. Ultsch, A. and Siemon, H.: Kohonen’s self organizing feature maps for exploratory data analysis. Proceedings of INNC’90, International Neural Network Conference, pp. 305–308, Dordrecht, Netherlands. Kluwer (1990) 53. Varela, F. J., Thompson, E., and Rosch, E.: The Embodied Mind: Cognitive Science and Human Experience. MIT Press, Cambridge, Massachusetts (1993) 54. von der Malsburg, C.: Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14:85–100 (1973) 55. Wermter, S., Riloff, E., and Scheler, G.: Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing. Springer Verlag, New York (1996) Evolution of Symbolisation: Signposts to a Bridge between Connectionist and Symbolic Systems Ronan G. Reilly Department of Computer Science University College Dublin Belfield Dublin 4 Ireland Ronan.Reilly@ucd.ie http://www.cs.ucd.ie/staff/rreilly Abstract. This paper describes recent attempts to understand the evolution of language in humans and argues that useful lessons can be learned from this analysis by designers of hybrid symbolic/connectionist systems. A specification is sketched out for a biologically grounded hybrid system motivated by our understanding of both the evolution and development of symbolisation in humans. 1 Introduction If one does not subscribe to the theory of a serendipitous emergence of language in our ancestors several hundred thousand years ago, one is faced with a conundrum. If our language capacity has been subject to the forces of evolution, what were its intermediate stages, what foundation was it constructed upon? An answer to this question would help us understand how language is achieved in contemporary brains, and go some way to helping us build more effective natural language processing (NLP) systems. In particular it would help us overcome what I believe are the significant obstacles confronting pure connectionist approaches to NLP, and lead to better motivated hybrid designs. The question of intermediate stages in the emergence of language has recently been addressed by Terrence Deacon [4] in his book The Symbolic Species. Deacon argues for the existence of an evolutionary path in the development of the relationship between sign and signified that proceeds from an iconic relationship, through a process of temporal and spatial indexicalisation (i.e., association), to an arbitrary and culturally licensed symbolic relationship. The purpose of this paper is to highlight possible lessons to be learned from the evolution of symbolisation in early hominids, as described by Deacon, that might support our efforts to improve the symbolic capabilities of connectionist networks. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 363–371, 2000. c Springer-Verlag Berlin Heidelberg 2000 364 2 R.G. Reilly An Evolutionary Perspective on Symbolisation Similarly to C.S. Pierce, Deacon distinguishes between three categories of sign: iconic, indexical, and symbolic. Iconic signs have some physical similarity with what they signify, indexical signs are related to their referent either spatially or temporally, and symbolic signs have an arbitrary relationship with their referent. In keeping with an evolutionary perspective, Deacon makes the case that symbol systems emerged in human species through a process that moved from iconic through indexical to symbolic sign usage. Each succeeding level of sign usage subsumed the preceding one. Thus indexical signs are built upon spatio– temporal relationships between icons, and symbols are constructed upon relationship between indices and most importantly on relationship between other symbols. So, for example, a child learning to read, first encounters printed words as iconic of print in general, much in the same way as someone who doesn’t read Chinese might look upon a book of Chinese characters. Each character is equally iconic (trivially) of written language. As the child learns the writing system she proceeds to a stage where the written signs index the spoken language and her perceptual environment in systematic ways. Finally, the relationships among these indices allow her to access the symbolic aspects of the words. However, the written words acquire their symbolic status not simply by standing for something in an associationstic sense, but from the mesh of inter-symbolic relationships in which they are embedded. This contrasts with the naive view of symbolisation that sees the acquisition of indexical reference as the essence of symbolisation. According to Deacon, this is a necessary stage in the development of symbolisation, but by no means the full story. One of Deacon’s central points is that there is an isomorphism between the stages leading to symbol development and the transition from perception (of icons), through associationistic learning (of indices), through cognising (of symbols). Only humans have reached the final stage, though Deacon argues that the bonobo chimpanzee, Kanzi [19], has attained a level of sophistication in the symbolic domain that poses a major challenge to language nativists. Overall, Deacon’s thesis is an alternative to the nativists’ arguments for the origins of language [1][12]. According to Deacon, symbolic cognition, of which language is just one manifestation, emerged as a result of an evolutionary dynamic between the selectional advantages provided by simple symbol usage and pre–frontal cortical changes that favoured and supported symbol usage. Deacon supports his case by looking at the comparative neural anatomy of apes and humans. He demonstrates that, contrary to conventional wisdom, overall brain size is not the key inter–species difference. Rather it is the disproportional growth of the pre–frontal cortex relative to other cortical areas. If we were to scale up a modern ape’s brain size to the correct proportions for a human body, the pre–frontal region in humans would be twice that of the scaled–up ape. More importantly, this scaling–up is not just of cortical mass, but of connectivity and the capacity to influence other cortical regions. An example of the role of the pre–frontal cortex in a natural primate environment would be its use in foraging behaviour. In foraging, an effective strategy is not to return to the locations that one has most recently visited, since they Evolution of Symbolisation 365 are least likely to provide a food reward. Thus, one needs to be able to suppress the more basic drive to return to where one has previously had one’s behaviour reinforced. In humans pre–frontal enlargement biases us to a specific style of learning. This involves, among other things, the ability to detach from the immediate perceptual demands of a task, to switch between alternative courses of action, to delay an immediate response to a stimulus, to tune into higher–order features of the stimulus environment, and so on. Deacon argues that these capabilities are essential pre–requisites for a facility with symbols. They provide the necessary attentional and mnemonic resources for the task of symbol learning. Damage to the frontal areas of the cortex in humans gives rise to subtle impairments in tasks involving so–called executive function. The deficits appear to lie in an inability to start, stop or modulate an action. In neuropsychology, this can often be observed in tests of fluency or card sorting. Patients with pre–frontal damage, for example, have difficulty in performing tasks involving the random generation of words of a given semantic category. Studies involving the teaching of language skills to chimpanzees [19], have shown that they could with the aid of external symbolic devices (e.g., boards of lexigrams) improve their symbol acquisition and manipulation skills considerably beyond those of non–symbol using chimpanzees. What appears to have occurred in human evolution, and specifically as a result of the enlargement of the pre– frontal area is the internalisation of the functional equivalents of these external mnemonic aids. It is a mistake, however, to view the pre–frontal area as a brain region that assumes its functional role at the same time as all other regions. Indeed, there is a rather specific pattern to the way in which all of the cortical regions develop. In the next section I will argue that the pattern of maturation plays an important role in the unfolding of the functional capabilities of the pre–frontal and other brain regions. 3 Maturational Wave One of the paradoxes of artificial neural network research is that the capabilities of artificial neural networks fall far short of those of the real thing, yet the learning algorithm(s) employed by real neural networks may well be considerably simpler and less powerful than, say, error backpropagation [17]. The evidence to date suggests that some variant of Hebb’s rule, possibly mediated by NMDA–based long term potentiation, may be the dominant learning rule in natural nervous systems [3]. The problem is that Hebb’s rule cannot be used to learn even a trivial higher–order function such as XOR, at least not directly. So this presents the evolutionary account of the emergence of symbolisation with a problem. If the symbolic system is indeed built upon layers of indexical and iconic relationships, we don’t have, prima facie, a biological learning algorithm that is up to the job of detecting the type of second–order features of the environment necessary for symbol use. Not only do we have an impoverishment of stimulus 366 R.G. Reilly if the nativist position is to be believed, but we also have an impoverishment of learning mechanism. Shrager and Johnson [20], however, in an elegant computational study demonstrated that the inclusion of a wave of learning plasticity passing through a model cortex permitted the acquisition of higher–order functions using only simple Hebbian learning. There is good evidence that just such a modulation of plasticity occurs during cortical development [21]. An important feature of the process is that it affects the sensory–motor regions of the cortex initially and then moves through to regions more distal from sensory–motor areas. In Shrager and Johnson’s model, this gives rise to the emergence of higher–order feature detectors in these distal areas that use outputs of low–order feature detectors in the sensory areas. There are obvious implications in this process for the role of the pre–frontal cortex in the acquisition of symbolisation. By virtue of its remove from sensory regions — it receives no direct sensory inputs — and the fact that it is one of the last regions to mature, the pre–frontal area can be considered to be a configurable resource with the capacity to be sensitive to higher–order features of input from other cortical regions. This makes it an obvious candidate for mediating the acquisition of symbolic relationships. In a sense, therefore, there may be nothing especially “symbolic” about the capabilities of the pre–frontal cortex; it was a relatively uncommitted computational resource that got exploited by evolution. Moreover, other factors in addition to its uncommittedness, may also have ensured it took on a central role in the emergence of symbolisation. Evidence for this comes from research by Patricia Greenfield [9] on the evolution of Broca’s area, which suggests that its prior computational function provided a useful building block in its subsequent evolution as a centre for handling language syntax (I will explore this idea in more detail in a later section). It is my contention that something similar may also have happened with respect to the pre–frontal area, and a good candidate for a re–usable computational resource in the pre–frontal cortex was the foraging behaviour of our primate ancestors. 4 Language Development Rebotier and Elman [14] have argued that there are suggestive parallels between the results of Shrager and Johnson’s model [20] and the language learning modelling of Elman and his colleagues [5]. Elman has demonstrated an interaction between developing cognitive capacity and the acquisition of rule–like behaviour of increasing complexity. In particular, he showed that a simple recurrent network (SRN) could not learn a complex context–free grammar without an initial limitation in its memory capacity. Elman’s experiments involved training an artificial neural network to learn a small artificial language with many of the characteristics of real English. The sample sentences he used in training comprised only grammatical sentences. When he first tried to train a network to learn this language it failed to do so. However, he discovered that if he limited the network’s memory capacity early in training and then gradually increased it, the network could ultimately master the grammar. This gradual increase in capacity is analogous to what happens to a child’s cognitive abilities as he or Evolution of Symbolisation 367 she develops. Elman’s network succeeded in learning the complex grammar by first acquiring a simpler version of it. What is particularly interesting about this finding is that, counter-intuitively, a capacity limitation early in development turns out to be an advantage. Elman’s finding has, I believe, significant implications for the nativist position on language acquisition. Given that a child learning a language is exposed to predominantly grammatical utterances, Chomsky [1] and others have argued that some innate knowledge of language is needed to circumvent the obstacle to learning identified by Gold [7]. The latter proved that grammars of the complexity of natural language are unlearnable from positive examples alone. With Elman’s finding we can discern the shape of a response to Chomsky and Gold’s position. If we take account of the fact that language is acquired by a child with certain cognitive capacity limitations, these limitations act as a filter on the complexity of the language, transforming it into one that is learnable from positive instances alone. The incremental expansion of these capacities allows additional features of the language to be constructed on a simpler foundation. There appears to be some similarity between the arguments marshalled by those in favour hybrid approaches to NLP, and those of the linguistic nativists. Elman’s results suggest that employing the principle of starting small and paying attention to developmental issues may be one way to push further the capabilities of the connectionist component, and postpone the introduction of a hybrid mechanism to a stage where it is really needed. 5 Symbol Emergence Further support for Elman’s position comes from a study of language learning in chimpanzees [19]. Kanzi, a bonobo chimpanzee, acquired the ability to use a limited artificial language involving a board of lexigrams while being cared for by his mother. One of the more interesting features of this case is that Kanzi was not explicitly taught how to use the lexigrams himself, but acquired his ability incidentally while his mother was going through a training regime that ultimately proved ineffective for her. This appears to be analogous to the phenomenon described by Elman [5], where a complex grammar could only be acquired by a connectionist network by “filtering” it through an initially limited memory capacity, akin to the capacity limitations we find with children. What is important from an evolutionary point of view, however, is not the means by which Kanzi acquired a significant facility with symbols, but that he was able to do so at all. This suggests that somewhere back in hominid evolution conditions prevailed that facilitated and favoured the expression of an incipient symbol ability. This then set in train a process of dynamical co-evolution between brain structure and cognitive abilities that led to the emergence of the complex cognitive and linguistic skills that we manifest as a species. The evolutionary jump to symbolisation from an association-based indexical system, argued for by Deacon, requires an ability to discern certain key relationships between symbols. Although not referred to by Deacon [4] as such, these relationships are primarily those of compositionality and systematicity. These 368 R.G. Reilly are the very features identified by Fodor and Pylyshyn [6] as notably absent from connectionist approaches to language and cognition. While the compostionality criticism has been positively addressed to most people’s satisfaction by Van Gelder [22], systematicity still remains a keenly debated issue [10]. In the case of connectionist NLP models, it boils down to their ability (or lack of it) to generalise their behaviour to new words in new positions in test sentences. Notwithstanding some positive indications [2], current connectionist models do not as yet demonstrate the strong systematicity characteristic of a human language user. There are hybrid symbolic/connectionist solutions to the systematicity problem [11], but a more satisfactory solution would be a biologically grounded hybrid system motivated by our understanding of the evolution and development of symbolisation in humans. As yet, nothing of this sort exists, and all one can do at present is sketch out a specification for such a system. In the next section, I will summarise its main features. 6 Evolutionary and Developmental Re-Use As was discussed earlier, Greenfield [9] proposed that both language production and high–level motor programming are initially subserved by the same cortical region, which subsequently differentiates and specialises. This homology arises from the exploitation during the evolution of the human language capacity of the motor–planning capabilities of what is now Broca’s area. The object–assembly specialisation in this incipient Broca’s area effectively provided a re–usable computational resource for the evolution of a language production system. Greenfield [9] based her argument for the evolution of Broca’s region in part on developmental evidence. She observed that there are parallels in the developmental complexity of speech and object manipulation. In studying the object manipulation of children aged 11–36 months, she noted that the increase in complexity of their object combination abilities mirrored the phonological and syllabic complexity of their speech production. There are two possible explanations for this phenomenon: (1) It represents analogous and parallel development mediated by separate neurological bases; or (2) the two processes are founded on a common neurological substrate. Greenfield used evidence from neurology, neuropsychology, and animal studies to support her view that the two processes are indeed built upon an initially common neurological foundation, which then divides into separate specialized areas as development progresses. A significant part of Greenfield’s argument centered on the results of an earlier study [8] in which children were asked to nest a set of cups of varying size. The very young children could do little more than pair cups. Slightly older children showed a capacity to nest the cups, but employed what Greenfield referred to as a “pot” strategy as their dominant approach. This entailed the child only ever moving one cup at a time to carry out the task. Still older children tended to favor a “sub–assembly” strategy in which the children moved pairs or triples of stacked cups. At a given age one could detect a combination of different nesting strategies being employed, but with one strategy tending to dominate. Evolution of Symbolisation 369 In [15], I sought to explore some aspects of Greenfield’s thesis within a computational model. I described an abstract connectionist characterization of a language processing and object manipulation task involving the recursive auto–associative memory (RAAM) technique developed by Jordan Pollack [13]. RAAMs were choosen because they are a general representational scheme for hierarchical structures, and thus can cope equally well with both linguistic and motoric representations. They are neurally plausible on several counts. The representations are distributed, and neural representations appear also to be distributed. They preserve similarity relationships: RAAM encodings of structures that are similar, are themselves similar. This also appears to be a pervasive feature of neural representations [18]. The fact that RAAM representations are of fixed-width also gives them some degree of neural plausibility, since the neural pathways (i.e., cortico–cortical projections) along which their neural counterparts are transmitted, are also of fixed width. The simulations described in [15] involved training a simple recurrent network (SRN) to generate, on the basis of an input RAAM “plan”, a sequence of either object–assembly actions or phonemes. The results of the simulations indicated an advantage in terms of rate of error decline for networks that had prior training on a simulated object assembly task, when compared with various control conditions. This was taken as support for Greenfield’s view of a functional homology underlying both language and object assembly tasks. To understand why the object assembly task provided a significant training advantage, it is useful to look at the control condition in which prior training was given first on the language task and then on the object assembly task. In this case, the prior language training did not benefit the SRN in learning the object assembly task. This suggests that the advantage provided by initial training on object assembly may be another case of the ”importance of starting small” in the sense of Elman [5], and already discussed above. The tree structures associated with object assembly were relatively simple, with few deep embeddings. The language RAAM structures were more complex in this respect. Elman [5] demonstrated in his grammar-learning studies that if a complex grammar is to be learned by an SRN, the network must first be trained on a simpler version of the grammar. In the model described in [15], when initial training was provided on the language task, the network was being trained on complex structures first and then simpler ones, the reverse of the approach that Elman found to be effective. It is unsurprising, therefore, that training is retarded when the complexity ordering is reversed. As well as demonstrating the viability of the re–use position, the model also successfully reproduced the relative difficulty that children have in producing the “pot” and “sub-assembly” strategies. In addition, there appeared to be a divergence in the exploitation of hidden-unit space as the complexity of the language task increased. This partitioning of the hidden unit space was analogous to what Greenfield argues occurs during the development of Broca’s region in children; early in development the one cortical region subsumes both object assembly and language tasks, but as development progresses the upper part of the region specialises for object assembly and the lower part for language. What 370 R.G. Reilly the model demonstrated was that this divergence could result simply from an increase in complexity of the two tasks, rather than a genetically programmed differentiation, as Greenfield herself has proposed [8]. In summary, the evidence from the modelling described above supports the argument that there are good computational reasons for building a language production system on an object-assembly foundation, and lends support to the more general argument that a process of re–utilisation of cortical computation may provide a mechanism for the construction of high–level cognitive processes on lower–level sensory–motor ones. 7 Signposts to a Connectionist/Symbolic Bridge In the preceding sections I have described a possible route taken in the evolution of symbolic capacity in humans. The account relies heavily on Deacon’s analysis of the role of the pre–frontal cortex in the emergence of symbolisation. Deacon focuses on a neurobiological basis for this capacity, and its possible origins in evolution. I have augmented this with a description of two general computational mechanisms that might underpin the changes he identified. The first of these is the role of maturation in conjunction with a neurally plausible learning mechanisms. I have also proposed that something akin to software re–use may be at work in biasing the selection of one cortical region rather than another as the substrate for symbolisation. The jump to symbolisation represents a significant discontinuity in functional capabilities, but I believe that it may be a small underlying difference that makes all the difference. In other words, the change in computational mechanism supporting this jump may be quite subtle. So, while we may have a functionally hybrid system, appearing to combine associationistic and symbolic capabilities, its computational infrastructure may not be that hybrid at all, merely an extension of existing associationistic mechanism. The challenge, then, is to find that subtle difference that makes the difference. Some of the indications are that: – we should aim to build complex, functionally hybrid, systems from simpler connectionist components; – we should explore the use of simple learning rules, such as Hebbian learning, in conjunction with a maturation dynamic; – we should try to resolve the limitations in systematicity of connectionist models by focussing on the overall role of the pre–frontal cortex in language and cognition. Explorations in the space of connectionist models constrained by the above guidelines will, I believe, yield fruitful results. References 1. Chomsky, N. (1986). Knowledge of language. New York: Praeger. 2. Christiansen, M.H., & Chater, N. (1994). Generalisation and connectionist language learning. Mind and Language, 9, 273–287. Evolution of Symbolisation 371 3. Cruikshank, S.J., & Weinberger, N.M. (1996). Evidence for the Hebbian hypothesis in experience-dependent physiological plasticity of neocortex: A critical review. Brain Research Reviews, 22, 191–228. 4. Deacon, T. (1997). The symbolic species: The co-evolution of language and the human brain. London, UK: The Penguin Group. 5. Elman, J.L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71–99. 6. Fodor J. A., & Pylyshyn, Z.W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28, 3–71. 7. Gold, E.L. (1967). Language identification in the limit. Information and Control, 16, 447–474. 8. Greenfield, P., Nelson, K, & Saltzman, E. (1972). The development of rule-bound strategies for manipulating seriated cups: A parallel between action and grammar. Cognitive Psychology, 3, 291–310. 9. Greenfield, P. (1991). Language, tool and brain: The ontogeny and phylogeny of hierarchically organized sequential behavior. Behavioral and Brain Sciences, 14, 531–595. 10. Hadley, R.F. (1994). Systematicity in connectionist language learning. Mind and Language, 9, 247–271. 11. Hadley, R.F. & Hayward, M.B. (1997). Strong semantic systematicity from Hebbian connectionist learning. Mind and Machines, 7, 1–37. 12. Pinker, S. (1994). The language instinct: how the mind creates language. New York: William Morrow. 13. Pollack, J.B. (1990). Recursive distributed representations. Artificial Intelligence, 46, 177–105. 14. Rebotier, T.P., & Elman, J.L., (1996). Explorations with the dynamic wave model. In D. Touretsky, M. Mozer, and M. Haselmo (Eds.), Advances in Neural Information Processing Systems 8. Cambridge, MA: MIT Press. 15. Reilly, R.G. (in press). The relationship between object manipulation and language development in Broca’s area: A connectionist simulation of Greenfield’s hypothesis. Behavioral and Brain Sciences. 16. Reilly, R.G. (1998). Cortical software re–use: A neural basis for creative cognition. In T. Veale (Ed.), Computational Models of Creative Computation, pp. 36-42. 17. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, and The PDP Research Group (Eds.), Parallel distributed processing. Explorations in the microstructure of cognition. Volume 1: Foundations. Cambridge, MA: MIT Press, pp. 318–362. 18. Sejnowski, T.J. (1986). Open questions about computation in the cerebral cortex. In J. L. McClelland, D. E. Rumelhart, & The PDP Research Group (Eds.), Parallel distributed processing. Explorations in the microstructure of cognition. Volume 2: Psychological and biological models. Cambridge, MA: MIT Press, pp. 372–389. 19. Savage–Rumbaugh, E.S., & Lewin, R. (1994). Kanzi: The ape at the brink of the human mind. New York: John Wiley. 20. Shrager, J. & M.H. Johnson (1996). Dynamic plasticity influences the emergence of function in a simple cortical array. Neural Networks, 9, 1119–1129. 21. Thatcher, R.W. (1992). Cyclic cortical reorganization during early childhood. Brain and Cognition, 20, 24–50. 22. Van Gelder, T. (1990). Compositionality: A connectionist variation on a classical theme. Cognitive Science, 14, 355–384. A Cellular Neural Associative Array for Symbolic Vision Christos Orovas and James Austin Advanced Computer Architectures Group Computer Science Department, University of York York, YO10 5DD, UK [christos|austin]@minster.york.ac.uk http://www.cs.york.ac.uk/arch/ Abstract. A system which combines the descriptional power of symbolic representations with the parallel and distributed processing model of cellular automata and the speed and robustness of connectionist symbol processing is described. Following a cellular automata based approach, the aim of the system is to transform initial symbolic descriptions of patterns to corresponding object level descriptions in order to identify patterns in complex or noisy scenes. A learning algorithm based on a hierarchical structural analysis is used to learn symbolic descriptions of objects. The underlying symbolic processing engine of the system is a neural based associative memory (AURA) which enables the system to operate in high speed. In addition, the use of distributed representations allow both efficient inter-cellular communications and compact storage of rules. 1 Introduction One of the basic features of syntactic and structural pattern recognition systems is the use of the structure of the patterns in order to classify them. This approach is the most appropriate when the patterns under consideration are characterized by complex structural relationships [1]. In these systems each pattern is represented using a symbolic data structure (strings, arrays, trees, graphs) where symbols represent basic pattern primitives. Structural methods rely on prototype matching which compares the unknown pattern with a set of models. Syntactic systems follow concepts from formal language theory aiming to represent each class of patterns with a corresponding grammar. Although these methods have been successfully applied in a number of cases, they have their drawbacks such as the computational complexity of the algorithms, sensitivity to noise and errors in the patterns and the lack of generality and robust learning abilities [2]. In relation to syntactical systems, the problem of grammatical inference combined with the representational abilities of the grammar used and the complexity and sensitivity of parsing are important obstacles to be addressed [3,2]. In order to deal with the problem of high dimensionality which occurs when the entire pattern is treated as the entity to be recognized, a decentralized and S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 372–386, 2000. c Springer-Verlag Berlin Heidelberg 2000 A Cellular Neural Associative Array for Symbolic Vision 373 distributed approach can be followed where the description of the patterns is achieved by the co-operation of a set of simple and homogeneous symbolic processing units. This approach follows a computational framework similar to cellular automata [4]. The basic idea in cellular automata is that a cellular array of relatively simple processing units exists and at each time instant the state of each cell is determined by its previous state and the previous states of its direct neighbours using a common set of simple rules. Albeit the homogeneous and simple units and the local neighbourhood connectivity, this model can demonstrate examples of complex behaviour and propagation of information [5] and can be used to simulate several physical systems containing many discrete elements with local interactions [6]. Apart from Von Neumann’s twenty nine states and four neighbours connectivity cellular automaton which was capable of simulating a Turing machine [7] and the well known Life game [4], cellular automata have found applications in simulating physical systems [8,9] and also in image processing [4,10,11]. The system in [8] is also an example where the rules are extracted directly from the experimental data that have to be simulated and this is achieved by using a genetic algorithm to search in the rule space for these rules that represent the data best. However, this is also one of the rare examples where automatic generation of rules for a cellular automata system is performed. Using this paradigm of parallel, distributed and ‘virtual’ multilayered processing, the problem of dimensionality is overcome by portioning the object into segments using processing nodes which communicate with each other. By allowing the states that each cell can be in to represent information at the different stages of interpretation of low level features towards world models, the cells can individually decide whether or not the segments they hold are parts of the same object. Effectively, a bottom up parsing is performed in a decentralized manner; each cell is trying to build its derivation tree upwards obeying at the same time at orders set by its neighbours. The questions that need to be answered for such an approach is how the grammar used to recognize each pattern is held in each node, how the rules of the grammar can be automatically created, how this set is handled efficiently and how generalization under noise tolerance is provided. Our approach is based on Cellular Associative Neural Networks (CANNs) [12] which learn the underlying grammar of the objects in a scene and operate as outlined above. A similar idea of ‘communicating grammars’ is also followed by the parallel grammar systems [13]. The network of language processors (NLP) is a typical example of such a system [14]. This consists of several language identifying devices which are called language processors and are associated with the nodes of a graph. The processors operate on strings (which can represent data and/or programs) by performing rewriting and communication steps. Although the potential of this approach is revealed by the extensive study of its properties in [14] there still remain the problems of generality, complexity of parsing and inferring of grammars. In our model the idea is extended by the use of associative neural networks in order to efficiently handle large sets of simple symbolic rules 374 C. Orovas and J. Austin and by the use of a hierarchical approach to learn the structure of the patterns and produce the sets of rules which make up the communicating grammar. Associative memories are characterized by their ability to address data by their content [15,16]. The neural associative memory model which is used in our system, AURA, is also capable of symbolic and sub-symbolic processing in terms that symbolic information is handled sub-symbolically; symbols are converted to patterns of activity which are managed in a distributed manner. By doing that we have the advantage of having the ability to perform symbolic processing and at the same time benefit from the generalization, noise and error tolerance and adaptability of neural information processing. The integration of neural (sub-symbolic) and symbolic processing is the main characteristic of the hybrid systems [17,18]. Connectionist symbolic processing is also a term related with these systems. Depending on the way in which neural and symbolic processing interact, hybrid systems can be classified in various categories without necessarily very sharp boundaries [18]. In CANNs, the overall architecture is a symbolic one; a cellular array of communicating symbolic processors. Neural associative memories with symbolic processing capabilities are used by each unit in the system in order to meet the requirements for high speed of store and recall, noise tolerance and generalization. The neural associative memories are basic elements and are embedded in each processor. This characteristic can classify CANNs as an embedded hybrid system according to the taxonomy in [18]. The novelty of CANNs lays not only in the combination of ideas from structural pattern recognition with the computational paradigm of cellular automata and the use of connectionist symbolic processing. It is also in the existence of a simple yet effective learning algorithm which, using a hierarchical approach, produces the necessary rules for the operation of the system. In this paper we present the basic architecture of AURA, the 2D CANN, the learning method and initial results of simple evaluations. 2 Cellular Associative Neural Networks A CANN is a cellular array of associative processors. This is a new concept based on both cellular automata and associative neural networks. As mentioned earlier, the main characteristics are the cellular automata like operation and the connectionist symbolic processing using the AURA model of neural associative memory. The initial labelling of the image is a 2D array of symbols produced by labelling the raw pixel image using a feature recognition stage. During recognition each iteration of the cellular array corresponds to a higher level of representation, indicated by symbols associated with higher level structures (e.g. whole edges, corner parts and then whole objects). After recognition is complete, each node in the 2D array will contain a label indicating what the feature ‘beneath’ that cell represents. A Cellular Neural Associative Array for Symbolic Vision 375 During processing, messages are exchanged between each cell in the array and after every iteration each cell is aware of the state of more distant cells. An example of the operation in a CANN is depicted in figure 1. time: (1) t1 t2 (1) 1 f1 t1 f2 0 (2) t1 (3) t1 t2 (2) 2 (4) O t2 (3) t2 (4) O 3 4 5 Fig. 1. The recognition of object O through the operation of a CANN. Object O consists of features f1 and f2 and ti (n) is the intermediate state at time n of a cell initially having feature label fi . All the cells have the same structure and they all perform the same sets of symbolic rules. Thus, their operation is location invariant. As we will see later, each cell is made up of a number of AURA based associative memories. Recognition is performed by an iterative process. Symbols from three alphabets are used to form the states of the cells in the array and the messages used to communicate between cells. The symbolic rules that govern the convergence properties of the system are produced by a learning algorithm which uses a hierarchical approach to describe the structure of the patterns. Recall uses these rules and, by employing a constraint relaxation, is able to generalize. Following is a more detailed description of these aspects of the system. 2.1 AURA The Advanced Uncertain Reasoning Architecture (AURA) is a set of methods for building associative memory systems capable of symbolic and sub-symbolic processing. Each cell in the CANN described in this paper is made up of a number of modules. Each module is a rule matching engine constructed using the AURA methods [19], the details of which are outside the scope of this paper. An out-line is presented here. The architecture used here has two stages. The first one uses a set of binary correlation matrix memories (CMM) [20] used to determine which rule will fire given a set of pre-conditions and the second is an enhanced indexing database used to store the post conditions of a rule. A basic 376 C. Orovas and J. Austin example of a CMM and a schematic diagram of the operation of the AURA model are depicted in figure 2. AURA can handle symbolic rules of the form pattern B 0 1 0 1 1 0 0 0 0 1 Antecedents (if P1 is A and P2 is B) 1 Lexical to Token converter Tokenised Symbols P1 P2 A B 0 p a t t e r n 0 A 1 Superimposed tokens Binding 1 0 1 CMM arity 1 CMM arity 2 CMM arity n 0 Superimposed separators 0 Post−Condition Look Up 0 0 (a) 4 0 4 4 0 0 0 0 Identify separators 4 (b) Post−Conditions Fig. 2. a) Example of a CMM. b) Schematic diagram of AURA. preconditions → postcondition. The preconditions are sets of variable : value pairs connected either with logical AND or OR operators. For each variable and value there is a unique binary vector and the input to the CMMs is formed after superimposing the corresponding tensor products of these vectors. This input is directed to the appropriate CMM according to the number of preconditions in the rule (called arity). To identify each rule there is a unique binary pattern with a constant number of bits set to one sparsely distributed in it. This pattern is called separator and is associated with the pre-processed input at the relevant CMM. The separator is also used as the key in order to store the relevant postcondition to the database. In recall, the input is applied to the corresponding CMM and a vector of summed values is produced at the output. This is then converted to one or a number of superimposed separators using the L-max thresholding method [21,22]. This method sets the L highest sums to ones and the rest to zeroes. L is the number of bits initially set in the separator patterns. For each separator identified in the recovered binary pattern a postcondition is then retrieved from the database. AURA provides a powerful symbolic processing engine and it is an essential part of the system. It allows on-line learning and high speed both in learning and recalling modes, it can perform in parallel on the data (multiple queries will give multiple responses) and is capable of partial and combinatorial matching of rules. It also allows direct hardware implementation of the system using the PRESENCE [23] architecture for very high speed processing. A Cellular Neural Associative Array for Symbolic Vision 2.2 377 The Cell Structure Each cell in the CANN consists of three kinds of modules, spreader, passer and combiner. They perform the three following tasks: (a) convert the input to a form suitable for spreading in each direction away from the cell, (b) combine incoming information with information to be passed to the neighbouring units and operate as a symbolic gate, and (c) combine current state and information from neighbours to produce the new state of the processor. An example of a two dimensional processor communicating with four neighbours is depicted in figure 3a. The spreaders incorporate a gating function, which can prevent symbols from being passed to other units if the cell has no image features on its input. This can prevent information pathways between passers becoming saturated. The passers operate to ‘count’ the distance a message has been sent, akin to allowing the system state that ‘there is a corner feature five hops away’. The combiners unite the information being passed by the passers about distant features and the current information about what part of the object the cell may represent, and translate this to a higher level description of the feature. The spreaders allow input symbols to be translated to features suitable for the passers. If they are not used all the passers obtain the same symbols to be passed in all directions. In operation the output from the combiners gets input to the spreaders on the next iteration of the array. R1 passer R5 R2 IN spreader R3 IN R6 combiner OUT (a) passer R4 (b) Fig. 3. a) A 2D associative processor and b) the patterns used for training. Messages and Rules Information in a CANN is represented by symbolic messages consisting of one or more symbols. These symbols belong to three sets, or alphabets. The first is the input alphabet and it is the set of symbols representing primitive pattern features. This can be thought of as the set of terminals 378 C. Orovas and J. Austin in a grammar system [24]. The second is the set of the transition symbols used during the evolution process of the CANN. These symbols correspond to the non-terminals in a grammar system. They either represent various combinations of symbols or sub-patterns or formations of sub-patterns. The third set is the output alphabet and can be thought of as the set of the starting symbols in a grammar system and represent complete patterns and are the object level symbols of the system. Each module has its own set of symbolic rules. However, these sets are the same for the modules of the same type. Thus, the operation is location independent. The rules are of the form input conditions → output where input conditions are combinations of messages and output is either a transition symbol or a symbol from the output alphabet. Connection Schemata. These determine the connection pattern which is followed, both inter-cell and intra-cell in the array. Cells are connected directly to four of their immediate neighbours. By the exchange of messages which takes place, cells become aware of the states of more distant cells after a number of iterations. For the intra-cell case the connection schema defines which messages form the input to a module and where the output of a module is directed. There can be equivalent forms of connections using different types of modules and numbers of them. It is important that there is always communicating and state determining units within the processor. These connection schemata are uniformly applied for all the processors. The types which have been followed for the current experiments derive from the one depicted in figure 3a. For the results presented in this paper no spreader modules were used, as the effect of spreading different messages in each direction has little effect in the problems addressed here. 2.3 Learning Session The sets of rules for the operation of a CANN are produced during the learning session. These rules allow the transition from the initial symbolic image to the object level description. The training set which is used is a set of patterns along with the corresponding high level symbol (object label) for each pattern. Two inputs are given to the system for each pattern: (a) a symbolic description of the pattern using labels representing primitive pattern features placed in an array where each location corresponds to each cell in the CANN and (b) an object label which is the ‘name’ of the pattern. The algorithm which is followed for every processor in the cellular array when the processor has a non null initial state is depicted in figure 4. Initially, a preliminary stage (step 1) prepares the system for operation by placing the state of each processor at the input points of its neighbours. Then, the main part of the algorithm begins and it is applied for all the modules in all the processors A Cellular Neural Associative Array for Symbolic Vision 379 Step 1 Place the symbols representing the initial state of the processor to the input channels of its neighbours. Step 2 For all modules: Check if the input or the combination of inputs is recognizable. If recognizable Retrieve the answer and place it at the location for the output of the module. else Assign a new transition symbol to represent this input or combination of inputs, store the new association to the module and place the new symbol at the location for the output of the module. Step 3 If all states are unique goto step 4 else goto step 2. Step 4 Associate the current inputs to the combiner module with the object level label provided and store the association to the module. Fig. 4. The algorithm used in the learning session. It is reminded that the spreader module takes only one input whereas the passer and the combiner modules have more than one input. with a non null state. Processors with null (empty) states do not participate in this process 1 . For every module a ‘test and set’ approach is followed. If the combination of the symbols at the input of a module is recognizable then the corresponding postcondition is retrieved. If not, a new transition symbol is created and assigned as a postcondition to the current formation of inputs. Then, the new rule is stored at the relevant CMM. Rules previously produced are used and the new ones are appended to the corresponding sets. As mentioned earlier, after every iteration, due to the propagation of messages, each cell becomes aware of the states of more distant cells. This is reflected in its state. Within a finite number of iterations (the number of which depends on the size of the input pattern) all cells have a state which is unique in the array. Unique states indicate unique formations of subpatterns in which the input pattern has been divided into. This is the termination condition for the learning algorithm and corresponds to the configuration in which the entropy of the cellular array is the maximum. At this stage, for all the non-empty cells, the preconditions which exist at the input of the combiner modules are associated with the symbol from the output alphabet specifying the pattern presented. The beauty of this learning process is its simplicity, allowing complex patterns to be learned without manual intervention. As is presented next, generalization of the resulting rules to allow recognition in the presence of image noise is dealt with using a relaxation method. 1 An option exists for the passers of these processors to participate when they have an incoming signal. 380 2.4 C. Orovas and J. Austin Recall Session For recall, a symbolic image is presented to the CANN. As in the learning session, a bottom-up approach is followed. A ‘parsing’ using a universal grammar is performed transforming the input symbols to the nearest object level description or descriptions. As mentioned in the beginning, each cell is trying to build its derivation tree in a bottom-up manner, obeying at the same time at the orders arriving from its neighbours. The initial states of the cells represent simple features which could exist in all objects that could be recognized by a CANN. As messages from neighbours arrive, the cells are forced to alter their states to new ones which represent feature formations that can be found in a reduced number of objects. This process is repeated and leads to an ever decreasing number of possible objects that the cell can belong to. If a pattern used for training is presented then the system converges to the corresponding object level symbols for that pattern. If there are similar patterns stored, the corresponding object level symbols also appear at the common areas. When an unknown pattern is presented, the system tries to label those formations of pattern primitives which are recognized. However, it is not always possible for all cells to give a higher or an object level label due to corruption or differences in the input pattern compared to the trained examples. To allow generalization, relaxation is used. If a postcondition for a combination of inputs cannot be found then the constraints are relaxed and responses with incomplete precondition matching are accepted (the system has an increased tolerance). This is achieved by accessing more than one CMM in the relevant module and by reducing the threshold used for determining whether a valid separator can be retrieved from the CMM’s output. The decision for increased tolerance is taken either when none of the cells can alter their state (global) or a given cell fails to output a known separator (local). In the second case we have a completely decentralized operation. Due to partial matching it is possible to output more than one symbol from a module. This is the reason that messages and states can consist of more than one symbol. There are two ways in which multiple symbols for the same precondition are presented to the modules. Consecutive and simultaneous. The former presents inputs one by one and trades speed for size of the CMMs while the latter presents the inputs superimposed and is faster but needs larger CMMs. For example, consider rules with two preconditions. Symbols A and B are both present for the first precondition while symbol C is the second precondition. Using the consecutive presentation, the CMM of arity two is accessed two times. One for the combination A and C and one for the combination B and C. The results are then combined to form a multiple symbol message. On the other hand, using the simultaneous approach the CMM is accessed only once. In that case the binary patterns corresponding to symbols A and B are superimposed. We can see that using the consecutive approach the number of times that the CMMs are accessed depends on the number of symbols existing as pre-conditions while the simultaneous method always accesses the CMMs only once. This is one of the major advantages of using the AURA methods for rule matching in the CANN. A Cellular Neural Associative Array for Symbolic Vision 381 The recalling session stops when there are no alterations to the configuration of the system, or the number of state altering cells is less than a threshold or a preset maximum number of iterations has been reached. 3 Experiments The recognition of objects in multi-object and noisy scenes is the aim of the system. Applications can be found in specific areas such as analysis of electronic and mechanical drawings although the long term aim is to build an adaptable and generic pattern recognition system. Various parameters of the system have been tested using a set of prototype patterns. Among these parameters are included the intra-processor connection schemata, the effect of global/local relaxation, the simultaneous/consecutive presentation and parameters of the AURA modules. At the same time, the behaviour of the system with complex patterns (combinations of the training patterns), noise, and scale variations of the training patterns have been tested. Some initial results with complex patterns were presented in [25]. In this paper, results of experiments where symbolic noise was injected into the training patterns and also with scale variations of them are presented. The patterns used for training (R1-R6) can be seen in figure 3b. Their size is of the order of 6×12 and 12×6 symbols. For training, five repetitions for each of the patterns R1 to R6 were needed. For each pattern, a decreasing number of rules for the combiner module were produced, starting at 153 (for R1) and ending at 82 (for R6). As mentioned earlier, the number of iterations needed to complete the learning session depends on the size of the patterns while the number of rules produced depends on the level of similarity among the patterns since common formations are represented with the same rules. Symbolic noise can be inserted into the input image during the initial labelling process. For example, consider a symbol, A, at position (x, y) of a symbolic image. There are three categories of noise: (a) absence of symbol A from (x, y), (b) replacement of A by one or more different symbols, and, (c) addition of one or more symbols at (x, y). For the testing purposes, random noise was injected into the training set at various levels according to predefined values for P (x/α) which is the probability of having noise of type x when symbol α, α representing a primitive pattern feature, should be present and given that noise exists in that position. Ten versions at each noise level were tested for each pattern and the results for pattern R4 can be seen in the graphs in figures 5 and 6. These graphs show the average recognition success while recalling noisy versions of pattern R4. The standard deviations of the averages for pattern R4 itself are also shown. Figure 7 has similar results for pattern R6. The graph in figure 5a has the recognition results when no tolerance for neither the combiner nor the passer modules were allowed. We can see that the percentage for R4 is always the highest one although its value decreases to almost 5% at 50% noise. A tolerance of 1/5 for the combiner modules is allowed 382 C. Orovas and J. Austin Pattern R4, Combiner Tolerance = 0, Passer Tol. = 0 Pattern R4, Combiner Tolerance = 1, Passer Tol. = 0 120 120 R1 R2 R3 R4 R5 R6 100 R1 R2 R3 R4 R5 R6 100 R4 80 80 R4 60 60 R1 40 R1 R3 40 R3 20 R2 R5 R6 20 0 R6 R2 R5 -20 0 5 10 15 20 25 30 Noise (%) 35 40 45 50 5 10 15 20 25 30 Noise (%) (a) 35 40 45 50 (b) Fig. 5. Percentages of object level labels while recalling noisy versions of pattern R4. Tolerance for the combiner modules is 0/5 (see text) in (a) and 1/5 in (b). Tolerance for the passer modules is 0/2 in both (a) and (b). in graph 5b. That means that one out of five preconditions is allowed not to match in the inputs to the combiner modules. The effect is a direct improvement at the recognition success as more cells are able to alter their states and end up with object level labels even when not all the necessary preconditions are present. This improvement is even better in graph 6b where a tolerance of 1/2 is Pattern R4, Combiner Tolerance = 1, Passer Tol. = 1 Pattern R4, Combiner Tolerance = 0, Passer Tol. = 1 120 120 R1 R2 R3 R4 R5 R6 100 R1 R2 R3 R4 R5 R6 100 R4 80 80 R4 60 60 R1 R3 40 40 R1 R3 R6 20 20 R5 R2 R5 R2 R6 0 0 5 10 15 20 25 30 Noise (%) (a) 35 40 45 50 5 10 15 20 25 30 Noise (%) 35 40 45 50 (b) Fig. 6. Percentages of object level labels while recalling noisy versions of pattern R4. Tolerance for the combiner modules is 0/5 in (a) and 1/5 in (b) while tolerance for the passer modules is 1/2 in both (a) and (b). A Cellular Neural Associative Array for Symbolic Vision Pattern R6, Combiner Tolerance = 2, Passer Tol. = 0 383 Pattern R6, Combiner Tolerance = 2, Passer Tol. = 1 110 110 R1 R2 R3 R4 R5 R6 100 90 R6 R6 90 80 80 70 70 60 60 R5 50 R1 R2 R3 R4 R5 R6 100 R5 50 40 R2 40 R2 30 30 R3 R4 20 20 R4 R1 R1 10 10 0 R3 0 5 10 15 20 25 30 Noise (%) (a) 35 40 45 50 5 10 15 20 25 30 Noise (%) 35 40 45 50 (b) Fig. 7. Percentages of object level labels while recalling noisy versions of pattern R6. Tolerance for the combiner modules is 2/5 for both (a) and (b) while no tolerance is allowed for passer modules in graph (a). also allowed for the passer modules. This enables the cells to communicate and exchange messages even when there are empty spaces or erroneous conditions between them. The effect of the relaxation of the constraints of the system is obvious from these graphs. Even with 50% of noise, object level symbol R4 has an average occurrence of almost 80% in 6b. Graph 6a has the results when tolerance is allowed only for the passer modules. These results have indications of improvement but it is only when both modules are relaxed when the best performance is achieved. Comparing 5b and 6a we can see that the behaviour of the system is influenced at a greater level by the combiner units than by the passer units. This is justified since the combiner modules are the ones that decide for the next state of the processor while the passer modules are only responsible for passing this information to the neighbouring units. The effect of the relaxation of the passer modules is more obvious in graphs 7a and 7b. We can see there that although the correct behaviour is followed in 7a, it needs to be augmented by relaxing the passer modules (in graph 7b). This has as effect a ‘fine-tuning’ of the recognition levels. We can notice from all the graphs that the next highest percentages are for patterns which are similar to the actual pattern. Thus we have a sort of ‘multilevel classification’ where recognition percentages are distributed according to the level of similarity with the actual pattern at the input. We can also notice that as the level of noise increases so do the percentage of symbols output for patterns with no similarity to the actual one. This is because as noise increases there are more randomly created subpatterns belonging to these patterns. The behaviour with the increased tolerance demonstrates both the capabilities of AURA for uncertain reasoning and partial matching and the robustness 384 C. Orovas and J. Austin Pattern R2, Combiner Tolerance = 1, Passer Tol. = 0 Pattern R4, Combiner Tolerance = 1, Passer Tol. = 0 100 100 R1 R2 R3 R4 R5 R6 80 R1 R2 R3 R4 R5 R6 80 R4 R2 R3 60 60 R1 R5 40 R6 40 R1 R2 R6 R3 20 R4 20 R5 0 100 105 110 115 120 125 Scale (%) (a) 130 135 140 145 150 0 100 105 110 115 120 125 Scale (%) 130 135 140 145 150 (b) Fig. 8. Recalling scaled versions of pattern R1 (a) and pattern R4 (b). The combiner modules are allowed a tolerance of 1/5 while no tolerance is allowed for passers. of the distributed processing approach. Relaxing the combiners has as an effect the production of states for the units even if some messages are missing while relaxing the passers enables messages to overcome obstacles such as empty spaces and erroneous conditions. It is important to note however that, even with no relaxation, the behaviour of the system is still good. Another set of experiments was performed in order to examine the behaviour with scale variations in patterns. The current results demonstrate that the system can cope very well with scaling of up to 120 % in all cases and in some cases this percentage is 150%. The graphs in figure 8 give the results in two cases. The results in 8a are characteristic for patterns R1 and R2 while the ones in 8b are typical for the rest of the patterns. Increasing the tolerance of the passer modules provides better results although not at the same level as with noise. Ideas which will allow a more ‘distance free’ recognition, i.e. recognition of a pattern with more weight to the nature of the existing subpatterns than to their distance, are currently being considered. Local relaxation and consecutive presentation were used for these experiments. The recognition of the patterns was achieved after the 6th iteration at most of the cases with only small fluctuations at the percentages after that. As with the learning session, the number of iterations before the first recognition is achieved, depends on the size of the patterns. A software simulation of the AURA model running on Silicon Graphics workstations were used. It is in our near future plans to extend the experimentations using the dedicated hardware platform for the AURA system. This is estimated to provide a speed up of up to 200 times [23] thus allowing for real time image processing since the initial labelling stage can also be performed using the dedicated hardware. A Cellular Neural Associative Array for Symbolic Vision 4 385 Summary A cellular system combining the enhanced representational power of symbolic descriptions and the robustness of associative information processing has been presented. Homogeneous associative processors are placed in a cellular array and they interact in a parallel, distributed and decentralized fashion. Each processor has communicating and state determining modules and together they process the rules of a global and universal grammar. These rules are produced during a simple, but powerful, learning method which uses a hierarchical approach. A basic ‘test and set’ principle is followed for this and the grammar which is produced describes the structure of the patterns and the relations among their constituent parts in all levels of abstraction. Starting with initial states representing primitive pattern features, a bottomup approach is followed by each cell in the system in its effort to end up having an object level state. This will indicate the object(s) which the cell is part of. This process is directed by the messages arriving from the neighbouring cells. The parallel operation of the units relieves the model from the complexity which is associated with parsing. This is because each cell operates in an autonomous way, however, leading to a global solution. This characteristic also enables the model to tackle erroneous conditions and noise at a local level. An essential component of the system is the underlying associative symbolic processing engine, AURA. The latter endows the system with high speed management of large sets of symbolic rules, generalization, partial matching and noise tolerance. The descriptional and computational potential of parallel grammar systems has been well studied and demonstrated in similar approaches [13,14]. The system presented in this paper also demonstrates how the incorporation of a powerful connectionist symbolic processing mechanism along with the use of a simple yet effective learning algorithm can result in a flexible pattern recognition system with direct hardware implementation. Acknowledgements The funding for this project came from the State Scholarships Foundation of Greece while C. Orovas’s participation to the workshop was subsidized by the TMR PHYSTA (Principled Hybrid Systems: Theory and Applications) research project of the EC. References 1. H.Bunke. Structural and syntactic pattern recognition. In C.H.Chen, L.F.Pau, and P.S.P Wang, editors, Handbook of Pattern Recognition & Computer Vision, pages 163–209. World Scientific, 1993. 2. K. Tombre. Structural and syntactic methods in line drawing analysis: To which extend do they work ? In P.Perner, P.Wang, and A.Rozenfeld, editors, Advances in Syntactic and Structural Pattern Recognition, 6th Int. Workshop, SSPR’96, pages 310–321, Germany, 1996. 386 C. Orovas and J. Austin 3. E. Tanaka. Theoretical aspects of syntactic pattern recognition. Pattern Recognition, 28(7):1053–1061, 1995. 4. K. Preston and M. Duff, editors. Modern Cellular Automata: Theory and Applications. Plenum Press, 1984. 5. S. Wolfram. Computation theory of cellular automata. Communications in Mathematical Physics, 96:15–57, 1984. 6. S. Wolfram. Statistical mechanics of cellular automata. Reviews of Modern Physics, 55(3):601–643, Jul 1983. 7. A.W. Burks, editor. Essays on Cellular Automata. University of Illinois Press, 1970. 8. F.C. Richards, P.M. Thomas, and N.H. Packard. Extracting cellular automaton rules directly from experimental data. Physica D, 45:189–202, 1990. 9. Y. Takai, K. Ecchu, and K. Takai. A cellular automaton model of particle motions and its applications. Visual Computer, 11, 1995. 10. T. Pierre and M. Milgram. New and efficient cellular algorithms for image processing. CVGIP: Image Understanding, 55(3):261–274, 1992. 11. M.J.B. Duff and T.J. Fountain, editors. Cellular Logic Image Processing. Academic Press, 1986. 12. C. Orovas. Cellular Associative Neural Networks for Pattern Recognition. PhD thesis, University of York, 1999. (copies are available). 13. G. Paun and A. Salomaa, editors. New Trends in Formal Languages. LNCS 1218. Springer, 1997. 14. E. Csuhaj-Varju and A. Salomaa. Networks of parallel language processors. In Paun and Salomaa [13], pages 299–318. 15. T. Kohonen. Content-Addressable Memories. Springer-Verlag, 1980. 16. Austin J. Associative memory. In Fiesler E. and Beale R., editors, Handbook of Neural Computation. Oxford University Press, 1996. 17. Geoffrey E. Hinton, editor. Connectionist Symbol Processing. MIT/Elsevier, 1990. 18. R. Sun and L.A. Bookman, editors. Computational architectures integrating neural and symbolic processing. Kluwer Academic Publishers, 1995. 19. J. Austin and K. Lees. A neural architecture for fast rule matching. In Proceedings of the Artificial Neural Networks and Expert Systems Conference, Dunedin, New Zealand, Dec 1995. 20. D.J. Willshaw, O.P. Buneman, and H.C. Longuet-Higgins. Non-holographic associative memory. Nature, 222(7):960–962, Jun 1969. 21. D. Casasent and B. Telfer. High capacity pattern recognition associative processors. Neural Networks, 5:687–698, 1992. 22. J. Austin and T.J. Stonham. Distributed associative memory for use in scene analysis. Image and Vision Computing, 5(4):251–261, 1987. 23. J.Kennedy and J.Austin. A parallel architecture for binary neural networks. In Proceedings of the 6th International Conference on Microelectronics for Neural Networks, Evolutionary & Fuzzy Systems (MICRONEURO’97), pages 225–232, Dresden, Sep. 1997. 24. G.Rozenberg and A.Salomma, editors. Handbook of Formal Languages, Vol.I-II-III. Springer-Verlag, 1997. 25. C. Orovas and J. Austin. A cellular system for pattern recognition using associative neural networks. In 5th IEEE International Workshop on Cellular Neural Networks and their Applications, pages 143–148, London, April 1998. Application of Neurosymbolic Integration for Environment Modelling in Mobile Robots Gerhard Kraetzschmar, Stefan Sablatnög, Stefan Enderle, and Günther Palm Neural Information Processing, University of Ulm James-Franck-Ring, 89069 Ulm, Germany, {gkk,stefan,steve,palm}@neuro.informatik.uni-ulm.de, WWW:http://www.informatik.uni-ulm.de/ni/staff/gkk.html Abstract. We present an architecture for representing spatial information on autonomous robots. This architecture integrates several kinds of representations each of which is tailored for different uses by the robot control software. We discuss various issues regarding neurosymbolic integration within this architecture. For one particular problem – extracting topological information from metric occupancy maps – various methods for their solution have been evaluated. Preliminary empirical results based on our current implementation are given. 1 Introduction and Motivation A close investigation of strengths and weaknesses of both symbolic, classical AI methods and subsymbolic, so-called soft computing methods suggests that a successful integration of both promises to yield more powerful methods that exhibit the stengths and avoid the weaknesses of either component approach. This idea is the underlying assumption of the majority of efforts to build hybrid systems over the past few years. Kandel and Langholz [8], Honovar and Uhr [7], Bookman and Sun [14], Goonatilake and Khebbal [5], Medskers [10] and Wermter [18] all have edited collections of papers describing a wide variety of approaches to integrate symbolic and subsymbolic computation. Hilario [6] provides a suitable classification and defines the notion of neurosymbolic integration for combining neural networks with symbolic information processing. The study of neurosymbolic integration is also a central research topic in our long-term, multi-project research effort SMART.1 In this project, the topic is studied in the context of adaptive mobile systems, such as service robots. In our group, we particularly investigate neurosymbolic integration for representing and reasoning about space. In this paper, we first give a very brief overview on the most common approaches to represent spatial concepts in robotics. An analysis of their respective strengths and weaknesses motivates the development of integrated, neurosymbolic systems for spatial representations on autonomous mobile robots. The Dynamo Architecture is then suggested as a framework for this endeavor, 1 See http://www.uni-ulm.de/SMART for more informtion. S. Wermter and R. Sun (Eds.): Hybrid Neural Systems, LNAI 1778, pp. 387–401, 2000. c Springer-Verlag Berlin Heidelberg 2000 388 G. Kraetzschmar et al. and its potential for applying neurosymbolic integration is outlined. One of the candidate problems for applying neurosymbolic integration is the extraction of topological spatial concepts from metric occupancy maps. We discuss various approaches we have pursued so far and report on their respective strengths and weaknesses based on experimental results. 2 Overview on Representations of Space Reasoning about space is a widely occurring problem. Application areas range from Geometric Information Systems (GIS), Computer-Aided Design (CAD), and architectural interior design all the way to industries like clothings, shoes, and logistics. In most of these problems, reasoning about space is used either as an optimization tool (minimizing waste of raw material, maximizing use of given storage capacity) or as a structural aid to solve problems later in the work process (providing precise specs for parts manufacturing or interior design work on a building). In these applications, a global view on the problem can be taken, and once a suitable representation for the relevant spatial concepts has been chosen, the representation remains fixed and static. The situation in robotics, even for classical industrial robots, is quite different. Either the whole robot, or at least parts of it, like a multiple-joint robot arm, move through space, thereby changing their spatial relations to their surroundings over time. Moreover, most intellectually and economically attractive applications of autonomous mobile robots usually work in environments exhibiting dynamics of some sort: various kinds of manipulatory objects (trash cans etc.), objects relocatable by other agents like chairs and crates, objects with changing spatial state like doors, and both human and non-human agents all introduce a certain dynamic aspect. Thus, a fundamental, but often neglected question to ask about spatial representations used in robotics is how they can deal with dynamic aspects of the world. Any evaluation of spatial representations and the respective reasoning capabilities in robotics has to take into account the particular tasks the robots are supposed to perform. Some typical tasks are briefly overviewed in the next few paragraphs. Navigating Through Space: One of the most investigated task is the pick-up-anddelivery task for office robots (see cf. [1]). It requires autonomous navigation in office environments. Leonard and Durrant-Whyte [9] have framed the problem concisely into three questions: 1) Where am I? 2) Where am I going? and 3) How should I get there? Almost all approaches to solve the navigation task for indoor robots are based on some kind of two-dimensional spatial representation. The range of representations varies widely and includes at least polygonial representations/computational geometry, relational representations/logics of space [12], topological representations/graph theory, and probabilistic metric occupancy grids/Bayesian inference [2], [16]. In all cases, allocentric representations of the environment are most commonly used, i.e. a global perspective Neurosymbolic Integration for Environment Modelling 389 is taken. Most approaches, however, are best suited for modelling static environments and require substantial effort for dealing with dynamic world aspects, especially if the environment is very large. Several biologically-inspired approaches have received closer attention, e.g. Mallot’s view graphs [3] or Tani’s sensory-based navigation system [15]. Experiments done in biology and psychophysics also initiated increased interest in egocentric representations (see c.f. [13]), where the environment is modelled relative to the robot. Egocentric representations have strong implicit dynamics (almost everything changes whenever the robot moves) and are often transient, i.e. they are constructed from the sensory information currently available and are not stored permanently. As a consequence, information tends to be noisy (few sensory input data yield limited reliability) but timely (changes in the environment are immediately seen, if sensors can detect them). Following People: A seemingly similar, yet different robot task is following a person. Spatially speaking, the robot is supposed to track a person by keeping the distance between the person and itself within given bounds. The task requires to visually track the person to be followed and to extract information about the spatial relation (angular position, distance) between the robot and the tracked person. Having information on the dynamics of the tracked person (speed, acceleration) can simplify the task. An essential requirement is the automatic avoidance of obstacles by the robot. Having a model of the environment maybe of help, but does in no way solve the problem by itself. A robust people following system would have to compensate for temporary interruption of the tracking capability due to occlusions. For this kind of task, egocentric models of the immediate environment seem to be suited quite well. Manipulating Objects: Manipulating objects by robots involves even more challenging problems of reasoning about space. Contrary to navigation tasks, threedimensional representations are considered essential for manipulation tasks by most roboticists. However, most 3D representations require substantially higher computational efforts, which make it even more difficult to model dynamic world aspects than in the 2D case. Also, there is psychological and psychophysical evidence, that humans also use just 2D representations even for manipulatory tasks. Speaking with Humans: Interacting with humans is a robot task that recently received significantly increased interest; many researchers believe it to be the field attracting most robotics research over the next decade or so. Using speech is both the most natural and appealing way for interacting with a human. Spatial representations and reasoning about space are essential requirements for providing powerful and natural and speech interfaces. Even the specification of the simplest robot tasks (“go and get me coffee”, “pick up the blue box next to the trash can in the kitchen and bring it to Peter’s room”) involved substantial and complex kinds of reasoning about space. Logics of space and topological approaches are most common in these areas, where precise metric distances can often be neglected. 390 G. Kraetzschmar et al. Reviewing the various kinds of representations for the different tasks, it is immediately plausible that only the integration of several of the discussed approaches is likely to provide the necessray spatial reasoning capabilities for mobile robots performing the whole range of tasks. Thus, hybrid spatial representations are of great interest in robotics. Several hybrid approaches, e.g. by Thrun [16] or Kuipers [11] are already known, but these systems use a combination of several representations for navigation only and do not specifically support other robot tasks. 3 The Dynamo Architecture We present the Dynamo architecture (see Figure 1) as a framework for the integration of symbolic and subsymbolic spatial representations. Basically, Dynamo features three layers of spatial representations: – At the lowest level, we use egocentric occupancy maps for representing the local environment of the robot. For each sensor modality, a neural network interprets the most recent sensory data and approximates a continuous 2D occupancy landscape. Locality is determined by the range of the particular sensor. An integrated view of the local environment is produced by fusing the continuous representations of the different sensor modalities. This so-called multimodal egocentric map is continuous as well. If needed, e.g. for implementing collision avoidance during trajectory execution, discrete equidistant occupancy grid maps can be produced for each continuous map by sampling it in an anytime manner. All egocentric maps are transient. – At the medium level, we use allocentric occupancy maps to represent a global view of the environment. These maps (usually one, but several for different floor levels) are discrete, usually equidistant (but other representations are possible and under consideration), and constructed automatically by integrating egocentric occupancy maps over time. The integration requires an estimate of the global position of the robot; the quality of position estimates strongly influences overall allocentric map quality. – At the top level, we basically use a relational representation, which has been extended by topological and polygonial elements. The relational representation, itself integrated into the knowledge representation used by the robot, provides the means to represent relevant spatial concepts like rooms, furniture, doors, etc., including all necessary attributes. Each such concept is associated with a polygonial description (currently a set of rectangular regions). The defining characteristic of a region is that it has an attribute which is spatially invariant over the region. Thus, regions are typed. Examples are the region of a table, the region of a rug, or a region representing an abstract spatial concept like a “forbidden zone” (cf. stairs). Depending upon their type, regions may overlap (Bob’s room and free space) or not (wall and free space). Traversability of and between regions is topologically represented by a graph. Neurosymbolic Integration for Environment Modelling relational / topological map 391 route planning topological projection spatial query metric projection room door corridor room unexplored door door room room or do metric information symbolic concepts concept matching room door corridor room unexplored door door room annotation layer vision exploration room or do classification region layer segmentation path planning occupancy grid position estimation integration odometry motion multimodal egocentric map unimodal egocentric maps fusion visual data interpretation IR interpretation sonar interpretation laser interpretation visual range data IR sensors sonar sensors laser scanner Fig. 1. The Dynamo Architecture control 392 G. Kraetzschmar et al. The Dynamo architecture provides many opportunities to study neurosymbolic integration in various forms: 1. Fusing the multiple modalities of egocentric maps can be done symbolically (priority rules, majority rules) or subsymbolically (learning contextdependent fusion functions with neural networks) 2. In order to reduce the dependency on reliable position estimates for temporal map integration, a matching process can try to compute the position where the current (fused) egocentric map matches the allocentric map best. Methods to do this in a subsymbolic manner are based e.g. on expectation maximization. Problems arise when the world changes dynamically. Including expectations generated from symbolically available knowledge may improve and simplify this process. 3. The segmentation of grid maps links sets of grid cells to symbolic concepts like rectangular regions or polygones. This process is considered a central element in integrating subsymbolic and symbolic representations and will be discussed in more detail below. 4. By map annotation we mean associating low-level symbolic concepts like regions with higher-level concepts present in the knowledge base, thereby augmenting the information available for the region under consideration. This association can be derived either by using object classification (we can see that this region is a table) or by matching low-level and higher-level concepts based on spatial occupancy (we already know that the room contains a single table). 5. The extraction of topological information can be done based on a polygonial representation (regions) by computing various spatial relationships (connectedness, overlap, etc.). This process can be very expensive. Extracting topological information directly from the allocentric map could exploit the contextual spatial information that is inherent in the allocentric map, but must be inferred when using a polygonial map. 6. Projecting topological information into allocentric maps is the reverse idea to the previous issue. It is useful for exploiting knowledge about the environment that is available a priori or acquired through communication with humans and can significantly ease exploration. 7. Probably the most interesting long-term aspect is modelling dynamic world aspects. The basic idea is to have a symbolic model of the spatial dynamics of certain objects like humans or robots. These models are applied to update the position and orientation of the regions associated with these objects. The regions can be projected onto the allocentric map, e.g. for path planning purposes. 4 From Grid Maps to Regions: Map Segmentation The annotation of subsymbolic spatial representations with symbolic information is considered a key element in Dynamo. Annotating grid cells directly is computationally expensive, given typical grid map resolutions of 5–15cm, and Neurosymbolic Integration for Environment Modelling 393 poses a severe problem of maintaining map consistency. A simple table would require the annotation hundreds of grid cells. The segmentation of grid maps into a set of regions, which then serve as basic elements for annotation, can reduce the required computational effort for annotation enormously. We applied several methods to solve this problem. Two of them, Colored Kohonen Map [17] and Growing Neural Gas [4], are based on self-organizing maps. The latter method only segments free space. Common to both methods is that the grid map is sampled during the adaptation process to retrieve single data points. Each data point is handled in an isolated manner and neighboring grid cells are not evaluated. Thus, these methods are based on the exploitation of isolated data points. The third method was first applied to the problem by Thrun and Bücken [16]. It based on Voronoi diagrams and computes a segmentation of free space. The fourth method, developed in our local group, applies ideas from computer vision to detect wall boundaries, which are then used to generate rectangles representing either free space or occupied space. Both methods take into account contextual spatial information. 4.1 Methods Based on Isolated Data Point Exploitation Self-organizing maps lend themselves naturally to neurosymbolic integration: they define a mapping from a high-dimensional input space (usually subsymbolic data with unknown structure) to a low-dimensional topological output space. In our application to map segmentation, the topology of the output space defines a Voronoi tesselation of the input space, thereby providing a segmentation of the grid map. Colored Kohonen Map Vleugels et al. [17] developed a colored variant of Fritzke’s Growing Cell Structures, which themselves are a variant of Kohonen maps with adaptive topology, for mapping the configuration space of a robot. The Algorithm: Two different types of neurons are used: One is specially designed to settle on obstacle boundaries, the other one is tuned to approximate the Voronoi diagram of free space. This behavior is achieved by applying a repulsive force to the neurons modelling free space from inputs inside obstacles, and an attractive force to the neurons modelling obstacles by inputs inside free space. Neurons are not allowed to change color, i.e. the control algorithm prevents that obstacle neurons are drawn into free space and vice versa. The size of the Kohonen map adapts automatically as new neurons are inserted as needed. The amount of data is dramatically reduced, especially in high dimensional configuration spaces. After the learning process, neurons in free space and their topological connections are interpreted as road maps, which are used for solving path planning problems. 394 G. Kraetzschmar et al. I 1 attractive force from I2 I2 repulsive force from I 1 Fig. 2. Update of colored neurons. Application To Our Problem: Figure 3 shows an example of a grid map that represents the configuration space of an omnidirectional point robot in an artificial environment and the corresponding colored Kohonen map. Input space is only two dimensional; data points were acquired by random sampling of the grid map. Fig. 3. Example of a colored Kohonen map. Neurosymbolic Integration for Environment Modelling 395 Results: A problem we often met with this method was that the map tends to move outside of narrow free space pockets, leaving free space which is not appropriately represented by neurons (see top left of Figure 3). Due to the dynamic insertion of neurons, the robot is able to easily increase the size of the grid map and to adapt the size of the Kohonen map accordingly. Also, the incremental creation and adaption process enables permanent adaptation to slowly changing environments. Growing Neural Gas The colored Kohonen map does not try to fully represent free space; it just builds road maps. For autonomous robots navigating in two dimensional space (ignoring its holonomic restrictions), we are interested in a more complete represention of free space. We tried to achieve this by using the Growing Neural Gas method by Fritzke [4]. The Algorithm: The algorithm described in [4] uses a set of units A, each of which is associated with a reference vector wc ∈ Rn (also referenced to as the position of the unit). Units keep track of a local error variable, which is incremented by the actual error for the best matching unit. Pairs of such units c ∈ A can be connected by edges ec1 c2 . Edges are equipped with an counter age which is incremented for every input. The algorithm usually starts with two units connected to each other. Every unit is assigned a random position. Then for each input in the nearest unit s1 and the second nearest unit s2 are computed, using the Euclidian distance measure. The age of every edge is incremented and the distance of the input and the best matching unit is added to the local error variable. The unit s1 is moved towards the input, so are the neighbors, which are directly connected to s1 by one of the edges. If there is already an edge between s1 and s2 the age of this edge is set to zero, else the edge is inserted. Then all edges are checked whether they have reached a maximum age, if so they are removed. All λ steps a new unit is inserted halfway between the unit with the maximum error q and its neighbor n with the largest error variable. The error of the two units is decreased and the new unit is assigned the new error value of q. On every step all error variables are decreased by a factor d. This is repeated until a specfic criteria is reached. Application To Our Problem: We used this method to create a map of free space by repeatedly presenting free space position vectors sampled from the grid. Results: After short time units were equally distributed over the free space, and as soon as their density allowed connections crossing occupied space were removed. The neural gas automatically develops a topology preserving structure, so that the resulting graph can be used for path planning purposes. This structure is not perfect, but improves with time spent on presenting examples from the input space. The number of neurons needed to represent a map depends on the required final density of neurons and on the structural complexity inherent in the particular grid map. The algorithm allows for dynamic extension of the map 396 G. Kraetzschmar et al. as well as for adaption to dynamic changes in the environment. This method needs many neurons especially for modelling at a very detailed level (Figure 4). The result given by this method usually is not free of some topological errors, e.g. connections crossing a wall. Longer training can reduce this, but not avoid it totally, as we do not want to make the structure of regions as detailed as the original grid map itself. Fig. 4. Neural gas approximating free space In the bottom left part of Figure 4, it can be seen that artefacts in the gridmap tend to create artefact regions as well, if they are big enough. In order to avoid this one should either choose an appropriate subpart of the map or cross check topological consistency. 4.2 Methods Based on Contextual Data Point Exploitation The methods described so far do not use any information about the neighborhood of a grid cell; every input to the self-organizing map is interpreted as an isolated point in the configuration space. The following methods try to also exploit contextual information. Voronoi Skeletization and Segmentation The Algorithm: Thrun and Bücken [16] try to extract topologic information with image analysis methods. They skeletize free space by approximating a Voronoi diagram, which in turn is used to find locations with minimal clearance to obstacles. The underlying hypothesis is that such location often mark doors or narrow pathways. Points on the Voronoi diagram with minimal clearance are used to define lines separating regions of free space. These free space regions are then topologically represented by nodes, while region adjacency yields the edges. Neurosymbolic Integration for Environment Modelling 397 The resulting graph can by pruned using a few simple pruning rules to further reduce the number of nodes. Results: We implemented the method as described in [16]. Our results show that the method is quite sensible to noise and map distortions, which result in very complex Voronoi diagrams and, consequently, in large numbers of small regions. Figure 5 shows an example of a segmentation obtained by our implementation). We experimented with various methods of filtering the grid map before applying the Voronoi algorithm, which led to much better results. It should be noted that the region generated by this approach do not match well with the regions intuitively associated with knowledge-level concepts such as rooms, etc. How a connection between regions and such concepts can be achieved, is an unsolved problem for this appproach. Fig. 5. Example of Voronoi skeletization and segmentation Also, as this method is not incremental in nature it is not as easily extended to dynamic environments as the self-organizing maps. Whenever significant changes to the grid map occur, the whole segmentationprocess must be repeated, possibly leading to completely different topologies. Wall Histograms The SMART project uses an autonomous system in an office environment. Most office enironments are built with mainly orthogonal walls. The wall histogram method makes this assumption and considers only rectangular regions. The regions have an additional orientation parameter, though, which usually reflects the overall rotation of the robot’s coordinate system against the direction of the main wall axis in the environment. Given a grid-based occupancy map G taken from G r×c , the set of all grid maps consisting of r rows and c columns, we want to apply the segmentation 398 G. Kraetzschmar et al. function S : G r×c 7→ 2R to obtain a set of regions R consisting of a number of disjoint regions Ri , i = 1 . . . n. As mentioned, we chose to restrict ourselves to sets of rectangular regions, each of which should be ideally either fully occupied or free. Note, that an equi-distant grid-based occupancy map is itself a rectangular segmentation of the environment, and thus, r × c is an upper bound for n. Usually, we are interested in segmentations yielding small numbers of regions, i.e n ≪ r×c. However, some properties of G make it more difficult to achieve grid map segmentation into small region sets: i) The orientation of the environment with respect to the axis of the map is unknown. Thus, rectangles collinear with map axes do not lead to small region sets. ii) The map may contain noise due to wrong sensor measurements. Thus, boundaries of regions representing walls, objects, etc. may appear to be curved and it is hard to select line segments for separation. Regions that are neither fully free nor occupied will result. iii) The environment is not necessarily fully explored. Unexplored or sporadically explored regions will add additional noise. Under these considerations, we restrict ourselves to partitions defined by an angle and two orthogonal sets of segmenting lines (cuts) which define a set of disjoint rectangles. Even with all restrictions imposed so far, most grid-based occupancy maps representing non-trivial or real-world environments still allow a very large number of rectangular segmentations R. We need at least an informal criterion to measure the quality of segmentation. Given our overall goals, the set of regions R should provide an intuitive segmentation, which can be more easily matched to symbolic concepts like rooms, tables, walls, etc. The Algorithm: We developed an algorithm that tries to deal with the problems described above. The algorithm, called the wall histogram method, is based on methods developed in computer vision. It consists of the following steps: – Identification of the main orientation α ∈ [0; π2 ) of the environment within the coordinate system of the grid map G. – Application of morphologic operations on the grid map G to extract the obstacle boundaries. – Identification of long edges by building histograms over the extracted boundaries. Then segmentation R is constructed by cutting the grid map G at the positions found in the previous step, at the angle α. The details of the steps above are as follows: i) Blurring using a Gauss filter (Figure 4.2a) ii) Application of a Sobel operator in x and y direction (Figure 4.2b and 4.2c) iii) Recombination of the gradient direction and length information from the two Sobel filtered images (Figure 4.2d and 4.2e) iv) Depending on the length of the gradient vectors the directions are accumulated in a histogram, which is used to determine the most frequent direction of edges in the image part (Figure 4.2f). Wall hypotheses are created by first smoothing the image using a morphologic operator which is ’wall shaped’. Afterwards the edges are extracted by growing the image with another morphologic operator and subtracting the original from Neurosymbolic Integration for Environment Modelling blurred sobel x 1 399 sobel y 2 2 0.9 1.5 50 0.8 1.5 50 50 1 1 0.7 100 0.6 100 100 0.5 0.5 0.5 0 0 150 150 150 0.4 −0.5 −0.5 0.3 −1 −1 200 200 200 0.2 −1.5 −1.5 0.1 −2 250 250 50 100 150 200 250 0 50 a) Blurred grid map. 100 150 200 −2 250 250 50 b) Horizontal Sobel directions 100 150 200 250 ) Verti al Sobel angle weight 700 3 2 2 50 600 50 500 1 100 1.5 100 400 0 150 1 150 300 −1 200 200 200 0.5 −2 100 250 −3 50 100 150 200 250 250 d) Re ombined dire tions. 50 100 150 200 250 0 e) Con den e values. 0 0 10 20 30 40 50 60 70 80 90 f) Angle histogram. the result. Then simple projections of the edge image along the dominant direction, as extracted above, and orthogonal to it are accumulated in histograms, which have peaks in places which are good hypotheses for wall or obstacle boundaries (Figure 6). These peaks are extracted and their positions mark the boundaries where the grid map should be splitted into adjacent regions. Preliminary Results: The method was tested on maps created by simulations and real robots using a preliminary version of the lower-level Dynamo mapping architecture. To test the robustness of the direction extraction algorithm the images were rotated artificially. This is important, as the robot is not guaranteed to start map building with his coordinate system aligned to walls. Also, we are robust against rectangular rooms, that are placed in a non rectangular manner. gridmap axis gram histo 0 0 0 14 2 2 2 14 0 0 axis tion direc Fig. 6. Histogram generation. 400 G. Kraetzschmar et al. As shown in Figure 7, the extraction of wall hypotheses works already quite well. The remaining ambiguous hypotheses are reduced by doing the calculation only on parts of the map that have reasonable size, which reduces the hypotheses to the few relevant in the area in question. Fig. 7. some results on gridmaps with different orientation 5 Conclusions We presented a framework for hybrid spatial representations in mobile robotics, which provides rich opportunities to study neurosymbolic integration. For one of these problems, the segmentation of occupancy grid maps in sets of regions, several approaches have been described. Currently, no single approach produces satisfactory results for all spatial aspects, even for static environments: The approaches based on self-organizing maps are well-suited for generating topological representations for free space, but have severe problems with occupied space. Both methods as well as Voronoi skeletization result in unintuitive tesselations of free space. Our own methods gives better results in this regard, but provides no suitable topologies for navigating in free space. However, we believe that suitable variations and combinations of these methods will produce much better results and are therefore working on such implementations. References [1] Michael Beetz and Drew McDermott. Declarative goals in reactive plans. In James Hendler, editor, Proceedings of AIPS-92: Artificial Intelligence Plannign Systems, pages 3–12, San Mateo, CA, 1992. Morgan Kaufmann. Neurosymbolic Integration for Environment Modelling 401 [2] Alberto Elfes. Dynamic control of robot perception using stochastic spatial models. In Proceedings of the International Workshop on Information Processing in Mobile Robots, March 1991. [3] Matthias O. Franz, Bernhard Schölkopf, Philipp Georg, Hanspeter A. Mallot, and Heinrich H. Bülthoff. Learning view graphs for robot navigation. In W. Lewis Johnson and Barbara Hayes-Roth, editors, Proceedings of the 1st International Conference on Autonomous Agents, pages 138–147, New York, February5–8 1997. ACM Press. [4] Bernd Fritzke. A growing neural gas network learns topologies. Advances in Neural Information Processing Systems, 7, 1995. [5] S. Goonatilake and S. Khebbal, editors. Intelligent Hybrid Systems. John Wiley & Sons, Ltd, 1995. [6] Melanie Hilario. An overview of strategies for neurosymbolic integration. In Ron Sun and Frederic Alexandre, editors, IJCAI-95 Workshop on ConnectionistSymbolic Integration: From Unified to Hybrid Approaches, pages 1–6, Montreal, August 1995. [7] Vasant Honovar and Leonard Uhr, editors. Artificial Intelligence and Neural Networks: Steps toward Principled Integration. Academic Press, 1994. [8] Abraham Kandel and Gideon Langholz, editors. Hybrid Architectures for Intelligent Systems. CRC Press, 1992. [9] J. Leonard and H. F. Durrant-Whyte. Mobile robot localization by tracking geometric beacons. IEEE Transactions on Robotics and Automation, 7(3):376– 382, 1991. [10] Larry R. Medskers, editor. Hybrid Intelligent Systems. Kluwer Academic Publishers, Norwell, 1995. [11] David Pierce and Benjamin Kuipers. Map learning with uninterpreted sensors and effectors. Technical Report AI96-246, University of Texas, Austin, 1997. [12] D.A. Randell, Z. Cui, and A.G. Cohn. A spatial logic based on regions and connection. In Proceedings of the Third International Conference on Knowledge Representation and Reasoning, pages 165–176, Cambridge, MA, USA, October 1992. [13] Michael Recce and K. D. Harris. Memory for places: A navigational model in support of Marr’s theory of hippocampal function. Hippocampus, 6:735–748, 1996. [14] Ron Sun and L. Bookman, editors. Computational Architectures Integrating Neural and Symbolic Processes. Kluwer Academic Publishers, Boston, 1995. [15] Jun Tani and Naohiro Fukumura. Learnign goal-directed sensory-based navigation of a mobile robot. Neural Networks, 7(3):553–563, 1994. [16] Sebastian Thrun and Arno Bücken. Learning maps for indoor mobile robot navigation. Technical Report CMU-CS-96-121, Carnegie Mellon University, April 1996. [17] Jules M. Vleugels, Joost Kok, and Mark H. Overmars. Motion planning using a colored Kohonen network. Technical Report RUU-CS-93-39, Department of Computer Science, Utrecht University, 1993. [18] Stefan Wermter, editor. Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language Processing: IJCAI-95 Workshop, volume 1040 of Lecture Notes in Computer Science. Springer Verlag, New York, 1996. Author Index Andrews, Robert 226 Arevian, Garen 158 Austin, James 372 Kremer, Stefan C. Bailey, David 14 Barbosa, Valmir C. 92 Bogacz, Rafal 63 Bologna, Guido 226, 240 Maire, Frederic 226 Mayberry, Marshall R. 144 Miikulainen, Risto 144 Morris, William G. 175 Cavill, Steven J. 270 Cottrell, Garrison W. 175 Omlin, Christian W. 123 Orovas, Christos 372 Diederich, Joachim 226 Dorffner, Georg 255 Palm, Günther 387 Panchev, Christo 158 Park, Nam Seog 78 Lipson, Hod Elman, Jeffrey 175 Enderle, Stefan 387 Gallant, Stephen I. 204 Giles, Lee 123 Giraud-Carrier, Christophe Gori, Marco 211 Hallack, Nelson A. 92 Hammerton, James A. 298 Hölldobler, Steffen 46 Honkela, Timo 348 Kalinke, Yvonne 46 Kalman, Barry L. 298 Kanerva, Pentti 194 Kolen, John F. 107 Kraetzschmar, Gerhard 387 286 Reilly, Ronan G. Feldman, Jerome 14 Fogg, Anthony J.B. 270 Foy, Michael A. 270 Frasconi, Paolo 211 107 363 Sablatnög, Stefan 387 Schittenkopf, Christian 255 Sharkey, Noel 313 Shastri, Lokendra 28 Siegelmann, Hava T. 286 Sperduti, Alessandro 211 Sun, Ron 1, 333 63 Taylor, Stewart J. 270 Thornber, Karvel K. 123 Tickle, Alan B. 226 Tiňo, Peter 255 Vaughn, Marilyn L. 270 Wermter, Stefan 1, 158 Wunderlich, Jörg 46 Zaverucha, Gerson 92 Ziemke, Tom 313