Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Finite-state Methods for Multimodal Parsing and Integration Michael Johnston AT&T Labs - Research Shannon Laboratory, 180 Park Ave Florham Park, NJ 07932, USA johnston@research.att.com 1 Introduction Finite-state machines have been extensively applied to many aspects of language processing including, speech recognition (Pereira and Riley, 1997; Riccardi et al., 1996), phonology (Kaplan and Kay, 1994; Kartunnen, 1991), morphology (Koskenniemi, 1984), chunking (Abney, 1991; Joshi and Hopely, 1997; Bangalore, 1997), parsing (Roche, 1999), and machine translation (Bangalore and Riccardi, 2000). In Johnston and Bangalore (2000) we showed how finite-state methods can be employed in a new and different task - parsing, integration, and understanding of multimodal input. Our approach addresses the particular case of multimodal input to a mobile device where the modes are speech and gestures made on the display with a pen, but has far broader application. The approach uses a multimodal grammar specification which is compiled into a finite-state device running on three tapes. This device takes as input a speech stream and a gesture stream and outputs their combined meaning. The approach overcomes the computational complexity of unification-based approaches to multimodal processing (Johnston, 1998), enables tighter coupling with speech recognition, and enables straightforward composition with other kinds of language processing such as finite-state translation (Bangalore and Riccardi, 2000). In this paper, we present a revised and updated finitestate model for multimodal language processing which incorporates a number of significant advancements to our approach. We show how gesture symbols can be decomposed into attributes in order to reduce the alphabet of gesture symbols and enable underspecification of required gestures. We present a new mechanism for abstracting over gestural content that cannot be captured in the finitestate machine.1 We address the problems relating to deictic numerals (Johnston, 2000) by introducing a new mechanism for aggregation of adjacent gestures. The examples we use are drawn from a new more sophisticated multimodal application which provides mobile access to city information such as the locations of restaurants and theatres (we will demonstrate this application as part of our presentation). We will also draw examples from other applications as needed. In addition to addressing multimodal rather than unimodal input, another novel aspect of our approach is that we used the finite-state representation to build the meaning representation. We first present the basics of the finite-state approach and then go on to discuss each of the innovations in turn. 1 This overcomes a number of difficulties with the previous buffering/variable mechanism we used. Srinivas Bangalore AT&T Labs - Research Shannon Laboratory, 180 Park Ave Florham Park, NJ 07932, USA srini@research.att.com 2 Finite-state models for multimodal processing Multimodal integration involves merging semantic content from multiple streams to build a joint interpretation for a multimodal utterance. We employ a finite-state device to parse multiple input streams and to combine their content into a single semantic representation. For an interface with n modes, a finite-state device operating over n+1 tapes is needed. The first n tapes represent the input streams and n + 1 is an output stream representing their composition. In the case of speech and pen input there are three tapes, one for speech, one for pen gesture, and a third for their combined meaning. In the city information application example in this paper, users issue spoken commands such as tell me about these two restaurants while gesturing on icons on a dynamically generated map display, or show cheap Italian restaurants in this neighborhood while drawing an area on the display. The structure and interpretation of multimodal commands of this kind is captured declaratively in a multimodal context-free grammar. We present a fragment capable of handling such commands in Figure 1. The non-terminals in the multimodal grammar are atomic symbols. The multimodal aspects of the grammar become apparent in the terminals. Each terminal contains three components W :G:M corresponding to the n + 1 tapes, where W is for the spoken language stream, G is the gesture stream, and M is the combined meaning. The epsilon symbol (eps) is used to indicate when one of these is empty in a given terminal. The symbols in W are words from the speech stream. The symbols in G are of two types. Sequences of symbols such as G area location indicate the presence of a particular kind of gesture in the gesture stream, while those like SEM are used as references to entities referred to by the gesture. In our previous application, the semantic representation consisted of predicates. In the current application we are experimenting with the use of XML as the meaning representation. The meaning tape contains symbols which when concatenated together form coherent XML expressions. Our approach makes certain simplifying assumptions with respect to temporal constraints. In multi-gesture utterances the primary function of temporal constraints is to force an order on the gestures. If you say move this here and make two gestures, the first gesture corresponds to this and the second gesture to here. Our multimodal grammars encode order but do not impose explicit temporal constraints. However, general temporal constraints between speech and the first gesture can be S COMMAND COMMAND NP DEICTICNP DEICTICNP SELECTION CUISMOD CUISINE CUISINE LOCMOD LOCMOD LOCATION LOCATION DDETSG DDETPL NUM NUM RESTSG RESTPL Entry ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! COMMAND show:eps: show NP eps:eps: /show tell:eps: info me:eps:eps about:eps:eps DEICTICNP eps:eps: /info eps:eps: restaurant CUISMOD restaurants:eps:eps LOCMOD eps:eps: /restaurant DDETSG SELECTION eps:1:eps RESTSG eps:eps: restaurant Entry eps:eps: /restaurant DDETPL SELECTION NUM RESTPL eps:eps: restaurant Entry eps:eps: /restaurant eps:area:eps eps:selection:eps eps:eps: cuisine CUISINE eps:eps: /cuisine italian:eps:italian chinese:eps:chinese eps:eps: location LOCATION eps:eps: /location eps:eps:eps in:eps:eps this:G:eps area:area:eps eps:location:eps Entry along:eps:eps this:G:eps route:line:eps eps:location:eps Entry this:G:eps these:G:eps two:2:eps three:3:eps restaurant:restaurant:eps restaurants:restaurant:eps eps:SEM:SEM < > < < > < < > < > > > < > < < < < > < < > > > > > < > Figure 1: Multimodal grammar fragment enforced before the FSA is applied. A multimodal CFG (MCFG) can be defined formally as quadruple < N; T; P; S >. N is the set of nonterminals. P is the set of productions of the form A ! where A 2 N and 2 (N [ T ). S is the start symbol for the grammar. T is the set of terminals of the form (W [ fepsg) : (G [ fepsg) : M  where W is the vocabulary of speech, G is the vocabulary of gesture=GestureSymbols [ EventSymbols; GestureSymbols =fG, area, location, restaurant, 1, : : :g and an event symbol EventSymbol =fSEMg. M is the vocabulary to represent meaning and includes the event symbol (EventSymbols  M ). In general a context-free grammar can be approximated by an FSA (Pereira and Wright, 1997; Nederhof, 1997). The transition symbols of the approximated FSA are the terminals of the context-free grammar and in the case of multimodal CFG as defined above, these terminals contain three components, W , G and M . The multimodal CFG fragment in Figure 1 translates into the FSA in Figure 2, a three-tape finite state device capable of composing two input streams into a single output semantic representation stream. While a three-tape finite-state automaton is feasible in principle (Rosenberg, 1964), currently available tools for finite-state language processing (Mohri et al., 1998; van Noord, 1997) only support finite-state transducers (FSTs) (two tapes). Furthermore, speech recognizers typically do not support the use of a three-tape FSA as a language model. In order to implement our approach, we convert the three-tape FSA (Figure 2) into an FST, by decomposing the transition symbols into an input component (G  W ) and output component M , thus resulting in a function, T :(G  W ) ! M . This corresponds to a transducer in which gesture symbols and words are on the input tape and the meaning is on the output tape. The domain of this function T can be further curried to result in a transducer that maps R:G ! W . This transducer captures the constraints that gesture places on the speech stream and we use it as a language model for constraining the speech recognizer based on the recognized gesture string. In the following sections we discuss several advancements in our finite-state approach to understanding multimodal input. 3 Gesture symbol complexes In our original approach, atomic symbols were used in the multimodal grammar and corresponding machine to represent different types of gestures. For example, Gp was used for gestural reference to a person, Go for a reference to an organization, 2Gp for a gestural reference to a set of two people and so on. In our current application we are exploring the idea of decomposing the gesture symbols into sequences of symbols each of which conveys a specific attribute of the content such as type or number. This facilitates reference to sets of specific symbols. It also limits the number of symbols that are needed for gesture. It plays an important role in enabling the storage of specific gesture contents (See Section 4) and aggregation (See Section 5). The gesture symbol complexes follow a basic form: G FORM MEANING (NUMBER TYPE) SEM. FORM indicates the physical form of the gesture: and has values such as area, point, line, arrow. MEANING indicates the specific meaning of that form for example an area can be either a location or a selection. NUMBER and TYPE are only found with selection. They indicate the number of entities selected (1,2,3, many) and the specific type of entity (restaurant, theatre). The TYPE mixed is used for gestures at collections of entities of varied different types. In order to facilitate abstraction and recomposition of specific gestural content (see Section 4), the specific content is mapped to a distinguished symbol SEM, while the other attributes of the gesture are mapped to themselves. As an example, if the user draws an area on the screen which contains two restaurants, which have identifiers id1 and id2, the resulting gesture lattice will be as in Figure 3. If the speech is show me chinese restaurants in this neighborhood then the first path will be chosen when the multimodal finite-state device is applied. If the speech is tell me about these two restaurants then the second, selection, path will be chosen. If instead the user circles a restaurant and a theatre the lattice would be as in Figure 4 and if they say tell me about this theatre the third path will be taken. But if they say tell me about these two the fourth path will be taken. This approach allows for cases where a user circles several entities and selects a specific one by type. If we did not split the symbols we would need a G area location symbol, G area selection 1 restaurant, G area selection 2 restaurant etc., significantly increasing the alphabet of gesture symbols. The split symbols also allow more perspicuous representation of more general categories. For example, if place in tell me about this place can refer to either a restaurant or a theatre then it can be assigned both arcs in the lattice. As shown in the next two sections, the splitting of gestures symbols also plays an important role in the abstraction and recovery of specific gestural content, and the process of aggregation. 4 Recovering gesture content by composition In order to capture multimodal integration using finitestate methods, it is necessary to abstract over certain aspects of the gestural content. For example, it is not possible to capture all of different possible sequences of coordinates that occur in gesture so that they can be copied from the gesture input tape to the meaning output tape. In our previous approach we assigned the specific content of gestures to a series of numbered variables e1, e2, e3 . . There are a number of limitations to this approach. The number of gestural inputs that can be handled is limited by the number of variables used. If a large number are used then the resulting multimodal finite-state device increases significantly in size since an arc is needed for each variable in every place where gesture content can be copied onto the meaning tape. This becomes a significant problem when aggregation of gestures is considered (See Section 5 below) since a new variable is needed for each combination of gestures. We have developed a new approach to this problem. The gestural input is represented as a transducer that maps S : I ! G where G are gesture symbols and I are the specific interpretations (see Figure 4). The G side contains indication of the type of gesture and its properties. In any place where there is specific content such as a list of entities or points in the gesture stream the symbol in G is the reserved symbol SEM. The specific content is placed on the I side opposite SEM. All G symbols other than SEM match with an identical symbol on I . In order to carry out the multimodal composition with the transducer R : G ! W and T : (G  W ) ! M , the output projection G of S : I ! G is taken. After composition we take a projection U : G ! M of the resulting G W:M machine, basically we factor out the speech W information. We then compose S and U yielding I:M. In order to read off the meaning we concatenate symbols from the M side. If the M symbol is SEM we instead take the I symbol for that arc. As an example, when the user says show chinese restaurants in this area and the gesture is as in Figure 3 the resulting G W:M machine is as in Figure 6. In order to compose this with Figure 3 the W is removed. Reading of the meaning from the resulting I:M we have the representation in Figure 7. 5 Aggregation for Deictic Numerals Johnston (2000) examines the problems posed for the unification-based approach to multimodal parsing (Johnston, 1998) by deictic numeral expressions such as these four restaurants. The problem is that there are a multitude of different possible sequences of gestures that are compatible and should be integrated with this spoken phrase. The user might circle a set of four restaurants, they might circle four individual restaurants, they might circle two sets of two, circle one and circle three etc. Capturing all of these possibilities in the spoken language grammar significantly increases its size and complexity and any plural expression is made massively ambiguous. The suggested alternative in (Johnston, 2000) is to have the deictic numeral subcategorize for a plurality of the appropriate number and predictively apply a set of gesture combination rules in order combine elements of gestural input into the appropriate pluralities. In the finite-state approach this is achieved using a method we term aggregation which serves a pre-processing phase on the gesture input lattice. Additional branches are added to the lattice which represent combinations of adjacent elements. This process needs to combine the specific semantic contents (SEM values) and so is carried on outside of the finite-state representation. An expression such as these three restaurants is assigned the gesture stream G 3 restaurant SEM by the multimodal grammar. 2 This will directly combine with a single area gesture containing three restaurants. In order to combine with a sequence of three separate gestures on single restaurants aggregation must apply. The gesture lattice for three gestures in a row is G:G 1:1 restaurant:restaurant ([id1]):SEM G:G 1:1 restaurant:restaurant ([id2]):SEM G:G 1:1 restaurant:restaurant ([id3]):SEM. In pre-processing the gesture lattice is parsed and adjacent selections of identical type are composed and new branches added to the lattice. In this case, three more gestures are added to the lattice as in Figure 5. This can be thought of as a closure on the gesture lattice of a function which combines adjacent gestures of identical type. When this gesture lattice is combined with the spoken input by the multimodal finite state device the speech will pick out the G 3 restaurant path in the gesture lattice. We term this kind of aggregation type specific. We also use type non-specific aggregation to generate aggregates of mixed type. For example in the case where the user says tell me about these two and circles a restaurant and then a theatre, non-type specific aggregation applies to combine the two gestures into an aggregate of mixed type G:G 2:2 mixed:mixed ([id4,id5]):SEM and this is able to combine with these two. In order to preserve an indication of the original sequence of gestures the newly added paths can be assigned lower weights. 2 We are simplifying the gesture symbol complex here for ease of exposition. In each case, the area selection would come between the G and the number. 6 Finite-state Semantic Representation A significant aspect of our approach is that in addition to capturing the structure of the input we also build the semantic representation within the finite-state framework. In our original application (Johnston and Bangalore 2000), we generated a prolog-like logical representation, for example: email([person(id1),organization(id2)]). In our city information application we are exploring the use of an XML-based meaning representation. The symbols on the meaning tape can be concatenated together to form coherent XML expressions which are then evaluated by the underlying application. For example, show chinese restaurants in this neighborhood yields Figure 7. 7 Conclusion We have presented an approach to multimodal language understanding in which speech and gesture inputs are assigned a combined meaning by a finite-state device. This work is novel not just in its application to multimodal input but also in that it assigns semantic representation using a finite-state device. We have highlighted a number of recent advancements in the approach since Johnston and Bangalore 2000. Gesture symbols have been split up into gesture symbol complexes which allow for a broader range of expression. The specifics of gestural input are abstracted over into a gesture lattice and reinserted into the selected multimodal interpretation without the use of variables or buffers. We have addressed the problem of deictic numerals brought up by Johnston 2000 using an aggregation mechanism which acts as a pre-processing stage on the gesture input and have explored the use of XML as an output meaning representation. In ongoing work we are exploring methods for automatically generating the multimodal grammar from a feature-based representation, and examining how the machine can be used in reverse for multimodal generation tasks. References Steven Abney. 1991. Parsing by chunks. In Robert Berwick, Steven Abney, and Carol Tenny, editors, Principle-based parsing. Kluwer Academic Publishers. Srinivas Bangalore and Giuseppe Riccardi. 2000. Stochastic finite-state models for spoken language machine translation. In Proceedings of the Workshop on Embedded Machine Translation Systems. Srinivas Bangalore. 1997. Complexity of Lexical Descriptions and its Relevance to Partial Parsing. Ph.D. thesis, University of Pennsylvania, Philadelphia, PA, August. Michael Johnston and Srinivas Bangalore. 2000. Finite-state multimodal parsing and understanding. In Proceedings of COLING 2000, Saarbruecken, Germany. Michael Johnston. 1998. Unification-based multimodal parsing. In Proceedings of COLING-ACL, pages 624–630, Montreal, Canada. Michael Johnston. 2000. Deixis and conjunction in multimodal systems. In Proceedings of COLING 2000, Saarbruecken, Germany. Aravind Joshi and Philip Hopely. 1997. A parser from antiquity. Natural Language Engineering, 2(4). Ronald M. Kaplan and M. Kay. 1994. Regular models of phonological rule systems. Computational Linguistics, 20(3):331–378. Lauri Kartunnen. 1991. Finite-state constraints. In Proceedings of the International Conference on Current Issues in Computational Linguistics, Universiti Sains Malaysia, Penang, Malaysia. Kimmo K. Koskenniemi. 1984. Two-level morphology: a general computation model for word-form recognition and production. Ph.D. thesis, University of Helsinki. Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. 1998. A rational design for a weighted finite-state transducer library. Number 1436 in Lecture notes in computer science. Springer, Berlin ; New York. Mark-Jan Nederhof. 1997. Regular approximations of cfls: A grammatical view. In Proceedings of the International Workshop on Parsing Technology, Boston. Fernando C.N. Pereira and Michael D. Riley. 1997. Speech recognition by composition of weighted finite automata. In E. Roche and Schabes Y., editors, Finite State Devices for Natural Language Processing, pages 431–456. MIT Press, Cambridge, Massachusetts. Fernando C.N. Pereira and R. Wright. 1997. Finite-state approximation of phrase structure grammars. In E. Roche and Schabes Y., editors, Finite State Devices for Natural Language Processing. MIT Press, Cambridge, Massachusetts. Giuseppe Riccardi, R. Pieraccini, and E. Bocchieri. 1996. Stochastic Automata for Language Modeling. Computer Speech and Language, 10(4):265–293. Emmanuel Roche. 1999. Finite state transducers: parsing free and frozen sentences. In András Kornai, editor, Extended Finite State Models of Language. Cambridge University Press. A.L. Rosenberg. 1964. On n-tape finite state acceptors. FOCS, pages 76–81. Gertjan van Noord. 1997. FSA Utilities: A Toolbox to Manipulate Finite-state Automata. In Derick Wood Darrell Raymond and Sheng Yu, editors, Automata Implementation. Lecture Notes in Computer Science 1260. Springer Verlag. 0 eps_show:<show> 1 eps_tell:<info> 2 eps:<restaurant> eps_me:eps 3 4 eps:<cuisine> eps_about:eps 5 eps_chinese:chinese 6 eps_italian:italian G_these:eps 7 G_this:eps 8 9 eps:</cuisine> area_eps:eps 10 11 eps_restaurants:eps selection_eps:eps 13 14 eps:<location> 2_two:eps 17 eps_in:eps 21 22 G_this:eps selection_eps:eps 15 1_eps:eps 19 restaurant_restaurants:eps restaurant_restaurant:eps 23 G_this:eps 24 12 3_three:eps 18 eps_along:eps area_eps:eps eps:<restaurant> 25 line_route:eps eps:</restaurant> area_area:eps 26 27 SEM_eps:SEM location_eps:eps 28 29 eps:</restaurant> SEM_eps:SEM 30 31 eps:</location> 32 eps:</info> eps:</restaurant> 16 eps:</show> 20 Figure 2: Multimodal three-tape FSA location:location G:G 0 area:area 1 2 3 (..points..):SEM 2:2 restaurant:restaurant 7 ([id1,id2]):SEM selection:selection 4 5 6 Figure 3: Gesture Lattice: Two restaurants 0 area:area 1 2 (..points..):SEM 3 location:location G:G selection:selection restaurant:restaurant 1:1 2:2 4 ([id1]):SEM 6 7 mixed:mixed 8 10 ([id2]):SEM theatre:theatre 5 ([id1,id2]):SEM 9 Figure 4: Gesture Lattice: Restaurant and theatre restaurant:restaurant 13 2:2 ([id1,id2]):SEM 14 12 G:G 0 G:G 5 G:G 1:1 1 2 restaurant:restaurant ([id1]):SEM 3 4 G:G 3:3 19 restaurant:restaurant 6 restaurant:restaurant 2:2 15 G:G 18 1:1 ([id2]):SEM 7 restaurant:restaurant 16 8 G:G 1:1 9 10 restaurant:restaurant 11 ([id3]):SEM ([id2,id3]):SEM 17 21 ([id1,id2,id3]):SEM 20 Figure 5: Aggregated Lattice 0 eps_show:<show> 1 eps_eps:<restaurant> 2 eps_eps:<cuisine> 3 eps_chinese:chinese 4 eps_eps:</cuisine> 5 eps_restaurants:eps 6 eps_eps:<location> 7 eps_in:eps 8 G_this:eps 9 Figure 6: G W:M machine <show> <restaurant> <cuisine>chinese</cuisine> <location>(...points...)</location> </restaurant> </show> Figure 7: Sample XML output area_area:eps 10 location_eps:eps 11 SEM_eps:SEM 12 eps_eps:</location> 13 eps_eps:</restaurant> 14 eps_eps:</show> 15