Remembering a Conversation – A Conversational Memory Architecture for Embodied Conversational Agents

Miguel Elvir; Avelino J. Gonzalez; Christopher Walls; Bryan Wilder

doi:10.1515/jisys-2015-0094

Publicly Available Published by De Gruyter February 25, 2016

Remembering a Conversation – A Conversational Memory Architecture for Embodied Conversational Agents

Miguel Elvir , Avelino J. Gonzalez , Christopher Walls and Bryan Wilder

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2015-0094

Abstract

This paper addresses the role of conversational memory in Embodied Conversational Agents (ECAs). It describes an investigation into developing such a memory architecture and integrating it into an ECA. ECAs are virtual agents whose purpose is to engage in conversations with human users, typically through natural language speech. While several works in the literature seek to produce viable ECA dialog architectures, only a few authors have addressed the episodic memory architectures in conversational agents and their role in enhancing their intelligence. In this work, we propose, implement, and test a unified episodic memory architecture for ECAs. We describe a process that determines the prevalent contexts in the conversations obtained from the interactions. The process presented demonstrates the use of multiple techniques to extract and store relevant snippets from long conversations, most of whose contents are unremarkable and need not be remembered. The mechanisms used to store, retrieve, and recall episodes from previous conversations are presented and discussed. Finally, we test our episodic memory architecture to assess its effectiveness. The results indicate moderate success in some aspects of the memory-enhanced ECAs, as well as some work still to be done in other aspects.

Keywords: Chatbots; chatterbots; conversational agents; conversational memory; Embodied Conversational Agents; episodic memory; episodic memory architecture

1 Introduction and Background

Oral communication – a conversation – may be considered one of the most personal forms of interpersonal communication. It can facilitate the transfer of information and knowledge from one individual to another when the information/knowledge being transferred is not overly voluminous. Most humans have the ability to store and retrieve from memory information or knowledge that was obtained from a conversation.

Chatbots are intelligent computer systems that can engage in conversation with humans. If a chatbot is embodied as an animated entity in the form of a virtual character, it is then called an Embodied Conversational Agent (ECA). Chatbots and ECAs attempt to anthropomorphize the discourses between computers and humans. By virtue of their embodiment, ECAs can be more lifelike than disembodied chatbots. Virtual humans are ECAs designed to closely resemble human beings – often, specific individuals. Therefore, ECAs and specifically virtual humans are the focus of our discussion here. We refer to both generically (although technically incorrectly) as ECAs.

Although dramatically improving over the last few years, ECAs do not yet, in general, effectively communicate with us via spoken natural language, nor do they effectively store and manage information collected from such interactions and retrieve this information for use in future dialogs. Under the overall theme of building natural ECAs, we seek here to develop conversational memory – a means for the ECA to remember information told to the ECA by a human conversant earlier in the same conversation or in previous conversations held in the not too distant past. Therefore, conversational memory in this discussion refers to the representation, storage, and retrieval of information and/or knowledge acquired during a multiparty oral conversation. The context of these conversations is associated with the conduct of business, whether formal (“How much is that car?”) or informal (“Where are we going for dinner tonight?”). In this context, remembering prior information mentioned by the human in the course of the conversation can be critical to the effectiveness as well as the “naturalness” of the conversation. For example, if the human tells the ECA “my flight home is at 4:00 this afternoon,” then the ECA should not later in the same conversation ask the human to attend a meeting at 4:30 PM.

1.1 The Temporal and Functional Aspects of Memory

Memory has two aspects of interest to us here – the temporal and the functional. From the temporal standpoint, memory can be long-term or short-term. Short-term memory often takes the form of working or belief memory that serves to temporarily store information/knowledge relevant to the current situation. It has a relatively short life (minutes, up to a few days). Long-term memory represents (semi-)permanent memory.

From a functional perspective, humans are said to have three forms of memory: procedural, declarative/semantic, and episodic. These can be shortly summarized as follows: (i) procedural – stores the knowledge necessary to perform tasks, e.g. how to solve a math problem, how to drive a car; (ii) declarative/semantic – consists of information about the world, e.g. “Columbus discovered America in 1492”; (iii) episodic – sometimes referred to as autobiographical memory – episodic memory maintains a record of the agent’s experiences as perceived by that agent. Autobiographic memory is often used interchangeably with episodic memory in the literature. However, Conway [10] specifically distinguishes the two, stating that episodic and autobiographical memory form two separate memory systems. Episodic memory is contextually dependent and maintains knowledge about specific events [39, 54], for example, “I saw the Mona Lisa when I visited the Louvre last year.” We focus on short-term episodic memory in this work, although we define short term as lasting for days.

1.2 Our Objective and General Approach

Our objective is to create a conversational memory architecture for ECAs that can permit the ECA to remember important aspects (the gist) of a conversation and retrieve them for use when contextually appropriate. Our episodic memory architecture tackles recollection across one or more episode instances. There are three issues related to this task: (i) knowing what to remember, (ii) how to store it efficiently, and (iii) knowing what to retrieve in the context of a conversation and when to do so appropriately. Our architecture uses a combination of existing openly available software and our own special-purpose software to create an effective memory system for conversational memory in ECAs. We focus on the technical aspects of building an episodic memory system for ECAs rather than on the relevant cognitive issues.

It is commonly accepted that a memory system – whether human or electronic – cannot keep in memory all events experienced in the exact form they were experienced. This would overwhelm our storage capacity as well as our ability to retrieve the memories. Therefore, we subconsciously select what information to keep and what to discard. Sun [49] asserts that the more extraordinary an experience is, the more likely we are to retain it permanently in memory. The emotional intensity of an experience also affects the likelihood of remembering it. This is why we all remember where we were when we heard traumatic news. The same goes for conversations. We do not remember conversations verbatim, but rather extract the important parts of a conversation and remember those. In effect, we get the gist of a conversation as well as important factoids about it (time of appointment, names of new contacts, etc.). The basis of our approach is to identify the important elements of a conversation, store them, and index them according to the context in which it was experienced. This context is remembered, and when it resurfaces (later in the same conversation or in a different future conversation), the memory is recalled and questions related to it can be answered.

We define context as highly tiered. That is, the overall context of a conversation is one thing (e.g. the topic, the conversant(s), the location), but we also recognize the presence of microcontexts throughout a conversation. For example, the overall context of a conversation might be between that between two colleagues discussing a particular contract, while a microcontext within the same conversation could be the client’s love for golf and that he should take her for a round of golf soon.

1.3 Assumptions

We assume that all communications between the human and the ECA are verbal, rather than through visual observation or non-verbal sounds. Furthermore, we neglect variations in the tone of the words that may connote a specific affective state by the human (e.g. anger, frustration, supplication, etc.). While these are undeniably important aspects of conversational transactions, we leave them for future research. Furthermore, we leave for others the investigation of memory in such ECAs when applied to serve as long-term companions (companion agents), empathic agents, and game-based action agents. Moreover, we do not address forgetting, as alas, it is much too often a problem in human-to-human business conversations. Finally, although inspired by existing memory models, we do not claim that our memory architecture necessarily reflects or works in a manner akin to human memory. For these reasons, we refer to our work as an architecture and not as a model.

In Section 2, we briefly examine episodic memory models and architectures reported in the literature. In Section 3, we define our unified episodic memory architecture for ECAs that will enhance the naturalness and effectiveness of a conversation between human and ECA. Section 4 describes our experiments, while Section 5 concludes and summarizes.

2 Related Works

Memory architectures are typically integrated into conversational agent architectures. The former seek to enhance the latter by making available information obtained by the ECA as part of the same or a prior conversation. Many architectures for conversational (dialog) systems have been reported in the literature – too many to discuss here. Nevertheless, some recent works of note include IVELL by Hassani et al. [18] and its application to learning language skills [19], a multifunctional system by Planells et al. [45], AIDA by Banchs et al. [3], CLARA by D’Haro et al. [11], a multimodel system by Nio et al. [37], the SARA system by Niculescu et al. [36], an unnamed system by Shibata et al. [47], and the FLoReS system by Morbini et al. [35]. However, none of these architectures include any kind of memory about the conversation.

The literature contains descriptions of many models of conversational memory. These models mainly seek to reflect how the human brain implements memory. These models have evolved over the years, as more about human memory has been learned by recent research. While building a human-like model of memory was not our objective in this research, our treatment of memory would not be complete without discussing the recent literature on this subject. However, a complete review of the literature on memory models is beyond the scope of this article. See Norman et al. [38] for an excellent, although somewhat dated, review of the literature on episodic memory models.

2.1 Conversational Memory Models

Sun [49] discussed the CLARION model of memory, which includes episodic memory. Sun makes a case that there are three distinctions between types of memory – (i) implicit vs. explicit memory, (ii) action-centered vs. non-action-centered memory, and (iii) episodic vs. semantic. Lim [29] provides a review of the literature on memory models for companion agents. Lim et al. [30] proposed a means to retrieve autobiographical memories with what they call “spreading association,” all in the context of companion agents. However, no evaluation is provided. Campos and Paiva [8] presented their memory architecture called May, which was integrated into an agent. The main thrust of their work was for the agent to develop a sense of “… intimacy and companionship” with its human user. Brom et al. [7] proposed an episodic memory model for what they refer to as intelligent virtual agents that live and interact with a virtual world and must move about this world. However, they did not address conversational memory. Brom and Lukavsky [6] focused on defining what they call full episodic memory, and asked whether full episodic memory is truly necessary. They did not answer the question but proposed some interesting ideas on the subject. Faltersack et al. [13] proposed an episodic memory model (Ziggurat) as a step toward a general episodic memory model. Ziggurat made use of context for memory retrieval but the application is for action-based games, not conversation. Wang et al. [56] reported their use of neural networks to learn traces (episodes) from sensory information and environmental feedback. Their model (EM-ART) included forgetting. Their application was strictly for game agents, not conversation. Ho et al. [20] described a symbolic autobiographic memory model for use in mission-based environments. An interesting aspect of their work is that their agent was able to tell a story to other agents about its experiences stored in its autobiographical memory. Kim et al. [23] reported a memory system that recognizes and saves elements of the conversation to memory for later use in the same conversation. However, their memory model appears to be declarative/semantic and not episodic.

2.2 Memory in Cognitive Architectures

SOAR [26] and ACT-R [1] are two cognitive architectures commonly used today. Their respective creators correlate the use of episodic memory to beneficial behavior in agents [39]. SOAR and ACT-R both advocate the distinction between episodic and semantic memories [39, 48], as originally suggested by Tulving [53, 54], who opined that non-trivial tasks require aspects of both semantic and episodic memories. SOAR and ACT-R incorporate elements of episodic memory into existing declarative memory [17, 39, 48]. In effect, both efforts aim to store autobiographical data for factual declarative knowledge. SOAR and ACT-R operate in a generally similar fashion, although with some important differences whose discussion is tangential to our objectives.

2.3 Memory in Chatbots and ECAs

Kopp et al. [24] presented a multimodal ECA (Max) capable of utilizing short- and long-term memory structures. Max serves as a museum guide that interacts with users in an enjoyable and informative manner, and aims to engage them in mixed-initiative conversations [28].

Bernsen and Dybkjær [4] described HCA, a multimodal conversational agent that interacts with users to discuss Hans Christian Andersen’s works. While all three types of memory functions can be observed in HCA, its episodic memory influenced the conversational agenda by retaining records of the interactions with a given user. It affects HCA’s ability to maintain prolonged dialogs by keeping track of knowledge visited, and preventing repetitive statements [4].

2.4 Memory in Question Answering Characters

Traum and Rickel [51] showed that Steve-based agents are capable of answering questions about the state of their environment. Steve-based agents employ episodic memory as a record-keeping structure. Three virtual question-and-answer (Q&A) agents were presented by Leuski et al. [27] (Sergeant Blackwell), Artstein et al. [2] (Sergeant Star) and by Traum et al. [52] (Hassan). All three characters implemented derivations of a similar architecture. These Q&A characters take the form of virtual humans inhabiting virtual worlds and interact with users through some model of dialog management.

2.5 Comparison to Existing Works

The episodic memory architecture used in our research strives to overcome constraints evident in the systems described above. While we do not presume to account for all the requirements for episodic memory architectures in general, our work aims to further develop conversational agents with realistic human-like behavior.

By providing an analytic review of previous research efforts in ECAs above, we demonstrated that some common features exist across several episodic memory implementations. Namely, we can reuse the concepts of a centralized memory, design for decentralized systems, conversation log feature tracking, and domain modeling. Additionally, our architecture attempts to emulate the characteristics of concurrent and decentralized modules advocated by development tools such as Microsoft’s Robotics Developer Studio (RDS) [34].

Unlike Max, Sgt. Blackwell, and HCA, our episodic memory tackles recollection across one or more episode instances. Prior works [24, 51] attempt to establish coherence in a single episode of interaction with users. We argue that a more realistic collaboration between humans and conversational agents can be established by extending the episodic memory through various encounters with the user. To accomplish this, our memory architecture relies on a greater number of interactions with these users. Furthermore, we designed the episodic memory structures to store several types of information and record more details about the interaction than has been accomplished previously. Hence, in regards to storage, it can be considered more generalizable while being more expressive.

3 Our Episodic Memory Architecture for Conversational Agents

In this section, we present our unified episodic memory architecture and prototype. Broadly speaking, memory architectures in ECAs falls into two categories: centralized and decentralized. In centralized memory schemes, a repository contains most, if not all, of the agent’s knowledge. In decentralized approaches, memory is specialized and found embedded within the individual agent components. Within these two extremes, we find a range of hybrid systems that centralize major portions of memory while displaying aspects of decentralization. One example is the Max agent discussed above, which centralizes the episode storage but decentralizes the knowledge acquired.

We derive our memory model from such a hybrid memory architecture. Our model centralizes critical elements of memory, while allowing the system to complement real-time operations with distributed memory. The centralized memory element includes the knowledge acquired by the ECA from conversations. Furthermore, our model supports extensible components.

3.1 System Architecture: The Pipeline Approach

Figure 1 illustrates a stack diagram of the four core components in the episodic memory model as implemented in our architecture. These four modules consist of the memory interfaces, a back-end database, a process of contextualization, and some analysis services. The general purpose of each of these components is described in the next few paragraphs.

Figure 1:

Episodic Memory Architecture Stack with Four Core Components.

The memory interfaces are found at the topmost layer of the stack; however, the architecture responds to requests for recalling events that have been contextualized and stored in the database. Our implementation of memory interfaces is in the form of loosely coupled services, which allow communication to occur between the episodic memory and ECAs that use it. A back-end database running on a server is the storage medium for episodes and the internal processes that manipulate conversational information.

The contextualization process is responsible for managing the input of interaction data, storing them in the episodic memory, indexing the episodes, and deciding which episodes are relevant for retrieval as a result of a the conversation. The analysis services are used by the contextualization process to extract information from the dialog. They are inherently extensible in that several custom services can be concurrently deployed to analyze data.

3.2 Capturing the Conversation

Creating conversational memory begins with capturing information from conversations between the ECA and a human. In a multiparty dialog system, users may interact with several client-side tools. A user may interact with our episodic memory in any of two forms, as shown in the sequence diagram of Figure 2. First, he/she may provide sensory input in the form of speech, which will be processed in memory. Alternatively, he/she may request information on previously stored episodes. In either case, access to memory functions occurs through the memory interfaces.

Figure 2:

Acquiring Conversational Data.

In the first of the interaction scenarios, a client side utility acquires raw data from the users – we use speech here. Our episodic memory architecture creates conversational episodes based on transcripts obtained from a speech recognition engine. The contents of these transcripts, as well as the audio data, are uploaded to the back-end database through memory interfaces. We created a simple, graphical use interface (GUI)-based client program that uses Microsoft’s SAPI 5 speech recognition engine to transcribe spoken data. Our interface accepts transcribed text, information identifying the initiating party, binary data associated with the event, and a unique conversation identifier.

Figure 2 depicts two timelines of events that originate from the GUI and the memory interfaces. The processes highlighted, beginning with “Receiving Audio/Video,” pertain to the acquisition of conversational data. By way of example, we can describe the flow using spoken interactions. Events generated by the speech recognition engine are collected into transcripts and uploaded to a centralized memory database system. This system preprocesses these transcripts to look for topics related to the conversation with tools available on the Internet as well as in our algorithm. The resulting information is parsed into a contextual structure accessed at query time and used to present relevant topical information to the user. Topics are deemed relevant based on the frequency with which they appear in conversations.

We next present the process by which our system selects the details of an episode that will be stored into memory. We expand on this database and the contextualization process in the next section.

3.3 Filtering and Storing Conversational Memories

As memory interfaces collect data from spoken conversations, they, in effect, serve as short-term memory buffers for the audio data and transcripts. The contents of these buffers are transmitted to the back-end database server for processing in episodic memory. Once the server receives the data, it performs three primary functions: (i) stores the raw data, (ii) filters and stores information generated through analysis of the episode, and (iii) indexes the raw and generated data. Storing raw data occurs in a straightforward fashion by using a conversation log similar to that used by Bernsen and Dybkjær [4], Traum and Rickel [51], and Traum et al. [52]. The utterances in a conversation are recorded in a table that serves as a conversation log as they occur.

A second function provided by the back-end server for managing episodic memories concerns the generation, filtering, and storage of episodic information. The information generated by the analysis services identifies key information from the transcripts. We discuss the features selected when describing each of the analysis services. Of importance to our current discussion, however, is the identification of topical information in the form of key phrases.

Two of the services used process the conversational utterances by querying online services that have been trained with large amounts of textual corpus data with which to identify named entities and categorize text. Named entities consist of textual elements that can be tagged as belonging to an agreed-upon taxonomy of categories such as person, location, monetary value, etc. We use Yahoo! [57] and OpenCalais [41] application program interfaces (APIs) to perform the named entity recognition. Because these services have been developed by others and have been widely described, we focus instead on a third service that we developed, which identifies key phrases in addition to the entities recognized by Yahoo! and OpenCalais. While these online services can process large volumes of information and extract topics and terms as recognized by their usage in broader news corpora, our algorithm is tailored to capture snippets from short utterances. It captures short, relevant key phrases from a conversation and processes utterances. It is implemented in five stages:

Service input stage – this service obtains the conversational utterance data from the contextualization process.
Part-of-speech (POS) tagging – we use a Maximum Entropy POS tagger to extract the part of speech for every word in the utterance.
Windowed phrase parsing – phrase groups, consisting of up to seven words, are parsed by using OpenNLP’s Treebank parser [42]. Natural language processing (NLP) parsing assigns a structural description to a sequence of words. In our case, the structure is a tree representation of a phrase’s syntax.
Thresholding of complex clauses – we calculate the depth of each clause in the parsed tree and extract the deepest ones.
Return complex clauses – complex clauses are returned to the contextualization process. This process concludes the key phrase extraction by storing phrases in a table of candidate topics.

The five steps outlined above correspond to the function of certain methods in our system. Please refer to Ref. [12] for details on these methods, including their pseudo code. An example of getting phrases from a complex phrase follows:

Phrase: I’d like to know more about the renewals part of the program.
Groups: (i) I’d like to know more about the (ii) the renewals part of the program.

As can be observed from the above phrase, two groups are created from the original phrase. The first window groups the seven leftmost words using a window size of seven. A second window creates a group from the remaining words in the sentences using a window size of six. For each resulting group of words, the method extracts candidate key phrases. This method returns a list of identified key phrases (to be explained shortly). Finally, another method analyzes the collection of key phrases of the original utterance for noun and verb groups and stores these in the XML object to be returned. Another method is tasked with extracting candidate key phrases from windowed word groups. NLP methods are used to preprocess the word groups. By parsing the word groups and finding co-references, a tree structure of the syntax that constitutes the word group is created.

We now describe a case study that will help clarify the implementation and usage of our algorithm, as a prelude to the functional description of the architecture. To further focus this discussion, we provide a sample exchange from a conversation in Table 1. This segment is an extract from a larger episode in which the system acts as the episodic memory for an ECA that converses with a human user and a moderator via synchronous turn taking. By running our algorithm over each of the utterances from the participants, we expect to extract items of relevance to the overall episode. We can deduce from Table 1 that the user shows interest in learning about renewals in the institution’s program that the ECA references at the beginning of the conversation. Hence, we also expect the list of key phrases extracted to include items with the word “renewal”.

Table 1

Term and Topic Extraction Case Study.

SPEAKER_NAME and TIME_STAMP	SPEAKER_PHRASE
ECA – 11/3/2009 14:08	I think I misheard you. Folks like to talk to me about renewals or memberships. Is there anything you want to know about the program?
User – 11/3/2009 14:08	I’d like to know more about the renewals part of the program
ECA – 11/3/2009 14:08	You can send your enquiries to Doctor John Smith through e-mail. Be as specific as possible to ensure an efficient and proper response, or call at 703 292 8492. There is some useful information on the screen next to me. Please let me know when you’re finished
User – 11/3/2009 14:08	Okay

To accomplish the extraction of terms from Table 1, we adapt certain NLP techniques to unstructured, shared-initiative conversations as follows. First, we perform POS tagging over each utterance received. In structured text, tagging might occur at the sentence level. However, speech recognition engines do not provide grammatical punctuation. Instead, utterances appear to behave in a stream-of-thought fashion similar to long sentences. Maximum-Entropy POS taggers, such as that developed by the OpenNLP group [42], which estimate probability distributions given a set of constraints, may perform erratically over longer transcript sentences because of the increased likelihood of a recognition error. Liu et al. [31] demonstrate that Maximum Entropy classifier performance degrades with increased errors in speech recognition output, which is likely to happen over longer and more complex speech. In view of this, we adopt a windowing technique through which we tag and analyze user utterances over five- to seven-word window segments.

As a second step, our term extraction algorithm performs phrase parsing over the windows using OpenNLP’s statistical parser. The reader should note that during parsing, the sentence or phrase is analyzed to determine the constituent groupings of the sentence (i.e. noun groups, verb groups, etc.). These groupings, in turn, may be represented in a tree-like structure.

During the third step, our system creates a structure representative of the tree. From the example of Table 1, the tree results from processing the user’s request for “renewal” information. The target key phrases of the sentence that we wish to extract are found in the tree at a position where the complexity increases. The complexity of a clause in a sentence is determined by the depth in the tree of the grouping in which it is contained.

We now proceed to the final stage of our algorithm. In this stage, the parser’s output is used to determine the depths of all nodes within the tree. Nodes with a complexity level above a threshold are retained. In other words, we place a lower limit on the structural complexity level of the clauses that are considered for term extraction. We expect that the complexity of the clauses will support the notion that terms extracted are items of significant relevance to the conversation. Our approach at extracting relevant terms can be compared to the more complex fact extraction algorithms presented by Pasca et al. [44]. While we exploit both patterns, Pasca et al. only use the POS tags to generalize patterns for fact extraction, and generate a low-quality set of candidate facts. In our implementation, however, we analyze patterns and complexity at the clause level and do not generalize patterns because the dataset available per conversation is limited.

The analysis service that runs our algorithm generates a set of key phrases for each utterance that are returned to the contextualization process. As part of this process, the resulting phrases are stored in a table of candidate topics. The table maintains an index for the records based on a conversation identifier, speaker, and time to facilitate future retrieval.

A summary listing of the topics occurring in an episode can be obtained by filtering the table that stores the candidate topics. Filtering occurs by processing the topics in two steps as follows:

Linguistic filter – first, a filter is applied on the table to return key phrases that contain one or more noun or verb groups, excluding prepositional phrases, single adjectives, or single adverbs.
Ranking – these groups are then ranked by the number of times they occur in a conversation. While this ranking method uses only a simple frequency heuristic, using a linguistic filter, such as our clause extraction algorithm, in combination with frequency ranking has been shown effective in similar studies [14, 15, 21].

It may be possible to produce a ranking by using more complex measures of interrelationships between extracted key phrases. For instance, Pointwise Mutual Information (PMI) [9] provides a well-known measure for determining word collocations. Word collocation refers to sequences of words, such as key phrases, with group-based syntactic and semantic characteristics and the definition of which results from the grouping and not the individual meaning of the words. Additionally, PMI may be used as a mechanism to gauge how much one word or group of words is related to another. However, PMI may perform badly in a sparse environment for which relatively few words or word collocations are available and shows bias toward infrequent terms [43]. Such a scenario can occur, for instance, in short conversations.

3.4 Episodic Memory Structures

With respect to the theory behind our architectural design, we now analyze the individual components with a high-level discussion of the memory structures. A definition of the requirements, rationale, motivation, and merits of these structures contribute to differentiate our system from previous implementations by other authors. Furthermore, it provides a sound basis on which to discuss memory representation in the episodic memory space.

3.4.1 Dialog Tables

A critical component of our episodic database consists of the tables that store autobiographic data. These tables provide the means for maintaining chronological and contextual coherence for a subscribing agent. As such, they contain the following:

A list of encountered users (T_USERS);
A conversation log (T_DIALOGUE);
A storage medium for complex binary data and user actions such as audio, video, images, etc. (T_DIALOGUE_DATA);
A repository of episode segments and features (T_DIALOGUE_CONTEXT);
A listed summary of topics in memory (T_DIALOGUE_TOPICS).

The sections that follow provide explanations for the implementation and purpose of each dialog table. Two additional sections explain the relationships between dialog tables, and pursue a discussion on two approaches to modeling memory.

3.4.2 User Information

The most trivial of these tables, T_USERS, contains the list of known users. It contains the moniker used to identify a user, such as a name, as well as the user model. For each user, a user model is formed and is composed of two data structures: the data and the object. A user model’s data comprise the definition of user model provided by Kopp et al. [24]:

“A user model contains all information that is gained throughout the dialogue. This includes information about the user (name, age, place of residence, etc.), his/her preferences and interests (determined by topics the user selected or rejected), and his/her previous behavior (cooperativeness, satisfaction, etc.).”

3.4.3 Conversation Log

A conversational log forms part of our episodic memory system and provides several characteristics commonly observed in ECAs as described earlier. It identifies the initiator of the turn, the spoken phrase, an index into the table of episode contexts, an index into the list of relevant topics, the duration of the phrase, NLP features, and a description of the source from which the turn data were obtained. The T_DIALOGUE table maintains one entry per user utterance. Readers may relate this approach to that of capturing discourse acts and creating Kopp et al.’s [24] discourse model.

Our conversational log contains most of the information attributed to the Max system, with the exception of goal- or planning-oriented fields. As noted earlier in this paper, the implementation of conversational logs in ECAs often results in many characteristics shared by different other systems. In our architecture, those common practices are observed, but we also extend the reach of the conversational log by providing indices into the remaining table-based structures of our episodic memory. These indices uniquely identify the conversation to which an utterance belongs, the speaker, time of the event, and related metadata. We can thus track utterances by their binary data and semantic information as provided by context or topical contribution. The reverse relation can also be obtained by simple manipulation.

3.4.4 Contextualized Records

A contextual process in our episodic architecture maintains a repository of episode segments and features in the T_DIALOGUE_CONTEXT table. This table contributes a fundamental component to our architecture, the dynamic context. Within this scope, we refer to a conversational context as the set of topics suggested by the utterances of all parties involved in the dialog. Moreover, we specify a dynamic context to be an abstract construct with a predefined structure, but whose possible range of attributes are not known a priori. The sections to follow discuss the storage medium and the members of this structure, as well as the role of the dynamic structure in maintaining conversational memory.

Three fields in the T_DIALOGUE_CONTEXT table serve as indices to link the various structures in the episodic architecture. The remaining fields identify the speaker name, unique identifier for the context, and the context metadata, respectively.

Because the dynamic context depends on the existence of an utterance or factoid entry into the dialog table to trigger its creation, its structure takes into consideration linking back against the dialog table. The user ID, conversation ID, and timestamp are recorded within the structure. All information is stored in XML in a prescribed format. To cope with the loose and unpredictable dimensionality of conversation from user to user or context to context, we devised the context structure to fit a variable amount of information. At the highest level, the context structure relies on three previously defined tags: a resources list, a set of attributes, and a set of production rule descriptions.

Of relevance to our discussion are the three primary tags that together form the context structure. These consist of

<resources/>
<context_attributes/>
<production_rules/>

Under the resources tag, we provide a listing of features obtained during the transcript collection. Even though these features can normally be found in other tables within the database, they are included in the context structure to maintain the atomic property of episode segments. By this, we suggest that each episode segment contains all necessary and sufficient information to provide a relevant contribution to the episodic memory. Each of the resources listed can potentially provide features for analysis in processes that operate over the context structure. These include, for instance, the utterance, POS information, phrase-parsing result, and template matches for the user model. Functional definitions for the latter three features were provided earlier in the section within the context of NLP.

The context attributes tag contains the result of analysis services performed by the episodic memory pipeline. We point out that the sources of these results remain invisible to the context structure. In other words, from the point of view of the structure itself, it is not relevant whether the results are provided by a service that is internal or external to the architecture. This reinforces our notion that the dynamic context attributes are not known a priori. Other processes provided by the architecture, or contributed by end-user defined services in the ECA, maintain responsibility for utilizing the context attribute results in a coherent fashion. Examples of services that may provide results to the context attributes include domain classifiers and web crawlers. These may serve the purpose of categorizing utterances by topic or expanding the information known about topics by searching through web content related to user statements.

A final tag, for production rules, concedes space for reasoning mechanisms. The listing inside the tag may be exploited by cognitive architectures, the episodic memory, or dialog managers. Concisely stated, it provides space for services that analyze the context attributes and resources to record the names of production rules to execute as a result of new episode segments. It may be worth noting that the Max agent in Kopp et al. [24] maintains a planner and is inherently goal oriented. We do not explicitly encode a goal-oriented structure in our episodic memory, but rather provide space for such a structure. While our system also encodes certain internal rules in this space, its intended purpose should be to engage processes in cognitive architectures or dialog managers by providing a whiteboard common to disjoint systems.

The entries for dynamic context structures maintained in T_DIALOGUE_CONTEXT form part of the episodes. Each episode can be identified by its conversation index. As is noted later, an episode segment occurs from the analysis of each dynamic context entry. In fact, the dynamic context structure plays a fundamental role to the contributions of this research. Many of the analysis services, protocols, and retrieval mechanisms discussed later in this section directly or indirectly utilize information from the episodes (or collection of dynamic contexts per conversation) to produce results.

3.4.5 Dialog Topics

A final comment on the episodic memory tables concerns the tracking of dialogue topics or domains. Earlier in the section, we exposed the steps that our key phrase extraction algorithm performs to detect candidate topics. This section, hence, pursues the storage and tracking of those topics. We derive the notion of such tracking from various domain-oriented agents. In particular, HCA’s conversational history [5] provides a well-defined structure to accomplish this task based on constructing tree structures and recording the last domain and topic visited. Unlike their implementation, we deviate from tree structures because we do not ascribe to a well-defined domain. Instead, as the topics are visited, they are recorded into a table of topics. We expect that this will cause a loss of information about the relationship between the individual topics as topical phrases are extracted from their source utterances. Thus, supporting clauses may be discarded and semantic relationships overlooked. To compensate for this, we include indices and timestamps that order the topics as they appear. In this fashion, we provide a means for the construction of structured trees or ontologies, should the need arise.

3.4.6 Using the Relationships between Dialog Tables for Gisting

The proceeding paragraphs explain the manner by which the table relationships contribute to the process of “gisting” conversations.

A measure of the value of our tables, relationships, and topic tracking algorithms can be verified in a conversational setting. Given this assumption, we realize that it may be important to determine the context for a conversation as a whole. Such a consideration must be taken because various speakers may contribute to the direction of a conversation, including the ECA. Therefore, we established procedures at the server level that an agent or dialog manager may call to obtain the gist of a conversation quickly. By gist, we refer to the set of prevalent topics and possibly a prediction of related or future topics. Simply stated, the gist contains only the list of active contexts given by a determination heuristic such as topic frequency, classification, or otherwise. We have previously explained the algorithm utilized to extract relevant topics, which serve as the active contexts in our architecture. Although gisting does not specify directly what the agent’s next statement should be, it helps to narrow down the scope of what it can be. As a result, the set of factoids returned is a shallow source of information. Our intention in equipping the episodic memory architecture with the ability to gist a conversation is to emulate the natural mental “note taking” a human being might undertake when engaging in daily conversations.

Because of the indexing and grouping choices previously mentioned for each of the dialog tables, we are able to collect contexts not only for active conversations, but also through a broader temporal and social span. For instance, we can merge information about the contexts of previous conversations with a particular user to a live conversation with the same person. If we extrapolate this even further, the social or group memory for a set of related individuals can be synthesized into a restricted collective pool of information available in the same form of the gist results set. To our knowledge, this form of collective analysis has not been performed at the group level in previous dialog systems that incorporate memory. It may provide useful insights about multiparty social interactions as a noisy channel for dynamic context building and about the value of providing users with possibly important information for decision making. Whether the knowledge should degrade over time (i.e. forgetting) or not may also be a point of future research. For the system summarized in the preceding section, we restrict ourselves to a perfect memory model. All information remains in the system unless the system filters it out as noise. The following sections detail the manner through which episodic information is retrieved and presented after being gisted.

3.5 Retrieving Episodic Memories

Before episodes become useful to any agent in need of recalling previous experiences, our episodic memory architecture must first be able to select and rank the stored information by its relevance to the particular context. We emphasize the distinction between episode selection and ranking. Selection of episodes calls for browsing stored memories to form a candidate pool of episodes when a request is received by the episodic memory architecture. To support the real-time operation of an ECA, this process must also return results in real time. The selected episodes should form a considerably smaller set than the total episodic memory space. In addition to selecting the episodes, the architecture must assign ranking or confidence levels to each episode as it prepares the set of selected episodes for presentation.

Our implementation utilizes simple mechanisms of episode selection and ranking. The selection of episodes may occur through one of two criteria: (i) temporal vicinity or (ii) contextual relevance. Episodes ranked by temporal vicinity arrange episodes deemed relevant to a context by their relationship in the sequence of events. The reader may visualize, for instance, a request for episodes based on temporal vicinity as a call for episodes that require playback. As a case study, if a user should need information about the presidential elections, then the architecture would first find episodes with relevant information. It then ranks these based on their chronological occurrence in order for the user to receive a temporally ordered view of events. In contrast to the sequential arrangement, our second selection mechanism introduces ranking by contextual relevance. Contextual relevance exploits information gathered as an agent interacts with human(s) to reason about which episodes should be selected for recall. Ranking in searches that perform selection by contextual relevance assign relevance to each episode based on the similarities demonstrated between recalled episodes and the current interaction with the user. The frequency-based algorithm for topic selection that was previously introduced currently serves as the ranking mechanism for contextual relevance.

3.6 Accessing Episodic Memory

In order for an ECA to request information about an episode and trigger the retrieval processes, it must possess the means of communicating with the episodic memory server. Previously, we defined the episodic memory structures and selection methods without considering how the episodes should be accessed. Accessing episodic memory from our architecture occurs through simple public interfaces into the memory.

Our work approximates communication and access models between memory and system components in a similar manner to well-known tools for developing large, complex systems. As a case in point, we allude to Microsoft’s RDS. Microsoft’s RDS permits developers and hobbyists to “create robotic applications across a wide variety of hardware” [34]. It facilitates communication across components by enabling the user to build modules that are loosely coupled [32, 33]. These loosely coupled components can be developed independently and make minimal assumptions about each other or their runtime environment [42].

Per the model of communication that we ascribe to Microsoft RDS, our memory-access interfaces instantiate episodic memory as a service application that uses techniques commonly employed in RDS. However, RDS does not incorporate an episodic memory model on its own. Episodes can be communicated through message-passing mechanisms. More importantly, we attempt to address the requirements in Ref. [33] for Decentralized Software Services (DSS) modules. From the perspective of our memory architecture, DSS modules promote robustness, composability, and observability within complex systems, as follows:

Robustness – protocols isolate the memory component from the remainder of the system and thus aid in limiting the impact of partial failures arising from episodic memory faults.
Composability – episodic memory as a service appears to be “created” and “wired up” at runtime based on the system’s needs.
Observability – it is possible, through our protocols, to determine what the memory is doing, what state it currently holds, and how it arrived at that state.

The episodic memory client is tasked with three responsibilities while in the presence of a dialog: collect user speech and deliver it for storage, recognize user interjections or interruptions, and display episode summaries by retrieving episodic information. Several elements are contained in the structure delivered by the memory interfaces for episode summaries. These elements are as follows:

episodeGuid – the unique identifier of the conversation being summarized.
participants – the names of the participants or individuals identified in the conversation.
timestamp – the time at which the conversation was last updated.
text – a textual summary of topics relevant to the conversation. At this moment, the summary is provided in a straightforward fashion that identifies the high-level topics first, followed by supporting keywords.
imagepreviews – a listing of URLs for images that may be representative of themes in the conversation. This aspect is included for demonstrative purposes only and may form part of a future expansion.

Through the data provided by the episode summary interfaces, we populate a chronological view of the dialogs in which the ECA has participated.

Although in the preceding paragraphs we alluded to analysis services, we deferred from exposing the list of services. In the following section, we complete our exposition of the analysis services exploited by the contextualization process.

3.7 Analysis Services

The contextualization process exploits two principal sets of modules: contextualizer methods and analysis services. Contextualizer methods act as interfaces between the database and analysis services. The database itself interacts with the contextualizer by providing notifications of dialog events that, in turn, begin the contextualization process when it receives new user input, or by storing the resulting context segment. Under normal circumstances, the complexity of the operations occurring while the contextualizer awaits analysis services results does not adversely hinder the interaction between the memory interfaces and the database. These interactions occur asynchronously. In other words, the user may continue interacting with the ECA while the backend database services perform contextualization services for a recent bulk of statements.

Analysis services provide the contents of the context attributes tags. In our architecture, we demonstrate the use of up to five services per context segment. Briefly stated, these services stem from several toolkits and APIs:

SharpNLP – C#-based API used to extract NLP features. It performs tagging, chunking, and parsing of sentences based on a Maximum Entropy algorithm trained on the English Treebank corpus [31].
AIML – C# implementation of an Artificial Intelligence Markup Language chatterbot based on Program#. It primarily performs template matching on new utterances [50].
Weka – Java-based API that implements many common classifier algorithms [55].
OpenCalais – web service that automatically creates semantic annotations for unstructured text based on NLP, machine learning, and other methods [40].
Yahoo! Search – web service to search web content including news, site, image, and local search content [57].

Only the first of these services above does not provide results to the context attributes tag. Because NLP features are extracted at an earlier stage than that which builds the context attributes, these features are included as part of the resources tag. The remaining four services provide user-model information, domain classification, semantic tagging, and web expansion.

As a variety of user statements can take the form of factoids, we employ an agent based on AIML to identify information in these statements. Program#, our API of choice for this task, can process >30,000 categories in under a second according to Tollervey [50]. Additionally, we can readily expand the AIML domain coverage by augmenting it with template documents. We exploit the template-matching methods to recognize factoids but do not require that subscribing agents act on the suggested response. Instead, the intended benefit from AIML template matching encompasses the ease of updating user models and recognizing factoids.

We implemented a Weka-based service that accepts transcripts as inputs and attempts to classify the type of statement into domain-specific categories. Essentially, it performs analysis for domain-specific applications. In our architecture prototype, we demonstrate the use of domain-specific classifiers by targeting a topic particular to the interests of the US National Science Foundation. The approach of mapping user utterances to likely ECA responses by using domain classification resembles that of Sgt. Blackwell [27].

The remaining two services require an active Internet connection to analyze data. OpenCalais, the first of the external services, performs textual analysis and semantic annotation of the dialogue in progress. OpenCalais [40] identifies a list of 38 entity types and 69 event or fact types that it can recognize. Moreover, it allows us to perform document classification over a range of episode segments in order to classify conversations by their broader topics. Yahoo! Search performs a similar analysis as OpenCalais and identifies certain named entities or key phrases in unstructured text. In addition to this service, it permits us to expand the text and resources available for context analysis by providing an API to perform mining of web documents and multimedia.

4 Testing and Evaluation of our Episodic Memory Architecture

Two experiments were performed that incrementally challenge the architecture’s capabilities and evaluate our hypotheses. While our memory system is designed to work with speech as the medium of choice, to avoid the complication of introducing errors with a speech recognition engine, all interactions in these experiments were done via text.

4.1 Experiment #1 – Passive Participation

The first set of test measures the relevance to human users of the phrases gisted by our algorithm. We devised a test where the episodic memory system and a human judge act as passive observers of two different prerecorded conversations between an interviewer and an interviewee. Both are human. The episodic memory system “listens” to the conversations and generates key phrases representing the gist of each of the conversations. Textual input and output were used. A GUI later displays the video to nine human judges and presents the topics identified as relevant by the episodic memory system, alongside a Likert scale for each judge to independently assess the relevance of these topics vis-à-vis the video he/she just saw. The human judges thus describe the level of agreement they find with the key phrases generated by the episodic memory model. Zero (0) on the scale denotes a total lack of relationship between the phrases and the video; 10 signifies complete agreement.

To minimize bias, a set of unrelated phrases were included among the topics selected by the episodic memory system. These phrases should be determined by the judges to be of low relevance. At the beginning of each interview, the subject is asked his/her name, followed by a series of background questions. The interviewer is responsible for introducing the topics of conversation. Occasionally, however, the interviewee expands the responses by adding personal experiences to the conversation. Table 2 shows the key phrases detected (gisted) by our episodic memory system.

Table 2

Topical Phrases Extracted by the Episodic Memory System from Interviews.

Key phrases
Video clip 1	Video clip 2
The family car	The computer
The states	Two sisters
Your interests

A total of 202 phrases were presented to the nine test subjects as candidate topics. Of these, 54 phrases corresponded to phrases from the topics gisted by our algorithm (six to each subject), while 148 corresponded to phrases unrelated to the video clips (the noise set).

A statistical summary of the results for the topical phrases is shown in Table 3. As the mean (5.43) suggests, the values selected by the user in response to the gisted phrases indicate that human judges did not generally perceive a strong relationship between the key phrases gisted by the episodic memory system. The range (10) and large standard deviation (3.75) support this assertion.

Table 3

Statistics for Topical Phrases of User-Given Values.

User ID	Topics	Mean	Std. dev.	Min	Max
1	6	8.17	4.02	0	10
2	6	6.17	4.31	1	10
3	6	3.67	2.88	1	8
4	6	4.67	3.88	0	9
5	6	2.83	2.64	0	7
6	6	8.50	1.76	6	10
7	6	4.00	2.10	2	8
8	6	2.50	1.76	1	5
9	6	8.33	4.08	0	10
Summary	54	5.43	3.75	0	10

Table 4 summarizes the same statistics for the noise set. The mean value of the noise set (1.43) is substantially lower than the topical mean (5.43), as was expected. This indicates that the judges correctly saw these as irrelevant.

Table 4

Statistics for Noise Phrases of User-Given Values.

User ID	Topics	Mean	Std. dev.	Max
1	19	2.05	3.85	10
2	17	1.18	1.29	4
3	20	0.20	0.89	4
4	16	0.69	1.74	6
5	16	0.13	0.34	1
6	16	3.31	3.11	10
7	14	0.43	1.16	4
8	16	1.13	1.54	5
9	14	4.21	4.63	10
Summary	148	1.43	2.72	10

The mean of the responses from the judges for topical phrases indicates that they are (slightly) biased toward 10, while the responses of the noise set are (strongly) biased toward 0. We use these biases to our advantage by further grouping our data and comparing it to similar works.

Other studies, such as that of Kanayama and Nasukawa [22] and Ku et al. [25], have employed human judges (annotators) as the gold standard for rating opinions or topics extracted from textual corpuses. Ku et al. [25] demonstrate that inter-annotator agreement is about 54% at word level extraction and 52% at sentence level extraction from web blogs.

We compare our results to those of Ku et al. [25] to see whether the agreement demonstrated by the judges and the gisting algorithm resembles the levels obtained from inter-annotator evaluations. To show this, we categorize the user responses into two segments: mostly agree and less likely to agree. The mostly agree category accumulates all the user responses with values >5. Similarly, the less likely to agree category accumulates responses ≤5. Using the Topics set, we are able to obtain a 51.9% agreement between the human judges and the gisting algorithm. Therefore, it follows that our algorithm performs at a level comparable to that of human annotators in the context of the interviews and compatible with the results of Ku et al. [25].

4.2 Experiment #2 – Episodic Memory Integrated into ECA

This experiment sought to evaluate the episodic memory system exactly how we plan for it to be used – integrated into an ECA. Figure 3 displays the graphic representation of the ECA used. We refer to this ECA as the Lifelike Avatar [16]. This experiment compares the relative capabilities of the Lifelike avatar with the episodic memory system incorporated within it to the same avatar without memory. This was done by designing a set of scripts – a sequence of questions that a user might ask. We constructed a total of six scripts that addressed two different knowledge bases. Three of the six scripts addressed a knowledge base about planets – we labeled these scripts P-1, P-2, and P-3. The other three scripts addressed a knowledge base about natural disasters: N-1, N-2, and N-3. Each script contained either six or seven questions. Each of the three scripts for each knowledge base was designed to test a different aspect of the ECA’s performance. Textual input and output were used to avoid the errors introduced by automated speech recognition engines.

Figure 3:

The Lifelike Avatar.

The first scripts (A-1 and N-1) asked straightforward questions that the memory-less ECA should be able to answer correctly. Each question was entirely self-contained, and it was not necessary to infer the meaning of the question from previous contexts in the conversation. The questions in these scripts were control questions used to ensure that our ECA with episodic memory conserved all the functionality of the original (memory-less) Lifelike avatar.

The second scripts (A-2 and N-2) contained questions phrased indirectly that required the use of memory. These scripts focused on a single topic such as asking whether various planets have water. Some of the questions were imprecisely phrased (e.g. “How about Mercury?,” after asking whether there is water on Mars, as opposed to asking “Is there water on Mercury?”). This required the ECA to remember the thread of the conversation.

The last set of scripts (A-3 and N-3) was the most challenging. In these scripts, several questions were also phrased indirectly. However, instead of focusing on a single topic, the topic changes partway through the script. The ECA had to detect the shift in context and respond accordingly. The last question of these scripts was something completely outside of the knowledge base of the ECA (“Do you have any friends that are avatars?”). The ECA should recognize that it does not know the answer, and respond accordingly. The scripts also contained questions that were repeated multiple times at different points in the conversation. The ECA should indicate in some way that it was aware of the repetition.

We analyze the results by placing each question asked into a category describing the skill that it was designed to test. The categories are as follows:

Direct: the question is entirely self-contained.
Reference: interpretation requires using memory about previous events in the conversation.
Repeat: the question was asked previously in the script. These questions were also counted under another category that measures whether the content of the ECA’s answer is correct. The “repeat” category was used to account for awareness of repetition.
Unknown: the question is outside the scope of the knowledge base. The ECAs answer should indicate that it does not know the answer.

The answers had to be correct. Given that the questions were relatively simple and unambiguous (“Is there water on Mars?”), the answers were judged as correct/incorrect by the testers themselves, based on what is contained in the knowledge base. Furthermore, if the question was repetitive, the ECA had to respond in a way that recognized that this question had already been answered (“I already told you that Mars has no water”). We considered the performance of each ECA (with and without episodic memory) to be the total number of questions answered correctly, summed across all categories, divided by the total number of questions. The performance breakdown by categories and the total performance is shown in Table 5.

Table 5

Comparison of Performance of ECA with and without Memory.

Category	Total number of questions	Correct answers by ECA with memory	Correct answers by ECA without memory
Direct	21	21	21
Reference	10	10	0
Repeated	10	10	0
Unknown	6	3	6
Total correct	47	44	27
Percentage	–	93%	62.7%

The overall performance of the ECA equipped with our episodic memory was substantially superior to the ECA without memory. The results show that the addition of our memory architecture conserves the strength of the original Lifelike avatar – its ability to correctly parse direct, self-contained queries. In this category, both ECAs achieved an equal perfect score. The advantage of the ECA with our memory architecture comes in the categories that require use of context from elsewhere in the conversation. Here, we see that the ECA with memory was able to correctly answer every question, while the one without memory was not able to correctly answer any at all.

Interestingly enough, we did observe a performance decrease for the ECA with memory when asked unknown questions. All three failures occurred with the natural disasters knowledge base. Essentially, the additional information to which the ECA-cum-memory had access led it to find false positives, where it thought that the query actually matched some answer in its knowledge base. This issue is particular to that knowledge base and phrasing used: it never occurred in the planets knowledge base and did not occur when a different (but equally unknown) question was asked. Eliminating false positives generated by the spurious use of context represents an area for future work.

5 Conclusions and Future Work

As was evident from the experiments of the previous section, we succeeded in devising an algorithm that can extract important information at the sentence level from a conversation that when presented to human judges can elicit a weak level of agreement from them. We also showed that an ECA with memory was able to answer questions more effectively than the same ECA without memory.

We note again that our architecture does not incorporate a forgetting mechanism. While this would clearly add naturalness to the conversation, it could add an element of frustration on the part of the user to be speaking to a forgetful ECA. As our stated goal was not to emulate human memory but rather provide a natural (but not too natural) communication experience with an avatar, we omitted the representation of forgetfulness.

Some features of the system remained to be explored, such as the performance under an extreme multimodal input. Future research could focus on expanding the interactive capabilities of the ECA to use speech as the input as well as include visual records of the user’s identity (such as a face) or behavioral gestures (such as pointing to objects). It may benefit the system to incorporate planning modules that track the state of the user model for the purposes of gathering identifying or useful data about them. Finally, tests with more than one human interlocutor will be performed after the above issues have been addressed.

Finally, we should note that we do not believe that having episodic memory alone will make an ECA believable. Many other issues still need to be solved before that happens. However, we think it is an important step in this pursuit.

Corresponding author: Avelino J. Gonzalez, Computer Science Department, University of Central Florida, PO Box 162362, 4000 Central Florida Boulevard, HEC 346, Orlando, FL 32816-2362, USA, e-mail: avelino.gonzalez@ucf.edu; agonzalez24@cfl.rr.com

Funding: Division of Computer and Network Systems, NSF (Grant/Award Number: ‘CNS0703927’).

Bibliography

[1] J. R. Anderson, ACT – a simple theory of complex cognition, Am. Psychol.51 (1996), 355–365.10.1037/0003-066X.51.4.355Search in Google Scholar

[2] R. Artstein, S. Gandhe, A. Leuski and D. R. Traum, Field testing of an interactive question answering character, in: Proceedings of the ELRA Workshop on Evaluation, Marrakech, Morocco, 2008.Search in Google Scholar

[3] R. E. Banchs, R. Jiang, S. Kim, A. Niswar and K. H. Yeo, Aida: artificial intelligent dialogue agent, in: Proceedings of the SIGDIAL 2013 Conference, pp. 145–147, 2013.Search in Google Scholar

[4] N. O. Bernsen and L. Dybkjær, Domain-oriented conversation with HC Andersen, Affective Dialogue Systems, pp. 142–153, Springer, Berlin, Heidelberg, 2004.10.1007/978-3-540-24842-2_14Search in Google Scholar

[5] N. O. Bernsen, M. Charfuel, A. Corradini, L. Dybkj, T. Hansen, S. Kiilerich, M. Kolodnytsky, D. Kupkin and M. Mehta, First prototype of conversational H.C. Andersen, in: Proceedings of the Working Conference on Advanced Visual Interfaces, 2004.10.1145/989863.989951Search in Google Scholar

[6] C. Brom and J. Lukavsky, Towards virtual characters with a full episodic memory II: the episodic memory strikes back, in: Proceedings of the AAMAS Workshop on Empathic Agents, 2009.Search in Google Scholar

[7] C. Brom, J. Lukavský and R. Kadlec, Episodic memory for human-like agents and human-like agents for episodic memory, International Journal of Machine Consciousness2 (2010), 227–244.10.1142/S1793843010000461Search in Google Scholar

[8] J. Campos and A. Paiva, MAY: my memories are yours, in: J. Allbeck, N. Badler, T. Bickmore, C. Pelachaud and A. Sofanova, eds., Intelligent Virtual Agents, Lecture Notes in Computer Science, vol. 6356, pp. 406–412, Springer, Berlin, 2010.10.1007/978-3-642-15892-6_44Search in Google Scholar

[9] K. W. Church and P. Hanks, Word association norms, mutual information, and lexicography, Comput. Linguist.16 (1990), 22–29.10.3115/981623.981633Search in Google Scholar

[10] M. A. Conway, Memory and the self, J. Mem. Lang.53 (2005), 594–628.10.1016/j.jml.2005.08.005Search in Google Scholar

[11] L. F. D’Haro, S. Kim, K. H. Yeo, R. Jiang, A. I. Niculescu, R. E. Banchs and H. Li, Clara: a multifunctional virtual agent for conference support and touristic information, 2014.10.1007/978-3-319-19291-8_22Search in Google Scholar

[12] M. Elvir, Episodic memory model for embodied conversational agents, Master’s thesis, Computer Engineering, University of Central Florida, 2009.Search in Google Scholar

[13] Z. Faltersack, B. Burns, A. Nuxoll and T. L. Crenshaw, Ziggurat: steps toward a general episodic memory, in: Proceedings of the AAAI Fall Symposium Series on Advanced Cognitive Systems, pp. 106–111, 2011.Search in Google Scholar

[14] K. T. Frantzi, Incorporating context information for the extraction of terms, in: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics, 1997.10.3115/976909.979682Search in Google Scholar

[15] R. C. Atkinson and R. M. Shiffrin, The control of short-term memory, Sci. Am.225 (1971), 82–90.10.1038/scientificamerican0871-82Search in Google Scholar PubMed

[16] A. J. Gonzalez, R. F. DeMara, V. C. Hung, C. Leon-Barth, M. Elvir, J. R. Hollister, L. Soros, S. Kobosko, J. Leigh, A. Johnson, S. Jones, G. Carlson, J. Lee, L. Renambot and M. Brown, Passing an enhanced Turing test – interacting with lifelike computer representations of specific individuals, J. Intell. Syst.22 (2013), 365–415.10.1515/jisys-2013-0016Search in Google Scholar

[17] N. A. Gorski and J. E. Laird, Learning to use episodic memory, in: Proceedings of the 9th International Conference on Cognitive Modeling(ICCM-09), Manchester, UK, 2009.Search in Google Scholar

[18] K. Hassani, A. Nahvi and A. Ahmadi, Architectural design and implementation of intelligent embodied conversational agents using fuzzy knowledge base, J. Intell. Fuzzy Syst.25 (2013), 811–823.10.3233/IFS-120687Search in Google Scholar

[19] K. Hassani, A. Nahvi and A. Ahmadi, Design and implementation of an intelligent virtual environment for improving and speaking and listening skills, Interact. Learn. Envir. (2013), 1–20.10.1080/10494820.2013.846265Search in Google Scholar

[20] W. C. Ho, K. Dautenhahn and C. L. Nehaniv, Computational memory architectures for autobiographic agents interacting in a complex virtual environment: a working model, Connect. Sci.20 (2008), 21–65.10.1080/09540090801889469Search in Google Scholar

[21] J. S. Justeson and S. M. Katz, Principled disambiguation: discriminating adjective senses with modified nouns, Comput. Linguist.21 (1995), 1–27.10.1017/S1351324900000048Search in Google Scholar

[22] H. Kanayama and T. Nasukawa, Fully automatic lexicon expansion for domain-oriented sentiment analysis, in: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, 2006.10.3115/1610075.1610125Search in Google Scholar

[23] Y. Kim, J. Bang, J. Choi, S. Ryu, S. Koo and G. G. Lee, Acquisition and use of long-term memory for personalized dialog systems, in: Multimodal Analyses Enabling Artificial Agents in Human-Machine Interaction, pp. 78–87, 2014.10.1007/978-3-319-15557-9_8Search in Google Scholar

[24] S. Kopp, L. Gesellensetter, N. C. Krämer and I. Wachsmuth, A conversational agent as museum guide – design and evaluation of a real-world application, Intelligent Virtual Agents, Lecture Notes in Computer Science3661 (2005), 329–343.10.1007/11550617_28Search in Google Scholar

[25] L. W. Ku, Y. T. Liang and H. H. Chen, Opinion extraction, summarization and tracking in news and blog corpora, in: Proceedings of AAAI-2006 Spring Symposium on Computational Approaches to Analyzing Weblogs, 2006.Search in Google Scholar

[26] J. E. Laird, A. Newell and P. S. Rosenbloom, SOAR: an architecture for general intelligence, Artif. Intell.33 (1987), 1–64.10.21236/ADA205407Search in Google Scholar

[27] A. Leuski, R. Patel, D. R. Traum and B. Kennedy, Building effective question answering characters, in: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, Sydney, Australia, 2006.10.3115/1654595.1654600Search in Google Scholar

[28] E. Levin, S. Narayanan, R. Pieraccini, K. Biatov, E. Bocchieri, G. Di Fabbrizio, W. Eckert, S. Lee, A. Pokrovsky, M. Rahim, P. Ruscitti and M. Walker, The AT&T DARPA communicator mixed initiative spoken dialogue system, in: Proceedings of the International Conference on Spoken Language Processing, 2000.10.21437/ICSLP.2000-224Search in Google Scholar

[29] M. Y. Lim, Memory models for intelligent social companions, in: M. Zacarias and J. Valente de Oliveira, eds., Human-Computer Interaction: The Agency Perspective,Studies in Computational Intelligence, vol. 396, pp. 241–262, Springer, Berlin, 2012.Search in Google Scholar

[30] M. Y. Lim, W. C. Ho and R. Aylett, Spreading activation – an autobiographic memory retrieval mechanism for social companions, in: Proceeding of Intelligent Virtual Agents, 2010.Search in Google Scholar

[31] Y. Liu, E. Shriberg, A. Stolcke and M. Harper, Comparing HMM, maximum entropy, and conditional random fields for disfluency detection, in: Proceedings of the European Conference on Speech Communication and Technology, 2005.10.21437/Interspeech.2005-851Search in Google Scholar

[32] Microsoft, CCR introduction, MSDN, Retrieved 24 September, 2009, from http://msdn.microsoft.com/en-us/library/bb648752.aspx, 2009.Search in Google Scholar

[33] Microsoft, DSS introduction, MSDN, Retrieved 24 September, 2009, from http://msdn.microsoft.com/en-us/library/bb483056.aspx, 2009.Search in Google Scholar

[34] Microsoft, Microsoft Robotics Developer Studio, Retrieved 24 September, 2009, from http://www.microsoft.com/robotics/#About, 2009.Search in Google Scholar

[35] F. Morbini, D. DeVault, K. Sagae, J. Gerten, A. Nazarian and D. Traum, FLoReS: a forward looking, reward seeking, dialogue manager, in: Natural Interaction with Robots, Knowbots and Smartphones, pp. 313–325, Springer, Berlin, 2014.10.1007/978-1-4614-8280-2_28Search in Google Scholar

[36] A. I. Niculescu, R. Jiang, S. Kim, K. H. Yeo, L. F. D’Haro, A. Niswar and R. E. Banchs, Sara: Singapore’s automated responsive assistant, a multimodal dialogue system for touristic information, in: Mobile Web information Systems, pp. 153–164, Springer, Berlin, 2014.10.1007/978-3-319-10359-4_13Search in Google Scholar

[37] L. Nio, S. Sakti, G. Neubig, T. Toda and S. Nakamura, Combination of example-based and SMT-based approaches in a chat-oriented dialog system, in: Proceedings of ICE-ID, 2013.Search in Google Scholar

[38] K. A. Norman, G. J. Detre and S. M. Polyn, Computational models of episodic memory, in: R. Sun, ed., The Cambridge Handbook of Computational Psychology, pp. 189–224, Cambridge University Press, Cambridge, 2008.10.1017/CBO9780511816772.011Search in Google Scholar

[39] A. M. Nuxoll and J. E. Laird, Extending cognitive architecture with episodic memory, in: Proceedings of the 22nd National Conference on Artificial Intelligence (AAAI), 2007.Search in Google Scholar

[40] OpenCalais, API metadata – English, OpenCalais, Retrieved 12 November, 2009, from http://opencalais.com/documentation/calais-web-service-api/api-metadata, 2009.Search in Google Scholar

[41] OpenCalais, How does Calais work?, OpenCalais, Retrieved 12 November, 2009, from http://www.opencalais.com/about, 2009.Search in Google Scholar

[42] OpenNLP, Retrieved 27 April, 2009, from http://opennlp.sourceforge.net/, 2009.Search in Google Scholar

[43] P. Pantel and D. Lin, Discovering word senses from text, in: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002.10.1145/775047.775138Search in Google Scholar

[44] M. Pasca, D. Lin, J. Bigham, A. Lifchits and A. Jain, Names and similarities on the web: fact extraction in the fast lane, in: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, 2006.10.3115/1220175.1220277Search in Google Scholar

[45] J. Planells, L. F. Hurtado, E. Segarra and E. Sanchis, A multi-domain dialog system to integrate heterogeneous spoken dialog systems, INTERSPEECH (2013), 1891–1895.10.21437/Interspeech.2013-459Search in Google Scholar

[46] SharpNLP, Retrieved 27 April, 2009, from http://www.codeplex.com/sharpnlp, 2009.Search in Google Scholar

[47] T. Shibata, Y. Egashira and S. Kurohashi, Chat-like conversational system based on selection of reply generating module with reinforcement learning, in: Proceedings of the 5th International Workshop Series on Spoken Dialog Systems, pp. 124–129, 2014.Search in Google Scholar

[48] C. R. Sims and W. D. Gray, Episodic versus semantic memory: an exploration of models of memory decay in the serial attention paradigm, in: Proceedings of the 6th International Conference on Cognitive Modeling (ICCM2004), Pittsburgh, PA, 2004.Search in Google Scholar

[49] R. Sun, Memory systems within a cognitive architecture, New Ideas in Psychology30 (2012), 227–240.10.1016/j.newideapsych.2011.11.003Search in Google Scholar

[50] N. H. Tollervey, Welcome to Program#, Retrieved 9 November, 2009, from http://aimlbot.sourceforge.net/, 2006.Search in Google Scholar

[51] D. Traum and J. Rickel, Embodied agents for multi-party dialogue in immersive virtual worlds, in: Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems: Part 2, 2002.10.1145/544862.544922Search in Google Scholar

[52] D. Traum, A. Roque, A. Leuski, P. Georgiou, J. Gerten, B. Martinovski, S. Narayanan, S. Robinson and A. Vaswani, Hassan: a virtual human for tactical questioning, in: Proceedings of the Eighth SIGdial Workshop on Discourse and Dialogue, Antwerp, Belgium, September 2007, pp. 71–74.Search in Google Scholar

[53] E. Tulving, Episodic and semantic memory, in: E. Tulving and W. Donaldson, eds., Organization of Memory, pp. 381–403, Academic Press, San Diego, CA, 1972.Search in Google Scholar

[54] E. Tulving, Elements of Episodic Memory, vol. 1, Oxford University Press, Oxford, 1983.Search in Google Scholar

[55] T. U. O. Waikato, Weka 3 – Data Mining with Open Source Machine Learning Software in Java, The University of Waikato, 2008.Search in Google Scholar

[56] W. Wang, B. Subagdja, A. H. Tan and J. A. Starzyk, Neural modeling of episodic memory: encoding, retrieval, and forgetting, IEEE Transactions on Neural Networks and Learning Systems23 (2012), 1574–1586.10.1109/TNNLS.2012.2208477Search in Google Scholar PubMed

[57] Yahoo! Developer Network – Developers Resources, Retrieved 12 November, 2009, from http://developer.yahoo.com/everything.html, 2009.Search in Google Scholar

Received: 2015-9-1

Published Online: 2016-2-25

Published in Print: 2017-1-1

Remembering a Conversation – A Conversational Memory Architecture for Embodied Conversational Agents

Abstract

1 Introduction and Background

1.1 The Temporal and Functional Aspects of Memory

1.2 Our Objective and General Approach

1.3 Assumptions

2 Related Works

2.1 Conversational Memory Models

2.2 Memory in Cognitive Architectures

2.3 Memory in Chatbots and ECAs

2.4 Memory in Question Answering Characters

2.5 Comparison to Existing Works

3 Our Episodic Memory Architecture for Conversational Agents

3.1 System Architecture: The Pipeline Approach

3.2 Capturing the Conversation

3.3 Filtering and Storing Conversational Memories

3.4 Episodic Memory Structures

3.4.1 Dialog Tables

3.4.2 User Information

3.4.3 Conversation Log

3.4.4 Contextualized Records

3.4.5 Dialog Topics

3.4.6 Using the Relationships between Dialog Tables for Gisting

3.5 Retrieving Episodic Memories

3.6 Accessing Episodic Memory

3.7 Analysis Services

4 Testing and Evaluation of our Episodic Memory Architecture

4.1 Experiment #1 – Passive Participation

4.2 Experiment #2 – Episodic Memory Integrated into ECA

5 Conclusions and Future Work

Bibliography

Journal and Issue

Articles in the same Issue