Anthropomorphism concerns the fact that an artefact is treated as if it had human-like characteristics even though it does not possess these properties. In this section, I suggest a method to track anthropomorphizing behavior in interaction over time.
3.1 Tracking Anthropomorphizing Behavior in Interaction
In order to describe how people anthropomorphize robots over the course of an interaction, we need a model that can track whether and, if so, in which ways or to what extent, people respond to a robot in a human-like way. Such a procedure would have to rely on people's
behavior, since their beliefs and attitudes are not directly available, and stopping them to fill out a questionnaire would disrupt the very interaction the analysis is setting out to explain. Similarly, we cannot rely on robots’ anthropomorphic design or behavior since it is the responses to these designs and behaviors that are at issue. Thus, any method that aims to investigate anthropomorphism in interaction has to rely on the observation of people's behaviors, and not just on single behaviors, such as whether or not a secret is kept [Kahn et al.
2015] or whether or not people agree to the robot's memory being erased [Seo et al.
2015], but potentially on all behaviors observable in interaction. That is, since anthropomorphism may express itself in mimics, gesture, polite verbal behavior and in many other ways, we would want to account for anthropomorphic responses with respect to all of these modalities, not just single actions. A method to track anthropomorphizing behavior in interaction over time thus has to rely on observing and classifying a broad range of people's behaviors towards robots in interaction.
Now, a minimalistic classification would mean to classify people's behavior as either anthropomorphizing or not, whereas a maximalist version of such a classification might classify the type or extent of anthropomorphization. One might, for instance, identify, based on each of a person's observable behaviors, what kind of trait is attributed, how uniquely human it is and correspondingly, how easy or difficult it is to attribute this trait to the artifact in question [cf. Ruijten et al.
2019]. This would however mean that specific attributions can be identified and intersubjectively classified, given that all we have is how people behave. The question is whether such a classification is feasible and how the relevant judgements can be made based on what is observable; it seems therefore safer to (a) keep the number of distinctions small, and (b) to allow an easy identification of the level oriented to.
The proposal made here is to use a model of depiction [Clark
2016] because it allows us to describe people's responses to robots without presupposing that we know what this behavior is caused by. For instance, Nass & Moon [
2000] suggest that people respond to technological artifacts like robots like to other humans out of mindlessness; Nass [
2004] argues that anthropomorphizing behavior is basically a mistake, based on our human evolution in which we were only faced with human social actors. In contrast, Ruijten et al. [
2019], for instance, regard anthropomorphism to be a predisposition instead, whereas social-constructivist approaches may take it as an instance of social practice, so that anthropomorphizing behavior would just be ‘how things get done’ [cf. Edwards
1994; Hutchby
2001]. In order to avoid circularity, a model that describes observable behaviors should not presuppose the one or other explanation and should thus be largely theory neutral. Therefore, I analyze the robot's behavior based on Clark's three levels of depiction [Clark
2016], and the person's behavior in terms of orientation to these levels. From this perspective (Clark & Fischer under revision), the robot's design is viewed as an act of staging, in which the robot is depicted as a social actor.
We understand by depiction a very common strategy of human interaction in which one person stages a behavior for another person to evoke a temporally and spatially removed situation. This strategy is most obvious in plays and other kinds of fictional staging (in movies, video games, pretend play etc.), but also occurs frequently in conversation in general when people report on events involving other people, for instance, a person telling her friend about her tyrannical boss. Clark's model involves (a) a base scene, which consists of the concrete means by which the depiction is carried out, e.g. gesture, mimics, loud voice etc., i.e. the respective ‘trigger’, (b) the proximal scene, which comprises the scene depicted, e.g. an angry person with a loud voice and exaggerated gestures, and (c) the distal scene, for instance, the speaker's boss getting outraged about a mistake someone made. Robots are special kinds of depictions since the scene evoked is not necessarily spatially or temporally removed, but rather concerns a fictitious being (cf. Clark & Fischer under revision for more details on the full model).
Concerning human-robot interaction, the base scene is relatively straight-forward to define: it concerns the mechanical properties of a robot as a machine. Less straight-forward is the definition of the proximal scene, which can be understood as the encoded meaning of a robot's behavior. That is, the relationship between base and proximal scene can be understood as one of signaling [Peirce
1998], such that the base scene (e.g. a robot moving its actuators) may evoke a particular meaning (e.g. the robot waving) by means of indexical, iconic or symbolic relations between the movement and the interpretation. An index is a sign in which the form consists of a pointer to a particular referent; an example is the English word “I,” which indexes the speaker, but it does not ‘mean’ John, Mary, Herb or Kerstin itself. For instance, directing the robot head towards a participant's position
indicates attention towards him or her. An icon is a sign where the form is connected to its meaning by means of similarity; for instance, the robot moving its gripper from left to right and back is similar to a human waving gesture, i.e. it can be taken to
iconically represent waving. Finally, a symbol is a sign with an arbitrary, conventional relationship between form and interpretation, the prime example being language, where the sequence of the letters c-a-t are conventionally taken to mean ‘cat’. When a robot says “hello,” it conventionally
symbolizes a greeting.
In contrast, the distal scene goes beyond the encoded meaning and concerns the pragmatic interpretation of an action. The definition of the distal scene is thus also relatively unproblematic since it concerns the completely fictional, live-like character evoked by the robot, for instance, as a companion, friend or collaborator. For example, the robot moving its hand encodes a waving gesture, but the interpretation that it sees me, recognizes me, wants to attract my attention and initiate an interaction by waving at me is an interpretation that goes beyond what is encoded. Both the proximal and the distal scene thus involve anthropomorphizing behavior; however, while the distal scene evokes the robot as a character, i.e. as a social actor, the proximal scene only assumes that the robot is engaged in meaningful action, i.e. that its behavior means something.
The two levels of depictions assumed thus differ in the underlying processes: whereas the proximal scene relies on semiotic processes, i.e. on the interpretation of signs, the distal scene is essentially fictional and relies on imagination [e.g. Walton
1993] and thus involves a certain degree of pretending on the side of the recipient [Clark
1999]. The distinction is useful for our analysis because it helps distinguish between features that are encoded in the (behavior) design of the robot and features that go beyond this; for instance, using a voice with features characteristic of a human female, for example, indexes a female speaker; that is, there is an objective, indexical relationship between certain voice characteristics and human gender. In contrast, if people associate gender stereotypes to a (human or robotic) speaker with these voice characteristics, then these stereotypes are not encoded in, but associated with the voice [e.g. Tay et al.
2014; Bryant et al.
2020; Law et al.
2020]. That is, by means of its speech characteristics, the robot is depicted as female, but an expected higher competence in a certain occupation is not part of the robot's (behavior) design, but rather “in the eye of the beholder”, i.e. in the distal scene evoked.
In our model (Clark & Fischer under revision), a robot is thus similar to Mickey Mouse in Disney Land: The man in the costume constitutes the basis for the depiction; the mouse-shaped costume evokes the depiction proper, a large mouse with eyes, ears and a mouth. The distal scene is then that Mickey Mouse, a character one is familiar with from numerous comics, is present in the here and now. Regarding Mickey Mouse, kids can orient to the man in the costume (e.g. “is it hot in there?”), to the depiction of the mouse (e.g. “you have a large head!”) or to the character (e.g. “Mickey, where is Minnie?”). To which of the three scenes they orient can thus be identified from their behavior.
Similarly, the robot has a mechanical base that enables the robot's functionality; the robot's design, functionality and presentation in context suggest that it can be interacted with by talking to it. And finally, the staging of the robot evokes a distal scene in which the robot can be understood as a human-like character. Like kids' interactions with Mickey Mouse, people's behavior in response to a robot's actions is revealing regarding to which of the three scenes they orient.
Table
1 illustrates how the depiction model applies to various different kinds of depictions [see also Clark
2016].
Thus, the depiction model assumes that depictions allow attention to three different perspectives or scenes, which exist in parallel, and all three can be in focus at different points in time. For instance, concerning the conversational depiction of the angry boss, the listener can focus on the base scene by saying ‘not so loud’, on the proximal scene by saying ‘you sound more cynical than angry’, or on the distal scene by saying ‘your boss is really not a nice person’.
Thus, what we are proposing is the anatomy of a ‘social cue’, where on the side of the robot, we understand a social cue as the staging of a design choice or behavior that is intended to signal a certain human-like behavior, which is meant to evoke a particular character or fictional being (for the explanatory power of this model, see Clark & Fischer under revision). In a second step, I employ qualitative methods to analyze to which of these three levels people's behavior is oriented (see Section
3.3 below).
3.2 Describing Robot Behavior
For the description of robots’ behavior, we can use the depiction model to account for the discrepancy between their mechanical nature as artefacts, the behaviors designed and the intended effects. In particular, the base scene corresponds to the fact that robots are artefacts that operate mechanically or are preprogrammed or remotely controlled and describes the robot's actions on the mechanical level, such as moving its joints. Responses to the base scene are thus clearly non-anthropomorphic. The proximal scene corresponds to the recognition of this scene as a purposeful action by the robot, such as lifting its arm. Responses to this scene are anthropomorphic since they rely on the attribution of certain properties that the robot does not have, but which are signaled by the design or behavior of the robot, such as having eyes or looking at something. The distal scene finally accounts for the possible function of the respective behavior in context, which may in this case be that the robot is trying to reach for something. It concerns a fictional scene in which the robot is a character, for instance, a human-like being.
Let us consider some examples from human-robot interaction (see Table
2); for instance, Nakata et al. [
1998] introduced subjects to a fish-shaped experimental robot which is sensitive to touch. In one of the three behaviors implemented, the robot lifts its ‘head’ when sensing human touch. The authors define this behavior as “a repelling tactile reaction” and label it “rebellious.” Using the depiction model, we can distinguish between the base scene, in which the robot's motors push up the upper part of the robot when the input to the touch sensor exceeds a threshold; the proximal scene, in which the robot ‘pushes back;’ and the distal scene, in which the robot's behavior is understood as expressing a rebellious intention.
As another example consider Tielmann et al.’s [
2014] model of robot emotional expression in child-robot interaction, in which the robot's behavior differs concerning features like speech volume, pitch height, speech rate, head pose, eye color, arm and trunk pose. The robot is designed to match the child's measured emotional state, which is analyzed in terms of arousal, valence and level of extroversion. The authors describe the relationship as follows: “The voice of the robot will be influenced by its arousal. The higher arousal, the louder the robot will speak, the higher pitched its voice will be and the higher the speech rate. Speech volume is also influenced by extroversion, the higher the extroversion, the louder the voice.” Thus, according to our model, the base scene is constituted by a machine producing speech in a loud and high-pitched voice, the proximal scene is to signal arousal, and the distal scene is to indicate to the child that the robot is as excited about something as the child is.
As a last example, consider the implementation of different types of gestures in a narrating robot [Huang & Mutlu
2013]. The robot's gestures were carefully modeled after human gestures previously elicited in human narrations and coordinated with particular lexical material in the robot's speech stream. The authors find, for instance, that the “robot's use of iconic gestures significantly predicted males’ perceptions of the robot's competence”. Using our model, we can distinguish between the robot producing a previously demonstrated and recorded posture of its ‘hand’ at the same time as a particular word is used, which constitutes the base scene; the interpretation of the hand posture as iconic gesture, which constitutes the proximal scene, and the impression of a competent robot that can accompany its words with appropriate gestures that indicate that it understands what it is saying, which constitutes the distal scene.
By conceptualizing the robot's behavior as an act of depiction, we can analyze in the next step what level of the robot's behavior users rely on while interacting with the robot, moment by moment. We thus apply the depiction model for describing the robot's behavior, and in a second step use it to describe the ways in which people respond to the robot's actions.
3.3 Describing Human Behavior in Interaction with Robots
For the analysis of participants’ responses to the robot in interaction, we take the micro-analytic perspective on participants’ actions developed in the framework of ethnomethodological conversation analysis [cf. Garfinkel
1972; Sacks
1992]. In this methodological framework [cf. Sacks et al.
1974], like in other collaborative models of communication [such as Clark
1996; Bangerter & Mayor
2013], it is assumed that in order for interaction to succeed, participants need to display to each other continuously how they understand each others’ contributions and what they consider the relevant context of the current action to be [cf. Clark & Schaefer
1989]. For instance, if speaker A says “sit down”, this utterance
per se is highly ambiguous; it could be a suggestion, an invitation, or a command, among many other things. However, once speaker B has responded to it, for instance by saying “thank you”, the interpretation of A's utterance is specified as an invitation. In the next turn, speaker A may reject this interpretation, for instance, by saying “this was an order!”; however, if she does not challenge speaker B's interpretation, ‘invitation’ is the intersubjectively established interpretation of A's utterance [cf. Sacks et al.
1974]. Because speakers in interaction need to display to each other the interpretations of each others’ utterances in order to arrive at intersubjectively available common ground [cf. Clark & Schaefer
1989], their interpretations of each others’ behaviors are also available to the observer [see Sacks
1984].
Correspondingly, in the method proposed, we look at the ways in which the participants display how they make sense of the robot's behavior.
1 Our micro-analysis of the users’ responses to the robot's depiction of human social behavior can indicate to which scene the user attends in the sense that the human is displaying an orientation to a certain aspect of the situation; for instance, in the example above, by saying “thank you,” speaker B displays her orientation to speaker A's utterance as an invitation. While in this example, the display concerns the interpretation of the function of the utterance, people also design their behaviors specifically for their current interaction partners. That is, each utterance is so designed that the particular recipient can understand it [Levinson
2006]; who the speaker takes the recipient of the utterance to be, for example, whether he or she assumes to be talking to a machine or to a live-like character, can thus also be identified from the way the utterance is designed [i.e. by means of ‘reverse recipient design’, cf. Fischer et al.
2012]. Using this approach, we can analyze participants’ behaviors towards robots by classifying them according to (a) whether they display an orientation to the base scene by treating the robot as a mechanical tool; (b) whether they display an orientation to the proximal scene by displaying an understanding of the robot's behavior as being intentional and meaningful; or (c) whether they attend to the distal scene by displaying an understanding of the robot's behavior as human-like.
In particular, for each response to a robot's actions, we identify to which of the three scenes proposed it is a response to. For instance, a robot's greeting “Hello, how are you?”, can be responded to in several different ways [see Fischer
2011]: One possibility is to reciprocate the greeting by replying “I'm fine, thanks, how are you?”. This is a response at the distal level, where two people are having a polite conversation. A second possibility is to respond to the robot's utterance as the production of canned speech by a machine in order to indicate the beginning of the interaction. In this case, people may say “drive forward”. This utterance signals an orientation towards the base scene. Furthermore, people may acknowledge that the robot was asking a question and producing a greeting by responding “fine” or “hello”. In this case, people orient towards the proximal scene, in which the robot is taken to produce some type of social action. However, the human interlocutor does not enter into the joint pretense [cf. Clark
1999] of having a polite conversation, but acknowledges that the robot's behavior may have a social function, such as opening the interaction (‘hello’) and asking a question (‘fine’). These responses would therefore be located at the proximal level.
We can thus classify people's behavior as being oriented to the base, proximal or distal scene, where responses to the proximal and distal scene are anthropomorphizing since they involve sense-making by drawing on the human domain. While the distal scene clearly evokes a fictional, human-like being, the proximal scene also involves human-like categories, such as the attribution of human functions (e.g. the robot “sees” something), and can be (and has been) considered as anthropomorphism [e.g. Fussell et al.
2008], yet the mechanisms underlying these two scenes are most likely quite different (that is, semiotic versus inferential processes, [cf. Fischer
2016]), and thus we rather keep them distinct in our classification.
Given the multimodal nature of interaction, users’ displays of how they understand the partner's actions may potentially comprise a very broad range of signals. For instance, on the one hand, we can look at what people say, on the other, also the timely response to the partner's behavior, i.e. attending to the response time typical of human interaction, about 300msecs, [see Jefferson
2004], can indicate to the analyst that the person is orienting to the robot in similar ways as to a human communication partner [cf. Hutchby
2001]. Similarly, people may or may not gaze towards robots at times when they would also gaze towards human interaction partners [e.g. Andrist et al.
2014]. Anthropomorphizing responses may consequently become evident in all kinds of communicative behaviors.
The method proposed thus targets anthropomorphism in terms of anthropomorphizing behavior, i.e. behavior that attributes human-like qualities to a robot. It is consequently in line with work investigating behavioral responses, such as keeping a robot's secret [e.g. Kahn et al.
2015], saving a computer's face [Nass
2004] or describing a robot in intentionalistic ways [Fussell et al.
2008]. However, the method proposed does not make claims regarding people's subjective understandings of robots, i.e. the psychological basis for the observable behavior. For instance, the method proposed does not make assumptions about psychological mechanisms such as mindless transfer [Nass & Moon
2000], automatic trait attribution [Roebroek
2014], or the real belief that the robot under consideration has a set of human-like characteristics. That is, our study is neutral with respect to possible explanations for the anthropomorphizing behavior observable. Nevertheless, as we shall see (Section
6), the results of tracking people's anthropomorphizing behaviors in interaction over time will have considerable consequences for explanatory models of anthropomorphism in human-robot interaction. Bridging the gap between people's cognitive processes and their observable behavior goes unfortunately far beyond the limits of this paper (but see Clark & Fischer under revision). What the method proposed here however does allow us to do is to describe people's responses to robotic action and track anthropomorphizing behavior dynamically over the course of an interaction.