Keywords

1 Introduction

Nonverbal cues play important roles in everyday communication, and studies have examined their importance not only from the affectional and attitudinal aspects of communication (Mehrabian and Ferris 1967; Mehrabian and Wiener 1967), but also in coordination of communication and “grounding”, i.e. constructing a shared understanding of the communication context (Clark and Brennan 1991; Clark 1996; Clark and Krych 2004), suggesting that the potential exists to expand and augment HCI (human-computer interaction) systems.

Among nonverbal modalities in communication, gazing activities have been considered fundamental, attracting considerable attention from researchers working in the area of multi-modal communication. Studies have reported that gaze has important communicative functions including expressing emotional states, exercising social control, highlighting the informational structure of speech, and organizing the speech floor (Argyleet al. 1968; Duncan 1972; Holler and Kendrick 2015; Kendon 1967). From the viewpoint of interaction organization in communication, studies have reported that gaze can be a cue for speech floor coordination not only in dyadic (Kendon 1967), but also in multi-party conversations (Kalma 1992; Learner 2003). Although the findings of other studies are not necessarily entirely consistent with those of the studies mentioned above (Beattie 1978; Rutter et al. 1978), this can probably be attributed to the multi-functional nature of gaze in communication (Kleinke 1986), while recent studies have confirmed the speech floor coordination function of gaze for dyadic (Ho et al. 2015) and multi-party conversations (Jokinen et al. 2013; Ishii et al. 2016; Vertegaal et al. 2001; Ijuin et al. 2018). Another study indicated that gaze can be a collaborative signal that serves as a cue to coordinate the insertion of responses (Bavelas et al. 2002). Furthermore, another study reported that even uninvolved observers of dyadic interactions followed the interactants’ speaking turns with their gaze (Hirvenkari et al. 2013).

Inspired by and based on studies that examined the functions of social gaze, system studies that incorporate gaze modalities have been proposed in the HCI and CSCW (computer-supported cooperative work) fields. Such studies have covered not only conversational agents (Cassel et al. 1994; Vertegaal et al. 2001; Garau et al. 2001; Heylen et al. 2005; Rehm et al. 2005) but also robots and devices with simulated gaze expression (Sidner et al. 2004; Bennewitz et al. 2005; Kuno et al. 2007; Foster et al. 2012; Lala et al. 2019; Jaber et al. 2019; McMillan et al. 2019).

Although HCI, HRI (human-robot interaction) and CSCW systems have to some extent been able to take gazing cues into account and integrate gaze functions, they have been less successful in incorporating linguistic proficiency that may affect gazing activities. A remote work study in the HCI field argued that video transmission of facial information and gesture helped non-native pairs to negotiate a common ground, whereas this did not provide significant help for native pairs (Veinott et al. 1999). An analysis of second language conversation reported that eye gazes and facial expressions play an important role in monitoring both partners’ understanding in the repair process (i.e. a modification to the content or presentation of the current proposition under consideration (Schegloff et al. 1977; Traum 1994)) where participants with different levels of linguistic proficiency are involved (Hosoda 2006).

Some quantitative studies have also examined the effect of linguistic proficiency on the speech floor coordination function of gaze. Analyses of the duration of the listener’s gaze during utterances have shown that when other participants are looking at the speaker in a second language (L2) conversation, the duration is significantly longer than in a first language (L1) conversation (Yamamoto et al. 2013; Umata et al. 2013; Yamamoto et al. 2015). These studies, however, have not considered the communicative context effects. Kleinke pointed out that the conditions of a conversational setup may affect the relative importance of the multiple functions of gaze in communication (Kleinke et al. 1986). Holler and Kendrick analyzed three-party conversations among native English speakers, showing that unaddressed participants were able to anticipate next turns in question-response sequences involving just two of the participants (Holler and Kendrick 2015). There are also studies that have shown the effects of interaction contexts on gazing behavior in social interactions (Rossana 2013; Kendrick and Holler 2017; Rossana et al. 2009; Stivers and Rossana 2010). The role of gaze in communication is affected by the context, and it is important to analyze the function of gaze during utterances while taking their communicative function into consideration.

The current study examined the effects of linguistic proficiency on the listener’s gaze in triadic communication considering the communicative function of utterances from the viewpoint of grounding. Each utterance was categorized according to the grounding acts in the dialogue, and the gazing activities of the listeners were compared between native and the second language conversations. We anticipated that conversation topics could also affect a listener’s gazing activities, and included the topic factor in our analysis. The results suggest that both language proficiency and topic factors independently affect the duration of a listener’s gaze in utterances in cases where the speaker provides some new pieces of information, but not in utterances where they just acknowledge the previous speaker’s utterance.

2 Corpus

Our analysis is based on a multimodal triadic interaction corpus with eye-gaze data collected and analyzed in previous studies (Yamamoto et al. 2015; Ijuin et al. 2018; Umata et al. 2018).

The corpus consists of triadic conversations in a mother tongue (L1) and those in a second language (L2) made by the same interlocutors in the same group (for details, refer to Yamamoto et al. 2015). For the current study, all utterances were newly labeled with grounding act tags (details are provided below in this section), and all the conversation data were subjected to analysis in this study. A total of 60 subjects (23 females and 37 males: 20 groups) between the ages of 18 and 24 participated in data collection, and each conversational group consisted of three participants. All participants were native Japanese speakers.

Their seats were placed about 1.5 m apart from each other in a triangular formation around a round table (see Fig. 1 and Fig. 2). The corpus covers two conversation types to examine whether such differences in types affect their interaction behaviors.

The first type is free-flowing, natural chatting that ranges over various topics such as hobbies, weekend plans, studies, and travels. The other type is goal-oriented, in which participants collaboratively decided what to take with them on trips to uninhabited islands or mountains. All the participants would be under pressure to contribute to the conversation to reach an agreement in the goal-oriented conversations, whereas such pressure would not be so strong in free-flowing conversations where reaching an agreement was not obligatory.

Fig. 1.
figure 1

Seating positions of the three participants.

Fig. 2.
figure 2

Seating positions of the three participants.

We expected that conversational flow would be more predictable in the goal-oriented conversations where the vocabulary was more limited and the domain of the discourse was defined more narrowly by the task than in the free-flowing conversations.

The order of the conversation types was arranged randomly to counterbalance any order effect. The order of the languages used in the conversations was also arranged randomly. Each group had approximately six-minute conversations of the two types in both Japanese and English. We collected multimodal data from 80 three-party conversations in L1 (Japanese) and in L2 (English) languages (20 free-flowing in Japanese, 20 free-flowing in English, 20 goal-oriented in Japanese, and 20 goal-oriented in English). Twenty groups engaged in all four conversation types. All the participants except those in the first three groups answered a questionnaire evaluating their conversation after each conversation condition. This material is to be analyzed in other studies (see Umata et al. 2013).

Their eye gazes and voices were recorded via three sets of NAC EMR-9 head-mounted eye trackers and headsets with microphones. The viewing angle of the EMR-9 was 62° and the sampling rate was 60 frames per second. We used the EUDICO Linguistic Annotator (ELAN) developed by the Max Planck Institute as the tool for gaze and utterance annotation (ELAN) (see Fig. 3). Each utterance is segmented from speech at inserted pauses of more than 500 ms, and the corpus was manually annotated in term of the time spans for utterances, backchannel, laughing, and eye movements. The corpus already had the grounding act tags according to the categories established by Traum (1994) for 20 groups engaging in goal-oriented conversations. For the current study, we trained a university student to perform annotation according to the categories for 20 groups engaging in free-flowing conversations. She annotated the tags using ELAN with video, gaze, and utterance transcription data in the same manner as in the previous study (Umata et al. 2019). Table 1 shows the grounding act tags and their descriptions, and Fig. 1 shows the frequency of grounding acts in L1 and L2 conversation.

Table 1. Traum’s grounding acts

3 Analyses of Gazes in Utterances

We analyzed the gazing activities of listeners in triadic conversation taking the factors of linguistic proficiency, topic and grounding into account. Previous studies of the listener’s gaze during utterances have shown that when other participants are looking at the speaker in a second language (L2) conversation, gaze is significantly longer than in a first language (L1) conversation (Yamamoto et al. 2013; Umata et al. 2013; Yamamoto et al. 2015), suggesting that listeners use visual information to compensate for their lack in linguistic proficiency in an L2 conversation. We assumed that the linguistic proficiency factor would affect the listener’s gazing activity. We also assumed that listeners would rely more heavily on visual information in a collaborative task where the requirement for communication organization is strong. The grounding act factor was also expected to affect the gazing activity of listeners; i.e. they would have greater reliance on visual information during an utterance in which new information is presented. Our hypotheses are listed as follows:

H1::

The linguistic proficiency factor would affect the duration of a listener’s gaze: The listeners would gaze at the speaker for longer in second language conversations where they compensate for their lack of linguistic proficiency with gazing cues.

H2::

The topic factor would affect the duration of a listener’s gaze: The listeners would gaze at the speaker for longer in goal-oriented conversation where the requirement for communication organization is stronger when an agreement has to be reached.

H3::

The grounding act factor would affect the duration of a listener’s gaze: The listeners would gaze at the speaker for longer during utterances presenting new information (namely, init, cont and ack init) than in utterances just acknowledging the previous utterance

We compared the duration of each listener’s gaze during four major categories of grounding acts (i.e., init, ack init, cont, ack) between L1 and L2 conversations. We used the average of the listener’s gazing ratio to analyze how long the speaker was gazed at by other participants [4]. The average of listener’s gazing ratios was defined as:

Here, D(i) is the duration of the ith utterance and DLGj(i) is the total gaze duration of the jth participant (j = 1, 2, 3) in each group gazing at the speaker in the ith utterance.

We expected that the topic factor would affect the duration of the listener’s gaze: the listeners would gaze at the speaker for longer in a goal-oriented conversation where they collaboratively decided what to take with them on a trip to a deserted island or to the mountains. We also expected that the linguistic proficiency factor would affect the duration of the listener’s gaze: the listeners would gaze at the speaker for longer in second language conversations where they compensate for their lack of linguistic proficiency with gazing cues, especially in speech turn organization. We conducted an analysis of variance (ANOVA) with language difference, topic difference, and grounding act as within-subject factors. The results revealed significant main effects of language (F(1, 113) = 45.875, p < .001), topic (F(1, 113) = 16.416, p < .001), and grounding act (F(2.589, 292.612) = 204.8, p < .01), and the multiple comparison analysis showed the differences among four major grounding acts were all significant (p < .001). Also, we observed significant first-order interaction between language and grounding acts (F(2.702, 305.327) = 24.551, p < .001), and between topic and grounding act (F(3, 339) = 4.516, p < .005). Sub-effect tests showed significant simple main effects of language in grounding act “init” (F(1, 113) = 15.81, p < .001), “cont” (F(1, 113) = 78.20, p < .001), and “ack-init” (F(1, 113) = 12.20, p < .01), and topic in grounding act “cont” (F(1, 113) = 6.22, p < .05) “ack-init” (F(1, 113) = 12.20, p < .01), and a marginally significant simple main effect of topic in grounding act “init” (F(1, 113) = 3.18, p < .1), but no significant simple main effect of either language nor topic in grounding act “ack”. The distribution of listeners’ gazing ratios (LGRs) is shown in the figure below.

Fig. 3.
figure 3

The distribution of listeners’ gazing ratios (LGRs)

As shown in Fig. 3, listeners gazed at the speaker for longer in L2 conversations than in L1 conversations. This was also the case in goal-oriented conversations compared to free-flowing conversations. Moreover, listeners gazed at the speaker for longer during init, cont and ack init utterances.

4 Discussion

We compared the duration of the listener’s gaze in triadic conversations to examine the effects of linguistic proficiency, topic and grounding on gazing activities. The results of ANOVA revealed significant main effects of language difference, topic and grounding, supporting our hypotheses H1, H2, and H3: the duration of a listener’s gaze is longer in L2 conversations, in goal-oriented conversations, and in init, cont and ack init utterances. The grounding factor had the greatest effect, followed by that of language proficiency.

The multiple comparison analysis showed that the differences among four major grounding acts were also all significant, and cont showed the longest duration for the listener’s gaze among all the grounding act categories. With cont utterances, the speakers were adding new pieces of information to their own previous utterances, and in doing so, they were sometimes observed using a filled pause to hold the speech floor while bringing order to their ideas. Such characteristics of cont utterances might have drawn the listener’s attention to the speaker. In contrast, ack showed the shortest duration for the listener’s gaze, suggesting that utterances just acknowledging the previous utterance without adding new information did not draw the listener’s visual attention to the speaker.

We observed significant first-order interaction between language difference and grounding acts, and sub-effect tests showed significant simple main effects of language difference in all the major grounding act categories except ack. The results suggest that linguistic proficiency affected the listener’s gazing activities only for utterances presenting new pieces of information. Similarly, we observed significant first-order interaction between topic and grounding acts, and sub-effect tests showed significant or marginally significant simple main effects of topic in all the major grounding act categories except ack. The results suggest that the topic also affected the listener’s gazing activities but only in the case where utterances presented new pieces of information.

Another interesting finding is that there was no significant interaction between the factors of language difference and topic. It suggests that in the current corpus settings, linguistic proficiency and topic independently affected the listener’s gazing activities.

These findings suggest that linguistic proficiency, conversation topic, and grounding all affect the listener’s gazing activities, and that these factors should be considered when attempting to design better HCI, HRI, and CSCW systems. It is also likely that the effects of these factors may not be just simple and independent but rather interlaced: our experimental results suggest that linguistic proficiency and grounding factors affect each other, and so do topic and grounding factors. Further detailed analyses are necessary to establish system design guidelines that reflect these factors.

5 Summary

We analyzed the effect of linguistic proficiency and conversation topic on the listener’s gaze in four major grounding acts. The results showed that the duration of a listener’s gaze is longer in second language (L2) conversations, in goal-oriented conversations, and during utterances presenting new information. The results also showed that both language proficiency and topic independently affect the duration of the listener’s gaze in utterances presenting new information. These results suggest that linguistic proficiency, conversation topic, and grounding factors all affect a listener’s gazing activities, supporting our hypotheses. The results are expected to contribute to HCI, HRI, and CSCW system design that reflects the interaction context and the linguistic proficiency of users.