Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Enabling Social Robots to Perceive and Join Socially Interacting Groups Using F-formation: A Comprehensive Overview

Published: 23 October 2024 Publication History

Abstract

Social robots in our daily surroundings, like personal guides, waiter robots, home helpers, assistive robots, and telepresence/teleoperation robots, are increasing day by day. Their usability and acceptability largely depend on their explicit and implicit interaction capability with fellow human beings. As a result, social behavior is one of the most sought-after qualities that a robot can possess. However, there is no specific aspect and/or feature that defines socially acceptable behavior, and it largely depends on the situation, application, and society. In this article, we investigate one such social behavior for collocated robots. Imagine a group of people is interacting with each other, and we want to join the group. We as human beings do it in a socially acceptable manner, i.e., within the group, we do position ourselves in such a way that we can participate in the group activity without disturbing/obstructing anybody. To possess such a quality, first, a robot needs to determine the formation of the group and then determine a position for itself, which we humans do implicitly. There are many theories which study group formations and proxemics; one such theory is f-formation which could be utilized for this purpose. As the types of formations can be very diverse, detecting the social groups is not a trivial task. In this article, we provide a comprehensive survey of the existing work on social interaction and group detection using f-formation for robotics and other applications. We also put forward a novel holistic survey framework combining some of the possibly more important concerns and modules relevant to this problem. We define taxonomies based on methods, camera views, datasets, detection capabilities and scale, evaluation approaches, and application areas. We discuss certain open challenges and limitations in the current literature along with possible future research directions based on this framework. In particular, we discuss the existing methods/techniques and their relative merits and demerits, applications, and provide a set of unsolved but relevant problems in this domain. The official website for this work is available at: https://github.com/HrishavBakulBarua/Social-Robots-F-formation

1 Introduction

Human group [146] and activity detection [2] has been a hot topic for computer/machine vision, Artificial Intelligence (AI), and robotics research. When humans interact with each other in a group (two or more people), they use implicit social intuition to position themselves concerning each other which facilitates easy interaction in the situation. The same thing is also applied when a new person wants to join a group. That person also assesses (implicitly) which place is best for joining so that people in the group do not face any inconvenience. The person also considers her role in that group according to the organization they are in right now to position herself [46]. Nowadays, robots are widely used in our daily surroundings [164] for many purposes. One such popular application is to use a robot to attend meetings/conferences/discussions remotely as a telepresence medium [21, 30, 121, 139, 140, 142]. In such scenarios, the robot has to join a group of people. For the robot to fluidly participate in groups, they must need to know how the groups are formed, how they are shaped, and how they have evolved [84, 85, 146]. There can be many kinds of groups that defer in dimension, situation, organization, and so on, and they are generally referred to as “f-formation.”
Facing formation (f-formation) is defined as the set of patterns that are formed during social interactions of two or more people. A robot can join an existing group or go to a single person and form a new group [15, 16, 54, 63, 114, 116, 136, 138, 182]. There are three social spaces related to f-formation, which are O-space, P-space, and R-space. O-space is known as the joint transaction space which is the interaction space between participants. P-space is the space where active participants are standing. R-space is the area that surrounds the participants and is outside the interaction radius as shown in Figure 1 (details in Section 2). According to social science, Kendon (1990) proposed four formations as standard formations. They are vis-a-vis/face-to-face, side-by-side, L-shaped, and circular. Apart from these, there are also many other kinds of formations such as semi-circular, rectangle, triangular, v-shaped, and spooning [112, 161]. By categorizing a formation to any of these formation types, a robot can understand how people are standing in the discussion and decide a position for itself to join the group accordingly. In joining, the robot should take care of the fact that already standing people are not disturbed and obstructed by itself. In detecting groups, several methods exist such as determining the position and orientation of people, graph-cuts methods, and Hough-voting system. One major problem in f-formation detection is occlusion [47]. People in a group may stand in such positions that some of them may occlude the others, or the viewing camera angle is placed in such a location that the complete group is not visible. In such cases, a robot will not know the formation as it may not be able to detect the bodies of some people. So, in this kind of situation, we have to think about what kind of formation we will assume for the robot to continue its work. For the sake of this article, we use group, interaction, and f-formation synonymously as we consider a group of interacting people who follow a particular formation to be detected for various applications as the primary focus area.
Fig. 1.
Fig. 1. The three social spaces corresponding to human proxemics.
Motivation and Research Objective. Social group and interaction detection is a non-trivial task in computer vision and is of importance to the social robotics research community. Many research groups across the globe have concentrated their studies in this area. The idea of social groups and interactions was first proposed in 1990 by Kendon [84]. However, the first mention of human proxemics dates back to 1963 [63]. These concepts of f-formation and proxemic behaviors of humans have formed the fundamentals to the study and detection of physically situated interaction and group of people or robots or both. The research has gained pace since 2010 (see Table 3), delivering many rule-based and learning-based methods and techniques for detecting interaction and f-formation in social setups for various applications, mostly robotics and vision. But not many surveys, reviews, or tutorials have been found in this domain which provide a good overall impression of the research, the state of the art, and future opportunities. This survey article is a 360\({}^{\circ}\) view into the problem of social group and interaction detection using f-formation covering almost all the concerns and aspects comprehensively. The aim is to discuss the domain in a comprehensive manner and facilitate the scientists, researchers, and computer engineers to get a fair idea of the area and conduct fruitful research in the future. We also put forward a few related surveys in Table 1 and have compared them with our work based on various concern areas of our survey framework (discussed in Section 4.1 and Figure 8).
Table 1.
Existing Surveys and Reviews(1)(2)(3)(4)(5)(6)(7)(8)(9)(10)(11)
Adriana et al. [157]
Francesco et al. [146]✓✓
Sai Krishna et al. [117]
This survey✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓
Table 1. Comparison of Existing and Related Surveys/Reviews With Our Work
-> signifies no treatment in the paper, -> signifies some mention exists and ✓✓-> means comprehensive treatment of the concern area. (1) Comprehensive f-formation list and tutorial on social spaces, (2) camera views and sensors, (3) datasets, (4) detection capability/scale, (5) evaluation methodology, (6) rule-based AI methods/techniques, (7) ML-based AI methods/techniques, (8) applications, (9) limitations, challenges and future directions, (10) generic survey framework for group and interaction detection, (11) discussion on ethical concerns and considerations.
Uniqueness of the Survey. To the best of our knowledge, this survey is the first of its kind in this subject area. Our survey puts forward the idea of social groups with the perspective of f-formation with some comprehensive details. We also discuss the natural joining position for a robot to enable human–robot interaction (HRI), after successful detection of the formation using computer vision techniques. Additionally, we propose a holistic framework to signify the various concern areas in the detection and prediction of social groups. Various taxonomies regarding camera view of the environment for collecting scenes for detection, datasets for training machine/deep learning (ML/DL) models, detection capability, and scale and evaluation methods are discussed. We discuss and categorize all the detection methods, particularly rule-based and ML-based. Furthermore, we also deliberate the application areas of such detection and recognition giving primary focus to robotics. We also attempt to include some discussion on ethical concerns and considerations [73] in data collection and distribution as well as application. We also detailed the challenges, limitations, and future research directions in this area.
Organization of the Survey. This survey article is organized into the following sections. Section 2 gives a comprehensive perspective about social spaces involved in group interaction. The questions like the meaning of f-formation, types of f-formation, and evolution of f-formation from one type to another when a new member joins a group have been answered along with pictorial depiction for readers’ understanding. Then, we present a year-wise compilation of research performed in this domain with analysis in Section 3. In Section 4.1, we propose a generic and holistic framework for group and interaction detection using formations which also becomes the basis of categorization of the literature in the survey based on the various concern areas and modules. Section 4.2 shows the article discovery and selection methodology along with inclusion/exclusion criteria in Section 4.3. Section 5 discusses the various input methods for detection such as cameras and other sensors. It also puts focus on the various camera views and positions. Section 6 summarizes the methods, techniques, and algorithms for detection focusing on both rule-based static AI methods and learning-based (data-driven) methods. In Section 7, we discuss briefly the various datasets available for training and testing purposes. Then, we talk about detection capabilities and scale in Section 8. Section 9 presents the various evaluation strategy and methodology from the perspective of algorithmic computational complexity and application areas like robotics and vision. The various application areas are stated in Section 10. Finally, we discuss the limitations and challenges in the existing state-of-the-art literature and methods as well as propose some future research directions and prospects in each of the modules (of the survey framework) in Section 11. We conclude the survey in Section 12.

2 Social Spaces in Group Interaction

In this section, we describe the theory of f-formation, leveraging the theory to study the groups of interacting people, and how a robot can utilize this to perceive and display social awareness during interaction with a group of people. Apart from that we also identify a set of formations which does not fall into the exact definition of f-formation but still are prevalent in social gatherings and environments [112].

2.1 What Is F-formation?

F-formation or any formation happens when two or more people sustain a spatial and orientational relationship, and they have equal, direct, and exclusive access to the space between them [84, 85]. Technically, an f-formation is said to have formed when an O-space is created due to the overlapping transaction segments of the interacting people [84, 85]. Figure 1 depicts such a social space where a group of people are interacting. An f-formation is the proper organization of three social spaces: O-space, P-space, and R-space [84, 85]. They are situated like three circles surrounding each other. O-space, the innermost circle, is a convex empty space that is normally surrounded by the people in the group and the participants generally look inward into the O-space. P-space, the second circle, is a narrow space where active participants are standing. The R-space, the outermost circle, is a space where mostly an inactive participant (listener) or an outsider who is not a part of the conversation stands. But it is not the case entirely as stated in [84]. Kendon explains that some listeners can be a part of an f-formation although they directly do not interect but just listen to other interacting people. In such cases, a listener can be in P-space. The papers [38, 59, 108] study various types of participants and non-participants in an interaction. The participants are the ones either speaking or the ones being spoken to directly (active listeners). The people not falling into these categories, but still can be considered as participants, are stated as listeners, over-hearers, or side participant in a group discussion or interaction. According to the authors, listeners can be either active (also participants) or passive (bystanders or eavesdropper). A bystander is a listener who is present in the group with the awareness of the participants but not interacting with them, whereas an eavesdropper is overhearing the conversation without the awareness of the participants. So, active listeners are a part of the P-space automatically, and in some cases, the passive listeners like the bystanders can also occupy the P-space. But it is very unlikely that an eavesdropper can be seen in a P-space. They are generally seen in the R-space. Another study by Lu Zhang et al. [187] categorize participants as in-group or out-group. The in-group participants are the ones actively participating in the conversation and are considered full members of the f-formation. They tend to be in the P-space. The out-group consists of the people who are trying to join an f-formation but are not fully accepted yet. They tend to be in the R-space and wait for an opportunity to get a spot in the P-space. These members can leave the group without disturbing the f-formation and the conversation.

2.2 Different Formations (Including F-formations) Possible during Interaction

Although both the theory of f-formation and appropriate methods to detect them have been well-analyzed in the literature, a comprehensive list of the possible formations (f-formations and others) and their major variations during different kinds of interactions is yet to be brought out. The most common ones are side-by-side, viz-a-viz, L-shaped, and triangular, defined for groups of two to three persons. Some others include circular, square, rectangular, and semi-circular, which are more flexible and can contain a varying number of persons. However, some are simply formations with no common O-space and which do not fit into the typical technical definition of f-formation, such as reversed L-shaped, spooling, and z-shaped [112] (see Figure 4). Wide v-shaped formation may also be considered as a new category as it does not exactly satisfy Kendon's definition [84, 85]. A study has been made by Jeni Paay et al. in [112] considering 1,592 scenes in a total of 61 videos where they define scene as a movement to a new formation or camera position; 663 scenes have no clear formations to consider while the remaining 929 have two or more people. Out of these scenes, 581 have spatial-orientational arrangements which can be fit into the identified f-formations by Kendon [84, 85]. The rest of the 348 scenes have different spatial and orientional arrangements which directly do not follow the f-formation norms. Hence, the authors define special kind of formation for these scenes.
Another important consideration is about redundancy in some of the f-formations. For example, square/rectangle, triangle, or circular formation may seem similar in form but vary only in number of participating people. But these are considered separate formations in the literature as in case of detection tasks for vision or robotics application, apart from the overall form of the formation, the number of involving people is also important. Also if a robot needs to join a formation, the structure of the formation might change from one form to another depending on the current form and the number of participating people. Lets consider a case; if a robot detects a triangle formation and joins it in an optimal way, it might lead to a rectangle/square formation as a result. So, in such cases of real-life applications, these formations may be considered unique for better representation in detection algorithms for robotics. As for some of the other formations like line, column, diagonal (echelon), or geese, these are not used much but are possible formations in exercises, military operations (infantry and cavalry warfare), or such disciplined drills. The use of robotics in defence or therapy is an active area of research, so these formations might be useful for such applications. Detection of and joining such formations by robots might be a possible application. Moreover, visual analysis of such exercises, military drills, and so on can be another application area.
We list down and categorize a comprehensive collection of the known formations and their major variations (including f-formations) below (see Figure 2 for a pictorial representation).
Fig. 2.
Fig. 2. Comprehensive list of f-formations which are/can be used for group and interaction detection tasks in vision and robotics.
(a)
Side-by-side: The side-by-side formation is formed when two people stand close to each other facing in the same direction. Both faces either right or left or center. A minimum of two people is required for such a formation [74].
(b)
Vis-a-vis or face-to-face: This formation comes into existence when two people are facing each other. Only two people are required for such a formation [74].
(c)
L-shape: The L-shape is formed when two people face each other perpendicularly and are situated on the two ends of the letter “L”—one person facing the center and the other facing right or left [74].
(d)
Reversed L-shaped: This is formed when two people are in a position of L-shape, but they are facing in different directions [112].
(e)
Wide V-shaped: Two people are facing in the same direction like side by side, but they tilt their bodies slightly to face each other a little. Minimum two people required for this formation [112].
(f)
Spooning: This formation has two people with one person facing forward and the other look over from the back in the same direction [112].
(g)
Z-shaped: This is formed when two people are standing side-by-side but facing in opposite directions [112].
(h)
Line formation: In this formation, all are standing in side-by-side fashion as a straight line and a minimum of two people are required [176].
(i)
Column formation: In this formation, all are standing in a fashion where one is behind the other in a straight line and a minimum of two people are required [177].
(j)
Diagonal (Echelon): In this formation, people stand diagonally and face in the same direction. A minimum of two people are required [178].
(k)
Side-by-side with one headliner: In this formation, one person stands in the front and others stand side-by-side at the back. A minimum of three people are required, and they all face in the same direction [46].
(l)
Side-by-side with outsider: In this formation, one participant occupying an outer position of the side-by-side formation in the R-space, who usually does not play an active role in the conversation. A minimum of three people are required [96].
(m)
V-shaped: In this formation, all people stand in a V-shaped fashion and face the same direction. A minimum of three people are required [112, 179].
(n)
Horseshoe: The group of people stands in the shape of “U,” and a minimum of five people are required [146].
(o)
Semi-circular: The semi-circular formation is where three or more people are focusing on the same task while interacting with each other [117].
(p)
Semi-circular with one leader in the middle: In this formation, people stand in semi-circular shape and there is one person in the center who is facing to the group of people in the semi-circle. A minimum of four people are required [96].
(q)
Square (Infantry square): Four people stand in the square-shaped fashion [180].
(r)
Triangle: As the name suggests, three people stands in a triangular shape in this formation [117].
(s)
Circle: As the name suggests, a group of people stands in a circular shape in this formation [115].
(t)
Circular arrangement with outsiders: In this formation, some people stand in a circular fashion and one/two additional people stand at the back of the circular formation [46].
(u)
Geese formation: In this formation, there are two or more people where one person is leading the path and the others are following that person but may or may not be looking in the same direction. A minimum of two people are required [48].
(v)
Lone wolves: This is not really a formation (yet). There is only one person ready to be joined by others before an interaction [48].
Fig. 3.
Fig. 3. Comprehensive list of f-formations and their corresponding changed formations after a person/robot joins it. Also, the relevant possible trajectories a robot/person should take in each of the case.
Fig. 4.
Fig. 4. Images from YouTube videos by Jeni Paay et al. [112] depicting some new formations.
Apart from these formations and f-formations in specific, some authors also talk about seated formations. In the work of McNeill et al. [98], they have considered some seated group interactions. The paper shows horseshoe and circular seated f-formations specifically picking up cases from US Air Force war gaming session interaction. They study the behavior of humans interacting with each other in a seated conversation using multi-modal cues such as hand pointing gestures and gaze or eye contact. Figure 5 shows a horseshoe setup with five people interacting with each other although the table has a capacity of eight people making it a circular-seated formation. The positions, C, D, E, F, and G, are occupied and A, B, and H are vacant. Camera 1 is used to take the images for analysis. Figure 6 gives the camera 1 perspective of the seated people. The research aims at understanding the interpersonal interaction patterns under the umbrella of multi-modality. There are five individuals from the US Air Force taking part in military gaming exercises at the Air Force Institute of Technology, at the Wright Patterson Air Force Base, in Dayton, OH. The people in the setup are from various military speciality with the commanding officer in position E. Here, we can definitely identify the O-space, P-space, and R-space in the figures. So, we can also have a new direction in this perspective of detection of seated f-formations and its application in robotics and vision.
Fig. 5.
Fig. 5. The setup used in McNeill et al. [98] for studying attention and gaze patterns of people in a seated conversation.
Fig. 6.
Fig. 6. The setup used in McNeill et al. [98] for studying attention and gaze patterns of people in a seated conversation. Please visit the citation for more details.

2.3 Best Position for a Robot to Join in a Group

As social robotics is one of the most important application areas of group and interaction detection using computer vision, we put forward a list of positions where a robot can join a formation [169] (here f-formation) after successfully detecting it. But joining a group requires a socially and culturally aware [33, 105, 163, 171] or human-aware [89, 132, 163] navigation protocol embedded into the robot. In other words, the robot should imitate human-like natural behavior while approaching a group (considering correct direction and angle) for interaction and discussion without incurring any discomfort to the existing members of the group. However, this part of the story is out of the scope of our survey, and we limit our work to detection and prediction of group and interaction only. But, as it seems necessary to at least briefly mention this side of the coin, we put forward some of the possible joining locations and natural joining path or approach direction/angle (robot trajectory) in this section (also discussed briefly in Sections 9 and 11). Researchers can think of presenting and publishing a systematic survey on the human-aware or socially aware navigation and group joining aspect of a robot/autonomous agent after successful detection/prediction of the group interaction and f-formation. Some of such works can be found in [105, 132, 164, 171].
Table 2 summarizes a list of formations with the number of people and correspondingly the new formations after a person/robot joins it. The pictorial summary of the same is presented in Figure 3. In the figure, we see some of the possible joining positions in a group (f-formation mostly) along with the trajectories of a robot/human from their possible initial position to final position. This list is not exhaustive but can be considered comprehensive in the sense that we have tried to include most of the formations (f-formations and others) which can be seen in humans in various social scenarios. The joining positions seen in the figures are considered optimum as those are the only positions (clearly understandable by the images) where a new joiner may stand without obstructing the existing members or blocking the conversation between them. In some case, like line, column, or diagonal formations, the formation can be preserved by joining in such a way that the resulting formation is considered as same but with more members. Similarly in other cases, the people/robot should join in a location that makes the resultant formation as one of the already existing one. As for the joining path or trajectory, one should follow a path that does not disturb the existing members of the group and at the same time should be the shortest path to the destined joining locations. In the view of this logic, we put forward the paths in the figures as shown. But here one important thing to consider is that, these paths do not consider the fact that there are any obstacles in the path or in the vicinity at least. However, in actual social scenarios, we expect a lot of static and moving obstacles, which need to be addressed by good navigation policies and path planning (see [33] and [89]) which is out of the discussion scope of this survey.
Table 2.
No.Formation (Before)No. of PeopleFormation (After joining)
aSide-by-side2\({}^{\mathrm{a}}\)Side-by-side, side-by-side with one outsider, Triangle
bVis-a-vis2Triangle
cL-shaped2Triangle, reverse semi-circular
dReversed L-shaped2Semi-circular
eWide V shaped2\({}^{\mathrm{a}}\)Triangle, semi-circular
fSpooning2Side-by-side with headliner, side-by-side with outsider
gZ-shaped2Triangle
hLine3Line
iColumn3Column
jDiagonal3Diagonal, V-shaped
kS-by-s with one headliner3\({}^{\mathrm{a}}\)S-by-s with one headliner
lS-by-s with outsider3\({}^{\mathrm{a}}\)Semi-circular
mV-shaped7Triangle
nHorseshoe5\({}^{\mathrm{a}}\)Pentagon
oSemi-circular4Circular
pSemi-circular with one leader in the middle5Circle
qSquare4Circular, horseshoe
rTriangle3Circle, semi-circular
sCircle6Circle
tCircle with outsider8Circle with outsider
uGeese2\({}^{\mathrm{a}}\)S-by-s with outsider
vLone wolf1Vis-a-vis, S-by-s, L-shaped
Table 2. Comprehensive List of Formations Before and After a Robot/Human Has Joined
\({}^{\mathrm{a}}\)Signifies that the number is the minimum requirement for that particular formation.

3 Research Chronology

In 1990, Kendon [84] proposed the f-formation theory for group interaction by participating people on the basis of proxemics behavior. Human proxemics is the study of how space is used between humans or how it is set between humans and others during interactions or performing an activity. A computer system to detect human proxemics behavior was first studied by Hall almost six decades ago [63, 64]. This section is a survey on the literature collected on f-formation, using static and learning-based AI approaches. Figure 7 shows the various specified distance ranges for different designated interaction types on the basis of intimacy level between the participating people. The distance ranges specified in green-colored boxes are relevant to group/interaction and f-formation detection perspective. The blue-colored boxes signify distance ranges that are not generally seen in any f-formation. After the early studies by Hall, it was until 2009 when computer vision-based approaches using static AI methods or ML/DL techniques have emerged. From 2010, the research in this field started gaining pace. From 2010 to 2013, rule-based fixed AI methods were prevalent. Learning-based or data-driven methods like DL and RL have gained prominence in 2014 and gaining pace with each passing year. During the period ranging from 2013 to 2017, the research in this domain reached its peak with multiple methods, algorithms, and techniques being proposed. These methods have attained popularity in 2019. The years from 2014 to 2018 have seen equal contributions in both fixed rule-based methods and learning-based methods. So, there is a transition from traditional AI-based methods to ML and data-driven techniques like almost any domain of AI. Table 3 can be referred to for the complete list of references (year-wise). The table also consists of the keywords (methods, focus areas, and technologies) for most of the references for a better perception of the readers.
Fig. 7.
Fig. 7. Human proxemics using distance ranges as stated by Hall [63]. The green comment boxes signify distance ranges typically used for f-formation and group interaction.
Table 3.
Year of PublicationStudy, Methods and TechniquesTotal
2004Geometric reasoning on sonar data [12].1
2006Wizard-of-Oz (WoZ) study on spatial distances related to f-formations [74].1
2009Clustering trajectories tracked by static laser range finders [80], trajectory classification by SVM [141].2
2010Probabilistic generative model on IR tracking data [62], WoZ study of robot's body movement [90], SVM classification using kinematic features [185].3
2011Analysis of different f-formations for information seeking [96], Hough-transform-based voting [43], graph clustering [72], a study on transitions between f-formations on interaction cues [100], a computational model of interaction space for virtual humans extending f-formation theory [107], a study of physical distancing from a robot [103], utilizing geometric properties of a simulated environment [104], a study to relate f-formations with conversation initiation [148], Gaussian clustering on camera-tracked trajectories [44], risk-based robot navigation [132].10
2012Application of f-formations in collaborative cooking [111], Kinect-based tracking with rules [95], WoZ study on social interaction in nursing facilities [87], a study of robot gaze behaviors in group conversations [184], velocity models (while walking) [102], SVM with motion features [135], HMM [56].7
2013Spatial geometric analysis on Kinect data [51, 53], analysis of f-formation in blended reality [47], a comparison of [43] and [72] [144], exemplar-based approach [93], multi-scale detection [145], Bag-of-Visual-Words-based classifier [162], Inter-Relation Pattern Matrix [27], HMM classifiers [99], O-space based path planning [60], Multi-hypothesis social grouping and tracking for mobile robots [94].11
2014Hough Voting (HVFF), graph-cut f-formation (GCFF) [122], game theory based approach [166], correlation clustering algorithm [11], reasoning on proximity and visual orientation data [48], effects of cultural differences [78], HMM to classify accelerometer data [71], iterative augmentation algorithm [36], adaptive weights learning methods [126], estimating lower-body pose from head pose and facial orientation [183], search-based method [52], study on group-approaching behavior [82], spatial activity analysis in a multiplayer game [79].12
2015Robust Tracking Algorithm using Tracking learning detection (TLD) [10], GCFF-based approach [146], Correlation Clustering algorithm [10], multimodal data fusion [9], spatial analysis in collaborative cooking [110], Group Interaction Zone (GIZ) detection method [35], study on influencing formations by a tour guide robot [81], joint inference of pose and f-formations [152], participation state model [149], SALSA dataset for evaluating social behavior [8], multi-level tracking based algorithm [170], Structural SVM using Dynamic Time Warping loss [150], Long-Short Term Memory (LSTM) network [3], influence of approach behavior on comfort [15], Sensor-based control task for joining [105].15
2016F-formation applied to mobile collaborative activities [161], subjective annotations of f-formation [186], game-theoretic clustering [167], study of display angles in museum [75], mobile co-location analysis using f-formation [143], proxemics analysis algorithm [129], review of human group detection approaches [158], LSTM based detection in ego-view [4], Tracking people through data fusion for inferring social situation [164], Detecting group formations using iBeacon technology [83], [163].11
2017Haar cascade face detector based algorithm [88, 115], weakly-supervised learning [165], temporal segmentation of social activities [41], omnidirectional mobility in f-formations [182], review of multimodal social scene analysis [7], 3D group motion prediction from video [77], survey on social navigation of robots [33], a study on robot's approaching behavior [16], heuristic calculation of robot's stopping distance [136], a study on human perception of robot's gaze [168], computational models of spatial orientation in virtual reality (VR) [118], Socially acceptable robot navigation over groups of people [171].13
2018Optical-flow based algorithm in ego-view [131], meta-classifier learning using accelerometer data [57], human-friendly approach planner [138], discussion on improved teleoperation using f-formation [113], effect of spatial arrangement in conversation workload [97], study of f-formation dynamics in a vast area [46].6
2019Study on teleoperators following f-formations [117], analysis on conversation floors prediction using f-formation [127], empirical comparison of data-driven approaches [67], LSTM networks applied on multimodal data [134], robot's optimal pose estimation in social groups [116], review of robot and human group interaction [181], Staged Social Behavior Learning (SSBL) [54], Euclidean distance based calculation after 2D pose estimation [114], Robot-Centric Group Estimation Model (RoboGEM) [159], DoT-GNN: Domain-Transferred Graph Neural Network for Group Re-identification [70], Audio-based framework for group activity sensing [156].11
2020Difference in spatial group configurations between physically and virtually present agents [68], Conditional Random Field with SVM for jointly detecting group membership, f-formation and approach angle [18, 21], Group Re-Identification With Group Context Graph Neural Networks [188], Improving Social Awareness Through DANTE: Deep Affinity Network for Clustering Conversational Interactants [153].4
2021Conversational Group Detection with Graph Neural Networks [160]1
2022Graph-based group modelling for backchannel detection [147], Pose generation for social robots in conversational group formations [169], Conversation group detection with spatio-temporal context [154]3
Table 3. Year-Wise Compilation of Methods and Techniques for Group/Interaction/F-formation Detection and Other Relevant Literatures
HMM, Hidden Markov Model; IR, infrared; WoZ, Wizard-of-Oz.

4 Review Methodology

This study systematically summarizes and analyzes state-of-the-art methods and challenges for group and interaction detection using f-formations. The methodology is based on a generic framework for analysis of the relevant literature and a synthesis of research findings in a systematic manner.

4.1 Generic Survey Framework with Possible Concern Areas

This survey aims to facilitate the concerned researchers with a comprehensive overview of this domain of group and interaction detection. The idea of group/interaction detection is not new and is around for more than a decade. Researchers are trying to design and develop new methods, techniques, algorithms, and architectures for various application areas ranging from computer vision and robotics to social environment analysis.
The problem of a group and/or interaction detection is a non-trivial problem of computer vision. The existing research approaches follow both classical AI algorithms like rule-based methods, geometric reasoning, and neural networked-based methods. Moreover, learning paradigms like supervised, semi-supervised, and unsupervised are also used. Proper categorization of these methods is necessary for future research directions. We have proposed a holistic framework that corresponds to the concern areas of f-formation research and can also be considered as a generic architecture for a typical group/intersection detection task using f-formation. Figure 8 puts forward a possible framework with different modules of such a detection task. The various concern areas of this domain can be characterized by—sensors used, camera view/position for capturing the group interaction, datasets used for training/testing the method in case of learning-based approaches (indoor or outdoor), feature selection (in brief), detection capabilities (static/dynamic scenes) and scale (single or multi-group scenario), evaluation methodology (efficiency/accuracy and/or simulation study and human experience study), and application areas. We also discuss the ethical concerns to be considered in data collection/sharing and application areas. The mentioned modules are used as the basis of categorization of the literature in our survey and are attended to in the upcoming sections one by one (as mentioned in Figure 8). Finally, we conclude the survey by discussing the limitations, challenges, and future directions/prospects (Section 11) in each of the concerned modules.
Fig. 8.
Fig. 8. A possible separation of concerns into modules regarding social groups/interactions detection. The arrows correspond to the flow of events/data in a typical group/interaction detection (f-formation) framework.

4.2 Search Strategy

We conduct a thorough search using the most popular online libraries including Association for Computing Machinery's ACM Digital Library,1 Institute of Electrical and Electronics Engineers's IEEE Xplore,2 Scopus,3 and ScienceDirect.4 ACM provides a wide array of transactions, journals, and conference proceedings in the area of HRI, human–machine interaction, and computer vision. Similarly, IEEE provides a vast collection of interdisciplinary research transactions and tier-1 conferences and proceedings combining AI, robotics, robot vision, and social behavior. Scopus has a variety of Springer journals dedicated to intelligent robotics and vision while ScienceDirect has a gamut of publications in the form of Elsevier which concentrates on autonomous systems and robots. We also use Google Scholar search as it provides the latest published or archived papers in the domain. The oldest paper reported is from the year 1963, and the newest one is from 2024. However, the major bulk of the works compared in this survey is between 2010 and 2020 (see Section 3, Table 3).
ACM Digital Library. The Search String used: [[Title: f-formation] OR [Title: “social interaction”] OR [Title: “group interaction”] OR [Title: “group formation”] OR [Title: “conversational group”] OR [Title: “group dynamics”] OR [Title: “group behaviour”] OR [Title: “human proxemics”]] AND [[Title: “social robot”] OR [Title: “human-robot interaction”]] AND NOT [[Title: survey] OR [Title: review] OR [Title: tutorial]] AND [E-Publication Date: (01/01/2010 TO 12/31/2020)].
IEEE Xplore. The Search String used: (“Document Title”: f-formation OR “Document Title”: “social interaction” OR “Document Title”: “group interaction” OR “Document Title”: “group formation” OR “Document Title”: “conversational group” OR “Document Title”: “group dynamics” OR “Document Title”: “group behaviour” “Document Title”: “human proxemics”) AND (“Document Title”: “human-robot interaction” OR “Document Title”: “social robot”) NOT (“Document Title”: survey OR “Document Title”: review OR “Document Title”: tutorial) Limit: 2010–2020.
Scopus. The Search String used: (human-robot AND interaction) OR (social AND robot) AND (f-formation) OR (social AND interaction) OR (group AND interaction) OR (group AND formation) OR (conversational AND group) OR (group AND dynamics) OR (group AND behaviour) OR (human AND proxemics). Date range 2010–2020.
ScienceDirect. The Search String used: (“f-formation” OR “social interaction” OR “group interaction” OR “group formation” OR “conversational group” OR “group dynamics” OR “group behaviour” OR “human proxemics”) AND (“human-robot interaction” OR “social robot”) NOT (“survey” OR “review” OR “tutorial”) AND PUBYEAR \(\gt\) 2009 AND PUBYEAR \(\lt\) 2021.
Google Scholar. The Search String used: f-formation OR “social interaction” OR “group interaction” OR “group formation” OR “conversational group” OR “group dynamics” OR “group behaviour” OR “human proxemics” -survey -review -tutorial. Limit: 2010 onwards.

4.3 Inclusion and Exclusion Criteria

The above searches yield total of 163 works (excluding survey/tutorials/reviews) after screening for appropriate and the relevant literature for our survey. Out of these works, 54 was published in IEEE journals and conferences, 41 appeared in ACM journals and conferences, 21 was published in Springer, and 18 in Elsevier. Rest of the 29 works were either published in other venues or were retrieved from archival sites like ArXiv,5 and Research Gate.6 A total of 59 works are from journals and 97 are from conferences and workshops. The remaining seven are from other sites.
Out of these, a total of 92 works are selected for comparison and analysis. We select the works focusing primarily on f-formation and its applications in robotics and related domains. We mostly list papers from 2010 to 2020 which have certain exceptions as per requirement of the survey and its treatment. Since IEEE and ACM are the major publishers in this area, we select almost 60% of the works from these venues while 24% accounts for Springer and Elsevier. Rest of the 16% approximately are from other lesser known sources or archival repositories.
Exclusions:
Methods that do not sufficiently describe the input methods/sensors.
Studies that do not explicitly define the details and capabilities of the models/methods used.
Studies that do not explicitly define the feature extraction and selection methods used.
Studies that do not explicitly define the results and evaluation strategies thoroughly.
Works that limit their discussion on datasets used.
Documents such as technical reports, thesis, and books.
Works that are not published in English language.

5 Cameras and Sensors for Scene Capture

This section summarizes the input methods in the group/interaction detection framework (Figure 8). The main input methods are cameras and sensors. There are different types of cameras used in the literature such as omnidirectional camera, helmet camera, robot camera, fisheye camera, and webcams. The camera sensor may be equipped with depth perception or provide only red green blue (RGB) images. The other main sensors found in the surveyed literature are audio sensors, blind sensors, radio-frequency identification (RFID) sensors, and so on. These are chosen based on the application areas and the working environment.

5.1 Camera Views

There are two different types of camera positioning used—ego-vision/ego-view (ego-centric) camera for robotics and exo-vision/exo-view (exo-centric) or global view cameras (fixed in walls and ceiling) in indoor environments or outdoor environments (see Figure 9). Cameras are used for drone surveillance, robotic vision, and scene monitoring. In these cases, we work with the ego/exo views of the scene to detect group interactions.
Fig. 9.
Fig. 9. Different camera views of the same group/interaction in an indoor environment. In the left side, a robot is having a ego-view camera and in the right side, an exo-view or global view camera is fixed in a wall. These images are produced in webot robotic simulator [45].
Ego-Centric View. Ego-centric refers to the first-person perspective; for example, images or videos captured by a wearable camera or robot camera. The captured data are focused on the part where the target objects are placed. In [131], the robot's camera is used for capturing scenes which is also referred a robot-centric view. In [52], the authors use first-person view cameras for estimating the location and orientation of the people in a group. In [3], the authors use low temporal resolution wearable camera for capturing groups’ images.
Exo-Centric View. The exo-centric view is concerned with the third-person perspective or the top view; for example, images or videos which are captured by surveillance/monitoring cameras. There can be one or many social interaction groups in a scene that can be captured simultaneously from the top view. In [126], the authors use four cameras for detecting groups at large scale. The method also detects changes in the target groups when it moves closer or further from the cameras. In [72], experiments are done by capturing a video with a camera from approximately 15 meters overhead. In [44], the images are captured by using a fisheye camera and it is mounted 7 meters above the floor.

5.2 Other Sensors

Sensors play a vital role to find the relative distance of the people in a group, which helps accurate prediction of the type of f-formation. Researchers used different types of sensors such as depth sensor, laser sensor, audio or speech sensor, RFID, and Ultra-Wideband (UWB) sensor in the literature. There are some cases where both cameras and other types of sensors are used simultaneously for detection. In [168], the authors use the UWB localization beacons, Kinect and an audio sensor for detecting people and other entities, and RGB cameras for monitoring. The data for scenes are captured in the form of images and/or videos depending on the method that uses the input for scene detection. Some instances of WiFi-based tracking [58] of humans are also visible in the literature. Some other sensors [163] like ibeacon [83], 2-dimensional (2D) range sensor [94], and audio sensor [156] are also in use.
Figure 10 shows a taxonomy of cameras/vision sensors and other sensors used in the literature for scene capture. Table 4 gives a categorization of the surveyed literature on the basis of camera views and sensors. In the table, readers may see the number of cameras used in each of the cited papers specified. The table also specifies the various cameras and sensors used in each paper.
Fig. 10.
Fig. 10. Taxonomy for cameras and sensors for scene capture. The leaf nodes give examples in each category.
Table 4.
ClassificationApplication Areas/ DetailsReferences
Ego-centric (first person view or robot view) [Section 5.1]
Robotics and HRI
Robot vision in telepresence
Drone/robot surveillance
[68], 1 [114], 1 [54], 1 [159], 1 [117], 2 Hamlet cameras and 1 robot camera [116], 1 [138], 1 [113], 1 [131], 1 [88], 1 [115], multi [77], 1 [4], 2 [75], 1 [158], 1 [161], 1 [3], 1 [149], 1 [81], 1 [10], multi [110], multi [52], multi [183], 1 [36], [82], multi [48], 1 [11], 1 [58], depth camera, RGB camera [56], an omni directional camera [184], multi [107, 111], 3 [90], 4 [62, 67], robot camera [74], 2 [12], [10], 2 [80], 1 [18, 21], 1 [94]
Exo-centric (global view) [Section 5.1]
Social scene monitoring
Covid-19 social distancing monitoring
Human interaction detection and analysis
1 [117], multi [57], multi [113], 1 [16], [182], multi [7], multi [77], 1 [33], 4 [168], multi [165], 8 [136], 1 [75], multi [158], multi [167], 4 [129], 2 [186], multi [150], [170], multi [8], multi [152], [146], 1 [35], multi [146], multi [9], multi [110], [79], 1 [55], 4 [126], a single monocular camera [166], 3 overhead fish eye camera used for training classifier [71], multi [183], multi [52], 1 [36], 1 [27], multi [51], 1 [99], 4 [145], [93], 1 [162], multi [144], 7 [53], 4 [87], 1 [135], 2 [95], 1 [34], multi [111], 1 [44], 1 [100], multi [43], 1 [103], 1 [72], multi [185], 4+2 [62], 4 webcams [74], an omnidirectional camera [12], [122], [187]
Using other sensors [Section 5.2]Audio, sociometric badges, Blind sensor, prime sensor, WiFi-based tracking, laser-based tracking, depth sensor, band radios, touch receptors, RFID sensors, smart phones, UWB becon[8], [48], Kinect depth sensor [51], [58], [56], [100], [161], speakers [127], wearable sensors [134], [33], [136], UWB localization beacons, Kinect [168], [80], [141], [143], RFID tag [75], [89], [51], [102], [107], Asus Xtion Pro sensor [114], ZED sensors [131], single worn accelerometer [57], Kinect sensor [138], Microsoft Speech SDK [118], speaker, Asus Xtion Pro live RGB-D sensor [16], Kinect [77], motion tracker [136], sociometric badges [7], RGB-D sensor [182], tablets [161], tablets [143], mobile sensors [158], microphone, IR beam and detector, bluetooth detector, accelerometer [8], touch sensor [107], range sensor [184], laser sensors [102], Wi-Fi based tracking, laser-based tracking [58], PrimeSensor, Microsoft Kinect, microphone [99], RFID sensors [47], blind sensor, location beacon [48], single worn accelerometer [71], [97], gaze animation controller [118], [148], grid world environment [104], ethnography method [96], ibeacon [83], 2D Range sensing [94], audio [156]
Others relevant literatures-[46, 47, 60, 63, 78, 84, 112, 163]
Table 4. Classification Based on Camera View and Other Sensors for Group/Interaction and F-formation Detection

6 Categorization of Methods/Techniques

There are many f-formation detection methods proposed in the literature. In this article, we broadly categorize these methods into two classes—(a) rule-based methods (fixed rules, assumptions, and geometric reasoning) like the conventional image processing and vision techniques and (b) learning-based method (or data driven approach), with learning-based methods coming to prominence in the recent past. Multimedia and visual analytics [120] from big data remain the lucrative tool for large-scale f-formation detection and group interaction analysis.
In group discussions, people stand in a position where the conversation can happen effectively. Kendon [84] proposed a formal structure of group proxemics among the interacting people in a formation (described in Section 2). [122] provides a dataset of top-view images for capturing the same and subsequently methods for analyzing the dataset and detecting interaction and conversational groups are proposed. In [145], the authors use the Hough-voting approach with a two-step algorithm—(1) fixed cardinality group detection and (2) groups merging. Using these two steps, they detect the type of f-formation. In [53], the experiment uses a heat map-based method for recognizing human activity and the best view camera selection method. In [122], the graph-cut f-formation (GCFF) is used for detecting f-formations in static images with the graph-cut algorithms via clustering graphs. Yasuharu Den [46] says that formations are also dependent on the social organization and environment. He explains formation with outsiders where people stand based on their position. In this article [117], there are three constraint-based formations namely triangle, rectangle, and semi-circular formation. They use a game-theoretic model for the position and orientational information of people to detect groups in the scene. For checking the formation, they use an algorithm proposed by Vascon et al. [166, 167] that generates the 2D frustum of the position and orientation of people in the group. In [115], the authors use the Haar cascade face detector algorithm to detect the faces and eyes of people. Based on the face and eye detection, the method decides how many frontal, right, and/or left faces are there and then decides the formations. In [88], the Haar cascade classifier is used with quadrant methodology. The paper differentiates the person's facing direction by looking where the eye is located and in which quadrant. In [53], the authors use a new method to find the dominant sets (DS) and then compare with modularity cut. But this method is applicable only when everyone is standing. Hedayati et al. in their work [67] state that a robot needs a finer characterization of f-formations and introduce tightness and symmetry as two such characterizations. In [127], Raman et al. describe that such a finer characterization might be the occurrence of multiple conversations within the f-formation making the argument of detecting “conversation floors.” This method uses speaking turns for indicating the existence of distinct conversation floors and gets the estimation of the presence of voice. But this method cannot detect the silent (inactive) participants. In [134], proximity and acceleration data are used and pairwise representations are used with the long short-term memory (LSTM) network. They are used for identifying the presence of interaction and the roles of participants in the interaction. However, using a fixed threshold for identifying speakers can create mislabel in some instances. In [10], structural support vector machine (SVM) is used for learning how to treat the distance and pose information, and correlation clustering algorithm is used to predict group compositions. Furthermore, tracking learning detection (TLD) tracker is used for blur detection for ego-vision images. But the trackers cannot perform detection when the target is moving out of the camera field of view. In [131], the method uses ego-centric pedestrian detection. The pedestrian detector generates bounding boxes (BB). It uses optical flow for estimating motion between consecutive image frames. For detecting groups, they used joint pedestrian proximity and motion estimation. In [186], the method detects the group with a group detector first then uses the trained classifier to differentiate the people involved in the group. Some researchers use pedestrians, vision-based algorithms, and pose-estimation algorithm to detect [165] groups. In the article [165], authors use BP for handling f-formation detection and finding the joint estimation of f-formation and target's head and body orientation. They also use multiple occlusion-adaptive classifiers. There are many more methods scientists use, but each of them has its own strengths and weaknesses. Figure 11 presents a taxonomy of different methods/techniques/approaches surveyed in this article that are used for group/interaction/formation detection. Moreover, some works also attend to the navigational and joining aspects of a already detected group/interaction for application in robotics and HRI such as [105, 132, 164, 171].
Fig. 11.
Fig. 11. Taxonomy for methods and approaches used for group and interaction detection.

6.1 Rule-Based Method

We categorize methods as rule based that include pre-defined rules, geometric assumptions, and reasoning. Rule-based methods are designed around well-known social behaviors and geometric properties and are often intuitive. In the absence of any learning paradigm, the algorithms are purely based on a static set of rules that are assumed to be true for a particular group situation (see Figure 12).
Fig. 12.
Fig. 12. Generic framework for a rule-based AI method of detecting group interactions and formations.
In the following, we list down the most popular rule-based methods that report a decent accuracy in detecting human groups.

6.1.1 Fixed Rules Based

Voting-Based Approach (2013). This approach is used for detecting and localizing groups by finding the matches based on exemplars. The authors in [93] suggest that this method works on agents so it is very flexible for different multi-agent scenarios. The results show that this method is effective for groups of up to four agents. The results are evaluated with people only without robots. The computational complexity of this method is low; hence, it is real-time in nature and accuracy is very good.
Head and Body Pose Estimation (HBPE) (2015) [152]. This method uses a joint learning framework for estimating head and body orientations that in turn is used for estimating f-formations. This method is evaluated with people in a scene without any robots. For evaluating, authors use the mean angular error for HBPE and use F1-score for f-formation estimation. This method is compared with Hough Voting for f-formation (HVFF) method. Though the results are more or less similar, this method is slightly more accurate and has a higher F1 score.

6.1.2 Geometric Reasoning Based

GCFF (2015) [146]. GCFF approach first finds the O-space and gives the individual position to identify the orientational formation. This method is tested on a synthetic scenario and compared with other methods such as Inter-Relation Pattern Matrix (IRPM), DS, Interacting Group Discovery (IGD), Game Theory for Conversational Groups, and HVFF. This approach improves over other approaches not only in terms of precision but also in recall scores. It performs better in detecting people and orientation with no errors. The results are evaluated with people only.
GROUP (2015) [170]. The GROUP algorithm detects f-formations based on lower-body orientation distributions of people from the scene and gives a set of free-standing conversational groups on each time step. First, it analyzes the maximum description length (MDL) parameter. The higher the MDL parameter, the higher is the radius of grouping people together. This method can also detect non-interacting people as outliers. It is evaluated with people only without any robots in the scene. The computational complexity of this method has been compared with state-of-the-art methods.
Approach Planner (2018) [138]. The Approach Planner enables a robot to navigates/plan based on the natural approaching behavior of humans toward the target person. This method can replicate human behavior/tendencies when approaching. The evaluation is based on the parameters derived from the skeletal information.
Game-Theoretic Model (2019) [117, 166]. The approach gives a 2D frustum for each virtual agent and robot by giving the position and orientation of them. Then, it computes the affinity matrix. The method is evaluated both quantitatively and qualitatively. It is efficient in serving teleoperated robots who follow f-formations while joining groups automatically. The method also takes care of the fact that the formation is modified when new people/robots join the old group. The evaluation is made in a simulation environment.
More methods are mentioned in the supplementary material.

6.2 ML-Based Method

ML-based methods are generally data-driven models where different algorithms are explored by researchers. Generally, we have special case of DL methods under ML. The primary learning paradigms used are—supervised, unsupervised, semi-supervised, and reinforcement learning. However, a generic system that uses a ML algorithm to detect f-formation is shown Figure 13. In the following, we describe the various ML-based approaches.
Fig. 13.
Fig. 13. Generic framework for a learning-based AI method of detecting group interactions and formations.

6.2.1 Supervised Approaches

IR Tracking Method with SVM (2010) [62]. With the help of IR tracking, social interactions can be classified as either existing or non-existing by using geometric social signals. The authors train and test with many classifiers such as SVM, Gaussian Mixture Models, and Naive Bayes classifier. IR tracking with an SVM classifier has been shown to achieve better accuracy than other classifiers.
Novel Framework (2013) [27]. This approach uses the Subjective View Frustum as the main feature which encodes the visual field of a person in a 3D environment and the IRPM as a tool for evaluation. For the tracking part, Hybrid Joint-Separable filter is used. The tracker gives the position of the head and feet of each person. Computational result-based evaluation is done with the other counterparts in terms of accuracy/efficiency.
GIZ Detection (2015) [35]. This method detects groups based on proxemics. Group Interaction Energy feature, Attraction and Repulsion Features, Granger Causality Test, and Additional Features are proposed in this method. Tests are also conducted by combining these features. This method allows people to be connected loosely. The evaluation is done on the basis of computational accuracy and efficiency.
3D Skeleton Reconstruction Using Patch Trajectory (2017) [77]. This algorithm works in two stages. First, it takes images from different views as input and produces 3D body skeletal proposals for people using 2D pose detection. Second, it refines those using a 3D patch trajectory stream and provides temporally stable 3D skeletons. Authors evaluate the method quantitatively and qualitatively yielding an accuracy of 99%. The limitation of this method lies in its dependency on 2D pose detection and the computation time complexity.
Learning Methods for Head Pose Estimation (HPE) and Body Pose Estimation (BPE) (2017) [165]. This method uses a jointly learning framework for estimating the head, body orientations of targets, and f-formations for conversational groups. The evaluation matrices used are—HP error, BPerror, and FFF1 score. This method is compared with IRPM, IGD, Hough transforms-based (HVFF), Graph Cut (GC), and Game-Theoretic methods, and the results are not very different in percentage accuracy.
Method Using Group Based Meta-Classifier Learning Using Local Neighborhood Training (GAMUT) (2018) [57]. This method aims at estimating the f-formation of each pair of people in a group as a pairwise relationship in the scene. This method works in two steps: prepossessing and GAMUT. In the prepossessing stage, raw tri-axial acceleration signals are converted to pairwise feature representations and these are used as samples in GAMUT. In the GAMUT stage, the same size of the local neighborhood is used per window size. The results are computationally evaluated in terms of accuracy and efficiency of detection.
Bagged Tree (2019) [67]. The proposed algorithm works in three steps—dataset deconstruction, pairwise classification, and reconstruction. Authors evaluate this algorithm with three ML classification models—weighted KNN, bagged trees, and logistic regression, where bagged tree model achieve better result in pairwise accuracy, precision, recall, and F1-score. But this method still needs to be trained on larger datasets and also on richer features. The evaluation is done with human experience study with robots. More methods are mentioned in the Supplementary Material.

6.2.2 Unsupervised Approaches

Graph-Based Clustering Method (2011) [72], (2013) [162]. In [72], authors use the “socially motivated estimate of focus orientation” feature to estimate body orientation that in turn estimates f-formation. This method has been compared with a modularity cut method. The evaluation process is done on the basis of computation complexity. The limitation of this approach can be seen in scenarios where people are moving within the group and/or people are joining/leaving the group.
Robot-Centric Group Estimation Model (RoboGEM) (2019) [159]. RoboGEM is an unsupervised algorithm that detects groups from an ego-centric view. This method works using three main modules—pedestrian detection module “P,” pedestrian motion estimation module “V,” and group detection module “G.” In the first module, an off-the-shelf pedestrian detector (YOLO) is used that provides BB for each person in the image. In the second module, V is estimated using optical flow. In the last module, the human group detection is performed using joint motion and proximity estimation. The authors compared this method with existing approaches using Intersection-over-Union, false positives per image, and depth threshold matrices. The evaluation is done with human experience with robots.
In [162], authors find a graph representation from the 3D trajectories of people and head poses. Using a graph-clustering algorithm, they discover social interaction groups. They use a SVM classifier for learning and classifying the group activities. The evaluation shows that it is better than the previous methods. Human experience study is also performed with robotic scenarios. This approach not only recognizes or detects particular group activity but also predicts a direct link between each person from that group.
Method Based on Pedestrian Motion Estimation (2018) [131]. This method works in three parts—ego-centric pedestrian detection, pedestrian motion estimation, and group detection using joint motion and proximity estimation. The pedestrian detectors result in BB with two features—a position of pedestrians and size of the BB. An optical feature is used for motion estimation. Then, joint pedestrian proximity and motion estimation are used for detecting groups while considering the depth data. The evaluation is done in terms of real-life human experience study using robots and humans. More methods are mentioned in the supplementary material.

6.2.3 Mixed Approaches

DANTE (2020) [153]. DANTE detects groups having conversations using a data-driven approach to identify spatially viable social interactions. A Deep Affinity Network has been designed which predicts the likelihood of two individuals being in a same group interaction in a given scene. The method takes the social context into account while making this decision. Predicted affinities between two entities (i̇.e., people) in the graph are then used to perform graph clustering to recognize groups of different sizes.
Based on Spatio-Temporal Context (2022) [154]. The method is based on dynamic LSTM-based DL model. This method also predicts the affinity values between two people indicating the likeliness if they are a part of the same group or not. The affinity is predicted in a continuous manner enabling it to detect dynamics in any group formation process. Finally, DS are extracted using graph clustering to identify the final conversational groups.
Geometry-Based and Data-Driven Methods (2022) [169]. The first method encodes geometry-based information about groups and interactions implicitly. The second one, which is data-driven in nature, uses Graph Neural Networks (GNNs) and adversarial learning to model the spatial arrangements and their properties explicitly.
Apart from the above mentioned methods, various research also shows the emergence of GNNs for group and interaction detection. The speciality of GNNs lie in their representation of data in graph formats having nodes and edges denoting entities and relationships, respectively. Some of the recent works utilizing GNN-based approaches can be found in [70, 147, 160, 188]. More methods are mentioned in the supplementary material.
Table 5 lists down the surveyed papers on the basis of rule-based and learning-based AI approaches. From the algorithmic trends, it is evident that learning-based approaches are slightly more predominant in recent years. However, both the methods are equally explored over the years and in recent pasts. Learning-based methods tend to be more accurate than their rule-based counterparts. Examples of such methods are as follows: [72], [99], [77], [131], [67], and [159].
Table 5.
ClassificationReferences
Classical rule-based AI methods [Fixed model based learning and prediction method based on certain geometric assumptions and/or reasoning (Section 6.1).]Approach behavior [141], sociologically principled method [43], proposed model [148], The Compensation (or Equilibrium) Model, The Reciprocity Model, The Attraction-Mediation Model, The Attraction-Transformation Model [103], rapid ethnography method [96], digital ethnography [111], GroupTogether system [95], museum guide robot system [184], extended f-formation system [53], Multi-scale Hough voting approach [145], Hough for f-Formations, Dominant-sets for f-Formations, [144], PolySocial Reality-F-formation [47], two-Gaussian mixture model, O-space model [60], Wi-fi based tracking, laser-based tracking, vision-based tracking [58], heat map based f-formation representation [51, 166], group tracking and behavior recognition [55], search-based method [52], Estimating positions and the orientations of lower bodies [183], Kendon's diagramming practice [110], GROUP [170], GCFF [146], [81], HBPE [152], Link Method, Interpersonal Synchrony Method [129], Frustum of attention modeling [167], f-formation as dominant set model [186], HRI motion planning system [182], footing behavior models (Spatial-Reorientation Model, Eye-Gaze Model) [118], matrix completion for head and body pose estimation (MC-HBPE) method [7], [136], Haar cascade face detector algorithm [88], Haar cascade face detector algorithm [115], [168], Approaching method [138], Measuring Workload Method [97, 116, 117], conversation floors estimation using f-formation [127], f-formation as dominant set model [187], geometry based [169]
ML-based AI methods [Data-driven models for learning and prediction using supervised, semi-supervised, unsupervised and reinforcement learning (ML/DL) or any such techniques (Section 6.2).][12], IR tracking techniques [62], SVM classifier [185], Grid World Scenario [104], graph-based clustering method [34, 44, 72], Hidden Markov Model (HMM) [56], proposed method with o-space and without o-space (SVM) [135], Region-based approach with level set method [87], IRPM [27], graph-based clustering algorithm [162], voting based approach [93], HMMs [99], SVM [36], Transfer Learning approaches [126], method with HMM [71], head pose estimation technique [11], [10], MC-HBPE [9], GIZ detection [35], [8], Supervised Correlation Clustering through Structured Learning [150], LSTM, Hough-Voting (HVFF) [3], LSTM [4], 3D skeleton reconstruction using patch trajectory [77], Human aware motion planner [33], [41], Learning Methods for HPE and BPE [165], Group bAsed Meta-classifier learning Using local neighborhood Training (GAMUT) [57], Group detection method [131], [67], LSTM network [134], Robot-Centric Group Estimation Model (RoboGEM) [159], Multi-Person 2D pose estimation [114], SSBL [54], multi-class SVM classifier [18], GNN based [70, 147, 160, 188], DANTE [153], Based on Spatio-Temporal Context [154], Geometry based and data-driven methods [169]
Other studiesWoZ [74], WoZ paradigm [15], WoZ [90]
Table 5. Classification Based on Approach/Method for Group and F-formation Detection
From the survey, it can be established that mostly unsatisfactory accuracy can be seen in rule-based approaches more than learning-based models. And, as expected, the main reason in general for low accuracy lies in the inability of the methods to detect dynamic groups as well as multiple groups in the scene accurately. But one interesting observation is that accuracy is largely impacted by the camera vision as well. Basically, low accuracy can be seen in exo-vision input methods and datasets. Similarly, real timeliness is another issue with methods dealing with dynamic and multiple groups. There is no prominent impact of camera views, datasets, and method types in this case. Readers may refer to the online Appendix which contains the detailed comparison of methods and techniques under rule-based static AI approaches in Table 1 and ML-based approaches in Table 2.

7 Datasets

Table 6 comprehensively lists all the surveyed datasets in the literature. A total of 71 datasets have been mentioned out of which only 15 are publicly available, 51 are private to the authors/researchers and 5 are not known. Fifteen of the datasets have outdoor scenes (mostly from public area captures), 42 have indoor scenes, and only 5 of them have both types of scenes. Thirty-six out of the list have multiple group scenarios in the scenes and 20 have single group scenes and 15 are not known. Ego-vision scenes of groups and interactions are seen in 20 datasets, whereas 39 datasets have exo-vision or global view images of groups, 1 dataset has both of them, and camera-view is not known for 11 datasets. Figure 14 gives a comprehensive idea about the taxonomy of datasets (training/testing) generally used in group/interaction and formation detection tasks.
Table 6.
DatasetView (Ego/Exo)Single/Multiple group (s)Indoor/Outdoor [area]Availability (Public/Private)
TUD Stadtmitte [13]EgoMulti-gpoutdoor [public]private
HumanEva II [13]EgoMulti-gpindoorprivate
SALSA [174]ExoMulti-gpindoorpublic
BEHAVE database [109]ExoMulti-gpoutdoor [public]public
TUD Multiview Pedestrians [13]ExoMulti-gpoutdoor [public]private
CHILL [34]ExoMulti-gp
Benfold [28]
MetroStation [34]ExoMulti-gpindoor [public]private
TownCentre [29]ExoMulti-gpoutdoor [public]private
Indoor [34]ExoMulti-gpindoorprivate
SI (Social Interactions) [101]ExoMulti-gpoutdoor [public]public
Coffee-room scenario [27]ExoMulti-gpindoorprivate
CoffeeBreak [42]ExoMulti-gpoutdoor [private]public
Collective Activity [14]EgoMulti-gpoutdoor/indoorprivate
PETS 2007 (S07 dataset) [27]ExoMulti-gpindoor [public]private
Structured Group Dataset [172]ExoMulti-gpindoor/outdoor [public]public
EGO-GROUP [5]EgoMulti-gpindoor/outdoorpublic
EGO-HPE [6]EgoMulti-gpindoor/outdoorpublic
Mingling [71]ExoMulti-gpindoorprivate
MatchNMingle [32]ExoMulti-gpindoorpublic
CLEAR [151]ExoSingle-gpindoorprivate
Greece [126]ExoMulti-gpindoorprivate
DPOSE [125]ExoMulti-gpindoorprivate
BIWI Walking Pedestrians [119]ExoMulti-gpoutdoor [public]private
Crowds-By-Examples [92]ExoMulti-gpoutdoor [public]private
Vittorio Emanuele II Gallery [17]ExoMulti-gpindoor [public]private
UoL-3D Social Interaction [40]EgoSingle-gpindoorpublic
Cocktail Party [173]ExoMulti-gpindoorpublic
Social Interaction [39]Single-gpindoorpublic
GDet [23]Two monocular cameras, located on opposite angles of a roomindoorpublic
IPD [76]ExoMulti-gpoutdoorpublic
Classroom Interaction Database [93]ExoMulti-gpindoorprivate
Caltech Resident-Intruder Mouse [31]
UT-Interaction [137]ExoMulti-gpoutdoorprivate
PosterData [72]ExoMulti-gpoutdoorprivate
Friends Meet [24, 26]ExoMulti-gpoutdoorpublic
Discovering Groups of People in Images (DGPI) [37]ExoMulti-gpindoorprivate
Prima head pose image [61]EgoSingle-gpindoorprivate
NUS-HGA [35]Single-gpindoorprivate
[62]ExoSingle-gpindoorprivate
[185]ExoSingle-gpindoorprivate
[72]ExoMulti-gpindoorprivate
[44]ExoMulti-gpoutdoor [public]private
[104]Exo
[56]EgoSingle-gpindoorprivate
Dataset using Narrative camera [106]EgoSingle-gpindoorprivate
[4]EgoSingle-gpindoor/outdoor [public]private
[57]ExoMulti-gpindoorprivate
[131]EgoMulti-gpoutdoorprivate
Laboratory-based dataset containing distance measures at three key distances, one laboratory-based dataset with distance measures from three predefined distances, dataset with distance measurements collected in a crowded open space [114]private
RGB-D pedestrian dataset [159]EgoMulti-gpoutdoor [public]private
[74]indoorprivate
[141]indoorprivate
[148]indoorprivate
[96]indoorprivate
[103]EgoSingle-gpindoorprivate
Youtube videos [110, 111]indoorprivate
[95]ExoSingle-gpindoorprivate
[53]Exoindoorprivate
In shopping mall [58]Egoprivate
[51]ExoSingle-gpindoorprivate
[25]Egoprivate
[15]EgoSingle-gpindoorprivate
DGPI dataset [167]
[182]ExoSingle-gpindoorprivate
[136]Ego, ExoSingle-gpindoorprivate
[168]EgoSingle-gpindoorprivate
[88]EgoSingle-gpindoorprivate
[138]EgoSingle-gpindoorprivate
[97]Egoprivate
Babble [65, 66]ExoSingle-gpindoor [public]public
Table 6. Comprehensive List of Datasets (Training/Testing) and their types, used in Group/Interaction and Formation Detection Methods Surveyed in the Literature
The color coding is done for the readers to perceive the data easily.
Fig. 14.
Fig. 14. Taxonomy for datasets surveyed for groups/interactions and formation detection.
One of the most important considerations in handling vision-based datasets involving people and their personal information such as interaction/conversation/discussion with other people in any indoor/outdoor scenes s the ethical issues/concerns and implications [73]. Here, it is evident that privacy and protection of visual data of people is at stake. There are multiple facets of this problem. Data collection itself is a huge challenge. To create a good dataset for learning models to detect human groups and f-formations, we need visuals from a diverse situations. So, as researchers, the very basic question we need to answer is the permission to collect, utilize, and sharing/distributing such data for research purpose.
Collecting such vision data of humans itself requires consent from the involving people in the scene. There are legal procedures in some jurisdictions for this. However, where no such legal dependency exists, it is always a good practice to do so. Similar consent also needs to be attained while distributing or sharing such data with other users and researchers for further exploration and learning model designing. Here, the consent of the people visible in the dataset along with the consent of the original creator of the dataset is required. This may be the reason behind less publicly available datasets in this domain. Another facet is the annotations of these datasets. Generally researches incorporate weak annotations in such human-based vision datasets. There is no personal information involved in such annotations. The annotations are also anonymous in nature. This can be a good way to comply with the ethical regulations.
Another facet is the in-the-wild versus in-the-lab settings for f-formation and human group/interaction detection. In case of lab settings, it is often conclusive that we already have the consent of the participating people in the visual data collection for usage and sharing. So, ego vision camera captures are also easy in such cases. However, in wild settings, often data are collected without the participating people knowing that they have been captured visually. Here, serious ethical concerns arise, which needs to validated and tracked. This also leads to less datasets in the wild with ego-centric cameras. In case of exo-centric cameras, privacy preservation is easy. It is very hard to preserve privacy in case of ego vision in the wild setups.
Next is the question of potential bias and discrimination7 in the dataset, i.e., non-inclusion of data from underrepresented communities. The major ethical concern in today's world can be seen in the diversity and inclusion facet considering factors like age, ethnicity, and gender. We can see that most of the datasets are created from scenes of western countries. There should be a mix of data from diverse communities across the globe which should include Asian and African countries as well. Another important consideration is the inclusion of both black and white skinned people in the vision data. Due to such under representation the learned models might predict wrong classes. Age consideration is also necessary in these datasets. Although, human groups/interactions in social setups may not consist of humans in age groups like infants and children, but there is a strong possibility of mixtures of teenagers and adults. Finally, equal inclusion of both male and female candidates in such data is also a concern to ponder upon. The datasets, as we see today, have much higher share of male people in them. Fairness in such cases is important.
There are organizations which are in place for such concerns in different geographies, such as the US or Europe. To name, Institutional Review Board8 or Research Ethics Board is a council (specifically in the US) that implements ethics in research activities by reviewing the entire research pipeline. They approve or reject any research and collecting of data which involves humans which is not ethical. Monitoring and reviewing of such research progress are also being carried out by such a committee. Another one is General Data Protection Regulation9 in the European Union (EU) jurisdiction. This is a part of the human rights law for EU in data privacy and protection. According to this, it is unethical and illegal to transfer personal data out of EU unless permitted specially. The board monitors these concerns strictly allowing individuals to have full control and rights over their personal data. This is another reason that vision data involving humans is rarely distributed or shared in public domains and platforms.

8 Categorization of Detection Capabilities and Scale

This section puts forward a categorization of the surveyed literature on the basis of group/interaction detection capabilities and scale for a method. After surveying the literature and the methods, it is evident that detection of groups in scenes is a non-trivial task and many factors are to be considered in the process as well. In real-life scenes, there can be both static groups of people interacting without much movement and there can also be groups with constant movement. There can be cases like group members leaving a group or new members joining a group. Group dynamics also is an important factor to be considered. People may sway or move their bodies occasionally too. Apart from these, methods also need to consider a single group or multiple existing groups in a scene. Outliers to one group can be a part of another group or can be noise at the global level. Figure 15 depicts a taxonomy of group detection in interaction scenarios in real-life cases.
Fig. 15.
Fig. 15. Taxonomy for detection capability and scale for groups/interactions and formations.

8.1 Detection Capability

Methods need to attend to both static and dynamic groups in interactions and formations. Here, we do categorization of this aspect.
Static Group Scene. In static group scenes, we consider f-formation detection capability in terms of the absence of any temporal information affecting the group's categorization. We consider statically captured data, such as a single-image capturing a group and other sources of data that do not capture slight changes in head/body pose and orientation, e.g., sonar. It is generally easier to detect groups and also classify f-formations in such scenarios. Static group scenes also denote that the people interacting in a group or formation do not change groups or new people do not join a group while interaction is in progress. In [36] and [114], single image is used from a single egocentric camera for detection. Static groups are mostly found in indoor scenes like conferences, group discussions, coffee breaks, and meetings.
Dynamic Group Scene. Dynamic group detection essentially means two things, i.e., either the group detection happens on temporal data or the group itself can change over time which is the case in most of the real environments. In the case of a dynamic scene, people tend to move in groups, also referred as group dynamics. New people can join a group and/or existing people can leave a group. Also, some people participating in an interaction may temporarily change their head/body pose and orientation a bit; this necessarily does not mean that the formation has changed. In such cases, it becomes very difficult for an algorithm to detect the group or formation in the interaction scenario. As a result, the methods need to consider the temporal information or data of the scene utilizing a sequence of images over a window. A sequence of image/image stream is taken over a particular period of time. Here, we also would like to mention a categorization of detection capability on the basis of datasets, scene dynamism (as discussed in Section 8.1), and number of groups (as discussed in Section 8.2). Normally, detection methods are tried and tested in indoor or lab settings. Datasets are also mostly indoor ones for training the learning models in data-driven methods. In such cases, ecological validity of the methods needs special mention. Evaluation on the basis of indoor scene may not sufficiently establish the superiority of method or its working accuracy. So, a thorough in-the-wild versus indoor/lab interaction testing is needed [73]. Here, we can consider that those methods which are capable of detecting dynamic scene interactions with multiple groups are a good candidates for in-the-wild use. Moreover, the considerations of camera views (see Table 4), both ego and exo vision, are also needed. The methods which use outdoor datasets with multiple groups and either ego or exo vision cameras (for training models) may be categorized into the bucket of in-the wild methods (see Table 6). Please see Tables 1 and 2 from the Online Appendix for all the methods and their features.
In [44], the video data are used which is 10 frames per second for detecting dynamic groups and interactions. In [34], the surveillance videos are used for experiment purposes. Similarly, [145] utilizes video feed from a cocktail party [173] for its experiments, which is one frame in 5 seconds. Similarly, SALSA [174] has a temporal sampling rate of 1 frame in 3 seconds (for 60 minutes video), and MatchNMingle [32] has a resolution of 1 frame per second (for 30 minutes video). On the other hand, CoffeeBreak [42] has only 130 frames and Idiap Poster Data (IPD) [76] has a 83 frames collection which make them difficult to have good temporal resolutions. Babble dataset [65, 66] has 3,481 frames of conversational group with sampling rate of one frame every second. Figure 16 depicts the dynamic scene and group scenario. The EGO-GROUP dataset [5] has a video of an indoor laboratory setup. The video consists of 395 image frames. The specialty of this video is that the people in the scene are not static in one position and they change position/orientation and location with time. On the right-hand side of the figure, we put forward four instances of the image sequence where four different types of groups/interactions and formations are visible for the same four people in the scene. This type of dynamism should be handled by the detection methods with efficiency considering temporal aspects of the scenes. In outdoor scenes such as waiting rooms, stations, airports, restaurants, theatres, and lobbies, dynamic groups are mostly encountered. The main problem in this is the scarcity of datasets with sufficient temporal resolution or sampling rate. To create good models with temporal information for learning process requires huge amount of temporal data which is missing in the current publicly available datasets. Towards this a study on time resolutions required to study various social human behaviours is required to understand the landscape which can farther facilitate the creation of good temporal datasets and learning models. Such a study can be found in the Section 3.3 of [128].
Table 7 summarizes the references into two detection capability types found in the literature.
Fig. 16.
Fig. 16. Dynamic behavior of groups/people in a video/image sequence from EGO-GROUP [5].
Table 7.
ClassificationReferences
Static scene detection[36, 51, 54, 57, 82, 113, 114, 122, 126, 129, 136, 144, 146, 166, 167, 186]
Dynamic scene detection[3, 4, 7, 8, 9, 10, 11, 12, 15, 16, 18, 21, 27, 34, 35, 41, 43, 44, 46, 47, 52, 53, 55, 56, 57, 58, 62, 68, 71, 72, 74, 75, 77, 79, 80, 81, 87, 88, 90, 93, 95, 96, 97, 100, 102, 103, 110, 111, 113, 115, 116, 127, 129, 131, 134, 135, 138, 141, 143, 145, 146, 148, 149, 150, 152, 159, 161, 162, 165, 166, 168, 170, 181, 182, 183, 184, 185, 187]
Table 7. Classification Based on Group/Interaction and Formation Detection Capability

8.2 Detection Scale

Since we need different methods depending on how many groups are there in a captured image or video, the detection scale plays an important role.
Single Group Detection. When a sensor/camera detects only one interacting group in the scene, the work is easily done. The stream of images sequence can have multiple groups as well. But all the methods do not have the capability to detect multiple groups simultaneously. In some cases, single group detection is useful when a robot needs to detect a single group of interest in a scene or environment and join the group for interaction/discussion. The datasets used for this kind of detection are mostly captured indoor (e.g., office and panoptic studio [39] or Babble [65, 66]) or outdoor (mostly private datasets). Other publicly available datasets are BEHAVE [109] and YouTube videos which can also be used for such purposes. In the case of ego-view camera-based detection methods, single group detection is the primary focus.
Multiple Group Detection. When there is more than one interacting group or formation in a scene, the detection methods need special attention. Sometimes there is only one interacting group in the scene along with some additional people who are not actively involved in the interaction. Those cases can also be considered under the same umbrella and are quite challenging too. This kind of detection is useful for finding how many groups are there or finding a particular group in a diverse scene in surveillance/monitoring applications. There exists some datasets comprising of such scenarios—coffee break dataset [42], EGO-GROUP [5], SALSA [174], cocktail party [173], GDet [23], Synthetic [50], IPD [76], and FriendsMeet2 (GM2) [24, 26]. Beyond these, some researchers have used their own (private) datasets. In [44] and [145], the authors experimented with such datasets where there is more than one group in the scene (the party data). Similarly, in [34], the surveillance videos are used as data where there can be more than one group in the captured video. Table 8 classifies the literature on the basis of group/interaction detection scale. The multiple group detection scenario normally comes in exo-view based methods. Figure 17 depicts three scenarios from a renowned dataset, EGO-GROUP [5]. Figure 17(a) shows a single triangular formation with one outlier in an indoor environment. Figure 17(b) depicts a viz-a-viz formation in an outdoor situation. Figure 17(c) shows two groups: one triangular and one L-shaped formation in an indoor situation.
Fig. 17.
Fig. 17. Images from EGO-GROUP [5] dataset depicting indoor and outdoor scenes with single and multiple group interactions.
Table 8.
ClassificationReferences
Single group detection[3, 4, 12, 15, 16, 18, 21, 41, 46, 47, 48, 51, 52, 53, 54, 56, 67, 68, 74, 75, 77, 78, 79, 80, 81, 82, 88, 89, 90, 95, 96, 97, 100, 102, 103, 104, 107, 110, 111, 113, 115, 116, 118, 122, 135, 136, 138, 141, 143, 146, 148, 149, 161, 168, 170, 181, 184, 185]
Multi-group detection[7, 8, 9, 10, 11, 27, 34, 35, 36, 43, 44, 55, 57, 62, 71, 72, 87, 93, 99, 113, 114, 117, 126, 127, 129, 131, 134, 144, 145, 146, 150, 152, 159, 162, 165, 166, 167, 182, 183, 186, 187]
Table 8. Classification Based on Group/Interaction and Formation Detection Scale

9 Categorization of Evaluation Methods

The most important part of the formation or interaction detection framework (Figure 8) is the evaluation methodologies. The conventional methods to compare methods and techniques in such vision tasks are accuracy and efficiency. The accuracy defines how accurately a method detects/predicts or recognizes an f-formation. The efficiency parameter relates to the real timeliness aspect of the method. Apart from these, the papers in the surveyed literature also speak about simulation-based evaluation and human experience study-based evaluations (for robotic applications specifically). Figure 18 shows the simple taxonomy of evaluation methods for various group/interaction and formation detection methods or algorithms.
Fig. 18.
Fig. 18. Taxonomy for evaluation methodology for group/interaction and f-formation detection.
Simulation-Based Evaluation. This type of evaluation is conducted using simulation tools. The simulators have different features and have the ability to simulate the real world in complex environments. A range of simulators are used in the surveyed literature—Gazebo [117], RoboDK [133], and Webot [45]. Nowadays researchers are also focused on using Virtual Reality or Augmented Reality technologies for evaluation purposes. The evaluations are performed mainly to access the perception of a virtual robot or an autonomous agent. The question to be answered is how well a simulated robot (in a simulated environment) can perceive a group of simulated people involved in an interaction. Second, after detection, is the simulated robot joining the group naturally without discomforting the simulated people? (see Section 2.3 and Figure 3). Parameters like stopping distance for the robot, orientation, and pose based on the perceived group pose/orientation and angle of approach depending on the group's angle and position should be considered. Extensive discussion on these factors post-group/interaction detection by a robot or autonomous agent is out of the scope of this survey.
Human Experience Study Based Evaluation. This type of evaluation is based on testing the detection methods using ego-vision robots or on a real scenario with human participants as evaluators. A questionnaire is provided to the human participators to rate the quality of the method being used by the robot in real scenarios. The questions and parameters similar to simulation-based evaluation can be considered in this case as well but with a real robot perceiving human groups (who are also the evaluators) and interaction. In real-life scenarios, the groups are not static and tend to move when a member joins and leaves the group. Accordingly, the robot or the autonomous agent must detect the changes in group formation, orientation, and pose to re-adjust itself in a natural and more human-like manner without causing any discomfort to other humans.
Accuracy/Efficiency Evaluation without Using Robot or Simulators. This kind of evaluation is based on the accuracy or efficiency aspects of the methods but not tested in the real environment or by using robots. Here, the focus is mainly to evaluate the computational aspects of the methods/algorithms without evaluating the usability in real-life applications like robotics. However, applications like human behavior/interaction analysis, scene monitoring, and surveillance depend entirely on such evaluation. Table 9 classifies the surveyed papers on the basis of the evaluation strategy adopted. It also shows the descriptions/names of simulators in the simulation-based category.
Table 9.
ClassificationReferences
Simulation-based evaluation (robotic simulators/virtual environment)2D grid environment simulated in Greenfoot [104], simulated the process of deformation of contours using P-spaces represented by Contours of the Level Set Method [87], [52], Robot Operating System implementation of PedSim [129], a simulated avatar embodied confederate [118], Gazebo [117], a simulator using Unity 3D game engine [54]
Human experience study-based evaluation (with real robots)[12, 15, 18, 21, 33, 56, 58, 67, 74, 88, 90, 97, 99, 100, 103, 114, 115, 116, 131, 136, 138, 148, 159, 162, 168, 182, 184]
Accuracy/efficiency evaluation (without robot, only computation)[4, 7, 9, 11, 27, 34, 35, 36, 41, 43, 44, 51, 52, 52, 53, 55, 57, 62, 71, 72, 77, 87, 93, 95, 96, 110, 111, 126, 127, 129, 134, 135, 144, 145, 146, 152, 165, 166, 167, 170, 183, 185, 186, 187]
Table 9. Classification Based on Evaluation Methods and Strategy

10 Application Areas

Group or interaction detection has seen vast applications in many areas of computer vision. Specifically speaking, with the emergence of robotics and AI, this domain has realized its true potential. In this article, we categorize the application landscape into two broad areas: robotic applications and other vision applications. Furthermore, these have been broken down into five groups as summarized in Table 10 (also see Figure 19). The robot vision implies the applications where the robot's camera is placed in an ego-centric view for finding the groups only, but there is no purpose of initiation of interaction with a human. In HRI, f-formation detection is used to detect the group in order to participate in the interaction with fellow human beings autonomously. In telepresence, a remote person uses the robot to interact with a group of people. In such a scenario, the semi-autonomous robot can detect the group and join them while the remote human operator can control the robot to adjust its positioning.
Fig. 19.
Fig. 19. Taxonomy for application areas for group/interaction and f-formation detection.
Table 10.
Application AreaReferences
Drone/robotic vision[3, 7, 9, 11, 18, 21, 27, 34, 35, 36, 43, 47, 53, 55, 68, 70, 72, 77, 90, 93, 95, 122, 126, 129, 143, 144, 145, 146, 150, 152, 158, 161, 165, 166, 167, 170, 186, 187]
HRI (i.e., assistive robots)[12, 15, 16, 41, 54, 56, 57, 58, 60, 67, 70, 74, 75, 78, 79, 80, 81, 82, 89, 90, 97, 99, 100, 102, 103, 104, 107, 113, 114, 131, 136, 138, 141, 147, 148, 149, 159, 160, 168, 181, 182, 183, 184, 188]
Telepresence/teleoperation technologies[70, 88, 113, 115, 117, 160, 188]
Indoor/outdoor scene monitoring and surveillance[58, 68, 90, 100, 118, 120, 185]
Human behavior/activity and interaction analysis[4, 8, 10, 44, 46, 48, 51, 52, 62, 70, 71, 74, 87, 96, 100, 110, 111, 127, 129, 134, 135, 147, 158, 160, 162, 165, 185, 188]
Covid-19 and social distancingScope of future research
Table 10. Classification Based on Targeted Application Areas
Scene monitoring is useful for analyzing indoor or outdoor scenes with people interacting and forming groups and f-formations for various activities. On the other hand, human behavior and interaction analysis refer to the behavior between humans and how they are interacting based on the situation. Furthermore, visual analytics in big data has empowered the domain beyond imagination. People are trying to use these technologies in various aspects of life. In the current scenario of the COVID-19 pandemic, we can utilize this technology in monitoring social distancing in human groups and interactions as well. As already mentioned, telepresence robotics can be utilized by doctors/nurses and other medical staff to attend to patients in remote locations without physically being present.
Some of the discussed application areas have serious ethical implications and potential negative societal impacts. HRI can be seen in many embodied robots and humanoids nowadays. Specially in cases like assistive and therapeutic robots, there are many ethical implications [130]. First is the data privacy and protection of the person who interacts with such robots. The robots ought to capture visual data of the humans with whom it interacts. The proper protection of such data is mandatory. These data are easily available to the owner or user of such robots. Any kind of misuse of such data raises ethical concerns. Second, the emotional and psychological attachments of a human with such a robot is at stake when the job of the robot is done and it is taken away (specially in case of assistive and therapeutic robots). This may lead to serious harmful effects on the humans and leave them in much worse conditions than before. These kind of situations need proper and planned strategies to be handled without adverse effects. Next, the robots may be involved physically with the humans, such as lifting patients, taking them to bathroom, and helping them bath. So, the robots must be designed in a way such that the privacy of the humans are preserved by deactivation of video monitors during intimate procedures. Similar ethical concerns prevail in case of HRI in telepresence and teleoperation robots [49]. Here, transparency and accountability also come into the play.
Another application area stated in our survey has direct ethical implications on human privacy. Nowadays remote surveillance, monitoring, and tracking of humans and their behaviours and activities have gained popularity. Usually drones, UAVs, UGVs, and even fixed camera systems are used for such application in many public places as well as private areas. This poses a serious breach in privacy of human life and activities. Many countries have their own set of laws governing this arena10. Some of the locations where a person expects privacy are hotel rooms, own residence, public restrooms, changing rooms, and so on. Having surveillance or monitoring systems in such locations may lead to the violation of the prescribed ethical law, and the concerned people may face legal consequences. So, it is recommended to review the set of laws for the country where such vision based applications are put to effect. Finally, most of the researchers release their methods and codes on human group interaction or activity detection for research purposes only. So, such limitations in real-life usage of these methods need serious consideration and workaround.

11 Limitations, Challenges, and Future Directions: A Discussion

The survey is treated based on a generic framework of concern areas about group/interaction detection using the theory of f-formation (see Section 4.1 and Figure 8). It addresses various identified modules and concern areas such as camera view and availability of other sensor data, datasets, feature selection, methods/techniques, detection capabilities/scale, evaluation methodologies, and application areas.
The existing methods have almost equal share of fixed rule-based (Figure 12) and learning-based (Figure 13) approaches (Tables 5 and in online Appendix Tables 1 and 2). Researchers need to orient their research toward data-driven approaches using DL and reinforcement learning paradigms for handling complex situations. But this is not easy as long as we have limited amount of data for training purpose. The current dataset landscape in this domain is not matured enough with such amount of data for several reasons as already discussed in Section 8. So, the research community must first give an honest attempt to create and publish more datasets and augment such datasets considering all the ethical and ecological issues as also stated in coming paragraphs on dataset research recommendation. Henceforth, meta-learning can also be explored on large-scale combined datasets and complex scenarios in detection tasks can be easily solved using big data and visual analytics [120]. Apart from that, representing data in the form of a graph can solve many performance issues in terms of accuracy and efficiency. Since we see that the GNNs have already achieved a lot of success in the recent times in this domain with multiple papers produced by research groups across the globe, we can also try to extend these methods by combining them with other forms of DL paradigms. Researchers can try out GNN and CNN hybrid models like graph convolution networks which can be a potential candidate to create appropriate models. A combination of recurrent neural networks, convolution neural networks, and/or graph recurrent networkscan also be explored for identifying more accurate and promising detection models. Finally, for the current datasets with small amount of data, we require ML techniques which are not too much data dependent but are efficient with even small data. This leads us back to the statistical models which works best with small amount data as well [1]. We can also try to explore few-shot/one-shot/low-shot learning and incremental learning paradigms in such cases. Transfer learning and domain adaptation can also be explored in future research directions for transferring static images features to video key frames for dynamic group formation detection. Another novel direction can be Diffusion models [69, 155] which have come to prominence in various computer vision applications. Diffusion is a type of generative models which initially convert the data into Gaussian noise in steps progressively and then learn to reverse the noising steps to recover the data. Finally, we also suggest new research using Neural Radiance Fields (NeRF)-based models. NeRFs are highly effective for object detection tasks in dynamic scenes [123] and grouping semantically similar things in 2D/3D environments [86]. Hence, it is a potential candidate for solving human group dynamics and detection problems in extremely dynamic and crowded scenes.
The problems like dynamism in groups (people leaving/joining the group dynamically or changing position and orientation within the group) and occlusions of people pose serious challenges and limitations to the current state-of-the-art methods in terms of accuracy and efficiency. Researchers can think about devising rules based on reasoning and geometry to detect application-specific groups and interactions. A combination of rules, geometry-based reasoning along data-driven models can also be explored to improve detection quality. Apart from detecting the group and formation alone, methods should be designed to detect the orientation and pose of the group itself (see [18, 21]). This can facilitate a good approach direction and angle (natural and human-like) for robots to join the group.
The major challenge with the datasets is their availability. Creating good quality (large scale) vision datasets (for training and testing) is a mammoth task in itself but has its own research/academic merit. The only 20% of the surveyed datasets are publicly available (Table 6). The researchers can publish more of their privately created datasets as benchmarks for people to experiment. However, the ethical implications [73] are needed to be considered. One simply cannot collect and publish vision datasets with human participants without proper permission from relevant board (see the discussion on these concerns in the end of section 8.2). Another limitation is the availability of public ego-vision datasets with only 30% of the total being ego-vision. Researchers can think of creating more first-person view datasets by merging/fusing existing datasets in a meaningful manner. Here, the researchers also need to consider the ecological validity [73] of interactions based on learning from such data. So, data augmentation needs to be done in a more restricted manner. Preserving privacy in ego vision datasets is also a big challenge. So, we can see less first person view datasets in-the-wild than lab or indoor settings. This issue needs to be handled sensitively by the researchers. As for the exo-vision datasets, accuracy is low, so the researchers can look into this direction by creating more robust datasets for global view scenes. Indoor datasets dominate the scenario currently with merely 29% of the datasets being outdoor. Researchers need to create more outdoor datasets for the sake of applications pertaining to surveillance and outdoor scene monitoring. Here again ethical concerns arise from application areas such as surveillance and monitoring. Such application implementation needs special permission for concerned bodies of any country. This needs to be handled accordingly (see discussion in the end of Section 10). Moreover, huge new datasets with temporal consideration needs to be incorporated. A fair enough temporal resolution with sampling rate [128] is needed to train models for complex scenes and dynamic group detection. Finally, we can use Generative Adversarial Networks to generate synthetic data from real data of human groups and interactions for training purposes in DL models. Finally, the researchers have limited their methods and datasets to only a few major formations like face-to-face, triangular/circular, side-by-side, and L-shaped. No literature or state-of-the-art methods speak about dealing with a comprehensive list of formations (as explained in Section 2.2). Researchers should concentrate on devising methods and creating datasets to solve these limitations.
Detection capabilities need attention with respect to dynamic scenes (Figure 16) as well as multiple groups (Figure 17). The literature is rich in taking care of most of the aspects of detection (Tables 7 and 8). However, some more research attention is required in cases of occlusion, background clutter, and lighting conditions. Researchers can use reinforcement learning and DL models for these problems. Also, appropriate datasets need to be prepared at a larger scale. We can also try to improve the datasets using Low Dynamic Range to High Dynamic Range) reconstruction methods [19, 20, 22, 91, 124, 175] in case of adverse lighting, color, or illuminance conditions. We can reconstruct such images with overexposed/underexposed parts which can lead to better detection algorithms and models.
Evaluation of the methods remain a challenge in the current literature (Table 9). Mostly, computational evaluation has been performed in terms of accuracy and efficiency. But in a problem like a group/interaction detection, human experience studies and/or simulation-based studies are important to establish the effectiveness of the method in various applications like robotics, telepresence, and social surveillance (see Section 9, Figure 18 and Table 10). The researchers need to orient their studies in this respect as well. Apart from that, most of the methods yield good accuracy but achieving real-time solutions maintaining good accuracy is a concern. The explorers can think of designing lightweight models for real-time detection of groups and interactions for dynamic scenes.
Feature selection/extraction plays a major role in any computer vision problem, and group detection is no exception. The existing literature lacks discussion about the use of proper features and the selection of proper approaches to extract useful and differentiating features. Apart from the visual features, researchers can also think about non-visual features such as audio or speech as future research trends. There is also a possibility of temporal feature selection and extraction for dynamic groups.
Applications of this domain can be widely seen in robotics, surveillance, human behavior analysis, and telepresence technologies (Table 10 and Figure 19). However, we can also think about using this technology in COVID-19-related applications such as monitoring of social distancing norms and others.
We also have discussed about two types of camera views: ego and exo-views (Figure 9 and Table 4). Ego vision is used predominantly for robotics-related applications. Methods using ego-view cameras for input are less compared to exo-view cameras. The main reason behind this roots down to the scarcity of public ego-view datasets for training the models. Researchers can also direct their research on designing detection models which can be created on a hybrid system of camera views and sensors. The visual as well as other forms of inputs combined can be used for better detection and prediction tasks. Here, we also need to think of the challenges in the case of cross-modal synchronization [128]. Accurate sampling of sensor inputs is required over fixed sized windows. As also discussed in Section 8.1, there is a scarcity of datasets with temporal sampling and resolution, the researchers should be encouraged to look into this gap in future. Various combinations of camera views and positions can be experimented with for better scene capture and robust dataset creation for learning models. Figures 20 and 21 summarizes the limitations, challenges, and future directions/opportunities for all the concern areas of this survey framework.
Fig. 20.
Fig. 20. Limitations and challenges in the various concern areas of our survey framework (see Section 4.1).
Fig. 21.
Fig. 21. Future research directions/opportunities in the various concern areas of our survey framework (see Section 4.1).

12 Conclusions

With the emergence of computer vision, robotics, and multimedia analytics, the world is changing for good with the progress of AI. Computation systems and autonomous agents are expected to show more human-like behavior and capability. One of the most important problems in this domain persists to be group/interaction detection and prediction using f-formation. Although some research has been conducted in the last decade, much more progress is still envisioned. This survey aims at generalizing the problem of group/interaction detection via a framework, which is referred to as the theme of this survey as well. This article presents a comprehensive glance at all the concern areas of this defined framework. This includes definitions of various f-formations, input camera views and sensors, datasets, feature selection, algorithms, detection capability and scale, quality of detection, evaluation methodologies, and application areas. The article also discusses the limitations, challenges, and future scope of research in this domain. The researchers can try and solve some of the unattended problems in this domain with the help of more recent and efficient approaches in DL, reinforcement learning, adversarial, and meta-learning paradigms. Another direction can be the utilization of advanced models such as Diffusion and NeRF which holds the potential to work very effectively in dynamic and crowded scenes. A combination of different neural networks can also be explored to create efficient models in application-specific cases.

Acknowledgments

We are thankful to the Editors and Associate Editors who were involved in the review process of our manuscript. Their guidance in revision iterations has significantly enhanced the quality of the work. We also are deeply grateful to the reviewers of this manuscript whose constructive criticism has improved the technical quality and readability of this manuscript.

Footnotes

Supplemental Material

PDF File - F_Formation_THRI_Supple.pdf
The Appendix consists of two long tables as a part of electronic supplementary material. These Tables 1 and 2 summarize the surveyed methods and techniques for group, interaction and F-formation detection on the basis of various parameters. Table 1 compares among the static rule based AI methods, while Table 2 informs about the Machine learning or Deep learning based techniques. The tables consists of some short forms in the table headers used to label each column and in other rows: ML (Machine Learning), DL (Deep Learning), RL (Reinforcement Learning), Su (Supervised), UnS (Unsupervised), SS (Semi-supervised), S (Single group), M (Multiple groups), Exo (Exocentric vision or global view), and Ego (Egocentric or first person view).

References

[1]
Amina Adadi. 2021. A survey on data-efficient algorithms in big data era. Journal of Big Data 8, 1 (2021), 1–54.
[2]
Jake K. Aggarwal and Michael S. Ryoo. 2011. Human activity analysis: A review. ACM Computing Surveys (CSUR) 43, 3 (2011), 1–43.
[3]
Maedeh Aghaei, Mariella Dimiccoli, and Petia Radeva. 2015. Towards social interaction detection in egocentric photo-streams. In 8th International Conference on Machine Vision (ICMV 2015), Vol. 9875. International Society for Optics and Photonics, 987514.
[4]
Maedeh Aghaei, Mariella Dimiccoli, and Petia Radeva. 2016. With whom do I interact? Detecting social interactions in egocentric photo-streams. In 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2959–2964.
[5]
AImageLab. 2021a. AImageLab datasets. Retrieved from http://imagelab.ing.unimore.it/files/EGO-GROUP.zip
[6]
AImageLab. 2021b. AImageLab datasets. Retrieved from http://imagelab.ing.unimore.it/files/EGO-HPE.zip
[7]
Xavier Alameda-Pineda, Elisa Ricci, and Nicu Sebe. 2017. Multimodal analysis of free-standing conversational groups. In Frontiers of Multimedia Research. ACM, 51–74.
[8]
Xavier Alameda-Pineda, Jacopo Staiano, Ramanathan Subramanian, Ligia Batrinca, Elisa Ricci, Bruno Lepri, Oswald Lanz, and Nicu Sebe. 2015a. Salsa: A novel dataset for multimodal group behavior analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 8 (2015), 1707–1720.
[9]
Xavier Alameda-Pineda, Yan Yan, Elisa Ricci, Oswald Lanz, and Nicu Sebe. 2015b. Analyzing free-standing conversational groups: A multimodal approach. In 23rd ACM International Conference on Multimedia, 5–14.
[10]
Stefano Alletto, Giuseppe Serra, Simone Calderara, and Rita Cucchiara. 2015. Understanding social relationships in egocentric vision. Pattern Recognition 48, 12 (Dec. 2015), 4082–4096. DOI:
[11]
Stefano Alletto, Giuseppe Serra, Simone Calderara, Francesco Solera, and Rita Cucchiara. 2014. From ego to nos-vision: Detecting social relationships in first-person views. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, 580–585.
[12]
P. Althaus, H. Ishiguro, T. Kanda, T. Miyashita, and H. I. Christensen. 2004. Navigation for human-robot interaction tasks. In IEEE International Conference on Robotics and Automation, 2004. Proceedings (ICRA ’04), Vol. 2. 1894–1900. 1050–4729. DOI:
[13]
M. Andriluka, S. Roth, and B. Schiele. 2010. Monocular 3D pose estimation and tracking by detection. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 623–630.
[14]
University of Michigan (Ann Arbor). 2021. Collective Activity Dataset. Retrieved from http://www.eecs.umich.edu/vision/activity-dataset.html
[15]
Adrian Ball, David Rye, David Silvera-Tawil, and Mari Velonaki. 2015. Group vs. individual comfort when a robot approaches. In International Conference on Social Robotics. Springer, 41–50.
[16]
Adrian Keith Ball, David C. Rye, David Silvera-Tawil, and Mari Velonaki. 2017. How should a robot approach two people? Journal of Human-Robot Interaction 6, 3 (2017), 71–91.
[17]
Stefania Bandini, Andrea Gorrini, and Giuseppe Vizzari. 2014. Towards an integrated approach to crowd analysis and crowd synthesis: A case study and first results. Pattern Recognition Letters 44 (2014), 16–29.
[18]
Hrishav Barua, Pradip Pramanick, Chayan Sarkar, and Theint Mg. 2020. Let me join you! Real-time F-formation recognition by a socially aware robot. In 29th IEEE International Conference on Robot and Human Interactive Communication, RO-MAN.
[19]
Hrishav Bakul Barua, Ganesh Krishnasamy, KokSheik Wong, Abhinav Dhall, and Kalin Stefanov. 2024a. HistoHDR-Net: Histogram equalization for single LDR to HDR image translation. arXiv:2402.06692.
[20]
Hrishav Bakul Barua, Ganesh Krishnasamy, KokSheik Wong, Kalin Stefanov, and Abhinav Dhall. 2023. Arthdr-net: Perceptually realistic and accurate hdr content creation. In 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 806–812.
[21]
Hrishav Bakul Barua, Pradip Pramanick, and Chayan Sarkar. 2021. System and method for enabling robot to perceive and detect socially interacting groups. US Patent App. 17/138,224.
[22]
Hrishav Bakul Barua, Kalin Stefanov, KokSheik Wong, Abhinav Dhall, and Ganesh Krishnasamy. 2024b. GTA-HDR: A large-scale synthetic dataset for HDR image reconstruction. arXiv:2403.17837.
[26]
L. Bazzani, M. Cristani, and V. Murino. 2012. Decentralized particle filter for joint individual-group tracking. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 1886–1893.
[27]
Loris Bazzani, Marco Cristani, Diego Tosato, Michela Farenzena, Giulia Paggetti, Gloria Menegaz, and Vittorio Murino. 2013. Social interactions by visual focus of attention in a three-dimensional environment. Expert Systems 30, 2 (2013), 115–127.
[28]
Ben Benfold and Ian Reid. 2009. Guiding visual surveillance by tracking human attention. British Machine Vision Conference, BMVC 2009—Proceedings. DOI:
[29]
B. Benfold and I. Reid. 2011. Unsupervised learning of a scene-specific coarse gaze estimator. In 2011 International Conference on Computer Vision, 2344–2351.
[30]
Abhijan Bhattacharyya, Ashis Sau, Ruddra Dev Roychoudhury, Hrishav Bakul Barua, Chayan Sarkar, Sayan Paul, Brojeshwar Bhowmick, Arpan Pal, and Balamuralidhar Purushothaman. 2021. Edge centric communication protocol for remotely maneuvering a tele-presence robot in a geographically distributed environment. US Patent App. 16/988,453.
[31]
X. P. Burgos-Artizzu, P. Dollár, D. Lin, D. J. Anderson, and P. Perona. 2012. Social behavior recognition in continuous video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 1322–1329.
[32]
Laura Cabrera-Quiros. 2021. The MatchNMingle dataset. Retrieved from http://matchmakers.ewi.tudelft.nl/matchnmingle/pmwiki/index.php?n=Main.TheDataset
[33]
Konstantinos Charalampous, Ioannis Kostavelis, and Antonios Gasteratos. 2017. Recent trends in social aware robot navigation: A survey. Robotics and Autonomous Systems 93 (2017), 85–104.
[34]
Cheng Chen and Jean-Marc Odobez. 2012. We are not contortionists: Coupled adaptive learning for head and body orientation estimation in surveillance video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1544–1551.
[35]
Nam-Gyu Cho, Young-Ji Kim, Unsang Park, Jeong-Seon Park, and Seong-Whan Lee. 2015. Group activity recognition with group interaction zone based on relative distance between human objects. International Journal of Pattern Recognition and Artificial Intelligence 29, 05 (2015), 1555007.
[36]
Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. 2014a. Discovering groups of people in images. In European Conference on Computer Vision. Springer, 417–433.
[37]
Wongun Choi, Yu-Wei Chao, Caroline Pantofaru, and Silvio Savarese. 2014b. Discovering groups of people in images. In ECCV.
[38]
Herbert H. Clark. 1996. Using Language. Cambridge University Press.
[39]
CMU. 2021. CMU Panoptic Dataset. Retrieved from https://domedb.perception.cs.cmu.edu
[40]
Claudio Coppola. 2021. UoL 3D Social Interaction Dataset. Retrieved from https://lcas.lincoln.ac.uk/wp/research/data-sets-software/uol-3d-social-interaction-dataset
[41]
Claudio Coppola, Serhan Cosar, Diego R. Faria, and Nicola Bellotto. 2017. Automatic detection of human interactions from rgb-d data for social activity classification. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 871–876.
[42]
Marco Cristani. 2021. CoffeeeBreak dataset. Retrieved from http://profs.sci.univr.it/cristanm/datasets.html
[43]
Marco Cristani, Loris Bazzani, Giulia Paggetti, Andrea Fossati, Diego Tosato, Alessio Del Bue, Gloria Menegaz, and Vittorio Murino. 2011a. Social interaction discovery by statistical analysis of F-formations. In BMVC.
[44]
Marco Cristani, Giulia Paggetti, Alessandro Vinciarelli, Loris Bazzani, Gloria Menegaz, and Vittorio Murino. 2011b. Towards computational proxemics: Inferring social relations from interpersonal distances. In 2011 IEEE 3rd International Conference on Privacy, Security, Risk and Trust and 2011 IEEE 3rd International Conference on Social Computing. IEEE, 290–297.
[45]
Cyberbotics. 2021. Webots-Open Source Robot Simulator. Retrieved from https://cyberbotics.com/
[46]
Yasuharu Den. 2018. F-formation and social context: How spatial orientation of participants’ bodies is organized in the vast field. In LREC 2018 Workshop: Language and Body in Real Life (LB-IRL2018) and Multimodal Corpora (MMC2018) Joint Workshop, 35–39.
[47]
Eyal Dim and Tsvi Kuflik. 2013. Social F-formation in blended reality. In International Conference on Intelligent User Interfaces, 25–28.
[48]
Eyal Dim and Tsvi Kuflik. 2014. Automatic detection of social behavior of museum visitor pairs. ACM Transactions on Interactive Intelligent Systems (TiiS) 4, 4 (2014), 1–30.
[49]
Reza Etemad-Sajadi, Antonin Soussan, and Théo Schöpfer. 2022. How ethical issues raised by human–robot interaction can impact the intention to use the robot? International Journal of Social Robotics (2022), 1–13.
[50]
Alessandro Farinelli. 2021. Retrieved from http://profs.sci.univr.it/*cristanm/datasets.html
[51]
Tian Gan. 2013. Social interaction detection using a multi-sensor approach. In 21st ACM International Conference on Multimedia, 1043–1046.
[52]
Tian Gan, Yongkang Wong, Bappaditya Mandal, Vijay Chandrasekhar, Liyuan Li, Joo-Hwee Lim, and Mohan S Kankanhalli. 2014. Recovering social interaction spatial structure from multiple first-person views. In 3rd International Workshop on Socially-Aware Multimedia, 7–12.
[53]
Tian Gan, Yongkang Wong, Daqing Zhang, and Mohan S Kankanhalli. 2013. Temporal encoded F-formation system for social interaction detection. In 21st ACM International Conference on Multimedia, 937–946.
[54]
Yuan Gao, Fangkai Yang, Martin Frisk, Daniel Hemandez, Christopher Peters, and Ginevra Castellano. 2019. Learning socially appropriate robot approaching behavior toward groups using deep reinforcement learning. In 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 1–8.
[55]
Carolina Gárate, Sofia Zaidenberg, Julien Badie, and François Brémond. 2014. Group tracking and behavior recognition in long video surveillance sequences. In 2014 International Conference on Computer Vision Theory and Applications (VISAPP), Vol. 2. IEEE, 396–402.
[56]
Andre Gaschler, Sören Jentzsch, Manuel Giuliani, Kerstin Huth, Jan de Ruiter, and Alois Knoll. 2012. Social behavior recognition using body posture and head pose for human-robot interaction. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2128–2133.
[57]
Ekin Gedik and Hayley Hung. 2018. Detecting conversing groups using social dynamics from wearable acceleration: Group size awareness. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 4 (2018), 1–24.
[58]
Dylan F. Glas, Satoru Satake, Florent Ferreri, Takayuki Kanda, Norihiro Hagita, and Hiroshi Ishiguro. 2013. The network robot system: Enabling social human-robot interaction in public spaces. Journal of Human-Robot Interaction 1, 2 (2013), 5–32.
[59]
Erving Goffman and Joel Best. 2017. Interaction Ritual: Essays in Face-to-Face Behavior. Routledge.
[60]
Javier V Gómez, Nikolaos Mavridis, and Santiago Garrido. 2013. Social path planning: Generic human-robot interaction framework for robotic navigation tasks. In 2nd International Workshop on Cognitive Robotics Systems: Replicating Human Actions and Activities.
[61]
Nicolas Gourier and James Crowley. 2004. Estimating face orientation from robust detection of salient facial structures. FG Net Workshop on Visual Observation of Deictic Gestures.
[62]
Georg Groh, Alexander Lehmann, Jonas Reimers, Marc René Frieß, and Loren Schwarz. 2010. Detecting social situations from interaction geometry. In 2010 IEEE Second International Conference on Social Computing. IEEE, 1–8.
[63]
Edward T. Hall. 1963. A system for the notation of proxemic behavior. American Anthropologist 65, 5 (1963), 1003–1026.
[64]
Edward T. Hall. 1966. The Hidden Dimension. Anchor Books. Garden City, NY.
[65]
Hooman Hedayati. 2021. Retrieved from https://github.com/cu-ironlab/Babble
[66]
Hooman Hedayati, Annika Muehlbradt, Daniel J. Szafir, and Sean Andrist. 2020a. Reform: Recognizing f-formations for social robots. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 11181–11188.
[67]
H. Hedayati, D. Szafir, and S. Andrist. 2019. Recognizing F-Formations in the open world. In 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 558–559.
[68]
Hooman Hedayati, Daniel Szafir, and James Kennedy. 2020b. Comparing F-formations between humans and on-screen agents. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems, 1–9.
[69]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.
[70]
Ziling Huang, Zheng Wang, Wei Hu, Chia-Wen Lin, and Shin’ichi Satoh. 2019. DoT-GNN: Domain-transferred graph neural network for group re-identification. In 27th ACM International Conference on Multimedia, 1888–1896.
[71]
Hayley Hung, Gwenn Englebienne, and Laura Cabrera Quiros. 2014. Detecting conversing groups with a single worn accelerometer. In 16th International Conference on Multimodal Interaction, 84–91.
[72]
Hayley Hung and Ben Kröse. 2011. Detecting f-formations as dominant sets. In 13th International Conference on Multimodal Interfaces, 231–238.
[73]
Hayley Hung, Chirag Raman, Ekin Gedik, Stephanie Tan, and Jose Vargas Quiros. 2019. Multimodal data collection for social interaction analysis in-the-wild. In 27th ACM International Conference on Multimedia, 2714–2715.
[74]
Helge Hüttenrauch, Kerstin Eklundh, Anders Green, and Elin Anna Topp. 2006. Investigating spatial relationships in human-robot interaction. 5052–5059. DOI:
[75]
Junko Ichino, Kazuo Isoda, Tetsuya Ueda, and Reimi Satoh. 2016. Effects of the display angle on social behaviors of the people around the display: A field study at a museum. In 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, 26–37.
[76]
Idiap Research Institute. 2021. Idiap Poster Data. Retrieved from https://www.idiap.ch/en/dataset/idiap-poster-data/idiap-poster-data
[77]
Hanbyul Joo, Tomas Simon, Xulong Li, Hao Liu, Lei Tan, Lin Gui, Sean Banerjee, Timothy Godisart, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. 2017. Panoptic studio: A massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 1 (2017), 190–204.
[78]
Michiel P. Joosse, Ronald W. Poppe, Manja Lohse, and Vanessa Evers. 2014. Cultural differences in how an engagement-seeking robot should approach a group of people. In 5th ACM International Conference on Collaboration across Boundaries: Culture, Distance & Technology, 121–130.
[79]
Manuela Jungmann, Richard Cox, and Geraldine Fitzpatrick. 2014. Spatial play effects in a tangible game with an f-formation of multiple players. In 15th Australasian User Interface Conference-Volume 150, 57–66.
[80]
Takayuki Kanda, Dylan F Glas, Masahiro Shiomi, and Norihiro Hagita. 2009. Abstracting people's trajectories for social robots to proactively approach customers. IEEE Transactions on Robotics 25, 6 (2009), 1382–1396.
[81]
Daphne Karreman, Geke Ludden, Betsy van Dijk, and Vanessa Evers. 2015. How can a tour guide robot's orientation influence visitors’ orientation and formations? In 4th International Symposium on New Frontiers in Human-Robot Interaction.
[82]
Daphne Karreman, Lex Utama, Michiel Joosse, Manja Lohse, Betsy van Dijk, and Vanessa Evers. 2014. Robot etiquette: How to approach a pair of people? In 2014 ACM/IEEE International Conference on Human-Robot Interaction, 196–197.
[83]
Kleomenis Katevas, Hamed Haddadi, Laurissa Tokarchuk, and Richard G Clegg. 2016. Detecting group formations using iBeacon technology. In 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing: Adjunct, 742–752.
[84]
Adam Kendon. 1990. Conducting interaction: Patterns of behavior in focused encounters. Vol. 7. CUP Archive.
[85]
Adam Kendon. 2010. Spacing and orientation in co-present interaction. In Development of Multimodal Interfaces: Active Listening and Synchrony. Springer, 1–15.
[86]
Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Goldberg, Matthew Tancik, and Angjoo Kanazawa. 2024. GARField: Group anything with radiance fields. arXiv:2401.09419.
[87]
Yuki Kizumi, Koh Kakusho, Takeshi Okadome, Takuya Funatomi, and Masaaki Iiyama. 2012. Detection of social interaction from observation of daily living environments. In The 1st International Conference on Future Generation Communication Technologies. IEEE, 162–167.
[88]
Sai Krishna, Andrey Kiselev, and Amy Loutfi. 2017. Towards a method to detect F-formations in real-time to enable social robots to join groups. In HRI 2017.
[89]
Thibault Kruse, Amit Kumar Pandey, Rachid Alami, and Alexandra Kirsch. 2013. Human-aware robot navigation: A survey. Robotics and Autonomous Systems 61, 12 (2013), 1726–1743.
[90]
Hideaki Kuzuoka, Yuya Suzuki, Jun Yamashita, and Keiichi Yamazaki. 2010. Reconfiguring spatial formation arrangement by robot body orientation. In 2010 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 285–292.
[91]
Min Jung Lee, Chi-hyoung Rhee, and Chang Ha Lee. 2022. HSVNet: Reconstructing HDR image from a single exposure LDR image with CNN. Applied Sciences 12, 5 (2022), 2370.
[92]
Alon Lerner, Yiorgos Chrysanthou, and Dani Lischinski. 2007. Crowds by example. Computer Graphics Forum 26 (2007), 655–664.
[93]
Ruonan Li, Parker Porfilio, and Todd Zickler. 2013. Finding group interactions in social clutter. In IEEE Conference on Computer Vision and Pattern Recognition, 2722–2729.
[94]
Matthias Luber and Kai Oliver Arras. 2013. Multi-hypothesis social grouping and tracking for mobile robots. In Robotics: Science and Systems.
[95]
Nicolai Marquardt, Ken Hinckley, and Saul Greenberg. 2012. Cross-device interaction via micro-mobility and f-formations. In 25th Annual ACM Symposium on User Interface Software and Technology, 13–22.
[96]
Paul Marshall, Yvonne Rogers, and Nadia Pantidi. 2011. Using F-formations to analyse spatial patterns of interaction in physical environments. ACM, 445–454.
[97]
Takahiro Matsumoto, Mitsuhiro Goto, Ryo Ishii, Tomoki Watanabe, Tomohiro Yamada, and Michita Imai. 2018. Where should robots talk? Spatial arrangement study from a participant workload perspective. In 2018 ACM/IEEE International Conference on Human-Robot Interaction, 270–278.
[98]
David McNeill, Susan Duncan, Amy Franklin, James Goss, Irene Kimbara, Fey Parrill, Haleema Welji, Lei Chen, Mary Harper, Francis Quek, Travis Rose, and Ronald Tuttle. 2009. Mind merging. Expressing Oneself/Expressing One's Self: Communication, Language, Cognition, and Identity: Essays in Honor of Robert Krauss, 143–164.
[99]
Ross Mead, Amin Atrash, and Maja J. Matarić. 2013. Automated proxemic feature extraction and behavior recognition: Applications in human-robot interaction. International Journal of Social Robotics 5, 3 (2013), 367–378.
[100]
Ross Mead and Maja J. Matarić. 2011. An experimental design for studying proxemic behavior in human-robot interaction. Technical Report. Citeseer.
[101]
MMLAB. 2021. Multimedia Signal Processing and Understanding Lab. Retrieved from http://mmlab.science.unitn.it/USID/
[102]
Luis Yoichi Morales Saiki, Satoru Satake, Rajibul Huq, Dylan Glas, Takayuki Kanda, and Norihiro Hagita. 2012. How do people walk side-by-side? Using a computational model of human behavior for a social robot. In 7th annual ACM/IEEE International Conference on Human-Robot Interaction, 301–308.
[103]
Jonathan Mumm and Bilge Mutlu. 2011. Human-robot proxemics: Physical and psychological distancing in human-robot interaction. In 6th International Conference on Human-Robot Interaction, 331–338.
[104]
Kavin Preethi Narasimhan. 2011. Towards modelling spatial cognition for intelligent agents. In 29th Annual European Conference on Cognitive Ergonomics, 253–254.
[105]
Vishnu K. Narayanan, Anne Spalanzani, François Pasteau, and Marie Babel. 2015. On equitably approaching and joining a group of interacting humans. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 4071–4077.
[106]
Narrative. 2021. Narrative Clip 2. Retrieved from http://getnarrative.com/
[107]
Nhung Nguyen and Ipke Wachsmuth. 2011. From body space to interaction space: Modeling spatial cooperation for virtual humans. In The 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 3. International Foundation for Autonomous Agents and Multiagent Systems, 1047–1054.
[108]
Catharine Oertel, Kenneth A Funes Mora, Joakim Gustafson, and Jean-Marc Odobez. 2015. Deciphering the silent participant: On the use of audio-visual cues for the classification of listener categories in group discussions. In 2015 ACM on International Conference on Multimodal Interaction, 107–114.
[109]
Robert Fisher School of Informatics (Univ. of Edinburgh). 2021. BEHAVE: Computer-assisted prescreening of video streams for unusual activities. Retrieved from http://homepages.inf.ed.ac.uk/rbf/BEHAVE/
[110]
Jeni Paay, Jesper Kjeldskov, and Mikael B. Skov. 2015. Connecting in the kitchen: An empirical study of physical interactions while cooking together at home. In 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, 276–287.
[111]
Jeni Paay, Jesper Kjeldskov, Mikael B. Skov, and Kenton O’Hara. 2012. Cooking together: A digital ethnography. In CHI’12 Extended Abstracts on Human Factors in Computing Systems. 1883–1888.
[112]
Jeni Paay, Jesper Kjeldskov, Mikael B. Skov, and Kenton O’hara. 2013. F-formations in cooking together: A digital ethnography using youtube. In IFIP Conference on Human-Computer Interaction. Springer, 37–54.
[113]
Sai Krishna Pathi. 2018. Join the group formations using social cues in social robots. In 17th International Conference on Autonomous Agents and MultiAgent Systems. International Foundation for Autonomous Agents and Multiagent Systems, 1766–1767.
[114]
Sai Krishna Pathi, Andrey Kiselev, Annica Kristoffersson, Dirk Repsilber, and Amy Loutfi. 2019. A novel method for estimating distances from a robot to humans using egocentric RGB camera. Sensors 19, 14 (2019), 3142.
[115]
Sai Krishna Pathi, Andrey Kiselev, and Amy Loutfi. 2017. Estimating F-formations for mobile robotic telepresence. In 2017 ACM/IEEE International Conference on Human-Robot Interaction (HRI ’17).
[116]
S. K. Pathi, A. Kristofferson, A. Kiselev, and A. Loutfi. 2019. Estimating optimal placement for a robot in social group interaction. In 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). 1–8. DOI:
[117]
Sai Krishna Pathi, Annica Kristoffersson, Andrey Kiselev, and Amy Loutfi. 2019. F-formations for social interaction in simulation using virtual agents and mobile robotic telepresence systems. Multimodal Technologies and Interaction 3, 4 (2019), 69.
[118]
Tomislav Pejsa, Michael Gleicher, and Bilge Mutlu. 2017. Who, me? How virtual agents can shape conversational footing in virtual reality. In International Conference on Intelligent Virtual Agents. Springer, 347–359.
[119]
S. Pellegrini, A. Ess, K. Schindler, and L. van Gool. 2009. You’ll never walk alone: Modeling social behavior for multi-target tracking. In 2009 IEEE 12th International Conference on Computer Vision. 261–268.
[120]
Samira Pouyanfar, Yimin Yang, Shu-Ching Chen, Mei-Ling Shyu, and SS Iyengar. 2018. Multimedia big data analytics: A survey. ACM Computing Surveys (CSUR) 51, 1 (2018), 1–34.
[121]
Pradip Pramanick, Hrishav Bakul Barua, and Chayan Sarkar. 2021. Robotic task planning for complex task instructions in natural language. US Patent App. 17/007,391.
[122]
Profs. 2014 modified. F-formation discovery in static images. Retrieved from http://profs.scienze.univr.it/cristanm/ssp/
[123]
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. 2021. D-nerf: Neural radiance fields for dynamic scenes. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10318–10327.
[124]
Prarabdh Raipurkar, Rohil Pal, and Shanmuganathan Raman. 2021. HDR-cGAN: Single LDR to HDR image translation using conditional GAN. In 12th Indian Conference on Computer Vision, Graphics and Image Processing, 1–9.
[125]
Anoop Rajagopal, Ramanathan Subramanian, Radu Vieriu, Elisa Ricci, Oswald Lanz, Kalpathi Ramakrishnan, and Nicu Sebe. 2012. An adaptation framework for head-pose classification in dynamic multi-view scenarios. In Vol. 7725. 652–666. DOI:
[126]
Anoop Kolar Rajagopal, Ramanathan Subramanian, Elisa Ricci, Radu L. Vieriu, Oswald Lanz, Kalpathi R. Ramakrishnan, and Nicu Sebe. 2014. Exploring transfer learning approaches for head pose classification from multi-view surveillance images. International Journal of Computer Vision 109, 1–2 (2014), 146–167.
[127]
Chirag Raman and Hayley Hung. 2019. Towards automatic estimation of conversation floors within F-formations. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), 175–181.
[128]
Chirag Raman, Stephanie Tan, and Hayley Hung. 2020. A modular approach for synchronized wireless multimodal multisensor data acquisition in highly dynamic social settings. In 28th ACM International Conference on Multimedia, 3586–3594.
[129]
Omar Adair Islas Ramírez, Giovanna Varni, Mihai Andries, Mohamed Chetouani, and Raja Chatila. 2016. Modeling the dynamics of individual behaviors for group detection in crowds using low-level features. In 2016 25th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 1104–1111.
[130]
Laurel Riek and Don Howard. 2014. A code of ethics for the human-robot interaction profession. Proceedings of We Robot.
[131]
Laurel D. Riek. 2018. Robot-Centric Human Group Detection Angelique.
[132]
Jorge Rios-Martinez, Anne Spalanzani, and Christian Laugier. 2011. Understanding human interaction for probabilistic autonomous navigation using risk-RRT approach. In 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2014–2019.
[133]
RoboDK. 2021. Simulate Robot Applications. Retrieved from https://robodk.com/index
[134]
Alessio Rosatelli, Ekin Gedik, and Hayley Hung. 2019. Detecting F-formations & roles in crowded social scenes with wearables: Combining proxemics & dynamics using LSTMs. In 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), 147–153.
[135]
Paolo Rota, Nicola Conci, and Nicu Sebe. 2012. Real time detection of social interactions in surveillance video. In European Conference on Computer Vision. Springer, 111–120.
[136]
Peter A. M. Ruijten and Raymond H. Cuijpers. 2017. Stopping distance for a robot approaching two conversating persons. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 224–229.
[137]
M. S. Ryoo and J. K. Aggarwal. 2009. Spatio-temporal relationship match: Video structure comparison for recognition of complex human activities. In 2009 IEEE 12th International Conference on Computer Vision, 1593–1600.
[138]
S. M. Bhagya P. Samarakoon, M. A. Viraj J. Muthugala, and A. G. Buddhika P. Jayasekara. 2018. Replicating natural approaching behavior of humans for improving robot's approach toward two persons during a conversation. In 2018 27th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 552–558.
[139]
Chayan Sarkar, Snehasis Banerjee, Pradip Pramanick, Hrishav Bakul Barua, Soumyadip Maity, Dipanjan Das, Brojeshwar Bhowmick, Ashis Sau, Abhijan Bhattacharyya, Arpan Pal, Balamuralidhar Purushothaman, and Ruddra Roy Chowdhury. 2021a. Knowledge partitioning for task execution by conversational tele-presence robots in a geographically separated environment. US Patent App. 17/015,238.
[140]
Chayan Sarkar, Hrishav Bakul Barua, Arpan Pal, Balamuralidhar Purushothaman, and Achanna Anil Kumar. 2021b. Attention shifting of a robot in a group conversation using audio-visual perception based speaker localization. US Patent App. 11,127,401.
[141]
Satoru Satake, Takayuki Kanda, Dylan F. Glas, Michita Imai, Hiroshi Ishiguro, and Norihiro Hagita. 2009. How to approach humans? Strategies for social robots to initiate interaction. In 4th ACM/IEEE International Conference on Human Robot Interaction, 109–116.
[142]
Ashis Sau, Ruddra Dev Roychoudhury, Hrishav Bakul Barua, Chayan Sarkar, Sayan Paul, Brojeshwar Bhowmick, Arpan Pal and Balamuralidhar Purushothaman. 2020. Edge-centric telepresence avatar robot for geographically distributed environment. arXiv:2007.12990.
[143]
Audrey Serna, Lili Tong, Aurélien Tabard, Simon Pageaud, and Sébastien George. 2016. F-formations and collaboration dynamics study for designing mobile collocation. In 18th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, 1138–1141.
[144]
Francesco Setti, Hayley Hung, and Marco Cristani. 2013a. Group detection in still images by F-formation modeling: A comparative study. In 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). IEEE, 1–4.
[145]
Francesco Setti, Oswald Lanz, Roberta Ferrario, Vittorio Murino, and Marco Cristani. 2013b. Multi-scale F-formation discovery for group detection. In 2013 IEEE International Conference on Image Processing. IEEE, 3547–3551.
[146]
Francesco Setti, Chris Russell, Chiara Bassetti, and Marco Cristani. 2015. F-formation detection: Individuating free-standing conversational groups in images. PloS One 10, 5 (2015), e0123783.
[147]
Garima Sharma, Kalin Stefanov, Abhinav Dhall, and Jianfei Cai. 2022. Graph-based group modelling for backchannel detection. In 30th ACM International Conference on Multimedia, 7190–7194.
[148]
Chao Shi, Michihiro Shimada, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. 2011. Spatial formation model for initiating conversation. Proceedings of Robotics: Science and Systems VII (2011), 305–313.
[149]
Chao Shi, Masahiro Shiomi, Takayuki Kanda, Hiroshi Ishiguro, and Norihiro Hagita. 2015. Measuring communication participation to initiate conversation in human–robot interaction. International Journal of Social Robotics 7, 5 (2015), 889–910.
[150]
Francesco Solera, Simone Calderara, and Rita Cucchiara. 2015. Socially constrained structural learning for groups detection in crowd. IEEE Transactions on Pattern Analysis and Machine Intelligence 38, 5 (2015), 995–1008.
[151]
Rainer Stiefelhagen, Rachel Bowers, and Jonathan Fiscus. 2008. Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, Revised Selected Papers. Vol. 4625. DOI:
[152]
Ramanathan Subramanian, Jagannadan Varadarajan, Elisa Ricci, Oswald Lanz, and Stefan Winkler. 2015. Jointly estimating interactions and head, body pose of interactors from distant social scenes. In 23rd ACM International Conference on Multimedia, 835–838.
[153]
Mason Swofford, John Peruzzi, Nathan Tsoi, Sydney Thompson, Roberto Martín-Martín, Silvio Savarese, and Marynel Vázquez. 2020. Improving social awareness through dante: Deep affinity network for clustering conversational interactants. Proceedings of the ACM on Human-Computer Interaction 4, CSCW1 (2020), 1–23.
[154]
Stephanie Tan, David M. J. Tax, and Hayley Hung. 2022. Conversation group detection with spatio-temporal context. In 2022 International Conference on Multimodal Interaction, 170–180.
[155]
Julian Tanke, Linguang Zhang, Amy Zhao, Chengcheng Tang, Yujun Cai, Lezi Wang, Po-Chen Wu, Juergen Gall, and Cem Keskin. 2023. Social diffusion: Long-term multiple human motion anticipation. In IEEE/CVF International Conference on Computer Vision, 9601–9611.
[156]
Yudong Tao, Samantha G. Mitsven, Lynn K. Perry, Daniel S. Messinger, and Mei-Ling Shyu. 2019. Audio-based group detection for classroom dynamics analysis. In 2019 International Conference on Data Mining Workshops (ICDMW). IEEE, 855–862.
[157]
Adriana Tapus, Antonio Bandera, Ricardo Vazquez-Martin, and Luis V. Calderita. 2019. Perceiving the person and their interactions with the others for social robotics–a review. Pattern Recognition Letters 118 (2019), 3–13.
[158]
Angelique Taylor and Laurel D. Riek. 2016. Robot perception of human groups in the real world: State of the art. In 2016 AAAI Fall Symposium Series.
[159]
Angelique Marie Taylor, Darren Chan, and Laurel Riek. 2019. Robot-centric perception of human groups. ACM Transactions on Human-Robot Interaction (THRI) (2019).
[160]
Sydney Thompson, Abhijit Gupta, Anjali W Gupta, Austin Chen, and Marynel Vázquez. 2021. Conversational group detection with graph neural networks. In 2021 International Conference on Multimodal Interaction, 248–252.
[161]
Lili Tong, Audrey Serna, Simon Pageaud, Sébastien George, and Aurélien Tabard. 2016. It's not how you stand, it's how you move: F-formations and collaboration dynamics in a mobile learning game. In 18th International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI ’16). ACM, New York, NY, 318–329. DOI:
[162]
Khai N. Tran, Apurva Bedagkar-Gala, Ioannis A. Kakadiaris, and Shishir K. Shah. 2013. Social cues in group formation and local interactions for collective activity analysis. In International Conference on Computer Vision Theory and Applications (VISAPP), 539–548.
[163]
Xuan Tung Truong, Yong Sheng Ou, and Trung-Dung Ngo. 2016. Towards culturally aware robot navigation. In 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR). IEEE, 63–69.
[164]
Shih-Huan Tseng, Yen Chao, Ching Lin, and Li-Chen Fu. 2016. Service robots: System design for tracking people through data fusion and initiating interaction with the human group by inferring social situations. Robotics and Autonomous Systems 83 (2016), 188–202.
[165]
Jagannadan Varadarajan, Ramanathan Subramanian, Samuel Rota Bulò, Narendra Ahuja, Oswald Lanz, and Elisa Ricci. 2017. Joint estimation of human pose and conversational groups from social scenes. International Journal of Computer Vision 126 (July 2017). DOI:
[166]
Sebastiano Vascon, Eyasu Zemene Mequanint, Marco Cristani, Hayley Hung, Marcello Pelillo, and Vittorio Murino. 2014. A game-theoretic probabilistic approach for detecting conversational groups. In Asian Conference on Computer Vision. Springer, 658–675.
[167]
Sebastiano Vascon, Eyasu Z. Mequanint, Marco Cristani, Hayley Hung, Marcello Pelillo, and Vittorio Murino. 2016. Detecting conversational groups in images and sequences: A robust game-theoretic approach. Computer Vision and Image Understanding 143 (2016), 11–24.
[168]
Marynel Vázquez, Elizabeth J. Carter, Braden McDorman, Jodi Forlizzi, Aaron Steinfeld, and Scott E. Hudson. 2017. Towards robot autonomy in group conversations: Understanding the effects of body orientation and gaze. In 2017 12th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 42–52.
[169]
Marynel Vázquez, Alexander Lew, Eden Gorevoy, and Joe Connolly. 2022. Pose generation for social robots in conversational group formations. Frontiers in Robotics and AI 8 (2022), 703807.
[170]
Marynel Vázquez, Aaron Steinfeld, and Scott E. Hudson. 2015. Parallel detection of conversational groups of free-standing people and tracking of their lower-body orientation. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 3010–3017.
[171]
Araceli Vega-Magro, Luis Manso, Pablo Bustos, Pedro Núñez, and Douglas G. Macharet. 2017. Socially acceptable robot navigation over groups of people. In 2017 26th IEEE International Symposium on Robot and Human Interactive Communication (RO-MAN). IEEE, 1182–1187.
[172]
Computational Vision and Geometry Lab (CVGL) at Stanford. 2021. Discovering groups of people in images. Retrieved from http://cvgl.stanford.edu/projects/groupdiscovery/
[173]
TeV TECHNOLOGIES OF VISION. 2021a. Resources. Retrieved from http://tev.fbk.eu/resources
[174]
TeV TECHNOLOGIES OF VISION. 2021b. SALSA dataset. Retrieved from http://tev.fbk.eu/salsa
[175]
Lin Wang and Kuk-Jin Yoon. 2021. Deep learning for hdr imaging: State-of-the-art and future trends. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[179]
[181]
Alexander Wisowaty. 2019. Group Human-Robot Interaction: A Review.
[182]
Shih-An Yang, Edwinn Gamborino, Chun-Tang Yang, and Li-Chen Fu. 2017. A study on the social acceptance of a robot in a multi-human interaction using an F-formation based motion model. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2766–2771.
[183]
Naoyuki Yasuda, Koh Kakusho, Takeshi Okadome, Takuya Funatomi, and Masaaki Iiyama. 2014. Recognizing conversation groups in an open space by estimating placement of lower bodies. In 2014 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 544–550.
[184]
Mohammad Abu Yousuf, Yoshinori Kobayashi, Yoshinori Kuno, Keiichi Yamazaki, and Akiko Yamazaki. 2012. Establishment of spatial formation by a mobile guide robot. In 7th Annual ACM/IEEE International Conference on Human-Robot Interaction. 281–282.
[185]
Gloria Zen, Bruno Lepri, Elisa Ricci, and Oswald Lanz. 2010. Space speaks: Towards socially and personality aware visual surveillance. In 1st ACM International Workshop on Multimodal Pervasive Video Analysis, 37–42.
[186]
Lu Zhang and Hayley Hung. 2016. Beyond F-formations: Determining social involvement in free standing conversing groups from static images. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1086–1095.
[187]
Lu Zhang and Hayley Hung. 2021. On social involvement in mingling scenarios: Detecting associates of F-formations in still images. IEEE Transactions on Affective Computing 12, 1 (2021), 165–176. DOI:
[188]
Ji Zhu, Hua Yang, Weiyao Lin, Nian Liu, Jia Wang, and Wenjun Zhang. 2020. Group re-identification with group context graph neural networks. IEEE Transactions on Multimedia 23 (2020), 2614–2626.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Human-Robot Interaction
ACM Transactions on Human-Robot Interaction  Volume 13, Issue 4
December 2024
492 pages
EISSN:2573-9522
DOI:10.1145/3613735
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 October 2024
Online AM: 29 July 2024
Accepted: 14 July 2024
Revised: 28 April 2024
Received: 13 October 2021
Published in THRI Volume 13, Issue 4

Check for updates

Author Tags

  1. F-formation
  2. social robotics
  3. group detection
  4. interaction detection
  5. machine learning
  6. deep learning
  7. artificial intelligence
  8. robotics
  9. telepresence
  10. teleoperation
  11. computer vision
  12. scene monitoring
  13. human-robot interaction

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 860
    Total Downloads
  • Downloads (Last 12 months)860
  • Downloads (Last 6 weeks)198
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media