research-article

Open access

Enabling Social Robots to Perceive and Join Socially Interacting Groups Using F-formation: A Comprehensive Overview

Authors:

Hrishav Bakul Barua,

Theint Haythi Mg,

Pradip Pramanick,

Chayan SarkarAuthors Info & Claims

ACM Transactions on Human-Robot Interaction, Volume 13, Issue 4

Article No.: 48, Pages 1 - 48

https://doi.org/10.1145/3682072

Published: 23 October 2024 Publication History

PDF eReader

Abstract

Social robots in our daily surroundings, like personal guides, waiter robots, home helpers, assistive robots, and telepresence/teleoperation robots, are increasing day by day. Their usability and acceptability largely depend on their explicit and implicit interaction capability with fellow human beings. As a result, social behavior is one of the most sought-after qualities that a robot can possess. However, there is no specific aspect and/or feature that defines socially acceptable behavior, and it largely depends on the situation, application, and society. In this article, we investigate one such social behavior for collocated robots. Imagine a group of people is interacting with each other, and we want to join the group. We as human beings do it in a socially acceptable manner, i.e., within the group, we do position ourselves in such a way that we can participate in the group activity without disturbing/obstructing anybody. To possess such a quality, first, a robot needs to determine the formation of the group and then determine a position for itself, which we humans do implicitly. There are many theories which study group formations and proxemics; one such theory is f-formation which could be utilized for this purpose. As the types of formations can be very diverse, detecting the social groups is not a trivial task. In this article, we provide a comprehensive survey of the existing work on social interaction and group detection using f-formation for robotics and other applications. We also put forward a novel holistic survey framework combining some of the possibly more important concerns and modules relevant to this problem. We define taxonomies based on methods, camera views, datasets, detection capabilities and scale, evaluation approaches, and application areas. We discuss certain open challenges and limitations in the current literature along with possible future research directions based on this framework. In particular, we discuss the existing methods/techniques and their relative merits and demerits, applications, and provide a set of unsolved but relevant problems in this domain. The official website for this work is available at: https://github.com/HrishavBakulBarua/Social-Robots-F-formation

1 Introduction

Human group [146] and activity detection [2] has been a hot topic for computer/machine vision, Artificial Intelligence (AI), and robotics research. When humans interact with each other in a group (two or more people), they use implicit social intuition to position themselves concerning each other which facilitates easy interaction in the situation. The same thing is also applied when a new person wants to join a group. That person also assesses (implicitly) which place is best for joining so that people in the group do not face any inconvenience. The person also considers her role in that group according to the organization they are in right now to position herself [46]. Nowadays, robots are widely used in our daily surroundings [164] for many purposes. One such popular application is to use a robot to attend meetings/conferences/discussions remotely as a telepresence medium [21, 30, 121, 139, 140, 142]. In such scenarios, the robot has to join a group of people. For the robot to fluidly participate in groups, they must need to know how the groups are formed, how they are shaped, and how they have evolved [84, 85, 146]. There can be many kinds of groups that defer in dimension, situation, organization, and so on, and they are generally referred to as “f-formation.”

Facing formation (f-formation) is defined as the set of patterns that are formed during social interactions of two or more people. A robot can join an existing group or go to a single person and form a new group [15, 16, 54, 63, 114, 116, 136, 138, 182]. There are three social spaces related to f-formation, which are O-space, P-space, and R-space. O-space is known as the joint transaction space which is the interaction space between participants. P-space is the space where active participants are standing. R-space is the area that surrounds the participants and is outside the interaction radius as shown in Figure 1 (details in Section 2). According to social science, Kendon (1990) proposed four formations as standard formations. They are vis-a-vis/face-to-face, side-by-side, L-shaped, and circular. Apart from these, there are also many other kinds of formations such as semi-circular, rectangle, triangular, v-shaped, and spooning [112, 161]. By categorizing a formation to any of these formation types, a robot can understand how people are standing in the discussion and decide a position for itself to join the group accordingly. In joining, the robot should take care of the fact that already standing people are not disturbed and obstructed by itself. In detecting groups, several methods exist such as determining the position and orientation of people, graph-cuts methods, and Hough-voting system. One major problem in f-formation detection is occlusion [47]. People in a group may stand in such positions that some of them may occlude the others, or the viewing camera angle is placed in such a location that the complete group is not visible. In such cases, a robot will not know the formation as it may not be able to detect the bodies of some people. So, in this kind of situation, we have to think about what kind of formation we will assume for the robot to continue its work. For the sake of this article, we use group, interaction, and f-formation synonymously as we consider a group of interacting people who follow a particular formation to be detected for various applications as the primary focus area.

Fig. 1.

Motivation and Research Objective. Social group and interaction detection is a non-trivial task in computer vision and is of importance to the social robotics research community. Many research groups across the globe have concentrated their studies in this area. The idea of social groups and interactions was first proposed in 1990 by Kendon [84]. However, the first mention of human proxemics dates back to 1963 [63]. These concepts of f-formation and proxemic behaviors of humans have formed the fundamentals to the study and detection of physically situated interaction and group of people or robots or both. The research has gained pace since 2010 (see Table 3), delivering many rule-based and learning-based methods and techniques for detecting interaction and f-formation in social setups for various applications, mostly robotics and vision. But not many surveys, reviews, or tutorials have been found in this domain which provide a good overall impression of the research, the state of the art, and future opportunities. This survey article is a 360\({}^{\circ}\) view into the problem of social group and interaction detection using f-formation covering almost all the concerns and aspects comprehensively. The aim is to discuss the domain in a comprehensive manner and facilitate the scientists, researchers, and computer engineers to get a fair idea of the area and conduct fruitful research in the future. We also put forward a few related surveys in Table 1 and have compared them with our work based on various concern areas of our survey framework (discussed in Section 4.1 and Figure 8).

Table 1.

Existing Surveys and Reviews	(1)	(2)	(3)	(4)	(5)	(6)	(7)	(8)	(9)	(10)	(11)
Adriana et al. [157]	✗	✗	✗	✗	✗	✗	✓	✓	✗	✗	✗
Francesco et al. [146]	✗	✗	✓✓	✓	✓	✓	✓	✗	✗	✗	✗
Sai Krishna et al. [117]	✗	✓	✗	✗	✓	✓	✓	✓	✗	✗	✗
This survey	✓✓	✓✓	✓✓	✓✓	✓✓	✓✓	✓✓	✓✓	✓✓	✓✓	✓✓

Table 1. Comparison of Existing and Related Surveys/Reviews With Our Work

✗-> signifies no treatment in the paper, ✓-> signifies some mention exists and ✓✓-> means comprehensive treatment of the concern area. (1) Comprehensive f-formation list and tutorial on social spaces, (2) camera views and sensors, (3) datasets, (4) detection capability/scale, (5) evaluation methodology, (6) rule-based AI methods/techniques, (7) ML-based AI methods/techniques, (8) applications, (9) limitations, challenges and future directions, (10) generic survey framework for group and interaction detection, (11) discussion on ethical concerns and considerations.

Uniqueness of the Survey. To the best of our knowledge, this survey is the first of its kind in this subject area. Our survey puts forward the idea of social groups with the perspective of f-formation with some comprehensive details. We also discuss the natural joining position for a robot to enable human–robot interaction (HRI), after successful detection of the formation using computer vision techniques. Additionally, we propose a holistic framework to signify the various concern areas in the detection and prediction of social groups. Various taxonomies regarding camera view of the environment for collecting scenes for detection, datasets for training machine/deep learning (ML/DL) models, detection capability, and scale and evaluation methods are discussed. We discuss and categorize all the detection methods, particularly rule-based and ML-based. Furthermore, we also deliberate the application areas of such detection and recognition giving primary focus to robotics. We also attempt to include some discussion on ethical concerns and considerations [73] in data collection and distribution as well as application. We also detailed the challenges, limitations, and future research directions in this area.

Organization of the Survey. This survey article is organized into the following sections. Section 2 gives a comprehensive perspective about social spaces involved in group interaction. The questions like the meaning of f-formation, types of f-formation, and evolution of f-formation from one type to another when a new member joins a group have been answered along with pictorial depiction for readers’ understanding. Then, we present a year-wise compilation of research performed in this domain with analysis in Section 3. In Section 4.1, we propose a generic and holistic framework for group and interaction detection using formations which also becomes the basis of categorization of the literature in the survey based on the various concern areas and modules. Section 4.2 shows the article discovery and selection methodology along with inclusion/exclusion criteria in Section 4.3. Section 5 discusses the various input methods for detection such as cameras and other sensors. It also puts focus on the various camera views and positions. Section 6 summarizes the methods, techniques, and algorithms for detection focusing on both rule-based static AI methods and learning-based (data-driven) methods. In Section 7, we discuss briefly the various datasets available for training and testing purposes. Then, we talk about detection capabilities and scale in Section 8. Section 9 presents the various evaluation strategy and methodology from the perspective of algorithmic computational complexity and application areas like robotics and vision. The various application areas are stated in Section 10. Finally, we discuss the limitations and challenges in the existing state-of-the-art literature and methods as well as propose some future research directions and prospects in each of the modules (of the survey framework) in Section 11. We conclude the survey in Section 12.

2 Social Spaces in Group Interaction

In this section, we describe the theory of f-formation, leveraging the theory to study the groups of interacting people, and how a robot can utilize this to perceive and display social awareness during interaction with a group of people. Apart from that we also identify a set of formations which does not fall into the exact definition of f-formation but still are prevalent in social gatherings and environments [112].

2.1 What Is F-formation?

F-formation or any formation happens when two or more people sustain a spatial and orientational relationship, and they have equal, direct, and exclusive access to the space between them [84, 85]. Technically, an f-formation is said to have formed when an O-space is created due to the overlapping transaction segments of the interacting people [84, 85]. Figure 1 depicts such a social space where a group of people are interacting. An f-formation is the proper organization of three social spaces: O-space, P-space, and R-space [84, 85]. They are situated like three circles surrounding each other. O-space, the innermost circle, is a convex empty space that is normally surrounded by the people in the group and the participants generally look inward into the O-space. P-space, the second circle, is a narrow space where active participants are standing. The R-space, the outermost circle, is a space where mostly an inactive participant (listener) or an outsider who is not a part of the conversation stands. But it is not the case entirely as stated in [84]. Kendon explains that some listeners can be a part of an f-formation although they directly do not interect but just listen to other interacting people. In such cases, a listener can be in P-space. The papers [38, 59, 108] study various types of participants and non-participants in an interaction. The participants are the ones either speaking or the ones being spoken to directly (active listeners). The people not falling into these categories, but still can be considered as participants, are stated as listeners, over-hearers, or side participant in a group discussion or interaction. According to the authors, listeners can be either active (also participants) or passive (bystanders or eavesdropper). A bystander is a listener who is present in the group with the awareness of the participants but not interacting with them, whereas an eavesdropper is overhearing the conversation without the awareness of the participants. So, active listeners are a part of the P-space automatically, and in some cases, the passive listeners like the bystanders can also occupy the P-space. But it is very unlikely that an eavesdropper can be seen in a P-space. They are generally seen in the R-space. Another study by Lu Zhang et al. [187] categorize participants as in-group or out-group. The in-group participants are the ones actively participating in the conversation and are considered full members of the f-formation. They tend to be in the P-space. The out-group consists of the people who are trying to join an f-formation but are not fully accepted yet. They tend to be in the R-space and wait for an opportunity to get a spot in the P-space. These members can leave the group without disturbing the f-formation and the conversation.

2.2 Different Formations (Including F-formations) Possible during Interaction

Although both the theory of f-formation and appropriate methods to detect them have been well-analyzed in the literature, a comprehensive list of the possible formations (f-formations and others) and their major variations during different kinds of interactions is yet to be brought out. The most common ones are side-by-side, viz-a-viz, L-shaped, and triangular, defined for groups of two to three persons. Some others include circular, square, rectangular, and semi-circular, which are more flexible and can contain a varying number of persons. However, some are simply formations with no common O-space and which do not fit into the typical technical definition of f-formation, such as reversed L-shaped, spooling, and z-shaped [112] (see Figure 4). Wide v-shaped formation may also be considered as a new category as it does not exactly satisfy Kendon's definition [84, 85]. A study has been made by Jeni Paay et al. in [112] considering 1,592 scenes in a total of 61 videos where they define scene as a movement to a new formation or camera position; 663 scenes have no clear formations to consider while the remaining 929 have two or more people. Out of these scenes, 581 have spatial-orientational arrangements which can be fit into the identified f-formations by Kendon [84, 85]. The rest of the 348 scenes have different spatial and orientional arrangements which directly do not follow the f-formation norms. Hence, the authors define special kind of formation for these scenes.

Another important consideration is about redundancy in some of the f-formations. For example, square/rectangle, triangle, or circular formation may seem similar in form but vary only in number of participating people. But these are considered separate formations in the literature as in case of detection tasks for vision or robotics application, apart from the overall form of the formation, the number of involving people is also important. Also if a robot needs to join a formation, the structure of the formation might change from one form to another depending on the current form and the number of participating people. Lets consider a case; if a robot detects a triangle formation and joins it in an optimal way, it might lead to a rectangle/square formation as a result. So, in such cases of real-life applications, these formations may be considered unique for better representation in detection algorithms for robotics. As for some of the other formations like line, column, diagonal (echelon), or geese, these are not used much but are possible formations in exercises, military operations (infantry and cavalry warfare), or such disciplined drills. The use of robotics in defence or therapy is an active area of research, so these formations might be useful for such applications. Detection of and joining such formations by robots might be a possible application. Moreover, visual analysis of such exercises, military drills, and so on can be another application area.

We list down and categorize a comprehensive collection of the known formations and their major variations (including f-formations) below (see Figure 2 for a pictorial representation).

Fig. 2.

(a)

Side-by-side: The side-by-side formation is formed when two people stand close to each other facing in the same direction. Both faces either right or left or center. A minimum of two people is required for such a formation [74].

(b)

Vis-a-vis or face-to-face: This formation comes into existence when two people are facing each other. Only two people are required for such a formation [74].

(c)

L-shape: The L-shape is formed when two people face each other perpendicularly and are situated on the two ends of the letter “L”—one person facing the center and the other facing right or left [74].

(d)

Reversed L-shaped: This is formed when two people are in a position of L-shape, but they are facing in different directions [112].

(e)

Wide V-shaped: Two people are facing in the same direction like side by side, but they tilt their bodies slightly to face each other a little. Minimum two people required for this formation [112].

(f)

Spooning: This formation has two people with one person facing forward and the other look over from the back in the same direction [112].

(g)

Z-shaped: This is formed when two people are standing side-by-side but facing in opposite directions [112].

(h)

Line formation: In this formation, all are standing in side-by-side fashion as a straight line and a minimum of two people are required [176].

(i)

Column formation: In this formation, all are standing in a fashion where one is behind the other in a straight line and a minimum of two people are required [177].

(j)

Diagonal (Echelon): In this formation, people stand diagonally and face in the same direction. A minimum of two people are required [178].

(k)

Side-by-side with one headliner: In this formation, one person stands in the front and others stand side-by-side at the back. A minimum of three people are required, and they all face in the same direction [46].

(l)

Side-by-side with outsider: In this formation, one participant occupying an outer position of the side-by-side formation in the R-space, who usually does not play an active role in the conversation. A minimum of three people are required [96].

(m)

V-shaped: In this formation, all people stand in a V-shaped fashion and face the same direction. A minimum of three people are required [112, 179].

(n)

Horseshoe: The group of people stands in the shape of “U,” and a minimum of five people are required [146].

(o)

Semi-circular: The semi-circular formation is where three or more people are focusing on the same task while interacting with each other [117].

(p)

Semi-circular with one leader in the middle: In this formation, people stand in semi-circular shape and there is one person in the center who is facing to the group of people in the semi-circle. A minimum of four people are required [96].

(q)

Square (Infantry square): Four people stand in the square-shaped fashion [180].

(r)

Triangle: As the name suggests, three people stands in a triangular shape in this formation [117].

(s)

Circle: As the name suggests, a group of people stands in a circular shape in this formation [115].

(t)

Circular arrangement with outsiders: In this formation, some people stand in a circular fashion and one/two additional people stand at the back of the circular formation [46].

(u)

Geese formation: In this formation, there are two or more people where one person is leading the path and the others are following that person but may or may not be looking in the same direction. A minimum of two people are required [48].

(v)

Lone wolves: This is not really a formation (yet). There is only one person ready to be joined by others before an interaction [48].

Fig. 3.

Fig. 4.

Apart from these formations and f-formations in specific, some authors also talk about seated formations. In the work of McNeill et al. [98], they have considered some seated group interactions. The paper shows horseshoe and circular seated f-formations specifically picking up cases from US Air Force war gaming session interaction. They study the behavior of humans interacting with each other in a seated conversation using multi-modal cues such as hand pointing gestures and gaze or eye contact. Figure 5 shows a horseshoe setup with five people interacting with each other although the table has a capacity of eight people making it a circular-seated formation. The positions, C, D, E, F, and G, are occupied and A, B, and H are vacant. Camera 1 is used to take the images for analysis. Figure 6 gives the camera 1 perspective of the seated people. The research aims at understanding the interpersonal interaction patterns under the umbrella of multi-modality. There are five individuals from the US Air Force taking part in military gaming exercises at the Air Force Institute of Technology, at the Wright Patterson Air Force Base, in Dayton, OH. The people in the setup are from various military speciality with the commanding officer in position E. Here, we can definitely identify the O-space, P-space, and R-space in the figures. So, we can also have a new direction in this perspective of detection of seated f-formations and its application in robotics and vision.

Fig. 5.

Fig. 6.

2.3 Best Position for a Robot to Join in a Group

As social robotics is one of the most important application areas of group and interaction detection using computer vision, we put forward a list of positions where a robot can join a formation [169] (here f-formation) after successfully detecting it. But joining a group requires a socially and culturally aware [33, 105, 163, 171] or human-aware [89, 132, 163] navigation protocol embedded into the robot. In other words, the robot should imitate human-like natural behavior while approaching a group (considering correct direction and angle) for interaction and discussion without incurring any discomfort to the existing members of the group. However, this part of the story is out of the scope of our survey, and we limit our work to detection and prediction of group and interaction only. But, as it seems necessary to at least briefly mention this side of the coin, we put forward some of the possible joining locations and natural joining path or approach direction/angle (robot trajectory) in this section (also discussed briefly in Sections 9 and 11). Researchers can think of presenting and publishing a systematic survey on the human-aware or socially aware navigation and group joining aspect of a robot/autonomous agent after successful detection/prediction of the group interaction and f-formation. Some of such works can be found in [105, 132, 164, 171].

Table 2 summarizes a list of formations with the number of people and correspondingly the new formations after a person/robot joins it. The pictorial summary of the same is presented in Figure 3. In the figure, we see some of the possible joining positions in a group (f-formation mostly) along with the trajectories of a robot/human from their possible initial position to final position. This list is not exhaustive but can be considered comprehensive in the sense that we have tried to include most of the formations (f-formations and others) which can be seen in humans in various social scenarios. The joining positions seen in the figures are considered optimum as those are the only positions (clearly understandable by the images) where a new joiner may stand without obstructing the existing members or blocking the conversation between them. In some case, like line, column, or diagonal formations, the formation can be preserved by joining in such a way that the resulting formation is considered as same but with more members. Similarly in other cases, the people/robot should join in a location that makes the resultant formation as one of the already existing one. As for the joining path or trajectory, one should follow a path that does not disturb the existing members of the group and at the same time should be the shortest path to the destined joining locations. In the view of this logic, we put forward the paths in the figures as shown. But here one important thing to consider is that, these paths do not consider the fact that there are any obstacles in the path or in the vicinity at least. However, in actual social scenarios, we expect a lot of static and moving obstacles, which need to be addressed by good navigation policies and path planning (see [33] and [89]) which is out of the discussion scope of this survey.

Table 2.

No.	Formation (Before)	No. of People	Formation (After joining)
a	Side-by-side	2\({}^{\mathrm{a}}\)	Side-by-side, side-by-side with one outsider, Triangle
b	Vis-a-vis	2	Triangle
c	L-shaped	2	Triangle, reverse semi-circular
d	Reversed L-shaped	2	Semi-circular
e	Wide V shaped	2\({}^{\mathrm{a}}\)	Triangle, semi-circular
f	Spooning	2	Side-by-side with headliner, side-by-side with outsider
g	Z-shaped	2	Triangle
h	Line	3	Line
i	Column	3	Column
j	Diagonal	3	Diagonal, V-shaped
k	S-by-s with one headliner	3\({}^{\mathrm{a}}\)	S-by-s with one headliner
l	S-by-s with outsider	3\({}^{\mathrm{a}}\)	Semi-circular
m	V-shaped	7	Triangle
n	Horseshoe	5\({}^{\mathrm{a}}\)	Pentagon
o	Semi-circular	4	Circular
p	Semi-circular with one leader in the middle	5	Circle
q	Square	4	Circular, horseshoe
r	Triangle	3	Circle, semi-circular
s	Circle	6	Circle
t	Circle with outsider	8	Circle with outsider
u	Geese	2\({}^{\mathrm{a}}\)	S-by-s with outsider
v	Lone wolf	1	Vis-a-vis, S-by-s, L-shaped

Table 2. Comprehensive List of Formations Before and After a Robot/Human Has Joined

\({}^{\mathrm{a}}\)Signifies that the number is the minimum requirement for that particular formation.

3 Research Chronology

In 1990, Kendon [84] proposed the f-formation theory for group interaction by participating people on the basis of proxemics behavior. Human proxemics is the study of how space is used between humans or how it is set between humans and others during interactions or performing an activity. A computer system to detect human proxemics behavior was first studied by Hall almost six decades ago [63, 64]. This section is a survey on the literature collected on f-formation, using static and learning-based AI approaches. Figure 7 shows the various specified distance ranges for different designated interaction types on the basis of intimacy level between the participating people. The distance ranges specified in green-colored boxes are relevant to group/interaction and f-formation detection perspective. The blue-colored boxes signify distance ranges that are not generally seen in any f-formation. After the early studies by Hall, it was until 2009 when computer vision-based approaches using static AI methods or ML/DL techniques have emerged. From 2010, the research in this field started gaining pace. From 2010 to 2013, rule-based fixed AI methods were prevalent. Learning-based or data-driven methods like DL and RL have gained prominence in 2014 and gaining pace with each passing year. During the period ranging from 2013 to 2017, the research in this domain reached its peak with multiple methods, algorithms, and techniques being proposed. These methods have attained popularity in 2019. The years from 2014 to 2018 have seen equal contributions in both fixed rule-based methods and learning-based methods. So, there is a transition from traditional AI-based methods to ML and data-driven techniques like almost any domain of AI. Table 3 can be referred to for the complete list of references (year-wise). The table also consists of the keywords (methods, focus areas, and technologies) for most of the references for a better perception of the readers.

Fig. 7.

Table 3.

Year of Publication	Study, Methods and Techniques	Total
2004	Geometric reasoning on sonar data [12].	1
2006	Wizard-of-Oz (WoZ) study on spatial distances related to f-formations [74].	1
2009	Clustering trajectories tracked by static laser range finders [80], trajectory classification by SVM [141].	2
2010	Probabilistic generative model on IR tracking data [62], WoZ study of robot's body movement [90], SVM classification using kinematic features [185].	3
2011	Analysis of different f-formations for information seeking [96], Hough-transform-based voting [43], graph clustering [72], a study on transitions between f-formations on interaction cues [100], a computational model of interaction space for virtual humans extending f-formation theory [107], a study of physical distancing from a robot [103], utilizing geometric properties of a simulated environment [104], a study to relate f-formations with conversation initiation [148], Gaussian clustering on camera-tracked trajectories [44], risk-based robot navigation [132].	10
2012	Application of f-formations in collaborative cooking [111], Kinect-based tracking with rules [95], WoZ study on social interaction in nursing facilities [87], a study of robot gaze behaviors in group conversations [184], velocity models (while walking) [102], SVM with motion features [135], HMM [56].	7
2013	Spatial geometric analysis on Kinect data [51, 53], analysis of f-formation in blended reality [47], a comparison of [43] and [72] [144], exemplar-based approach [93], multi-scale detection [145], Bag-of-Visual-Words-based classifier [162], Inter-Relation Pattern Matrix [27], HMM classifiers [99], O-space based path planning [60], Multi-hypothesis social grouping and tracking for mobile robots [94].	11
2014	Hough Voting (HVFF), graph-cut f-formation (GCFF) [122], game theory based approach [166], correlation clustering algorithm [11], reasoning on proximity and visual orientation data [48], effects of cultural differences [78], HMM to classify accelerometer data [71], iterative augmentation algorithm [36], adaptive weights learning methods [126], estimating lower-body pose from head pose and facial orientation [183], search-based method [52], study on group-approaching behavior [82], spatial activity analysis in a multiplayer game [79].	12
2015	Robust Tracking Algorithm using Tracking learning detection (TLD) [10], GCFF-based approach [146], Correlation Clustering algorithm [10], multimodal data fusion [9], spatial analysis in collaborative cooking [110], Group Interaction Zone (GIZ) detection method [35], study on influencing formations by a tour guide robot [81], joint inference of pose and f-formations [152], participation state model [149], SALSA dataset for evaluating social behavior [8], multi-level tracking based algorithm [170], Structural SVM using Dynamic Time Warping loss [150], Long-Short Term Memory (LSTM) network [3], influence of approach behavior on comfort [15], Sensor-based control task for joining [105].	15
2016	F-formation applied to mobile collaborative activities [161], subjective annotations of f-formation [186], game-theoretic clustering [167], study of display angles in museum [75], mobile co-location analysis using f-formation [143], proxemics analysis algorithm [129], review of human group detection approaches [158], LSTM based detection in ego-view [4], Tracking people through data fusion for inferring social situation [164], Detecting group formations using iBeacon technology [83], [163].	11
2017	Haar cascade face detector based algorithm [88, 115], weakly-supervised learning [165], temporal segmentation of social activities [41], omnidirectional mobility in f-formations [182], review of multimodal social scene analysis [7], 3D group motion prediction from video [77], survey on social navigation of robots [33], a study on robot's approaching behavior [16], heuristic calculation of robot's stopping distance [136], a study on human perception of robot's gaze [168], computational models of spatial orientation in virtual reality (VR) [118], Socially acceptable robot navigation over groups of people [171].	13
2018	Optical-flow based algorithm in ego-view [131], meta-classifier learning using accelerometer data [57], human-friendly approach planner [138], discussion on improved teleoperation using f-formation [113], effect of spatial arrangement in conversation workload [97], study of f-formation dynamics in a vast area [46].	6
2019	Study on teleoperators following f-formations [117], analysis on conversation floors prediction using f-formation [127], empirical comparison of data-driven approaches [67], LSTM networks applied on multimodal data [134], robot's optimal pose estimation in social groups [116], review of robot and human group interaction [181], Staged Social Behavior Learning (SSBL) [54], Euclidean distance based calculation after 2D pose estimation [114], Robot-Centric Group Estimation Model (RoboGEM) [159], DoT-GNN: Domain-Transferred Graph Neural Network for Group Re-identification [70], Audio-based framework for group activity sensing [156].	11
2020	Difference in spatial group configurations between physically and virtually present agents [68], Conditional Random Field with SVM for jointly detecting group membership, f-formation and approach angle [18, 21], Group Re-Identification With Group Context Graph Neural Networks [188], Improving Social Awareness Through DANTE: Deep Affinity Network for Clustering Conversational Interactants [153].	4
2021	Conversational Group Detection with Graph Neural Networks [160]	1
2022	Graph-based group modelling for backchannel detection [147], Pose generation for social robots in conversational group formations [169], Conversation group detection with spatio-temporal context [154]	3

Table 3. Year-Wise Compilation of Methods and Techniques for Group/Interaction/F-formation Detection and Other Relevant Literatures

HMM, Hidden Markov Model; IR, infrared; WoZ, Wizard-of-Oz.

4 Review Methodology

This study systematically summarizes and analyzes state-of-the-art methods and challenges for group and interaction detection using f-formations. The methodology is based on a generic framework for analysis of the relevant literature and a synthesis of research findings in a systematic manner.

4.1 Generic Survey Framework with Possible Concern Areas

This survey aims to facilitate the concerned researchers with a comprehensive overview of this domain of group and interaction detection. The idea of group/interaction detection is not new and is around for more than a decade. Researchers are trying to design and develop new methods, techniques, algorithms, and architectures for various application areas ranging from computer vision and robotics to social environment analysis.

The problem of a group and/or interaction detection is a non-trivial problem of computer vision. The existing research approaches follow both classical AI algorithms like rule-based methods, geometric reasoning, and neural networked-based methods. Moreover, learning paradigms like supervised, semi-supervised, and unsupervised are also used. Proper categorization of these methods is necessary for future research directions. We have proposed a holistic framework that corresponds to the concern areas of f-formation research and can also be considered as a generic architecture for a typical group/intersection detection task using f-formation. Figure 8 puts forward a possible framework with different modules of such a detection task. The various concern areas of this domain can be characterized by—sensors used, camera view/position for capturing the group interaction, datasets used for training/testing the method in case of learning-based approaches (indoor or outdoor), feature selection (in brief), detection capabilities (static/dynamic scenes) and scale (single or multi-group scenario), evaluation methodology (efficiency/accuracy and/or simulation study and human experience study), and application areas. We also discuss the ethical concerns to be considered in data collection/sharing and application areas. The mentioned modules are used as the basis of categorization of the literature in our survey and are attended to in the upcoming sections one by one (as mentioned in Figure 8). Finally, we conclude the survey by discussing the limitations, challenges, and future directions/prospects (Section 11) in each of the concerned modules.

Fig. 8.

4.2 Search Strategy

We conduct a thorough search using the most popular online libraries including Association for Computing Machinery's ACM Digital Library,¹ Institute of Electrical and Electronics Engineers's IEEE Xplore,² Scopus,³ and ScienceDirect.⁴ ACM provides a wide array of transactions, journals, and conference proceedings in the area of HRI, human–machine interaction, and computer vision. Similarly, IEEE provides a vast collection of interdisciplinary research transactions and tier-1 conferences and proceedings combining AI, robotics, robot vision, and social behavior. Scopus has a variety of Springer journals dedicated to intelligent robotics and vision while ScienceDirect has a gamut of publications in the form of Elsevier which concentrates on autonomous systems and robots. We also use Google Scholar search as it provides the latest published or archived papers in the domain. The oldest paper reported is from the year 1963, and the newest one is from 2024. However, the major bulk of the works compared in this survey is between 2010 and 2020 (see Section 3, Table 3).

ACM Digital Library. The Search String used: [[Title: f-formation] OR [Title: “social interaction”] OR [Title: “group interaction”] OR [Title: “group formation”] OR [Title: “conversational group”] OR [Title: “group dynamics”] OR [Title: “group behaviour”] OR [Title: “human proxemics”]] AND [[Title: “social robot”] OR [Title: “human-robot interaction”]] AND NOT [[Title: survey] OR [Title: review] OR [Title: tutorial]] AND [E-Publication Date: (01/01/2010 TO 12/31/2020)].

IEEE Xplore. The Search String used: (“Document Title”: f-formation OR “Document Title”: “social interaction” OR “Document Title”: “group interaction” OR “Document Title”: “group formation” OR “Document Title”: “conversational group” OR “Document Title”: “group dynamics” OR “Document Title”: “group behaviour” “Document Title”: “human proxemics”) AND (“Document Title”: “human-robot interaction” OR “Document Title”: “social robot”) NOT (“Document Title”: survey OR “Document Title”: review OR “Document Title”: tutorial) Limit: 2010–2020.

Scopus. The Search String used: (human-robot AND interaction) OR (social AND robot) AND (f-formation) OR (social AND interaction) OR (group AND interaction) OR (group AND formation) OR (conversational AND group) OR (group AND dynamics) OR (group AND behaviour) OR (human AND proxemics). Date range 2010–2020.

ScienceDirect. The Search String used: (“f-formation” OR “social interaction” OR “group interaction” OR “group formation” OR “conversational group” OR “group dynamics” OR “group behaviour” OR “human proxemics”) AND (“human-robot interaction” OR “social robot”) NOT (“survey” OR “review” OR “tutorial”) AND PUBYEAR \(\gt\) 2009 AND PUBYEAR \(\lt\) 2021.

Google Scholar. The Search String used: f-formation OR “social interaction” OR “group interaction” OR “group formation” OR “conversational group” OR “group dynamics” OR “group behaviour” OR “human proxemics” -survey -review -tutorial. Limit: 2010 onwards.

4.3 Inclusion and Exclusion Criteria

The above searches yield total of 163 works (excluding survey/tutorials/reviews) after screening for appropriate and the relevant literature for our survey. Out of these works, 54 was published in IEEE journals and conferences, 41 appeared in ACM journals and conferences, 21 was published in Springer, and 18 in Elsevier. Rest of the 29 works were either published in other venues or were retrieved from archival sites like ArXiv,⁵ and Research Gate.⁶ A total of 59 works are from journals and 97 are from conferences and workshops. The remaining seven are from other sites.

Out of these, a total of 92 works are selected for comparison and analysis. We select the works focusing primarily on f-formation and its applications in robotics and related domains. We mostly list papers from 2010 to 2020 which have certain exceptions as per requirement of the survey and its treatment. Since IEEE and ACM are the major publishers in this area, we select almost 60% of the works from these venues while 24% accounts for Springer and Elsevier. Rest of the 16% approximately are from other lesser known sources or archival repositories.

Exclusions:

—

Methods that do not sufficiently describe the input methods/sensors.

—

Studies that do not explicitly define the details and capabilities of the models/methods used.

—

Studies that do not explicitly define the feature extraction and selection methods used.

—

Studies that do not explicitly define the results and evaluation strategies thoroughly.

—

Works that limit their discussion on datasets used.

—

Documents such as technical reports, thesis, and books.

—

Works that are not published in English language.

5 Cameras and Sensors for Scene Capture

This section summarizes the input methods in the group/interaction detection framework (Figure 8). The main input methods are cameras and sensors. There are different types of cameras used in the literature such as omnidirectional camera, helmet camera, robot camera, fisheye camera, and webcams. The camera sensor may be equipped with depth perception or provide only red green blue (RGB) images. The other main sensors found in the surveyed literature are audio sensors, blind sensors, radio-frequency identification (RFID) sensors, and so on. These are chosen based on the application areas and the working environment.

5.1 Camera Views

There are two different types of camera positioning used—ego-vision/ego-view (ego-centric) camera for robotics and exo-vision/exo-view (exo-centric) or global view cameras (fixed in walls and ceiling) in indoor environments or outdoor environments (see Figure 9). Cameras are used for drone surveillance, robotic vision, and scene monitoring. In these cases, we work with the ego/exo views of the scene to detect group interactions.

Fig. 9.

Ego-Centric View. Ego-centric refers to the first-person perspective; for example, images or videos captured by a wearable camera or robot camera. The captured data are focused on the part where the target objects are placed. In [131], the robot's camera is used for capturing scenes which is also referred a robot-centric view. In [52], the authors use first-person view cameras for estimating the location and orientation of the people in a group. In [3], the authors use low temporal resolution wearable camera for capturing groups’ images.

Exo-Centric View. The exo-centric view is concerned with the third-person perspective or the top view; for example, images or videos which are captured by surveillance/monitoring cameras. There can be one or many social interaction groups in a scene that can be captured simultaneously from the top view. In [126], the authors use four cameras for detecting groups at large scale. The method also detects changes in the target groups when it moves closer or further from the cameras. In [72], experiments are done by capturing a video with a camera from approximately 15 meters overhead. In [44], the images are captured by using a fisheye camera and it is mounted 7 meters above the floor.

5.2 Other Sensors

Sensors play a vital role to find the relative distance of the people in a group, which helps accurate prediction of the type of f-formation. Researchers used different types of sensors such as depth sensor, laser sensor, audio or speech sensor, RFID, and Ultra-Wideband (UWB) sensor in the literature. There are some cases where both cameras and other types of sensors are used simultaneously for detection. In [168], the authors use the UWB localization beacons, Kinect and an audio sensor for detecting people and other entities, and RGB cameras for monitoring. The data for scenes are captured in the form of images and/or videos depending on the method that uses the input for scene detection. Some instances of WiFi-based tracking [58] of humans are also visible in the literature. Some other sensors [163] like ibeacon [83], 2-dimensional (2D) range sensor [94], and audio sensor [156] are also in use.

Figure 10 shows a taxonomy of cameras/vision sensors and other sensors used in the literature for scene capture. Table 4 gives a categorization of the surveyed literature on the basis of camera views and sensors. In the table, readers may see the number of cameras used in each of the cited papers specified. The table also specifies the various cameras and sensors used in each paper.

Fig. 10.

Table 4.

Classification	Application Areas/ Details	References
Ego-centric (first person view or robot view) [Section 5.1]	– Robotics and HRI – Robot vision in telepresence – Drone/robot surveillance	[68], 1 [114], 1 [54], 1 [159], 1 [117], 2 Hamlet cameras and 1 robot camera [116], 1 [138], 1 [113], 1 [131], 1 [88], 1 [115], multi [77], 1 [4], 2 [75], 1 [158], 1 [161], 1 [3], 1 [149], 1 [81], 1 [10], multi [110], multi [52], multi [183], 1 [36], [82], multi [48], 1 [11], 1 [58], depth camera, RGB camera [56], an omni directional camera [184], multi [107, 111], 3 [90], 4 [62, 67], robot camera [74], 2 [12], [10], 2 [80], 1 [18, 21], 1 [94]
Exo-centric (global view) [Section 5.1]	– Social scene monitoring – Covid-19 social distancing monitoring – Human interaction detection and analysis	1 [117], multi [57], multi [113], 1 [16], [182], multi [7], multi [77], 1 [33], 4 [168], multi [165], 8 [136], 1 [75], multi [158], multi [167], 4 [129], 2 [186], multi [150], [170], multi [8], multi [152], [146], 1 [35], multi [146], multi [9], multi [110], [79], 1 [55], 4 [126], a single monocular camera [166], 3 overhead fish eye camera used for training classifier [71], multi [183], multi [52], 1 [36], 1 [27], multi [51], 1 [99], 4 [145], [93], 1 [162], multi [144], 7 [53], 4 [87], 1 [135], 2 [95], 1 [34], multi [111], 1 [44], 1 [100], multi [43], 1 [103], 1 [72], multi [185], 4+2 [62], 4 webcams [74], an omnidirectional camera [12], [122], [187]
Using other sensors [Section 5.2]	Audio, sociometric badges, Blind sensor, prime sensor, WiFi-based tracking, laser-based tracking, depth sensor, band radios, touch receptors, RFID sensors, smart phones, UWB becon	[8], [48], Kinect depth sensor [51], [58], [56], [100], [161], speakers [127], wearable sensors [134], [33], [136], UWB localization beacons, Kinect [168], [80], [141], [143], RFID tag [75], [89], [51], [102], [107], Asus Xtion Pro sensor [114], ZED sensors [131], single worn accelerometer [57], Kinect sensor [138], Microsoft Speech SDK [118], speaker, Asus Xtion Pro live RGB-D sensor [16], Kinect [77], motion tracker [136], sociometric badges [7], RGB-D sensor [182], tablets [161], tablets [143], mobile sensors [158], microphone, IR beam and detector, bluetooth detector, accelerometer [8], touch sensor [107], range sensor [184], laser sensors [102], Wi-Fi based tracking, laser-based tracking [58], PrimeSensor, Microsoft Kinect, microphone [99], RFID sensors [47], blind sensor, location beacon [48], single worn accelerometer [71], [97], gaze animation controller [118], [148], grid world environment [104], ethnography method [96], ibeacon [83], 2D Range sensing [94], audio [156]
Others relevant literatures	-	[46, 47, 60, 63, 78, 84, 112, 163]

Table 4. Classification Based on Camera View and Other Sensors for Group/Interaction and F-formation Detection

6 Categorization of Methods/Techniques

There are many f-formation detection methods proposed in the literature. In this article, we broadly categorize these methods into two classes—(a) rule-based methods (fixed rules, assumptions, and geometric reasoning) like the conventional image processing and vision techniques and (b) learning-based method (or data driven approach), with learning-based methods coming to prominence in the recent past. Multimedia and visual analytics [120] from big data remain the lucrative tool for large-scale f-formation detection and group interaction analysis.

In group discussions, people stand in a position where the conversation can happen effectively. Kendon [84] proposed a formal structure of group proxemics among the interacting people in a formation (described in Section 2). [122] provides a dataset of top-view images for capturing the same and subsequently methods for analyzing the dataset and detecting interaction and conversational groups are proposed. In [145], the authors use the Hough-voting approach with a two-step algorithm—(1) fixed cardinality group detection and (2) groups merging. Using these two steps, they detect the type of f-formation. In [53], the experiment uses a heat map-based method for recognizing human activity and the best view camera selection method. In [122], the graph-cut f-formation (GCFF) is used for detecting f-formations in static images with the graph-cut algorithms via clustering graphs. Yasuharu Den [46] says that formations are also dependent on the social organization and environment. He explains formation with outsiders where people stand based on their position. In this article [117], there are three constraint-based formations namely triangle, rectangle, and semi-circular formation. They use a game-theoretic model for the position and orientational information of people to detect groups in the scene. For checking the formation, they use an algorithm proposed by Vascon et al. [166, 167] that generates the 2D frustum of the position and orientation of people in the group. In [115], the authors use the Haar cascade face detector algorithm to detect the faces and eyes of people. Based on the face and eye detection, the method decides how many frontal, right, and/or left faces are there and then decides the formations. In [88], the Haar cascade classifier is used with quadrant methodology. The paper differentiates the person's facing direction by looking where the eye is located and in which quadrant. In [53], the authors use a new method to find the dominant sets (DS) and then compare with modularity cut. But this method is applicable only when everyone is standing. Hedayati et al. in their work [67] state that a robot needs a finer characterization of f-formations and introduce tightness and symmetry as two such characterizations. In [127], Raman et al. describe that such a finer characterization might be the occurrence of multiple conversations within the f-formation making the argument of detecting “conversation floors.” This method uses speaking turns for indicating the existence of distinct conversation floors and gets the estimation of the presence of voice. But this method cannot detect the silent (inactive) participants. In [134], proximity and acceleration data are used and pairwise representations are used with the long short-term memory (LSTM) network. They are used for identifying the presence of interaction and the roles of participants in the interaction. However, using a fixed threshold for identifying speakers can create mislabel in some instances. In [10], structural support vector machine (SVM) is used for learning how to treat the distance and pose information, and correlation clustering algorithm is used to predict group compositions. Furthermore, tracking learning detection (TLD) tracker is used for blur detection for ego-vision images. But the trackers cannot perform detection when the target is moving out of the camera field of view. In [131], the method uses ego-centric pedestrian detection. The pedestrian detector generates bounding boxes (BB). It uses optical flow for estimating motion between consecutive image frames. For detecting groups, they used joint pedestrian proximity and motion estimation. In [186], the method detects the group with a group detector first then uses the trained classifier to differentiate the people involved in the group. Some researchers use pedestrians, vision-based algorithms, and pose-estimation algorithm to detect [165] groups. In the article [165], authors use BP for handling f-formation detection and finding the joint estimation of f-formation and target's head and body orientation. They also use multiple occlusion-adaptive classifiers. There are many more methods scientists use, but each of them has its own strengths and weaknesses. Figure 11 presents a taxonomy of different methods/techniques/approaches surveyed in this article that are used for group/interaction/formation detection. Moreover, some works also attend to the navigational and joining aspects of a already detected group/interaction for application in robotics and HRI such as [105, 132, 164, 171].

Fig. 11.

6.1 Rule-Based Method

We categorize methods as rule based that include pre-defined rules, geometric assumptions, and reasoning. Rule-based methods are designed around well-known social behaviors and geometric properties and are often intuitive. In the absence of any learning paradigm, the algorithms are purely based on a static set of rules that are assumed to be true for a particular group situation (see Figure 12).

Fig. 12.

In the following, we list down the most popular rule-based methods that report a decent accuracy in detecting human groups.

6.1.1 Fixed Rules Based

Voting-Based Approach (2013). This approach is used for detecting and localizing groups by finding the matches based on exemplars. The authors in [93] suggest that this method works on agents so it is very flexible for different multi-agent scenarios. The results show that this method is effective for groups of up to four agents. The results are evaluated with people only without robots. The computational complexity of this method is low; hence, it is real-time in nature and accuracy is very good.

Head and Body Pose Estimation (HBPE) (2015) [152]. This method uses a joint learning framework for estimating head and body orientations that in turn is used for estimating f-formations. This method is evaluated with people in a scene without any robots. For evaluating, authors use the mean angular error for HBPE and use F1-score for f-formation estimation. This method is compared with Hough Voting for f-formation (HVFF) method. Though the results are more or less similar, this method is slightly more accurate and has a higher F1 score.

6.1.2 Geometric Reasoning Based

GCFF (2015) [146]. GCFF approach first finds the O-space and gives the individual position to identify the orientational formation. This method is tested on a synthetic scenario and compared with other methods such as Inter-Relation Pattern Matrix (IRPM), DS, Interacting Group Discovery (IGD), Game Theory for Conversational Groups, and HVFF. This approach improves over other approaches not only in terms of precision but also in recall scores. It performs better in detecting people and orientation with no errors. The results are evaluated with people only.

GROUP (2015) [170]. The GROUP algorithm detects f-formations based on lower-body orientation distributions of people from the scene and gives a set of free-standing conversational groups on each time step. First, it analyzes the maximum description length (MDL) parameter. The higher the MDL parameter, the higher is the radius of grouping people together. This method can also detect non-interacting people as outliers. It is evaluated with people only without any robots in the scene. The computational complexity of this method has been compared with state-of-the-art methods.

Approach Planner (2018) [138]. The Approach Planner enables a robot to navigates/plan based on the natural approaching behavior of humans toward the target person. This method can replicate human behavior/tendencies when approaching. The evaluation is based on the parameters derived from the skeletal information.

Game-Theoretic Model (2019) [117, 166]. The approach gives a 2D frustum for each virtual agent and robot by giving the position and orientation of them. Then, it computes the affinity matrix. The method is evaluated both quantitatively and qualitatively. It is efficient in serving teleoperated robots who follow f-formations while joining groups automatically. The method also takes care of the fact that the formation is modified when new people/robots join the old group. The evaluation is made in a simulation environment.

More methods are mentioned in the supplementary material.

6.2 ML-Based Method

ML-based methods are generally data-driven models where different algorithms are explored by researchers. Generally, we have special case of DL methods under ML. The primary learning paradigms used are—supervised, unsupervised, semi-supervised, and reinforcement learning. However, a generic system that uses a ML algorithm to detect f-formation is shown Figure 13. In the following, we describe the various ML-based approaches.

Fig. 13.

6.2.1 Supervised Approaches

IR Tracking Method with SVM (2010) [62]. With the help of IR tracking, social interactions can be classified as either existing or non-existing by using geometric social signals. The authors train and test with many classifiers such as SVM, Gaussian Mixture Models, and Naive Bayes classifier. IR tracking with an SVM classifier has been shown to achieve better accuracy than other classifiers.

Novel Framework (2013) [27]. This approach uses the Subjective View Frustum as the main feature which encodes the visual field of a person in a 3D environment and the IRPM as a tool for evaluation. For the tracking part, Hybrid Joint-Separable filter is used. The tracker gives the position of the head and feet of each person. Computational result-based evaluation is done with the other counterparts in terms of accuracy/efficiency.

GIZ Detection (2015) [35]. This method detects groups based on proxemics. Group Interaction Energy feature, Attraction and Repulsion Features, Granger Causality Test, and Additional Features are proposed in this method. Tests are also conducted by combining these features. This method allows people to be connected loosely. The evaluation is done on the basis of computational accuracy and efficiency.

3D Skeleton Reconstruction Using Patch Trajectory (2017) [77]. This algorithm works in two stages. First, it takes images from different views as input and produces 3D body skeletal proposals for people using 2D pose detection. Second, it refines those using a 3D patch trajectory stream and provides temporally stable 3D skeletons. Authors evaluate the method quantitatively and qualitatively yielding an accuracy of 99%. The limitation of this method lies in its dependency on 2D pose detection and the computation time complexity.

Learning Methods for Head Pose Estimation (HPE) and Body Pose Estimation (BPE) (2017) [165]. This method uses a jointly learning framework for estimating the head, body orientations of targets, and f-formations for conversational groups. The evaluation matrices used are—HP error, BPerror, and FFF1 score. This method is compared with IRPM, IGD, Hough transforms-based (HVFF), Graph Cut (GC), and Game-Theoretic methods, and the results are not very different in percentage accuracy.

Method Using Group Based Meta-Classifier Learning Using Local Neighborhood Training (GAMUT) (2018) [57]. This method aims at estimating the f-formation of each pair of people in a group as a pairwise relationship in the scene. This method works in two steps: prepossessing and GAMUT. In the prepossessing stage, raw tri-axial acceleration signals are converted to pairwise feature representations and these are used as samples in GAMUT. In the GAMUT stage, the same size of the local neighborhood is used per window size. The results are computationally evaluated in terms of accuracy and efficiency of detection.

Bagged Tree (2019) [67]. The proposed algorithm works in three steps—dataset deconstruction, pairwise classification, and reconstruction. Authors evaluate this algorithm with three ML classification models—weighted KNN, bagged trees, and logistic regression, where bagged tree model achieve better result in pairwise accuracy, precision, recall, and F1-score. But this method still needs to be trained on larger datasets and also on richer features. The evaluation is done with human experience study with robots. More methods are mentioned in the Supplementary Material.

6.2.2 Unsupervised Approaches

Graph-Based Clustering Method (2011) [72], (2013) [162]. In [72], authors use the “socially motivated estimate of focus orientation” feature to estimate body orientation that in turn estimates f-formation. This method has been compared with a modularity cut method. The evaluation process is done on the basis of computation complexity. The limitation of this approach can be seen in scenarios where people are moving within the group and/or people are joining/leaving the group.

Robot-Centric Group Estimation Model (RoboGEM) (2019) [159]. RoboGEM is an unsupervised algorithm that detects groups from an ego-centric view. This method works using three main modules—pedestrian detection module “P,” pedestrian motion estimation module “V,” and group detection module “G.” In the first module, an off-the-shelf pedestrian detector (YOLO) is used that provides BB for each person in the image. In the second module, V is estimated using optical flow. In the last module, the human group detection is performed using joint motion and proximity estimation. The authors compared this method with existing approaches using Intersection-over-Union, false positives per image, and depth threshold matrices. The evaluation is done with human experience with robots.

In [162], authors find a graph representation from the 3D trajectories of people and head poses. Using a graph-clustering algorithm, they discover social interaction groups. They use a SVM classifier for learning and classifying the group activities. The evaluation shows that it is better than the previous methods. Human experience study is also performed with robotic scenarios. This approach not only recognizes or detects particular group activity but also predicts a direct link between each person from that group.

Method Based on Pedestrian Motion Estimation (2018) [131]. This method works in three parts—ego-centric pedestrian detection, pedestrian motion estimation, and group detection using joint motion and proximity estimation. The pedestrian detectors result in BB with two features—a position of pedestrians and size of the BB. An optical feature is used for motion estimation. Then, joint pedestrian proximity and motion estimation are used for detecting groups while considering the depth data. The evaluation is done in terms of real-life human experience study using robots and humans. More methods are mentioned in the supplementary material.

6.2.3 Mixed Approaches

DANTE (2020) [153]. DANTE detects groups having conversations using a data-driven approach to identify spatially viable social interactions. A Deep Affinity Network has been designed which predicts the likelihood of two individuals being in a same group interaction in a given scene. The method takes the social context into account while making this decision. Predicted affinities between two entities (i̇.e., people) in the graph are then used to perform graph clustering to recognize groups of different sizes.

Based on Spatio-Temporal Context (2022) [154]. The method is based on dynamic LSTM-based DL model. This method also predicts the affinity values between two people indicating the likeliness if they are a part of the same group or not. The affinity is predicted in a continuous manner enabling it to detect dynamics in any group formation process. Finally, DS are extracted using graph clustering to identify the final conversational groups.

Geometry-Based and Data-Driven Methods (2022) [169]. The first method encodes geometry-based information about groups and interactions implicitly. The second one, which is data-driven in nature, uses Graph Neural Networks (GNNs) and adversarial learning to model the spatial arrangements and their properties explicitly.

Apart from the above mentioned methods, various research also shows the emergence of GNNs for group and interaction detection. The speciality of GNNs lie in their representation of data in graph formats having nodes and edges denoting entities and relationships, respectively. Some of the recent works utilizing GNN-based approaches can be found in [70, 147, 160, 188]. More methods are mentioned in the supplementary material.

Table 5 lists down the surveyed papers on the basis of rule-based and learning-based AI approaches. From the algorithmic trends, it is evident that learning-based approaches are slightly more predominant in recent years. However, both the methods are equally explored over the years and in recent pasts. Learning-based methods tend to be more accurate than their rule-based counterparts. Examples of such methods are as follows: [72], [99], [77], [131], [67], and [159].

Table 5.

Classification	References
Classical rule-based AI methods [Fixed model based learning and prediction method based on certain geometric assumptions and/or reasoning (Section 6.1).]	Approach behavior [141], sociologically principled method [43], proposed model [148], The Compensation (or Equilibrium) Model, The Reciprocity Model, The Attraction-Mediation Model, The Attraction-Transformation Model [103], rapid ethnography method [96], digital ethnography [111], GroupTogether system [95], museum guide robot system [184], extended f-formation system [53], Multi-scale Hough voting approach [145], Hough for f-Formations, Dominant-sets for f-Formations, [144], PolySocial Reality-F-formation [47], two-Gaussian mixture model, O-space model [60], Wi-fi based tracking, laser-based tracking, vision-based tracking [58], heat map based f-formation representation [51, 166], group tracking and behavior recognition [55], search-based method [52], Estimating positions and the orientations of lower bodies [183], Kendon's diagramming practice [110], GROUP [170], GCFF [146], [81], HBPE [152], Link Method, Interpersonal Synchrony Method [129], Frustum of attention modeling [167], f-formation as dominant set model [186], HRI motion planning system [182], footing behavior models (Spatial-Reorientation Model, Eye-Gaze Model) [118], matrix completion for head and body pose estimation (MC-HBPE) method [7], [136], Haar cascade face detector algorithm [88], Haar cascade face detector algorithm [115], [168], Approaching method [138], Measuring Workload Method [97, 116, 117], conversation floors estimation using f-formation [127], f-formation as dominant set model [187], geometry based [169]
ML-based AI methods [Data-driven models for learning and prediction using supervised, semi-supervised, unsupervised and reinforcement learning (ML/DL) or any such techniques (Section 6.2).]	[12], IR tracking techniques [62], SVM classifier [185], Grid World Scenario [104], graph-based clustering method [34, 44, 72], Hidden Markov Model (HMM) [56], proposed method with o-space and without o-space (SVM) [135], Region-based approach with level set method [87], IRPM [27], graph-based clustering algorithm [162], voting based approach [93], HMMs [99], SVM [36], Transfer Learning approaches [126], method with HMM [71], head pose estimation technique [11], [10], MC-HBPE [9], GIZ detection [35], [8], Supervised Correlation Clustering through Structured Learning [150], LSTM, Hough-Voting (HVFF) [3], LSTM [4], 3D skeleton reconstruction using patch trajectory [77], Human aware motion planner [33], [41], Learning Methods for HPE and BPE [165], Group bAsed Meta-classifier learning Using local neighborhood Training (GAMUT) [57], Group detection method [131], [67], LSTM network [134], Robot-Centric Group Estimation Model (RoboGEM) [159], Multi-Person 2D pose estimation [114], SSBL [54], multi-class SVM classifier [18], GNN based [70, 147, 160, 188], DANTE [153], Based on Spatio-Temporal Context [154], Geometry based and data-driven methods [169]
Other studies	WoZ [74], WoZ paradigm [15], WoZ [90]

Table 5. Classification Based on Approach/Method for Group and F-formation Detection

From the survey, it can be established that mostly unsatisfactory accuracy can be seen in rule-based approaches more than learning-based models. And, as expected, the main reason in general for low accuracy lies in the inability of the methods to detect dynamic groups as well as multiple groups in the scene accurately. But one interesting observation is that accuracy is largely impacted by the camera vision as well. Basically, low accuracy can be seen in exo-vision input methods and datasets. Similarly, real timeliness is another issue with methods dealing with dynamic and multiple groups. There is no prominent impact of camera views, datasets, and method types in this case. Readers may refer to the online Appendix which contains the detailed comparison of methods and techniques under rule-based static AI approaches in Table 1 and ML-based approaches in Table 2.

7 Datasets

Table 6 comprehensively lists all the surveyed datasets in the literature. A total of 71 datasets have been mentioned out of which only 15 are publicly available, 51 are private to the authors/researchers and 5 are not known. Fifteen of the datasets have outdoor scenes (mostly from public area captures), 42 have indoor scenes, and only 5 of them have both types of scenes. Thirty-six out of the list have multiple group scenarios in the scenes and 20 have single group scenes and 15 are not known. Ego-vision scenes of groups and interactions are seen in 20 datasets, whereas 39 datasets have exo-vision or global view images of groups, 1 dataset has both of them, and camera-view is not known for 11 datasets. Figure 14 gives a comprehensive idea about the taxonomy of datasets (training/testing) generally used in group/interaction and formation detection tasks.

Table 6.

Dataset	View (Ego/Exo)	Single/Multiple group (s)	Indoor/Outdoor [area]	Availability (Public/Private)
TUD Stadtmitte [13]	Ego	Multi-gp	outdoor [public]	private
HumanEva II [13]	Ego	Multi-gp	indoor	private
SALSA [174]	Exo	Multi-gp	indoor	public
BEHAVE database [109]	Exo	Multi-gp	outdoor [public]	public
TUD Multiview Pedestrians [13]	Exo	Multi-gp	outdoor [public]	private
CHILL [34]	Exo	Multi-gp	—	—
Benfold [28]	—	—	—	—
MetroStation [34]	Exo	Multi-gp	indoor [public]	private
TownCentre [29]	Exo	Multi-gp	outdoor [public]	private
Indoor [34]	Exo	Multi-gp	indoor	private
SI (Social Interactions) [101]	Exo	Multi-gp	outdoor [public]	public
Coffee-room scenario [27]	Exo	Multi-gp	indoor	private
CoffeeBreak [42]	Exo	Multi-gp	outdoor [private]	public
Collective Activity [14]	Ego	Multi-gp	outdoor/indoor	private
PETS 2007 (S07 dataset) [27]	Exo	Multi-gp	indoor [public]	private
Structured Group Dataset [172]	Exo	Multi-gp	indoor/outdoor [public]	public
EGO-GROUP [5]	Ego	Multi-gp	indoor/outdoor	public
EGO-HPE [6]	Ego	Multi-gp	indoor/outdoor	public
Mingling [71]	Exo	Multi-gp	indoor	private
MatchNMingle [32]	Exo	Multi-gp	indoor	public
CLEAR [151]	Exo	Single-gp	indoor	private
Greece [126]	Exo	Multi-gp	indoor	private
DPOSE [125]	Exo	Multi-gp	indoor	private
BIWI Walking Pedestrians [119]	Exo	Multi-gp	outdoor [public]	private
Crowds-By-Examples [92]	Exo	Multi-gp	outdoor [public]	private
Vittorio Emanuele II Gallery [17]	Exo	Multi-gp	indoor [public]	private
UoL-3D Social Interaction [40]	Ego	Single-gp	indoor	public
Cocktail Party [173]	Exo	Multi-gp	indoor	public
Social Interaction [39]	—	Single-gp	indoor	public
GDet [23]	Two monocular cameras, located on opposite angles of a room	—	indoor	public
IPD [76]	Exo	Multi-gp	outdoor	public
Classroom Interaction Database [93]	Exo	Multi-gp	indoor	private
Caltech Resident-Intruder Mouse [31]	—	—	—	—
UT-Interaction [137]	Exo	Multi-gp	outdoor	private
PosterData [72]	Exo	Multi-gp	outdoor	private
Friends Meet [24, 26]	Exo	Multi-gp	outdoor	public
Discovering Groups of People in Images (DGPI) [37]	Exo	Multi-gp	indoor	private
Prima head pose image [61]	Ego	Single-gp	indoor	private
NUS-HGA [35]	—	Single-gp	indoor	private
[62]	Exo	Single-gp	indoor	private
[185]	Exo	Single-gp	indoor	private
[72]	Exo	Multi-gp	indoor	private
[44]	Exo	Multi-gp	outdoor [public]	private
[104]	Exo	—	—	—
[56]	Ego	Single-gp	indoor	private
Dataset using Narrative camera [106]	Ego	Single-gp	indoor	private
[4]	Ego	Single-gp	indoor/outdoor [public]	private
[57]	Exo	Multi-gp	indoor	private
[131]	Ego	Multi-gp	outdoor	private
Laboratory-based dataset containing distance measures at three key distances, one laboratory-based dataset with distance measures from three predefined distances, dataset with distance measurements collected in a crowded open space [114]	—	—	—	private
RGB-D pedestrian dataset [159]	Ego	Multi-gp	outdoor [public]	private
[74]	—	—	indoor	private
[141]	—	—	indoor	private
[148]	—	—	indoor	private
[96]	—	—	indoor	private
[103]	Ego	Single-gp	indoor	private
Youtube videos [110, 111]	—	—	indoor	private
[95]	Exo	Single-gp	indoor	private
[53]	Exo	—	indoor	private
In shopping mall [58]	Ego	—	—	private
[51]	Exo	Single-gp	indoor	private
[25]	Ego	—	—	private
[15]	Ego	Single-gp	indoor	private
DGPI dataset [167]	—	—	—	—
[182]	Exo	Single-gp	indoor	private
[136]	Ego, Exo	Single-gp	indoor	private
[168]	Ego	Single-gp	indoor	private
[88]	Ego	Single-gp	indoor	private
[138]	Ego	Single-gp	indoor	private
[97]	Ego	—	—	private
Babble [65, 66]	Exo	Single-gp	indoor [public]	public

Table 6. Comprehensive List of Datasets (Training/Testing) and their types, used in Group/Interaction and Formation Detection Methods Surveyed in the Literature

The color coding is done for the readers to perceive the data easily.

Fig. 14.

One of the most important considerations in handling vision-based datasets involving people and their personal information such as interaction/conversation/discussion with other people in any indoor/outdoor scenes s the ethical issues/concerns and implications [73]. Here, it is evident that privacy and protection of visual data of people is at stake. There are multiple facets of this problem. Data collection itself is a huge challenge. To create a good dataset for learning models to detect human groups and f-formations, we need visuals from a diverse situations. So, as researchers, the very basic question we need to answer is the permission to collect, utilize, and sharing/distributing such data for research purpose.

Collecting such vision data of humans itself requires consent from the involving people in the scene. There are legal procedures in some jurisdictions for this. However, where no such legal dependency exists, it is always a good practice to do so. Similar consent also needs to be attained while distributing or sharing such data with other users and researchers for further exploration and learning model designing. Here, the consent of the people visible in the dataset along with the consent of the original creator of the dataset is required. This may be the reason behind less publicly available datasets in this domain. Another facet is the annotations of these datasets. Generally researches incorporate weak annotations in such human-based vision datasets. There is no personal information involved in such annotations. The annotations are also anonymous in nature. This can be a good way to comply with the ethical regulations.

Another facet is the in-the-wild versus in-the-lab settings for f-formation and human group/interaction detection. In case of lab settings, it is often conclusive that we already have the consent of the participating people in the visual data collection for usage and sharing. So, ego vision camera captures are also easy in such cases. However, in wild settings, often data are collected without the participating people knowing that they have been captured visually. Here, serious ethical concerns arise, which needs to validated and tracked. This also leads to less datasets in the wild with ego-centric cameras. In case of exo-centric cameras, privacy preservation is easy. It is very hard to preserve privacy in case of ego vision in the wild setups.

Next is the question of potential bias and discrimination⁷ in the dataset, i.e., non-inclusion of data from underrepresented communities. The major ethical concern in today's world can be seen in the diversity and inclusion facet considering factors like age, ethnicity, and gender. We can see that most of the datasets are created from scenes of western countries. There should be a mix of data from diverse communities across the globe which should include Asian and African countries as well. Another important consideration is the inclusion of both black and white skinned people in the vision data. Due to such under representation the learned models might predict wrong classes. Age consideration is also necessary in these datasets. Although, human groups/interactions in social setups may not consist of humans in age groups like infants and children, but there is a strong possibility of mixtures of teenagers and adults. Finally, equal inclusion of both male and female candidates in such data is also a concern to ponder upon. The datasets, as we see today, have much higher share of male people in them. Fairness in such cases is important.

There are organizations which are in place for such concerns in different geographies, such as the US or Europe. To name, Institutional Review Board⁸ or Research Ethics Board is a council (specifically in the US) that implements ethics in research activities by reviewing the entire research pipeline. They approve or reject any research and collecting of data which involves humans which is not ethical. Monitoring and reviewing of such research progress are also being carried out by such a committee. Another one is General Data Protection Regulation⁹ in the European Union (EU) jurisdiction. This is a part of the human rights law for EU in data privacy and protection. According to this, it is unethical and illegal to transfer personal data out of EU unless permitted specially. The board monitors these concerns strictly allowing individuals to have full control and rights over their personal data. This is another reason that vision data involving humans is rarely distributed or shared in public domains and platforms.

8 Categorization of Detection Capabilities and Scale

This section puts forward a categorization of the surveyed literature on the basis of group/interaction detection capabilities and scale for a method. After surveying the literature and the methods, it is evident that detection of groups in scenes is a non-trivial task and many factors are to be considered in the process as well. In real-life scenes, there can be both static groups of people interacting without much movement and there can also be groups with constant movement. There can be cases like group members leaving a group or new members joining a group. Group dynamics also is an important factor to be considered. People may sway or move their bodies occasionally too. Apart from these, methods also need to consider a single group or multiple existing groups in a scene. Outliers to one group can be a part of another group or can be noise at the global level. Figure 15 depicts a taxonomy of group detection in interaction scenarios in real-life cases.

Fig. 15.

8.1 Detection Capability

Methods need to attend to both static and dynamic groups in interactions and formations. Here, we do categorization of this aspect.

Static Group Scene. In static group scenes, we consider f-formation detection capability in terms of the absence of any temporal information affecting the group's categorization. We consider statically captured data, such as a single-image capturing a group and other sources of data that do not capture slight changes in head/body pose and orientation, e.g., sonar. It is generally easier to detect groups and also classify f-formations in such scenarios. Static group scenes also denote that the people interacting in a group or formation do not change groups or new people do not join a group while interaction is in progress. In [36] and [114], single image is used from a single egocentric camera for detection. Static groups are mostly found in indoor scenes like conferences, group discussions, coffee breaks, and meetings.

Dynamic Group Scene. Dynamic group detection essentially means two things, i.e., either the group detection happens on temporal data or the group itself can change over time which is the case in most of the real environments. In the case of a dynamic scene, people tend to move in groups, also referred as group dynamics. New people can join a group and/or existing people can leave a group. Also, some people participating in an interaction may temporarily change their head/body pose and orientation a bit; this necessarily does not mean that the formation has changed. In such cases, it becomes very difficult for an algorithm to detect the group or formation in the interaction scenario. As a result, the methods need to consider the temporal information or data of the scene utilizing a sequence of images over a window. A sequence of image/image stream is taken over a particular period of time. Here, we also would like to mention a categorization of detection capability on the basis of datasets, scene dynamism (as discussed in Section 8.1), and number of groups (as discussed in Section 8.2). Normally, detection methods are tried and tested in indoor or lab settings. Datasets are also mostly indoor ones for training the learning models in data-driven methods. In such cases, ecological validity of the methods needs special mention. Evaluation on the basis of indoor scene may not sufficiently establish the superiority of method or its working accuracy. So, a thorough in-the-wild versus indoor/lab interaction testing is needed [73]. Here, we can consider that those methods which are capable of detecting dynamic scene interactions with multiple groups are a good candidates for in-the-wild use. Moreover, the considerations of camera views (see Table 4), both ego and exo vision, are also needed. The methods which use outdoor datasets with multiple groups and either ego or exo vision cameras (for training models) may be categorized into the bucket of in-the wild methods (see Table 6). Please see Tables 1 and 2 from the Online Appendix for all the methods and their features.

In [44], the video data are used which is 10 frames per second for detecting dynamic groups and interactions. In [34], the surveillance videos are used for experiment purposes. Similarly, [145] utilizes video feed from a cocktail party [173] for its experiments, which is one frame in 5 seconds. Similarly, SALSA [174] has a temporal sampling rate of 1 frame in 3 seconds (for 60 minutes video), and MatchNMingle [32] has a resolution of 1 frame per second (for 30 minutes video). On the other hand, CoffeeBreak [42] has only 130 frames and Idiap Poster Data (IPD) [76] has a 83 frames collection which make them difficult to have good temporal resolutions. Babble dataset [65, 66] has 3,481 frames of conversational group with sampling rate of one frame every second. Figure 16 depicts the dynamic scene and group scenario. The EGO-GROUP dataset [5] has a video of an indoor laboratory setup. The video consists of 395 image frames. The specialty of this video is that the people in the scene are not static in one position and they change position/orientation and location with time. On the right-hand side of the figure, we put forward four instances of the image sequence where four different types of groups/interactions and formations are visible for the same four people in the scene. This type of dynamism should be handled by the detection methods with efficiency considering temporal aspects of the scenes. In outdoor scenes such as waiting rooms, stations, airports, restaurants, theatres, and lobbies, dynamic groups are mostly encountered. The main problem in this is the scarcity of datasets with sufficient temporal resolution or sampling rate. To create good models with temporal information for learning process requires huge amount of temporal data which is missing in the current publicly available datasets. Towards this a study on time resolutions required to study various social human behaviours is required to understand the landscape which can farther facilitate the creation of good temporal datasets and learning models. Such a study can be found in the Section 3.3 of [128].

Table 7 summarizes the references into two detection capability types found in the literature.

Fig. 16.

Table 7.

Classification	References
Static scene detection	[36, 51, 54, 57, 82, 113, 114, 122, 126, 129, 136, 144, 146, 166, 167, 186]
Dynamic scene detection	[3, 4, 7, 8, 9, 10, 11, 12, 15, 16, 18, 21, 27, 34, 35, 41, 43, 44, 46, 47, 52, 53, 55, 56, 57, 58, 62, 68, 71, 72, 74, 75, 77, 79, 80, 81, 87, 88, 90, 93, 95, 96, 97, 100, 102, 103, 110, 111, 113, 115, 116, 127, 129, 131, 134, 135, 138, 141, 143, 145, 146, 148, 149, 150, 152, 159, 161, 162, 165, 166, 168, 170, 181, 182, 183, 184, 185, 187]

Table 7. Classification Based on Group/Interaction and Formation Detection Capability

8.2 Detection Scale

Since we need different methods depending on how many groups are there in a captured image or video, the detection scale plays an important role.

Single Group Detection. When a sensor/camera detects only one interacting group in the scene, the work is easily done. The stream of images sequence can have multiple groups as well. But all the methods do not have the capability to detect multiple groups simultaneously. In some cases, single group detection is useful when a robot needs to detect a single group of interest in a scene or environment and join the group for interaction/discussion. The datasets used for this kind of detection are mostly captured indoor (e.g., office and panoptic studio [39] or Babble [65, 66]) or outdoor (mostly private datasets). Other publicly available datasets are BEHAVE [109] and YouTube videos which can also be used for such purposes. In the case of ego-view camera-based detection methods, single group detection is the primary focus.

Multiple Group Detection. When there is more than one interacting group or formation in a scene, the detection methods need special attention. Sometimes there is only one interacting group in the scene along with some additional people who are not actively involved in the interaction. Those cases can also be considered under the same umbrella and are quite challenging too. This kind of detection is useful for finding how many groups are there or finding a particular group in a diverse scene in surveillance/monitoring applications. There exists some datasets comprising of such scenarios—coffee break dataset [42], EGO-GROUP [5], SALSA [174], cocktail party [173], GDet [23], Synthetic [50], IPD [76], and FriendsMeet2 (GM2) [24, 26]. Beyond these, some researchers have used their own (private) datasets. In [44] and [145], the authors experimented with such datasets where there is more than one group in the scene (the party data). Similarly, in [34], the surveillance videos are used as data where there can be more than one group in the captured video. Table 8 classifies the literature on the basis of group/interaction detection scale. The multiple group detection scenario normally comes in exo-view based methods. Figure 17 depicts three scenarios from a renowned dataset, EGO-GROUP [5]. Figure 17(a) shows a single triangular formation with one outlier in an indoor environment. Figure 17(b) depicts a viz-a-viz formation in an outdoor situation. Figure 17(c) shows two groups: one triangular and one L-shaped formation in an indoor situation.

Fig. 17.

Table 8.

Classification	References
Single group detection	[3, 4, 12, 15, 16, 18, 21, 41, 46, 47, 48, 51, 52, 53, 54, 56, 67, 68, 74, 75, 77, 78, 79, 80, 81, 82, 88, 89, 90, 95, 96, 97, 100, 102, 103, 104, 107, 110, 111, 113, 115, 116, 118, 122, 135, 136, 138, 141, 143, 146, 148, 149, 161, 168, 170, 181, 184, 185]
Multi-group detection	[7, 8, 9, 10, 11, 27, 34, 35, 36, 43, 44, 55, 57, 62, 71, 72, 87, 93, 99, 113, 114, 117, 126, 127, 129, 131, 134, 144, 145, 146, 150, 152, 159, 162, 165, 166, 167, 182, 183, 186, 187]

Table 8. Classification Based on Group/Interaction and Formation Detection Scale

9 Categorization of Evaluation Methods

The most important part of the formation or interaction detection framework (Figure 8) is the evaluation methodologies. The conventional methods to compare methods and techniques in such vision tasks are accuracy and efficiency. The accuracy defines how accurately a method detects/predicts or recognizes an f-formation. The efficiency parameter relates to the real timeliness aspect of the method. Apart from these, the papers in the surveyed literature also speak about simulation-based evaluation and human experience study-based evaluations (for robotic applications specifically). Figure 18 shows the simple taxonomy of evaluation methods for various group/interaction and formation detection methods or algorithms.

Fig. 18.

Simulation-Based Evaluation. This type of evaluation is conducted using simulation tools. The simulators have different features and have the ability to simulate the real world in complex environments. A range of simulators are used in the surveyed literature—Gazebo [117], RoboDK [133], and Webot [45]. Nowadays researchers are also focused on using Virtual Reality or Augmented Reality technologies for evaluation purposes. The evaluations are performed mainly to access the perception of a virtual robot or an autonomous agent. The question to be answered is how well a simulated robot (in a simulated environment) can perceive a group of simulated people involved in an interaction. Second, after detection, is the simulated robot joining the group naturally without discomforting the simulated people? (see Section 2.3 and Figure 3). Parameters like stopping distance for the robot, orientation, and pose based on the perceived group pose/orientation and angle of approach depending on the group's angle and position should be considered. Extensive discussion on these factors post-group/interaction detection by a robot or autonomous agent is out of the scope of this survey.

Human Experience Study Based Evaluation. This type of evaluation is based on testing the detection methods using ego-vision robots or on a real scenario with human participants as evaluators. A questionnaire is provided to the human participators to rate the quality of the method being used by the robot in real scenarios. The questions and parameters similar to simulation-based evaluation can be considered in this case as well but with a real robot perceiving human groups (who are also the evaluators) and interaction. In real-life scenarios, the groups are not static and tend to move when a member joins and leaves the group. Accordingly, the robot or the autonomous agent must detect the changes in group formation, orientation, and pose to re-adjust itself in a natural and more human-like manner without causing any discomfort to other humans.

Accuracy/Efficiency Evaluation without Using Robot or Simulators. This kind of evaluation is based on the accuracy or efficiency aspects of the methods but not tested in the real environment or by using robots. Here, the focus is mainly to evaluate the computational aspects of the methods/algorithms without evaluating the usability in real-life applications like robotics. However, applications like human behavior/interaction analysis, scene monitoring, and surveillance depend entirely on such evaluation. Table 9 classifies the surveyed papers on the basis of the evaluation strategy adopted. It also shows the descriptions/names of simulators in the simulation-based category.

Table 9.

Classification	References
Simulation-based evaluation (robotic simulators/virtual environment)	2D grid environment simulated in Greenfoot [104], simulated the process of deformation of contours using P-spaces represented by Contours of the Level Set Method [87], [52], Robot Operating System implementation of PedSim [129], a simulated avatar embodied confederate [118], Gazebo [117], a simulator using Unity 3D game engine [54]
Human experience study-based evaluation (with real robots)	[12, 15, 18, 21, 33, 56, 58, 67, 74, 88, 90, 97, 99, 100, 103, 114, 115, 116, 131, 136, 138, 148, 159, 162, 168, 182, 184]
Accuracy/efficiency evaluation (without robot, only computation)	[4, 7, 9, 11, 27, 34, 35, 36, 41, 43, 44, 51, 52, 52, 53, 55, 57, 62, 71, 72, 77, 87, 93, 95, 96, 110, 111, 126, 127, 129, 134, 135, 144, 145, 146, 152, 165, 166, 167, 170, 183, 185, 186, 187]

Table 9. Classification Based on Evaluation Methods and Strategy

10 Application Areas

Group or interaction detection has seen vast applications in many areas of computer vision. Specifically speaking, with the emergence of robotics and AI, this domain has realized its true potential. In this article, we categorize the application landscape into two broad areas: robotic applications and other vision applications. Furthermore, these have been broken down into five groups as summarized in Table 10 (also see Figure 19). The robot vision implies the applications where the robot's camera is placed in an ego-centric view for finding the groups only, but there is no purpose of initiation of interaction with a human. In HRI, f-formation detection is used to detect the group in order to participate in the interaction with fellow human beings autonomously. In telepresence, a remote person uses the robot to interact with a group of people. In such a scenario, the semi-autonomous robot can detect the group and join them while the remote human operator can control the robot to adjust its positioning.

Fig. 19.

Table 10.

Application Area	References
Drone/robotic vision	[3, 7, 9, 11, 18, 21, 27, 34, 35, 36, 43, 47, 53, 55, 68, 70, 72, 77, 90, 93, 95, 122, 126, 129, 143, 144, 145, 146, 150, 152, 158, 161, 165, 166, 167, 170, 186, 187]
HRI (i.e., assistive robots)	[12, 15, 16, 41, 54, 56, 57, 58, 60, 67, 70, 74, 75, 78, 79, 80, 81, 82, 89, 90, 97, 99, 100, 102, 103, 104, 107, 113, 114, 131, 136, 138, 141, 147, 148, 149, 159, 160, 168, 181, 182, 183, 184, 188]
Telepresence/teleoperation technologies	[70, 88, 113, 115, 117, 160, 188]
Indoor/outdoor scene monitoring and surveillance	[58, 68, 90, 100, 118, 120, 185]
Human behavior/activity and interaction analysis	[4, 8, 10, 44, 46, 48, 51, 52, 62, 70, 71, 74, 87, 96, 100, 110, 111, 127, 129, 134, 135, 147, 158, 160, 162, 165, 185, 188]
Covid-19 and social distancing	Scope of future research

Table 10. Classification Based on Targeted Application Areas

Scene monitoring is useful for analyzing indoor or outdoor scenes with people interacting and forming groups and f-formations for various activities. On the other hand, human behavior and interaction analysis refer to the behavior between humans and how they are interacting based on the situation. Furthermore, visual analytics in big data has empowered the domain beyond imagination. People are trying to use these technologies in various aspects of life. In the current scenario of the COVID-19 pandemic, we can utilize this technology in monitoring social distancing in human groups and interactions as well. As already mentioned, telepresence robotics can be utilized by doctors/nurses and other medical staff to attend to patients in remote locations without physically being present.

Some of the discussed application areas have serious ethical implications and potential negative societal impacts. HRI can be seen in many embodied robots and humanoids nowadays. Specially in cases like assistive and therapeutic robots, there are many ethical implications [130]. First is the data privacy and protection of the person who interacts with such robots. The robots ought to capture visual data of the humans with whom it interacts. The proper protection of such data is mandatory. These data are easily available to the owner or user of such robots. Any kind of misuse of such data raises ethical concerns. Second, the emotional and psychological attachments of a human with such a robot is at stake when the job of the robot is done and it is taken away (specially in case of assistive and therapeutic robots). This may lead to serious harmful effects on the humans and leave them in much worse conditions than before. These kind of situations need proper and planned strategies to be handled without adverse effects. Next, the robots may be involved physically with the humans, such as lifting patients, taking them to bathroom, and helping them bath. So, the robots must be designed in a way such that the privacy of the humans are preserved by deactivation of video monitors during intimate procedures. Similar ethical concerns prevail in case of HRI in telepresence and teleoperation robots [49]. Here, transparency and accountability also come into the play.

Another application area stated in our survey has direct ethical implications on human privacy. Nowadays remote surveillance, monitoring, and tracking of humans and their behaviours and activities have gained popularity. Usually drones, UAVs, UGVs, and even fixed camera systems are used for such application in many public places as well as private areas. This poses a serious breach in privacy of human life and activities. Many countries have their own set of laws governing this arena¹⁰. Some of the locations where a person expects privacy are hotel rooms, own residence, public restrooms, changing rooms, and so on. Having surveillance or monitoring systems in such locations may lead to the violation of the prescribed ethical law, and the concerned people may face legal consequences. So, it is recommended to review the set of laws for the country where such vision based applications are put to effect. Finally, most of the researchers release their methods and codes on human group interaction or activity detection for research purposes only. So, such limitations in real-life usage of these methods need serious consideration and workaround.

11 Limitations, Challenges, and Future Directions: A Discussion

The survey is treated based on a generic framework of concern areas about group/interaction detection using the theory of f-formation (see Section 4.1 and Figure 8). It addresses various identified modules and concern areas such as camera view and availability of other sensor data, datasets, feature selection, methods/techniques, detection capabilities/scale, evaluation methodologies, and application areas.

—

The existing methods have almost equal share of fixed rule-based (Figure 12) and learning-based (Figure 13) approaches (Tables 5 and in online Appendix Tables 1 and 2). Researchers need to orient their research toward data-driven approaches using DL and reinforcement learning paradigms for handling complex situations. But this is not easy as long as we have limited amount of data for training purpose. The current dataset landscape in this domain is not matured enough with such amount of data for several reasons as already discussed in Section 8. So, the research community must first give an honest attempt to create and publish more datasets and augment such datasets considering all the ethical and ecological issues as also stated in coming paragraphs on dataset research recommendation. Henceforth, meta-learning can also be explored on large-scale combined datasets and complex scenarios in detection tasks can be easily solved using big data and visual analytics [120]. Apart from that, representing data in the form of a graph can solve many performance issues in terms of accuracy and efficiency. Since we see that the GNNs have already achieved a lot of success in the recent times in this domain with multiple papers produced by research groups across the globe, we can also try to extend these methods by combining them with other forms of DL paradigms. Researchers can try out GNN and CNN hybrid models like graph convolution networks which can be a potential candidate to create appropriate models. A combination of recurrent neural networks, convolution neural networks, and/or graph recurrent networkscan also be explored for identifying more accurate and promising detection models. Finally, for the current datasets with small amount of data, we require ML techniques which are not too much data dependent but are efficient with even small data. This leads us back to the statistical models which works best with small amount data as well [1]. We can also try to explore few-shot/one-shot/low-shot learning and incremental learning paradigms in such cases. Transfer learning and domain adaptation can also be explored in future research directions for transferring static images features to video key frames for dynamic group formation detection. Another novel direction can be Diffusion models [69, 155] which have come to prominence in various computer vision applications. Diffusion is a type of generative models which initially convert the data into Gaussian noise in steps progressively and then learn to reverse the noising steps to recover the data. Finally, we also suggest new research using Neural Radiance Fields (NeRF)-based models. NeRFs are highly effective for object detection tasks in dynamic scenes [123] and grouping semantically similar things in 2D/3D environments [86]. Hence, it is a potential candidate for solving human group dynamics and detection problems in extremely dynamic and crowded scenes.

—

The problems like dynamism in groups (people leaving/joining the group dynamically or changing position and orientation within the group) and occlusions of people pose serious challenges and limitations to the current state-of-the-art methods in terms of accuracy and efficiency. Researchers can think about devising rules based on reasoning and geometry to detect application-specific groups and interactions. A combination of rules, geometry-based reasoning along data-driven models can also be explored to improve detection quality. Apart from detecting the group and formation alone, methods should be designed to detect the orientation and pose of the group itself (see [18, 21]). This can facilitate a good approach direction and angle (natural and human-like) for robots to join the group.

—

The major challenge with the datasets is their availability. Creating good quality (large scale) vision datasets (for training and testing) is a mammoth task in itself but has its own research/academic merit. The only 20% of the surveyed datasets are publicly available (Table 6). The researchers can publish more of their privately created datasets as benchmarks for people to experiment. However, the ethical implications [73] are needed to be considered. One simply cannot collect and publish vision datasets with human participants without proper permission from relevant board (see the discussion on these concerns in the end of section 8.2). Another limitation is the availability of public ego-vision datasets with only 30% of the total being ego-vision. Researchers can think of creating more first-person view datasets by merging/fusing existing datasets in a meaningful manner. Here, the researchers also need to consider the ecological validity [73] of interactions based on learning from such data. So, data augmentation needs to be done in a more restricted manner. Preserving privacy in ego vision datasets is also a big challenge. So, we can see less first person view datasets in-the-wild than lab or indoor settings. This issue needs to be handled sensitively by the researchers. As for the exo-vision datasets, accuracy is low, so the researchers can look into this direction by creating more robust datasets for global view scenes. Indoor datasets dominate the scenario currently with merely 29% of the datasets being outdoor. Researchers need to create more outdoor datasets for the sake of applications pertaining to surveillance and outdoor scene monitoring. Here again ethical concerns arise from application areas such as surveillance and monitoring. Such application implementation needs special permission for concerned bodies of any country. This needs to be handled accordingly (see discussion in the end of Section 10). Moreover, huge new datasets with temporal consideration needs to be incorporated. A fair enough temporal resolution with sampling rate [128] is needed to train models for complex scenes and dynamic group detection. Finally, we can use Generative Adversarial Networks to generate synthetic data from real data of human groups and interactions for training purposes in DL models. Finally, the researchers have limited their methods and datasets to only a few major formations like face-to-face, triangular/circular, side-by-side, and L-shaped. No literature or state-of-the-art methods speak about dealing with a comprehensive list of formations (as explained in Section 2.2). Researchers should concentrate on devising methods and creating datasets to solve these limitations.

—

Detection capabilities need attention with respect to dynamic scenes (Figure 16) as well as multiple groups (Figure 17). The literature is rich in taking care of most of the aspects of detection (Tables 7 and 8). However, some more research attention is required in cases of occlusion, background clutter, and lighting conditions. Researchers can use reinforcement learning and DL models for these problems. Also, appropriate datasets need to be prepared at a larger scale. We can also try to improve the datasets using Low Dynamic Range to High Dynamic Range) reconstruction methods [19, 20, 22, 91, 124, 175] in case of adverse lighting, color, or illuminance conditions. We can reconstruct such images with overexposed/underexposed parts which can lead to better detection algorithms and models.

—

Evaluation of the methods remain a challenge in the current literature (Table 9). Mostly, computational evaluation has been performed in terms of accuracy and efficiency. But in a problem like a group/interaction detection, human experience studies and/or simulation-based studies are important to establish the effectiveness of the method in various applications like robotics, telepresence, and social surveillance (see Section 9, Figure 18 and Table 10). The researchers need to orient their studies in this respect as well. Apart from that, most of the methods yield good accuracy but achieving real-time solutions maintaining good accuracy is a concern. The explorers can think of designing lightweight models for real-time detection of groups and interactions for dynamic scenes.

—

Feature selection/extraction plays a major role in any computer vision problem, and group detection is no exception. The existing literature lacks discussion about the use of proper features and the selection of proper approaches to extract useful and differentiating features. Apart from the visual features, researchers can also think about non-visual features such as audio or speech as future research trends. There is also a possibility of temporal feature selection and extraction for dynamic groups.

—

Applications of this domain can be widely seen in robotics, surveillance, human behavior analysis, and telepresence technologies (Table 10 and Figure 19). However, we can also think about using this technology in COVID-19-related applications such as monitoring of social distancing norms and others.

—

We also have discussed about two types of camera views: ego and exo-views (Figure 9 and Table 4). Ego vision is used predominantly for robotics-related applications. Methods using ego-view cameras for input are less compared to exo-view cameras. The main reason behind this roots down to the scarcity of public ego-view datasets for training the models. Researchers can also direct their research on designing detection models which can be created on a hybrid system of camera views and sensors. The visual as well as other forms of inputs combined can be used for better detection and prediction tasks. Here, we also need to think of the challenges in the case of cross-modal synchronization [128]. Accurate sampling of sensor inputs is required over fixed sized windows. As also discussed in Section 8.1, there is a scarcity of datasets with temporal sampling and resolution, the researchers should be encouraged to look into this gap in future. Various combinations of camera views and positions can be experimented with for better scene capture and robust dataset creation for learning models. Figures 20 and 21 summarizes the limitations, challenges, and future directions/opportunities for all the concern areas of this survey framework.

Fig. 20.

Fig. 21.

12 Conclusions

With the emergence of computer vision, robotics, and multimedia analytics, the world is changing for good with the progress of AI. Computation systems and autonomous agents are expected to show more human-like behavior and capability. One of the most important problems in this domain persists to be group/interaction detection and prediction using f-formation. Although some research has been conducted in the last decade, much more progress is still envisioned. This survey aims at generalizing the problem of group/interaction detection via a framework, which is referred to as the theme of this survey as well. This article presents a comprehensive glance at all the concern areas of this defined framework. This includes definitions of various f-formations, input camera views and sensors, datasets, feature selection, algorithms, detection capability and scale, quality of detection, evaluation methodologies, and application areas. The article also discusses the limitations, challenges, and future scope of research in this domain. The researchers can try and solve some of the unattended problems in this domain with the help of more recent and efficient approaches in DL, reinforcement learning, adversarial, and meta-learning paradigms. Another direction can be the utilization of advanced models such as Diffusion and NeRF which holds the potential to work very effectively in dynamic and crowded scenes. A combination of different neural networks can also be explored to create efficient models in application-specific cases.

Acknowledgments

We are thankful to the Editors and Associate Editors who were involved in the review process of our manuscript. Their guidance in revision iterations has significantly enhanced the quality of the work. We also are deeply grateful to the reviewers of this manuscript whose constructive criticism has improved the technical quality and readability of this manuscript.

Footnotes

https://dl.acm.org

https://ieeexplore.ieee.org/Xplore/home.jsp

https://www.scopus.com/home.uri

⁴

https://www.sciencedirect.com/

⁵

https://arxiv.org/

⁶

https://www.researchgate.net/

⁷

https://visualai.princeton.edu/fcvd/assets/Denton_slides.pdf

⁸

https://en.wikipedia.org/wiki/Institutional_review_board

⁹

https://en.wikipedia.org/wiki/General_Data_Protection_Regulation

¹⁰

https://blog.ipleaders.in/cctv-surveillance-laws-in-india-and-abroad/

Supplemental Material

PDF File - F_Formation_THRI_Supple.pdf

The Appendix consists of two long tables as a part of electronic supplementary material. These Tables 1 and 2 summarize the surveyed methods and techniques for group, interaction and F-formation detection on the basis of various parameters. Table 1 compares among the static rule based AI methods, while Table 2 informs about the Machine learning or Deep learning based techniques. The tables consists of some short forms in the table headers used to label each column and in other rows: ML (Machine Learning), DL (Deep Learning), RL (Reinforcement Learning), Su (Supervised), UnS (Unsupervised), SS (Semi-supervised), S (Single group), M (Multiple groups), Exo (Exocentric vision or global view), and Ego (Egocentric or first person view).

Download
414.64 KB

References

[1]

Amina Adadi. 2021. A survey on data-efficient algorithms in big data era. Journal of Big Data 8, 1 (2021), 1–54.

Abstract

1 Introduction

2 Social Spaces in Group Interaction

2.1 What Is F-formation?

2.2 Different Formations (Including F-formations) Possible during Interaction

2.3 Best Position for a Robot to Join in a Group

3 Research Chronology

4 Review Methodology

4.1 Generic Survey Framework with Possible Concern Areas

4.2 Search Strategy

4.3 Inclusion and Exclusion Criteria

5 Cameras and Sensors for Scene Capture

5.1 Camera Views

5.2 Other Sensors

6 Categorization of Methods/Techniques

6.1 Rule-Based Method

6.1.1 Fixed Rules Based

6.1.2 Geometric Reasoning Based

6.2 ML-Based Method

6.2.1 Supervised Approaches

6.2.2 Unsupervised Approaches

6.2.3 Mixed Approaches

7 Datasets

8 Categorization of Detection Capabilities and Scale

8.1 Detection Capability

8.2 Detection Scale

9 Categorization of Evaluation Methods

10 Application Areas

11 Limitations, Challenges, and Future Directions: A Discussion

12 Conclusions

Acknowledgments

Footnotes

Supplemental Material

References

Index Terms

Recommendations

Evaluating Social Perception of Human-to-Robot Handovers Using the Robot Social Attributes Scale (RoSAS)

A Real-Time Reconfigurable Collision Avoidance System for Robot Manipulation

The Robotic Social Attributes Scale (RoSAS): Development and Validation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations