\useunder

\ul

\corrauth

Tim Schreiter

THÖR-MAGNI: A Large-scale Indoor Motion Capture Recording of Human Movement and Robot Interaction

Tim Schreiter^*¹¹affiliationmark: Tiago Rodrigues de Almeida^*¹¹affiliationmark: Yufei Zhu¹¹affiliationmark: Eduardo Gutierrez Maestro¹¹affiliationmark: Lucas Morillo-Mendez⁶⁶affiliationmark: Andrey Rudenko²²affiliationmark: Luigi Palmieri²²affiliationmark: Tomasz P. Kucner^3,4^3,4affiliationmark: Martin Magnusson¹¹affiliationmark: Achim J. Lilienthal^1,5^1,5affiliationmark: ¹¹affiliationmark: Örebro University, Sweden ²²affiliationmark: Robert Bosch GmbH, Corporate Research, Stuttgart, Germany. ³³affiliationmark: Finnish Center for Artificial Intelligence (FCAI), Finland. ⁴⁴affiliationmark: School of Electrical Engineering Aalto University. ⁵⁵affiliationmark: Technical University of Munich, Germany. ⁶⁶affiliationmark: Independent Researcher, Spain. ^{${}^{*}$}^{${}^{*}$}affiliationmark: These authors contributed equally to the work. tim.schreiter@oru.se

Abstract

We present a new large dataset of indoor human and robot navigation and interaction, called THÖR-MAGNI, that is designed to facilitate research on social navigation: e.g., modelling and predicting human motion, analyzing goal-oriented interactions between humans and robots, and investigating visual attention in a social interaction context. THÖR-MAGNI was created to fill a gap in available datasets for human motion analysis and HRI. This gap is characterized by a lack of comprehensive inclusion of exogenous factors and essential target agent cues, which hinders the development of robust models capable of capturing the relationship between contextual cues and human behavior in different scenarios. Unlike existing datasets, THÖR-MAGNI includes a broader set of contextual features and offers multiple scenario variations to facilitate factor isolation. The dataset includes many social human-human and human-robot interaction scenarios, rich context annotations, and multi-modal data, such as walking trajectories, gaze tracking data, and lidar and camera streams recorded from a mobile robot. We also provide a set of tools for visualization and processing of the recorded data. THÖR-MAGNI is, to the best of our knowledge, unique in the amount and diversity of sensor data collected in a contextualized and socially dynamic environment, capturing natural human-robot interactions.

keywords:

Dataset for Human Motion, Human Trajectory Prediction, Human-Robot Collaboration, Social HRI, Human-Aware Motion Planning

1 Introduction

In recent years, the topics of human motion modeling, prediction and interaction with social and service robots have been rapidly growing, driven by industrial interests and a quest for safer algorithms in human-robot interaction settings. Various types of advanced automated systems, such as mobile robots (including autonomous vehicles), manipulators, and sensor networks, benefit from human motion models for safe and efficient operation in the presence of humans. Human motion data is central to human-aware path planning, collision avoidance, tracking, interaction, understanding human activities, and collaborating on shared tasks.

Modern approaches for modeling human motion require plentiful data recorded in diverse environments and settings to train on, as well as for the evaluation (Rudenko et al., 2020b). Among the growing numbers of human trajectory datasets, most focus on capturing interactions between the moving agents in indoor (Brščić et al., 2013), outdoor (Robicquet et al., 2016) and automated driving (Bock et al., 2020) settings. These datasets are designed to study how people interact and avoid collisions in social settings, by describing their motion through position and velocity information. Further datasets attempt to capture full-body motion in various activities and human-object interactions in household settings (Liu et al., 2019; Kratzer et al., 2020; Ehsanpour et al., 2022).

Human motion is influenced by many exogenous factors, which cumulatively amount to the context in which people move and interact. Among those are numerous environment factors: motion and activities of other people and robots, locations of obstacles, semantic attributes such as points of common interest, direction signs and special zones. Motion datasets should not only capture these factors to enable computational analysis of how people navigate, but also vary them systematically to support factor isolation in various conditions. Datasets with access to rich context can help to better explain, model and predict human motion.

Furthermore, beyond the environment context, there are various aspects of the specific person — target agent cues (Rudenko et al., 2020b) — which are helpful in better understanding their intention, ongoing activity, attention and distraction, preferences and abilities. These cues include head orientations and full body positions, gaze directions, social grouping and past activity patterns. Multi-modal approaches for human motion modeling and prediction can provide more accurate results by combining these cues (de Almeida et al., 2023), and their development is subject to the availability of high-quality multi-modal data.

Existing datasets in the field of human motion analysis often lack the comprehensive inclusion of the exogenous factors and the target agent cues necessary for holistic studies of human motion dynamics. This research gap hinders the development of robust models capable of capturing the relationship between contextual cues and human behavior in different scenarios. To address this gap, we present a novel dataset that not only incorporates a broader set of contextual features, but also contains multiple variations to support factor isolation. By integrating diverse modalities such as walking trajectories, eye tracking data, and environmental sensory inputs captured by a mobile robot (see Figure 1), our dataset fosters the exploration and analysis of human motion in various scenarios with increased fidelity and granularity.

Refer to caption — Figure 1: THÖR-MAGNI data modalities. (1) walking trajectories of participants, in a workplace setting shared with other humans and robots; (2) lidar sweep recorded with a mobile robot; (3) snapshot from an eye tracker’s gaze overlay video; (4) fish-eye camera image from the mobile robot, showing object stashes and two goal points from our scenarios.

In this paper, we propose a novel dataset of accurate human and robot navigation and interaction in diverse indoor contexts, building on the previous THÖR dataset (Rudenko et al., 2020a). The THÖR dataset pioneered weakly-scripted scenario-based data collection with motion capture in a controlled environment, recording continuous activities, which involve meaningful social navigation towards randomized targets in the environment. Our new THÖR-MAGNI dataset extends this effort with rich context annotations, time-synchronized multi-modal data, human-robot interaction scenarios and diverse navigation modes of a mobile robot.

The THÖR-MAGNI data collection is designed around systematic variation of factors in the environment to allow building cue-conditioned models of human motion and verifying hypotheses on factor impact. To that end, we propose several scenarios in which the participants, in addition to basic navigation, need to move objects, interact with each other and the robot, and respond to remote instructions. The dataset includes differential and omnidirectional robot navigation, semantic zones and direction signs in the environment, and many further aspects. We provide position and head orientation for each moving agent, 3D lidar scans and gaze tracking. Finally, we provide tools to visualize the multiple modalities of the dataset and to preprocess the trajectory data. In total, THÖR-MAGNI captures 3.5 hours of motion of 40 participants over 5 days of recording, which is available for download¹¹1https://doi.org/10.5281/zenodo.10407223. Furthermore, we note the continuity between the THÖR and THÖR-MAGNI recordings due to their shared environment (in diverse configurations), motion capture system and complimentary scenario composition.

In this paper, we motivate and detail the THÖR-MAGNI data collection and sensor setup, describe the interfaces to the dataset and compare it to the prior datasets. The paper is structured as follows: in Section 2 we review the prior state-of-the-art datasets and in Section 3 we outline the target application domains. Section 4 provides all necessary information about the data collection, and Section 5 describes the data formats and tools used to visualize and preprocess the data. Finally, Section 6 presents a quantitative evaluation of the collected data followed by a conclusion in Section 7.

2 Related Work

Multi-modal human motion datasets, including gait patterns, gaze vectors, human-robot interactions, and robot sensor data drive a wide range of research applications. These include human motion prediction (Rudenko et al., 2020b; Kothari et al., 2022), human motion representation for mobile robots (Kucner et al., 2023), human-robot interaction (Dahiya et al., 2023), human awareness in robot motion planning (Faroni et al., 2022; Heuer et al., 2023), and gaze-based prediction of human pose and locomotion mode (Zheng et al., 2022; Li et al., 2022).

Early datasets such as UCY (Lerner et al., 2007) and ETH (Pellegrini et al., 2009) have contributed significantly to our understanding of human movement in outdoor environments. Although these datasets encompass a range of human motion attributes such as trajectories, group identification, and goal points, social interactions play minor role shaping the human trajectories (Makansi et al., 2022). The indoor ATC dataset introduced by Brščić et al. (2013) represents a data collection with high coverage and tracking accuracy due to the use of 49 range sensors for raw data acquisition. The tracking method involved the independent estimation of positions and body orientations from each sensor, which were subsequently fused together. This fusion process increased the robustness of the primary estimates and ensured a high degree of accuracy in the resulting dataset. In contrast to the UCY and ETH datasets, our dataset contains many social interactions, as we always had multiple participants moving in the same space between goal points, which deliberately allowed for frequent interactions between participants (see Section 3.1). Furthermore, unlike the ATC dataset, we have included a mobile robot in the scene, which allows the study of human-robot interaction scenarios (see Section 3.4).

Munaro and Menegatti (2014), Dondrup et al. (2015), and Ehsanpour et al. (2022) have presented human motion datasets acquired through mobile robotic systems. While the datasets presented by Munaro and Menegatti (2014) and Dondrup et al. (2015) consist of short acquisitions and have limited contextual information such as maps or environmental goals, Ehsanpour et al. (2022) have contributed a more comprehensive dataset. Their dataset includes detailed annotations of micro-actions and social group dynamics, thereby offering a richer and more contextualized understanding of human motion patterns in diverse environments. However, in these datasets, human locations are based on detections in the sensor’s field of view onboard the mobile robot, which limits the scope of tracking due to occlusions. In contrast to these works, we used a motion capture system to track the moving agents (described in Section 4.5), which provides longer continuous tracking of each observed agent.

Kratzer et al. (2020) presented the MoGaze dataset, a notable advancement by incorporating a motion capture system for full-body pose tracking and eye-tracking data for humans engaged in various activities. Similarly, Chen et al. (2022) proposed a human tracking dataset for recording human-robot cooperation tasks in retail environments. However, both datasets do not capture social interactions as they track only one person. In addition, MoGaze does not include a mobile robot in the scene. The absence of these elements hinders the study of downstream applications, for instance robot motion planning methods in the “invisible robot” settings (Heuer et al., 2023), in which the humans do not react to the robot’s motion and location, but rather the full extent of collision avoidance falls on the robot. Similar to THÖR-MAGNI, the THÖR dataset introduced by Rudenko et al. (2020a) presents accurate human motion trajectories in the presence of a robot. While the THÖR dataset provides tracking accuracy in a socially dynamic environment, its limited recording duration (1 hour) poses challenges for in-depth studies, particularly concerning data-intensive deep learning-based methods for trajectory prediction.

In summary, the THÖR-MAGNI dataset based on the protocol proposed by Rudenko et al. (2020a), overcomes the limitations of its predecessors. Table 1 shows a thorough comparison with well-established and recent datasets. THÖR-MAGNI contains 3.5 times more trajectory data than THÖR, therefore providing a broader range of situations for the analysis of human motion trajectories. In addition, THÖR-MAGNI includes sensor data recorded by a mobile robot. Furthermore, our dataset provides gaze vectors aligned with the corresponding trajectories, giving the opportunity to simultaneously analyze both modalities. This alignment not only enables studies of human-robot interaction but also facilitates in-depth analyses of the complex interplay between human visual attention and motion patterns. Finally, THÖR-MAGNI is, to the best of our knowledge, unique in the amount and diversity of sensor data collected in a contextualized and socially dynamic environment, capturing natural human-robot interactions.

Table 1: Comparison of human trajectory datasets.

Dataset

Environment

Sensors for Pose

Estimation

Duration

Pose

Frequency (Hz)

Pose

Annotation

Social

Interactions

Robot in the

Scene

Intended

for HRI

Goals

Map

Robot Data

Other Data

UCY

(Lerner et al., 2007)

Street

(outdoor)

RGB camera

20 min.

Continuous

Manual

✓

ETH

(Pellegrini et al., 2009)

University and Hotel

RGB camera

25 min.

2.5

Manual

✓

Edinburgh

(Majecka, 2009)

Forum

(outdoor)

RGB camera

4 months

6-10

Automated

✓

Town Center

(Benfold and Reid, 2011)

Street

(outdoor)

RGB camera

5 min.

Manual

✓

Raw

VIRAT

(Oh et al., 2011)

Various outdoors

RGB camera

29 h

2, 5, 10

Manual

✓

Raw

Human activities,

agents types

Central station

(Zhou et al., 2012)

Train station

RGB camera

34 min.

Automated

✓

ATC

(Brščić et al., 2013)

Shopping Centre

Several 3D

range sensors

41 days

10-30

Automatic

✓

NBA SportVU

2013 ²²2https://github.com/linouk23/NBA-Player-Movements

Basketball court

RGB camera

20 days

Automatic

Multi-agent human

activities

KTP

(Munaro and Menegatti, 2014)

Empty Room

RGB-D camera

4.7 min.

Manual

✓

RGB-D camera

Motion capture

KTH

(Dondrup et al., 2015)

Lab

RGB-D camera and

2D laser scanner

2.7 h

Automatic

✓

RGB-D camera and

2D laser scanner

UCLA Aerial Event Dataset

(Shu et al., 2015)

Outdoor spaces

RGB camera

1.5 h

Automatic

✓

Raw

Human roles, small and

large objects location

SDD

(Robicquet et al., 2016)

University campus

(outdoor)

RGB camera

5 h

Manual

✓

Raw

Human activities

L-CAS

(Yan et al., 2017)

Office

3D lidar

49 min.

Manual

✓

3D lidar

Single-person, group

labels

MoGaze

(Kratzer et al., 2020)

Lab

Motion capture

3 h

120

Ground truth

Human activities

Flobot

(Yan et al., 2020)

Public spaces (i.e., airport,

warehouse, supermarket)

3D lidar and

RGB-D camera

27.5 min.

Automatic

✓

2D and 3D lidars, RGB-D

and stereo cameras

THÖR

(Rudenko et al., 2020a)

Lab with various

spatial layouts

Motion capture

1 h

100

Ground truth

✓

3D lidar

Aligned ET,

Human activities

JRDB-Act

(Ehsanpour et al., 2022)

University campus

(indoor and outdoor)

Lidar and

RGB camera

1 h

7.5

Automatic

✓

Velodyne, several cameras

(RGB and RGB-D)

Human activities

Oxford-IHM

(Finean et al., 2023)

Lab/Office

Motion capture

1 h

100

Ground truth

✓

RGB-D camera

Static RGB-D camera

THÖR-MAGNI

(2024)

Lab with various

spatial layouts

Motion capture

3.5 h

100

Ground truth

✓

3D lidar, RGB and

RGB-D cameras

Aligned ET,

Several human

activities

3 Context of the THÖR-MAGNI Dataset

The THÖR-MAGNI dataset provides diverse navigation styles of a mobile robot and humans engaged in various activities in a shared environment with robotic agents, and incorporates multi-modal data for a more complete representation. Following a comparative analysis of our dataset with state-of-the-art datasets contributing to the evolving landscape of human motion research (see Section 2), this section supports users of our dataset by providing a detailed exploration of its features in the context of human motion and robot navigation and interactions. We explain their significance in addressing identified gaps, before describing the dataset itself in Section 4.

3.1 Goal-directed Human Motion Trajectories

The presence of goal-directed human agents is crucial in the field of human motion prediction (Dendorfer et al., 2021; Zhao and Wildes, 2021; Chiara et al., 2022). Traditional approaches often depict human agents as rational entities, acting logically and moving towards specific goals or destinations (Ziebart et al., 2009). Real-world recordings commonly show this directional traffic flow, characterized by distinct goal points, often resulting in a consistent and linear motion with limited diversity. In our dataset, we include scenes with seven distinct goal points distributed over a larger spatial volume, and scenes where they are arranged in a more compact space (see Section 4). Goal points and static obstacles are positioned strategically to ensure that recorded trajectories are not only sufficiently long but also topologically diverse, i.e. covering a range of spatial arrangements and configurations. This approach allows for the inclusion of frequent interactions between the moving agents, contributing to a more comprehensive understanding of human motion dynamics.

3.2 Navigation of Heterogeneous Agents

Heterogeneous agents are dynamic entities that navigate with distinct motion patterns. This heterogeneity stems from various factors that affect the motion such as tasks and ongoing activities performed by the agent (de Almeida et al., 2023). For instance, several works have studied how humans move individually or as part of a social group (Moussaïd et al., 2010; Rudenko et al., 2018; Wang et al., 2022). It has been shown that humans can coordinate their movements as a group by following simple rules based on the visual perception of local motion (Boos et al., 2014). Previous research on the anatomy of leadership in collective behavior (Garland et al., 2018) describes human collective behavior as optimal coordination and leadership dynamics in various group scenarios. In particular, crowd dynamics are not only determined by physical constraints, but also significantly influenced by communicative and social interactions among individuals (Moussaïd et al., 2010). Autonomous driving datasets often highlight the motion of heterogeneous agents in mixed traffic (Chandra et al., 2019; Salzmann et al., 2020). In our dataset, we introduce roles for participants tailored for industrial tasks, such as navigating alone or in groups of different sizes, transporting various objects and interacting with a robot. This heterogeneous social setting provides a novel way to study how specific industrial roles influence human motion, aligning with the work conducted by de Almeida et al. (2023).

3.3 Navigation of a Robotic Agent

Human-aware robot motion planning is crucial for safe navigation in shared spaces, especially in the narrow and crowded indoor environments (Cancelli et al., 2023). Understanding human interaction with robots of different driving styles promotes the design of socially acceptable motion planners (Möller et al., 2021). Analyzing participant behavior with robots of varied movement patterns reveals insights into how robot motion style affects human expectations (Karnan et al., 2022), guiding the development of robots that interact safely and are well-received by people (Shah et al., 2023). Our dataset features scenarios with a mobile robot in teleoperated and semi-autonomous modes and two driving styles: differential drive (forward, backward and turning) and omnidirectional mode (allowing the robot to drive in any direction while keeping its heading). This variety of motion modes (detailed in Section 4.2.2) extends the state-of-the-art datasets of teleoperated navigation which feature a single driving style (Karnan et al., 2022).

3.4 Spatial Human-Robot Interaction in Shared Workplace Settings

The concept of Industry 5.0 aims to prioritize human well-being in manufacturing systems (Leng et al., 2022). This requires enhancing the quality of human-machine and human-robot interactions in these environments. Designing robots that can clearly express their intentions to human collaborators is a crucial step towards fostering mutual understanding and enhancing the well-being of human workers who regularly interact with robots (Pascher et al., 2023). Furthermore, intuitive human-robot interaction (HRI) not only improves well-being but also enhances safety and efficiency in collaborative settings (Haddadin et al., 2011).

Spatial HRI (sHRI) and navigation in shared environments are research areas that have an adherent need for accurate datasets of human motion tracking and prediction (Rudenko et al., 2020a; Chen et al., 2022) and for robots that understand the underlying physical interactions between nearby agents and objects (Castri et al., 2022). Our dataset contains recordings of explicit interactions between a mobile robot and individuals in shared workplace settings. THÖR-MAGNI is a valuable resource for studying human responses to robotic approach and assistance initiatives, enabling researchers to analyze goal-oriented interactions between humans and robots.

3.5 Eye tracking and Head Orientation in Navigation Tasks

Eye tracking is a powerful method to study various aspects of human behavior, including attention, emotion, cognition, and decision-making, with applications spanning education, marketing, gaming, and healthcare (Duchowski, 2017). Eye tracking provides objective data about eye movements and positions and enables researchers to quantify visual information processing through various metrics (Duchowski, 2017; Mahanama et al., 2022). In HRI applications, human eye-gaze is an important nonverbal signal (Admoni and Scassellati, 2017). In our dataset, we align human gaze data with human motion trajectories, providing an opportunity to study human gaze during visual exploration across dynamic tasks, activities, and scenarios.

Complementary to gaze, head orientation provides another essential modality of human behavior, closely related to gaze direction and attentional focus. Head orientation plays a vital role in joint attention, i.e. the coordination of attention between individuals focusing on the same point of interest (Tomasello, 2014). Furthermore, it is valuable for detecting interpersonal dynamics in multi-party interactions (Stiefelhagen and Zhu, 2002). Beyond its social implications, head orientation becomes a predictive indicator of walking motion goals (Holman et al., 2021) and can enhance human motion prediction through vision-based features (Salzmann et al., 2023). Using a state-of-the-art motion capture system and eye-tracking devices, our dataset provides highly accurate head poses and orientations aligned with the eye tracking data.

3.6 Semantic Environment Cues

Crucial environmental information, conveyed by semantic cues such as doors, stairs, floor markings, and signs, plays an important role in guiding humans and robots within a given space. These cues, combined with obstacle configurations, influence human interactions with the environment, leading to actions like detouring, bypassing, overtaking, and avoiding specific areas. In our dataset, we include different semantic cues like markings on the floor indicating areas to be cautious of the environment or one way passages that limit the flow of motion in one direction. In this way, we enable the exploration of navigation and interactions in semantically-rich environments. For instance, leveraging Maps of Dynamics (Kucner et al., 2023) allows the quantification of motion patterns changes around these cues. This information, in turn, can be utilized to predict long-term human motion dynamics, as demonstrated by Zhu et al. (2023).

4 Description of the THÖR-MAGNI Dataset

The THÖR-MAGNI dataset is a large-scale indoor motion capture recording of human movement and robot interaction. It consists of 52 four-minute recordings (runs) of participants performing various activities related to navigating alone and in groups, finding and transporting small and large objects, and interacting with robots. THÖR-MAGNI contains over 3.5 hours of motion data for 40 participants, including position, velocity, and head orientation. Eye tracking data is available for 16 of them, totaling 8.3 hours for eight activities (see Table 2). In 24 runs, THÖR-MAGNI also includes the robot sensor data of 3D point clouds from an Ouster lidar. Additionally, videos recorded by an Azure Kinect camera and a Basler fish-eye camera onboard a mobile robot are available on request.

In this section, we detail the environment in which we recorded the data (Section 4.1), the navigation and task design for the participants and the robot (Section 4.2), interactive scenarios to emphasize the various contextual aspects of human motion (Section 4.3), participants’ background and priming (Section 4.4) and the technical implementation of the recording pipeline and collection of motion capture and eye tracking data (Section 4.5).

4.1 Environment Design

We conducted the data acquisition in a laboratory at Örebro University, the same as in the THÖR dataset (Rudenko et al., 2020a). There are two different configurations for the laboratory. One features a small but free-space environment(see Figure 2 left). The other resembles an industrial logistics setting and promotes frequent interactions between human and robotic co-workers (see Section 4.3). Both room configurations have seven goal positions, to drive purposeful human navigation through the available space, generating frequent interactions in the center. Additionally, we include several environmental layouts (i.e., obstacle maps) in the THÖR-MAGNI dataset, which vary the placement of static obstacles (robotic manipulators and tables) in the room to prevent walking between goals in a straight path. Apart from static obstacles, two robots are in the room: a static robotic arm near the podium and an omnidirectional mobile robot with a robotic arm on top (see Section 4.2.2).

4.2 Navigation and Interaction Design

The interaction and navigation design in THÖR-MAGNI extends the weakly-scripted motion recording procedure introduced in the THÖR dataset (Rudenko et al., 2020a). This procedure facilitates realistic motion in controlled settings, in which accurate ground truth motion capture and eye tracking data are collected using specialized equipment (see Figure 2 on the right). Our key idea is to assign meaningful activities and tasks to the recording’s participants, allowing them to concentrate on their continuous activity during which they freely move in the room shared with other people and robots. To generate a diverse range of interactions, we developed several scenes that vary in the composition of tasks, robot operation, and other contextual cues, as discussed in Section 3.

4.2.1 Tasks, Activities and Roles Requiring Search and Navigation

We aimed to simulate authentic scenes that reflect the different activities individuals perform in a workplace environment. To that end, we designed several tasks that require search, navigation, interaction with objects, other participants and a mobile robot. Participants engaged in those tasks according to their assigned role.

There are two types of roles in our dataset: Visitors and Carriers. Visitors navigate either individually (Visitors–Alone) or in groups of two (Visitors–Group 2) or three Visitors–Group 3) between target points in the environment. The Visitors role includes a human-robot interaction component denoted by Visitors–Alone HRI, where participants interact with a robot in a joint navigation task (see Section 4.2.2). In addition, Carriers are involved in transporting various objects, including Carrier–Bucket, Carrier–Box, Carrier–Storage Bin HRI and Carrier–Large Object (see Figure 3). Carriers transport objects between pre-defined target points, and objects themselves representing different levels of difficulty for navigation, categorized as small (lowest difficulty), medium (medium difficulty), and large (highest difficulty).

Visitors used a card-based system to navigate, receiving new destinations each time they reached a designated goal point. At each goal point, a deck of cards was available, featuring instructions such as “Go to Goal 1”. The instructions could specify a new destination or contain instructions to go to the robot. In the case of Visitors–Alone, they drew a card and placed it at the bottom of the deck. Afterward, the participant moved to the destination. In the case of groups, the members could choose who will draw the card.

Carriers were asked to transport objects of different shapes and sizes. These include small objects such as a blue plastic storage bin for the Carrier–Storage Bin HRI and plastic buckets of canned vegetables for the Carrier–Bucket, designed for easy, one-handed transportation. For the Carrier–Box, the participants had to move cardboard boxes as medium-sized objects. These boxes were filled with a few books, allowing for comfortable two-handed transportation. In addition, a collaborative effort involving two participants working as a group featured moving a large object, specifically a poster stand (Carrier–Large Object). This stand-up, equipped with four wheels, is thin and long and can be moved by two people working in tandem. The overall goal of this setup is to assess how different ongoing activities affect participants’ behavioral patterns, including factors such as gaze direction and movement.

Table 2: Amount of eye tracking- and trajectory data recorded for various activities with all three devices: Tobii 2, Tobii 3 and Pupil Invisible glasses

Activity

Eye tracking (min.)

Trajectory data (min.)

Visitors–Alone

108

392

Visitors–Group 2

124

344

Visitors–Group 3

168

Visitors–Alone HRI

112

Carrier–Bucket

Carrier–Box

Carrier–Large Object

192

Carrier–Storage Bin HRI

Total

548

1416

4.2.2 Modes of Robot Navigation and HRI

Our dataset includes a mobile robot, “DARKO³³3https://darko-project.eu/” (see Figure 4), which acts as a static obstacle in some scenes and moves in others. This range of behaviors enables the study of participants’ movements and gaze behaviors concerning the stationary and mobile status of the robot. In certain scenes, the robot was teleoperated and moved omnidirectionally, enabling it to reach any 2D position from a stationary position. In some it moved directionally with a predetermined orientation (front). In others the DARKO robot navigated semi-autonomously with manually set goal points. An experimenter was supervising the navigation of DARKO for safety reasons. When acting semi-autonomously the robot interacted with participants through a communication intermediary called the “Anthropomorphic Robot Mock Driver” (ARMoD).

The ARMoD is a small humanoid NAO robot, as shown in Figure 4. It was sitting on the DARKO robot. The ARMoD displayed two behaviors during interactions: One using only the voice (Verbal-Only HRI). The other uses multi-modal features such as eye contact, robotic gaze, and pointing gestures to support the voice (Multi-modal HRI). This style of interaction reduces fixations on the DARKO robot, increases focus on the ARMoD’s face, and triggers faster response times to instructions of participants, effectively directing attention and improving the quality communication with the robot (Schreiter et al., 2023).

4.3 Scenario Design

We address the context of agent movement by including both humans and robots, as previously discussed, in five specifically designed scenes we call “scenarios”. Scenario 1 captures the dynamics of motion because of semantic attributes of the environment and sets up a baseline for goal-directed social navigation. Scenario 2 adds role-specific motion for some participants navigating the environment. Subsequently, Scenario 3 explores the impact of different robot motion styles on these role-specific patterns. Figure 5 depicts a detailed overview of the room configuration and varying environmental layouts for Scenarios 1–3. Scenario 1’s conditions A and B capture regular social behavior in a static environment with and without additional floor markings and a one-way passage. Scenario 2 maintains the same layout as Scenario 1A but introduces individuals performing tasks, emulating industrial activities. Scenario 3 explores human-robot interactions by varying the driving modes of the mobile robot teleoperated by experimenters on a podium.

Transitioning to a smaller room configuration, we present two scenarios to explore human motion and intended interactions between humans and robots: Scenarios 4 and 5. Scenario 4’s participants engaged in intermittent interaction with a mobile robot. This robot communicated in two interaction styles through another entity to mediate joint navigation with participants toward goal points. In Scenario 5, the robots and a human co-worker collaborated actively in transporting small storage bins. For a comprehensive overview of roles and scenarios, see Figure 6.

We recorded multiple runs for each condition in Scenarios 1–5. Specifically, we recorded two runs per condition for Scenarios 1 and 3, two for Scenario 2, four per condition for Scenario 4, and four runs for Scenario 5. To counterbalance learning-based effects, we randomized the recording order of conditions for Scenarios 3 and 5. We implemented this methodical approach to ensure a broad and impartial exploration of the scenarios, capturing subtle interactions and behaviors in each setting.

4.3.1 Scenario 1: Capturing Motion Dynamics in the Environment

Scenario 1 comprises two conditions: condition A involves static obstacles such as tables, stationary robots, and goal points. condition B introduces floor markings and stop signs in a one-way corridor in addition to the elements presented in condition A. The recording of condition B was before condition A to avoid biasing the participants towards the floor markings and to capture their natural reaction. Baseline condition A provides a clean environment without any floor markings or stop signs, allowing for the study of participants’ motion patterns independently of these factors. This condition provides a foundation for understanding the effects of additional variables introduced in our other scenarios. Conditions A and B together enable the exploration of the impact of environmental cues on human motion (see Figure 7).

4.3.2 Scenario 2: Role-Specific Motion Patterns in Industrial Environments

Scenario 2 features the same environment layout as Scenario 1A (Figure 7 left). In addition to the goal-driven navigation (Visitors role), this scenario introduces people performing different tasks as Carriers. For each run, we assign new roles to the participants. One participant carries small objects (i.e., buckets), and another carries medium objects (i.e., boxes) between two goal points. Finally, two participants move a large object (i.e., a poster stand). We use Discord⁴⁴4Free and easy-to-use communication and collaboration platform https://discord.com to instruct one member of the two-person team responsible for moving the large object. The usage of Discord enabled the dynamic allocation of new goal points and facilitated the coordination of participants’ movements in this industrial context.

In summary, this scenario presents role-specific tasks for participants and goal-driven navigation, creating a platform to study the impact of human occupation on their motion profiles and those of the other agents in a shared environment.

4.3.3 Scenario 3: Impact of Mobile Robot Motion on Human Behavior

With Scenario 3, we introduce an opportunity to study the interplay between human activities and a mobile robot. In this scenario, the stationary DARKO robot of Scenarios 1 and 2 becomes mobile, exploring changes in the humans’ motion patterns based on the mobile robot driving style. This scenario comprises two conditions, in which we modulated the way the mobile robot navigates: condition A, where the robot’s motion always has a designated direction using directional differential-drive kinematics (see Figure 8 bottom left) and condition B, where it can drive in any direction, i.e. omnidirectional (see Figure 8 bottom right) using it’s mecanum wheels (see Figure 8 top). In both conditions, the roles of the participants remain the same as in Scenario 2, and a human operator controls the mobile robot using a remote controller to ensure the safety of the participants. Besides allowing for the study of human activities in the presence of a mobile robot, this setup also provides insights into how varying robot motion styles impact human behavior.

4.3.4 Scenario 4: Spatial HRI, Navigation in a Shared Environment

This scenario includes participants with the roles of Visitors–Alone HRI and Visitors–Group 2, who freely move around a shared environment alongside the DARKO robot that navigates semi-autonomously. The robot moved autonomously, with the restriction of being supervised by an experimenter who could intervene and halt its movements via a controller.

Participants assigned with the role of Visitors–Alone HRI received instructions regarding a joint navigation task and engaged in interactions with the ARMoD. These participants take place in two conditions based on the interaction styles outlined in Section 4.2.2, a Verbal-Only interactions style in condition A and a multi-modal one in condition B. Depending on the interaction style and the distance between goals one interaction lasted around 30–40 $\mathrm{s}$ . If too many participants were at a goal point, the experimenter interrupted the mobile robot’s autonomous navigation shortly before reaching the goal. If interrupted prematurely, the mobile robot told the participants to abort the interaction and continue drawinging cards. The mobile robot finished navigating autonomously to the goal point once it was less crowded.

Participants move either individually or in pairs between designated goal points. A specific card directs the individual participants (Visitors-Alone HRI) to approach the ARMoD and await further guidance. Visitors–Group 2 are instructed to disregard this card. The experimenter controls ARMoD’s behavior and sets the mobile robot’s next goal point. Upon participants’ arrival, ARMoD greets them and leads them jointly to the next goal point, where participants draw another card.

To ensure safe and seamless interactions, ARMoD’s behaviors are triggered by an experimenter using a controller (see Figure 9 left). The experimenter initiates actions like “Greet closest participant” and “Talking to participant”, guiding ARMoD’s communication with participants. Concurrently, the mobile robot continues its autonomous navigation, albeit under the oversight of the experimenter, who can pause its movements if necessary.

Accurate tracking of individuals was essential for facilitating seamless interactions between ARMoD and participants. Determining the ARMoD’s position relative to individuals at any given moment, we leveraged the motion capture system’s data, broadcasted into the local network using “Robot Operating System (ROS)” (Quigley et al., 2009). This integration ensured precise transformations and provided position and orientation information, enabling ARMoD to accurately point, look, and establish eye contact with its interaction partners. Figure 9 right illustrates an interaction between a participant and ARMoD in this scenario. The position and orientation data of participants, robots, and the world frame are broadcasted within the local network, providing essential information to the path planner for DARKO and the interaction scheduler for ARMoD. In this figure, examples of established coordinate frames include (1) that of the helmets of participants defined based on the orientation of the marker, (2) a static coordinate frame for the ARMoD derived from the DARKO robots frame through an offset, (3) DARKO’s coordinate frame, and (4) the motion capture reference’s frame called the “QTM-World Frame”.

This scenario investigates free movement in a shared environment alongside the DARKO robot, exploring semi-autonomous navigation. Participants engaged in interactions with ARMoD under varied conditions. These allow for a study of human-robot interactions, navigation tasks, and the impact of different interaction styles on participants’ activities and movements.

4.3.5 Scenario 5: Spatial Human Robot Interaction, Proactive Robotic Assistance

This scenario involves the roles: Visitors–Alone, Visitors–Group 2, and Carrier–Storage Bin HRI. The first two navigate between goal points by drawing cards. The Carrier–Storage Bin HRI takes on the role of a factory worker responsible for transporting storage bins, and interacting with DARKO through ARMoD. The experimenter controled ARMoD’s behavior and supervised DARKO’s motion for safety. During the interaction, ARMoD proactively offered assistance to the Carrier–Storage Bin HRI, informing them of the option to place a small storage bin on the mobile robot. If participants accepted, they could place the small storage bin on the DARKO robot for transportation between two designated points. The procedure described in Section 4.3.4 enabled reliable perception of both human and robot positions for this scenario. This scenario features proactive assistance from a mobile robot to a human worker in a simulated factory environment.

4.4 Participants Background and Priming

The average age of the participants was 30.18 years, with a standard deviation of 6.73, indicating a relatively homogeneous age group. The dataset contains a balanced gender distribution with 40 participants, of which 21 are male and 19 female. Geographically, 23 participants are from Sweden. From other European countries, there are 10, including the Czech Republic, Spain, Germany, and Italy, reflecting a diverse European representation. The remaining 7 participants come from countries on other continents like Asia, Africa, and South America, providing a broader international scope. We recruited the participants from different areas of the campus. Their backgrounds varied considerably, including differences in their highest academic degree and primary subjects. At the beginning of each recording day, participants completed a demographic questionnaire. We used this information to create diverse group compositions, aiming for optimal allocation of eye tracking devices across different roles (see Figure 10). For example, we ensured that groups of two or three participants contained only one participant equipped with an eye tracker and the equipment of at least one of the carriers with an eye tracker.

During the data collection procedure, we guided the participants through a series of runs with specific instructions tailored to each scenario. Between successive runs, participants complete questionnaires while logistical preparations are made, such as removing floor markings, configuring a phone for voice chat using Discord (before Scenarios 2 and 3), monitoring and, if necessary, changing the batteries of eye trackers, and preparing the robots for Scenarios 3–5. After completing the questionnaire, participants are assigned new roles in Scenarios 2 and 3. We gave each group a new starting point for the next run, from which they drew their first card. Participants unfamiliar with their roles got a brief recap of their task-related responsibilities. In Scenario 3, we informed participants that an experimenter monitored the robot’s motion for safety and teleoperated the robot. In Scenarios 4 and 5, participants were first briefed about their roles in the scenario (see Section 4.3) and then introduced to the ARMoD and the DARKO robot as co-workers in the room, with the ARMoD acting as a communicator on behalf of the DARKO robot.

After each run, participants completed the raw version of the NASA Task Load Index (RTLX) (Hart and Staveland, 1988; Hart, 2006). The scale consisted of a 21-point set of subscales [1 = low; 21 = high], each of which assessed the mental demand, physical demand, temporal demand, and frustration produced by the task as reported by the participant, as well as their self-perceived performance and frustration. After each session of the last run of Scenarios 3 or 5, participants complete two additional mobile robot questionnaires. First, they complete the Godspeed Questionnaire Series (Bartneck et al., 2009), a semantic differential set of subscales [5-point] that measures participants’ perceptions of the robot in terms of anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety. Second, they complete a 5-point Likert scale [1 = strongly disagree; 5 = strongly agree] to assess trust in the robot in industrial human-robot collaborations (Charalambous et al., 2016). Participants complete all questionnaires on paper.

4.5 System Setup

4.5.1 Hardware and Software Configuration

We used a motion capture system from Qualisys with ten infrared cameras (Oqus 7+) positioned around the room to track moving agents. The system provided comprehensive coverage of the room volume. Reflective markers arranged in distinct patterns of six degrees of freedom (6DoF) on bicycle helmets. These were tracked at $100\text{\,}\mathrm{Hz}$ with a spatial resolution of $1\text{\,}\mathrm{mm}$ . The coordinate frame of the system had its origin at ground level in the center of the room. Each participant and the robot are represented as unique rigid bodies (identifiable through the group of passive reflective markers arranged in specific patterns) in the system. This configuration enabled the precise capture of the 6DoF head position and orientation for each participant. We provided the participants with individualized helmets for the recording sessions. The specific helmet IDs used during each recording session are listed in Tables 3, 4, and 5 in the Appendix.

We captured eye tracking data using three distinct models of eye tracking devices: Tobii Pro Glasses 2 and 3, and Pupil Invisible. The Tobii Glasses models record raw gaze data at a frequency of $50\text{\,}\mathrm{Hz}$ and camera footage at $25\text{\,}\mathrm{Hz}$ , while the Pupil Glasses record gaze data at $100\text{\,}\mathrm{Hz}$ and camera footage at $30\text{\,}\mathrm{Hz}$ . To export Tobii Glasses data, we used the I-VT Attention filter, which is optimized for dynamic situations, to classify gaze points into fixations and saccades based on a velocity threshold of $100\degree/s$ . All eye trackers are equipped with an IMU that comprises an accelerometer and a gyroscope operating at $100\text{\,}\mathrm{Hz}$ . In addition, the Tobii Glasses 3 are equipped with a magnetometer that operates at $10\text{\,}\mathrm{Hz}$ . The infrared cameras in these devices capture the human gaze, which is then superimposed onto a 2D video by the scene cameras. The Pupil Invisible Glasses’ scene camera has a resolution of $1088\times 1080$ pixels, with both horizontal and vertical field of view (FOV) angles measuring $80\degree$ . In contrast, the Tobii Glasses offer a resolution of $1920\times 1080$ pixels. The Tobii 3 Glasses feature FOV angles of $95\degree$ horizontally and $63\degree$ vertically, while the FOV of the Tobii 2 Glasses $82\degree$ horizontally and $52\degree$ vertically.

The DARKO robot integrates several sensors, including an Ouster OS0-128 lidar, two Azure Kinect RGB-D cameras (one of which was used in these recordings), two Basler fish-eye RGB cameras, and two Sick MicroScan 2D safety lidars. The Azure Kinect cameras have a resolution of $2048\times 1536$ at $6\text{\,}\mathrm{Hz}$ , a horizontal field of view of $75\degree$ , and a tracking range of up to $5\text{\,}\mathrm{m}$ . The Basler fish-eye RGB cameras have a resolution of $1700\times 1536$ at $20\text{\,}\mathrm{Hz}$ . The DARKO robot is augmented with a NAO robot acting as ARMoD for participant interaction. The NAO is attached to a seat on the DARKO robot, facilitating the communication of spatial motion intent. This arrangement aligns the ARMoD’s body orientation with the direction of movement in scenarios where DARKO employs a directional driving style.

Recordings from the DARKO robot and the motion capture system were synchronized using ROS timestamps. Taking advantage of the integration of the motion capture system with ROS 1 Melodic, we recorded all of the robot’s on-board sensor data and the 6DoF positions of the people using ROS bag files and in text form.

4.5.2 Sensor Calibration

The precision of the data acquisition relied on sensor calibration procedures to ensure accurate measurements and reliable data interpretation throughout the experiments. In this section, we provide a detailed description of our calibration methods for both the motion capture system and the eye tracking devices. We followed separate calibration routines for each sensor. These calibration routines allowed for the robustness and reliability of our dataset, allowing for accurate analysis and interpretation of participants’ behaviors and interactions within the recorded scenarios.

For the eye tracking devices we followed the calibration procedures for both Tobii Glasses models (see Figure 11) as outlined in their respective user manuals to optimize eye tracking accuracy (see Tobii AB (Accessed: 2024-02-02a) and Tobii AB (Accessed: 2024-02-02b)). This process involved positioning a calibration target, ensuring its visibility, having participants focus on its center. To ensure accurate recordings with the Pupil Invisible Glasses, we followed best calibration practices outlined by- and validated the calibrations with the dedicated software of Pupil Labs AB (Accessed: 2024-02-02b).

To ensure data accuracy of the motion capture system, rigorous daily calibration routines were performed prior to the start of each recording session. We used the standard calibration kit with a $502.2\text{\,}\mathrm{mm}$ carbon fibre wand to fine-tune the system. These calibrations allowed us to define precise rigid bodies that enabled 6DoF tracking. This approach ensured the accurate capture of spatial dimensions (X, Y, Z) and rotational elements (roll, pitch, yaw) of objects within the 3D environment, resulting in an average residual tracking error of $2\text{\,}\mathrm{mm}$ . Rigid bodies of helmets and objects such as the large object for the cariers or the DARKO robot, were strategically designed to enable simultaneous and highly accurate capture of all object poses and locations.

4.6 Post Processing

Multi-modal data synchronization was necessary in our data collection. We used ROS and custom Python scripts to align the data streams while maintaining temporal integrity. To achieve synchronicity between the motion capture and eye tracking data, we strategically placed custom events associated with precise timestamps in the two data streams using the respective software of the eye tracking devices such as Tobii Pro Lab (Tobii AB, Accessed: 2024-02-02c) and Pupil Player (Pupil Labs AB, Accessed: 2024-02-02a) as well as the Qualisys Track Manager (QTM) (Qualisys AB, Accessed: 2024-02-02) for the motion capture system. This procedure resulted in CSV files where all modalities’ timestamps are synchronized on the motion capture system’s timestamp. Within these files, eye tracking data is available for frames where all markers of the rigid body are tracked by the motion capture system, as it is a prerequisite to determine the 3D gaze vector using a correct head orientation. The frame numbers for each respective eye tracker’s scene recording are indexed in the column named “SceneFNr” in the corresponding CSV file.

To facilitate a thorough analysis of the eye tracking data in our study, we offer access to the raw data from the Tobii glasses, along with essential synchronization details. To ensure data protection, the scene recordings are provided in a blurred format and removed audio data. Access to the raw data from the Pupil Invisible glasses can be granted upon individual request, ensuring careful and ethical distribution of sensitive data.

An extensive post-processing stage, including the above mentioned synchronization and alignment, followed the data acquisitions. Its aim was to refine and validate the collected data as well as to ensure the protection of sensitive data. This stage involved several key procedures, such as eliminating artifacts and noise caused by marker occlusion, lighting variations, and camera disruptions. We also rectified misidentified trajectories through a combination of spatial and temporal consistency evaluations, with manual adjustments applied when needed.

5 Working with the THÖR-MAGNI Dataset

5.1 Data Formats

For the purpose of dissemination, the dataset has been categorized into five recording scenarios (see Section 4 for a detailed description), aligning with the respective days of data collection. Each scenario’s data is organized into separate folders. Within each folder, multiple acquisitions conducted over the five days of recording are stored. The folders corresponding to the first three scenarios (1–3) contain acquisitions from four days (in May 2022), while the folders representing the last two scenarios (4 and 5) encompass recordings from one day (in September 2022). To enhance the diversity of motion data in the recordings and mitigate random artifacts, we record multiple runs for each scenario and condition. It is essential to note that all files are intended to be extracted into a common directory. In this way, the arrangement preserves the temporal structure of the recorded data.

Each run’s data includes a CSV file and up to two .mp4 videos representing the recordings from the scene cameras of the Tobii eye trackers and if the robot was in motion during the Scenarios 3–5 continuous 3D point clouds from the Ouster lidar as well as the RGB videos from one of the fish-eye cameras. The structure of the recorded data is shown in the Tables 3, 4, and 5 in the Appendix. In the following subsections we will provide more specific details on the usage and processing of the individual files.

5.1.1 Comma-Separated Value Files

Each CSV file contains a header with important metadata, including number of frames for the recording, rigid body and marker details, units of measurement, role labels, and eye tracking specifics (see Table 6 in the Appendix). The rest of the CSV files contains the merged data from the motion capture system and the eye tracking devices, organized based on the rigid bodies of the participants’ helmets. Thus, the data of each rigid body is organized into columns containing the XYZ coordinates of all markers (e.g., “Helmet_1 – 2 X” indicating the data for helmet one, marker two and axis X), XYZ coordinates of the centroid of all markers, 6DOF orientation of the rigid body’s local coordinate frame, and, if available, eye tracking data including 2D gaze coordinates, 3D gaze vectors, frame number of scene recording, eye movement types (such as saccades or fixations), and IMU data (accelerometer, gyroscope, and magnetometer).

Missing data is indicated by either “N/A” (not available) or an empty cell. The temporal indexing in these files is provided by the “Time” or “Frame” column, which indicates the timestamp or frame number of the motion capture system, respectively.

5.1.2 Robot sensor data

The sensor data from the robot includes lidar data and videos captured by the Azure Kinect camera and the Basler camera. Lidar 3D point clouds are provided in the Point Cloud Data (PCD) file format, corresponding to each timestamp. The lidar data for each run is supplied in a zip file, which is labeled with the same File ID as referenced in the Tables 3, 4, and 5 in the Appendix. In terms of video data, both the RGB-D and fish-eye camera video streams are unrectified, providing raw visual data, and are only available upon request to ensure a suitable data protection.

5.1.3 Additional data

In addition to the CSV files containing information about the recorded data from the eye trackers and the motion capture system, we provide the scene recordings from most of the Tobii eye tracking devices as .mp4 videos. The videos of the scene recordings were carefully post processed, as we blurred all faces of the participants using a dedicated video-redaction software (“Caseguard”) to ensure data protection. The raw camera video from the Pupil Invisible Glasses scene has distortions that need to be corrected. For this purpose, we provide JSON files with the necessary intrinsic camera parameters to compensate. All data from the Pupil Invisible eye tracking devices and the remaining data from the Tobii devices are available upon request.

5.2 Development Tools

The majority of existing datasets in the field lack a dedicated toolbox for streamlined visualization and preprocessing. Addressing this gap, we contribute a set of data visualization tools, including a dashboard, and introduce a specialized Python package named thor-magni-tools. This package is designed to facilitate the filtering and preprocessing of raw trajectory data, enhancing the accessibility and usability of the THÖR-MAGNI dataset. By making available these resources, we aim to provide researchers with versatile and fast means to navigate, analyze, and extract valuable insights from the dataset.

5.2.1 Data Visualization

In order to provide researchers and users with an intuitive interface for the exploration of human movement, gaze patterns, and environmental perception of the THÖR-MAGNI dataset, we made a set of visualization tools publicly available⁵⁵5https://github.com/tmralmeida/magni-dash/tree/dash-public. Our visualization dashboard provides a user-friendly interface with multiple interactive components. The dashboard includes the following key features:

1.

Trajectory visualization: users can visualize agents’ trajectories in 2D or 3D space. To represent different agents, the trajectories are color-coded, which allows the user to identify patterns and variations.
2.

Velocity profiles: the dashboard displays also velocity profiles corresponding to each trajectory, allowing users to analyze speed variations during different phases of movement. This feature helps to understand the dynamics of human movement under different conditions.
3.

Eye tracking data alignment: gaze data is overlaid on the 3D trajectories. This provides insight into visual attention during different phases of motion. Researchers can explore how gaze patterns align with specific trajectory segments, promoting the study of the cognitive processes underlying human actions.
4.

Lidar data visualization: lidar sensor data is presented in 3D format to show the environmental context of human motion. This information is critical for studying lidar-based human detectors onboard mobile robots, especially in complex environments like in THÖR-MAGNI.

In addition to data visualization, our dashboard contains concise scenario descriptions. Each scenario represents a unique context in which human motion data was captured (described in Section 4.3). These descriptions include information such as the physical environment, task objectives, social interactions, and any specific conditions imposed on the participants (e.g., transporting objects between two goal points). Understanding these scenarios is vital for the accurate interpretation of the data and ensures that researchers can contextualize their analyses effectively.

5.2.2 Data Filtering and Preprocessing with thor-magni-tools

To facilitate the use of the agents’ trajectories in our dataset, we employed the thor-magni-tools Python package⁶⁶6https://github.com/tmralmeida/thor-magni-tools, a tool designed specifically for filtering, preprocessing, and visualizing trajectory data. This tool focuses on mitigating tracking issues arising from the motion capture system, enhancing the quality of the data for downstream tasks such as studying novel trajectory prediction methods. To filter 3D trajectory data, we provide two methods: (1) using the most reliable marker, i.e., the marker of each helmet with the highest number of tracking locations and (2) restore the helmet tracking based on the average of the tracking locations of each marker. Both approaches offer a trade-off between tracking quantity and quality. The method utilizing the best marker exclusively produces smoother trajectories due to its reliance on a single marker. Conversely, the method averaging the positions of all visible markers generates longer trajectories but with increased jerkiness, as it incorporates data from multiple markers, which can vary. However, this jerkiness can be alleviated through the application of a moving average filter in subsequent processing stages. Figure 12 shows an example of the two methods applied on THÖR-MAGNI trajectory data.

For both 3D and 6D tracks (X, Y, Z, and 3D orientation), we provide an interpolation method based on a predefined maximum number of positions in the absence of tracking. This method is used to fill in the missing data points while maintaining the integrity of the motion patterns and ensuring continuity in the trajectories. An example of the interpolation of a trajectory based on thor-magni-tools is depicted in Figure 13. Finally, this tool also offers optional preprocessing steps, including downsampling and signal smoothing through a moving average filter, further refining the processed trajectories.

6 Analysis and Comparison to Existing Human Motion Datasets

This section presents a comparison with popular human trajectory datasets, specifically the ETH/UCY benchmark and THÖR, with our THÖR-MAGNI dataset. Our analysis encompasses a multidimensional evaluation, covering various facets of the data recordings. These include trajectory continuity, social proxemics delineating interpersonal interactions, as well as motion characteristics such as velocity profiles and trajectory linearity. Through this comparison, we aim to situate THÖR-MAGNI among its predecessors, showing its potential for advancing human motion analysis and human-robot interaction research.

6.1 Metrics for Trajectory Data Comparison

To evaluate the trajectory data of our dataset in comparison to previous data collections, we employ metrics proposed by Rudenko et al. (2020a) and Amirian et al. (2021):

•

Tracking Duration ( $\text{\,}\mathrm{s}$ ): this metric represents the average duration of continuous tracking for all human agents. A higher value indicates longer tracking, which is favorable for long-term human motion prediction methods.
•

Minimal Distance Between People ( $\text{\,}\mathrm{m}$ ): this metric measures the minimum distance observed between individuals in the dataset. It provides insights into the proximity of human agents during their interactions, offering valuable data for studies related to personal space (proxemics) and social dynamics.
•

Number of 8-second Tracklets: this metric counts the non-overlapping tracklets of 8-second duration after downsampling to $0.4\text{\,}\mathrm{s}$ and applying a moving average filter. These choices align with current trajectory prediction benchmarks such as those outlined in Kothari et al. (2022). These tracklets offer discrete temporal segments for analysis, ensuring compatibility with existing evaluation standards in the field of trajectory prediction.
•

Motion Speed ( $\text{\,}\mathrm{m}$ /s): motion speed represents the velocity of all human agents. A higher standard deviation in motion speed indicates a diverse range of behaviors within the dataset. This diversity is essential for capturing various movement patterns and is crucial for robustness in trajectory prediction models. This metric is computed in the 8-second tracklets.
•

Path Efficiency: path efficiency quantifies the linearity of trajectories in the dataset, ranging between 0 and 1 (Amirian et al., 2021). It is calculated by dividing the distance between the first and last points by the cumulative distance traveled. A lower coefficient suggests more complex and non-linear trajectories, providing valuable insights into intricate human movement patterns. This metric is computed in the 8-second tracklets.

6.2 Trajectory Data Comparison

We compare our dataset with the THÖR dataset and the ETH/UCY trajectory prediction benchmark. The THÖR dataset encompasses three distinct scenarios, each featuring participants performing different tasks such as individual and group movement, as well as box transportation, different amounts of obstacles, and a mobile robot in the environment. In THÖR Scenario 1 (THÖR-S1), participants navigate the environment with one static obstacle. THÖR Scenario 2 (THÖR-S2) introduces a mobile robot navigating around the static obstacle, while participants continue their tasks. Finally, in THÖR Scenario 3 (THÖR-S3), the mobile robot becomes a static obstacle and an additional obstacle is added to the scene. The ETH/UCY trajectory prediction benchmark consists of five scenes: ETH, HOTEL, UNIV, ZARA1, and ZARA2. These scenes represent five outdoor public spaces that capture natural human motion patterns, resulting in a benchmark widely used by the human trajectory prediction community (Salzmann et al., 2020; Dendorfer et al., 2021; Yue et al., 2022; de Almeida and Mozos, 2023).

Firstly, we show the tracking durations in Figure 14. THÖR presents consistent average tracking durations around 15.5 to 17.6 seconds across the three scenarios. In contrast, THÖR-MAGNI shows wider variations, for instance, Scenario 4 features longer tracking durations (averaging 41.3 seconds), whereas Scenario 2 has the shortest durations (averaging 17.1 seconds). This variability can be attributed to the density of participants; that Scenarios 4 and 5, involved fewer human agents in a smaller space, may contribute to higher quality tracking. Nevertheless, THÖR-MAGNI has comparable or higher tracking time than THÖR. Furthermore, compared to the ETH/UCY benchmark (i.e., ETH, HOTEL, UNIV, ZARA1, and ZARA2 scenes), THÖR-MAGNI offers comparable or significantly longer tracking durations. This makes our dataset more useful than its predecessors for tasks such as long-term human motion prediction and human-robot interactions.

Secondly, we compare the minimal distance between people in Figure 15. Again, human density plays an important role: THÖR-MAGNI Scenarios 1–3 show low values comparable to those in ZARA1/ZARA2, while Scenario 4 and 5 reach values similar to THÖR, ETH, and HOTEL. The higher participant density in THÖR-MAGNI Scenarios 1–3 results in reduced spatial navigational freedom, leading to increased interactions and decreased social distances between individuals.

Thirdly, the motion speed statistics are shown in Figure 16. Despite the higher participant density in Scenarios 1–3 of THÖR-MAGNI, these datasets feature faster human agent navigation than THÖR and akin to those in ETH, HOTEL, and ZARA1 scenes, possibly influenced by the task of object transportation, impacting their velocity profiles. Participants in Scenarios 4 and 5 of THÖR-MAGNI have an average velocity similar to those in THÖR, UNIV and ZARA2. Also, generally, THÖR-MAGNI shows comparable standard deviations in motion speeds, indicating diverse and varied movement patterns among human agents. The similarity of the velocity profiles to previous datasets suggests that our dataset is also natural and diverse.

Finally, we show the comparison of path efficiency and number of tracklets in Figure 17. In terms of trajectory linearity, Scenarios 1–3 are aligned with the THÖR and HOTEL datasets, while the other datasets from the ETH/UCY benchmark contain more linear and less complex trajectories. It is also worth noting that THÖR-MAGNI Scenario 4 and 5 display the lowest average metrics ( $0.78$ and $0.75$ , respectively). The presence of a moving robot might influence these scenarios, prompting human agents to navigate cautiously and align their motion with the robot’s motion profile. Furthermore, THÖR-MAGNI presents a much higher number of non-overlapping tracklets than the other datasets.

In summary, these distinctive features make our dataset uniquely challenging, diverse, and valuable as a benchmark for evaluating human trajectory prediction methods. The heightened complexity and diverse range of trajectories in THÖR-MAGNI can provide a robust platform for evaluating the effectiveness of trajectory prediction methods, thereby increasing the breadth and depth of research in this area.

7 Conclusions

In this paper, we present THÖR-MAGNI, a comprehensive dataset of human and robot navigation and interaction, extending THÖR (Rudenko et al., 2020a) with 3.5 times more motion data, novel interactive scenarios, and rich contextual annotations. Both datasets are accessible online at http://thor.oru.se/. To further support researchers, THÖR-MAGNI comes with a dedicated set of user-friendly tools — a dashboard and a specialized Python package called thor-magni-tools — specifically designed to streamline the visualization, filtering, and preprocessing of raw data. These resources aim to improve the accessibility and usability of the THÖR-MAGNI dataset.

THÖR-MAGNI was created to fill a gap in datasets for human motion analysis, that was limiting HRI research: a lack of comprehensive inclusion of exogenous factors and essential target agent cues, which hinders holistic studies of human motion dynamics. Unlike existing datasets, THÖR-MAGNI includes a broader set of contextual features and offers multiple variations to facilitate factor isolation. Our dataset integrates different modalities, such as walking trajectories, eye tracking data, and environmental sensory inputs captured by a mobile robot.

THÖR-MAGNI provides a comprehensive representation of diverse navigation styles of mobile robots and humans in shared environments, using multi-modal data. Our dataset contributes to the evolving landscape of human motion research through a comparative analysis with state-of-the-art datasets. Furthermore, we discuss the features of our dataset in the context of human motion and robot interaction, highlighting their importance in addressing gaps in the existing literature. The THÖR-MAGNI dataset has already been used in research papers, demonstrating its usefulness for training role-conditioned motion prediction models (de Almeida et al., 2023) and investigating visual attention during human-robot interaction (Schreiter et al., 2023).

In the future, we intend to propose a benchmark for multi-modal indoor trajectory prediction methods that builds on the rich contextual cues in THÖR-MAGNI. This effort is geared towards advancing the field by enabling the creation of more accurate models of human motion.

{acks}

Authors would like to thank Johannes A. Stork for valuable feedback and suggestions. The authors also thank all the AASS colleagues who helped to prepare and test the experimental infrastructure.

{funding}

This work was supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation and by the European Union’s Horizon 2020 research and innovation program under grant agreement No. 101017274 (DARKO).

References

Admoni and Scassellati (2017) Admoni H and Scassellati B (2017) Social eye gaze in human-robot interaction: a review. Journal of Human-Robot Interaction 6(1): 25–63.
Amirian et al. (2021) Amirian J, Zhang B, Castro FV, Baldelomar JJ, Hayet JB and Pettré J (2021) Opentraj: Assessing prediction complexity in human trajectories datasets. In: Ishikawa H, Liu CL, Pajdla T and Shi J (eds.) Computer Vision – ACCV 2020. Cham: Springer International Publishing. ISBN 978-3-030-69544-6, pp. 566–582.
Bartneck et al. (2009) Bartneck C, Kulić D, Croft E and Zoghbi S (2009) Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. International Journal of Social Robotics 1(1): 71–81.
Benfold and Reid (2011) Benfold B and Reid I (2011) Stable multi-target tracking in real-time surveillance video. In: CVPR 2011. IEEE, pp. 3457–3464.
Bock et al. (2020) Bock J, Krajewski R, Moers T, Runde S, Vater L and Eckstein L (2020) The ind dataset: A drone dataset of naturalistic road user trajectories at german intersections. In: 2020 IEEE Intelligent Vehicles Symposium (IV). IEEE, pp. 1929–1934.
Boos et al. (2014) Boos M, Pritz J, Lange S and Belz M (2014) Leadership in moving human groups. PLoS computational biology 10(4): e1003541.
Brščić et al. (2013) Brščić D, Kanda T, Ikeda T and Miyashita T (2013) Person tracking in large public spaces using 3-D range sensors. IEEE Transactions on Human-Machine Systems 43(6): 522–534.
Cancelli et al. (2023) Cancelli E, Campari T, Serafini L, Chang AX and Ballan L (2023) Exploiting Proximity-Aware Tasks for Embodied Social Navigation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10957–10967.
Castri et al. (2022) Castri L, Mghames S, Hanheide M and Bellotto N (2022) Causal discovery of dynamic models for predicting human spatial interactions. In: International Conference on Social Robotics. Cham: Springer Nature, pp. 154–164.
Chandra et al. (2019) Chandra R, Bhattacharya U, Bera A and Manocha D (2019) Traphic: Trajectory Prediction in Dense and Heterogeneous Traffic Using Weighted Interactions. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8475–8484.
Charalambous et al. (2016) Charalambous G, Fletcher S and Webb P (2016) The development of a scale to evaluate trust in industrial human-robot collaboration. International Journal of Social Robotics 8: 193–209.
Chen et al. (2022) Chen Y, Luo Y, Yang C, Yerebakan MO, Hao S, Grimaldi N, Li S, Hayes R and Hu B (2022) Human mobile robot interaction in the retail environment. Scientific Data 9(1): 673.
Chiara et al. (2022) Chiara LF, Coscia P, Das S, Calderara S, Cucchiara R and Ballan L (2022) Goal-driven self-attentive recurrent networks for trajectory prediction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2518–2527.
Dahiya et al. (2023) Dahiya A, Aroyo AM, Dautenhahn K and Smith SL (2023) A survey of multi-agent Human–Robot Interaction systems. Robotics and Autonomous Systems 161: 104335.
de Almeida and Mozos (2023) de Almeida TR and Mozos OM (2023) Likely, light, and accurate context-free clusters-based trajectory prediction. In: 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC). pp. 1269–1276. 10.1109/ITSC57777.2023.10422479.
de Almeida et al. (2023) de Almeida TR, Rudenko A, Schreiter T, Zhu Y, Maestro EG, Morillo-Mendez L, Kucner TP, Mozos OM, Magnusson M, Palmieri L, Arras KO and Lilienthal AJ (2023) THÖR-Magni: Comparative Analysis of Deep Learning Models for Role-Conditioned Human Motion Prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops. pp. 2200–2209.
Dendorfer et al. (2021) Dendorfer P, Ošep A and Leal-Taixé L (2021) Goal-GAN: Multimodal Trajectory Prediction Based on Goal Position Estimation. In: Ishikawa H, Liu CL, Pajdla T and Shi J (eds.) Computer Vision – ACCV 2020. Cham: Springer International Publishing, pp. 405–420.
Dondrup et al. (2015) Dondrup C, Bellotto N, Jovan F, Hanheide M et al. (2015) Real-time multisensor people tracking for human-robot spatial interaction. ICRA’15 Workshop on Machine Learning for Social Robotics .
Duchowski (2017) Duchowski TA (2017) Eye tracking: methodology theory and practice. London: Springer.
Ehsanpour et al. (2022) Ehsanpour M, Saleh FS, Savarese S, Reid ID and Rezatofighi H (2022) JRDB-act: A large-scale dataset for spatio-temporal action, social group and activity detection. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) : 20951–20960.
Faroni et al. (2022) Faroni M, Beschi M and Pedrocchi N (2022) Safety-aware time-optimal motion planning with uncertain human state estimation. IEEE Robotics and Automation Letters 7(4): 12219–12226.
Finean et al. (2023) Finean MN, Petrović L, Merkt W, Marković I and Havoutis I (2023) Motion planning in dynamic environments using context-aware human trajectory prediction. Robotics and Autonomous Systems 166: 104450.
Garland et al. (2018) Garland J, Berdahl AM, Sun J and Bollt EM (2018) Anatomy of leadership in collective behaviour. Chaos: An Interdisciplinary Journal of Nonlinear Science 28(7).
Haddadin et al. (2011) Haddadin S, Suppa M, Fuchs S, Bodenmüller T, Albu-Schäffer A and Hirzinger G (2011) Towards the robotic co-worker. In: Robotics Research: The 14th International Symposium ISRR. Berlin Heidelberg: Springer, pp. 261–282.
Hart (2006) Hart SG (2006) NASA-task load index (NASA-TLX); 20 years later. In: Proceedings of the human factors and ergonomics society annual meeting, volume 50.9. Sage publications Sage CA: Los Angeles, CA, pp. 904–908.
Hart and Staveland (1988) Hart SG and Staveland LE (1988) Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research. In: Advances in psychology, volume 52. Elsevier, pp. 139–183.
Heuer et al. (2023) Heuer L, Palmieri L, Rudenko A, Mannucci A, Magnusson M and Arras KO (2023) Proactive model predictive control with multi-modal human motion prediction in cluttered dynamic environments. In: International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 229–236.
Holman et al. (2021) Holman B, Anwar A, Singh A, Tec M, Hart J and Stone P (2021) Watch where you’re going! gaze and head orientation as predictors for social robot navigation. In: 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 3553–3559.
Karnan et al. (2022) Karnan H, Nair A, Xiao X, Warnell G, Pirk S, Toshev A, Hart J, Biswas J and Stone P (2022) Socially compliant navigation dataset (scand): A large-scale dataset of demonstrations for social navigation. IEEE Robotics and Automation Letters 7(4): 11807–11814.
Kothari et al. (2022) Kothari P, Kreiss S and Alahi A (2022) Human Trajectory Forecasting in Crowds: A Deep Learning Perspective. IEEE Transactions on Intelligent Transportation Systems 23(7).
Kratzer et al. (2020) Kratzer P, Bihlmaier S, Balachandra Midlagajni N, Prakash R, Toussaint M and Mainprice J (2020) Mogaze: A dataset of full-body motions that includes workspace geometry and eye-gaze. IEEE Robotics and Automation Letters (RAL) .
Kucner et al. (2020) Kucner TP, Lilienthal AJ, Magnusson M, Palmieri L and Swaminathan CS (2020) Probabilistic mapping of spatial motion patterns for mobile robots. Cham: Springer Nature.
Kucner et al. (2023) Kucner TP, Magnusson M, Mghames S, Palmieri L, Verdoja F, Swaminathan CS, Krajník T, Schaffernicht E, Bellotto N, Hanheide M et al. (2023) Survey of maps of dynamics for mobile robots. The International Journal of Robotics Research 42(11): 977–1006.
Leng et al. (2022) Leng J, Sha W, Wang B, Zheng P, Zhuang C, Liu Q, Wuest T, Mourtzis D and Wang L (2022) Industry 5.0: Prospect and retrospect. Journal of Manufacturing Systems 65: 279–295.
Lerner et al. (2007) Lerner A, Chrysanthou Y and Lischinski D (2007) Crowds by example. Computer Graphics Forum 26(3): 655–664.
Li et al. (2022) Li M, Zhong B, Lobaton E and Huang H (2022) Fusion of Human Gaze and Machine Vision for Predicting Intended Locomotion Mode. IEEE Transactions on Neural Systems and Rehabilitation Engineering 30: 1103–1112.
Liu et al. (2019) Liu J, Shahroudy A, Perez M, Wang G, Duan LY and Kot AC (2019) NTU RGB+D 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence 42(10): 2684–2701.
Mahanama et al. (2022) Mahanama B, Jayawardana Y, Rengarajan S, Jayawardena G, Chukoskie L, Snider J and Jayarathna S (2022) Eye movement and pupil measures: A review. Frontiers in Computer Science 3: 733531.
Majecka (2009) Majecka B (2009) Statistical models of pedestrian behaviour in the forum. Master’s thesis, School of Informatics, University of Edinburgh .
Makansi et al. (2022) Makansi O, von Kügelgen J, Locatello F, Gehler P, Janzing D, Brox T and Schölkopf B (2022) You mostly walk alone: Analyzing feature attribution in trajectory prediction. In: 10th International Conference on Learning Representations (ICLR).
Möller et al. (2021) Möller R, Furnari A, Battiato S, Härmä A and Farinella GM (2021) A survey on human-aware robot navigation. Robotics and Autonomous Systems 145: 103837.
Moussaïd et al. (2010) Moussaïd M, Perozo N, Garnier S, Helbing D and Theraulaz G (2010) The walking behaviour of pedestrian social groups and its impact on crowd dynamics. PloS one 5(4): e10047.
Munaro and Menegatti (2014) Munaro M and Menegatti E (2014) Fast RGB-D people tracking for service robots. Autonomous Robots 37: 227–242.
Oh et al. (2011) Oh S, Hoogs A, Perera A, Cuntoor N, Chen CC, Lee JT, Mukherjee S, Aggarwal J, Lee H, Davis L et al. (2011) A large-scale benchmark dataset for event recognition in surveillance video. In: CVPR 2011. IEEE, pp. 3153–3160.
Pascher et al. (2023) Pascher M, Gruenefeld U, Schneegass S and Gerken J (2023) How to Communicate Robot Motion Intent: A Scoping Review. In: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. pp. 1–17.
Pellegrini et al. (2009) Pellegrini S, Ess A, Schindler K and van Gool L (2009) You’ll never walk alone: Modeling social behavior for multi-target tracking. In: 2009 IEEE 12th International Conference on Computer Vision. pp. 261–268.
Pupil Labs AB (Accessed: 2024-02-02a) Pupil Labs AB (Accessed: 2024-02-02a) Pupil Player Documentation. https://docs.pupil-labs.com/core/software/pupil-player/.
Pupil Labs AB (Accessed: 2024-02-02b) Pupil Labs AB (Accessed: 2024-02-02b) Working with pupil Core, Best Practices. https://bit.ly/42JeBVm.
Qualisys AB (Accessed: 2024-02-02) Qualisys AB (Accessed: 2024-02-02) Qualisys Track Manager User Manual v2022.1. https://cdn-content.qualisys.com/2022/07/QTM-user-manual.pdf.
Quigley et al. (2009) Quigley M, Conley K, Gerkey B, Faust J, Foote T, Leibs J, Wheeler R, Ng AY et al. (2009) ROS: an open-source Robot Operating System. In: ICRA workshop on open source software, volume 3.2. p. 5.
Robicquet et al. (2016) Robicquet A, Sadeghian A, Alahi A and Savarese S (2016) Learning social etiquette: Human trajectory understanding in crowded scenes. In: European Conference on Computer Vision. p. 549–565.
Rudenko et al. (2020a) Rudenko A, Kucner TP, Swaminathan CS, Chadalavada RT, Arras KO and Lilienthal AJ (2020a) THÖR: Human-robot navigation data collection and accurate motion trajectories dataset. IEEE Robotics and Automation Letters 5(2): 676–682.
Rudenko et al. (2020b) Rudenko A, Palmieri L, Herman M, Kitani KM, Gavrila DM and Arras KO (2020b) Human motion trajectory prediction: A survey. The International Journal of Robotics Research 39(8): 895–935.
Rudenko et al. (2018) Rudenko A, Palmieri L, Lilienthal AJ and Arras KO (2018) Human motion prediction under social grouping constraints. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 3358–3364.
Salzmann et al. (2023) Salzmann T, Chiang HTL, Ryll M, Sadigh D, Parada C and Bewley A (2023) Robots That Can See: Leveraging Human Pose for Trajectory Prediction. IEEE Robotics and Automation Letters .
Salzmann et al. (2020) Salzmann T, Ivanovic B, Chakravarty P and Pavone M (2020) Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In: Computer Vision – ECCV 2020. Springer International Publishing. ISBN 978-3-030-58523-5, pp. 683–700.
Schreiter et al. (2023) Schreiter T, Morillo-Mendez L, Chadalavada RT, Rudenko A, Billing E, Magnusson M, Arras KO and Lilienthal AJ (2023) Advantages of Multimodal versus Verbal-Only Robot-to-Human Communication with an Anthropomorphic Robotic Mock Driver. In: 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN).
Shah et al. (2023) Shah D, Sridhar A, Bhorkar A, Hirose N and Levine S (2023) Gnm: A general navigation model to drive any robot. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, pp. 7226–7233.
Shu et al. (2015) Shu T, Xie D, Rothrock B, Todorovic S and Zhu SC (2015) Joint inference of groups, events and human roles in aerial videos. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4576–4584.
Stiefelhagen and Zhu (2002) Stiefelhagen R and Zhu J (2002) Head orientation and gaze direction in meetings. In: CHI ’02 Extended Abstracts on Human Factors in Computing Systems. Association for Computing Machinery, p. 858–859.
Tian et al. (2017) Tian Y, Zhang S, Liu J, Chen F, Li L and Xia B (2017) Research on a new omnidirectional mobile platform with heavy loading and flexible motion. Advances in Mechanical Engineering 9(9): 1687814017726683.
Tobii AB (Accessed: 2024-02-02a) Tobii AB (Accessed: 2024-02-02a) Tobii 2 Glasses User Manual v2.0.2. https://go.tobii.com/Glasses2UM.
Tobii AB (Accessed: 2024-02-02b) Tobii AB (Accessed: 2024-02-02b) Tobii 3 Glasses User Manual v1.18. https://go.tobii.com/tobii-pro-glasses-3-user-manual.
Tobii AB (Accessed: 2024-02-02c) Tobii AB (Accessed: 2024-02-02c) Tobii Pro Lab User Manual v1.217. https://go.tobii.com/tobii_pro_lab_user_manual.
Tomasello (2014) Tomasello M (2014) Joint attention as social cognition. In: Joint attention. Psychology Press, pp. 103–130.
Wang et al. (2022) Wang A, Mavrogiannis C and Steinfeld A (2022) Group-based motion prediction for navigation in crowded environments. In: Conference on Robot Learning. PMLR, pp. 871–882.
Yan et al. (2017) Yan Z, Duckett T and Bellotto N (2017) Online learning for human classification in 3D LiDAR-based tracking. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 864–871.
Yan et al. (2020) Yan Z, Schreiberhuber S, Halmetschlager G, Duckett T, Vincze M and Bellotto N (2020) Robot perception of static and dynamic objects with an autonomous floor scrubber. Intelligent Service Robotics 13(3): 403–417.
Yue et al. (2022) Yue J, Manocha D and Wang H (2022) Human trajectory prediction via neural social physics. In: Avidan S, Brostow G, Cissé M, Farinella GM and Hassner T (eds.) Computer Vision – ECCV 2022. Cham: Springer Nature Switzerland, pp. 376–394.
Zhao and Wildes (2021) Zhao H and Wildes RP (2021) Where Are You Heading? Dynamic Trajectory Prediction With Expert Goal Examples. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 7629–7638.
Zheng et al. (2022) Zheng Y, Yang Y, Mo K, Li J, Yu T, Liu Y, Liu CK and Guibas LJ (2022) GIMO: Gaze-Informed Human Motion Prediction in Context. In: Avidan S, Brostow G, Cissé M, Farinella GM and Hassner T (eds.) Computer Vision – ECCV 2022. Cham: Springer Nature Switzerland, pp. 676–694.
Zhou et al. (2012) Zhou B, Wang X and Tang X (2012) Understanding collective crowd behaviors: Learning a Mixture model of Dynamic pedestrian-Agents. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2871–2878.
Zhu et al. (2023) Zhu Y, Rudenko A, Kucner TP, Palmieri L, Arras KO, Lilienthal AJ and Magnusson M (2023) CLiFF-LHMP: Using Spatial Dynamics Patterns for Long- Term Human Motion Prediction. In: 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 3795–3802.
Ziebart et al. (2009) Ziebart BD, Ratliff N, Gallagher G, Mertz C, Peterson K, Bagnell JA, Hebert M, Dey AK and Srinivasa S (2009) Planning-based prediction for pedestrians. In: 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, pp. 3931–3936.

File ID

Visitors

Helmet ID 1-10

Eyetrackers

Helmet ID 1-10

120522_SC1A_1

1,7,10 + 2,6 + 5 + 4 + 3 + 8

4 + 6 + 10

120522_SC1A_2

8,4,3 + 10,5 + 1 + 2 + 6 + 7

4 + 6 + 10

130522_SC1A_1

1,5,10 + 4,8 + 6 + 3

4 + 6 + 10

130522_SC1A_2

5,6,8 + 3,10 + 1 + 4

4 + 6 + 10

170522_SC1A_1

5,8 + 2,6 + 1 + 4 + 10

5 + 6

170522_SC1A_2

1,4 + 5,10 + 2 + 6+ 8

5 + 6

180522_SC1A_1

5,6 + 2,10 + 1 + 4+ 7

4 + 5+ 10

180522_SC1A_2

1,4 + 6,10 + 2 + 5 + 7

4 + 5+ 10

120522_SC1B_1

2,5,10 + 4,7 + 1 + 3 + 6 + 8

4 + 6 + 10

120522_SC1B_2

3,6,7 + 1,8 + 2 +4 + 5 + 10

4 + 6 + 10

130522_SC1B_1

1,5,10 + 4,8 + 3 + 6

4 + 6 + 10

130522_SC1B_2

5,6,8 + 3,10 + 1 + 4

4 + 6 + 10

170522_SC1B_1

1,6 + 2,5 + 4 + 8 + 10

5 + 6

170522_SC1B_2

1,5 + 6,8 + 2 + 4 + 10

5 + 6

180522_SC1B_1

2,6 + 7,10 + 1 + 4 +5

4 + 5+ 10

180522_SC1B_2

1,6 + 4,5 + 2 + 7 + 10

4 + 5+ 10

300922_SC4A_1

3 + 8 + 9 + 10

9 + 10

300922_SC4A_2

3 + 8 + 9 + 10

9 + 10

300922_SC4A_3

3,10 + 1 + 6 + 8

8 + 10

300922_SC4A_4

3,10 + 1 + 6 + 8

8 + 10

300922_SC4B_3

1,6 + 3 + 8 + 10

8 + 10

300922_SC4B_4

1,6 + 3 + 8 + 10

8 + 10

300922_SC4B_1

3 + 8 + 9 + 10

9 + 10

300922_SC4B_2

3 + 8 + 9 + 10

9 + 10

Table 3: Assignment of Visitors and eyetrackers to helmet IDs in Scenarios 1 and 4. Each row relates the file ID to the assignment of Visitors roles (alone and groups), eyetrackers, and corresponding helmets IDs (1-10). The file ID contains “date of the recording” + ”_” + “SC“ + “Scenario number“ + “condition” + ”_” + “run number”.

File ID

Visitors

ID 1-10

Carrier

Box + Bucket

Large Object

Leader + Follower

Eyetracker

ID 1-10

120522_SC2_1

2,4,5+3+8

7+6

10+1

4+6+10

120522_SC2_2

4,8+2+3+5

6+7

1+10

4+6+10

130522_SC2_1

1,6+3

8+10

5+4

4+6+10

130522_SC2_2

1+3+4

10+8

6+5

4+6+10

170522_SC2_1

2+5+8

4+10

6+1

4+5+6

170522_SC2_2

2,8+5

4+10

1+6

4+5+6

180522_SC2_1

1+5+6

10+2

7+4

4+5+10

180522_SC2_2

1,6+5

2+10

4+7

4+5+10

120522_SC3A_1

3,6,7+4+5

10+1

2+8

4+6+10

120522_SC3A_2

2,4,5+3+7

1+10

8+6

4+6+10

130522_SC3A_1

3,8+4

5+6

10+1

4+6+10

130522_SC3A_2

3,4+8

6+5

1+10

4+6+10

170522_SC3A_1

1,4+10

6+2

8+5

4+5+6

170522_SC3A_2

4,10+1

2+6

5+8

4+5+6

180522_SC3A_1

2+4+7

5+6

10+1

4+6+10

180522_SC3A_2

4,6+2

7+5

1+10

4+6+10

120522_SC3B_1

3,4,8+2+5

10+1

6+7

4+6+10

120522_SC3B_2

3,6,8+1+2

4+5

10+7

4+6+10

130522_SC3B_1

3,6+1

10+8

4+5

4+6+10

130522_SC3B_2

1,8+10

6+5

3+4

4+6+10

170522_SC3B_1

6,10+8

1+5

2+4

4+5+6

170522_SC3B_2

8,10+6

5+1

4+2

4+5+6

180522_SC3B_1

2,10+4

6+1

5+7

4+5+10

180522_SC3B_2

2,4,6

10+1

7+5

4+5+10

Table 4: Assignment of Visitors and Carriers and eyetrackers to helmet ID in Scenarios 2 and 3. Each row relates the file ID to the assignment of the roles in Scenarios 2 and 3: Visitors–Alone, Visitors–Groups 2, Visitors–Groups 3, Carrier–Box, Carrier–Bucket, and Carrier–Large Object. The file ID contains “date of the recording” + ”_” + “SC“ + “Scenario number“ + “condition” + ”_” + “run number”.

File ID

Visitors

ID 1-10

Carrier

Storage Bin

Eyetracker

ID 1-10

300922_SC5_1

3 + 8 + 9

9 + 10

300922_SC5_2

3 + 8 + 10

9 + 10

300922_SC5_3

1,8 + 3 + 10

8 + 10

300922_SC5_4

1 + 3 + 6 + 8

8 + 10

Table 5: Assignment of human roles and eyetrackers to helmet IDs in Scenario 5. Each row relates the file ID to the assignment of Visitors roles (alone and groups of 2), Carrier–Storage Bin HRI role, eyetrackers, and corresponding helmets IDs (1-10). The file ID contains “date of the recording” + ”_” + “SC“ + “Scenario number“ + “condition” + ”_” + “run number”.

Line of the header

Description

FILE_ID

Name of the file

N_FRAMES_QTM

Amount of frames recorded at

100\text{\,}\mathrm{H}\mathrm{z}

N_BODIES

Amount of rigid bodies present

N_MARKERS

Total amount of markers present

CONTIGUOUS_ROTATION_MATRIX

Order of a 3x3 rotation matrix

MODALITIES_

WITH_UNITS

List of measured variables

MODALITIES_

UNITS_SPECIFIED

Measurement units specified

EYETRACKING_

DEVICES

List of eye trackers in this recording

(Not all devices are used in all files)

EYETRACKING_

FREQUENCY_IR

Frequency of eyetrackers infra red sensor

EYETRACKING_

FREQUECNY_SCENE_CAMER

Frequency of scene cameras video

EYETRACKING_

DATA_INCLUDED

Data included from the eye tracker

EYETRACKING_DATA_N_FRAMES

Total amount of frames with this data

BODY_NAMES

Name of each rigid body

BODY_ROLES

Role label of each rigid body

BODY_NR_MARKERS

Amount of markers for each rigid body

MARKER_NAMES

The names of all markers used in this file

(Unaligned with the previous rows)

Table 6: Overview of the metadata in the headers of the CSV files. It includes recording information such as the file ID, date, scenario, condition, and run. It also provides details on data quantities, including the number of frames recorded, rigid bodies present, and markers used. Moreover, it includes information about the order of rotation matrices, measured variables with units, specified measurement units, and a list of eye trackers utilized in each recording. Furthermore, it outlines the frequencies of eyetracker sensors and scene cameras, as well as the presence of eyetracking data. Finally, it covers details about rigid bodies including their names, roles, and the number of markers associated with them.