Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

ULTra-AV: A Unified Longitudinal Trajectory Dataset for Automated Vehicle

Hang Zhou University of Wisconsin-Madison, Department of Civil and Environmental Engineering, Madison, WI, 53706, USA Ke Ma University of Wisconsin-Madison, Department of Civil and Environmental Engineering, Madison, WI, 53706, USA corresponding author(s): Xiaopeng Li (xli2485@wisc.edu), Xiaobo Qu (drxiaoboqu@gmail.com), Ke Ma (kma62@wisc.edu) Shixiao Liang University of Wisconsin-Madison, Department of Civil and Environmental Engineering, Madison, WI, 53706, USA Xiaopeng Li University of Wisconsin-Madison, Department of Civil and Environmental Engineering, Madison, WI, 53706, USA corresponding author(s): Xiaopeng Li (xli2485@wisc.edu), Xiaobo Qu (drxiaoboqu@gmail.com), Ke Ma (kma62@wisc.edu) Xiaobo Qu School of Vehicle and Mobility, Tsinghua University, Beijing, 100084, China corresponding author(s): Xiaopeng Li (xli2485@wisc.edu), Xiaobo Qu (drxiaoboqu@gmail.com), Ke Ma (kma62@wisc.edu)
Abstract

Automated Vehicles (AVs) promise significant advances in transportation. Critical to these improvements is understanding AVs’ longitudinal behavior, relying heavily on real-world trajectory data. Existing open-source trajectory datasets of AV, however, often fall short in refinement, reliability, and completeness, hindering effective performance metrics analysis and model development. This study addresses these challenges by creating a Unified Longitudinal TRAjectory dataset for AVs (Ultra-AV) to analyze their microscopic longitudinal driving behaviors. This dataset compiles data from 13 distinct sources, encompassing various AV types, test sites, and experiment scenarios. We established a three-step data processing: 1. extraction of longitudinal trajectory data, 2. general data cleaning, and 3. data-specific cleaning to obtain the longitudinal trajectory data and car-following trajectory data. The validity of the processed data is affirmed through performance evaluations across safety, mobility, stability, and sustainability, along with an analysis of the relationships between variables in car-following models. Our work not only furnishes researchers with standardized data and metrics for longitudinal AV behavior studies but also sets guidelines for data collection and model development.

Background & Summary

The advent of Automated Vehicles (AVs) marks a revolutionary change in the realm of transportation [1]. Various stakeholders, including transportation agencies, policymakers, urban planners, the automotive industry, and customers are paying attention to the potential impact of AV on traffic flow. Numerous studies have established a definitive and quantifiable link between macro-level traffic flow and micro-level longitudinal driving behavior[2, 3]. This connection underscores the importance of understanding microscopic longitudinal AV behaviors to fully grasp their broader impacts on traffic. There is a particular emphasis on car-following behavior, the critical component in longitudinal driving behavior and arguably the most fundamental element in traffic flow [4, 5]. The key to studying and comprehending the car-following driving behavior of AVs lies in the availability of real-world trajectory data, which contain a sequence of spatial positions, velocities, and ground truth accelerations over time and thus provide invaluable insights into AV behavior [6]. Access to such AV trajectory data is imperative for stakeholders to generate reliable insights informing policy, infrastructure development, management strategies, traffic solutions, and AV design.

Numerous studies indicated that car-following behavior essentially impacts road traffic performance including safety, mobility, stability, and sustainability [7, 8, 9, 10, 11]. AV trajectory data can offer direct insights into the impact on traffic in terms of these performance metrics. Safety is a priority for many stakeholders. In car-following behavior, the principal focus involves assessing the likelihood and timing of a rear-end collision, considering the vehicle’s relative position and speed with respect to its preceding vehicle. Regarding mobility metrics, AVs have the potential to alter the driving strategy and following distance, thereby affecting throughput and traffic flow efficiency. While adopting an aggressive driving strategy or maintaining shorter following distances might enhance efficiency, such approaches would compromise stability, leading to diminished comfort and rising safety hazards. AVs are also expected to reduce overall fuel consumption of road traffic, thereby contributing to the achievement of environmental sustainability for future transportation. By directly analyzing AV trajectory data, stakeholders can assess these performance metrics, and develop strategies that enhance the positive impacts of AVs while mitigating potential negative impacts.

Although AV trajectory data can provide intuitive insights into the performance metrics of AV, a significant limitation arises from the limited scenarios in which this data is collected. For example, trajectory data may be collected within a specific speed range or when the AV followed a vehicle adhering to a predetermined path [12]. Thus, performance metrics derived from such constrained scenarios might present a biased view, failing to capture corner cases. This drawback underscores another crucial role of AV trajectory data: the accurate calibration of robust models that can run in the mirror of real-world conditions. These accurately calibrated models enable the exploration of broader impacts on traffic through simulation, including examining AV driving behavior in corner cases [13]. Furthermore, the models facilitate prediction interactions between AVs and human-driven vehicles. By accurate simulation results, the stakeholders can lay the groundwork for decision-making and strategic planning in anticipation of the forthcoming mixed traffic.

Recently, a surge in perception datasets of AVs—gathered through cameras and Light Detection and Ranging (LiDAR) in AV, such as BDD100K [14], Argoverse [15, 16], Waymo perception [17], KITTI [18], nuScenes [19], ONCE [20], and ZOD [21] datasets. These perception datasets are primarily used to predict the motion states of surrounding vehicles of AV and address basic safety conditions in AVs. However, they fall short of capturing the complex driving behaviors of AVs. In stark contrast, the collection of AV trajectory data, which depends on Global Positioning System (GPS) and Inertial Measurement Units (IMU), remains exceedingly scarce despite its critical importance as previously discussed. This scarcity is largely due to automakers’ reluctance to voluntarily share their trajectory data with their automated driving technology. The high costs associated with renting test sites and vehicles with automated driving technology also pose barriers for researchers to collect this essential data.

Despite these challenges, researchers worldwide have published several trajectory datasets under varying conditions and of differing sizes, as outlined in Table 1. However, these datasets often fall short in terms of refinement, reliability, and completeness, which limits their utility for comprehensive and precise studies of car-following behavior[22]. Firstly, not all datasets are specifically designed to collect trajectory data; they may inadvertently include car-following trajectory data alongside extraneous information, such as data on lateral vehicles that do not influence AV’s car-following behavior [17, 23]. This necessitates a selective refinement process to isolate the relevant trajectory data. Also, some datasets only contain raw data, which may include outliers or anomalies resulting from measurement errors, thus compromising data reliability. Thirdly, these datasets occasionally lack crucial details (e.g., vehicle length), requiring researchers to make educated guesses to fill these gaps. In light of these mentioned facts, currently, there is no unified and well-processed trajectory dataset available that encompasses multiple AVs across diverse experimental conditions and scenarios. This absence hinders the feasibility of conducting comprehensive studies on the impact of AVs on transportation.

To address this gap, this study proposes a unified approach to processing trajectory datasets of AVs by enhancing their refinement, reliability, and completeness. While it would be impractical to conduct empirical research on all available vehicles with automated driving technology, developing a dataset in systematic and structured processes that compile the results of experimental campaigns conducted by various global research teams—including the author’s group, Connected & Autonomous Transportation Systems Laboratory (CATS Lab)—can provide substantial insights into the longitudinal behavior of AVs. Similar to how standard datasets such as ImageNet [24], KITTI [18], and NGSIM [25] have revolutionized their respective fields, this work aims to establish an open-source Unified Longitudinal TRAjectory dataset for AVs (Ultra-AV) for future longitudinal behavior research of AV. The Ultra-AV dataset will facilitate the analysis of data, the development of models, and the identification of characteristics that influence AVs’ impact on transportation.

This study has the following contributions:

  • This study systematically reviewed open-source AV trajectory datasets and detailed their collection scenarios and conditions.

  • This study developed a unified trajectory data format that includes essential elements for car-following behavior analysis of AVs, such as the position, speed, and acceleration of both the following AV (FAV) and the lead vehicle (LV).

  • This study introduced a standardized trajectory data processing methodology that involves multiple steps to enhance the refinement, reliability, and completeness of the data.

  • This study validated the processed unified trajectory dataset through three key approaches: data collection methods, analysis of performance metrics, and development of AV models.

To summarize, we leverage available open-source AV datasets to facilitate research. A comprehensive workflow for processing multiple open-source datasets to compile this dataset is illustrated in Figure 1.

Refer to caption
Figure 1: A road map of this paper.

Methods

Efforts have been made globally to gather trajectory data related to AVs. We have examined 13 open-source datasets, each providing distinct insights into AV behavior across various driving conditions and scenarios. These open-source datasets are from six providers:

The majority of the datasets reviewed involve AVs’ long-time trajectories, which have been widely used in the analysis of AV behavior in the literature. However, the Waymo Open Dataset’s Waymo Motion Dataset and the Argoverse 2 Motion Forecasting Dataset contain comparatively shorter trajectories, with durations of 9.1 seconds and 11 seconds at 10Hz, respectively. These datasets are primarily employed in research in motion forecasting. However, such datasets are typically collected in rural areas within complex traffic environments, which provide the opportunity to analyze AV behavior in challenging conditions. Consequently, this paper includes analyses of these two datasets. Other motion forecasting datasets always consist of even shorter trajectories, such as the Argoverse 1 Motion Forecasting Dataset[15] consisting of 5-second trajectories. Given that such a short duration may not adequately reflect AV behavioral patterns, our analysis does not consider these datasets.

Refer to caption
Figure 2: Test sites distribution of the trajectory datasets.

The locations of the test sites for the datasets are depicted in Figure 2. The reviewed datasets are collected in several cities in the United States and Europe, which ensure the diversity and exemplarity among the selected cities [31]. Comprehensive details including data collection, AV information, test sites, and experiment settings are summarized in Table 1.

Table 1: Overview of the AV longitudinal trajectory open datasets.
Data Set1 1 2 3 4 5 6 7 8 9 10 11 12 13
Data Collection Sensor Ublox2 Ublox C066-F9P Ublox C066-F9P LiDAR Ublox 9 Ublox 8 RT-Range S3 Diffusion of three types of sensors4 Diffusion of three types of sensors5 Diffusion of three types of sensors5 Diffusion of two types of sensors6 Diffusion of two types of sensors6 Diffusion of three types of sensors7
Accurancy N/A 0.26m & 0.089m/s 0.26m & 0.089m/s N/A 0.3 m & 0.14 m/s N/A N/A
Frequency 10Hz 10Hz 1Hz 10Hz 10Hz 10Hz 10Hz 10Hz 10Hz 10Hz 10 Hz 10 Hz 10 Hz
AV Information Level ADAS ADAS ADAS ADAS ADAS ADAS ADAS ADAS ADAS ADAS ADS ADS ADAS
Brand N/A Lincoln Lincoln Lincoln Hyundai 7 vehicles, 6 brands, 7 models8 5 vehicles, 4 brands, 5 models9 12 vehicles, 9 brands, 12 models10 Tesla 2 vehicles, 2 brands11 Waymo Waymo Ford
Model N/A MKZ MKZ MKZ Ioniq hybrid N/A N/A N/A Fusion Hybrids12
Year 2019 2016 & 2017 2016 & 2017 2016 2018 N/A N/A N/A N/A
Powertrain13 N/A HV HV HV HV N/A N/A N/A HV
Deminsion SUV Sedan Sedan Sedan Sedan Sedan SUV SUV Sedan
Test Sites Accessibility public public public public public public closed closed public public public public public
Road Type freeway highway highway urban highway highway rural rural freeway & highway freeway & highway urban urban urban
Weather Ideal Ideal Ideal Ideal Ideal Ideal Ideal Ideal Ideal Ideal whole whole whole
Experiment Setting Duration N/A 3 days N/A 2024/03/01-2024/03/03 2020/10/27 2019/02/26-2019/02/28 2019/04/07, 2019/05/07 2019/10/06-2019/10/09 2021/09/1-2022/02/19 2022/03/15-2022/03/21 N/A N/A N/A
Speed Setting no [11.2,15.6], [20.1,24.6] [22.4,24.6] no no no straight: [25,27.8] & curve: [13.9,16.7] [8.3,16.7] no no no no no
Headway Setting no 4 levels of headway settings 4 levels of headway settings no no shortest time headway setting no 3 levels and mixed no no no no no
Drive type naturalistic drive artificial drive artificial drive artificial drive naturalistic drive naturalistic drive artificial drive artificial drive naturalistic drive naturalistic drive naturalistic drive naturalistic drive naturalistic drive
  • 1

    Datasets: 1 = Vanderbilt Two-vehicle ACC Dataset; 2 = CATS ACC Dataset; 3 = CATS Platoon Dataset; 4 = CATS UW Dataset; 5 = OpenACC Casale Dataset; 6 = OpenACC Vicolungo Dataset; 7 = OpenACC Asta Dataset; 8 = OpenACC ZalaZone Dataset; 9 = Ohio Single-vehicle Dataset; 10 = Ohio Two-vehicle Dataset; 11 = Waymo Perception Dataset; 12 = Waymo Motion Dataset; 13 = Argoverse 2 Motion Forecasting Dataset. Table 4-7 also follows these IDs.

  • 2

    Ublox GNSS from Ublox company. (https://www.u-blox.com/)

  • 3

    The RT-Range S multiple target ADAS measurements solution by Oxford Technical Solutions Company. (https://www.oxts.com/)

  • 4

    The three types of sensor are: Race Logic VBOX with 0.02 position accuracy and 0.03 m/s speed accuracy, Ublox 9 with 0.3 m position accuracy and 0.14 m/s speed accuracy, and a tracker App available from ZalaZone 10 m position accuracy and 0.28 m/s speed accuracy.

  • 5

    For the Tesla vehicle, the three types of sensors are: a 32-line Velodyne LiDAR with 0.03m position accuracy, two pluggable USB monocameras, and an RT3000 with 0.01 m position accuracy from OXTS company. For the Ford vehicle, the RT3000 is replaced by the Novatel SPAN from Novatel Company (https://novatel.com/).

  • 6

    The two sensors are: LiDAR and the high-resolution pinhole camera.

  • 7

    The three types of sensors are: 32-line Velodyne LiDAR with 0.03 m position accuracy, high-resolution ring cameras, and front-view facing stereo cameras.

  • 8

    Ford S-Max 2018 ICEV SUV, KIA Niro 2019 HV SUV, Mini Cooper 2018 ICEV Hatchback, Mitsubishi Outlander PHEV 2018 HV SUV, Mitsubishi SpaceStar 2018 ICEV SUV, Peugeot 3008GTLine 2018 ICEV SUV, VW GolfE 2018 EV SUV.

  • 9

    Audi A6 2018 ICEV Sedan, Audi A8 2018 ICEV Sedan, BMW X5 2018 ICEV SUV, Mercedes AClass 2019 ICEV Sedan, Tesla Model3 2019 EV Sedan.

  • 10

    Audi A4 Avant 2019 HV SUV, Audi E-tron 2019 EV SUV, BMW I3S 2018 HV Hatchback, Jaguar I-Pace 2019 EV Hatchback, Mazda 3 2019 ICEV Sedan, Mercedes-Benz GLE 450 4Matic 2019 HV SUV, Smart BME Addv (developed by Budapest University of Technology and Economics), Skoda Octavia RS 2019 ICEV SUV, Tesla Model3 2019 EV Sedan, Tesla ModelS 2019 EV Sedan, Tesla ModelX 2016 EV Hatchback, Toyota RAV4 2019 HV SUV.

  • 11

    A retrofitted Tesla Sedan, and a retrofitted Ford Fusion Sedan from AutonomouStuff Company.

  • 12

    The Ford Fusion Hybrid is integrated with Argo AI self-driving technology.

  • 13

    ICEV/HV/EV: internal combustion engine vehicle/hybrid vehicle/electric vehicle.

To enhance the refinement, reliability, and completeness of these datasets, this study proposes a three-step process to develop the Ultra-AV dataset by three steps: (1) extraction of longitudinal trajectory data; (2) general data cleaning to remove anomalies and errors; and (3) data-specific cleaning tailored for car-following behavior.

Step 1: Extraction of longitudinal trajectory data

The first step of the data process aims to obtain the unified longitudinal trajectory data. Thus, we identified and stored them with a unified data format. Before explaining the extraction process, we define the longitudinal trajectory used in this study. Define the index set ={1,,I}1𝐼\mathcal{I}=\{1,...,I\}caligraphic_I = { 1 , … , italic_I } of longitudinal trajectories comprising a series of consecutive data points, where I𝐼Iitalic_I is the total number of trajectories. Each trajectory contains a series of consecutive time stamps 𝒯i={ti0,ti1,,tiTi}subscript𝒯𝑖subscript𝑡𝑖0subscript𝑡𝑖1subscript𝑡𝑖subscript𝑇𝑖\mathcal{T}_{i}=\{t_{i0},t_{i1},...,t_{iT_{i}}\}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_t start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , … , italic_t start_POSTSUBSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } with the same time gap ΔtΔ𝑡\Delta troman_Δ italic_t, where Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the number of time stamps for trajectory i𝑖iitalic_i. Although the datasets we reviewed organize data in a similar "trajectory" format, they may contain different FAVs or LVs within different lanes in the same trajectory. To keep consistency, we define a longitudinal trajectory consisting of one FAV cifsuperscriptsubscript𝑐𝑖fc_{i}^{\mathrm{f}}italic_c start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT to track the same LV cilsubscriptsuperscript𝑐l𝑖c^{\mathrm{l}}_{i}italic_c start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT consistently in the same lane, without changing lanes throughout all time stamps in a trajectory 𝒯isubscript𝒯𝑖\mathcal{T}_{i}caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. In a longitudinal trajectory i𝑖iitalic_i, the data point at time stamp t𝑡titalic_t corresponds to a state vector 𝐬it=[aitf,dit,vitf,Δvit]subscript𝐬𝑖𝑡subscriptsuperscript𝑎f𝑖𝑡subscript𝑑𝑖𝑡subscriptsuperscript𝑣f𝑖𝑡Δsubscript𝑣𝑖𝑡\mathbf{s}_{it}=[a^{\mathrm{f}}_{it},d_{it},v^{\mathrm{f}}_{it},\Delta v_{it}]bold_s start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = [ italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ], where aitfsubscriptsuperscript𝑎f𝑖𝑡a^{\mathrm{f}}_{it}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT denotes the longitudinal acceleration of cifsubscriptsuperscript𝑐f𝑖c^{\mathrm{f}}_{i}italic_c start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, ditsubscript𝑑𝑖𝑡d_{it}italic_d start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is the spatial gap (i.e., bumper-to-bumper distance) between cilsubscriptsuperscript𝑐l𝑖c^{\mathrm{l}}_{i}italic_c start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cifsubscriptsuperscript𝑐f𝑖c^{\mathrm{f}}_{i}italic_c start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, vitfsubscriptsuperscript𝑣f𝑖𝑡v^{\mathrm{f}}_{it}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is the velocity of cifsubscriptsuperscript𝑐f𝑖c^{\mathrm{f}}_{i}italic_c start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and ΔvitΔsubscript𝑣𝑖𝑡\Delta v_{it}roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is the velocity difference between cilsubscriptsuperscript𝑐l𝑖c^{\mathrm{l}}_{i}italic_c start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cifsubscriptsuperscript𝑐f𝑖c^{\mathrm{f}}_{i}italic_c start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Extracting longitudinal trajectory set \mathcal{I}caligraphic_I necessitates identifying the cilsubscriptsuperscript𝑐l𝑖c^{\mathrm{l}}_{i}italic_c start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cifsubscriptsuperscript𝑐f𝑖c^{\mathrm{f}}_{i}italic_c start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The dataset can be categorized into two types by the identification procedures. In the first category, exemplified by the Vanderbilt ACC Dataset and CATS Open Datasets, the relationship of cilsubscriptsuperscript𝑐l𝑖c^{\mathrm{l}}_{i}italic_c start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and cifsubscriptsuperscript𝑐f𝑖c^{\mathrm{f}}_{i}italic_c start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is labeled in the dataset. The second category, including the Central Ohio Datasets, Waymo Open Dataset, and Argoverse 2 Motion Forecasting Dataset, provides information on all surrounding vehicles but does not specifically label the relationship. Thus, this paper proposes a unified algorithm to identify LVs among mass trajectories. The identification algorithm is shown as follows:

  1. 1.

    Segment trajectories to exhibit AV’s lane-changing behaviors. For the Central Ohio Dataset, which includes the lane ID where the vehicles are located, processing is straightforward. We segment the trajectories into multiple consecutive trajectories. Thus, each segmented trajectory maintains a consistent lane ID throughout its duration. For datasets that only offer vehicles’ positions without specific lane IDs, such as the Waymo Motion Dataset and the Argoverse 2 Motion Forecasting Dataset, identification of consistent lane trajectories poses a challenge. To address this, we employ linear regression to identify trajectories that exhibit straight-driving behaviors, indicative of consistent lane ID. Here, we denote the set of trajectories from the original dataset as 𝒥0superscript𝒥0\mathcal{J}^{0}caligraphic_J start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT to differentiate the original trajectory set from the set of longitudinal trajectories I𝐼Iitalic_I obtained after the processing. Besides, the Euler coordinate position of the center of mass of vehicle k𝒦j𝑘subscript𝒦𝑗k\in\mathcal{K}_{j}italic_k ∈ caligraphic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in trajectory j𝒥1𝑗superscript𝒥1j\in\mathcal{J}^{1}italic_j ∈ caligraphic_J start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT at timestamp t𝑡titalic_t as pjtkxsubscriptsuperscript𝑝x𝑗𝑡𝑘p^{\mathrm{x}}_{jtk}italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT and pjtky)p^{\mathrm{y}}_{jtk})italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT ), respectively, where k=0𝑘0k=0italic_k = 0 represents the FAV and k𝒦j/{0}𝑘subscript𝒦𝑗0k\in\mathcal{K}_{j}/\{0\}italic_k ∈ caligraphic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT / { 0 } represents the surrounding vehicle. For trajectory j𝒥0𝑗superscript𝒥0j\in\mathcal{J}^{0}italic_j ∈ caligraphic_J start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT, apply the least squares method to fit a linear model with {pjt0x}t𝒯jsubscriptsubscriptsuperscript𝑝x𝑗𝑡0𝑡subscript𝒯𝑗\{p^{\mathrm{x}}_{jt0}\}_{t\in\mathcal{T}_{j}}{ italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT as inputs and {pjt0y}t𝒯jsubscriptsubscriptsuperscript𝑝y𝑗𝑡0𝑡subscript𝒯𝑗\{p^{\mathrm{y}}_{jt0}\}_{t\in\mathcal{T}_{j}}{ italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT as outputs. We then compute the R-squared (R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT) of the linear model of the trajectory j𝑗jitalic_j. Trajectories whose R2superscript𝑅2R^{2}italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is less than a threshold are considered not a straight line. The threshold is set as 0.9 as determined from our preliminary experimental results. Finally, these trajectories are excluded from set 𝒥0superscript𝒥0\mathcal{J}^{0}caligraphic_J start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT. The new trajectory set is denoted as 𝒥1superscript𝒥1\mathcal{J}^{1}caligraphic_J start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT.

  2. 2.

    Identify preceding vehicles of the FAV. For each trajectory j𝒥1𝑗superscript𝒥1j\in\mathcal{J}^{1}italic_j ∈ caligraphic_J start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and time stamp t𝒯j𝑡subscript𝒯𝑗t\in\mathcal{T}_{j}italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we define the set of vehicles 𝒦¯jtsubscript¯𝒦𝑗𝑡\bar{\mathcal{K}}_{jt}over¯ start_ARG caligraphic_K end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT where each vehicle k𝒦¯jt𝑘subscript¯𝒦𝑗𝑡k\in\bar{\mathcal{K}}_{jt}italic_k ∈ over¯ start_ARG caligraphic_K end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT must meet the following criteria: 1. k𝒦j{0}𝑘subscript𝒦𝑗0k\in\mathcal{K}_{j}\setminus\{0\}italic_k ∈ caligraphic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∖ { 0 }, 2. k𝑘kitalic_k is located in the same lane with the FAV, and 3. k𝑘kitalic_k is located in front of the FAV. For the Central Ohio Dataset, which provides both the lane ID and Frenet coordinates [29] of surrounding vehicles, the identification of 𝒦¯jtsubscript¯𝒦𝑗𝑡\bar{\mathcal{K}}_{jt}over¯ start_ARG caligraphic_K end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT is straightforward. However, datasets such as the Waymo Motion Dataset and the Argoverse 2 Motion Forecasting Dataset primarily consist of trajectories represented in Euler coordinates. To process this data, we first excluded vehicles not moving in the same direction as the FAV by removing any vehicle k𝑘kitalic_k where the dot product σ0σksubscript𝜎0subscript𝜎𝑘\mathbf{\sigma}_{0}\cdot\mathbf{\sigma}_{k}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is negative, where σ0t=[pj(t1)0xpjt0x,pj(t1)0ypjt0y]subscript𝜎0𝑡subscriptsuperscript𝑝x𝑗𝑡10subscriptsuperscript𝑝x𝑗𝑡0subscriptsuperscript𝑝y𝑗𝑡10subscriptsuperscript𝑝y𝑗𝑡0\mathbf{\sigma}_{0t}=[p^{\mathrm{x}}_{j(t-1)0}-p^{\mathrm{x}}_{jt0},p^{\mathrm% {y}}_{j(t-1)0}-p^{\mathrm{y}}_{jt0}]italic_σ start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT = [ italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j ( italic_t - 1 ) 0 end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j ( italic_t - 1 ) 0 end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT ] and σkt=[pj(t1)kxpjtkx,pj(t1)kypjtky]subscript𝜎𝑘𝑡subscriptsuperscript𝑝x𝑗𝑡1𝑘subscriptsuperscript𝑝x𝑗𝑡𝑘subscriptsuperscript𝑝y𝑗𝑡1𝑘subscriptsuperscript𝑝y𝑗𝑡𝑘\mathbf{\sigma}_{kt}=[p^{\mathrm{x}}_{j(t-1)k}-p^{\mathrm{x}}_{jtk},p^{\mathrm% {y}}_{j(t-1)k}-p^{\mathrm{y}}_{jtk}]italic_σ start_POSTSUBSCRIPT italic_k italic_t end_POSTSUBSCRIPT = [ italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j ( italic_t - 1 ) italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT ] are the direction vectors of the FAV and vehicle k𝑘kitalic_k, respectively. Next, we define the direction vector from the FAV to vehicle k𝑘kitalic_k, σ0kt=[pjt0xpjtkx,pjt0ypjtky]subscript𝜎0𝑘𝑡subscriptsuperscript𝑝x𝑗𝑡0subscriptsuperscript𝑝x𝑗𝑡𝑘subscriptsuperscript𝑝y𝑗𝑡0subscriptsuperscript𝑝y𝑗𝑡𝑘\mathbf{\sigma}_{0kt}=[p^{\mathrm{x}}_{jt0}-p^{\mathrm{x}}_{jtk},p^{\mathrm{y}% }_{jt0}-p^{\mathrm{y}}_{jtk}]italic_σ start_POSTSUBSCRIPT 0 italic_k italic_t end_POSTSUBSCRIPT = [ italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT ]. We use the dot product of σ0tsubscript𝜎0𝑡\mathbf{\sigma}_{0t}italic_σ start_POSTSUBSCRIPT 0 italic_t end_POSTSUBSCRIPT and σ0ktsubscript𝜎0𝑘𝑡\mathbf{\sigma}_{0kt}italic_σ start_POSTSUBSCRIPT 0 italic_k italic_t end_POSTSUBSCRIPT to verify alignment in direction, removing vehicles where this value is less than 0.984111Considering the average width of a mid-size car as 6 feet [32], with lane widths at 10 feet [33], resulting in a maximum deviation of 4 feet. We assume the AV follows the 3-second rules [34] and a minimum speed of 5 mph, which results in a minimum spatial gap of 22 feet. Thus the maximum angle θ=arctan(0.182)𝜃0.182\theta=\arctan(0.182)italic_θ = roman_arctan ( 0.182 ), cos(θ)0.984𝜃0.984\cos(\theta)\approx 0.984roman_cos ( italic_θ ) ≈ 0.984.. This threshold ensures that only vehicles moving in a closely similar direction to the FAV are retained for further analysis.

  3. 3.

    Identify the LV. For each trajectory j𝒥1𝑗superscript𝒥1j\in\mathcal{J}^{1}italic_j ∈ caligraphic_J start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and time stamp t𝒯j𝑡subscript𝒯𝑗t\in\mathcal{T}_{j}italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, calculate the spatial headway hjtksubscript𝑗𝑡𝑘h_{jtk}italic_h start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT by Equation (1) for each vehicle k𝒦¯jt𝑘subscript¯𝒦𝑗𝑡k\in\bar{\mathcal{K}}_{jt}italic_k ∈ over¯ start_ARG caligraphic_K end_ARG start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT relative to the FAV. hjtksubscript𝑗𝑡𝑘h_{jtk}italic_h start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT is defined as the distance between the centers of FAV and the surrounding vehicle.

    hjtk=(pjtkxpjt0x)2+(pjtkypjt0y)2subscript𝑗𝑡𝑘superscriptsubscriptsuperscript𝑝x𝑗𝑡𝑘subscriptsuperscript𝑝x𝑗𝑡02superscriptsubscriptsuperscript𝑝y𝑗𝑡𝑘subscriptsuperscript𝑝y𝑗𝑡02\displaystyle h_{jtk}=\sqrt{(p^{\mathrm{x}}_{jtk}-p^{\mathrm{x}}_{jt0})^{2}+(p% ^{\mathrm{y}}_{jtk}-p^{\mathrm{y}}_{jt0})^{2}}italic_h start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT = square-root start_ARG ( italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_x end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_y end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (1)

    The vehicle with the smallest hjtksubscript𝑗𝑡𝑘h_{jtk}italic_h start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT is considered as the LV for that timestamp, cjtl=argminkhjtksubscriptsuperscript𝑐l𝑗𝑡subscript𝑘subscript𝑗𝑡𝑘c^{\mathrm{l}}_{jt}=\arg\min_{k}h_{jtk}italic_c start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_t end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_h start_POSTSUBSCRIPT italic_j italic_t italic_k end_POSTSUBSCRIPT. If a trajectory j𝑗jitalic_j involves multiple LVs over time, divide it into several longitudinal trajectories, each consistent with a single LV. Collect these into set \mathcal{I}caligraphic_I.

  4. 4.

    Enhance identification by the relationship between spatial headway and speed. Our preliminary experimental results identified several inaccuracies using previous methods. To enhance the algorithm, an additional step has been integrated into the trajectory processing workflow. For each trajectory i𝑖i\in\mathcal{I}italic_i ∈ caligraphic_I, we compare the change in spatial gap, Δdi=diTidi0Δsubscript𝑑𝑖subscript𝑑𝑖subscript𝑇𝑖subscript𝑑𝑖0\Delta d_{i}=d_{iT_{i}}-d_{i0}roman_Δ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT, to the change estimated from speed differences, Δd^i=t𝒯iΔtΔvitΔsubscript^𝑑𝑖subscript𝑡subscript𝒯𝑖Δ𝑡Δsubscript𝑣𝑖𝑡\Delta\hat{d}_{i}=\sum_{t\in\mathcal{T}_{i}}\Delta t\cdot\Delta v_{it}roman_Δ over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_Δ italic_t ⋅ roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT. In a consistent car-following scenario, these two changes should align closely. Therefore, if the relative difference |ΔdiΔd^i|ΔdiΔsubscript𝑑𝑖Δsubscript^𝑑𝑖Δsubscript𝑑𝑖\frac{|\Delta d_{i}-\Delta\hat{d}_{i}|}{\Delta d_{i}}divide start_ARG | roman_Δ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Δ over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG roman_Δ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG exceeds a threshold of 0.2, which is based on the preliminary experimental results, trajectory i𝑖iitalic_i is deemed inaccurate and removed from set \mathcal{I}caligraphic_I.

Following the refinement of longitudinal trajectories, key labels relevant to analyzing FAV behaviors are extracted from the processed data and formatted consistently. The labels retained are listed in Table 2, where each label is described with its definition and calculation methods. Notably, a default value of 4.5 meters, corresponding to the average length of a mid-size vehicle [32], is assigned to vehicle length if it is not explicitly provided in the dataset. This standardization ensures uniformity in the data.

Table 2: Labels for the uniformed data format.
Label Description Notations and formulation Unit
Trajectory_ID ID of the longitudinal trajectory. i𝑖i\in\mathcal{I}italic_i ∈ caligraphic_I. N/A
Time_Index Common time stamp in one trajectory. t𝒯i,iformulae-sequence𝑡subscript𝒯𝑖𝑖t\in\mathcal{T}_{i},i\in\mathcal{I}italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I. ss\mathrm{s}roman_s
ID_LV LV ID. cil,isubscriptsuperscript𝑐l𝑖𝑖c^{\mathrm{l}}_{i},i\in\mathcal{I}italic_c start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I. Label each FAV with a different ID and all HVs with -1. N/A
Pos_LV LV position in the Frenet coordinate. pitl=pitf+hit,i,t𝒯iformulae-sequencesubscriptsuperscript𝑝l𝑖𝑡subscriptsuperscript𝑝f𝑖𝑡subscript𝑖𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖p^{\mathrm{l}}_{it}=p^{\mathrm{f}}_{it}+h_{it},i\in\mathcal{I},t\in\mathcal{T}% _{i}italic_p start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + italic_h start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. mm\mathrm{m}roman_m
Speed_LV LV speed. vitl=pi(t+1)lpitlΔt,i,t𝒯iformulae-sequencesubscriptsuperscript𝑣l𝑖𝑡subscriptsuperscript𝑝l𝑖𝑡1subscriptsuperscript𝑝l𝑖𝑡Δ𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖v^{\mathrm{l}}_{it}=\frac{p^{\mathrm{l}}_{i(t+1)}-p^{\mathrm{l}}_{it}}{\Delta t% },i\in\mathcal{I},t\in\mathcal{T}_{i}italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = divide start_ARG italic_p start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ( italic_t + 1 ) end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. m/sms\mathrm{m}/\mathrm{s}roman_m / roman_s
Acc_LV LV acceleration. aitl=vi(t+1)lvitlΔt,i,t𝒯iformulae-sequencesubscriptsuperscript𝑎l𝑖𝑡subscriptsuperscript𝑣l𝑖𝑡1subscriptsuperscript𝑣l𝑖𝑡Δ𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖a^{\mathrm{l}}_{it}=\frac{v^{\mathrm{l}}_{i(t+1)}-v^{\mathrm{l}}_{it}}{\Delta t% },i\in\mathcal{I},t\in\mathcal{T}_{i}italic_a start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = divide start_ARG italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ( italic_t + 1 ) end_POSTSUBSCRIPT - italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. m/s2msuperscripts2\mathrm{m}/\mathrm{s}^{2}roman_m / roman_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
ID_FAV FAV ID. cif,isubscriptsuperscript𝑐f𝑖𝑖c^{\mathrm{f}}_{i},i\in\mathcal{I}italic_c start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I. Label each FAV with a different ID. N/A
Pos_FAV FAV position in the Frenet coordinate. pitf=pi(t1)f+Δtvitf,i,t𝒯iformulae-sequencesubscriptsuperscript𝑝f𝑖𝑡subscriptsuperscript𝑝f𝑖𝑡1Δ𝑡subscriptsuperscript𝑣f𝑖𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖p^{\mathrm{f}}_{it}=p^{\mathrm{f}}_{i(t-1)}+\Delta t\cdot v^{\mathrm{f}}_{it},% i\in\mathcal{I},t\in\mathcal{T}_{i}italic_p start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ( italic_t - 1 ) end_POSTSUBSCRIPT + roman_Δ italic_t ⋅ italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. mm\mathrm{m}roman_m
Speed_FAV FAV speed. vitf=pi(t+1)fpitfΔt,i,t𝒯iformulae-sequencesubscriptsuperscript𝑣f𝑖𝑡subscriptsuperscript𝑝f𝑖𝑡1subscriptsuperscript𝑝f𝑖𝑡Δ𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖v^{\mathrm{f}}_{it}=\frac{p^{\mathrm{f}}_{i(t+1)}-p^{\mathrm{f}}_{it}}{\Delta t% },i\in\mathcal{I},t\in\mathcal{T}_{i}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = divide start_ARG italic_p start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ( italic_t + 1 ) end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. m/sms\mathrm{m}/\mathrm{s}roman_m / roman_s
Acc_FAV FAV acceleration. aitf=vi(t+1)fvitfΔt,i,t𝒯iformulae-sequencesubscriptsuperscript𝑎f𝑖𝑡subscriptsuperscript𝑣f𝑖𝑡1subscriptsuperscript𝑣f𝑖𝑡Δ𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖a^{\mathrm{f}}_{it}=\frac{v^{\mathrm{f}}_{i(t+1)}-v^{\mathrm{f}}_{it}}{\Delta t% },i\in\mathcal{I},t\in\mathcal{T}_{i}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = divide start_ARG italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i ( italic_t + 1 ) end_POSTSUBSCRIPT - italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG roman_Δ italic_t end_ARG , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. m/s2msuperscripts2\mathrm{m}/\mathrm{s}^{2}roman_m / roman_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Space_Gap Bump-to-bump distance between two vehicles. git=pitlpitflf/2ll/2,i,t𝒯iformulae-sequencesubscript𝑔𝑖𝑡subscriptsuperscript𝑝l𝑖𝑡subscriptsuperscript𝑝f𝑖𝑡superscript𝑙f2superscript𝑙l2formulae-sequence𝑖𝑡subscript𝒯𝑖g_{it}=p^{\mathrm{l}}_{it}-p^{\mathrm{f}}_{it}-l^{\mathrm{f}}/2-l^{\mathrm{l}}% /2,i\in\mathcal{I},t\in\mathcal{T}_{i}italic_g start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT - italic_l start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT / 2 - italic_l start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT / 2 , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where lfsuperscript𝑙fl^{\mathrm{f}}italic_l start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT and lfsuperscript𝑙fl^{\mathrm{f}}italic_l start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT are the length of the LV and the FAV. mm\mathrm{m}roman_m
Space_Headway Distance between the center of two vehicles. hit=pitlpitf,i,t𝒯iformulae-sequencesubscript𝑖𝑡subscriptsuperscript𝑝l𝑖𝑡subscriptsuperscript𝑝f𝑖𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖h_{it}=p^{\mathrm{l}}_{it}-p^{\mathrm{f}}_{it},i\in\mathcal{I},t\in\mathcal{T}% _{i}italic_h start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_p start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT - italic_p start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. mm\mathrm{m}roman_m
Speed_Diff Speed difference of the two vehicles. Δvit=vitlvitf,i,t𝒯iformulae-sequenceΔsubscript𝑣𝑖𝑡subscriptsuperscript𝑣l𝑖𝑡subscriptsuperscript𝑣f𝑖𝑡formulae-sequence𝑖𝑡subscript𝒯𝑖\Delta v_{it}=v^{\mathrm{l}}_{it}-v^{\mathrm{f}}_{it},i\in\mathcal{I},t\in% \mathcal{T}_{i}roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_i ∈ caligraphic_I , italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. m/sms\mathrm{m}/\mathrm{s}roman_m / roman_s

Step 2: General data cleaning

The second step focuses on enhancing the reliability of the trajectory dataset by cleaning it including removing outliers and inputting missing values. Raw datasets may include abnormal values due to sensor errors, such as accelerations exceeding 100 m/s2msuperscripts2\mathrm{m}/\mathrm{s}^{2}roman_m / roman_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Such errors can become more pronounced when calculating additional metrics through differentiation. We define these abnormal values as outliers and remove them by excluding data outside η𝜂\etaitalic_η standard deviations from the mean, denoted as std𝑠𝑡𝑑stditalic_s italic_t italic_d and mean𝑚𝑒𝑎𝑛meanitalic_m italic_e italic_a italic_n. The cleaning process includes the following procedures:

  1. 1.

    Mark missing values and outliers. The mean and standard deviation for each label are calculated, excluding the data previously identified as missing. We then mark outliers that fall outside the range [meanηstd,mean+ηstd]mean𝜂stdmean𝜂std[\text{mean}-\eta\cdot\text{std},\text{mean}+\eta\cdot\text{std}][ mean - italic_η ⋅ std , mean + italic_η ⋅ std ]. The calculation of the mean and standard deviation and the marking will be repeated iteratively without considering the marked outliers until all outliers are marked. The labels to be identified and their respective η𝜂\etaitalic_η values are summarized in Table 3. We use a conservative η𝜂\etaitalic_η to ensure this step removes only genuine outliers without affecting general data.

  2. 2.

    Remove or input the marked data points. Based on experience, if a label includes ten consecutive marked data points, we remove all these points to maintain accuracy in trajectory analysis. However, if there are fewer than ten consecutive marked data points within a label, use linear interpolation to replace the marked data to minimize data loss. Note that this interpolation is only done within the same trajectory.

  3. 3.

    Re-organize the trajectory ID. After removing some data points, certain trajectories may become discontinuous. To address this, we follow these steps to re-organize the "Trajectory_ID" and "Time_index" labels: 1) Split any trajectories where "Time_index" is discontinuous into multiple new trajectories, assigning each a new "Trajectory_ID". 2) Remove short trajectories that contain fewer than 70 data points. This threshold is based on preliminary experiments. 3) Renumber the "Trajectory_ID" and "Time_index" columns to start from 0, ensuring a continuous sequence. 4) Update the labels "Position_LV" and "Position_FAV" to reflect changes in trajectory segmentation.

After these cleaning procedures, we obtained the longitudinal trajectory dataset, which includes both free-flow and car-following trajectories. Given the critical importance of car-following behavior in the study of AV longitudinal behaviors, we specifically extracted the car-following trajectories in the next step for further analysis and validation.

Table 3: Identified criteria for Step 2 and Step 3.
Steps vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT vlsuperscript𝑣lv^{\mathrm{l}}italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT alsuperscript𝑎la^{\mathrm{l}}italic_a start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT d𝑑ditalic_d
Step 2 N/A η=10𝜂10\eta=10italic_η = 10 N/A η=10𝜂10\eta=10italic_η = 10 η=10𝜂10\eta=10italic_η = 10
Step 3 [0.1,+)0.1[0.1,+\infty)[ 0.1 , + ∞ ) [5,5]55[-5,5][ - 5 , 5 ] [0.1,+)0.1[0.1,+\infty)[ 0.1 , + ∞ ) [5,5]55[-5,5][ - 5 , 5 ] (0,120]0120(0,120]( 0 , 120 ]

Step 3: Data-specific cleaning

In this step, we define a hard margin for certain labels to identify car-following trajectories. Though the car-following concept is a broad consensus among researchers, there is no universally accepted definition [35]. Thus, we proposed several thresholds derived from both a review of the relevant literature and empirical analysis of the data to identify car-following behavior, which are also summarized in Table 3:

  • A minimum speed threshold of 0.1 m/s, below which FAVs are considered to be stationary based on empirical observations.

  • A spatial distance threshold whereby an LV is situated within 120 meters on the same lane as the FAV, identifying it from free-flow traffic conditions [36].

  • An acceleration range set of FAV is between -5 m/s2msuperscripts2\mathrm{m}/\mathrm{s}^{2}roman_m / roman_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT to 5 m/s2msuperscripts2\mathrm{m}/\mathrm{s}^{2}roman_m / roman_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [37].

The following process of this step is similar to Step 2. Data points that fall outside these established thresholds will be removed for exclusion or rectified through linear interpolation. After these three steps, we finally obtained the car-following trajectory dataset.

Data Records

The statistics for the dataset after each step of the processing mentioned in the last section are recorded in Table 4-6.

Table 4: Statistical results of the data following Step 1 processing.
Label Statistics 1 2 3 4 5 6 7 8 9 10 11 12 13
vlsuperscript𝑣lv^{\mathrm{l}}italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT mean 29.3 15.0 23.2 3.5 32.2 26.3 18.3 9.3 15.9 14.0 6.8 5.7 4.2
std 1.5 8.9 0.7 0.5 6.6 9.5 4.7 4.0 11.0 9.9 7.3 6.2 4.7
min 25.7 0.0 17.4 1.7 0.0 0.0 0.0 -0.1 0.0 0.0 -0.3 0.0 0.0
max 34.8 26.7 24.7 4.0 40.1 40.8 33.8 27.5 42.6 39.8 28.5 32.9 22.9
alsuperscript𝑎la^{\mathrm{l}}italic_a start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.3 0.1 0.1 0.0 0.0
std 0.3 3.0 0.2 0.4 0.6 1.9 0.8 0.7 3.5 2.7 0.7 9.3 7.4
min -2.7 -239.1 -1.8 0.0 -10.7 -330.6 -203.5 -162.3 -29.7 -18.6 -2.9 -304.6 -160.8
max 3.8 112.9 5.8 10.0 13.2 167.3 165.6 160.6 92.9 90.5 3.5 298.6 161.4
vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT mean 29.3 14.9 23.2 3.2 32.2 26.4 18.3 9.3 16.6 14.4 7.0 6.1 4.5
std 1.7 9.0 0.9 0.7 6.8 9.6 4.9 4.1 10.6 9.3 7.1 6.3 4.9
min 25.7 0.0 17.6 1.2 0.0 0.0 0.0 0.0 0.0 0.0 -0.3 -1.0 0.0
max 34.9 28.6 25.4 4.2 41.2 42.8 34.5 27.5 35.6 31.9 29.6 58.2 24.0
afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT mean 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
std 0.2 3.2 0.3 0.1 0.6 2.0 0.8 0.7 0.6 0.6 0.7 10.1 6.8
min -1.1 -239.7 -1.0 -1.2 -9.8 -364.3 -203.5 -162.3 -5.6 -5.0 -2.9 -447.5 -228.3
max 0.6 135.8 6.2 1.0 10.1 167.3 157.0 160.6 6.1 3.3 2.9 423.3 219.2
d𝑑ditalic_d mean 36.2 29.3 37.3 13.8 42.6 37.1 25.3 20.5 25.5 22.7 20.5 13.3 14.0
std 6.7 16.2 10.5 4.8 15.3 26.7 12.3 12.7 46.7 63.3 16.1 37.7 15.6
min 17.0 -1.1 15.7 4.9 1.6 -2.7 1.2 -1.4 0.0 -15.3 2.7 0.0 -2.1
max 72.7 92.4 57.0 31.8 137.7 445.4 117.1 230.0 4119.5 1334.2 74.5 11696.6 204.3

Table 4 shows the statistical results after Step 1 extraction of longitudinal trajectory data. These results indicate the range of data collected in each dataset. For example, CATS UW Dataset, OpenACC ZalaZone Dataset, Waymo Perception Dataset, and Argoverse 2 Motion Forecasting Dataset (datasets 4, 8, 12, and 13) have a low average speed, suggesting that the scenarios in these four datasets are primarily low-speed environments. The CATS UW Dataset and OpenACC ZalaZone Dataset mainly test low-speed environments, while the Waymo Motion Dataset and Argoverse 2 Motion Forecasting Dataset are primarily collected in urban environments where the traffic conditions are complex, and AVs usually travel at low speeds. Additionally, there are some outliers, such as the maximum and minimum alsuperscript𝑎la^{\mathrm{l}}italic_a start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT and afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT in dataset 2, and the maximum d𝑑ditalic_d in datasets 9 and 10, which would not occur in a normal driving process. We suppose that these data are caused by sensor errors and should be removed from the dataset.

Table 5: Statistical results of the data following Step 2 processing.
Label Statistics 1 2 3 4 5 6 7 8 9 10 11 12 13
vlsuperscript𝑣lv^{\mathrm{l}}italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT mean 29.3 16.9 23.2 3.5 32.2 26.3 18.3 9.3 17.7 13.9 6.8 5.8 4.3
std 1.5 7.2 0.7 0.5 6.6 9.5 4.7 3.9 10.4 10.2 7.4 6.2 4.8
min 25.7 0.0 17.4 1.7 0.0 0.0 0.0 -0.1 0.0 0.0 -0.3 0.0 0.0
max 34.8 26.7 24.7 4.0 40.1 40.8 33.8 27.5 37.4 33.4 28.5 30.7 22.9
alsuperscript𝑎la^{\mathrm{l}}italic_a start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT mean 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2 0.1
std 0.3 0.6 0.2 0.0 0.6 0.6 0.5 0.4 1.0 1.0 0.7 0.9 1.1
min -2.7 -3.9 -1.8 0.0 -5.3 -5.0 -3.4 -4.0 -9.6 -10.2 -2.9 -9.0 -11.5
max 3.8 5.4 1.6 0.2 5.9 5.8 3.2 4.0 9.6 10.2 3.5 9.3 11.8
vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT mean 29.3 16.8 23.2 3.2 32.2 26.3 18.3 9.3 17.8 13.9 7.0 6.3 4.6
std 1.7 7.4 0.9 0.7 6.9 9.6 4.9 4.0 10.2 10.0 7.1 6.3 5.0
min 25.7 0.0 17.6 1.2 0.0 0.0 0.0 0.0 0.0 0.0 -0.3 0.0 0.0
max 34.9 28.6 25.4 4.2 41.2 42.8 34.5 27.5 34.3 31.9 29.6 36.8 24.0
afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT mean 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.0
std 0.2 0.7 0.2 0.1 0.6 0.6 0.5 0.4 0.6 0.6 0.7 2.6 1.0
min -1.1 -4.8 -1.0 -1.2 -6.2 -6.5 -4.4 -4.3 -5.6 -4.4 -2.9 -27.1 -11.0
max 0.6 4.2 2.0 1.0 6.1 6.5 3.4 4.3 6.1 3.3 2.9 27.3 11.2
d𝑑ditalic_d mean 36.2 31.9 37.9 13.8 42.6 36.3 25.3 20.6 26.7 20.1 20.5 12.6 14.1
std 6.7 13.0 10.3 4.8 15.3 20.9 12.3 11.8 15.1 28.7 16.1 10.2 15.7
min 17.0 -1.1 15.7 5.0 1.6 -2.7 1.2 -1.2 0.0 -13.8 2.7 0.0 -2.1
max 72.7 92.4 57.0 31.8 137.7 245.6 117.1 139.4 59.3 256.9 74.5 72.9 169.2

Due to the presence of outliers in the data obtained from Step 1, we proceeded with Step 2 general data cleaning, where the processed data are recorded in Table 5. After this step, the outliers initially found in Table 4 have been removed. The data removal was carefully limited to a small quantity to ensure that the processing did not significantly influence the overall data distribution. Consequently, the mean and standard deviation remain largely consistent with those in Table 4. Nevertheless, some data that do not typically occur in the car-following scenario still exist in Table 5. For example, the minimum speeds of the OpenACC ZalaZone Dataset and Waymo Perception Dataset (datasets 8 and 11) are less than 0 m/s. The maximum acceleration and deceleration of the Ohio Two-vehicle Dataset and Waymo Motion Dataset (datasets 10 and 12) exceed 10 m/s2msuperscripts2\mathrm{m}/\mathrm{s}^{2}roman_m / roman_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In the OpenACC Vicolungo Dataset and Ohio Two-vehicle Dataset (datasets 6 and 10), the maximum space gaps are around 250 m.

Table 6: Statistical results of the data following Step 3 processing.
Label Statistics 1 2 3 4 5 6 7 8 9 10 11 12 13
vlsuperscript𝑣lv^{\mathrm{l}}italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT mean 29.3 17.8 23.2 3.5 32.6 27.3 18.6 9.9 20.3 20.7 10.8 6.6 7.9
std 1.5 6.3 0.7 0.5 5.7 8.2 4.3 3.3 8.3 6.6 6.8 5.3 4.0
min 25.7 0.1 17.4 1.7 0.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
max 34.8 26.7 24.7 4.0 40.1 40.8 33.8 27.5 37.4 33.4 28.5 29.3 22.9
alsuperscript𝑎la^{\mathrm{l}}italic_a start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT mean 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1
std 0.3 0.6 0.2 0.0 0.6 0.6 0.5 0.4 1.0 0.9 0.8 0.8 1.3
min -2.3 -3.9 -1.8 0.0 -5.0 -4.8 -3.4 -4.0 -5.0 -4.9 -2.9 -5.0 -5.0
max 2.4 4.4 1.6 0.2 4.9 5.0 3.2 4.0 5.0 5.0 3.5 5.0 5.0
vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT mean 29.3 17.6 23.2 3.2 32.6 27.3 18.6 9.9 20.3 20.4 10.7 7.0 8.3
std 1.7 6.6 0.9 0.7 5.9 8.4 4.5 3.3 8.3 6.3 6.6 5.3 4.1
min 25.7 0.1 17.6 1.2 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
max 34.9 28.6 25.4 4.2 41.2 40.8 34.5 27.4 34.3 31.9 29.6 29.3 23.6
afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT mean 0.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.1 0.1
std 0.2 0.7 0.2 0.1 0.6 0.7 0.5 0.4 0.6 0.4 0.7 1.7 0.9
min -1.1 -4.8 -1.0 -1.2 -4.7 -5.0 -4.4 -4.3 -5.0 -4.0 -2.9 -5.0 -5.0
max 0.6 4.2 2.0 1.0 4.9 4.9 3.2 4.3 4.6 2.9 2.8 5.0 5.0
d𝑑ditalic_d mean 36.2 33.4 37.9 13.8 42.7 36.1 25.6 21.5 31.6 36.7 28.0 12.1 16.9
std 6.8 11.6 10.3 4.8 13.7 16.4 12.1 11.0 11.8 21.6 15.0 6.8 13.4
min 17.0 1.4 15.7 5.0 1.6 0.0 1.5 0.0 0.0 0.0 3.6 0.5 0.0
max 72.7 92.4 57.0 31.8 119.6 119.9 117.1 120.0 59.3 120.0 74.5 68.6 120.0

Table 6 shows the statistical results after Step 3 data-specific cleaning. Following the removal of non-car-following scenario data, the data in Table 5 have been adjusted to normal ranges. It is evident that the average speeds across all datasets have increased, especially in the Ohio Single-vehicle Dataset, Ohio Two-vehicle Dataset, Waymo Perception Dataset, and Argoverse 2 Motion Forecasting Dataset (datasets 9, 10, 11, and 13). This suggests the presence of numerous stationary scenarios within these datasets. The average spatial gaps in the Ohio Single-vehicle Dataset, Ohio Two-vehicle Dataset, and Waymo Perception Dataset (datasets 9, 10, and 11) have also significantly increased, indicating that these datasets initially contained many instances of small gaps.

Figure 3 displays the statistical distributions of key variables in car-following behavior analysis, including spatial gap, relative speed, FAV speed, and FAV acceleration, for the final data after completing the three-step data processing. From Figure 3, we can suppose the testing scenarios of the data, where some of them are indicated in Table 1. For example, the CATS UW Dataset was collected in a low-speed environment, while the OpenACC Casale Dataset was collected in a high-speed environment. Additionally, the form of the car-following model can be analyzed by observing the shape of the distribution. For example, the distributions of afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT from the four datasets in the OpenACC Database clearly show two distinct peaks, one on the left and one on the right of zero, possibly representing vehicles’ behaviors in accelerating and decelerating are piecewise.

Refer to caption
Figure 3: Probability distributions of g𝑔gitalic_g, ΔvΔ𝑣\Delta vroman_Δ italic_v, vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, and afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT.

Technical Validation

In this section, we validate the processed unified trajectory dataset through three aspects. First, we introduce the data collection methods we used in the CATS Open Datasets. Then, we analyze the performance of the car-following trajectory dataset through four metrics. Finally, we analyze the relationships between the variables in the car-following model.

Data Collection

First, we introduce the AV platform developed by the CATS Lab as a reference solution for future AV trajectory data collection. The CATS Lab has developed a complete AV platform, which has been set up in two lab-owned Lincoln MKZ. The platform is built upon the Robot Operating System (ROS), which provides a robust framework for parallel computing and is particularly well-suited for robotics and autonomous applications. In addition, the platform allows for direct electronic control over the vehicle’s functions by integrating the Drive-By-Wire (DBW) system.

The developed system includes the perception system, operation system, and dynamical system, shown in Figure 4. The perception system comprises advanced sensing technologies such as LiDAR and cameras that provide real-time data on the vehicle’s surroundings. LiDAR and GPS navigation units offer high-precision location tracking capabilities. The operation system is a hierarchical structure with an upper-level computer and a lower-level control. This system utilizes the Transmission Control Protocol/Internet Protocol (TCP/IP) protocol for networking and communication, interlinking with the Controller Area Network (CAN) for vehicle control. The dynamical system features an electrically powered acceleration/braking/steering system to manipulate the vehicle’s longitudinal and lateral motions. The final output utilizes the CAN to transit the signals for brake, throttle, and steering angle.

Refer to caption
Figure 4: AV platform developed by the CATS Lab.

Data collection is accomplished by the perception system. The perception system obtains precise vehicle trajectory data collected from various sensors, including LiDAR, GPS, and cameras. The early data collected by CATS Lab, including the CATS ACC Dataset and CATS Platoon Dataset, primarily used the Ublox GPS system to gather real-time GPS positions and speeds. The real-time vehicle-following spacing between the two vehicles could be obtained by the distance between the GPS positions. Preliminary testing indicated that the GPS receivers had a position accuracy of 0.26 m and a speed accuracy of 0.089 m/s. Due to low precision and the data packet loss during transmission, in the CATS UW Dataset, we utilized LiDAR for data collection. To achieve higher data precision, a feasible approach is to design algorithms that integrate data from multiple sensors. The Waymo Open Dataset and Argoverse Dataset have already employed similar technologies. Therefore, we recommend that researchers integrate multiple sensors in future data collection to obtain high-precision data.

Performance Measurement

Next, we analyze how each dataset impacts road traffic performance in four metrics, i.e., safety, mobility, stability, and sustainability. We first define the measurements of these four metrics.

The safety of FAV in car-following behavior is measured by the Time-To-Collision [38] (TTC) to represent the risk or proximity of a vehicle to a potential collision. The TTC at time t𝑡titalic_t is defined as the time that remains until a collision between two vehicles would have occurred if the collision course and speed difference were maintained. The higher the TTC is, the more safe a situation is, and vice versa [38]. The TTC in trajectory i𝑖iitalic_i at time t𝑡titalic_t can be calculated as follows:

TTCit=gitvitfvitl,vitf>vitl.formulae-sequence𝑇𝑇subscript𝐶𝑖𝑡subscript𝑔𝑖𝑡subscriptsuperscript𝑣f𝑖𝑡subscriptsuperscript𝑣l𝑖𝑡subscriptsuperscript𝑣f𝑖𝑡subscriptsuperscript𝑣l𝑖𝑡\displaystyle TTC_{it}=\frac{g_{it}}{v^{\mathrm{f}}_{it}-v^{\mathrm{l}}_{it}},% \quad v^{\mathrm{f}}_{it}>v^{\mathrm{l}}_{it}.italic_T italic_T italic_C start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = divide start_ARG italic_g start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG , italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT > italic_v start_POSTSUPERSCRIPT roman_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT . (2)

The mobility is measured by the time headway, defined by the time difference between consecutive arrival instants of two vehicles passing a certain detector site on the same lane [39]. The time headway is considered a direct measure of road capacity. A short time headway will increase road capacity and thus increase mobility, and vice versa [40]. The time headway in trajectory i𝑖iitalic_i at time t𝑡titalic_t can be calculated as follows:

τit=hitvitf.subscript𝜏𝑖𝑡subscript𝑖𝑡subscriptsuperscript𝑣f𝑖𝑡\displaystyle\tau_{it}=\frac{h_{it}}{v^{\mathrm{f}}_{it}}.italic_τ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = divide start_ARG italic_h start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT end_ARG . (3)

The stability is measured by the squared error of acceleration at time t𝑡titalic_t. Larger variations in acceleration are considered a lack of smoothest to indicate discomfort and potential safety risks. The larger the squared error of acceleration is, the lower traffic stability, and vice versa. It can be calculated as follows:

αit=(aitt𝒯iaitTi)2.subscript𝛼𝑖𝑡superscriptsubscript𝑎𝑖𝑡subscriptsuperscript𝑡subscript𝒯𝑖subscript𝑎𝑖superscript𝑡subscript𝑇𝑖2\displaystyle\alpha_{it}=(a_{it}-\frac{\sum_{t^{\prime}\in\mathcal{T}_{i}}a_{% it^{\prime}}}{T_{i}})^{2}.italic_α start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = ( italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT - divide start_ARG ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (4)

The sustainability is measured by the fuel consumption rate of the FAV. We utilize the average value of four classical vehicle fuel consumption models to measure the fuel consumption, including the Virginia Tech Microscopic (VT-Micro) model [41], Microscopic Emission and Fuel consumption (MEF) model [42], Vehicle Specific Power (VSP) model [43], and Australian Road Research Board (ARRB) model [44, 45]. We use FitVTM,FitMEF,FitVSP,FitARRBsubscriptsuperscript𝐹VTM𝑖𝑡subscriptsuperscript𝐹MEF𝑖𝑡subscriptsuperscript𝐹VSP𝑖𝑡subscriptsuperscript𝐹ARRB𝑖𝑡F^{\mathrm{VTM}}_{it},F^{\mathrm{MEF}}_{it},F^{\mathrm{VSP}}_{it},F^{\mathrm{% ARRB}}_{it}italic_F start_POSTSUPERSCRIPT roman_VTM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT roman_MEF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT roman_VSP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT roman_ARRB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT to represent the fuel consumption calculated by the four models in trajectory i𝑖iitalic_i at time t𝒯i𝑡subscript𝒯𝑖t\in\mathcal{T}_{i}italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, respectively. To simplify the notation, in equations (5)-(12), we use vitsubscript𝑣𝑖𝑡v_{it}italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and aitsubscript𝑎𝑖𝑡a_{it}italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT to represent the velocity and acceleration of the vehicle in trajectory i𝑖iitalic_i at time t𝒯i𝑡subscript𝒯𝑖t\in\mathcal{T}_{i}italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The expression of the four models is shown as follows:

  • VT-Micro model:

    FitVTM=exp(n1=03n2=03Kn1n2(vit)n1(ait)n2)subscriptsuperscript𝐹VTM𝑖𝑡superscriptsubscriptsubscript𝑛103superscriptsubscriptsubscript𝑛203subscript𝐾subscript𝑛1subscript𝑛2superscriptsubscript𝑣𝑖𝑡subscript𝑛1superscriptsubscript𝑎𝑖𝑡subscript𝑛2\displaystyle F^{\mathrm{VTM}}_{it}=\exp\left(\sum_{n_{1}=0}^{3}\sum_{n_{2}=0}% ^{3}K_{n_{1}n_{2}}\left(v_{it}\right)^{n_{1}}\left(a_{it}\right)^{n_{2}}\right)italic_F start_POSTSUPERSCRIPT roman_VTM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = roman_exp ( ∑ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (5)

    where n1,n2subscript𝑛1subscript𝑛2n_{1},n_{2}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the power indexes and Kn1,n2subscript𝐾subscript𝑛1subscript𝑛2K_{n_{1},n_{2}}italic_K start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT are constant coefficients, which are available in Appendix A.

  • MEF model:

    FitMEF=exp(n1=03n2=03Kn1n2(vt)n1(βait+(1β)t=1Tai(tt)/T)n2)subscriptsuperscript𝐹MEF𝑖𝑡superscriptsubscriptsubscript𝑛103superscriptsubscriptsubscript𝑛203subscript𝐾subscript𝑛1subscript𝑛2superscriptsubscript𝑣𝑡subscript𝑛1superscript𝛽subscript𝑎𝑖𝑡1𝛽superscriptsubscriptsuperscript𝑡1𝑇subscript𝑎𝑖𝑡superscript𝑡𝑇subscript𝑛2\displaystyle F^{\mathrm{MEF}}_{it}=\exp\left(\sum_{n_{1}=0}^{3}\sum_{n_{2}=0}% ^{3}K_{n_{1}n_{2}}\left(v_{t}\right)^{n_{1}}\left(\beta\cdot a_{it}+(1-\beta)% \sum_{t^{\prime}=1}^{T}a_{i(t-t^{\prime})}/T\right)^{n_{2}}\right)italic_F start_POSTSUPERSCRIPT roman_MEF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = roman_exp ( ∑ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_β ⋅ italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + ( 1 - italic_β ) ∑ start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i ( italic_t - italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT / italic_T ) start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (6)

    where β=0.5𝛽0.5\beta=0.5italic_β = 0.5 is the acceleration impact factor, and T=9𝑇9T=9italic_T = 9 is the number of historical data considered in the model.

  • VSP model:

    VSPit=vit(1.1ait+9.81δ+0.132)+3.02E4vit3𝑉𝑆subscript𝑃𝑖𝑡subscript𝑣𝑖𝑡1.1subscript𝑎𝑖𝑡9.81𝛿0.1323.02𝐸4superscriptsubscript𝑣𝑖𝑡3\displaystyle VSP_{it}=v_{it}\left(1.1a_{it}+9.81\delta+0.132\right)+3.02E-4% \cdot v_{it}^{3}italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ( 1.1 italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + 9.81 italic_δ + 0.132 ) + 3.02 italic_E - 4 ⋅ italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT (7)
    FitVSP={f1,if VSPit<10f2VSPit2+f3VSPit+f4,if 10VSPit10f5VSPit+f6,if VSPit10subscriptsuperscript𝐹VSP𝑖𝑡casessubscript𝑓1if 𝑉𝑆subscript𝑃𝑖𝑡10subscript𝑓2𝑉𝑆superscriptsubscript𝑃𝑖𝑡2subscript𝑓3𝑉𝑆subscript𝑃𝑖𝑡subscript𝑓4if 10𝑉𝑆subscript𝑃𝑖𝑡10subscript𝑓5𝑉𝑆subscript𝑃𝑖𝑡subscript𝑓6if 𝑉𝑆subscript𝑃𝑖𝑡10\displaystyle F^{\mathrm{VSP}}_{it}=\left\{\begin{array}[]{cc}f_{1},&\text{if % }VSP_{it}<-10\\ f_{2}VSP_{it}^{2}+f_{3}VSP_{it}+f_{4},&\text{if }-10\leq VSP_{it}\leq 10\\ f_{5}VSP_{it}+f_{6},&\text{if }VSP_{it}\geq 10\end{array}\right.italic_F start_POSTSUPERSCRIPT roman_VSP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , end_CELL start_CELL if italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT < - 10 end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , end_CELL start_CELL if - 10 ≤ italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ≤ 10 end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + italic_f start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT , end_CELL start_CELL if italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ≥ 10 end_CELL end_ROW end_ARRAY (11)

    where δ𝛿\deltaitalic_δ denotes the road grade that is set to 0 in this paper since we assume the road grade can be neglected in most experiment sites, VSPit𝑉𝑆subscript𝑃𝑖𝑡VSP_{it}italic_V italic_S italic_P start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is the vehicle specific power in trajectory i𝑖iitalic_i at time t𝒯i𝑡subscript𝒯𝑖t\in\mathcal{T}_{i}italic_t ∈ caligraphic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Parameters f1=2.48E03,f2=1.98E03,f3=3.97E02,f4=2.01E01,f5=7.93E02,f6=2.48E03formulae-sequencesubscript𝑓12.48𝐸03formulae-sequencesubscript𝑓21.98𝐸03formulae-sequencesubscript𝑓33.97𝐸02formulae-sequencesubscript𝑓42.01𝐸01formulae-sequencesubscript𝑓57.93𝐸02subscript𝑓62.48𝐸03f_{1}=2.48E-03,f_{2}=1.98E-03,f_{3}=3.97E-02,f_{4}=2.01E-01,f_{5}=7.93E-02,f_{% 6}=2.48E-03italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2.48 italic_E - 03 , italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.98 italic_E - 03 , italic_f start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 3.97 italic_E - 02 , italic_f start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 2.01 italic_E - 01 , italic_f start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 7.93 italic_E - 02 , italic_f start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 2.48 italic_E - 03, where the notation ’E’ represents exponentiation in scientific notation.

  • ARRB model:

    FitARRB=γ1+γ2vit+γ3vit2+γ4vit3+γ5vitait+γ6vit(max(0,ait)2)\displaystyle F^{\mathrm{ARRB}}_{it}=\gamma_{1}+\gamma_{2}v_{it}+\gamma_{3}v_{% it}^{2}+\gamma_{4}v_{it}^{3}+\gamma_{5}v_{it}\cdot a_{it}+\gamma_{6}v_{it}% \left(\max\left(0,a_{it}\right)^{2}\right)italic_F start_POSTSUPERSCRIPT roman_ARRB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + italic_γ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ⋅ italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + italic_γ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ( roman_max ( 0 , italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (12)

    where parameters γ1=0.666,γ2=0.019,γ3=0.001,γ4=0.0005,γ5=0.122formulae-sequencesubscript𝛾10.666formulae-sequencesubscript𝛾20.019formulae-sequencesubscript𝛾30.001formulae-sequencesubscript𝛾40.0005subscript𝛾50.122\gamma_{1}=0.666,\gamma_{2}=0.019,\gamma_{3}=0.001,\gamma_{4}=0.0005,\gamma_{5% }=0.122italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.666 , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.019 , italic_γ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = 0.001 , italic_γ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = 0.0005 , italic_γ start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT = 0.122, and γ6=0.793subscript𝛾60.793\gamma_{6}=0.793italic_γ start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = 0.793.

In equations (5)-(12), the units for vitsubscript𝑣𝑖𝑡v_{it}italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and aitsubscript𝑎𝑖𝑡a_{it}italic_a start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT are m/s and m/s2superscripts2\mathrm{s}^{2}roman_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, respectively. The units for FitVTM,FitMEF,FitVSP,FitARRBsubscriptsuperscript𝐹VTM𝑖𝑡subscriptsuperscript𝐹MEF𝑖𝑡subscriptsuperscript𝐹VSP𝑖𝑡subscriptsuperscript𝐹ARRB𝑖𝑡F^{\mathrm{VTM}}_{it},F^{\mathrm{MEF}}_{it},F^{\mathrm{VSP}}_{it},F^{\mathrm{% ARRB}}_{it}italic_F start_POSTSUPERSCRIPT roman_VTM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT roman_MEF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT roman_VSP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT , italic_F start_POSTSUPERSCRIPT roman_ARRB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT are L/s, L/s, g/s, and ml/s, respectively. To perform unit conversions, we assume the density of fuel is 800 g/L (the density of gasoline at room temperature is approximately 720 to 775 g/L, and diesel is about 830 to 850 g/L). Therefore, the total energy consumption equation is formulated as:

Fitall=(FitVTM+FitMEF+FitVSP/800+FitARRB/1000)/4subscriptsuperscript𝐹all𝑖𝑡subscriptsuperscript𝐹VTM𝑖𝑡subscriptsuperscript𝐹MEF𝑖𝑡subscriptsuperscript𝐹VSP𝑖𝑡800subscriptsuperscript𝐹ARRB𝑖𝑡10004\displaystyle F^{\mathrm{all}}_{it}=\left(F^{\mathrm{VTM}}_{it}+F^{\mathrm{MEF% }}_{it}+F^{\mathrm{VSP}}_{it}/800+F^{\mathrm{ARRB}}_{it}/1000\right)/4italic_F start_POSTSUPERSCRIPT roman_all end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT = ( italic_F start_POSTSUPERSCRIPT roman_VTM end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + italic_F start_POSTSUPERSCRIPT roman_MEF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT + italic_F start_POSTSUPERSCRIPT roman_VSP end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT / 800 + italic_F start_POSTSUPERSCRIPT roman_ARRB end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT / 1000 ) / 4 (13)

Figure 5 displays the distributions of the four indicators across all datasets. The distribution of TTC𝑇𝑇𝐶TTCitalic_T italic_T italic_C is predominantly left-skewed, with peak values below 50 s. Notably, the Waymo Motion Dataset and Argoverse 2 Motion Forecasting Dataset exhibit the smallest peaks around 10 s, and their probability density is concentrated in the lower TTC range. Given that a smaller TTC indicates greater risk, this distribution suggests a higher risk associated with AVs in complex traffic environments.

The distribution of τ𝜏\tauitalic_τ is primarily concentrated between 1-5 s. The CATS Platoon Dataset shows multiple peaks due to testing with four levels of time headway settings. In contrast, the CATS UW Dataset exhibits a larger τ𝜏\tauitalic_τ, indicating poorer mobility of its AV’s car-following model compared to the smaller τ𝜏\tauitalic_τ observed in the OpenACC Database and Vanderbilt ACC Dataset. This difference may also caused by the different experimental settings. The CATS UW Dataset was tested in a low-speed environment with a minimum safety distance, leading to a larger τ𝜏\tauitalic_τ, while the other two datasets were tested at higher speeds.

Regarding the distribution of F𝐹Fitalic_F, some datasets, including the CATS UW Dataset, OpenACC ZalaZone Dataset, Waymo Motion Dataset, and Argoverse 2 Motion Forecasting Dataset, predominantly have F𝐹Fitalic_F below 0.001 L/s. Conversely, the Vanderbilt ACC Dataset, OpenACC Casale Dataset, and OpenACC Vicolungo Dataset exhibit higher F𝐹Fitalic_F, averaging over 0.005 L/s. The variance in F𝐹Fitalic_F across these datasets can be explained by the positive correlation between energy consumption and speed. According to Table 6, the datasets with the lower F𝐹Fitalic_F correspond to those with the lowest average speeds, while the datasets with higher F𝐹Fitalic_F have the highest average speeds. Additionally, differences in the vehicle car-following models’ energy efficiency across the datasets also influence it. This result reflects the distinct energy consumption characteristics associated with the vehicles in each dataset.

Lastly, the distribution of α𝛼\alphaitalic_α is centered near zero for most datasets, with the probability density decreasing as α𝛼\alphaitalic_α increases. Among them, the CATS ACC Dataset shows fluctuations in its distribution. The reason is that the acceleration precision of the original data is limited to one decimal place, resulting in insufficient data accuracy. This causes α𝛼\alphaitalic_α to be concentrated in a few areas. Additionally, the CATS Platoon Dataset, OpenACC Vicolungo Dataset, and Waymo Perception Dataset exhibit slightly higher densities at larger α𝛼\alphaitalic_α, indicating relatively poorer vehicle stability in these datasets.

Refer to caption
Figure 5: Distribution of performance metrics of safety, mobility, stability, and sustainability.

Overall, the results presented in Figure 5 are consistent with the literature. For example, TTC𝑇𝑇𝐶TTCitalic_T italic_T italic_C in most datasets is concentrated below 50 seconds, and the time headway τ𝜏\tauitalic_τ is mainly between 1 and 2 seconds. This demonstrates that the Ultra-AV dataset can be used for research on AV behavior analysis. The four metrics used in this paper can be utilized to evaluate the trajectory datasets collected in the future, and also serve as standards for the development of car-following models. Additionally, we observe the relationships between different metrics in Figure 5: 1. τ𝜏\tauitalic_τ and TTC𝑇𝑇𝐶TTCitalic_T italic_T italic_C show a negative correlation. 2. τ𝜏\tauitalic_τ and F𝐹Fitalic_F show a negative correlation. 3. τ𝜏\tauitalic_τ and α𝛼\alphaitalic_α show a negative correlation. This indicates that improving some metrics might lead to worse outcomes in others. For example, increasing mobility by reducing τ𝜏\tauitalic_τ might reduce the spatial gap between vehicles, thereby reducing TTC𝑇𝑇𝐶TTCitalic_T italic_T italic_C and compromising vehicle safety. A future research direction is to develop models that make trade-offs between these four metrics to achieve overall optimal.

Car-following Model Development

Although the trajectory data from FAV offers valuable insights into their performance, this data may stem from a limited set of conditions. These derived performance metrics may not reflect the full range of driving scenarios. Thus, the development of accurate and robust car-following models for simulation across a broader range of scenarios is advantageous. Researchers are trying to develop car-following models, including the linear ACC model [46], nonlinear intelligent driver model [47], or data-driven models [48]. No matter what the model structure is, they all adapted aitfsubscriptsuperscript𝑎f𝑖𝑡a^{\mathrm{f}}_{it}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT as the output and ditsubscript𝑑𝑖𝑡d_{it}italic_d start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT, vitfsubscriptsuperscript𝑣f𝑖𝑡v^{\mathrm{f}}_{it}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT, and ΔvitΔsubscript𝑣𝑖𝑡\Delta v_{it}roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT as the input in these models. Thus, we analyze the relationship between output acceleration and input three variables with scatter plots and correlation analysis.

Figure 6 displays the relationship between output acceleration and input three variables. To accurately depict the relationships among the variables, the moving average method with a window length of three was applied to smooth the aitfsubscriptsuperscript𝑎f𝑖𝑡a^{\mathrm{f}}_{it}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT data before plotting. In Figure 6, a notably nonlinear positive correlation between aitfsubscriptsuperscript𝑎f𝑖𝑡a^{\mathrm{f}}_{it}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and ditsubscript𝑑𝑖𝑡d_{it}italic_d start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is evident, particularly in the datasets from the OpenACC Database, which a logarithmic curve can characterize. Besides, most datasets show a linear positive correlation between aitfsubscriptsuperscript𝑎f𝑖𝑡a^{\mathrm{f}}_{it}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and ΔvitΔsubscript𝑣𝑖𝑡\Delta v_{it}roman_Δ italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT. In contrast, there is no clear relationship between aitfsubscriptsuperscript𝑎f𝑖𝑡a^{\mathrm{f}}_{it}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT and vitsubscript𝑣𝑖𝑡v_{it}italic_v start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT. The unclear relationship is due to aitfsubscriptsuperscript𝑎f𝑖𝑡a^{\mathrm{f}}_{it}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT being influenced by the other three variables collectively, and it cannot be directly reflected by a single variable. Therefore, a deep analysis of the relationships among variables in car-following behavior is necessary, as well as the development of specific car-following models. Moreover, car-following models may vary across different vehicle types, and more detailed studies should analyze the same vehicle. This paper will not delve into specific models, but researchers can refer to the latest reviews in this field [49, 50] to conduct studies using the provided data.

Refer to caption
Figure 6: The scatter plots between afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT and d𝑑ditalic_d, vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, and ΔvΔ𝑣\Delta vroman_Δ italic_v. The y-axis represents afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, and the x-axis with green, yellow, and red colors represent d𝑑ditalic_d, vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, and ΔvΔ𝑣\Delta vroman_Δ italic_v, respectively.
Table 7: Correlation coefficients between afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT and d𝑑ditalic_d, vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, and ΔvΔ𝑣\Delta vroman_Δ italic_v.
Dataset Pearson Spearman
d𝑑ditalic_d vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT ΔvΔ𝑣\Delta vroman_Δ italic_v d𝑑ditalic_d vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT ΔvΔ𝑣\Delta vroman_Δ italic_v
1 0.3954 0.0040 0.6606 0.4987 0.0323 0.6518
2 0.1557 -0.1246 0.5803 0.1828 -0.1005 0.5644
3 0.1215 -0.1255 0.7076 0.1609 -0.1136 0.7317
4 0.0027 -0.1708 0.0163 0.0160 -0.1732 0.0346
5 0.3373 -0.0007 0.3512 0.3450 -0.0264 0.3577
6 0.2193 -0.0352 0.4052 0.2474 -0.0673 0.4838
7 0.2265 -0.0062 0.6369 0.3583 -0.0550 0.5575
8 0.1377 -0.0124 0.5767 0.1559 0.0049 0.5798
9 0.1289 -0.0412 0.4792 0.1859 -0.1359 0.5228
10 0.1009 0.0092 0.2333 0.0932 -0.0728 0.2263
11 -0.0763 -0.0925 0.5466 -0.0835 -0.0941 0.5660
12 0.0102 -0.0219 0.1080 0.0210 -0.0150 0.1061
13 0.0043 -0.0647 0.2764 0.0446 -0.0746 0.3087

The relationship between afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT and three variables can also be revealed with correlation coefficients. Table 7 shows the Pearson and Spearman correlation coefficients between afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT and d𝑑ditalic_d, vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, and ΔvΔ𝑣\Delta vroman_Δ italic_v for all datasets. According to Table 7, afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT has a positive correlation with d𝑑ditalic_d and ΔvΔ𝑣\Delta vroman_Δ italic_v, and a negative correlation with vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT. This result is similar to the experience: when d𝑑ditalic_d is large, the FAV accelerates to close the gap; when vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT is large, the FAV slows down to stabilize at a following speed; and when ΔvΔ𝑣\Delta vroman_Δ italic_v is positive, the FAV decelerates to match the LV speed. The correlation coefficient for d𝑑ditalic_d generally ranges from 0 to 0.4, indicating a weak correlation. For vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT, the coefficient ranges from -0.2 to 0, indicating no correlation For ΔvΔ𝑣\Delta vroman_Δ italic_v, the coefficient is above 0.5, indicating a strong correlation. Since Pearson and Spearman correlation coefficients only reflect the linear and monotonic relationships between variables, we suppose that in the car-following model, the influence of vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT on afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT is nonlinear. This insight inspires us to develop piecewise linear or nonlinear car-following models.

From the analysis of the scatter plots and correlation analysis, it is clear that the Ultra-AV dataset reflects certain relationships between acceleration and other variables. The results show that the relationships depicted in the dataset are similar to those discussed in the literature and experience, validating that our dataset can be utilized for the development of car-following models. The analysis also indicates that there is a certain nonlinearity in the relationships between variables and acceleration, particularly with vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT. This inspires researchers to consider the nonlinear relationship between vfsuperscript𝑣fv^{\mathrm{f}}italic_v start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT and afsuperscript𝑎fa^{\mathrm{f}}italic_a start_POSTSUPERSCRIPT roman_f end_POSTSUPERSCRIPT in future model development.

Usage Notes

In this study, we reviewed a series of AV trajectory datasets and extracted the Ultra-AV dataset. This dataset includes two sub-datasets: the longitudinal trajectory dataset and the car-following trajectory dataset, corresponding to data after Step 2 general data cleaning, and Step 3 data-specific cleaning, respectively.

Contrary to the existing AV perception datasets discussed in the literature, this dataset offers AV trajectory data that enhances the analysis of microscopic longitudinal AV behaviors. Few existing trajectory datasets were stored in their respective formats and often fall short in terms of refinement, reliability, and completeness, which limits their widespread use and comparison. Therefore, this paper reviewed most of the trajectory datasets, transformed all datasets into a unified format through data processing, and performed cleaning and refinement.

Moreover, we analyzed the data using multiple methods to validate its reliability. We first analyzed the impact of the data collection methods. We recommend researchers integrate data from multiple sensors such as GPS, LiDAR, and cameras, to enhance the precision of trajectory data and preserve essential information for AV behavior analysis. Secondly, we summarized the performance of each dataset in safety, mobility, sustainability, and stability under four metrics widely used in the literature. These four metrics can be utilized to evaluate the trajectory datasets collected in the future. The results show that there are some relationships between different metrics, and the trade-offs between metrics need to be considered during model design. Finally, we analyzed the relationships between variables in the car-following model through correlation coefficients and scatter plots, suggesting that researchers focus on the nonlinear impact of FAV speed on acceleration in model development.

In this study, we focus on the longitudinal behaviors of AVs, not considering lateral behaviors, such as lane-changing. This is due to the scarcity of trajectory datasets that include AV lateral behaviors, and the corresponding analyses and processing are more complex. However, we noted that some trajectory datasets, such as the Central Ohio ACC Dataset, Waymo Open Dataset, and Argoverse 2 Motion Forecasting Dataset, contain lane-changing behaviors. To fully understand AV behaviors, it is essential to consider both longitudinal and lateral behaviors. Consequently, developing an AV behavior model that balances the four metrics is important. Moreover, how to utilize the results of the behavior model to improve the AV would enhance the transportation system.

Code availability

The Ultra-AV dataset and the code for data extraction and analysis have been documented and made accessible on https://github.com/CATS-Lab/Filed-Experiment-Data-Unified-AV-Trajectory-Dataset. The detailed contents include:

  • Readme.md: A general description of the raw data for each dataset.

  • Main.py: The main function calls data processing and analysis functions for each dataset.

  • trajectory_extraction.py: Code used in Step 1 to extract AV longitudinal trajectories.

  • data_transformation.py: Code used in Step 1 to convert all datasets to a unified format.

  • data_cleaning.py: Code used in Steps 2 and 3 for data cleaning.

  • data_analysis.py: Code used to analyze data statistics, plot traffic performance of datasets, and plot scatter plots.

  • model_calibration.py: An example tool to use the processed data to calibrate a linear car-following model.

We also recommend using other software packages such as R to effectively analyze the trajectory data. These tools are well-suited for handling the dataset’s format.

Data usage is restricted to research purposes only. Any commercial exploitation of the data requires separate approval and possibly additional agreements.

References

  • [1] Calvert, S. et al. Traffic flow of connected and automated vehicles: Challenges and opportunities, 10.1007/978-3-319-60934-8_19 (2018).
  • [2] Jin, W. L. On the equivalence between continuum and car-following models of traffic flow. \JournalTitleTransportation Research Part B: Methodological 93, 543–559, 10.1016/j.trb.2016.08.007 (2016).
  • [3] Kerner, B. S. Physics of automated driving in framework of three-phase traffic theory, 10.1103/PhysRevE.97.042303 (2018).
  • [4] Jiang, R. et al. On some experimental features of car-following behavior and how to model them. \JournalTitleTransportation Research Part B: Methodological 80, 338–354, 10.1016/j.trb.2015.08.003 (2015).
  • [5] Qu, X., Zhang, J. & Wang, S. On the stochastic fundamental diagram for freeway traffic: Model development, analytical properties, validation, and extensive applications. \JournalTitleTransportation Research Part B: Methodological 104, 256–271, 10.1016/j.trb.2017.07.003 (2017).
  • [6] Chen, X. et al. Follownet: A comprehensive benchmark for car-following behavior modeling. \JournalTitleScientific Data 10, 828 (2023).
  • [7] Axelsson, J. Safety in vehicle platooning: A systematic literature review, 10.1109/TITS.2016.2598873 (2017).
  • [8] Wang, M. et al. Delay-compensating strategy to enhance string stability of adaptive cruise controlled vehicles. \JournalTitlehttps://doi.org/10.1080/21680566.2016.1266973 6, 211–229, 10.1080/21680566.2016.1266973 (2016).
  • [9] Ubiergo, G. A. & Jin, W.-L. Mobility and environment improvement of signalized networks through vehicle-to-infrastructure (v2i) communications. \JournalTitleTransportation Research Part C: Emerging Technologies 68, 70–82, https://doi.org/10.1016/j.trc.2016.03.010 (2016).
  • [10] Thiemann, C., Treiber, M. & Kesting, A. Longitudinal hopping in intervehicle communication: Theory and simulations on modeled and empirical trajectory data, 10.1103/PhysRevE.78.036102 (2008).
  • [11] Feng, S., Yan, X., Sun, H., Feng, Y. & Liu, H. X. Intelligent driving intelligence test for autonomous vehicles with naturalistic and adversarial environment. \JournalTitleNature Communications 12, 748 (2021).
  • [12] Shi, X. & Li, X. Empirical study on car-following characteristics of commercial automated vehicles with different headway settings. \JournalTitleTransportation Research Part C: Emerging Technologies 128, 10.1016/j.trc.2021.103134 (2021).
  • [13] Feng, S. et al. Dense reinforcement learning for safety validation of autonomous vehicles. \JournalTitleNature 615, 620–627 (2023).
  • [14] Yu, F. et al. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognitionand pattern recognition, 2636–2645 (2020).
  • [15] Chang, M.-F. et al. Argoverse: 3d tracking and forecasting with rich maps. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8748–8757 (2019).
  • [16] Wilson, B. et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting. \JournalTitlearXiv preprint arXiv:2301.00493 (2023).
  • [17] Sun, P. et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2446–2454 (2020).
  • [18] Geiger, A., Lenz, P., Stiller, C. & Urtasun, R. Vision meets robotics: The kitti dataset. \JournalTitleThe International Journal of Robotics Research 32, 1231–1237 (2013).
  • [19] Caesar, H. et al. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11621–11631 (2020).
  • [20] Mao, J. et al. One million scenes for autonomous driving: Once dataset. \JournalTitlearXiv preprint arXiv:2106.11037 (2021).
  • [21] Alibeigi, M. et al. Zenseact open dataset: A large-scale and diverse multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 20178–20188 (2023).
  • [22] Zhou, H., Ma, K. & Li, X. A review on trajectory datasets on advanced driver assistance system. \JournalTitlearXiv preprint arXiv:2402.05009 (2024).
  • [23] Kesting, A., Treiber, M. & Helbing, D. General lane-changing model mobil for car-following models. \JournalTitleTransportation Research Record 1999, 86–94 (2007).
  • [24] Deng, J. et al. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255 (Ieee, 2009).
  • [25] Punzo, V., Borzacchiello, M. T. & Ciuffo, B. On the assessment of vehicle trajectory data accuracy and application to the next generation simulation (ngsim) program data. \JournalTitleTransportation Research Part C: Emerging Technologies 19, 1243–1262 (2011).
  • [26] Wang, Y., Gunter, G., Nice, M. & Work, D. B. Estimating adaptive cruise control model parameters from on-board radar units. \JournalTitlearXiv preprint arXiv:1911.06454 (2019).
  • [27] Shi, X. & Li, X. Empirical study on car-following characteristics of commercial automated vehicles with different headway settings. \JournalTitleTransportation Research Part C: Emerging Technologies 128, 103134 (2021).
  • [28] Makridis, M., Mattas, K., Anesiadou, A. & Ciuffo, B. Openacc. an open database of car-following experiments to study the properties of commercial acc systems. \JournalTitleTransportation Research Part C: Emerging Technologies 125, 103047 (2021).
  • [29] Xia, X. et al. An automated driving systems data acquisition and analytics platform. \JournalTitleTransportation Research Part C: Emerging Technologies 151, 104120 (2023).
  • [30] Hu, X., Zheng, Z., Chen, D., Zhang, X. & Sun, J. Processing, assessing, and enhancing the waymo autonomous vehicle open dataset for driving behavior research. \JournalTitleTransportation Research Part C: Emerging Technologies 134, 103490 (2022).
  • [31] Xu, X., Zheng, Z., Hu, Z., Feng, K. & Ma, W. A unified dataset for the city-scale traffic assignment model in 20 us cities. \JournalTitleScientific Data 11, 325 (2024).
  • [32] Ibiknle, D. Average car sizes & dimensions (2023). Accessed: 2023-05-06.
  • [33] National Association of City Transportation Officials (NACTO). Lane width - urban street design guide (2023). Accessed: 2023-05-06.
  • [34] DriveSafe Online. Safe following distance: Follow the 3 second rule (2020). Accessed: 2024-05-06.
  • [35] Liu, T., Fu, R. et al. The relationship between different safety indicators in car-following situations. In 2018 IEEE Intelligent Vehicles Symposium (IV), 1515–1520 (IEEE, 2018).
  • [36] Mai, M., Wang, L. & Prokop, G. Advancement of the car following model of wiedemann on lower velocity ranges for urban traffic simulation. \JournalTitleTransportation Research Part F: Traffic Psychology and Behaviour 61, 30–37 (2019).
  • [37] Alotibi, F. & Abdelhakim, M. Anomaly detection for cooperative adaptive cruise control in autonomous vehicles using statistical learning and kinematic model. \JournalTitleIEEE Transactions on Intelligent Transportation Systems 22, 3468–3478 (2020).
  • [38] Minderhoud, M. M. & Bovy, P. H. Extended time-to-collision measures for road traffic safety assessment. \JournalTitleAccident Analysis & Prevention 33, 89–97 (2001).
  • [39] Ha, D.-H., Aron, M. & Cohen, S. Time headway variable and probabilistic modeling. \JournalTitleTransportation Research Part C: Emerging Technologies 25, 181–201 (2012).
  • [40] Li, X. Trade-off between safety, mobility and stability in automated vehicle following control: An analytical method. \JournalTitleTransportation Research Part B: Methodological 166, 1–18 (2022).
  • [41] Zegeye, S., De Schutter, B., Hellendoorn, J., Breunesse, E. & Hegyi, A. Integrated macroscopic traffic flow, emission, and fuel consumption model for control purposes. \JournalTitleTransportation Research Part C: Emerging Technologies 31, 158–171 (2013).
  • [42] Lei, W., Chen, H. & Lu, L. Microscopic emission and fuel consumption modeling for light-duty vehicles using portable emission measurement system data. \JournalTitleWorld Academy of Science, Engineering and Technology 66, 918–925 (2010).
  • [43] Duarte, G. O., Gonçalves, G. A., Baptista, P. C. & Farias, T. L. Establishing bonds between vehicle certification data and real-world vehicle fuel consumption–a vehicle specific power approach. \JournalTitleEnergy Conversion and Management 92, 251–265 (2015).
  • [44] Akcelik, R. Efficiency and drag in the power-based model of fuel consumption. \JournalTitleTransportation Research Part B: Methodological 23, 376–385 (1989).
  • [45] Knoop, V. L. et al. Platoon of sae level-2 automated vehicles on public roads: Setup, traffic interactions, and stability. \JournalTitleTransportation Research Record 2673, 311–322 (2019).
  • [46] Ma, K. et al. String stability of automated vehicles based on experimental analysis of feedback delay and parasitic lag. \JournalTitleTransportation Research Part C: Emerging Technologies 145, 103927 (2022).
  • [47] Treiber, M., Hennecke, A. & Helbing, D. Congested traffic states in empirical observations and microscopic simulations. \JournalTitlePhysical Review E 62, 1805 (2000).
  • [48] Zhu, M., Wang, X. & Wang, Y. Human-like autonomous car-following model with deep reinforcement learning. \JournalTitleTransportation Research Part C: Emerging Technologies 97, 348–368 (2018).
  • [49] Brackstone, M. & McDonald, M. Car-following: a historical review. \JournalTitleTransportation Research Part F: Traffic Psychology and Behaviour 2, 181–196 (1999).
  • [50] Wang, Z., Shi, Y., Tong, W., Gu, Z. & Cheng, Q. Car-following models for human-driven vehicles and autonomous vehicles: A systematic review. \JournalTitleJournal of Transportation Engineering, Part A: Systems 149, 04023075 (2023).

Acknowledgements

(not compulsory)

This research is sponsored by National Science Foundation, USA through Grants CMMI #1932452 and CMMI #2343167.

Author contributions statement

H.Z. conducted the experiments and analyzed the results, K.M. conceived the experiment, S.L. conducted the experiments, X.L. and X.Q. revised the paper. All authors reviewed the manuscript.

Competing interests

The authors declare no competing interests.

Figures & Tables

Appendix A Coefficients of the VT-Micro model and MEF model

Table 8: Coefficients of the VT-Micro model and MEF model.
Kn1n2subscript𝐾subscript𝑛1subscript𝑛2K_{n_{1}n_{2}}italic_K start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT n2=0subscript𝑛20n_{2}=0italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0 n2=1subscript𝑛21n_{2}=1italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 n2=2subscript𝑛22n_{2}=2italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2 n2=3subscript𝑛23n_{2}=3italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3
n1=0subscript𝑛10n_{1}=0italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 -7.537 0.4438 0.1716 -0.0420
n1=1subscript𝑛11n_{1}=1italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1 0.0973 0.0518 0.0029 0.0071
n1=2subscript𝑛12n_{1}=2italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2 -0.003 -7.42E-04 1.09E-04 1.16E-04
n1=3subscript𝑛13n_{1}=3italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3 5.3E-05 6E-06 -1E-05 -6E-06