(eccv) Package eccv Warning: Package ‘hyperref’ is loaded with option ‘pagebackref’, which is *not* recommended for camera-ready version
New York University, USA. 33institutetext: 33email: cfeng@nyu.edu
Shanghai AI laboratory, China.
Self-Localized Collaborative Perception
Abstract
Collaborative perception has garnered considerable attention due to its capacity to address several inherent challenges in single-agent perception, including occlusion and out-of-range issues. However, existing collaborative perception systems heavily rely on precise localization systems to establish a consistent spatial coordinate system between agents. This reliance makes them susceptible to large pose errors or malicious attacks, resulting in substantial reductions in perception performance. To address this, we propose , a novel self-localized collaborative perception system, which achieves more holistic and robust collaboration without using an external localization system. The core of is a novel spatial alignment module, which provides the relative poses between agents by effectively matching co-visible objects across agents. We validate our method on both real-world and simulated datasets. The results show that i) achieves state-of-the-art detection performance under arbitrary localization noises and attacks; and ii) the spatial alignment module can seamlessly integrate with a majority of previous methods, enhancing their performance by an average of . Code is available at https://github.com/VincentNi0107/CoBEVGlue.
Keywords:
Collaborative perception Bird’s eye view Autonomous driving1 Introduction
Accurate perception is essential for the navigation and safety of autonomous vehicles [32, 40]. Despite advancements facilitated by large-scale datasets [8, 51], and powerful models [27, 66], single-agent perception is inherently limited by occlusions and long-range issues [56], which could lead to catastrophic consequences [72]. Leveraging modern communication technologies, current research in collaborative perception [30, 21, 61, 56] enables the sharing of perceptual information among multiple agents, fundamentally improving the perception performance. Fueled by the advent of high-quality datasets [62, 68, 60, 29] and innovative collaborative techniques [34, 35, 75, 2, 59, 57], collaborative perception systems have the potential to improve the safety of transportation networks significantly.
In this emerging field of collaborative perception, most prevailing works [56, 30, 21] make an oversimplified assumption: the global localization system, typically GPS or SLAM, employed by each agent is precise enough to establish a consistent spatial coordinate system for collaboration. However, snapshots from real-world collaborative perception datasets V2V4Real [60] and DAIR-V2X [68] show that the ground truth localization is still noisy even after meticulous and resource-intensive offline calibration, as shown in Fig. 1(a)(b). These inaccuracies could be far more exacerbated in real-world applications under computing limits and real-time constraints. Moreover, localization systems are susceptible to long-existing yet still unsolved attacks [42, 41, 71, 50, 24, 67]. These attacks allow adversaries to manipulate positions at will, further undermining the reliability of localization systems. Such prevalent challenges of significant noise and malicious attack starkly contrast with the ideal scenarios considered by earlier works [53, 69, 61, 36], which primarily focus on minor pose inaccuracies and fail to transcend the no collaboration baseline under large noise, see Fig. 1(c).
To eliminate the dependence on potentially unreliable external localization systems, a direct solution is deducing the relative poses of collaborative agents by point cloud registration, a technique extensively utilized in multi-agent collaborative systems[26, 73, 10, 9, 74]. Point cloud registration methods [4, 63, 1, 22] apply nearest-neighbor algorithms to identify correspondences across extensive 3D point sets, followed by robust techniques [16, 63] to calculate the transformation from these putative correspondences. Although these methods prove effective for latency-tolerant applications such as collaborative mapping[26], the real-time transmission of large volumes of 3D data is impractical for bandwidth-limited collaborative perception systems [23, 21, 11, 60]. Therefore, there exists a pronounced gap in creating a system that is free from localization errors while also maintaining communication efficiency for practical applications.
To fill this gap, we propose , a self-localized collaborative perception system that is designed for multiple agents to achieve more holistic perception without relying on external localization systems, achieving efficiency with reduced communication costs. follows the pipeline of previous collaborative perception systems [36, 21], and uses its key spatial alignment module to estimate the relative pose between agents with the objects detected and tracked by each agent. The core idea behind is to search for the co-visible objects from the bird’s eye view perceptual data across agents and calculate the relative transformation with these co-visible objects, ensuring a consistent spatial coordinate system for collaboration. includes three key components: i) object graph modeling, which converts each agent’s observations to an object graph with rich information, including object shape, heading, tracking ID and the invariant spatial relationship between objects; ii) temporally consistent maximum subgraph detection, which efficiently harnesses spatial and temporal data within object graphs to detect the largest common subgraph, following strict spatial isomorphism constraint and temporal consistency; and iii) relative pose calculation, which computes the pose relationships between agents using the detected common subgraph, without using time-consuming outlier rejection algorithms.
The proposed system offers three significant advantages: i) it operates independently of external localization devices, showcasing its resilience to noise and malicious attacks; ii) it brings minor communication overhead since only uses object bounding boxes with tracking ID to estimate the relative pose between agents; iii) its core module, ensures high-quality matching results by keeping strict spatial isomorphism constraint between the detected common subgraph and temporal consistency between matching results across time.
To evaluate the effectiveness of the proposed method, we consider the collaborative 3D object detection task on three datasets: OPV2V [62], DAIR-V2X [68] and V2V4Real [60], covering both simulation and real-world scenarios. The results show that, the empowered robust collaborative perception system perform comparably to systems relying on precise localization information, and achieves state-of-art detection performance when localization noise and attack exist.
In summary, the main contributions of this work are:
-
•
We propose , the first self-localized collaborative perception system without relying on external localization devices;
-
•
We propose , a novel spatial alignment method that estimates the relative poses between agents through matching co-visible objects;
-
•
We conduct extensive experiments for collaborative LiDAR object detection in simulated and real-world datasets. The results show that i) attains state-of-the-art detection performance in the presence of localization noise. ii) can seamlessly integrate with a majority of previous methods, enhancing their performance by an average of .
2 Related work
2.1 Collaborative Perception
As a recent application of multi-agent systems to perception tasks, collaborative perception is emerging [62, 56, 30, 21]. To support this area of research, there is a surge of high-quality datasets, including V2X-Sim [29], OPV2V [62], and DAIR-V2X [68]. Based on those datasets, numerous methods have been proposed to handle various practical issues, such as communication latency [28] and communication bandwidth [21]. In this work, we specifically consider the robustness of localization error and attack.
To gain resistance towards localization noises, previous works consider two main approaches: learning-based and matching-based. Learning-based methods aim to construct robust network architectures to reduce the impact of pose errors. For example, V2VNet (robust) [53] designs pose regression, global consistency and attention aggregation module to correct relative poses and concentrate on neighbor with less pose error; V2X-ViT [61] uses multi-scale window attention to capture features in various ranges. On the other hand, matching-based approaches seek to develop robust frameworks or network architectures. Examples include FPV-RCNN [69] and CoAlign [36], which estimate relative poses between agents using an IoU-based matching strategy. However, they can only rectify minor inaccuracies in external localization since these approaches rely on a basically precise initial relative pose. Their performance drops significantly when the noise is large or an attack exists. In contrast, our work considers collaborative perception independent of external localization systems.
2.2 Point Cloud Registration
Although the ultimate aim of this paper is to enhance detection capabilities, advancements in point cloud registration methods has inspired us to propose our novel self-localized collaborative perception system. Traditional point cloud registration methods focusing on refining the Iterative Closest Point (ICP) algorithm [4] and its variants [7, 17, 19, 49] have led to improvements in convergence and noise resilience. Recent typical point cloud registration workflows consist of extracting local 3D feature descriptors and conducting registration. For extracting 3D local descriptors, conventional approaches like Fast Point Feature Histograms [46, 25, 47, 52] utilize hand-crafted features. More recent techniques [44, 43, 1, 12, 70, 31] adopt learning-based methods for this purpose. In terms of registration, traditional approaches often employ nearest-neighbor algorithms for matching and robust optimization for outlier rejection [16, 63], whereas contemporary deep registration methods [22, 65, 64] leverage self-attention mechanisms [54] for correspondence determination. SGAligner [48] pioneers the employment of a pre-constructed 3D scene graph for registration purposes. Nevertheless, similar to preceding strategies, it requires the transmission of dense point clouds and high-dimensional features. These methods are widely applied in latency-tolerant multi-agent systems such as collaborative mapping[26] and 3D scene graph generation[10]. However, the collaborative object detection task requires precise relative pose estimation in real time. Unfortunately, the V2X networks struggle to transmit the dense point clouds and feature required by point cloud registration methods in real time. To overcome this limitation, our approach prioritizes object-level registration, representing each object with just eight float numbers. This innovation markedly reduces the bandwidth necessary and computation cost for calculating relative poses among collaborative autonomous vehicles, thus efficiently resolving the transmission dilemma.
2.3 Maximum Common Subgraph Detection
The Maximum Common Subgraph (MCS) detection problem, classified as NP-hard, is pivotal in various scientific fields ene[45, 13, 15, 18], necessitating algorithms that balance precision and computational efficiency. Traditional approaches primarily employ branch-and-bound algorithms [39, 55, 38] and techniques that transform MCS detection into maximum clique problems [45, 37]. Recent advancements [33, 3] in machine learning have seen the application of graph neural networks and reinforcement learning to MCS detection, which attempts to learn suitable heuristics for graph matching. Despite their innovations, they are still constrained by the heuristic nature of the search space exploration and are subject to exponential time complexity in the worst-case scenarios. In this work, we model the bounding boxes detected by each agent with a geometric invariant object pose graph and leverage the spatial constraints and temporal consistency to solve the problem efficiently.
3 CoBEVGlue: Self-Localized Collaborative Perception System
In this section, we present , the first self-localized collaborative perception system that replaces potentially unreliable localization systems with our novel spatial alignment module to estimate the relative pose between agents. includes a single-agent object detector and tracker, the key spatial alignment module , a multi-agent feature fusion module, and a decoder; see the overview in Fig. 2.
Mathematically, consider agents in the scene. For the th agent, let be the perceptual observation at time . The proposed works as follows:
(1a) | |||||
(1b) | |||||
(1c) | |||||
(1d) | |||||
(1e) |
where is the BEV feature extracted from the th agent’s observation, is the detection and tracking outputs without collaboration, is the estimated relative pose from th agent’s perspective to th agent ( is the identity), is the wrapped BEV feature transformed from the th agent’s coordinate space to the th agent’s coordinate space through affine transformation, is the aggregated feature of the th agent after fusing other agents’ messages, and is the detection outputs after collaboration.
Step (1a) employs the PointPillar framework [27], a lightweight 3D object detection system, in conjunction with a SORT [5]-inspired tracker, to extract the BEV feature from the observation of the th agent. This step also generates the detected bounding boxes accompanied by their tracking IDs . The cornerstone of the process, Step (1b), leverages our innovative module to identify co-visible objects and compute the relative pose , drawing upon the detection and tracking results from multiple agents; see details in the Section 4. Subsequently, Step (1c) aligns the features from other agents with the ego agent’s pose using the estimated relative poses. Step (1d) applies a multi-scale max fusion to refine the feature map, denoting as equivalent to . The final phase, Step (1e), uses fused features to obtain final detection results. can be applied to multi-agent collaboration: Steps (1b) and (1c) are executed between the ego agent and each collaborator independently. Subsequently, in Step (1d), the ego agent integrates features transformed from all agents.
Note that accurate relative pose estimation, , is essential for the success of collaborative perception systems. Inaccuracies in pose information can critically undermine subsequent processes such as feature transformation, fusion, and collaborative detection. Conventionally, as delineated in Step (1b), precise pose information necessitates each agent to utilize an external localization system to acquire its global position and calculate the relative transformations among collaborators. This dependency on external localization is fraught with challenges, including susceptibility to noise interference and potential security breaches through malicious attacks. Our innovative spatial alignment module, , is designed to solve these issues by leveraging perceptual data to ensure accurate relative pose estimation, thereby enhancing the resilience and effectiveness of collaborative perception. We elaborate this key module in Sec. 4.
4 BEVGlue: Spatial Alignment Module
To estimate the relative pose between agents, the main idea of is to identify the co-visible objects and subsequently calculate the transformation based on these co-visible objects. To excavate this internal correspondence among agents, presents three modules: (i) object graph modeling, (ii) temporally consistent maximum common subgraph detection, and (iii) relative pose calculation.
4.1 Object Graph Modeling
Object graph modeling is designed to represent the detection and tracking outcomes of each agent from Step (1a) as an object graph, with each node corresponding to an object and each edge describing the spatial relationship between objects. This method of modeling node and edge attributes discovers the temporal information of each object and the invariant geometric pattern between objects, which are valuable information for the subsequent common subgraph searching procedure.
Consider the th agent is tracking objects in a scenario. Let be the detection and tracking result. The th bounding box with tracking ID is , encompassing the 2D center position, length, width, yaw angle, and tracking ID. We formulate into a fully connected object graph for agent , where are the sets of nodes and edges, respectively. To be noted that each agent has its object graph. The th node feature is , and the edge feature is are defined within a polar coordinate system which sets the heading of the th node as the reference direction and its position as the origin (pole). The and denote the radial distance and polar angle of node respectively and is the intersection angle between the heading of node and node . Given that the pole and the reference direction possess clear and singular definitions in the physical world, consistency in edge feature computation across different object graphs is achievable. Specifically, if the detection results of nodes and are accurate for both the th and th agents when calculating the edge feature on agent graph based on agent , it will be identical to . Also see the Fig. 3.
The object graph presents an innovative approach to model the observation of each agent: i) the node attribute encompasses temporal tracking data, which helps keep the matching consistency across time; ii) the edge feature is consistent across object graphs derived from different agents’ perspectives, signifying that rotations and translations applied to do not alter the value of . It implies that when two objects are simultaneously observed by different agents, the edge attribute remains consistent, regardless of the varying perspectives.
4.2 Temporally Consistent Maximum Common Subgraph Detection
Upon completing the object graph modeling phase, the task shifts to efficiently detecting the largest common subgraph that strictly satisfies graph isomorphism between two graphs. This common subgraph is indicative of the co-visible objects across agents and is subsequently used in calculating the relative transformation. We leverage two pieces of information for the search of a common subgraph: i) the spatial relationship: the geometric pattern of co-visible objects across different agents is isomorphism; ii) the temporal relationship: the co-visible objects are consistent across time. By leveraging the spatial relationship, we can ensure there are no outliers in the matched nodes. By leveraging the temporal relationship, we can ensure the temporal consistency of matching results across time.
Consider a pair of modeled object pose graphs with nodes and with nodes at time , finding the maximum common subgraph can be formulated as
(2) |
To realize , the procedure is divided into three primary steps:
i). Candidate initialization. This step is designed to generate a set of candidate common subgraphs, represented as a list , where denotes the number of these candidates. At the initial timestep(), we explore all potential node pairs. For each pair, we assess whether their node affinity surpasses a predefined threshold . The node affinity for any given node pair at timestep is determined by the equation where the affinity function measure the similarity between node in agent graph and node in agent graph . Pairs that meet this criterion are subsequently incorporated into as candidates for common subgraphs. At subsequent timesteps(), the node pairs from previous common subgraph are considered as candidates for the current common subgraphs. The temporal correspondence between and is established by the tracking IDs.
ii). Subgraph expanding. For each candidate , we consider the node pair within it as . We then identify all potential matching node pairs using node affinity and edge affinity . The affinity functions for node is the same as i) and the affinity functions for edges measure the similarity between edge pairs, considering the difference of relative position and heading difference; see the details in Appendix. A pair will be added to if and where and are predefined thresholds.
iii). Maximum common subgraph selection. After ii), our algorithm needs to determine the most suitable common subgraph from all candidates in . The initial criterion is to select the common subgraphs with the highest node count. If multiple candidates share this characteristic, we then compute a confidence score for those candidates. The confidence score of the candidate is calculated using Eq. 3.
(3) |
where is the node count of . Among the candidates that have the largest number of nodes, the one with the highest confidence score is selected as the final common subgraph.
The proposed MCS detection algorithm brings two benefits: i) the subgraph expanding step maintains strict spatial constraints between matched objects, saving the time for outlier detection in the following transformation calculation step. ii) the temporal tracking information promotes the consistency of detected MCS across time, improving the robustness of the proposed algorithm; see the algorithm diagram in appendix.
4.3 Relative Pose Calculation
Given the common subgraph , we calculate the relative pose by considering each matched node as a point and estimate the rigid transformation between two point sets. Let be two matching node sets and and be the corresponding 2D position of the nodes in their corresponding BEV coordinate system. The transformation between two coordinate systems can be calculated by solving the following Procrustes problem[20]:
(4) |
The optimal solution for the transformation can be efficiently derived using Singular Value Decomposition (SVD) in scenarios where there are no erroneous matching pairs. However, in cases where erroneous correspondences are present, the least variance nature of optimizing Eq. 4 necessitates the use of other more time-intensive methods capable of outlier elimination, such as RANSAC [16]. Thanks to the accuracy of our temporally consistent maximum common subgraph detection, we can employ the faster SVD method for computing the transformation. The final relative pose can be obtained from .
4.4 Discussions
Advantages. has following distinct advantages. Comparing to point cloud registration methods[49, 63, 70]:
-
•
It only requires the transmission of bounding boxes and tracking IDs, while point cloud registration methods require the transmission of massive points with their feature vector, costing too much communication bandwidth;
-
•
It ensures high-quality matching with no outliers by checking the spatial relationship between matched nodes, while point cloud registration methods only match by comparing local point features;
Comparing to alignment modules in previous collaborative perception system methods [53, 69, 61, 36] :
-
•
It is capable of building a consistent coordinate system without using any localization results, while previous methods can only rectify minor inaccuracies in external localization systems since they rely on a basically precise initial relative pose.
-
•
It incorporates tracking results to enhance the temporal consistency of matching, while previous methods ignore the temporal information.
Prerequisites. There are two assumptions for to function well. i) Collaboration is initiated only when agents are in close proximity, ensuring a common field of view. ii) Several objects exist in the common field of view and are perceived by both agents. These assumptions are realistically met in a collaborative perception context for two primary reasons. Firstly, the pivotal wireless connections that facilitate collaboration are inherently range-bound, thus naturally restricting collaborative interactions to agents in close proximity. Secondly, typical traffic scenarios inherently offer an environment rich with various objects and BEV alignment can be successfully achieved with the presence of only two co-visible devices. Real-world data from collaborative perception datasets corroborate this: 91.3% of the samples in DAIR-V2X[68] and 94.2% of the samples in V2V4Real[60] satisfy these criteria.
5 Experimental Results
Dataset | OPV2V | DAIR-V2X | V2V4Real | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method/Metric | AP@0.5 | ||||||||||||
Noise Level | 0.0/0.0 | 0.5/0.5 | 1.5/1.5 | 2.5/2.5 | 0.0/0.0 | 0.5/0.5 | 1.5/1.5 | 2.5/2.5 | 0.0/0.0 | 0.5/0.5 | 1.5/1.5 | 2.5/2.5 | |
w/o collaboration | No Collaboration | 0.786 | 0.645 | 0.447 | |||||||||
F-Cooper[11]SEC’19 | 0.834 | 0.638 | 0.458 | 0.399 | 0.737 | 0.697 | 0.660 | 0.636 | 0.693 | 0.481 | 0.330 | 0.309 | |
V2VNet[56]ECCV’20 | 0.936 | 0.861 | 0.724 | 0.691 | 0.665 | 0.610 | 0.551 | 0.526 | 0.580 | 0.441 | 0.338 | 0.312 | |
DiscoNet[30]NeurIPS’21 | 0.916 | 0.874 | 0.788 | 0.753 | 0.737 | 0.704 | 0.674 | 0.666 | 0.736 | 0.527 | 0.411 | 0.378 | |
w/o robust design | Where2comm[21]NeurIPS’22 | 0.944 | 0.721 | 0.500 | 0.505 | 0.752 | 0.637 | 0.580 | 0.570 | 0.704 | 0.505 | 0.384 | 0.364 |
FPV-RCNN[69]RAL’22 | 0.858 | 0.476 | 0.236 | 0.225 | 0.626 | 0.512 | 0.427 | 0.422 | 0.701 | 0.387 | 0.244 | 0.237 | |
V2VNet[53]CoRL’20 | 0.942 | 0.919 | 0.865 | 0.831 | 0.661 | 0.639 | 0.614 | 0.594 | 0.550 | 0.525 | 0.467 | 0.435 | |
V2X-ViT[61]ECCV’22 | 0.946 | 0.925 | 0.796 | 0.632 | 0.705 | 0.682 | 0.647 | 0.632 | 0.680 | 0.673 | 0.450 | 0.422 | |
w/ robust design | CoAlign[36]ICRA’23 | 0.966 | 0.950 | 0.863 | 0.824 | 0.746 | 0.712 | 0.665 | 0.647 | 0.709 | 0.613 | 0.435 | 0.387 |
Self-Localized | CoBEVGlue | 0.958 | 0.740 | 0.702 | |||||||||
Method/Metric | AP@0.7 | ||||||||||||
Noise Level | 0.0/0.0 | 0.5/0.5 | 1.5/1.5 | 2.5/2.5 | 0.0/0.0 | 0.5/0.5 | 1.5/1.5 | 2.5/2.5 | 0.0/0.0 | 0.5/0.5 | 1.5/1.5 | 2.5/2.5 | |
w/o collaboration | No Collaboration | 0.690 | 0.526 | 0.261 | |||||||||
F-Cooper[11]SEC’19 | 0.603 | 0.388 | 0.328 | 0.298 | 0.560 | 0.542 | 0.516 | 0.487 | 0.432 | 0.212 | 0.179 | 0.174 | |
V2VNet[56]ECCV’20 | 0.740 | 0.534 | 0.384 | 0.315 | 0.402 | 0.362 | 0.320 | 0.316 | 0.250 | 0.163 | 0.135 | 0.130 | |
DiscoNet[30]NeurIPS’21 | 0.791 | 0.741 | 0.684 | 0.655 | 0.584 | 0.568 | 0.561 | 0.557 | 0.466 | 0.296 | 0.271 | 0.357 | |
w/o robust design | Where2comm[21]NeurIPS’22 | 0.855 | 0.469 | 0.355 | 0.286 | 0.588 | 0.473 | 0.454 | 0.451 | 0.469 | 0.263 | 0.226 | 0.220 |
FPV-RCNN[69]RAL’22 | 0.840 | 0.214 | 0.173 | 0.189 | 0.409 | 0.319 | 0.325 | 0.340 | 0.479 | 0.153 | 0.156 | 0.165 | |
V2VNet[53]CoRL’20 | 0.854 | 0.826 | 0.773 | 0.742 | 0.486 | 0.472 | 0.447 | 0.449 | 0.309 | 0.296 | 0.279 | 0.272 | |
V2X-ViT[61]ECCV’22 | 0.856 | 0.834 | 0.721 | 0.502 | 0.531 | 0.523 | 0.510 | 0.502 | 0.391 | 0.305 | 0.272 | 0.262 | |
w/ robust design | CoAlign[36]ICRA’23 | 0.912 | 0.878 | 0.771 | 0.732 | 0.604 | 0.575 | 0.558 | 0.548 | 0.417 | 0.336 | 0.261 | 0.239 |
Self-Localized | CoBEVGlue | 0.909 | 0.582 | 0.431 |
Dataset | OPV2V | DAIR-V2X | ||||
---|---|---|---|---|---|---|
Method/Metric | AP@0.5 | AP@0.7 | AP@0.5 | AP@0.7 | ||
without / with BEVGlue | ||||||
F-Cooper[11]SEC’19 | 0.307/0.841 174% | 0.224/0.605 170% | 0.000360% | 0.563/0.699 24% | 0.410/0.540 32% | 0.000300% |
V2VNet[56]ECCV’20 | 0.636/0.929 46% | 0.375/0.731 95% | 0.00144% | 0.393/0.616 57% | 0.247/0.374 51% | 0.00168% |
DiscoNet[30]NeurIPS’21 | 0.671/0.917 37% | 0.654/0.789 21% | 0.000360% | 0.635/0.706 11% | 0.540/0.569 5.4% | 0.000420% |
Where2comm[21]NeurIPS’22 | 0.272/0.937 244% | 0.203/0.826 307% | 0.000363% | 0.493/0.669 36% | 0.398/0.508 28% | 0.000418% |
V2VNet[53]CoRL’20 | 0.790/0.927 17% | 0.708/0.837 18% | 0.00144% | 0.513/0.684 33% | 0.401/0.525 31% | 0.00168% |
V2X-ViT[61]ECCV’22 | 0.696/0.942 35% | 0.638/0.852 34% | 0.00150% | 0.581/0.684 18% | 0.474/0.525 11% | 0.00172% |
CoAlign[36]ICRA’23 | 0.791/0.962 22% | 0.699/0.907 30% | 0.000829% | 0.601/0.713 19% | 0.522/0.578 11% | 0.000959% |
5.1 Datasets and Experimental Settings
We conduct collaborative LiDAR-based 3D object detection on both a simulation dataset, OPV2V [62], co-simulated by OpenCDA [58] and Carla [14], and two real-world dataset, DAIR-V2X [68] and V2V4Real[60]. We follow [62, 36, 60] to set the detection range as in OPV2V and V2V4Real and in DAIR-V2X respectively. We use PointPillars [27] with the grid size as the encoder. For multi-scale feature fusion, the residual layer number is 3 and the channel numbers are . The communication results count the message size by byte in log scale with base .
5.2 Quantitative Evaluation
Detection performance in the presence of localization noise. Table 1 compares the proposed with previous methods under localization noise on OPV2V, DAIR-V2X and V2V4Real. For the setting of localization noise, we apply Gaussian noise on and on , where are the 2D centers and yaw angle of each agent’s accurate 3DoF pose. This noise setting follows previous work [56, 61, 21, 36], while we increase the range of standard deviation to cover more challenging settings. The baseline methods include no collaboration, collaborative methods without specific design to localization noise, and collaborative methods including robust design for localization error. We see that the performances of are not affected by the level of pose noise since it operates independently of localization systems, significantly outperforming previous methods in various noise levels across both datasets.
Detection performance in the presence of localization attack. We also explore detection performance under the common and unsolved GPS Spoofing attack [71, 6], an attack where malicious attackers set arbitrary position by sending fake satellite signals. Specifically, we consider the attacker to deceive all collaborators into thinking they are in the same location, aiming to generate false positive bounding boxes. Table 2 presents the detection performance and communication bandwidth of various methods under this attack, along with their performance when integrated with our spatial alignment module . The results reveal that significantly improves performance under attack while bringing negligible communication overhead. Notably, with ’s assistance, a majority of collaborative perception methods outperform single-agent perception even under malicious attack.
Comparisons with point cloud registration. We present a comparison between and two representative point cloud registration methods in Fig. 4. This comparison includes both 2D and 3D versions of the Iterative Closest Point (ICP) and a widely used pipeline in recent multi-agent systems [26], which incorporates Fast Point Feature Histograms (FPFH) [46] for keypoint description and TEASER++ [63] for robust registration. Initial poses provided to ICP are varied under different levels of Gaussian noise. We see that i) consistently delivers superior performance with minimal communication volume, and its effectiveness remains stable regardless of the extent of localization noise; ii) while ICP can mitigate the impact of minor localization noise, it consumes large communication bandwidth and its performance significantly declines under bandwidth constraints; iii) as localization noise increases, ICP fails to achieve successful alignment, irrespective of the available bandwidth.
Computation time. Tested on a system equipped with a 2.90GHz Intel Xeon CPU and an RTX 4090 GPU, achieves 89.98 frames per second (FPS) on OPV2V, 72.18 FPS on DAIR-V2X and 158.7 FPS on V2V4Real.
5.3 Qualitative Evaluation
Visualization of detection results. Fig. 5 shows a comparative visualization of detection results from V2X-ViT, CoAlign, and in the OPV2V dataset under noisy setting. The noise stems from a Gaussian distribution with a standard deviation of 3.0m for position and 3.0° for heading. V2X-ViT, despite employing the MSWin module to mitigate pose error, struggles under large noise. Similarly, the pose graph optimization algorithm in CoAlign fails in the presence of large noise, leading to a severe drop in detection performance. In contrast, ’s exhibits superior performance under large noise. This can be attributed to its independence from prior pose information, which makes it less susceptible to the impacts of pose noise.
Visual comparison with point cloud registration. Fig. 6 compares the collaborative perception system with point cloud registration and the one with . For point cloud registration, we use FPFH [46] for feature descriptor and TEASER [63] for registration. Fig. 6(a) shows that the point cloud registration pipeline identifies a few matched points (in white) from a large amount of transmitted points; and Fig. 6(b) shows that uses the spatial geometry of objects to find correspondence. Compared with the point cloud registration pipeline, uses much less communication volume ().
5.4 Ablation studies
Dataset | OPV2V | V2V4Real | |||||
---|---|---|---|---|---|---|---|
Matching Criteria | Metric | ||||||
Geometric pattern | Tracking | AP@0.3 | AP@0.5 | AP@0.7 | AP@0.3 | AP@0.5 | AP@0.7 |
0.7904 | 0.7781 | 0.6746 | 0.4919 | 0.4469 | 0.2611 | ||
✓ | 0.9614 | 0.9545 | 0.9049 | 0.6839 | 0.6356 | 0.3787 | |
✓ | ✓ | 0.9651 | 0.9581 | 0.9088 | 0.7394 | 0.7016 | 0.4308 |
Table 3 assesses the effectiveness of the geometric pattern and tracking information used in maximum common subgraph detection on the OPV2V and V2V4Real dataset. Absent the geometric pattern, the matching process only relies on node information, ignoring the spatial relationship between nodes. Without the tracking information, the search ignores temporal consistency. We see that: 1) Geometric patterns play a crucial role in enhancing our method for common subgraph detection. Specifically, on the V2V4Real dataset, integrating geometric patterns results in a performance boost of 45.7% over the baseline; and 2) The inclusion of tracking information significantly contributes to the temporal coherence of the matching outcomes, with an improvement of 57.0% on the V2V4Real dataset compared to the baseline approach.
6 Conclusions
This paper proposes a novel self-localized collaborative perception framework and a novel spatial alignment method . The core idea is to search for co-visible objects from the bird’s eye view perceptual data across agents and calculate the relative pose between agents, ensuring a consistent spatial coordinate system for collaboration. Comprehensive experiments show that performs comparably to systems relying on precise localization information and achieves state-of-the-art detection performance when localization noise and attack exist.
Limitation and future work. This work focuses on exploiting spatial alignment for collaborative perception. We plan to mitigate the temporal misalignment issue in future.
References
- [1] Ao, S., Hu, Q., Yang, B., Markham, A., Guo, Y.: Spinnet: Learning a general surface descriptor for 3d point cloud registration. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11753–11762 (2021)
- [2] Arnold, E., Dianati, M., de Temple, R., Fallah, S.: Cooperative perception for 3d object detection in driving scenarios using infrastructure sensors. IEEE Transactions on Intelligent Transportation Systems 23(3), 1852–1864 (2020)
- [3] Bai, Y., Xu, D., Sun, Y., Wang, W.: Glsearch: Maximum common subgraph detection via learning to search. In: International Conference on Machine Learning. pp. 588–598. PMLR (2021)
- [4] Besl, P.J., McKay, N.D.: Method for registration of 3-d shapes. In: Sensor fusion IV: control paradigms and data structures. vol. 1611, pp. 586–606. Spie (1992)
- [5] Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B.: Simple online and realtime tracking. In: 2016 IEEE international conference on image processing (ICIP). pp. 3464–3468. IEEE (2016)
- [6] Bhatti, J., Humphreys, T.E.: Hostile control of ships via false gps signals: Demonstration and detection. NAVIGATION: Journal of the Institute of Navigation 64(1), 51–66 (2017)
- [7] Bouaziz, S., Tagliasacchi, A., Pauly, M.: Sparse iterative closest point. In: Computer graphics forum. vol. 32, pp. 113–123. Wiley Online Library (2013)
- [8] Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11621–11631 (2020)
- [9] Chang, Y., Ebadi, K., Denniston, C.E., Ginting, M.F., Rosinol, A., Reinke, A., Palieri, M., Shi, J., Chatterjee, A., Morrell, B., et al.: Lamp 2.0: A robust multi-robot slam system for operation in challenging large-scale underground environments. IEEE Robotics and Automation Letters 7(4), 9175–9182 (2022)
- [10] Chang, Y., Hughes, N., Ray, A., Carlone, L.: Hydra-multi: Collaborative online construction of 3d scene graphs with multi-robot teams. arXiv preprint arXiv:2304.13487 (2023)
- [11] Chen, Q.: F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds (2019)
- [12] Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 8958–8966 (2019)
- [13] Combier, C., Damiand, G., Solnon, C.: Map edit distance vs. graph edit distance for matching images. In: Graph-Based Representations in Pattern Recognition: 9th IAPR-TC-15 International Workshop, GbRPR 2013, Vienna, Austria, May 15-17, 2013. Proceedings 9. pp. 152–161. Springer (2013)
- [14] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: Carla: An open urban driving simulator. In: Conference on robot learning. pp. 1–16. PMLR (2017)
- [15] Ehrlich, H.C., Rarey, M.: Maximum common subgraph isomorphism algorithms and their applications in molecular science: a review. Wiley Interdisciplinary Reviews: Computational Molecular Science 1(1), 68–79 (2011)
- [16] Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981). https://doi.org/10.1145/358669.358692, https://doi.org/10.1145/358669.358692
- [17] Fitzgibbon, A.W.: Robust registration of 2d and 3d point sets. Image and vision computing 21(13-14), 1145–1153 (2003)
- [18] Gay, S., Fages, F., Martinez, T., Soliman, S., Solnon, C.: On the subgraph epimorphism problem. Discrete Applied Mathematics 162, 214–228 (2014)
- [19] Gelfand, N., Mitra, N.J., Guibas, L.J., Pottmann, H.: Robust global registration. In: Symposium on geometry processing. vol. 2, p. 5. Vienna, Austria (2005)
- [20] Gower, J.C., Dijksterhuis, G.B.: Procrustes problems, vol. 30. OUP Oxford (2004)
- [21] Hu, Y., Fang, S., Lei, Z., Zhong, Y., Chen, S.: Where2comm: Communication-efficient collaborative perception via spatial confidence maps. Advances in neural information processing systems 35, 4874–4886 (2022)
- [22] Huang, S., Gojcic, Z., Usvyatsov, M., Wieser, A., Schindler, K.: Predator: Registration of 3d point clouds with low overlap. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 4267–4276 (2021)
- [23] Huang, Y., Shan, T., Chen, F., Englot, B.: Disco-slam: Distributed scan context-enabled multi-robot lidar slam with two-stage global-local graph optimization. IEEE Robotics and Automation Letters 7(2), 1150–1157 (2021)
- [24] Ikram, M.H., Khaliq, S., Anjum, M.L., Hussain, W.: Perceptual aliasing++: Adversarial attack for visual slam front-end and back-end. IEEE Robotics and Automation Letters 7(2), 4670–4677 (2022)
- [25] Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Transactions on pattern analysis and machine intelligence 21(5), 433–449 (1999)
- [26] Lajoie, P.Y., Beltrame, G.: Swarm-slam: Sparse decentralized collaborative simultaneous localization and mapping framework for multi-robot systems. arXiv preprint arXiv:2301.06230 (2023)
- [27] Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12697–12705 (2019)
- [28] Lei, Z., Ren, S., Hu, Y., Zhang, W., Chen, S.: Latency-aware collaborative perception. In: European Conference on Computer Vision. pp. 316–332. Springer (2022)
- [29] Li, Y., Ma, D., An, Z., Wang, Z., Zhong, Y., Chen, S., Feng, C.: V2x-sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving. IEEE Robotics and Automation Letters 7(4), 10914–10921 (2022)
- [30] Li, Y., Ren, S., Wu, P., Chen, S., Feng, C., Zhang, W.: Learning distilled collaboration graph for multi-agent perception. Advances in Neural Information Processing Systems 34, 29541–29552 (2021)
- [31] Liu, Q., Zhu, H., Zhou, Y., Li, H., Chang, S., Guo, M.: Density-invariant features for distant point cloud registration. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18215–18225 (2023)
- [32] Liu, S., Gao, C., Chen, Y., Peng, X., Kong, X., Wang, K., Xu, R., Jiang, W., Xiang, H., Ma, J., et al.: Towards vehicle-to-everything autonomous driving: A survey on collaborative perception. arXiv preprint arXiv:2308.16714 (2023)
- [33] Liu, Y.l., Li, C.m., Jiang, H., He, K.: A learning based branch and bound for maximum common subgraph problems. arXiv preprint arXiv:1905.05840 (2019)
- [34] Liu, Y.C., Tian, J., Glaser, N., Kira, Z.: When2com: Multi-agent perception via communication graph grouping. In: Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition. pp. 4106–4115 (2020)
- [35] Liu, Y.C., Tian, J., Ma, C.Y., Glaser, N., Kuo, C.W., Kira, Z.: Who2com: Collaborative perception via learnable handshake communication. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 6876–6883. IEEE (2020)
- [36] Lu, Y., Li, Q., Liu, B., Dianati, M., Feng, C., Chen, S., Wang, Y.: Robust collaborative 3d object detection in presence of pose errors. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 4812–4818. IEEE (2023)
- [37] McCreesh, C., Ndiaye, S.N., Prosser, P., Solnon, C.: Clique and constraint models for maximum common (connected) subgraph problems. In: International Conference on Principles and Practice of Constraint Programming. pp. 350–368. Springer (2016)
- [38] McCreesh, C., Prosser, P., Trimble, J.: A partitioning algorithm for maximum common subgraph problems (2017)
- [39] McGregor, J.J.: Backtrack search algorithms and the maximal common subgraph problem. Software: Practice and Experience 12(1), 23–34 (1982)
- [40] Meng, Z., Xia, X., Xu, R., Liu, W., Ma, J.: Hydro-3d: Hybrid object detection and tracking for cooperative perception using 3d lidar. IEEE Transactions on Intelligent Vehicles (2023)
- [41] Narain, S., Ranganathan, A., Noubir, G.: Security of gps/ins based on-road location tracking systems. In: 2019 IEEE Symposium on Security and Privacy (SP). pp. 587–601. IEEE (2019)
- [42] Noh, J., Kwon, Y., Son, Y., Shin, H., Kim, D., Choi, J., Kim, Y.: Tractor beam: Safe-hijacking of consumer drones with adaptive gps spoofing. ACM Transactions on Privacy and Security (TOPS) 22(2), 1–26 (2019)
- [43] Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 652–660 (2017)
- [44] Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems 30 (2017)
- [45] Raymond, J.W., Willett, P.: Maximum common subgraph isomorphism algorithms for the matching of chemical structures. Journal of computer-aided molecular design 16, 521–533 (2002)
- [46] Rusu, R.B., Blodow, N., Beetz, M.: Fast point feature histograms (fpfh) for 3d registration. In: 2009 IEEE international conference on robotics and automation. pp. 3212–3217. IEEE (2009)
- [47] Rusu, R.B., Blodow, N., Marton, Z.C., Beetz, M.: Aligning point cloud views using persistent feature histograms. In: 2008 IEEE/RSJ international conference on intelligent robots and systems. pp. 3384–3391. IEEE (2008)
- [48] Sarkar, S.D., Miksik, O., Pollefeys, M., Barath, D., Armeni, I.: Sgaligner: 3d scene alignment with scene graphs. arXiv preprint arXiv:2304.14880 (2023)
- [49] Sharp, G.C., Lee, S.W., Wehe, D.K.: Icp registration using invariant features. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(1), 90–102 (2002)
- [50] Shen, J., Won, J.Y., Chen, Z., Chen, Q.A.: Drift with devil: Security of Multi-Sensor fusion based localization in High-Level autonomous driving under GPS spoofing. In: 29th USENIX Security Symposium (USENIX Security 20). pp. 931–948 (2020)
- [51] Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2446–2454 (2020)
- [52] Tombari, F., Salti, S., Di Stefano, L.: Unique shape context for 3d data description. In: Proceedings of the ACM workshop on 3D object retrieval. pp. 57–62 (2010)
- [53] Vadivelu, N., Ren, M., Tu, J., Wang, J., Urtasun, R.: Learning to communicate and correct pose errors. In: Conference on Robot Learning. pp. 1195–1210. PMLR (2021)
- [54] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information processing systems 30 (2017)
- [55] Vismara, P., Valery, B.: Finding maximum common connected subgraphs using clique detection or constraint satisfaction algorithms. In: International Conference on Modelling, Computation and Optimization in Information Systems and Management Sciences. pp. 358–368. Springer (2008)
- [56] Wang, T.H., Manivasagam, S., Liang, M., Yang, B., Zeng, W., Urtasun, R.: V2vnet: Vehicle-to-vehicle communication for joint perception and prediction. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. pp. 605–621. Springer (2020)
- [57] Xiang, H., Xu, R., Ma, J.: Hm-vit: Hetero-modal vehicle-to-vehicle cooperative perception with vision transformer. arXiv preprint arXiv:2304.10628 (2023)
- [58] Xu, R., Guo, Y., Han, X., Xia, X., Xiang, H., Ma, J.: Opencda: an open cooperative driving automation framework integrated with co-simulation. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). pp. 1155–1162. IEEE (2021)
- [59] Xu, R., Tu, Z., Xiang, H., Shao, W., Zhou, B., Ma, J.: Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. arXiv preprint arXiv:2207.02202 (2022)
- [60] Xu, R., Xia, X., Li, J., Li, H., Zhang, S., Tu, Z., Meng, Z., Xiang, H., Dong, X., Song, R., et al.: V2v4real: A real-world large-scale dataset for vehicle-to-vehicle cooperative perception. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13712–13722 (2023)
- [61] Xu, R., Xiang, H., Tu, Z., Xia, X., Yang, M.H., Ma, J.: V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. In: European conference on computer vision. pp. 107–124. Springer (2022)
- [62] Xu, R., Xiang, H., Xia, X., Han, X., Li, J., Ma, J.: Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication. In: 2022 International Conference on Robotics and Automation (ICRA). pp. 2583–2589. IEEE (2022)
- [63] Yang, H., Shi, J., Carlone, L.: TEASER: Fast and Certifiable Point Cloud Registration. IEEE Trans. Robotics (2020)
- [64] Yew, Z.J., Lee, G.H.: Rpm-net: Robust point matching using learned features. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11824–11833 (2020)
- [65] Yew, Z.J., Lee, G.H.: Regtr: End-to-end point cloud correspondences with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6677–6686 (2022)
- [66] Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11784–11793 (2021)
- [67] Yoshida, K., Hojo, M., Fujino, T.: Adversarial scan attack against scan matching algorithm for pose estimation in lidar-based slam. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 105(3), 326–335 (2022)
- [68] Yu, H., Luo, Y., Shu, M., Huo, Y., Yang, Z., Shi, Y., Guo, Z., Li, H., Hu, X., Yuan, J., et al.: Dair-v2x: A large-scale dataset for vehicle-infrastructure cooperative 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21361–21370 (2022)
- [69] Yuan, Y., Cheng, H., Sester, M.: Keypoints-based deep feature fusion for cooperative vehicle detection of autonomous driving. IEEE Robotics and Automation Letters 7(2), 3054–3061 (2022)
- [70] Zeng, A., Song, S., Nießner, M., Fisher, M., Xiao, J., Funkhouser, T.: 3dmatch: Learning local geometric descriptors from rgb-d reconstructions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1802–1811 (2017)
- [71] Zeng, K.C., Liu, S., Shu, Y., Wang, D., Li, H., Dou, Y., Wang, G., Yang, Y.: All your GPS are belong to us: Towards stealthy manipulation of road navigation systems. In: 27th USENIX security symposium (USENIX security 18). pp. 1527–1544 (2018)
- [72] Zhang, Z., Fisac, J.F.: Safe occlusion-aware autonomous driving via game-theoretic active perception. arXiv preprint arXiv:2105.08169 (2021)
- [73] Zhong, S., Chen, H., Qi, Y., Feng, D., Chen, Z., Wu, J., Wen, W., Liu, M.: Colrio: Lidar-ranging-inertial centralized state estimation for robotic swarms. arXiv preprint arXiv:2402.11790 (2024)
- [74] Zhong, S., Qi, Y., Chen, Z., Wu, J., Chen, H., Liu, M.: Dcl-slam: A distributed collaborative lidar slam framework for a robotic swarm. IEEE Sensors Journal (2023)
- [75] Zhou, Y., Xiao, J., Zhou, Y., Loianno, G.: Multi-robot collaborative perception with graph neural networks. IEEE Robotics and Automation Letters 7(2), 2289–2296 (2022)