In this section, we will show in detail the results of detecting anomalous GPS/BeiDou observations on the dataset D1 using HDBSCAN algorithms. This contains a critical post-processing process, which is also the focus of this article, laying the foundation for subsequent supervised learning and real-time applications. At the same time, changes in positioning accuracy and availability after excluding anomalous observations are also analyzed.
3.2.1. Results of D1a
The data subset D1a is first processed using the HDBSCAN algorithm. By parameter tuning, the dataset is grouped into one category as far as possible. The criterion for parameter determination is that as outliers evolve from less to more, the maximum positioning errors in the east and north directions are both less than 10 m for the first time after excluding anomalous observations. This is a relatively conservative approach to reduce false positive rate. Another effective method is to extract the upper quantiles according to the probability distribution of
outlier scores in HDBSCAN that describes the possibility of the point becoming an outlier to determine the outlier boundary. In this paper, the maximum positioning error was used to determine the parameters of the clustering model because of the available ground truth values. To be specific, we first fixed
MinSamples to the default value and adjusted
MinClusterSize. As described in the last paragraph of
Section 2.2, the number of clusters can be reduced by increasing
MinClusterSize. Therefore, we set the parameter in ascending order. When
MinClusterSize is 60, the number of clusters starts to converge to 2 (the rest are outliers), and there are very few sample points in one of the clusters. At this point, we started to adjust
MinSamples. The larger
MinSamples is, the more points are considered anomalies. To reduce false positive rate, after repeated SPP experiments with anomaly exclusion, we finally set
MinSamples to 8, and the results exactly meet the above-mentioned maximum plane error conditions.
Figure 6 shows the 3D clustering results when the
MinClusterSize is 60 and the
MinSamples is 8. For better visualization, the first three principal components (PC) were set as the coordinate axes. It can be seen that the dataset is clustered into two categories, Cluster 1 (blue points) and Cluster 2 (green points), labelled 1 and 0, respectively. Here, the outlier samples labelled -1 are considered “anomalous observations” (red points). However, this is just an intuitive preliminary labelling result, because outliers or anomalies do not necessarily mean anomalous GNSS observations. In some extreme circumstances, anomalous observations in the dataset may behave more like non-outliers than normal observations. Without prior knowledge, there may be a risk that the anomalous observations are wrongly regarded as “normal”, while the normal observations become “anomalous”. Therefore, for the sake of insurance, it is necessary to further confirm the clustering results according to the probability distributions of the feature values.
Figure 7 depicts a pair plot that indicates pairwise relationships between variables in the training set D1a. The subplots on the diagonal axes show the univariate distributions of the corresponding variables. The four variables chosen for illustration are satellite elevation angle, pseudorange residual, C/N
0 measurement, and pseudorange rate consistency. According to the probability distributions of the last three variables, it can be determined that the recognition results of anomalous GNSS observations above are generally consistent with prior statistical knowledge. Specifically, the anomalous observations are characterized in probability by the largest absolute pseudorange residual, the lowest C/N
0 measurement and the largest magnitude of pseudorange rate difference, which is highly similar to the properties of NLOS signals in the urban environments. Meanwhile, the satellite elevation angle has a weak ability to discern the types of GNSS observations. However, it can still be seen that points with medium-low elevation angles (less than 50°) account for the majority of the anomalies. As for Cluster 1 and Cluster 2, the samples of Cluster 2 are mainly concentrated in the region with higher elevation angle, smaller residual and larger C/N
0 measurement. These samples can be defined as high-quality observations, which are most likely derived from completely contamination-free LOS signals under ideal observation conditions. Cluster 1 is a collection of those between high-quality observations and anomalous observations. Due to the interference of low multipath effect or other potential factors, its quality as a whole is not as good as that of Cluster 2, but it can still be used for position solution. In D1, the number of anomalies and non-anomalies is 677 and 18,892, respectively.
Being short of labels, there is no unified evaluation index for clustering results. However, the goal of clustering in this paper is clear, namely, to improve the accuracy and reliability of GNSS positioning in urban environments, so the root mean square error (RMSE) of positioning results after excluding anomalous observations was used as an index to evaluate the performance of clustering. The high-precision RTK/INS tightly combined solution was considered as ground truth values. The single point positioning solutions in this paper were obtained using the weighted least squares (WLS) method based on satellite elevation angle.
Figure 8 and
Table 3 show the improvement in positioning results after HDBSCAN was used to eliminate anomalous observations. The original positioning RMSE without anomaly detection is 3.05, 2.41, and 9.88 m in the east, north, and up directions, respectively. In contrast, the positioning RMSE in the three directions after excluding anomalous observations is 1.09, 2.10, and 6.17 m, with accuracy improvements of 64.3%, 12.9%, and 37.6%. As the vehicle track generally goes from south to north, the positioning error in the east direction is relatively large compared with the north direction [
44]. The experimental results show that the removal of anomalies has a better effect on the improvement of positioning accuracy in the east direction, which indicates that HDBSCAN is not only effective to identify GNSS anomalous observations but also consistent with the real situation. The continuity of dynamic positioning results is also of crucial importance while ensuring the positioning accuracy. When there are too few GNSS constellations involved in position solution, it is not advisable to lose a large number of original valid epochs by blindly pursuing the positioning accuracy. In the dataset D1a, the number of valid epochs is 1705, and after excluding anomalous observations the number becomes 1655, making the observed data available up to 97.1%. From
Figure 9, besides, as the number of satellites participating in position calculation is reduced due to the exclusion of anomalous observations, the geometric dilution of precision (GDOP) has unsurprisingly risen overall. Even so, the positioning accuracy has been improved. Therefore, this method can effectively identify anomalies in GNSS observations. Moreover, the epochs after using HDBSCAN to remove anomalous observations all meet the Chi-square test.
The proposed method can improve the positioning accuracy in two ways. As can be seen from
Figure 10, in harsh environments, fewer satellites can be observed and a large number of anomalous observations are mixed into the observation epochs, resulting in huge positioning errors. After anomaly exclusion, there are not enough GPS or BeiDou satellites available for the dual-system positioning process, so these epochs no longer output their positioning results. On the other hand, as shown in
Figure 11, when there are sufficient visible satellites and considerable anomalous observations received, the remaining satellites can still participate in the position solution even if the removal of anomalies weakened the satellite geometry, and the positioning accuracy is substantially improved, especially in the east direction. In the former case, some valid epochs are lost due to insufficient normal observations caused by the poor observation environment. However, incorrect coordinate solutions are avoided. Since normal observations are preserved, these epochs can be complemented using their normal observations through advanced filtering algorithms and enhanced information from other sensors or measurements.
In the traditional GNSS single point positioning process, the satellite elevation mask and C/N
0 mask are usually set to exclude the observations with poor quality. Therefore, this simple and direct method is also used for comparative experiments. The experimental results are listed in
Table 4. When the C/N
0 mask is set to 50 dB-Hz, the position solutions of only eight epochs can be obtained, and the availability of high-quality observations is greatly reduced, so this result is ignored. By comparison, it can be found that when the elevation angle mask and C/N
0 mask are set to 25° and 0 dB-Hz, respectively, the data subset D1a has the minimum two-dimensional plane positioning error. Nevertheless, the positioning accuracy has not been significantly improved compared with the original result. Some conclusions can be drawn here. Firstly, in urban vehicle-mounted scenarios, the C/N
0 measurement has a higher resolution in the positioning results compared with the satellite elevation angle. Specifically, the change of the elevation mask has little influence on the positioning result under the condition of a fixed C/N
0 mask until the elevation mask reaches 40°. This also indicates that in the urban environment, the observations are mainly concentrated below the C/N
0 measurement of 50dB-Hz and above the elevation of 35°. Secondly, anomalous GNSS observations cannot be effectively eliminated by setting elevation mask and C/N
0 mask in a complex urban environment, because the factors affecting the quality of GNSS observations are various. In addition, appropriate cut-off values are not easy to find. If the mask is set too large, the satellite geometric distribution will deteriorate and the number of visible satellites will decrease, thus reducing the positioning accuracy and availability.
3.2.2. Results of D1b
For this part of data, the traditional RAIM FDE algorithm recalculates the position by excluding, one by one, the visible satellites at each epoch. Once meeting the Chi-square test, the position solutions of these epochs are output. However, in epochs with a large number of anomalous observations, the algorithm usually fails. Since HDBSCAN itself can also be used to build predictive models, we first try to predict anomalies of D1b using HDBSCAN which are trained on D1a in this section. The prediction results are shown in
Figure 12. To facilitate visualization, the dataset was projected onto the PC1-PC2 plane. There are 599 anomalous observations, obviously more than the normal observations. The remaining 134 normal observations are used for single point positioning to verify the effect of anomaly detection.
Table 5 shows how the above two methods change the positioning results. It can be seen that although RAIM FDE retains most of the epochs, the positioning accuracy does not increase but decreases because the anomalous observations are not effectively eliminated. HDBSCAN improves the plane positioning accuracy to within 1 m, and the two new epochs after excluding anomalous observation meet the Chi-square test. Since the old epochs of D1b do not conform to the Chi-square test, it contains considerable anomalous observations. After identification and elimination of them, the observations that can be used for position calculation will be greatly reduced, so only a small number of epochs are retained. In D1b, although 79 valid epochs contain 134 normal observations, only two epochs with more than five satellites are involved in the solution. This problem will be alleviated once more constellations are introduced.
In addition to the HDBSCAN prediction, we also used some typical supervised classifiers, such as radial basis function (RBF) kernel SVM, decision tree, random forest, adaptive boosting (AdaBoost) and multi-layer perceptron (MLP), to detect anomalous observations of D1b for verification of clustering results. Before the training, we classified Cluster 1 and Cluster 2 in D1a into one category, and anomalous observations into another, forming a binary classifier. The positioning results after classification are listed in
Table 6. It can be seen that the positioning results of RBF SVM, decision tree, and MLP are close to that of HDBSCAN. As the availability increases, the positioning accuracy decreases, which indicates that more anomalous observations may be retained in the observations. Therefore, for D1b, both the direct prediction based on HDBSCAN and the classifier based on supervised learning can effectively detect anomalous observations. In addition, to deal with the imbalance of positive and negative samples, an over-sampling method called synthetic minority over-sampling technique (SMOTE) [
45] was used to increase the number of anomaly samples. However, the desired results were not achieved.
Similarly, we also set the elevation mask and C/N
0 mask for this part of data to compare with the proposed method. The positioning results are shown in
Table 7. Obviously, due to the poor quality of the data, the positioning results improve only when the C/N
0 mask is set to a large value. One epoch with higher accuracy is retained, which is similar to the results of machine learning methods. However, it is still difficult to determine the optimal elevation and C/N
0 mask, which must be consistent with D1a. In practical applications, the cut-off values of the two parts of data are set uniformly. Therefore, it is generally advisable to abandon the recovery of these epochs, so as not to encumber the positioning result of D1a. Combined with the two parts of observations, it is difficult to identify the anomalies by setting the elevation and C/N
0 mask in the complex urban vehicle-mounted environment, which has little effect on the improvement of the positioning results.