Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

GPSClean: A Framework for Cleaning and Repairing GPS Data

Published: 13 April 2022 Publication History

Abstract

The rise of GPS-equipped mobile devices has led to the emergence of big trajectory data. The collected raw data usually contain errors and anomalies information caused by device failure, sensor error, and environment influence. Low-quality data fails to support application requirements and therefore raw data will be comprehensively cleaned before usage. Existing methods are suboptimal to detect GPS data errors and do the repairing. To solve the problem, we propose a framework called GPSClean to analyze the anomalies data and develop effective methods to repair the data. There are primarily four modules in GPSClean: (i) data preprocessing, (ii) data filling, (iii) data repairing, and (iv) data conversion. For (i), we propose an approach named MDSort (Maximum Disorder Sorting) to efficiently solve the issue of data disorder. For (ii), we propose a method named NNF (Nearest Neighbor Filling) to fill missing data. For (iii), we design an approach named RCSWS (Range Constraints and Sliding Window Statistics) to repair anomalies and also improve the accuracy of data repairing by mak7ing use of driving direction. We use 45 million real trajectory data to evaluate our proposal in a prototype database system SECONDO. Experimental results show that the accuracy of RCSWS is three times higher than an alternative method SCREEN and nearly an order of magnitude higher than an alternative method EWMA.

References

[2]
Gérard Alengrin and Gérard Favier. 1978. New stochastic realization algorithms for identification of ARMA models. In IEEE ICASSP. 208–213.
[3]
Asif Iqbal Baba, Manfred Jaeger, Hua Lu, Torben Bach Pedersen, Wei-Shinn Ku, and Xike Xie. 2016. Learning-based cleansing for indoor RFID data. In SIGMOD. 925–936.
[4]
Sabyasachi Basu and Martin Meckesheimer. 2007. Automatic outlier detection for time series: An application to sensor data. Knowl. Inf. Syst. 11, 2 (2007), 137–154.
[5]
Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In ACM SIGMOD. 143–154.
[6]
George E. P. Box and David A. Pierce. 1970. Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. Journal of the American Statistical Association 65, 332 (1970), 1509–1526.
[7]
David R. Brillinger. 2001. Time series - data analysis and theory. Classics in applied mathematics, Vol. 36. SIAM.
[8]
Peter J. Brockwell, Richard A. Davis, and Matthew V. Calder. 2002. Introduction to Time Series and Forecasting. Vol. 2.
[9]
Chao Chen, Shuhai Jiao, Shu Zhang, Weichen Liu, Liang Feng, and Yasha Wang. 2018. TripImputor: Real-time imputing taxi trip purpose leveraging multi-sourced urban data. IEEE Trans. Intell. Transp. Syst. 19, 10 (2018), 3292–3304.
[10]
Chao Chen, Daqing Zhang, Xiaojuan Ma, Bin Guo, Leye Wang, Yasha Wang, and Edwin Hsing-Mean Sha. 2017. Crowddeliver: Planning city-wide package delivery paths leveraging the crowd of taxis. IEEE Trans. Intell. Transp. Syst. 18, 6 (2017), 1478–1496.
[11]
Longbiao Chen, Daqing Zhang, Gang Pan, Xiaojuan Ma, Dingqi Yang, Kostadin Kushlev, Wangsheng Zhang, and Shijian Li. 2015. Bike sharing station placement leveraging heterogeneous urban open data. In ACM. 571–575.
[12]
Longbiao Chen, Daqing Zhang, Leye Wang, Dingqi Yang, Xiaojuan Ma, Shijian Li, Zhaohui Wu, Gang Pan, Thi Mai Trang Nguyen, and Jérémie Jakubowicz. 2016. Dynamic cluster-based over-demand prediction in bike sharing systems. In ACM. 841–852.
[13]
Roberto Corizzo, Michelangelo Ceci, and Nathalie Japkowicz. 2019. Anomaly detection and repair for accurate predictions in geo-distributed big data. Big Data Res. 16 (2019), 18–35.
[14]
Yinglong Diao, Ke-yan Liu, Xiaoli Meng, Xueshun Ye, and Kaiyuan He. 2015. A big data online cleaning algorithm based on dynamic outlier detection. In CyberC. 230–234.
[15]
Jirun Dong and Richard Hull. 1982. Applying approximate order dependency to reduce indexing space. In ACM SIGMOD. 119–127.
[16]
Uwe Draisbach, Felix Naumann, Sascha Szott, and Oliver Wonneberg. 2012. Adaptive windows for duplicate detection. In IEEE ICDE. 1073–1083.
[17]
Chenguang Fang, Shaoxu Song, Zhiwei Chen, and Acan Gui. 2019. Fine-grained fuel consumption prediction. In ACM CIKM. 2783–2791.
[18]
Stefan Funke and Sabine Storandt. 2015. Personalized route planning in road networks. In SIGSPATIAL. 45:1–45:10.
[19]
Everette S. Gardner Jr. 2006. Exponential smoothing: The state of the art–Part II. International Journal of Forecasting 22, 4 (2006), 637–666.
[20]
Felix A. Gers, Jürgen Schmidhuber, and Fred A. Cummins. 2000. Learning to forget: Continual prediction with LSTM. Neural Comput. 12, 10 (2000), 2451–2471.
[21]
Tomasz Gogacz and Szymon Torunczyk. 2017. Entropy bounds for conjunctive queries with functional dependencies. In ICDT, Vol. 68. 15:1–15:17.
[22]
Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, and Divesh Srivastava. 2009. Sequential dependencies. Proc. VLDB Endow. 2, 1 (2009), 574–585.
[23]
Bin Guo, Yan Liu, Wenle Wu, Zhiwen Yu, and Qi Han. 2017. ActiveCrowd: A framework for optimized multitask allocation in mobile crowdsensing systems. IEEE Trans. Hum. Mach. Syst. 47, 3 (2017), 392–403.
[24]
Aditya Gupta and Bhuwan Dhingra. 2012. Stock market prediction using hidden Markov models. IEEE, 1–4.
[25]
Manish Gupta, Jing Gao, Charu C. Aggarwal, and Jiawei Han. 2014. Outlier Detection for Temporal Data.
[26]
Ralf Hartmut Güting, Thomas Behr, and Christian Düntgen. 2010. SECONDO: A platform for moving objects database research and for publishing and integrating research implementations. IEEE Data Eng. Bull. 33, 2 (2010), 56–63.
[27]
David J. Hill and Barbara S. Minsker. 2010. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environ. Model. Softw. 25, 9 (2010), 1014–1022.
[28]
Koji Ichikawa and Hiroshi Tamano. 2020. Unsupervised qualitative scoring for binary item features. Data Sci. Eng. 5, 3 (2020), 317–330.
[29]
Shawn R. Jeffery, Minos N. Garofalakis, and Michael J. Franklin. 2006. Adaptive cleaning for RFID data streams. In VLDB. 163–174.
[30]
Hoyoung Jeung, Hua Lu, Saket Sathe, and Man Lung Yiu. 2014. Managing evolving uncertainty in trajectory databases. IEEE Trans. Knowl. Data Eng. 26, 7 (2014), 1692–1705.
[31]
Eamonn J. Keogh, Jessica Lin, Sang-Hee Lee, and Helga Van Herle. 2007. Finding the most unusual time series subsequence: Algorithms and applications. Knowl. Inf. Syst. 11, 1 (2007), 1–27.
[32]
Dan Li, Dacheng Chen, Baihong Jin, Lei Shi, Jonathan Goh, and See-Kiong Ng. 2019. MAD-GAN: Multivariate anomaly detection for time series data with generative adversarial networks. In ICANN, Vol. 11730. 703–716.
[33]
L. Li, X. Chen, Q. Liu, and Z. Bao. 2020. A data-driven approach for GPS trajectory data cleaning. In DASFAA(Lecture Notes in Computer Science, Vol. 12112). Springer, 3–19.
[34]
Andrei Lopatenko and Loreto Bravo. 2007. Efficient approximation algorithms for repairing inconsistent databases. In ICDE. 216–225.
[35]
Chunyang Ma, Hua Lu, Lidan Shou, and Gang Chen. 2013. KSQ: Top-(k) similarity query on uncertain trajectories. IEEE Trans. Knowl. Data Eng. 25, 9 (2013), 2049–2062.
[36]
Martyna Marczak, Tommaso Proietti, and Stefano Grassi. 2018. A data-cleaning augmented Kalman filter for robust estimation of state space models. Econometrics and Statistics 5 (2018), 107–123.
[37]
Dominik Mautz, Claudia Plant, and Christian Böhm. 2020. DeepECT: The deep embedded cluster tree. Data Sci. Eng. 5, 4 (2020), 419–432.
[38]
Mostafa Milani, Zheng Zheng, and Fei Chiang. 2019. CurrentClean: Spatio-Temporal cleaning of stale data. In IEEE ICDE. 172–183.
[39]
Yi Ouyang, Bin Guo, Xinjiang Lu, Qi Han, Tong Guo, and Zhiwen Yu. 2019. CompetitiveBike: Competitive analysis and popularity prediction of bike-sharing apps using multi-source data. IEEE Trans. Mob. Comput. 18, 8 (2019), 1760–1773.
[40]
Channamma Patil and Ishwar Baidari. 2019. Estimating the optimal number of clusters k in a dataset using data depth. Data Sci. Eng. 4, 2 (2019), 132–140.
[41]
Z. Qu, Y. Wang, Chong Wang, Nan Qu, and Jia Yan. 2016. A data cleaning model for electric power big data based on spark framework. International Journal of Database Theory and Application 9, 3 (2016), 137–150.
[42]
Jingbo Shang, Yu Zheng, Wenzhu Tong, Eric Chang, and Yong Yu. 2014. Inferring gas consumption and pollution emission of vehicles throughout a city. In ACM SIGKDD. 1027–1036.
[43]
Shaoxu Song, Chunping Li, and Xiaoquan Zhang. 2015. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In ACM SIGKDD. 1115–1124.
[44]
Shaoxu Song, Yu Sun, Aoqian Zhang, Lei Chen, and Jianmin Wang. 2020. Enriching data imputation under similarity rule constraints. IEEE Trans. Knowl. Data Eng. 32, 2 (2020), 275–287.
[45]
Shaoxu Song, Aoqian Zhang, Jianmin Wang, and Philip S. Yu. 2015. SCREEN: Stream data cleaning under speed constraints. In ACM SIGMOD. 827–841.
[46]
Xuan Song, Quanshi Zhang, Yoshihide Sekimoto, Ryosuke Shibasaki, Nicholas Jing Yuan, and Xing Xie. 2017. Prediction and simulation of human mobility following natural disasters. ACM Trans. Intell. Syst. Technol. 8, 2 (2017), 29:1–29:23.
[47]
Yuqiang Sun, Lei Peng, Huiyun Li, and Min Sun. 2018. Exploration on spatiotemporal data repairing of parking lots based on recurrent GANs. In ITSC. 467–472.
[48]
Dong Wang, Lance M. Kaplan, and Tarek F. Abdelzaher. 2014. Maximum likelihood analysis of conflicting observations in social sensing. ACM Trans. Sens. Networks 10, 2 (2014), 30:1–30:27.
[49]
Mohamed Yakout, Laure Berti-Équille, and Ahmed K. Elmagarmid. 2013. Don’t be SCAREd: Use scalable automatic repairing with maximal likelihood and bounded changes. In ACM SIGMOD. 553–564.
[50]
Kenji Yamanishi and Jun’ichi Takeuchi. 2002. A unifying framework for detecting outliers and change points from non-stationary time series data. In ACM SIGKDD. 676–681.
[51]
Wei Yin, Tianbai Yue, Hongzhi Wang, Yanhao Huang, and Yaping Li. 2018. Time series cleaning under variance constraints. In DASFAA, Vol. 10829. 108–113.
[52]
Jing Yuan, Yu Zheng, Xing Xie, and Guangzhong Sun. 2011. Driving with knowledge from the physical world. In ACM SIGKDD. 316–324.
[53]
Jing Yuan, Yu Zheng, Chengyang Zhang, Wenlei Xie, Xing Xie, Guangzhong Sun, and Yan Huang. 2010. T-drive: Driving directions based on taxi trajectories. In ACM SIGSPATIAL ACM-GIS. 99–108.
[54]
Aoqian Zhang, Shaoxu Song, and Jianmin Wang. 2016. Sequential data cleaning: A statistical approach. In SIGMOD. 909–924.
[55]
Qi Zhang, Jianlong Chang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan. 2020. Spatio-Temporal graph structure learning for traffic forecasting. In AAAI. 1177–1185.
[56]
Yu Zheng and Xing Xie. 2011. Learning travel recommendations from user-generated GPS traces. ACM Trans. Intell. Syst. Technol. 2, 1 (2011), 2:1–2:29.
[57]
Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, and Tianrui Li. 2015. Forecasting fine-grained air quality based on big data. In ACM SIGKDD. 2267–2276.
[58]
Yongzhen Zhuang, Lei Chen, Xiaoyang Sean Wang, and Jie Lian. 2007. A weighted moving average-based approach for cleaning sensor data. In IEEE ICDCS. 38.

Cited By

View all
  • (2024)Multivariate Time Series Cleaning under Speed ConstraintsProceedings of the ACM on Management of Data10.1145/36988212:6(1-26)Online publication date: 20-Dec-2024
  • (2024)Detecting Outlier Segments in Uncertain Personal Trajectory DataComputational Science and Its Applications – ICCSA 202410.1007/978-3-031-64608-9_28(418-426)Online publication date: 2-Jul-2024
  • (2023)Mobile Phone Data Feature Denoising for Expressway Traffic State EstimationSustainability10.3390/su1507581115:7(5811)Online publication date: 27-Mar-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 13, Issue 3
June 2022
415 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3508465
  • Editor:
  • Huan Liu
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 April 2022
Accepted: 01 May 2021
Revised: 01 April 2021
Received: 01 January 2021
Published in TIST Volume 13, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Trajectory data
  2. data detection
  3. data repairing
  4. data cleaning

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Natural Science Foundation of China
  • Natural Science Foundation of Jiangsu Province of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)10
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multivariate Time Series Cleaning under Speed ConstraintsProceedings of the ACM on Management of Data10.1145/36988212:6(1-26)Online publication date: 20-Dec-2024
  • (2024)Detecting Outlier Segments in Uncertain Personal Trajectory DataComputational Science and Its Applications – ICCSA 202410.1007/978-3-031-64608-9_28(418-426)Online publication date: 2-Jul-2024
  • (2023)Mobile Phone Data Feature Denoising for Expressway Traffic State EstimationSustainability10.3390/su1507581115:7(5811)Online publication date: 27-Mar-2023

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media