Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Akane: Perplexity-Guided Time Series Data Cleaning

Published: 30 May 2024 Publication History

Abstract

Dirty data are prevalent in time series, such as energy consumption or stock data. Existing data cleaning algorithms present shortcomings in dirty data identification and unsatisfactory cleaning decisions. To handle these drawbacks, we leverage inherent recurrent patterns in time series, analogize them as fixed combinations in textual data, and incorporate the concept of perplexity. The cleaning problem is thus transformed to minimize the perplexity of the time series under a given cleaning cost, and we design a four-phase algorithmic framework to tackle this problem. To ensure the framework's feasibility, we also conduct a brief analysis of the impact of dirty data and devise an automatic budget selection strategy. Moreover, to make it more generic, we additionally introduce advanced solutions, including an ameliorative probability calculation method grounded in the homomorphic pattern aggregation and a greedy-based heuristic algorithm for resource savings. Experiments on 12 real-world datasets demonstrate the superiority of our methods.

Supplemental Material

MP4 File
Slides and Presentation video
PDF File
Slides and Presentation video

References

[1]
Adam Aboode. 2018. Anomaly detection in time series data based on holt-winters method.
[2]
Subutai Ahmad, Alexander Lavin, Scott Purdy, and Zuha Agha. 2017. Unsupervised real-time anomaly detection for streaming data. Neurocomputing, Vol. 262 (2017), 134--147.
[3]
Arthur Asuncion and David Newman. 2007. UCI machine learning repository.
[4]
Asif Iqbal Baba, Manfred Jaeger, Hua Lu, Torben Bach Pedersen, Wei-Shinn Ku, and Xike Xie. 2016. Learning-Based Cleansing for Indoor RFID Data. In SIGMOD Conference. ACM, 925--936.
[5]
Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. In SIGMOD Conference. ACM, 143--154.
[6]
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jö rg Sander. 2000. LOF: Identifying Density-Based Local Outliers. In SIGMOD Conference. ACM, 93--104.
[7]
David R. Brillinger. 2001. Time series - data analysis and theory. Classics in applied mathematics, Vol. 36. SIAM.
[8]
Má rcia R. Cerioli, Lué rbio Faria, Talita O. Ferreira, and Fá bio Protti. 2011. A note on maximum independent sets and minimum clique partitions in unit disk graphs and penny graphs: complexity and approximation. RAIRO Theor. Informatics Appl., Vol. 45, 3 (2011), 331--346.
[9]
Stanley F Chen, Douglas Beeferman, and Roni Rosenfeld. 1998. Evaluation metrics for language models. (1998).
[10]
Wai-Ki Ching and Michael K Ng. 2006. Markov chains. Models, algorithms and applications (2006).
[11]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In ICDE. IEEE Computer Society, 458--469.
[12]
Xu Chu, John Morcos, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti, Nan Tang, and Yin Ye. 2015. KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing. In SIGMOD Conference. ACM, 1247--1261.
[13]
Jerome T. Connor, R. Douglas Martin, and Les E. Atlas. 1994. Recurrent neural networks and robust time series prediction. IEEE Trans. Neural Networks, Vol. 5, 2 (1994), 240--254.
[14]
Adrian Dumitrescu and Já nos Pach. 2011. Minimum Clique Partition in Unit Disk Graphs. Graphs Comb., Vol. 27, 3 (2011), 399--411.
[15]
Len Feremans, Vincent Vercruyssen, Boris Cule, Wannes Meert, and Bart Goethals. 2019. Pattern-Based Anomaly Detection in Mixed-Type Time Series. In ECML/PKDD (1) (Lecture Notes in Computer Science, Vol. 11906). Springer, 240--256.
[16]
Fei Gao, Shaoxu Song, and Jianmin Wang. 2021. Time Series Data Cleaning under Multi-Speed Constraints. Int. J. Softw. Informatics, Vol. 11, 1 (2021), 29--54.
[17]
Ge Gao, Qitong Gao, Xi Yang, Miroslav Pajic, and Min Chi. 2022. A Reinforcement Learning-Informed Pattern Mining Framework for Multivariate Time Series Classification. In IJCAI. ijcai.org, 2994--3000.
[18]
Everette S Gardner Jr. 2006. Exponential smoothing: The state of the art-Part II. International journal of forecasting, Vol. 22, 4 (2006), 637--666.
[19]
Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, and Divesh Srivastava. 2009. Sequential Dependencies. Proc. VLDB Endow., Vol. 2, 1 (2009), 574--585.
[20]
Niklas Heim and James E. Avery. 2019. Adaptive Anomaly Detection in Chaotic Time Series with a Spatially Aware Echo State Network. CoRR, Vol. abs/1909.01709 (2019).
[21]
David J. Hill and Barbara S. Minsker. 2010. Anomaly detection in streaming environmental sensor data: A data-driven modeling approach. Environ. Model. Softw., Vol. 25, 9 (2010), 1014--1022.
[22]
Van Long Ho, Nguyen Ho, and Torben Bach Pedersen. 2021. Efficient Temporal Pattern Mining in Big Time Series Using Mutual Information. Proc. VLDB Endow., Vol. 15, 3 (2021), 673--685.
[23]
Van Long Ho, Nguyen Ho, and Torben Bach Pedersen. 2023. Mining Seasonal Temporal Patterns in Time Series. In ICDE. IEEE, 2249--2261.
[24]
Bryan Hooi, Shenghua Liu, Asim Smailagic, and Christos Faloutsos. 2017. BeatLex: Summarizing and Forecasting Time Series with Patterns. In ECML/PKDD (2) (Lecture Notes in Computer Science, Vol. 10535). Springer, 3--19.
[25]
Kyle Hundman, Valentino Constantinou, Christopher Laporte, Ian Colwell, and Tom Sö derströ m. 2018. Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding. In KDD. ACM, 387--395.
[26]
Alexander Ihler, Jon Hutchins, and Padhraic Smyth. 2006. Adaptive event detection with time-varying poisson processes. In KDD. ACM, 207--216.
[27]
Shawn R. Jeffery, Minos N. Garofalakis, and Michael J. Franklin. 2006. Adaptive Cleaning for RFID Data Streams. In VLDB. ACM, 163--174.
[28]
Fred Jelinek, Robert L Mercer, Lalit R Bahl, and James K Baker. 1977. Perplexity-a measure of the difficulty of speech recognition tasks. The Journal of the Acoustical Society of America, Vol. 62, S1 (1977), S63--S63.
[29]
Dasheng Lee, Chih-Wei Lai, Kuo-Kai Liao, and Jia-Wei Chang. 2021. Artificial intelligence assisted false alarm detection and diagnosis system development for reducing maintenance cost of chillers at the data centre. Journal of Building Engineering, Vol. 36 (2021), 102110.
[30]
Peng Li, Xi Rao, Jennifer Blase, Yue Zhang, Xu Chu, and Ce Zhang. 2021. CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks. In ICDE. IEEE, 13--24.
[31]
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2012. Truth Finding on the Deep Web: Is the Problem Solved? Proc. VLDB Endow., Vol. 6, 2 (2012), 97--108.
[32]
Jessica Lin, Eamonn J. Keogh, Li Wei, and Stefano Lonardi. 2007. Experiencing SAX: a novel symbolic representation of time series. Data Min. Knowl. Discov., Vol. 15, 2 (2007), 107--144.
[33]
Kumar G Ranjan, Debesh S Tripathy, B Rajanarayan Prusty, and Debashisha Jena. 2021. An improved sliding window prediction-based outlier detection and correction for volatile time-series. International Journal of Numerical Modelling: Electronic Networks, Devices and Fields, Vol. 34, 1 (2021), e2816.
[34]
Faraz Rasheed, Peter Peng, Reda Alhajj, and Jon G. Rokne. 2009. Fourier Transform Based Spatial Outlier Mining. In IDEAL (Lecture Notes in Computer Science, Vol. 5788). Springer, 317--324.
[35]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow., Vol. 10, 11 (2017), 1190--1201.
[36]
Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. 2019. Time-Series Anomaly Detection Service at Microsoft. In KDD. ACM, 3009--3017.
[37]
Sebastian Schmidl, Phillip Wenig, and Thorsten Papenbrock. 2022. Anomaly Detection in Time Series: A Comprehensive Evaluation. Proc. VLDB Endow., Vol. 15, 9 (2022), 1779--1797.
[38]
Pavel Senin, Jessica Lin, Xing Wang, Tim Oates, Sunil Gandhi, Arnold P. Boedihardjo, Crystal Chen, and Susan Frankenstein. 2015. Time series anomaly discovery with grammar-based compression. In EDBT. OpenProceedings.org, 481--492.
[39]
Shaoxu Song, Chunping Li, and Xiaoquan Zhang. 2015a. Turn Waste into Wealth: On Simultaneous Clustering and Cleaning over Dirty Data. In KDD. ACM, 1115--1124.
[40]
Shaoxu Song and Aoqian Zhang. 2020. IoT Data Quality. In CIKM. ACM, 3517--3518.
[41]
Shaoxu Song, Aoqian Zhang, Jianmin Wang, and Philip S. Yu. 2015b. SCREEN: Stream Data Cleaning under Speed Constraints. In SIGMOD Conference. ACM, 827--841.
[42]
Yunxiang Su, Yikun Gong, and Shaoxu Song. 2023. Time Series Data Validity. Proc. ACM Manag. Data, Vol. 1, 1 (2023), 85:1--85:26.
[43]
Yu Sun, Shaoxu Song, Chen Wang, and Jianmin Wang. 2020. Swapping Repair for Misplaced Attribute Values. In ICDE. IEEE, 721--732.
[44]
Kenneth J. Supowit. 1981. Topics in Computational Geometry. Ph.,D. Dissertation. University of Illinois Urbana-Champaign, USA.
[45]
Xi Wang and Chen Wang. 2020. Time Series Data Cleaning: A Survey. IEEE Access, Vol. 8 (2020), 1866--1881.
[46]
Mohamed Yakout, Laure Berti-É quille, and Ahmed K. Elmagarmid. 2013. Don't be SCAREd: use SCalable Automatic REpairing with maximal likelihood and bounded changes. In SIGMOD Conference. ACM, 553--564.
[47]
Kenji Yamanishi and Jun'ichi Takeuchi. 2002. A unifying framework for detecting outliers and change points from non-stationary time series data. In KDD. ACM, 676--681.
[48]
Jie Yang, Alisa Smirnova, Dingqi Yang, Gianluca Demartini, Yuan Lu, and Philippe Cudré -Mauroux. 2019. Scalpel-CD: Leveraging Crowdsourcing and Deep Probabilistic Modeling for Debugging Noisy Training Data. In WWW. ACM, 2158--2168.
[49]
Zhuoran Yu and Xu Chu. 2019. PIClean: A Probabilistic and Interactive Data Cleaning System. In SIGMOD Conference. ACM, 2021--2024.
[50]
Aoqian Zhang, Shaoxu Song, and Jianmin Wang. 2016. Sequential Data Cleaning: A Statistical Approach. In SIGMOD Conference. ACM, 909--924.
[51]
Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017. Time Series Data Cleaning: From Anomaly Detection to Anomaly Repairing. Proc. VLDB Endow., Vol. 10, 10 (2017), 1046--1057. io

Index Terms

  1. Akane: Perplexity-Guided Time Series Data Cleaning

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Management of Data
    Proceedings of the ACM on Management of Data  Volume 2, Issue 3
    SIGMOD
    June 2024
    1953 pages
    EISSN:2836-6573
    DOI:10.1145/3670010
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2024
    Published in PACMMOD Volume 2, Issue 3

    Permissions

    Request permissions for this article.

    Author Tags

    1. data cleaning
    2. perplexity
    3. recurrent patterns
    4. time series

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 175
      Total Downloads
    • Downloads (Last 12 months)175
    • Downloads (Last 6 weeks)54
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media