Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Stream Data Cleaning under Speed and Acceleration Constraints

Published: 28 September 2021 Publication History

Abstract

Stream data are often dirty, for example, owing to unreliable sensor reading or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum modification rule in data cleaning. To capture the knowledge about what is clean, we consider the (widely existing) constraints on the speed and acceleration of data changes, such as fuel consumption per hour, daily limit of stock prices, or the top speed and acceleration of a car. Guided by these semantic constraints, in this article, we propose the constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window. Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient median-based solution, and (3) experiments on real datasets demonstrate that our method can show significantly lower L1 error than the existing approaches such as smoother.

References

[1]
A. Savitzky A. and M. J. E. Golay.1964. Smoothing and differentiation of data by simplified least-squares procedures. Analyt. Chem. 8, 36 (1964), 1627–1639.
[2]
Ziawasch Abedjan, Cuneyt Gurcan Akcora, Mourad Ouzzani, Paolo Papotti, and Michael Stonebraker. 2015. Temporal rules discovery for web data cleaning. PVLDB 9, 4 (2015), 336–347.
[3]
Patricia C. Arocena, Boris Glavic, Giansalvatore Mecca, Renée J. Miller, Paolo Papotti, and Donatello Santoro. 2015. Messing up with BART: Error generation for evaluating data-cleaning algorithms. PVLDB 9, 2 (2015), 36–47.
[4]
George Beskales, Ihab F. Ilyas, and Lukasz Golab. 2010. Sampling the repairs of functional dependency violations under hard constraints. PVLDB 3, 1 (2010), 197–207.
[5]
Philip Bohannon, Michael Flaster, Wenfei Fan, and Rajeev Rastogi. 2005. A cost-based model and effective heuristic for repairing constraints by value modification. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 143–154.
[6]
David R. Brillinger. 2001. Time Series - Data Analysis and Theory. (Classics in Applied Mathematics, Vol. 36.) SIAM.
[7]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794.
[8]
Xu Chu, Ihab F. Ilyas, and Paolo Papotti. 2013. Holistic data cleaning: Putting violations into context. In Proceedings of the 29th IEEE International Conference on Data Engineering. 458–469.
[9]
Moo K. Chung. 2020. Gaussian kernel smoothing. CoRR abs/2007.09539 (2020).
[10]
Jeffrey Considine, Feifei Li, George Kollios, and John W. Byers. 2004. Approximate aggregation techniques for sensor databases. In Proceedings of the 20th International Conference on Data Engineering. 449–460.
[11]
Amol Deshpande, Carlos Guestrin, Samuel Madden, Joseph M. Hellerstein, and Wei Hong. 2004. Model-driven data acquisition in sensor networks. In (e)Proceedings of the 30th International Conference on Very Large Data Bases. 588–599.
[12]
Peter M. Fischer, Kyumars Sheykh Esmaili, and Renée J. Miller. 2010. Stream schema: Providing and exploiting static metadata for data stream processing. In Proceedings of the 13th International Conference on Extending Database Technology. 207–218.
[13]
David Freedman. 1991. Statistics (2nd ed.). Norton.
[14]
Roland Fried and Ann Cathrice George. 2011. Exponential and holt-winters smoothing. In International Encyclopedia of Statistical Science. 488–490.
[15]
Lukasz Golab, Howard J. Karloff, Flip Korn, Avishek Saha, and Divesh Srivastava. 2009. Sequential dependencies. PVLDB 2, 1 (2009), 574–585.
[16]
C. A. R. Hoare. 1962. Quicksort. Comput. J. 5, 1 (1962), 10–15.
[17]
Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative support for sensor data cleaning. In Proceedings of the 4th International Conference on Pervasive Computing. 83–100.
[18]
Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and Jennifer Widom. 2006. A pipelined framework for online cleaning of sensor data streams. In Proceedings of the 22nd International Conference on Data Engineering. 140.
[19]
Shawn R. Jeffery, Minos N. Garofalakis, and Michael J. Franklin. 2006. Adaptive cleaning for RFID data streams. In Proceedings of the 32nd International Conference on Very Large Data Bases. 163–174. Retrieved from http://dl.acm.org/citation.cfm?id=1164143.
[20]
Narendra Karmarkar. 1984. A new polynomial-time algorithm for linear programming. In Proceedings of the 16th ACM Symposium on Theory of Computing. 302–311.
[21]
Solmaz Kolahi and Laks V. S. Lakshmanan. 2009. On approximating optimum repairs for functional dependency violations. In Proceedings of the 12th International Conference on Database Theory. 53–62.
[22]
Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, and Divesh Srivastava. 2012. Truth finding on the deep web: Is the problem solved?PVLDB 6, 2 (2012), 97–108.
[23]
Zheng Li, Tingjian Ge, and Cindy X. Chen. 2013. -Matching: Event processing over noisy sequences in real time. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 601–612.
[24]
Ester Livshits, Benny Kimelfeld, and Sudeepa Roy. 2018. Computing optimal repairs for functional dependencies. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. 225–237.
[25]
Andrei Lopatenko and Loreto Bravo. 2007. Efficient approximation algorithms for repairing inconsistent databases. In Proceedings of the 23rd International Conference on Data Engineering. 216–225.
[26]
Sekander Hayat Khan M.2011. Standard deviation. In International Encyclopedia of Statistical Science. 1378–1379.
[27]
A. K. Mahalanabis. 1986. Introduction to random signal analysis and Kalman filtering: Robert G. Brown. Autom. 22, 3 (1986), 387–388.
[28]
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10, 11 (2017), 1190–1201.
[29]
Kexin Rong and Peter Bailis. 2017. ASAP: Prioritizing attention via time series smoothing. PVLDB 10, 11 (2017), 1358–1369.
[30]
Claude Sammut and Geoffrey I. Webb (Eds.). 2017. Encyclopedia of Machine Learning and Data Mining. Springer.
[31]
Michael Smithson. 2011. Confidence interval. In International Encyclopedia of Statistical Science. 283–284.
[32]
Shaoxu Song, Aoqian Zhang, Jianmin Wang, and Philip S. Yu. 2015. SCREEN: Stream data cleaning under speed constraints. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 827–841.
[33]
John W. Tukey. 1977. Exploratory Data Analysis. Addison-Wesley. Retrieved from https://www.worldcat.org/oclc/03058187.
[34]
Tzu-Tsung Wong and Nai-Yu Yang. 2017. Dependency analysis of accuracy estimates in k-fold cross validation. IEEE Trans. Knowl. Data Eng. 29, 11 (2017), 2417–2427.
[35]
Wush Chi-Hsuan Wu, Mi-Yen Yeh, and Jian Pei. 2012. Random error reduction in similarity search on time series: A statistical approach. In Proceedings of the IEEE 28th International Conference on Data Engineering. 858–869.
[36]
Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S. Yu. 2017. Time series data cleaning: From anomaly detection to anomaly repairing. Proc. VLDB Endow. 10, 10 (2017), 1046–1057.
[37]
Shichao Zhang, Xuelong Li, Ming Zong, Xiaofeng Zhu, and Debo Cheng. 2017. Learning k for kNN classification. ACM Trans. Intell. Syst. Technol. 8, 3 (2017), 43:1–43:19.
[38]
Xingquan Zhu, Peng Zhang, Xindong Wu, Dan He, Chengqi Zhang, and Yong Shi. 2008. Cleansing noisy data streams. In Proceedings of the 8th IEEE International Conference on Data Mining. 1139–1144.

Cited By

View all
  • (2024)Optimizing Time Series Queries with VersionsProceedings of the ACM on Management of Data10.1145/36549622:3(1-27)Online publication date: 30-May-2024
  • (2024)Time Series Data Cleaning Under Expressive Constraints on Both Rows and Columns2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00283(3682-3695)Online publication date: 13-May-2024
  • (2024)TSDDISCOVER: Discovering Data Dependency for Time Series Data2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00282(3668-3681)Online publication date: 13-May-2024
  • Show More Cited By

Index Terms

  1. Stream Data Cleaning under Speed and Acceleration Constraints

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 46, Issue 3
    September 2021
    172 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/3481695
    Issue’s Table of Contents
    Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 September 2021
    Accepted: 01 May 2021
    Revised: 01 April 2021
    Received: 01 February 2020
    Published in TODS Volume 46, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data repairing
    2. speed constraints
    3. acceleration constraints

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • National Key Research and Development Plan
    • National Natural Science Foundation of China
    • MIIT High Quality Development Program 2020, NSF

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)52
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 03 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Optimizing Time Series Queries with VersionsProceedings of the ACM on Management of Data10.1145/36549622:3(1-27)Online publication date: 30-May-2024
    • (2024)Time Series Data Cleaning Under Expressive Constraints on Both Rows and Columns2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00283(3682-3695)Online publication date: 13-May-2024
    • (2024)TSDDISCOVER: Discovering Data Dependency for Time Series Data2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00282(3668-3681)Online publication date: 13-May-2024
    • (2023)TsQuality: Measuring Time Series Data Quality in Apache IoTDBProceedings of the VLDB Endowment10.14778/3611540.361160116:12(3982-3985)Online publication date: 1-Aug-2023
    • (2023)DuaFacePattern Recognition Letters10.1016/j.patrec.2023.01.013167:C(25-29)Online publication date: 1-Mar-2023
    • (2022)IoT data cleaning techniques: A surveyIntelligent and Converged Networks10.23919/ICN.2022.00263:4(325-339)Online publication date: Dec-2022
    • (2021)An Analysis of Data Processing for Big Data AnalyticsJournal of Computing and Natural Science10.53759/181X/JCNS202101019(130-138)Online publication date: 5-Oct-2021

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media