Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Ontology-Based Data Quality Management for Data Streams

Published: 06 October 2016 Publication History

Abstract

Data Stream Management Systems (DSMS) provide real-time data processing in an effective way, but there is always a tradeoff between data quality (DQ) and performance. We propose an ontology-based data quality framework for relational DSMS that includes DQ measurement and monitoring in a transparent, modular, and flexible way. We follow a threefold approach that takes the characteristics of relational data stream management for DQ metrics into account. While (1) Query Metrics respect changes in data quality due to query operations, (2) Content Metrics allow the semantic evaluation of data in the streams. Finally, (3) Application Metrics allow easy user-defined computation of data quality values to account for application specifics. Additionally, a quality monitor allows us to observe data quality values and take counteractions to balance data quality and performance. The framework has been designed along a DQ management methodology suited for data streams. It has been evaluated in the domains of transportation systems and health monitoring.

References

[1]
Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, Ugur Çetintemel, Mitch Cherniack, Jeong-Hyon Hwang, Wolfgang Lindner, Anurag Maskey, Alex Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing, and Stanley B. Zdonik. 2005. The design of the Borealis stream processing engine. In Proceedings of the 2nd Biennal Conference on Innovative Data Systems Research (CIDR). Asilomar, CA, USA, 277--289.
[2]
Daniel J. Abadi, Donald Carney, Ugur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stanley B. Zdonik. 2003. Aurora: A new model and architecture for data stream management. VLDB J. 12, 2 (2003), 120--139.
[3]
Karl Aberer, Manfred Hauswirth, and Ali Salehi. 2006. A middleware for fast and flexible sensor network deployment. In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB), Umeshwar Dayal, Kyu-Young Whang, David B. Lomet, Gustavo Alonso, Guy M. Lohman, Martin L. Kersten, Sang Kyun Cha, and Young-Kuk Kim (Eds.). ACM Press, 1199--1202.
[4]
Arvind Arasu, Brian Babcock, Shivnath Babu, Mayur Datar, Keith Ito, Itaru Nishizawa, Justin Rosenstein, and Jennifer Widom. 2003. STREAM: The Stanford stream data manager. In Proceediings of the ACM SIGMOD International Conference on Management of Data, Alon Y. Halevy, Zachary G. Ives, and AnHai Doan (Eds.). ACM, 665.
[5]
Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2006. The CQL continuous query language: Semantic foundations and query execution. The VLDB J. 15, 2 (2006), 121--142.
[6]
Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the 21st ACM Symposium on Principles of Database Systems (PODS), Lucian Popa (Ed.). ACM, New York, 1--16.
[7]
Victor R. Basili and H. Dieter Rombach. 1988. The TAME project: Towards improvement-oriented software environments. IEEE Trans. Softw. Eng. 14, 6 (1988), 758--773.
[8]
Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data quality assessment and improvement. ACM Comput. Surv. (CSUR) 41, 3 (2009), 16.
[9]
Carlo Batini and Monica Scannapieca. 2006. Data Quality: Concepts, Methodologies and Techniques. Springer, Chapter Methodologies for Data Quality Measurement and Improvement, 161--200.
[10]
Norbert Baumgartner, Wolfgang Gottesheim, Stefan Mitsch, Werner Retschitzegger, and Wieland Schwinger. 2010. Improving situation awareness in traffic management. In Proceedings of the 30th International Conference on Very Large Databases.
[11]
Stefan Brüggemann and Fabian Grüning. 2008. Using domain knowledge provided by ontologies for improving data quality management. In Proceedings of the International Conference on Knowledge Management and Knowledge Technologies (I-KNOW). 251--258.
[12]
John L. Campbell, Lindsey E. Rustad, John H. Porter, Jeffrey R. Taylor, Ethan W. Dereszynski, James B. Shanley, Corinna Gries, Donald L. Henshaw, Mary E. Martin, and Wade M. Sheldon. 2013. Quantity is nothing without quality: Automated QA/QC for streaming environmental sensor data. BioScience 63, 7 (2013), 574--585.
[13]
Michael Dalgleish and Neil Hoose. 2008. Highway Traffic Monitoring and Data Quality. Artech House Publishers.
[14]
Wenfei Fan and Floris Geerts. 2012. Foundations of data quality management. Synth. Lect. Data Manag. 4, 5 (2012), 1--217.
[15]
Marco Fiscato, Quang Hieu Vu, and Peter Pietzuch. 2009. A quality-centric data model for distributed stream management systems. In Proceedings of the International Workshop on Quality in Databases (QDB).
[16]
Sandra Geisler. 2013. Data stream management systems. In Data Exchange, Integration, and Streams, Phokion G. Kolaitis, Maurizio Lenzerini, and Nicole Schweikardt (Eds.). Dagstuhl Follow-Ups, Vol. 5. Leibniz-Zentrum für Informatik.
[17]
Sandra Geisler, Yuan Chen, Christoph Quix, and Guido G. Gehlen. 2010. Accuracy assessment for traffic information derived from floating phone data. In Proceedings of the 17th World Congress on Intelligent Transportation Systems and Services.
[18]
Sandra Geisler, Christoph Quix, Stefan Schiffer, and Matthias Jarke. 2012. An evaluation framework for traffic information systems based on data streams. Transport. Res. Part C 23 (2012), 29--55.
[19]
Sandra Geisler, Sven Weber, and Christoph Quix. 2011. An ontology-based data quality framework for data stream applications. In Proceedings of the 16th International Conference on Information Quality (ICIQ). Adelaide, Australia.
[20]
A. Blanton Godfrey. 1999. Total quality management. In Juran’s Quality Handbook. McGraw-Hill, Chapter 14, 1--35.
[21]
William R. Hogan and Michael M. Wagner. 1997. Accuracy of data in computer-based patient records. J. Am. Med. Inform. Assoc. 4, 5 (1997), 342--355.
[22]
Matthias Jarke, Manfred A. Jeusfeld, C. Quix, and Panos Vassiliadis. 1999. Architecture and quality in data warehouses: An extended repository approach. Inform. Syst. 24, 3 (1999), 229--253.
[23]
Matthias Jarke, Maurizio Lenzerini, Yannis Vassiliou, and Panos Vassiliadis (Eds.). 2003. Fundamentals of Data Warehouses (2nd ed.). Springer-Verlag.
[24]
Saul Judah and Ted Friedman. 2014. Magic Quadrant for Data Quality Tools. Technical Report. Gartner.
[25]
Joseph M. Juran. 1993. Der neue Juran: Qualität von Anfang an. verlag moderne industrie. In German.
[26]
Joseph M. Juran. 1999. How to think about quality. In Juran’s Quality Handbook. McGraw-Hill, Chapter 2, 2.1--2.18.
[27]
Joseph M. Juran and A. Blanton Godfrey (Eds.). 1999. Juran’s Quality Handbook. McGraw-Hill.
[28]
Bhargav Kanagal and Amol Deshpande. 2009. Efficient query evaluation over temporally correlated probabilistic streams. In Proceedings of the IEEE 25th International Conference on Data Engineering. IEEE, 1315--1318.
[29]
Bahador Khaleghi, Alaa Khamis, Fakhreddine O. Karray, and Saiedeh N. Razavi. 2011. Multisensor data fusion: A review of the state-of-the-art. Inform. Fus. 14, 1 (2011), 28--44.
[30]
Anja Klein and Wolfgang Lehner. 2009. Representing data quality in sensor data streaming environments. ACM J. Data Inform. Qual. 1, 2 (September 2009), 1--28.
[31]
Sefki Kolozali, Maria Bermudez-Edo, Daniel Puschmann, Frieder Ganz, and Payam Barnaghi. 2014. A knowledge-based approach for real-time iot data stream annotation and processing. In Proceedings of the IEEE International Conferene on Internet of Things (iThings). IEEE, 215--222.
[32]
Hermann Kopetz. 2011. Real-Time Systems: Design Principles for Distributed Embedded Applications (2 ed.). Springer Science & Business Media.
[33]
Jürgen Krämer and Bernhard Seeger. 2009. Semantics and implementation of continous sliding window queries over data streams. ACM Trans. Database Syst. 34 (2009), 1--49.
[34]
Yang W. Lee, Diane M. Strong, Beverly K. Kahn, and Richard Y. Wang. 2002. AIMQ: A methodology for information quality assessment. Inform. Manag. 40, 2 (2002), 133--146.
[35]
Siaw-Teng Liaw, A. Rahimi, P. Ray, Jane Taggart, S. Dennis, Simon de Lusignan, B. Jalaludin, AET Yeo, and A. Talaei-Khoei. 2013. Towards an ontology for data quality in integrated chronic disease management: A realist review of the literature. Int. J. Med. Inform.s 82, 1 (2013), 10--24.
[36]
Morten Lindeberg, Vera Goebel, and Thomas Plagemann. 2010. Adaptive sized windows to improve real-time health monitoring - A case study on heart attack predicition. In MIR’10.
[37]
Rich Margiotta. 2002. State of the practice for traffic data quality. In Proceedings of the Traffic Data Quality Workshop.
[38]
Pubudika K. Mawilmada, Susan E. Smith, and Tony Sahama. 2012. Investigation of decision making issues in the use of current clinical information systems. Stud. Health Technol. Inform. 178 (2012), 136--143.
[39]
Lilia Paradis and Qi Han. 2007. A survey of fault management in wireless sensor networks. J. Netw. Syst. Manag. 15, 2 (2007), 171--190.
[40]
Alun Preece, Paolo Missier, S. Embury, Binling Jin, and M. Greenwood. 2008. An ontology-based approach to handling information quality in e-science. Concurr. Comput.: Pract. Exper. 20, 3 (2008), 253--264.
[41]
Christoph Quix, Johannes Barnickel, Sandra Geisler, Marwan Hassani, Saim Kim, Xiang Li, Andreas Lorenz, Till Quadflieg, Thomas Gries, Matthias Jarke, Steffen Leonhardt, Ulrike Meyer, and Thomas Seidl. 2013. HealthNet: A system for mobile and wearable health information management. In Proceedings of the 3rd International Workshop on Information Management in Mobile Applications (IMMoA).
[42]
Thomas C. Redman. 1996. Data Quality for the Information Age. Artech House.
[43]
Thomas C. Redman. 1999. Second-generation data quality systems. In Juran’s Quality Handbook. McGraw-Hill, Chapter 34, 34.1--34.14.
[44]
Thomas C. Redman. 2004. Data: An unfolding quality disaster. DM REVIEW 14, 8 (2004), 21--23. Retrieved from http://www.estv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/TP06_anexo2.pdf.
[45]
Thomas C. Redman. 2013. Data quality management past, present, and future: Towards a management system for data. In Handbook of Data Quality, Shazia Sadiq (Ed.). Springer, 15--40.
[46]
Sven Schmidt. 2006. Quality-of-Service-Aware Data Stream Processing. Ph.D. Dissertation. Technischen Universität Dresden.
[47]
Sven Schmidt, Henrike Berthold, and Wolfgang Lehner. 2004. Qstream: Deterministic querying of data streams. In Proceedings of the 30th International Conference on Very Large Data Bases (VLDB), Mario A. Nascimento, M. Tamer Özsu, Donald Kossmann, Renée J. Miller, José A. Blakeley, and K. Bernhard Schiefer (Eds.). Morgan Kaufmann, Toronto, Canada, 1365--1368.
[48]
Roger G. Schroeder, Kevin Linderman, Charles Liedtke, and Adrian S. Choo. 2008. Six sigma: Definition and underlying theory. J. Oper. Manag. 26, 4 (2008), 536--554.
[49]
Rainer Schützle. 2009. Quality management in ROSATTE. In Proceedings of the 16th Intelligent Transportation Systems and Services Congress (ITS’09).
[50]
Rainer Schützle, Jacek Frank, Camille Delorme, Franck Petit, Per-Olof Svensk, Lars Wikström, Per Isaksson, Bert Boterbergh, Hamish Keith, and Ulrich Haspel. 2010. ROSATTE - D5.3 - Report on Validation of Data Quality Management Concept and Experiences from Test Sites. Technical Report. University of Stuttgart.
[51]
Michael Stonebraker, Ugur Çetintemel, and Stanley B. Zdonik. 2005. The 8 requirements of real-time stream processing. SIGMOD Rec. 34, 4 (2005), 42--47.
[52]
Diane M. Strong, Yang W. Lee, and Richard Y. Wang. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103--110.
[53]
Dan Suciu, Dan Olteanu, Christopher Ré, and Christoph Koch. 2011. Probabilistic databases. In Synthesis Lectures on Data Management, Tamer Öszu (Ed.). Vol. 3. Morgan & Claypool, 1--180.
[54]
Philip J. Tarnoff. 2002. Getting to the INFOstructure. In White Paper prepared for the TRB Roadway INFOstructure Conference.
[55]
Krish Thiru, Alan Hassey, and Frank Sullivan. 2003. Systematic review of scope and quality of electronic patient record data in primary care. Br. Med. J. 326, 7398 (2003), 1070.
[56]
Thanh T. L. Tran, Liping Peng, Boduo Li, Yanlei Diao, and Anna Liu. 2010. PODS: A new model and processing algorithms for uncertain data streams. In Proceedings of the ACM SIGMOD Intl. Conference on Management of Data. ACM, 159--170.
[57]
Shawn Turner. 2002. Defining and measuring traffic data quality. In Proceedings of the Traffic Data Quality Workshop.
[58]
Richard Y. Wang. 1998. A product perspective on total data quality management. Commun. ACM 41, 2 (1998), 58--65.
[59]
Richard Y. Wang and Diane M. Strong. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inform. Syst. 12, 4 (1996), 5--33.

Cited By

View all
  • (2024)Social Web in IoT: Can Evolutionary Computation and Clustering Improve Ontology Matching for Social Web of Things?IEEE Transactions on Computational Social Systems10.1109/TCSS.2023.333256211:3(3966-3977)Online publication date: Jun-2024
  • (2024)BIGOWL4DQ: Ontology-driven approach for Big Data quality meta-modelling, selection and reasoningInformation and Software Technology10.1016/j.infsof.2023.107378167(107378)Online publication date: Mar-2024
  • (2024)SDS-MDBScan: Assigning a meaning to changes in data stream scenarios based on the statistical calculation of the data semantic trendsExpert Systems with Applications10.1016/j.eswa.2024.124500255(124500)Online publication date: Dec-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Data and Information Quality
Journal of Data and Information Quality  Volume 7, Issue 4
Challenge Papers and Regular Papers
October 2016
57 pages
ISSN:1936-1955
EISSN:1936-1963
DOI:10.1145/3006343
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 October 2016
Accepted: 01 July 2016
Revised: 01 June 2016
Received: 01 May 2015
Published in JDIQ Volume 7, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data streams
  2. data quality assessment
  3. data quality control
  4. ontologies

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)2
Reflects downloads up to 14 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Social Web in IoT: Can Evolutionary Computation and Clustering Improve Ontology Matching for Social Web of Things?IEEE Transactions on Computational Social Systems10.1109/TCSS.2023.333256211:3(3966-3977)Online publication date: Jun-2024
  • (2024)BIGOWL4DQ: Ontology-driven approach for Big Data quality meta-modelling, selection and reasoningInformation and Software Technology10.1016/j.infsof.2023.107378167(107378)Online publication date: Mar-2024
  • (2024)SDS-MDBScan: Assigning a meaning to changes in data stream scenarios based on the statistical calculation of the data semantic trendsExpert Systems with Applications10.1016/j.eswa.2024.124500255(124500)Online publication date: Dec-2024
  • (2024)A design theory for data quality tools in data ecosystems: Findings from three industry casesData & Knowledge Engineering10.1016/j.datak.2024.102333153(102333)Online publication date: Sep-2024
  • (2023)Improving the decision-making process by considering supply uncertainty – a case study in the forest value chainInternational Journal of Production Research10.1080/00207543.2023.2169382(1-20)Online publication date: 1-Feb-2023
  • (2023)State of the art on quality control for data streamsComputer Science Review10.1016/j.cosrev.2023.10055448:COnline publication date: 1-May-2023
  • (2023)Engineering Digital Twins and Digital Shadows as Key Enablers for Industry 4.0Digital Transformation10.1007/978-3-662-65004-2_1(3-31)Online publication date: 3-Feb-2023
  • (2023)Evolving the Digital Industrial Infrastructure for Production: Steps Taken and the Road AheadInternet of Production10.1007/978-3-030-98062-7_2-2(1-26)Online publication date: 28-Sep-2023
  • (2022)Data quality challenges in large-scale cyber-physical systemsInformation Systems10.1016/j.is.2021.101951105:COnline publication date: 1-Mar-2022
  • (2022)Knowledge Discovery in Databases: Comorbidities in Tuberculosis CasesComputational Science – ICCS 202210.1007/978-3-031-08757-8_1(3-13)Online publication date: 21-Jun-2022
  • Show More Cited By

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media