Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

BIGQA: Declarative Big Data Quality Assessment

Published: 22 August 2023 Publication History

Abstract

In the big data domain, data quality assessment operations are often complex and must be implementable in a distributed and timely manner. This article tries to generalize the quality assessment operations by providing a new ISO-based declarative data quality assessment framework (BIGQA). BIGQA is a flexible solution that supports data quality assessment in different domains and contexts. It facilitates the planning and execution of big data quality assessment operations for data domain experts and data management specialists at any phase in the data life cycle. This work implements BIGQA to demonstrate its ability to produce customized data quality reports while running efficiently on parallel or distributed computing frameworks. BIGQA generates data quality assessment plans using straightforward operators designed to handle big data and guarantee a high degree of parallelism when executed. Moreover, it allows incremental data quality assessment to avoid reading the whole dataset each time the quality assessment operation is required. The result was validated using radiation wireless sensor data and Stack Overflow users’ data to show that it can be implemented within different contexts. The experiments show a 71% performance improvement over a 1 GB flat file on a single processing machine compared with a non-parallel application and a 75% performance improvement over a 25 GB flat file within a distributed environment compared to a non-distributed application.

References

[1]
Alain Abran, Rafa E. Al-Qutaish, Jean-Marc Desharnais, and Naji Habra. 2005. An information model for software quality measurement with ISO standards. In Proceedings of the International Conference on Software Development (SWDC-REK’05), Reykjavik, 104–116.
[2]
D. Ardagna, C. Cappiello, Walter Samá, and M. Vitali. 2018. Context-aware data quality assessment for big data. Future Gener. Comput. Syst. 89 (2018), 548–562.
[3]
J. Barzdins, A. Zarins, Karlis Cerans, A. Kalnins, Edgars Rencis, L. Lace, Renars Liepins, and A. Sprogis. 2007. GrTP: Transformation based graphical tool building platform. In MDDAUI.
[4]
C. Batini, A. Rula, M. Scannapieco, and G. Viscusi. 2015. From data quality to big data quality. J. Database Manag. 26, 1 (2015), 60–82.
[5]
Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later. Proceedings of the VLDB Endowment 4, 11 (2011), 695–701.
[6]
L. Bertossi. 2017. Some Declarative Approaches to Data Quality. Retrieved February 3, 2021 from http://people.scs.carleton.ca/bertossi/talks/tutBicod17.pdf.
[7]
L. Bertossi and L. Bravo. 2013. Generic and declarative approaches to data quality management. Handbook of Data Quality: Research and Practice, 181–211.
[8]
Zane Bicevska, Janis Bicevskis, and Ivo Oditis. 2017. Domain-specific characteristics of data quality. In 2017 Federated Conference on Computer Science and Information Systems (FedCSIS’17)999–1003.
[9]
Zane Bicevska, J. Bicevskis, and Ivo Oditis. 2017. Models of data quality. In Information Technology for Management. Ongoing Research and Development: 15th Conference, AITM 2017, and 12th Conference (ISM’17, Held as Part of FedCSIS, Prague, Czech Republic, September 3-6, 2017), Extended Selected Papers 15. Springer, 194–211.
[10]
J. Bicevskis, Zane Bicevska, and G. Karnitis. 2017. Executable data quality models. Procedia Computer Science 104 (2017), 138–145.
[11]
J. Bicevskis, Zane Bicevska, A. Nikiforova, and Ivo Oditis. 2018. An approach to data quality evaluation. In 2018 5th International Conference on Social Networks Analysis, Management and Security (SNAMS’18)196–201.
[12]
Amit Chandel, Oktie Hassanzadeh, Nick Koudas, Mohammad Sadoghi, and Divesh Srivastava. 2007. Benchmarking declarative approximate selection predicates. In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, 353–364.
[13]
Roger Clarke. 2014. Quality Factors in Big Data and Big Data Analytics. Retrieved December 30, 2019 from http://www.rogerclarke.com/EC/BDQF.html.
[14]
Graham Cormode and Nick G. Duffield. 2014. Sampling for big data: A tutorial. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1975–1975.
[15]
Microsoft Corporation. 2018. SQL Server Integration Services. Retrieved August 25, 2020 from https://docs.microsoft.com/en-us/sql/integration-services/sql-server-integration-services?view=sql-server-ver15.
[16]
Oracle Corporation. 2013. Comprehensive Data Quality with Oracle Data Integrator and Oracle Enterprise Data Quality [White Paper]. Technical Report. Oracle Corporation. Retrieved August 1, 2020 from https://www.oracle.com/technetwork/middleware/data-integrator/overview/oracledi-comprehensive-quality-131748.pdf.
[17]
Peralta Costabel and V. Carmen. 2006. Data Quality Evaluation in Data Integration Systems. Ph. D. Dissertation. Université de Versailles-Saint Quentin en Yvelines; Université de la République dÚruguay.
[18]
Hadi Fadlallah, Yehia Taher, Rafiqul Haque, and Ali H. Jaber. 2019. ORADIEX: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. In BDCSIntell, 52–56.
[19]
Hadi Fadlallah, Yehia Taher, and Ali H. Jaber. 2018. RaDEn: A scalable and efficient radiation data engineering. In BDCSIntell, 89–93.
[20]
Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 363–370.
[21]
Jeffrey E. F. Friedl. 2006. Mastering Regular Expressions (3rd ed.). O’Reilly Media, Inc., Sebastopol, CA.
[22]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and Cristian-Augustin Saita. 2001. Declarative data cleaning: Language, model, and algorithms. In VLDB.
[23]
Jerry Zeyu Gao, Chunli Xie, and Chuanqi Tao. 2016. Big data validation and quality assurance – Issues, challenges, and needs. In 2016 IEEE Symposium on Service-Oriented System Engineering (SOSE’16)433–441.
[24]
Gartner. 2017. How to Create a Business Case for Data Quality Improvement. Retrieved May 1, 2021 from https://www.gartner.com/smarterwithgartner/how-to-create-a-business-case-for-data-quality-improvement/.
[25]
Mouzhi Ge and Markus Helfert. 2007. A review of information quality research-develop a research agenda. In The International Conference on Information Quality (ICIQ’07). Citeseer, 76–91.
[26]
Aditya Grover and Jure Leskovec. 2016. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–864.
[27]
Rong Gu, Yang Qi, Tongyu Wu, Zhaokang Wang, Xiaolong Xu, C. Yuan, and Yihua Huang. 2021. SparkDQ: Efficient generic big data quality management on distributed data-parallel computation. J. Parallel Distributed Comput. 156 (2021), 132–147.
[28]
Venkat Gudivada, Amy Apon, and Junhua Ding. 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. International Journal on Advances in Software 10, 1 (2017), 1–20.
[29]
Venkat N. Gudivada, Dhana Rao, and William I. Grosky. 2016. Data quality centric application framework for big data. In The 2nd International Conference on Big Data, Small Data, Linked Data and Open Data (ALLDATA’16), 33.
[30]
Qing He, Haocheng Wang, Fuzhen Zhuang, Tianfeng Shang, and Zhongzhi Shi. 2015. Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst. 258 (2015), 117–133.
[31]
Markus Helfert and Owen Foley. 2009. A context aware information quality framework. 2009 4th International Conference on Cooperation and Promotion of Information Resources in Science and Technology. 187–193.
[32]
Melanie Herschel and I. Manolescu. 2007. Declarative XML data cleaning with XClean. In CAiSE.
[33]
Kasra Hosseini, Federico Nanni, and Mariona Coll Ardanuy. 2020. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 62–69.
[34]
IBM. 2020. The Four V’s of Big Data. Retrieved May 1, 2021 from http://www.ibmbigdatahub.com/infographic/four-vs-big-data. Accessed May 1, 2021.
[35]
Informatica. 2018. Informatica Data Quality Data Sheet. Technical Report. Informatica. Retrieved August 25, 2020 from https://www.informatica.com/content/dam/informatica-com/en/collateral/data-sheet/en_informatica-data-quality_data-sheet_6710.pdf.
[36]
ISO/IEC. 2008. 25012:2008 Software Engineering–Software Product Quality Requirements and Evaluation (SQuaRE)—Data Quality Model. Standard. ISO/IEC.
[37]
ISO/IEC. 2012. ISO/IEC 25021:2012 Systems and Software Engineering–Systems and Software Quality Requirements and Evaluation (SQuaRE) –Quality Measure Elements. Standard. ISO/IEC.
[38]
ISO/IEC. 2014. ISO/IEC 25000:2014. Systems and Software Engineering – System and Software Quality Requirements and Evaluation (SQuaRE) – Guide to SQuaRE. Standard. ISO/IEC.
[39]
ISO/IEC. 2015. ISO/IEC 25024:2015 Systems and Software Engineering–Systems and Software Quality Requirements and Evaluation (SQuaRE)–Measurement of Data Quality. Standard. ISO/IEC.
[40]
ISO/IEC. 2017. ISO/IEC 15939:2017 Systems and Software Engineering–Measurement Process. Standard. ISO/IEC.
[41]
ISO/IEC. 2020. ISO/IEC 20547-3:2020 Big Data Reference Architecture - Part 3: Reference Architecture. Standard. ISO/IEC.
[42]
Shawn R. Jeffery, Gustavo Alonso, Michael J. Franklin, Wei Hong, and Jennifer Widom. 2006. Declarative support for sensor data cleaning. In Pervasive Computing: 4th International Conference (PERVASIVE’06, Dublin, Ireland, May 7-10, 2006). Proceedings 4. Springer, 83–100.
[43]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, S. Madden, M. Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A system for big data cleansing. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 1215–1230.
[44]
Jae Kwang Kim and Zhonglei Wang. 2019. Sampling techniques for big data analysis. International Statistical Review 87 (2019), S177–S191.
[45]
Won Y. Kim, Byoungju Choi, E. Hong, S. Kim, and D. Lee. 2003. A taxonomy of dirty data. Data Mining and Knowledge Discovery 7 (2003), 81–99.
[46]
H. Leung. 2001. Quality metrics for intranet applications. Inf. Manag. 38, 3 (2001), 137–152.
[47]
Jorge Merino, I. Caballero, Bibiano Rivas, M. Serrano, and M. Piattini. 2016. A data quality in use model for big data. Future Gener. Comput. Syst. 63 (2016), 123–130.
[48]
[49]
Heiko Müller and Johann Christoph Freytag. 2005. Problems, Methods, and Challenges in Comprehensive Data Cleansing. Professoren des Inst. Für Informatik.
[50]
A. Nikiforova and J. Bicevskis. 2019. An extended data object-driven approach to data quality evaluation: Contextual data quality analysis. In ICEIS (1). 274–281.
[51]
A. Nikiforova, J. Bicevskis, Zane Bicevska, and Ivo Oditis. 2020. User-oriented approach to data quality evaluation. J. UCS 26, 1 (2020), 107–126.
[52]
Paulo Oliveira, Fátima Rodrigues, and P. Henriques. 2005. A formal definition of data quality problems. In ICIQ.
[53]
Paulo Oliveira, Fátima Rodrigues, Pedro Henriques, and Helena Galhardas. 2005. A taxonomy of data quality problems. In 2nd Int. Workshop on Data and Information Quality. 219–233.
[54]
Erhard Rahm. 2011. Towards large-scale schema and ontology matching. In Schema Matching and Mapping. Springer, 3–27.
[55]
E. Rahm and H. Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 4 (2000), 3–13.
[56]
Dominique Ritze, Oliver Lehmberg, and Christian Bizer. 2015. Matching html tables to dbpedia. In Proceedings of the 5th International Conference on Web Intelligence, Mining and Semantics. 1–6.
[57]
Praveen Kumar Sadineni. 2020. Sampling based join-aggregate query processing technique for big data. Indian Journal of Computer Science and Engineering 11, 5 (2020), 532–546.
[58]
B. Saha and D. Srivastava. 2014. Data quality: The other face of Big Data. In 2014 IEEE 30th International Conference on Data Engineering. 1294–1297.
[59]
Peter Sanders, Sebastian Lamm, Lorenz Hübschle-Schneider, Emanuel Schrade, and Carsten Dachsbacher. 2018. Efficient parallel random sampling—Vectorized, cache-efficient, and online. ACM Transactions on Mathematical Software (TOMS) 44, 3 (2018), 1–14.
[60]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 1781–1794.
[61]
Calidad Software. 2022. ISO/IEC 25012. Retrieved March 22, 2020 from https://iso25000.com/index.php/en/iso-25000-standards/iso-25012.
[62]
D. Strong, Y. Lee, and R. Wang. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103–110.
[63]
Y. Taher, Rafiqul Haque, Mohammed AlShaer, W. Heuvel, Mohand-Said Hacid, and M. Dbouk. 2016. A context-aware analytics for processing tweets and analysing sentiment in realtime (short paper). In On the Move to Meaningful Internet Systems: OTM 2016 Conferences: Confederated International Conferences: CoopIS, C&TC, and ODBASE 2016, Rhodes, Greece, October 24-28, 2016, Proceedings. Springer, 910–917.
[64]
Ikbal Taleb, Rachida Dssouli, and Mohamed Adel Serhani. 2015. Big data pre-processing: A quality framework. In 2015 IEEE International Congress on Big Data. 191–198.
[65]
Ikbal Taleb, Mohamed Adel Serhani, Chafik Bouhaddioui, and Rachida Dssouli. 2021. Big data quality framework: A holistic approach to continuous quality management. Journal of Big Data 8, 1 (2021), 1–41.
[66]
Ikbal Taleb, M. A. Serhani, and R. Dssouli. 2018. Big data quality assessment model for unstructured data. In 2018 International Conference on Innovations in Information Technology (IIT’18). 69–74.
[67]
Talend. 2020. How to Manage Modern Data Quality [White Paper]. Technical Report. Talend. Retrieved August 1, 2020 from https://www.talend.com/resources/definitive-guide-data-quality-how-to-manage.
[68]
R. Wang and D. Strong. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12, 4 (1996), 5–33.
[69]
P. Woodall, Martin Oberhofer, and A. Borek. 2014. A classification of data quality assessment and improvement methods. Int. J. Inf. Qual. 3, 4 (2014), 298–321.
[70]
Pengcheng Zhang, Xuewu Zhou, Wenrui Li, and Jerry Zeyu Gao. 2017. A survey on quality assurance techniques for big data applications. 2017 IEEE 3rd International Conference on Big Data Computing Service and Applications (BigDataService’17). 313–319.

Cited By

View all
  • (2024)cuallee: A Python package for data quality checks across multiple DataFrame APIsJournal of Open Source Software10.21105/joss.066849:98(6684)Online publication date: Jun-2024
  • (2024)Construction and Optimization of University Financial Risk Early Warning Model Based on Big Data AnalysisProceedings of the 2024 International Conference on Machine Intelligence and Digital Applications10.1145/3662739.3672303(182-187)Online publication date: 30-May-2024
  • (2024)Active Metadata and Machine Learning based Framework for Enhancing Big Data QualityProceedings of the 7th International Conference on Networking, Intelligent Systems and Security10.1145/3659677.3659707(1-8)Online publication date: 18-Apr-2024
  • Show More Cited By

Index Terms

  1. BIGQA: Declarative Big Data Quality Assessment

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 15, Issue 3
    September 2023
    326 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3611329
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 August 2023
    Online AM: 13 June 2023
    Accepted: 12 May 2023
    Revised: 16 April 2023
    Received: 16 April 2022
    Published in JDIQ Volume 15, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Declarative framework
    2. quality assessment
    3. big data
    4. data quality

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)433
    • Downloads (Last 6 weeks)55
    Reflects downloads up to 16 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)cuallee: A Python package for data quality checks across multiple DataFrame APIsJournal of Open Source Software10.21105/joss.066849:98(6684)Online publication date: Jun-2024
    • (2024)Construction and Optimization of University Financial Risk Early Warning Model Based on Big Data AnalysisProceedings of the 2024 International Conference on Machine Intelligence and Digital Applications10.1145/3662739.3672303(182-187)Online publication date: 30-May-2024
    • (2024)Active Metadata and Machine Learning based Framework for Enhancing Big Data QualityProceedings of the 7th International Conference on Networking, Intelligent Systems and Security10.1145/3659677.3659707(1-8)Online publication date: 18-Apr-2024
    • (2024)Current Challenges of Big Data Quality Management in Big Data Governance: A Literature ReviewAdvances in Intelligent Computing Techniques and Applications10.1007/978-3-031-59711-4_15(160-172)Online publication date: 30-Jun-2024
    • (2023)PyDaQu: Python Data Quality Code Generation Based on Data Architecture2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C)10.1109/MODELS-C59198.2023.00020(60-64)Online publication date: 1-Oct-2023
    • (2023)Addressing the Velocity Challenge of Big Data in Radiation Pollution Monitoring: Implementation and Demonstration2023 IEEE 4th International Multidisciplinary Conference on Engineering Technology (IMCET)10.1109/IMCET59736.2023.10368261(104-109)Online publication date: 12-Dec-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media