Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey

Context-aware Big Data Quality Assessment: A Scoping Review

Published: 22 August 2023 Publication History

Abstract

The term data quality refers to measuring the fitness of data regarding the intended usage. Poor data quality leads to inadequate, inconsistent, and erroneous decisions that could escalate the computational cost, cause a decline in profits, and cause customer churn. Thus, data quality is crucial for researchers and industry practitioners.
Different factors drive the assessment of data quality. Data context is deemed one of the key factors due to the contextual diversity of real-world use cases of various entities such as people and organizations. Data used in a specific context (e.g., an organization policy) may need to be more efficacious for another context. Hence, implementing a data quality assessment solution in different contexts is challenging.
Traditional technologies for data quality assessment reached the pinnacle of maturity. Existing solutions can solve most of the quality issues. The data context in these solutions is defined as validation rules applied within the ETL (extract, transform, load) process, i.e., the data warehousing process. In contrast to traditional data quality management, it is impossible to specify all the data semantics beforehand for big data. We need context-aware data quality rules to detect semantic errors in a massive amount of heterogeneous data generated at high speed. While many researchers tackle the quality issues of big data, they define the data context from a specific standpoint. Although data quality is a longstanding research issue in academia and industries, it remains an open issue, especially with the advent of big data, which has fostered the challenge of data quality assessment more than ever.
This article provides a scoping review to study the existing context-aware data quality assessment solutions, starting with the existing big data quality solutions in general and then covering context-aware solutions. The strength and weaknesses of such solutions are outlined and discussed. The survey showed that none of the existing data quality assessment solutions could guarantee context awareness with the ability to handle big data. Notably, each solution dealt only with a partial view of the context. We compared the existing quality models and solutions to reach a comprehensive view covering the aspects of context awareness when assessing data quality. This led us to a set of recommendations framed in a methodological framework shaping the design and implementation of any context-aware data quality service for big data. Open challenges are then identified and discussed.

References

[1]
Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2017. Data profiling: A tutorial. In Proceedings of the 2017 ACM International Conference on Management of Data (2017), 1747–1751.
[2]
Ziawasch Abedjan, Lukasz Golab, Felix Naumann, and Thorsten Papenbrock. 2018. Data profiling. Synthes. Lect. Data Manag. 10, 4 (2018), 1–154.
[3]
Maribel Acosta, Amrapali Zaveri, Elena Simperl, Dimitris Kontokostas, Sören Auer, and Jens Lehmann. 2013. Crowdsourcing linked data quality assessment. In The Semantic Web–ISWC 2013: 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013, Proceedings, Part II 12. Springer, 260–276.
[4]
Divyakant Agrawal, Philip Bernstein, Elisa Bertino, Susan Davidson, Umeshwas Dayal, Michael Franklin, Johannes Gehrke, Laura Haas, Alon Halevy, Jiawei Han et al. 2011. Challenges and Opportunities with Big Data [White Paper]. Technical Report. Computing Research Association. Retrieved from http://cra.org/ccc/docs/init/bigdatawhitepaper.pdf.
[5]
Jameela Al-Jaroodi and Nader Mohamed. 2018. Service-oriented architecture for big data analytics in smart cities. In 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID’18). 633–640.
[6]
Mohammed AlShaer, Yehia Taher, Rafiqul Haque, Mohand-Saïd Hacid, and Mohamed Dbouk. 2019. IBRIDIA: A hybrid solution for processing big logistics data. Fut. Gen. Comput. Syst. 97 (2019), 792–804.
[7]
Danilo Ardagna, Cinzia Cappiello, Walter Samá, and Monica Vitali. 2018. Context-aware data quality assessment for big data. Fut. Gen. Comput. Syst. 89 (2018), 548–562.
[8]
Otmane Azeroual and Mohammad Abuosba. 2019. Improving the data quality in the research information systems. arXiv preprint arXiv:1901.07388 (2019).
[9]
Jānis Bārzdiņš, Andris Zariņš, Kārlis Čerāns, Audris Kalniņš, Edgars Rencis, Lelde Lāce, Renārs Liepiņš, and Artūrs Sprog̀is. 2007. GrTP: Transformation based graphical tool building platform. In 10th International Conference on Model-driven Engineering Languages and Systems, Models.
[10]
Carlo Batini, Federico Cabitza, Cinzia Cappiello, and Chiara Francalanci. 2008. A comprehensive data quality methodology for web and structured data. Int. J. Innov. Comput. Applic. 1, 3 (2008), 205–218.
[11]
Carlo Batini, Anisa Rula, Monica Scannapieco, and Gianluigi Viscusi. 2015. From data quality to big data quality. J. Datab. Manag. 26, 1 (2015), 60–82.
[12]
Sururah A. Bello, Lukumon O. Oyedele, Olugbenga O. Akinade, Muhammad Bilal, Juan Manuel Davila Delgado, Lukman A. Akanbi, Anuoluwapo O. Ajayi, and Hakeem A. Owolabi. 2021. Cloud computing in construction industry: Use cases, benefits and challenges. Automat. Construct. 122 (2021), 103441.
[13]
Philip A. Bernstein, Jayant Madhavan, and Erhard Rahm. 2011. Generic schema matching, ten years later. Proc. VLDB Endow. 4, 11 (2011), 695–701.
[14]
Janki Bhimani, Ningfang Mi, Miriam Leeser, and Zhengyu Yang. 2017. FiM: Performance prediction for parallel computation in iterative data processing applications. In IEEE 10th International Conference on Cloud Computing (CLOUD’17). 359–366.
[15]
Janki Bhimani, Ningfang Mi, Miriam Leeser, and Zhengyu Yang. 2019. New performance modeling methods for parallel data processing applications. ACM Trans. Model. Comput. Simul. 29, 3 (2019), 1–24.
[16]
Zane Bicevska, Janis Bicevskis, and Ivo Oditis. 2017. Domain-specific characteristics of data quality. Federated Conference on Computer Science and Information Systems (FedCSIS’17). 999–1003.
[17]
Zane Bicevska, Janis Bicevskis, and Ivo Oditis. 2018. Models of data quality. In Information Technology for Management. Ongoing Research and Development: 15th Conference, AITM 2017, and 12th Conference, ISM 2017, Held as Part of FedCSIS, Prague, Czech Republic, September 3–6, 2017, Extended Selected Papers 15. Springer, 194–211.
[18]
Janis Bicevskis, Zane Bicevska, and Girts Karnitis. 2017. Executable data quality models. Procedia Comput. Sci. 104 (2017), 138–145.
[19]
Janis Bicevskis, Zane Bicevska, Anastasija Nikiforova, and Ivo Oditis. 2018. An approach to data quality evaluation. In Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS’18). 196–201.
[20]
Jacqueline Biscobing. 2018. What Is Data Sampling? Retrieved from https://www.techtarget.com/searchbusinessanalytics/definition/data-sampling.
[21]
Antoon Bronselaer, Joachim Nielandt, Toon Boeckling, and Guy De Tré. 2018. Operational measurement of data quality. In Information Processing and Management of Uncertainty in Knowledge-Based Systems. Applications: 17th International Conference, IPMU 2018, Cádiz, Spain, June 11–15, 2018, Proceedings, Part III 17. Springer, 517–528.
[22]
Stefan Brüggemann and Fabian Grüning. 2009. Using ontologies providing domain knowledge for data quality management. Networked Knowledge-Networked Media: Integrating Knowledge Management, New Media Technologies and Semantic Systems. Springer, 187–203.
[23]
Peter Buneman and Susan B. Davidson. 2010. Data provenance–The foundation of data quality. In Workshop: Issues and Opportunities for Improving the Quality and Use of Data within the DoD, Arlington, 26–28.
[24]
Li Cai and Yangyong Zhu. 2015. The challenges of data quality and data quality assessment in the big data era. Data Sci. J. 14 (2015).
[25]
Batini Carlo, Barone Daniele, Cabitza Federico, and Grega Simone. 2011. A data quality methodology for heterogeneous data. Int. J. Datab. Manag. Syst. 3, 1 (2011), 60–79.
[26]
O.-Hoon Choi, Jun-Eun Lim, Hong-Seok Na, and Doo-Kwon Baik. 2008. An efficient method of data quality using quality evaluation ontology. 2008 Third International Conference on Convergence and Hybrid Information Technology 2 (2008), 1058–1061.
[27]
Corinna Cichy and Stefan Rass. 2019. An overview of data quality frameworks. IEEE Access 7 (2019), 24634–24648.
[28]
Roger Clarke. 2014. Quality Factors in Big Data and Big Data Analytics. Xamax Consultancy Pty Ltd.
[29]
Graham Cormode and Nick Duffield. 2014. Sampling for big data: A tutorial. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1975–1975.
[32]
Oracle Corporation. 2013. Comprehensive Data Quality with Oracle Data Integrator and Oracle Enterprise Data Quality [White Paper]. Technical Report. Oracle Corporation. Retrieved from https://www.oracle.com/technetwork/middleware/data-integrator/overview/oracledi-comprehensive-quality-131748.pdf.
[33]
Wei Dai, Isaac Wardlaw, Yu Cui, Kashif Mehdi, Yanyan Li, and Jun Long. 2016. Data profiling technology of data governance regarding big data: Review and rethinking. In Information Technology: New Generations: 13th International Conference on Information Technology. Springer, 439–450.
[34]
Wei Dai, Kenji Yoshigoe, and William Parsley. 2018. Improving data quality through deep learning and statistical models. In Information Technology-New Generations: 14th International Conference on Information Technology. 515–522.
[35]
Houda Daki, Asmaa El Hannani, Abdelhak Aqqal, Abdelfattah Haidine, and Aziz Dahbi. 2017. Big Data management in smart grid: Concepts, requirements and implementation. J. Big Data 4, 1 (2017), 1–19.
[36]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1, 107–113.
[37]
Houssein Dhayne, Rafiqul Haque, Rima Kilany, and Yehia Taher. 2019. In search of big medical data integration solutions—A comprehensive survey. IEEE Access 7 (2019), 91265–91290.
[38]
Viktor Dmitriyev, Tariq Mahmoud, and Pablo Michel Marín-Ortega. 2015. Int. J. Inf. Syst. Proj. Manag. 3, 3 (2015), 49–63.
[39]
Xin Luna Dong, Laure Berti-Equille, and Divesh Srivastava. 2013. Data fusion: Resolving conflicts from multiple sources. Handbook of Data Quality: Research and Practice. Springer, 293–318.
[40]
Xin Luna Dong and Divesh Srivastava. 2013. Big data integration. In IEEE 29th International Conference on Data Engineering (ICDE’13). IEEE, 1245–1248.
[41]
Nicola Dragoni, Ivan Lanese, Stephan Thordal Larsen, Manuel Mazzara, Ruslan Mustafin, and Larisa Safina. 2018. Microservices: How to make your application scale. In Perspectives of System Informatics: 11th International Andrei P. Ershov Informatics Conference, PSI 2017, Moscow, Russia, June 27–29, 2017, Revised Selected Papers 11. Springer, 95–104.
[42]
M. Durairaj and T. S. Poornappriya. 2018. Importance of MapReduce for big data applications: A survey. Asian J. Comput. Sci. Technol. 7, 1 (2018), 112–118.
[43]
Lisa Ehrlinger, Bernhard Werth, and Wolfram Wöß. 2018. Automated continuous data quality measurement with QuaIIe. Int. J. Advanc. Softw. 11, 3 (2018), 400–417.
[44]
Lisa Ehrlinger, Bernhard Werth, and Wolfram Wöß. 2018. QuaIIe: A data quality assessment tool for integrated information systems. In 10th International Conference on Advances in Databases, Knowledge, and Data Applications (DBKDA’18). 21–31.
[45]
Lisa Ehrlinger and Wolfram Wöß. 2017. Automated data quality monitoring. In 22nd MIT International Conference on Information Quality (ICIQ’17). 15–1.
[46]
Adir Even and Ganesan Shankaranarayanan. 2005. Value-driven data quality assessment. In International Conference on Information Quality (ICIQ’05).
[47]
Adir Even and Ganesan Shankaranarayanan. 2007. Utility-driven assessment of data quality. ACM SIGMIS Datab.: DATAB. Adv. Inf. Syst. 38, 2 (2007), 75–93.
[48]
Hadi Fadlallah, Yehia Taher, Rafiqul Haque, and Ali Jaber. 2019. ORADIEX: A big data driven smart framework for real-time surveillance and analysis of individual exposure to radioactive pollution. In International Conference on Big Data and Cybersecurity Intelligence (BDCSIntell’19). 52–56.
[49]
Hadi Fadlallah, Yehia Taher, and Ali Jaber. 2018. RaDEn: A scalable and efficient radiation data engineering. In International Conference on Big Data and Cybersecurity Intelligence (BDCSIntell’18). 89–93.
[50]
Óscar Figuerola Salas, Velibor Adzic, Akash Shah, and Hari Kalva. 2013. Assessing internet video quality using crowdsourcing. In 2nd ACM International Workshop on Crowdsourcing for Multimedia. 23–28.
[51]
Jenny Rose Finkel, Trond Grenager, and Christopher D. Manning. 2005. Incorporating non-local information into information extraction systems by Gibbs sampling. In 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). 363–370.
[52]
Jerry Gao, Chunli Xie, and Chuanqi Tao. 2016. Big data validation and quality assuranceIssues, challenges, and needs. In IEEE symposium on service-oriented system engineering (SOSE16). 433–441.
[53]
Mouzhi Ge and Markus Helfert. 2007. A review of information quality research-develop a research agenda. In International Conference on Information Quality (ICIQ’07). 76–91.
[54]
Rong Gu, Yang Qi, Tongyu Wu, Zhaokang Wang, Xiaolong Xu, Chunfeng Yuan, and Yihua Huang. 2021. SparkDQ: Efficient generic big data quality management on distributed data-parallel computation. J. ParallelDistrib. Comput. 156 (2021), 132–147.
[55]
Venkat Gudivada, Amy Apon, and Junhua Ding. 2017. Data quality considerations for big data and machine learning: Going beyond data cleaning and transformations. Int. J. Advanc. Softw. 10, 1 (2017), 1–20.
[56]
Venkat N. Gudivada, Dhana Rao, and William I. Grosky. 2016. Data quality centric application framework for big data. In International Conference on Big Data, Small Data, Linked Data and Open Data (ALLDATA’16).
[57]
Reihaneh H. Hariri, Erik M. Fredericks, and Kate M. Bowers. 2019. Uncertainty in big data analytics: Survey, opportunities, and challenges. J. Big Data 6, 1 (2019), 1–16.
[58]
Wilhelm Hasselbring. 2016. Microservices for scalability: Keynote talk abstract. In Proceedings of the 7th ACM/SPEC on International Conference on Performance Engineering. 133–134.
[59]
Brian Hay, Kara Nance, and Matt Bishop. 2011. Storm clouds rising: Security challenges for IaaS cloud computing. In 2011 44th Hawaii International Conference on System Sciences. 1–7.
[60]
Qinlu He, Zhanhuai Li, and Xiao Zhang. 2010. Data deduplication techniques. In 2010 International Conference on Future Information Technology and Management Engineering 1 (2010), 430–433.
[61]
Qing He, Haocheng Wang, Fuzhen Zhuang, Tianfeng Shang, and Zhongzhi Shi. 2015. Parallel sampling from big data with uncertainty distribution. Fuzzy Sets Syst. 258 (2015), 117–133.
[62]
Markus Helfert and Owen Foley. 2009. A context aware information quality framework. In 2009 Fourth International Conference on Cooperation and Promotion of Information Resources in Science and Technology. 187–193.
[63]
Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard de Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. 2021. Knowledge graphs. ACM Comput. Surv. 54, 4 (2021), 1–37.
[64]
Kasra Hosseini, Federico Nanni, and Mariona Coll Ardanuy. 2020. DeezyMatch: A flexible deep learning approach to fuzzy string matching. In Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 62–69.
[65]
Tobias Hoßfeld, Matthias Hirth, Pavel Korshunov, Philippe Hanhart, Bruno Gardlo, Christian Keimel, and Christian Timmerer. 2014. Survey of web-based crowdsourcing frameworks for subjective quality assessment. In IEEE 16th International Workshop on Multimedia Signal Processing (MMSP’14). 1–6.
[66]
Ihab F. Ilyas and Xu Chu. 2019. Data Cleaning. ACM New York, NY.
[67]
Anne Immonen, Pekka Pääkkönen, and Eila Ovaska. 2015. Evaluating the quality of social media data in big data architecture. IEEE Access 3 (2015), 2028–2043.
[68]
Talend Inc.2022. Data Quality and Machine Learning: What’s the Connection? Retrieved from https://www.talend.com/resources/machine-learning-data-quality/.
[69]
Informatica. 2018. Informatica Data Quality Data Sheet. Technical Report. Informatica. Retrieved from https://www.informatica.com/content/dam/informatica-com/en/collateral/data-sheet/en_informatica-data-quality_data-sheet_6710.pdf.
[70]
Muhammad Hussain Iqbal, Tariq Rahim Soomro et al. 2015. Big data analysis: Apache Storm perspective. Int. J. Comput. Trends Technol. 19, 1 (2015), 9–14.
[71]
ISO/IEC. 2001. ISO/IEC 9126-1:2001. Software Engineering – Product Quality – Part 1: Quality Model. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/22749.html.
[72]
ISO/IEC. 2008. 25012:2008 Software Engineering – Software Product Quality Requirements and Evaluation (SQuaRE) – Data Quality Model. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/35736.html.
[73]
ISO/IEC. 2014. ISO/IEC 25000:2014. Systems and Software Engineering – System and Software Quality Requirements and Evaluation (SQuaRE) – Guide to SQuaRE. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/64764.html.
[74]
ISO/IEC. 2015. ISO/IEC 25024:2015 Systems and Software Engineering – Systems and Software Quality Requirements and Evaluation (SQuaRE) – Measurement of Data Quality. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/35749.html.
[75]
ISO/IEC. 2017. ISO/IEC 15939:2017 Systems and Software Engineering – Measurement Process. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/71197.html.
[76]
ISO/IEC. 2020. ISO/IEC 20547-3:2020 Big Data Reference Architecture - Part 3: Reference Architecture. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/71277.html.
[77]
ISO/IEC. 2022. ISO/IEC AWI 5259-1 Artificial Intelligence – Data Quality for Analytics and Machine Learning (ML) – Part 1: Overview, Terminology, and Examples. Standard. ISO/IEC. Retrieved from https://www.iso.org/standard/81088.html.
[78]
ISO/TS. 2011. ISO/TS 8000-1:2011 - Data Quality - Part 1: Overview. Standard. ISO/TS. Retrieved from https://www.iso.org/standard/50798.html.
[79]
Michael A. Iverson, Fusun Ozguner, and Lee C. Potter. 1999. Statistical prediction of task execution times through analytic benchmarking for scheduling in a heterogeneous environment. In Proceedings Eighth Heterogeneous Computing Workshop (HCW’99). 99–111.
[80]
Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, and Keqiu Li. 2012. Big data processing in cloud computing environments. In 2012 12th International Symposium on Pervasive Systems, Algorithms and Networks (2012), 17–23.
[81]
Anirudh Kadadi, Rajeev Agrawal, Christopher Nyamful, and Rahman Atiq. 2014. Challenges of data integration and interoperability in big data. In 2014 IEEE International Conference on Big Data (big data) (2014), 38–40.
[82]
Jiří Kaiser. 2014. Dealing with missing values in data. J. Syst. Integr. 5, 1 (2014) 42–51.
[83]
Amir Karami, Aryya Gangopadhyay, Bin Zhou, and Hadi Kharrazi. 2015. A fuzzy approach model for uncovering hidden latent semantic structure in medical text collections. In iConference 2015.
[84]
Anurag Karmakar, Anaswara Raghuthaman, Om Sudhakar Kote, and N. Jayapandian. 2022. Cloud computing application: Research challenges and opportunity. In International Conference on Sustainable Computing and Data Communication Systems (ICSCDS’22). IEEE, 1284–1289.
[85]
Zuhair Khayyat, Ihab F. Ilyas, Alekh Jindal, S. Madden, M. Ouzzani, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, and Si Yin. 2015. BigDansing: A system for big data cleansing. In SIGMOD Conference.
[86]
Jae Kwang Kim and Zhonglei Wang. 2019. Sampling techniques for big data analysis. Int. Statist. Rev. 87 (2019), S177–S191.
[87]
Dimitris Kontokostas, Amrapali Zaveri, Sören Auer, and Jens Lehmann. 2013. TripleCheckMate: A tool for crowdsourcing the quality assessment of linked data. In Knowledge Engineering and the Semantic Web: 4th International Conference, KESW 2013, St. Petersburg, Russia, October 7–9, 2013. Proceedings 4. Springer, 265–272.
[88]
Pradeep Kumar, Roheet Bhatnagar, Kuntal Gaur, and Anurag Bhatnagar. 2021. Classification of imbalanced data: Review of methods and applications. IOP Conference Series: Materials Science and Engineering 1099, 1 (2021), 012077.
[89]
Tien Fabrianti Kusumasari et al. 2016. Data profiling for data quality improvement with OpenRefine. In International Conference on Information Technology Systems and Innovation (ICITSI’16). 1–6.
[90]
Hareton K. N. Leung. 2001. Quality metrics for intranet applications. Inf. Manag. 38, 3 (2001), 137–152.
[91]
Zhicheng Liu and Aoqian Zhang. 2020. Sampling for big data profiling: A survey. IEEE Access 8 (2020), 72713–72726.
[92]
Alexandra L’Heureux, Katarina Grolinger, Hany F. Elyamany, and Miriam A. M. Capretz. 2017. Machine learning with big data: Challenges and approaches. IEEE Access 5 (2017), 7776–7797.
[93]
Jyoti Malhotra and Jagdish Bakal. 2015. A survey and comparative study of data deduplication techniques. In International Conference on Pervasive Computing (ICPC’15). 1–5.
[94]
Nigel McKelvey, Kevin Curran, and Luke Toland. 2016. The Challenges of Data Cleansing with Data Warehouses. 77–82. DOI:
[95]
Mohammad Mehrtak, SeyedAhmad SeyedAlinaghi, Mehrzad MohsseniPour, Tayebeh Noori, Amirali Karimi, Ahmadreza Shamsabadi, Mohammad Heydari, Alireza Barzegary, Pegah Mirzapour, Mahdi Soleymanzadeh, et al. 2021. Security challenges and solutions using healthcare cloud computing. J. Med. Life 14, 4 (2021), 448.
[96]
Jorge Merino, Ismael Caballero, Bibiano Rivas, Manuel Serrano, and Mario Piattini. 2016. A data quality in use model for big data. Fut. Gen. Comput. Syst. 63 (2016), 123–130.
[97]
Nandana Mihindukulasooriya, Raúl García-Castro, Freddy Priyatna, Edna Ruckhaus, and Nelson Saturno. 2017. A linked data profiling service for quality assessment. In The Semantic Web: ESWC 2017 Satellite Events: ESWC 2017 Satellite Events, Portorož, Slovenia, May 28–June 1, 2017, Revised Selected Papers 14. Springer, 335–340.
[98]
Paolo Missier, Suzanne Embury, Mark Greenwood, Alun Preece, and Binling Jin. 2006. Quality views: Capturing and exploiting the user perspective on data quality. In International Conference on Very Large Data Bases.
[99]
Hajar Mousannif, Hasna Sabah, Yasmina Douiji, and Younes Oulad Sayad. 2014. From big data to big projects: A step-by-step roadmap. In 2014 International Conference on Future Internet of Things and Cloud. 373–378.
[100]
Zachary Munn, Micah D. J. Peters, Cindy Stern, Catalin Tufanaru, Alexa McArthur, and Edoardo Aromataris. 2018. Systematic review or scoping review? Guidance for authors when choosing between a systematic or scoping review approach. BMC Med. Res. Methodol. 18 (2018), 1–7.
[101]
Goutam Mylavarapu, Johnson P. Thomas, and K. Ashwin Viswanathan. 2019. An automated big data accuracy assessment tool. In IEEE 4th International Conference on Big Data Analytics (ICBDA’19). 193–197.
[102]
Goutam Mylavarapu, K. Ashwin Viswanathan, and Johnson P. Thomas. 2019. Assessing context-aware data consistency. In IEEE/ACS 16th International Conference on Computer Systems and Applications (AICCSA’19). 1–6.
[103]
Maryam M. Najafabadi, Flavio Villanustre, Taghi M. Khoshgoftaar, Naeem Seliya, Randall Wald, and Edin Muharemagic. 2015. Deep learning applications and challenges in big data analytics. J. Big Data 2, 1 (2015), 1–21.
[104]
Fatemeh Nargesian, Erkang Zhu, Renée J. Miller, Ken Q. Pu, and Patricia C. Arocena. 2019. Data lake management: Challenges and opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986–1989.
[105]
Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Rec. 42, 4 (2014), 40–49.
[106]
Eila Niemelä, Antti Evesti, and Pekka Savolainen. 2008. Modeling quality attribute variability. In International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE’08). 169–176.
[107]
Anastasija Nikiforova and Janis Bicevskis. 2019. An extended data object-driven approach to data quality evaluation: Contextual data quality analysis. In International Conference on Enterprise Information Systems (ICEIS’19). 274–281.
[108]
Anastasija Nikiforova, Janis Bicevskis, Zane Bicevska, and Ivo Oditis. 2020. User-oriented approach to data quality evaluation. J. Univers. Comput. Sci. 26, 1 (2020), 107–126.
[109]
Pekka Pääkkönen and Daniel Pakkala. 2015. Reference architecture and classification of technologies, products and services for big data systems. Big Data Res. 2, 4 (2015), 166–186.
[111]
Beatriz Pérez, Julio Rubio, and Carlos Sáenz-Adán. 2018. A systematic review of provenance systems. Knowl. Inf. Syst. 57 (2018), 495–543.
[112]
Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Commun. ACM 45, 4 (2002), 211–218.
[113]
Rosanne Price, Dina Neiger, and Graeme Shanks. 2008. Developing a measurement instrument for subjective aspects of information quality. Commun. Assoc. Inf. Syst. 22, 1 (2008), 3.
[114]
Kumar Rahul and R. K. Banyal. 2019. Data cleaning mechanism for big data and cloud computing. In 6th International Conference on Computing for Sustainable Global Development (INDIACom’19). 195–198.
[115]
Lakshmish Ramaswamy, Victor Lawson, and Siva Venkat Gogineni. 2013. Towards a quality-centric big data architecture for federated sensor services. In 2013 IEEE International Congress on Big Data. 86–93.
[116]
R. Rawat and R. Yadav. 2021. Big data: Big data analysis, issues and challenges and technologies. IOP Conference Series: Materials Science and Engineering 1022, 1 (2021), 012014.
[117]
Praveen Kumar Sadineni. 2020. Sampling based join-aggregate query processing technique for big data. Indian J. Comput. Sci. Eng. 11, 5, 532–546.
[118]
Barna Saha and Divesh Srivastava. 2014. Data quality: The other face of big data. In 2014 IEEE 30th International Conference on Data Engineering. 1294–1297.
[119]
Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. 2018. Automating large-scale data quality verification. Proc. VLDB Endow. 11, 12 (2018), 1781–1794.
[121]
Norbert Siegmund, Marko Rosenmüller, Martin Kuhlemann, Christian Kästner, Sven Apel, Fabien Duchateau, and Justin Fagnan. 2015. Schema matching bibtex. In Proceedings of the VLDB Endowment.
[122]
Calidad Software. 2022. ISO/IEC 25012. Retrieved from https://iso25000.com/index.php/en/iso-25000-standards/iso-25012.
[123]
Dragan Stojanović, Natalija Stojanović, and Jovan Turanjanin. 2015. Processing big trajectory and Twitter data streams using Apache STORM. (2015), 301–304. Retrieved from https://www.semanticscholar.org/paper/Schema-Matching-Bibtex-Siegmund-Rosenm%C3%BCller/a4d94ddaab429e5874386dd29822e470b57d6ee4.
[124]
Diane M. Strong, Yang W. Lee, and Richard Y. Wang. 1997. Data quality in context. Commun. ACM 40, 5 (1997), 103–110.
[125]
Yehia Taher, Rafiqul Haque, Mohammed AlShaer, Willem Jan van den Heuvel, Mohand-Saïd Hacid, and Mohamed Dbouk. 2016. A context-aware analytics for processing tweets and analysing sentiment in realtime (short paper). In On the Move to Meaningful Internet Systems: OTM 2016 Conferences: Confederated International Conferences: CoopIS, C&TC, and ODBASE 2016, Rhodes, Greece, October 24–28, 2016, Proceedings. Springer, 910–917.
[126]
Yehia Taher, Rafiqul Haque, and Mohand-Said Hacid. 2017. BDLaaS: Big data lab as a service for experimenting big data solution. In IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS* W’17). 155–159.
[127]
Ikbal Taleb, Rachida Dssouli, and Mohamed Adel Serhani. 2015. Big data pre-processing: A quality framework. (2015), 191–198.
[128]
Ikbal Taleb, Mohamed Adel Serhani, and Rachida Dssouli. 2018. Big data quality assessment model for unstructured data. In International Conference on Innovations in Information Technology (IIT’18). 69–74.
[129]
Ikbal Taleb, Mohamed Adel Serhani, and Rachida Dssouli. 2019. Big data quality: A data quality profiling model. In Services–SERVICES 2019: 15th World Congress, Held as Part of the Services Conference Federation, SCF 2019, San Diego, CA, USA, June 25–30, 2019, Proceedings 15. Springer, 61–77.
[130]
Talend. 2020. How to Manage Modern Data Quality [White Paper]. Technical Report. Talend. Retrieved from https://www.talend.com/resources/definitive-guide-data-quality-how-to-manage.
[131]
Mohamed Talha, Nabil Elmarzouqi, and Anas Abou El Kalam. 2020. Towards a powerful solution for data accuracy assessment in the big data context. Int. J. Advanc. Comput. Sci. Applic. 11, 2 (2020).
[132]
Shivaram Venkataraman, Zongheng Yang, Michael Franklin, Benjamin Recht, and Ion Stoica. 2016. Ernest: Efficient performance prediction for large-scale advanced analytics. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI’16). 363–378.
[133]
Lidong Wang and Cheryl Ann Alexander. 2016. Machine learning in big data. Int. J. Math., Eng. Manag. Sci. 1, 2 (2016), 52–61.
[134]
Richard Y. Wang. 1998. A product perspective on total data quality management. Commun. ACM 41, 2 (1998), 58–65.
[135]
Richard Y. Wang and Diane Strong. 1996. Beyond accuracy: What data quality means to data consumers. J. Manag. Inf. Syst. 12 (1996), 5–33.
[136]
Xinxin Wang, Depeng Dang, and Zixian Guo. 2020. Evaluating the crowd quality for subjective questions based on a Spark computing environment. Fut. Gen. Comput. Syst. 106 (2020), 426–437.
[137]
Chen Wei-Liang, Zhang Shi-Dong, and Gao Xiang. 2009. Anchoring the consistency dimension of data quality using ontology in data integration. (2009), 201–205.
[138]
Philip Woodall, Martin Oberhofer, and Alexander Borek. 2014. A classification of data quality assessment and improvement methods. Int. J. Inf. Qual. 3, 4 (2014), 298–321.
[139]
Arkady Zaslavsky, Charith Perera, and Dimitrios Georgakopoulos. 2013. Sensing as a service and big data. arXiv preprint arXiv:1301.0159 (2013).
[140]
Amrapali Zaveri, Dimitris Kontokostas, Mohamed A. Sherif, Lorenz Bühmann, Mohamed Morsey, Sören Auer, and Jens Lehmann. 2013. User-driven quality evaluation of DBpedia. In 9th International Conference on Semantic Systems. 97–104.
[141]
Pengcheng Zhang, Xuewu Zhou, Wenrui Li, and Jerry Gao. 2017. A survey on quality assurance techniques for big data applications. (2017), 313–319.
[142]
Zhenrong Zhang, Jianshu Zhang, Jun Du, and Fengren Wang. 2022. Split, embed and merge: An accurate table structure recognizer. Pattern Recognit. 126 (2022), 108565.
[143]
Lina Zhou, Shimei Pan, Jianwu Wang, and Athanasios V. Vasilakos. 2017. Machine learning on big data: Opportunities and challenges. Neurocomputing 237 (2017), 350–361.

Cited By

View all
  • (2024)cuallee: A Python package for data quality checks across multiple DataFrame APIsJournal of Open Source Software10.21105/joss.066849:98(6684)Online publication date: Jun-2024
  • (2024)Current Challenges of Big Data Quality Management in Big Data Governance: A Literature ReviewAdvances in Intelligent Computing Techniques and Applications10.1007/978-3-031-59711-4_15(160-172)Online publication date: 30-Jun-2024
  • (2023)EFwork: An Efficient Framework for Constructing a Malware Knowledge Graph2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00171(1258-1265)Online publication date: 1-Nov-2023
  • Show More Cited By

Index Terms

  1. Context-aware Big Data Quality Assessment: A Scoping Review

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Journal of Data and Information Quality
    Journal of Data and Information Quality  Volume 15, Issue 3
    September 2023
    326 pages
    ISSN:1936-1955
    EISSN:1936-1963
    DOI:10.1145/3611329
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 August 2023
    Online AM: 13 June 2023
    Accepted: 08 May 2023
    Revised: 23 March 2023
    Received: 16 April 2022
    Published in JDIQ Volume 15, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data quality
    2. big data
    3. context awareness
    4. data quality assessment

    Qualifiers

    • Survey

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)629
    • Downloads (Last 6 weeks)79
    Reflects downloads up to 15 Oct 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)cuallee: A Python package for data quality checks across multiple DataFrame APIsJournal of Open Source Software10.21105/joss.066849:98(6684)Online publication date: Jun-2024
    • (2024)Current Challenges of Big Data Quality Management in Big Data Governance: A Literature ReviewAdvances in Intelligent Computing Techniques and Applications10.1007/978-3-031-59711-4_15(160-172)Online publication date: 30-Jun-2024
    • (2023)EFwork: An Efficient Framework for Constructing a Malware Knowledge Graph2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)10.1109/TrustCom60117.2023.00171(1258-1265)Online publication date: 1-Nov-2023
    • (2023)Addressing the Velocity Challenge of Big Data in Radiation Pollution Monitoring: Implementation and Demonstration2023 IEEE 4th International Multidisciplinary Conference on Engineering Technology (IMCET)10.1109/IMCET59736.2023.10368261(104-109)Online publication date: 12-Dec-2023
    • (2023)CTXDQ: An Automated Context-Driven Data Quality Assessment2023 IEEE 4th International Multidisciplinary Conference on Engineering Technology (IMCET)10.1109/IMCET59736.2023.10368231(32-37)Online publication date: 12-Dec-2023
    • (2023)A novel approach to assess and improve syntactic interoperability in data integrationInformation Processing and Management: an International Journal10.1016/j.ipm.2023.10352260:6Online publication date: 1-Nov-2023

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media