Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
survey
Open access

Big Data Systems: A Software Engineering Perspective

Published: 28 September 2020 Publication History
  • Get Citation Alerts
  • Abstract

    Big Data Systems (BDSs) are an emerging class of scalable software technologies whereby massive amounts of heterogeneous data are gathered from multiple sources, managed, analyzed (in batch, stream or hybrid fashion), and served to end-users and external applications. Such systems pose specific challenges in all phases of software development lifecycle and might become very complex by evolving data, technologies, and target value over time. Consequently, many organizations and enterprises have found it difficult to adopt BDSs. In this article, we provide insight into three major activities of software engineering in the context of BDSs as well as the choices made to tackle them regarding state-of-the-art research and industry efforts. These activities include the engineering of requirements, designing and constructing software to meet the specified requirements, and software/data quality assurance. We also disclose some open challenges of developing effective BDSs, which need attention from both researchers and practitioners.

    References

    [1]
    Daniel Abadi, Anastasia Ailamaki, David Andersen, Peter Bailis, Magdalena Balazinska, Philip Bernstein, Peter Boncz, Surajit Chaudhuri, Alvin Cheung, AnHai Doan, et al. 2020. The seattle report on database research. ACM SIGMOD Rec. 48, 4 (2020), 44--53.
    [2]
    Noufa Al-Najran. 2015. A Requirements Specification Framework for Big Data Collection and Capture. Master’s thesis. Prince Sultan University, Riyadh.
    [3]
    Noufa Al-Najran and Ajantha Dahanayake. 2015. A requirements specification framework for Big Data collection and capture. In East European Conference on Advances in Databases and Information Systems. Springer, 12--19.
    [4]
    Ibrahim Alhassan, David Sammon, and Mary Daly. 2016. Data governance activities: An analysis of the literature. J. Dec. Syst. 25, sup1 (2016), 64--75.
    [5]
    Saleema Amershi, Andrew Begel, Christian Bird, Robert DeLine, Harald Gall, Ece Kamar, Nachiappan Nagappan, Besmira Nushi, and Thomas Zimmermann. 2019. Software engineering for machine learning: A case study. In Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. IEEE Press, 291--300.
    [6]
    Arcitura. 2017. Big Data Patterns and Mechanisms. Retrieved July 23, 2019 from http://www.bigdatapatterns.org/.
    [7]
    Darlan Arruda. 2018. Requirements engineering in the context of Big Data applications. ACM SIGSOFT Softw. Eng. Not. 43, 1 (2018), 1--6.
    [8]
    Darlan Arruda and Nazim H. Madhavji. 2017. Towards a requirements engineering artefact model in the context of Big Data software development projects. In Proceedings of the International Conference on Big Data. IEEE, 2314--2319.
    [9]
    Darlan Arruda and Nazim H. Madhavji. 2019. QualiBD: A tool for modelling quality requirements for Big Data applications. In Proceedings of the International Conference on Big Data (Big Data’19). IEEE, 5977--5979.
    [10]
    Florian Auer and Michael Felderer. 2018. Shifting quality assurance of machine learning algorithms to live systems.
    [11]
    Florian Auer and Michael Felderer. 2019. Addressing data quality problems with metamorphic data relations. In Proceedings of the 4th International Workshop on Metamorphic Testing. IEEE Press, 76--83.
    [12]
    Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and Jennifer Widom. 2002. Models and issues in data stream systems. In Proceedings of the 21th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, 1--18.
    [13]
    Wolf-Tilo Balke. 2012. Introduction to information extraction: Basic notions and current trends. Datenb.-Spektr. 12, 2 (2012), 81--88.
    [14]
    Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo. 2014. The oracle problem in software testing: A survey. IEEE Trans. Softw. Eng. 41, 5 (2014), 507--525.
    [15]
    Andreas Bauer and Holger Günzel. 2013. Data-Warehouse-Systeme: Architektur, Entwicklung, Anwendung. dpunkt.verlag.
    [16]
    Marcello M. Bersani, Francesco Marconi, Matteo Rossi, and Madalina Erascu. 2016. A tool for verification of Big-Data applications. In Proceedings of the 2nd International Workshop on Quality-Aware DevOps. ACM, 44--45.
    [17]
    Jeff Bertolucci. 2013. Big Data analytics: Descriptive vs. predictive vs. prescriptive. Retrieved from https://www.informationweek.com/big-data/big-data-analytics/big-data-analytics-descriptive-vs-predictive-vs-prescriptive/d/d-id/1113279.
    [18]
    Tobias Bleifuß, Leon Bornemann, Theodore Johnson, Dmitri V. Kalashnikov, Felix Naumann, and Divesh Srivastava. 2018. Exploring change: A new dimension of data analytics. Proc. VLDB Endow. 12, 2 (2018), 85--98.
    [19]
    Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. 2014. Summingbird: A framework for integrating batch and online mapreduce computations. Proc. VLDB Endow. 7, 13 (2014), 1441--1451.
    [20]
    Paul Buitelaar, Philipp Cimiano, and Bernardo Magnini. 2005. Ontology Learning from Text: Methods, Evaluation and Applications. Vol. 123. IOS Press.
    [21]
    Matteo Camilli. 2014. Formal verification problems in a Big Data world: Towards a mighty synergy. In Proceedings of the 36th International Conference on Software Engineering. ACM, 638--641.
    [22]
    Cinzia Cappiello, Marco Comuzzi, Florian Daniel, and Giovanni Meroni. 2019. Data quality control in blockchain applications. In Proceedings of the International Conference on Business Process Management. Springer, 166--181.
    [23]
    Cinzia Cappiello, Walter Samá, and Monica Vitali. 2018. Quality awareness for a successful Big Data exploitation. In Proceedings of the 22nd International Database Engineering 8 Applications Symposium. 37--44.
    [24]
    Otávio Carvalho, Eduardo Roloff, and Philippe O. A. Navaux. 2017. A distributed stream processing based architecture for IoT smart grids monitoring. In Proceedings of the 10th International Conference on Utility and Cloud Computing. ACM, 9--14.
    [25]
    Rick Cattell. 2011. Scalable SQL and NoSQL data stores. ACM SIGMOD Rec. 39, 4 (2011), 12--27.
    [26]
    Paolo Ceravolo, Antonia Azzini, Marco Angelini, Tiziana Catarci, Philippe Cudré-Mauroux, Ernesto Damiani, Alexandra Mazak, Maurice Van Keulen, Mustafa Jarrar, Giuseppe Santucci, et al. 2018. Big Data semantics. J. Data Semant. 7, 2 (2018), 65--85.
    [27]
    Hong-Mei Chen, Rick Kazman, and Serge Haziyev. 2016. Agile Big Data analytics development: An architecture-centric approach. In Proceedings of the 49th Hawaii International Conference on System Sciences. IEEE, 5378--5387.
    [28]
    Hong-Mei Chen, Rick Kazman, Serge Haziyev, and Olha Hrytsay. 2015. Big Data system development: An embedded case study with a global outsourcing firm. In Proceedings of the 1st International Workshop on Big Data Software Engineering. IEEE, 44--50.
    [29]
    Tsong Y. Chen, Shing C. Cheung, and Shiu Ming Yiu. 1998. Metamorphic Testing: A New Approach for Generating Next Test Cases. Technical Report HKUST-CS98-01. Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong.
    [30]
    Tsong Yueh Chen, D. H. Huang, T. H. Tse, and Zhi Quan Zhou. 2004. Case studies on the selection of useful relations in metamorphic testing. In Proceedings of the 4th Ibero-American Symposium on Software Engineering and Knowledge Engineering (JIISIC 2004). Polytechnic University of Madrid, 569--583.
    [31]
    Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, T. H. Tse, and Zhi Quan Zhou. 2018. Metamorphic testing: A review of challenges and opportunities. ACM Comput. Surv. 51, 1 (2018), 4.
    [32]
    Bin Cheng, Salvatore Longo, Flavio Cirillo, Martin Bauer, and Ernoe Kovacs. 2015. Building a Big Data platform for smart cities: Experience and lessons from santander. In Proceedings of the International Congress on Big Data. IEEE, 592--599.
    [33]
    Paul Clements and Len Bass. 2010. Relating Business Goals to Architecturally Significant Requirements for Software Systems. Technical Report. Carnegie-Mellon University, Software Eengineering Institute.
    [34]
    E. F. Codd, S. B. Codd, and C. T. Salley. 1993. Providing OLAP (On-Line Analytical Processing) to User-Analysts: An IT Mandate. E.F.Codd 8 Associates, Tech. Rep.
    [35]
    Carlos Costa and Maribel Yasmina Santos. 2016. Reinventing the energy bill in smart cities with NoSQL technologies. In Transactions on Engineering Technologies. Springer, 383--396.
    [36]
    Carlos E. Cuesta, Miguel A. Martínez-Prieto, and Javier D. Fernández. 2013. Towards an architecture for managing Big Semantic Data in Real-Time. In Software Architecture. Vol. 7957. Springer, Berlin, 45--53.
    [37]
    Carlo Curino, Hyun Jin Moon, Alin Deutsch, and Carlo Zaniolo. 2013. Automating the database schema evolution process. Int. J. VLDB 22, 1 (2013), 73--98.
    [38]
    Alfredo Cuzzocrea, Rim Moussa, and Gianni Vercelli. 2018. An innovative lambda-architecture-based data warehouse maintenance framework for effective and efficient near-real-time OLAP over Big Data. In Proceedings of the International Conference on Big Data. Springer, 149--165.
    [39]
    R. Davenport. 2019. Big Companies Are Embracing Analytics, But Most Still Don’t Have a Data-Driven Culture. Retrieved March 20, 2020 from https://hbr.org/2018/02/big-companies-are-embracing-analytics-but-most-still-dont-have-a-data-driven-culture.
    [40]
    Ali Davoudian. 2019. Helios: An adaptive and query workload-driven partitioning framework for distributed graph stores. In Proceedings of the SIGMOD International Conference on Management of Data. ACM, 1820--1822.
    [41]
    Ali Davoudian, Liu Chen, and Mengchi Liu. 2018. A survey on NoSQL stores. ACM Comput. Surv. 51, 2 (2018), 40.
    [42]
    Junhua Ding, Xiaojun Kang, and Xin-Hua Hu. 2017. Validating a deep learning framework by metamorphic testing. In Proceedings of the 2nd International Workshop on Metamorphic Testing (MET’17). IEEE, 28--34.
    [43]
    AnHai Doan, Alon Halevy, and Zachary Ives. 2012. Principles of Data Integration. Morgan Kaufmann.
    [44]
    Xin Luna Dong and Divesh Srivastava. 2015. Big Data Integration. Synthesis Lectures on Data Management. Morgan 8 Claypool.
    [45]
    Xiaoning Du, Xiaofei Xie, Yi Li, Lei Ma, Jianjun Zhao, and Yang Liu. 2018. DeepCruiser: Automated guided testing for stateful deep learning systems. arXiv preprint arXiv:1812.05339 (2018).
    [46]
    Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik. 2015. The BigDAWG polystore system. ACM SIGMOD Rec. 44, 2 (2015), 11--16.
    [47]
    A. D. Duncan. 2014. Focus on the ‘Three Vs’ of Big Data Analytics: Ariability, Veracity and Value. no. 25-11-2015, pp. To drive better analytic outcomes, business leader, 2016. [Online]. Available at https://www.gartner.com/doc/2921417/focus-vs-big-data-analytics.
    [48]
    Lisa Ehrlinger, Elisa Rusz, and Wolfram Wöß. 2019. A survey of data quality measurement and monitoring tools. arXiv preprint arXiv:1907.08138 (2019).
    [49]
    Ahmed K. Elmagarmid, Marek Rusinkiewicz, Amit Sheth, and Amit Sheth. 1999. Management of Heterogeneous and Autonomous Database Systems. Morgan Kaufmann.
    [50]
    Aaron Elmore, Jennie Duggan, Mike Stonebraker, Magdalena Balazinska, Ugur Cetintemel, Vijay Gadepally, Jeffrey Heer, Bill Howe, Jeremy Kepner, Tim Kraska, et al. 2015. A demonstration of the BigDAWG polystore system. Proc. VLDB Endow. 8, 12 (2015), 1908--1911.
    [51]
    Hanif Eridaputra, Bayu Hendradjaya, and Wikan Danar Sunindyo. 2014. Modeling the requirements for Big Data application using goal oriented approach. In Proceedings of the International Conference on Data and Software Engineering. IEEE, 1--6.
    [52]
    Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proc. VLDB Endow. 2, 1 (2009), 407--418.
    [53]
    Raul Castro Fernandez, Peter R. Pietzuch, Jay Kreps, Neha Narkhede, Jun Rao, Joel Koshy, Dong Lin, Chris Riccomini, and Guozhang Wang. 2015. Liquid: Unifying nearline and offline Big Data integration. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’15).
    [54]
    Felix Gessert, Michael Schaarschmidt, Wolfram Wingerath, Erik Witt, Eiko Yoneki, and Norbert Ritter. 2017. Quaestor: Query web caching for database-as-a-service providers. Proc. VLDB Endow. 10, 12 (2017), 1670--1681.
    [55]
    Corinna Giebler, Christoph Stach, Holger Schwarz, and Bernhard Mitschang. 2018. BRAID-A hybrid processing architecture for Big Data. In Proceedings of the International Conference on Data Science, Technology, and Applications (DATA’18). 294--301.
    [56]
    Behzad Golshan, Alon Halevy, George Mihaila, and Wang-Chiew Tan. 2017. Data integration: After the teenage years. In Proceedings of the 36th SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems. ACM, 101--106.
    [57]
    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems. 2672--2680.
    [58]
    Ian Gorton and John Klein. 2015. Distribution, data, deployment: Software architecture convergence in Big Data systems. IEEE Softw. 32, 3 (2015), 78--85.
    [59]
    Christoph Gröger, Holger Schwarz, and Bernhard Mitschang. 2014. Prescriptive analytics for recommendation-based business process optimization. In Proceedings of the International Conference on Business Information Systems. Springer, 25--37.
    [60]
    Rihan Hai, Sandra Geisler, and Christoph Quix. 2016. Constance: An intelligent data lake system. In Proceedings of the SIGMOD International Conference on Management of Data. 2097--2100.
    [61]
    Alon Halevy, Anand Rajaraman, and Joann Ordille. 2006. Data integration: The teenage years. In Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment, 9--16.
    [62]
    Sang Hun Han, Kyoung Ok Kim, Eun Jong Cha, Kyung Ah Kim, and Ho Sun Shon. 2017. System framework for cardiovascular disease prediction based on Big Data technology. Symmetry 9, 12 (2017), 293.
    [63]
    Leonard Heilig and Stefan Voß. 2017. Managing cloud-based Big Data platforms: A reference architecture and cost perspective. In Big Data Management. Springer, 29--45.
    [64]
    Heli Hiisilä, Marjo Kauppinen, and Sari Kujala. 2016. An iterative process to connect business and IT development: Lessons learned. In Proceedings of the 18th Conference on Business Informatics (CBI’16), Vol. 1. IEEE, 94--103.
    [65]
    H. Hu, Y. G. Wen, T.-S. Chua, and X. L. Li. 2014. Towards scalable systems for Big Data analytics: A technology tutorial. IEEE Access 2 (2014), 652--687.
    [66]
    IBM. 2017. How to leverage the power of prescriptive analytics to maximize the ROI. Retrieved May 6, 2019 from https://www.ibmbigdatahub.com/blog/how-leverage-power-prescriptive-analytics-maximize-roi.
    [67]
    Anne Immonen, Pekka Pääkkönen, and Eila Ovaska. 2015. Evaluating the quality of social media data in Big Data architecture. IEEE Access 3 (2015), 2028--2043.
    [68]
    Infochimps. 2013. CIOs 8 Big Data: What your IT team wants you to know. Retrieved Dec 21, 2018 from http://www.infochimps.com/resources/report-cios-big-data-what-your-it-team-wants-you-to-know-6/.
    [69]
    Matteo Interlandi, Kshitij Shah, Sai Deep Tetali, Muhammad Ali Gulzar, Seunghyun Yoo, Miryung Kim, Todd Millstein, and Tyson Condie. 2015. Titian: Data provenance support in Spark. Proc. VLDB Endow. 9, 3 (2015), 216--227.
    [70]
    ISO. 2017. Systems and Software Engineering-vocabulary. Technical Report. ISO/IEC/IEEE 24765.
    [71]
    ISTQB. 2018. Standard Glossary of Terms used in Software Testing Version 3.2. Retrieved Jun 14, 2019 from https://www.istqb.org/downloads/category/20-istqb-glossary.html.
    [72]
    Petar Jovanovic, Óscar Romero Moral, Alkis Simitsis, Alberto Abelló Gamazo, Héctor Candón Arenas, and Sergi Nadal Francesch. 2015. Quarry: Digging up the gems of your data treasury. In Proceedings of the 18th International Conference on Extending Database Technology. 549--552.
    [73]
    Dawn N. Jutla, Peter Bodorik, and Sohail Ali. 2013. Engineering privacy for Big Data apps with the unified modeling language. In Proceedings of the International Congress on Big Data. IEEE, 38--45.
    [74]
    Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream data processing systems. In Proceedings of the 34th International Conference on Data Engineering (ICDE’18). IEEE, 1507--1518.
    [75]
    Vijay Khatri and Carol V. Brown. 2010. Designing data governance. Commun. ACM 53, 1 (2010), 148--152.
    [76]
    Jinhan Kim, Robert Feldt, and Shin Yoo. 2019. Guiding deep learning system testing using surprise adequacy. In Proceedings of the 41st International Conference on Software Engineering. IEEE Press, 1039--1049.
    [77]
    Michael Kläs, Wolfgang Putz, and Tobias Lutz. 2016. Quality evaluation for Big Data: A scalable assessment approach and first evaluation results. In Proceedings of the Joint Conference of the International Workshop on Software Measurement and the International Conference on Software Process and Product Measurement (IWSM-MENSURA’16). IEEE, 115--124.
    [78]
    Jay Kreps. 2014. Questioning the Lambda architecture. Retrieved April 1, 2019 from https://www.oreilly.com/ideas/questioning-the-lambda-architecture.
    [79]
    Johannes Kroß, Andreas Brunnert, Christian Prehofer, Thomas A. Runkler, and Helmut Krcmar. 2015. Stream processing on demand for lambda architectures. In Proceedings of the European Workshop on Performance Engineering. Springer, 243--257.
    [80]
    Vijay Dipti Kumar and Paulo Alencar. 2016. Software engineering for Big Data projects: Domains, methodologies and gaps. In Proceedings of the International Conference on Big Data (Big Data’16). IEEE, 2886--2895.
    [81]
    Rodrigo Laigner, Marcos Kalinowski, Sérgio Lifschitz, Rodrigo Salvador Monteiro, and Daniel de Oliveira. 2018. A systematic mapping of software engineering approaches to develop Big Data systems. In Proceedings of the 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA’18). IEEE, 446--453.
    [82]
    Andreas Langegger, Wolfram Wöß, and Martin Blöchl. 2008. A semantic web middleware for virtual data integration on the web. In Proceedings of the European Semantic Web Conference. Springer, 493--507.
    [83]
    Lydia Lau, Fan Yang-Turner, and Nikos Karacapilidis. 2014. Requirements for Big Data analytics supporting decision making: A sensemaking perspective. In Mastering Data-intensive Collaboration and Decision Making. Springer, 49--70.
    [84]
    George Lee, Jimmy Lin, Chuang Liu, Andrew Lorek, and Dmitriy Ryaboy. 2012. The unified logging infrastructure for data analytics at Twitter. Proc. VLDB Endow. 5, 12 (2012), 1771--1780.
    [85]
    Maurizio Lenzerini. 2002. Data integration: A theoretical perspective. In Proceedings of the 21st SIGMOD Symposium on Principles of Database Systems. ACM, 233--246.
    [86]
    Katerina Lepenioti, Alexandros Bousdekis, Dimitris Apostolou, and Gregoris Mentzas. 2020. Prescriptive analytics: Literature review and research challenges. Int. J. Inf. Manage. 50 (2020), 57--70.
    [87]
    Jimmy Lin and Dmitriy Ryaboy. 2013. Scaling Big Data mining infrastructure: The Twitter experience. ACM SIGKDD Explor. Newslett. 14, 2 (2013), 6--19.
    [88]
    Jiaheng Lu and Irena Holubová. 2019. Multi-model databases: A new journey to handle the variety of data. ACM Comput. Surv. 52, 3 (2019), 55.
    [89]
    Ashwin Machanavajjhala and Jerome P. Reiter. 2012. Big privacy: Protecting confidentiality in Big Data. XRDS 19, 1 (2012), 20--23.
    [90]
    Nazim H. Madhavji, Andriy Miranskyy, and Kostas Kontogiannis. 2015. Big picture of Big Data software engineering: With example research challenges. In Proceedings of the 1st International Workshop on Big Data Software Engineering. IEEE Press, 11--14.
    [91]
    Gunasekaran Manogaran and Daphne Lopez. 2018. Disease surveillance system for big climate data processing and dengue transmission. In Climate Change and Environmental Concerns: Breakthroughs in Research and Practice. IGI Global, 427--446.
    [92]
    Miguel A. Martínez-Prieto, Carlos E. Cuesta, Mario Arias, and Javier D. Fernández. 2015. The solid architecture for real-time management of big semantic data. Fut. Gener. Comput. Syst. 47 (2015), 62--79.
    [93]
    Nathan Marz. 2011. How to beat the CAP theorem. Retrieved April 2, 2019 from http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html.
    [94]
    Nathan Marz and James Warren. 2015. Big Data: Principles and Best Practices of Scalable Real-time Data Systems. Manning, New York, NY.
    [95]
    Richard McClatchey, Andrew Branson, Jetendr Shamdasani, Zsolt Kovacs, et al. 2015. Designing traceability into Big Data systems. arXiv preprint arXiv:1502.01545 (2015).
    [96]
    John Meehan, Nesime Tatbul, Stan Zdonik, Cansu Aslantas, Ugur Cetintemel, Jiang Du, Tim Kraska, Samuel Madden, David Maier, Andrew Pavlo, Michael Stonebraker, Kristin Tufte, and Hao Wang. 2015. S-Store: Streaming meets transaction processing. arXiv:1503.01143.
    [97]
    John Meehan, Stan Zdonik, Shaobo Tian, Yulong Tian, Nesime Tatbul, Adam Dziedzic, and Aaron Elmore. 2016. Integrating real-time and batch processing in a polystore. In Proceedings of the High Performance Extreme Computing Conference (HPEC’16). IEEE, 1--7.
    [98]
    Jorge Merino, Ismael Caballero, Bibiano Rivas, Manuel Serrano, and Mario Piattini. 2016. A data quality in use model for Big Data. Fut. Gener. Comput. Syst. 63 (2016), 123--130.
    [99]
    Loup Meurice and Anthony Cleve. 2017. Supporting schema evolution in schema-less NoSQL data stores. In Proceedings of the 24th International Conference on Software Analysis, Evolution and Reengineering (SANER’17). IEEE, 457--461.
    [100]
    Katina Michael and Keith W. Miller. 2013. Big Data: New opportunities and new challenges. Computer 46, 6 (2013), 22--24.
    [101]
    H. Gilbert Miller and Peter Mork. 2013. From data to decisions: A value chain for Big Data. It Profess. 15, 1 (2013), 57--59.
    [102]
    Seyed Esmaeil Mirvakili, MohammadAmin Fazli, and Jafar Habibi. 2019. Reactive liquid: Optimized liquid architecture for elastic and resilient distributed data processing. arXiv preprint arXiv:1902.05968 (2019).
    [103]
    Mostafa Mirzaie, Behshid Behkamal, and Samad Paydar. 2019. Big Data quality: A systematic literature review and future research directions. arXiv preprint arXiv:1904.05353 (2019).
    [104]
    Gilad Mishne, Jeff Dalton, Zhenghua Li, Aneesh Sharma, and Jimmy Lin. 2013. Fast data in the era of Big Data: Twitter’s real-time related query suggestion architecture. In Proceedings of the SIGMOD International Conference on Management of Data. ACM, 1147--1158.
    [105]
    Luc Moreau, Ben Clifford, Juliana Freire, Joe Futrelle, Yolanda Gil, Paul Groth, Natalia Kwasnikowska, Simon Miles, Paolo Missier, Jim Myers, et al. 2011. The open provenance model core specification (v1.1). Fut. Gener. Comput. Syst. 27, 6 (2011), 743--756.
    [106]
    Sergi Nadal. 2019. Metadata-driven Data Integration. Ph.D dissertation. The Polytechnic University of Catalonia.
    [107]
    Sergi Nadal, Victor Herrero, Oscar Romero, Alberto Abelló, Xavier Franch, Stijn Vansummeren, and Danilo Valerio. 2017. A software reference architecture for semantic-aware Big Data systems. Inf. Softw. Technol. 90 (2017), 75--92.
    [108]
    V. D. Nadkarni. 2020. Worldwide Big Data Technology and Services Forecast, 2016--2020. Retrieved March 20, 2020 from https://www.marketresearch.com/IDC-v2477/Worldwide-Big-Data-Technology-Services-10510864/.
    [109]
    Ravishankar Narayanan. 2016. Evolving and Improving the Requirements approach to Big Data Projects. https://re-magazine.ireb.org/articles/a-roadmap-to-implementing-big-data-projects/.
    [110]
    NIST. 2018. NIST Big Data Interoperability Framework: Volume 3, Use Cases and General Requirements. U.S. Department of Commerce, National Institute of Standards and Technology.
    [111]
    Ibtehal Noorwali, Darlan Arruda, and Nazim H. Madhavji. 2016. Understanding quality requirements in the context of Big Data systems. In Proceedings of the 2nd International Workshop on Big Data Software Engineering. ACM, 76--79.
    [112]
    Jukka K. Nurminen and Harrison Mfula. 2018. A unified framework for 5G network management tools. In Proceedings of the 11th Conference on Service-Oriented Computing and Applications. IEEE, 41--48.
    [113]
    Carlos Ordonez. 2010. Statistical model computation with UDFs. IEEE Trans. Knowl. Data Eng. 22, 12 (2010), 1752--1765.
    [114]
    Carlos E. Otero and Adrian Peter. 2014. Research directions for engineering Big Data analytics software. IEEE Intell. Syst. 30, 1 (2014), 13--19.
    [115]
    M. Tamer Özsu and Patrick Valduriez. 2011. Principles of Distributed Database Systems. Springer Science 8 Business Media.
    [116]
    Pasquale Pagano, Leonardo Candela, and Donatella Castelli. 2013. Data interoperability. Data Sci. J. 12 (2013), 119--125.
    [117]
    Leysia Palen, Kenneth M. Anderson, Gloria Mark, James Martin, Douglas Sicker, Martha Palmer, and Dirk Grunwald. 2010. A vision for technology-mediated support for public participation 8 assistance in mass emergencies 8 disasters. In ACM-BCS Visions of Computer Science Conference. British Computer Society, 8.
    [118]
    George Papadakis, Dimitrios Skoutas, Emmanouil Thanos, and Themis Palpanas. 2020. Blocking and filtering techniques for entity resolution: A survey. ACM Comput. Surv. 53, 2 (2020), 1--42.
    [119]
    A. Patrizio. 2019. 4 reasons Big Data projects failand 4 ways to succeed: Nearly all Big Data projects end up in failure, despite all the mature technology available. Here’s how to make Big Data efforts actually succeed. Retrieved March 21, 2020 from https://www.infoworld.com/article/3393467/4-reasons-big-data-projects-failand-4-ways-to-succeed.html.
    [120]
    Kexin Pei, Yinzhi Cao, Junfeng Yang, and Suman Jana. 2017. DeepXplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 1--18.
    [121]
    Michele Pettinato, Juan Pablo Gil, Patricio Galeas, and Barbara Russo. 2019. Log mining to re-construct system behavior: An exploratory study on a large telescope system. Inf. Softw. Technol. 114 (2019), 121--136.
    [122]
    Antonella Poggi, Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and Riccardo Rosati. 2008. Linking data to ontologies. J. Data Semantics 10 (2008), 133--173.
    [123]
    John Poole, Dan Chang, Douglas Tolbert, and David Mellor. 2002. Common Warehouse Metamodel. Vol. 20. John Wiley 8 Sons.
    [124]
    Christoph Quix, Rihan Hai, and Ivan Vatov. 2016. GEMMS: A generic and extensible metadata management system for data lakes. In Proceedings of the CAiSE Forum. 129--136.
    [125]
    Erhard Rahm and Hong Hai Do. 2000. Data cleaning: Problems and current approaches. IEEE Data Eng. 23, 4 (2000), 3--13.
    [126]
    Vaibhav Sachdeva and Lawrence Chung. 2017. Handling non-functional requirements for Big Data and IOT projects in scrum. In Proceedings of the 7th International Conference on Cloud Computing, Data Science 8 Engineering-Confluence. IEEE, 216--221.
    [127]
    Maribel Yasmina Santos, Jorge Oliveira e Sá, Carina Andrade, Francisca Vale Lima, Eduarda Costa, Carlos Costa, Bruno Martinho, and João Galvão. 2017. A Big Data system supporting Bosch Braga industry 4.0 strategy. Int. J. Inf. Manage. 37, 6 (2017), 750--760.
    [128]
    Carlton Sapp, Daren Brabham, Joseph Antelmi, Henry Cook, Thornton Craig, Soyeb Barot, Doreen Galli, Sumit Pal, Sanjeev Mohan, George Gilbert. 2018. Planning guide for data and analytics. [Online]. Available at https://www.gartner.com/en/doc/361501-2019-planning-guide-for-data-and-analytics [Accessed 29 May 2020].
    [129]
    Sergio Segura, Gordon Fraser, Ana B. Sanchez, and Antonio Ruiz-Cortés. 2016. A survey on metamorphic testing. IEEE Trans. Softw. Eng. 42, 9 (2016), 805--824.
    [130]
    A. Sharala. 2019. Why 85% of Big Data projects fail. Retrieved March 21, 2020 from https://www.digitalnewsasia.com/insights/why-85-big-data-projects-fail.
    [131]
    Jiaxin Shi, Youyang Yao, Rong Chen, Haibo Chen, and Feifei Li. 2016. Fast and concurrent {RDF} queries with RDMA-based distributed graph exploration. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 317--332.
    [132]
    Hassan A. Sleiman and Rafael Corchuelo. 2012. Information extraction framework. In Trends in Practical Applications of Agents and Multiagent Systems. Springer, 149--156.
    [133]
    Sunil Soares. 2012. Big Data Governance: An Emerging Imperative. Mc Press.
    [134]
    Sunil Soares. 2013. IBM InfoSphere: A Platform for Big Data Governance and Process Data Governance. Mc Press.
    [135]
    Ian Sommerville and Pete Sawyer. 1997. Requirements Engineering: A Good Practice Guide. John Wiley 8 Sons, Inc.
    [136]
    Andrea Stocco, Michael Weiss, Marco Calzana, and Paolo Tonella. 2019. PRECRIME: Self-assessment Oracles for Anticipatory Testing. Technical Report TR-Precrime-2019-02. USI Universita della Svizzera Italiana.
    [137]
    Christopher Stoermer, Felix Bachmann, and Chris Verhoef. 2003. SACAM: The Software Architecture Comparison Analysis Method. Technical Report. Carnegie-Mellon University, Pittsburgh, Pennsylvania.
    [138]
    Isuru Suriarachchi and Beth Plale. 2016. Provenance as essential infrastructure for data lakes. In Proceedings of the International Provenance and Annotation Workshop. Springer, 178--182.
    [139]
    Ikbal Taleb, Hadeel T. El Kassabi, Mohamed Adel Serhani, Rachida Dssouli, and Chafik Bouhaddioui. 2016. Big data quality: A quality dimensions evaluation. In Proceedings of the Conferences on Ubiquitous Intelligence 8 Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld’16). IEEE, 759--765.
    [140]
    Ignacio G. Terrizzano, Peter M. Schwarz, Mary Roth, and John E. Colino. 2015. Data wrangling: The challenging journey from the wild to the lake. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’15).
    [141]
    Yuchi Tian, Kexin Pei, Suman Jana, and Baishakhi Ray. 2018. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th International Conference on Software Engineering. ACM, 303--314.
    [142]
    Trends. 2020. Interest in “Big Data Analytics’’ over time. Retrieved May 2, 2020 from https://trends.google.pt/trends/explore?cat=128date=2011-01-01%202020-02-048q=big%20data%20analytics.
    [143]
    Chun-Wei Tsai, Chin-Feng Lai, Han-Chieh Chao, and Athanasios V. Vasilakos. 2015. Big Data analytics: A survey. J. Big Data 2, 1 (2015), 21.
    [144]
    R. Van Der Meulen. 2016. Gartner survey reveals investment in Big Data is up but fewer organizations plan to invest. [Online]. Available at http://www.gartner.com/newsroom/id/3466117 [Accessed 29 May 2020].
    [145]
    Maria-Esther Vidal, Kemele M. Endris, Samaneh Jozashoori, Farah Karim, and Guillermo Palma. 2019. Semantic data integration of big biomedical data for supporting personalised medicine. In Current Trends in Semantic Web Technologies: Theory and Practice. Springer, 25--56.
    [146]
    M. Villari, A. Celesti, M. Fazio, and A. Puliafito. 2014. AllJoyn Lambda: An architecture for the management of smart environments in IoT. In Proceedings of the International Conference on Smart Computing Workshops. 9--14.
    [147]
    Matthias Volk, Sascha Bosse, Dennis Bischoff, and Klaus Turowski. 2019. Decision-support for selecting Big Data reference architectures. In Proceedings of the International Conference on Business Information Systems. Springer, 3--17.
    [148]
    Coral Walker and Hassan Alrehamy. 2015. Personal data lake with data gravity pull. In Proceedings of the 5th International Conference on Big Data and Cloud Computing. IEEE, 160--167.
    [149]
    Siyuan Wang, Chang Lou, Rong Chen, and Haibo Chen. 2018. Fast and concurrent {RDF} queries using RDMA-assisted {GPU} graph exploration. In Proceedings of the 2018 USENIX Annual Technical Conference (USENIX ATC’18). 651--664.
    [150]
    Wei Wang, Lei Fan, Pu Huang, and Hai Li. 2019. A new data processing architecture for multi-scenario applications in aviation manufacturing. IEEE Access 7 (2019), 83637--83650.
    [151]
    Elaine J. Weyuker. 1982. On testing non-testable programs. Comput. J. 25, 4 (1982), 465--470.
    [152]
    Gio Wiederhold. 1992. Mediators in the architecture of future information systems. Computer 25, 3 (1992), 38--49.
    [153]
    Fangjin Yang, Gian Merlino, Nelson Ray, Xavier Léauté, Himanshu Gupta, and Eric Tschetter. 2017. The RADStack: Open source Lambda architecture for interactive analytics. In Proceedings of the 50th Hawaii International Conference on System Sciences. 1703--1712.
    [154]
    Michal Young. 2008. Software Testing and Analysis: Process, Principles, and Techniques. John Wiley 8 Sons.
    [155]
    Victor Zakhary, Divyakant Agrawal, and Amr El Abbadi. 2017. Caching at the web scale. In Proceedings of the 26th International Conference on World Wide Web Companion. 909--912.
    [156]
    Jie M. Zhang, Mark Harman, Lei Ma, and Yang Liu. 2020. Machine learning testing: Survey, landscapes and horizons. IEEE Transactions on Software Engineering (2020).
    [157]
    Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd International Conference on Automated Software Engineering. ACM, 132--142.
    [158]
    Zhi Quan Zhou, Shaowen Xiang, and Tsong Yueh Chen. 2015. Metamorphic testing for software quality assessment: A study of search engines. Trans. Softw. Eng. 42, 3 (2015), 264--284.
    [159]
    Theo Zschörnig, Jonah Windolph, Robert Wehlitz, and Bogdan Franczyk. 2020. A cloud-based analytics-platform for user-centric Internet of Things domains—Prototype and performance evaluation. In Proceedings of the 53rd Hawaii International Conference on System Sciences.

    Cited By

    View all

    Index Terms

    1. Big Data Systems: A Software Engineering Perspective

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 53, Issue 5
      September 2021
      782 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3426973
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 28 September 2020
      Accepted: 01 June 2020
      Revised: 01 May 2020
      Received: 01 July 2019
      Published in CSUR Volume 53, Issue 5

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Big Data
      2. Big Data systems
      3. quality assurance
      4. requirements engineering
      5. software engineering
      6. software reference architecture

      Qualifiers

      • Survey
      • Research
      • Refereed

      Funding Sources

      • Guangzhou Key Laboratory of Big Data and Intelligent Education
      • National Natural Science Foundation of China

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1,557
      • Downloads (Last 6 weeks)130

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Data Lakes: A Survey of Concepts and ArchitecturesComputers10.3390/computers1307018313:7(183)Online publication date: 22-Jul-2024
      • (2024)15 years of Big Data: a systematic literature reviewJournal of Big Data10.1186/s40537-024-00914-911:1Online publication date: 14-May-2024
      • (2024)Use of Context in Data Quality Management: a Systematic Literature ReviewJournal of Data and Information Quality10.1145/3672082Online publication date: 17-Jun-2024
      • (2024)Digital Transformation in the Public Administrations: A Guided Tour for Computer ScientistsIEEE Access10.1109/ACCESS.2024.336307512(22841-22865)Online publication date: 2024
      • (2024)Benchmarking scalability of stream processing frameworks deployed as microservices in the cloudJournal of Systems and Software10.1016/j.jss.2023.111879208:COnline publication date: 1-Feb-2024
      • (2024)Design of Intelligent Software Security System Based on Spark Big Data ComputingWireless Personal Communications10.1007/s11277-024-11015-4Online publication date: 16-Apr-2024
      • (2024)Big DataBig Data Analytics10.1007/978-3-031-55639-5_2(9-30)Online publication date: 8-May-2024
      • (2024)Management of Implicit Ontology Changes Generated by Non-conservative JSON Instance Updates in the τJOWL EnvironmentAdvances in Information Systems, Artificial Intelligence and Knowledge Management10.1007/978-3-031-51664-1_15(213-226)Online publication date: 20-Jan-2024
      • (2023)τSQWRL: A TSQL2-Like Query Language for Temporal Ontologies Generated from JSON Big DataBig Data Mining and Analytics10.26599/BDMA.2022.90200446:3(288-300)Online publication date: Sep-2023
      • (2023)Analysis of the practical education model and its value in the context of big data in university civic educationApplied Mathematics and Nonlinear Sciences10.2478/amns.2023.1.00344Online publication date: 5-Jun-2023
      • Show More Cited By

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Get Access

      Login options

      Full Access

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media