Abstract
Today’s state-of-the-art Big data analytics engines handle masses of data, but will reach to their limits, as the future Big data flood is predicted to still grow with an increasing speed. Hence we need to think about the next development phase and future features of Big data analytics engines. In this paper, we discuss possible future enhancements in the area of Big data analytics with focus on emergent models, frameworks, and hardware technologies. We point out a selection of new challenges and open research questions.
Similar content being viewed by others
Notes
As given in the November list at https://www.top500.org.
Floating-point operations per second (Flops) is a measure of computer performance representing the number of floating-point calculations computed per second.
Inspired by [77], the numbers for the considered hardware are collected from https://www.intel.com/content/www/us/en/support/articles/000006779/processors.html for Intel Xeon processors, http://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/high-performance-xeon-phi-coprocessor-brief.pdf and https://www.intel.com/content/www/us/en/processors/xeon/scalable/xeon-scalable-platform-brief.html for Xeon Phi, http://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units for NVIDIA GPUs and http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units for AMD GPUs.
We define the number of computational units of CPUs as the number of cores, and compare these number with the number of streaming multi-processors (NVIDIA GPU) and compute units (AMD GPU). We do not present here the number of compute units of GPUs, groups of which are meant for handling single-instruction multiple-data (SIMD) in GPUs and hence are not independent of each other as the cores of CPUs.
Note that there is a big discussion of about how to determine the GFlops of an FPGA (see, e.g., https://www.altera.com/en_US/pdfs/literature/wp/wp-01222-understanding-peak-floating-point-performance-claims.pdf). We considered the numbers presented in previously mentioned web document. Further note that due to the different architectures it is much more difficult to define the number of computational units of an FPGA, which are comparable to those of CPUs and GPUs. Hence we do not provide the numbers of computational units for FPGAs here.
References
Abdelfattah MS, Hagiescu A, Singh D (2014) Gzip on a Chip: High performance lossless data compression on FPGAs using OpenCL. In: Proceedings of the International Workshop on OpenCL 2014, IWOCL ’14. ACM, New York, NY, USA, pp 4:1–4:9
Ahn J, Im D, Kim H (2015) Sigmr: Mapreduce-based SPARQL query processing by signature encoding and multi-way join. J Supercomput 71(10):3695–3725
Ajtai M, Komlós J, Szemerédi E (1983) An 0(n log n) sorting network. In: Proceedings of the Fifteenth Annual ACM Symposium on Theory of Computing, STOC ’83. ACM, New York, NY, USA, pp 1–9
Alam M, Yoginath SB, Perumalla KS (2016) Performance of point and range queries for in-memory databases using radix trees on GPUS. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), pp 1493–1500
Alcantara DA, Sharf A, Abbasinejad F, Sengupta S, Mitzenmacher M, Owens JD, Amenta N (2009) Real-time parallel hashing on the gpu. ACM Trans Graph 28(5):154
Alvarez V, Richter S, Chen X, Dittrich J (2015) A comparison of adaptive radix trees and hash tables. In: ICDE
AMD (2014) Compute Cores, White Paper. http://www.amd.com/Documents/Compute_Cores_Whitepaper.pdf. Accessed 19 Feb 2018
Ashkiani S, Li S, Farach-Colton M, Amenta N, Owens JD (2017) GPU LSM: a dynamic dictionary data structure for the GPU. CoRR. arXiv:1707.05354. Accessed 19 Feb 2018
Baddar SWA-H, Batcher KE (2011) Designing sorting networks: a new paradigm. Springer, Berlin
Barbieri DF, Braga D, Ceri S, Della Valle E, Grossniklaus M (2010) Incremental reasoning on streams and rich background knowledge. Springer, Berlin, pp 1–15
Barbieri DF, Braga D, Ceri S, Valle ED, Huang Y, Tresp V, Rettinger A, Wermser H (2010) Deductive and inductive stream reasoning for semantic social media analytics. IEEE Intell Syst 25(6):32–41
Batcher KE (1968) Sorting networks and their applications. In: AFIPS
Battré D, Heine F, Höing A, Kao O (2007) On triple dissemination, forward-chaining, and load balancing in DHT based RDF stores. In: Proceedings of the 2006 International Conference on Databases, Information Systems, and Peer-to-Peer Computing. Springer, pp 343–354
Bender MA et al (2007) Cache-oblivious streaming B-trees. In: SPAA
Berners-Lee T, Hendler J, Lassila O (2001) The Semantic Web. Scientific American Magazine 284:34–43
Blochwitz C, Joseph JM, Pionteck T, Backasch R, Werner S, Heinrich D, Groppe S (2015) An optimized radix-tree for hardware-accelerated index generation for Semantic Web Databases. In: International Conference on ReConFigurable Computing and FPGAs (ReConFig), Cancun, Mexico, December 7–9
Blochwitz C, Wolff J, Joseph JM, Werner S, Heinrich D, Groppe S, Pionteck T (2017) Hardware-accelerated radix-tree based string sorting for Big data applications. In: Architecture of Computing Systems (ARCS 2017) - 30th International Conference (LNCS), vol 10172, Vienna, Austria, pp 47–58, 3–6 April 2017
Bonomi F, Milito R, Zhu J, Addepalli S (2012) Fog computing and its role in the internet of things. In: Proceedings of the First Edition of the MCC Workshop on Mobile Cloud Computing, MCC ’12. ACM, New York, pp 13–16
Borne K (2014) Top 10 Big Data Challenges A Serious Look at 10 Big Data Vs. Gartner. https://www.mapr.com/blog/top-10-big-data-challenges-serious-look-10-big-data-vs. Accessed 19 Feb 2018
Carbone P, Ewen S, Fóra G, Haridi S, Richter S, Tzoumas K (2017) State management in apache flink: consistent stateful distributed stream processing. Proc. VLDB Endow. 10(12):1718–1729
Chang F et al (2008) Bigtable: a distributed storage system for structured data. ACM Trans Comput Syst 26(2):4:1–4:26
Chazelle B, Guibas LJ (1986) Fractional cascading: I. A data structuring technique. Algorithmica 1(1):133–162
Chellappa R (1997) Intermediaries in cloud-computing: a new computing paradigm. In: INFORMS
Chen X, Chen H, Zhang N, Zhang S (2014) Sparkrdf: elastic discreted rdf graph processing engine with distributed memory. In: Proceedings of the 2014 International Conference on Posters & Demonstrations Track, ISWC-PD’14, vol 1272, pp 261–264, Aachen, Germany. CEUR-WS.org
Chen Y-T, Cong J, Fang Z, Lei J, Wei P (2016) When spark meets FPGAs: a case study for next-generation DNA sequencing acceleration. In: 8th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16), Denver, CO, 2016. USENIX Association
Comer D (1979) Ubiquitous B-tree. ACM Comput Surv 11(2):121–137
Daga M, Nutter M (2012) Exploiting coarse-grained parallelism in B+ tree searches on an APU. In: High performance computing, networking, storage and analysis (SCC), 2012 SC companion. IEEE, pp 240–247
DataStax, Inc (2016) How is data written?. http://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlHowDataWritten.html. Accessed 19 Feb 2018
Dowd M et al (1989) The periodic balanced sorting network. J ACM 36(4):738–757
Facebook (2015) Indexing SST files for better lookup performance. https://github.com/facebook/rocksdb/wiki/Indexing-SST-Files-for-Better-Lookup-Performance. Accessed 19 Feb 2018
Fisher DE, Yang S (2016) Doing more with the dew: a new approach to cloud-dew architecture. Open J Cloud Comput 3(1):8–19
Gaetani E, Aniello L, Baldoni R, Lombardi F, Margheri A, Sassone V (2017) Blockchain-based database to ensure data integrity in cloud computing environments. In: ITASEC, pp 146–155
Google (2015) Leveldb file layout and compactions. https://rawgit.com/google/leveldb/master/doc/impl.html. Accessed 19 Feb 2018
Graux D, Jachiet L, Genevès P, Layaïda N (2016) SPARQLGX: efficient distributed evaluation of SPARQL with apache spark. In: The Semantic Web - ISWC 2016—15th International Semantic Web Conference, Kobe, Japan, October 17–21, 2016, Proceedings, Part II, pp 80–87
Groppe J, Groppe S, Schleifer A, Linnemann V (Nov. 2009) LuposDate: a semantic web database system. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management (ACM CIKM 2009). ACM, Hong Kong, China, pp 2083–2084
Groppe S (2011) Data management and query processing in semantic web databases. Springer, Berlin
Groppe S (2017) LUPOSDATE Semantic Web Database Management System. https://github.com/luposdate/luposdate. Accessed 3 Feb 2017
Groppe S, Kiencke T, Werner S, Heinrich D, Stelzner M, Gruenwald L (2014) P-luposdate: using precomputed bloom filters to speed up sparql processing in the cloud. Open J Semant Web 1(2):25–55
Heimel M, Saecker M, Pirk H, Manegold S, Markl V (2013) Hardware-oblivious parallelism for in-memory column-stores. Proc VLDB Endow 6(9):709–720
Heinrich D, Werner S, Blochwitz C, Pionteck T, Groppe S (2017) Search & update optimization of a B+ tree in a hardware aided semantic web database system. In: Proceedings of the 7th International Conference on Emerging Databases (EDB)(Lecture Notes in Electrical Engineering (LNEE)). Springer, vol 461 , pp 172–182
Heinrich D, Werner S, Stelzner M, Blochwitz C, Pionteck T, Groppe S (2015) Hybrid FPGA approach for a B+ tree in a semantic web database system. In: Proceedings of the 10th International Symposium on Reconfigurable Communication-Centric Systems-on-Chip (ReCoSoC 2015), Bremen, Germany, June 29–July 1 2015. IEEE
Idreos S, Koubarakis M (2004) Methods and applications of artificial intelligence. In: Third Hellenic Conference on AI, SETN 2004, Samos, Greece, May 5–8, 2004. Proceedings, Chapter P2P-DIET: Ad-hoc and Continuous Queries in Peer-to-Peer Networks Using Mobile Agents. Springer, Berlin, pp 23–32
Idreos S, Koubarakis M, Tryfonopoulos C (2004) Advances in database technology—EDBT 2004. In: 9th International Conference on Extending Database Technology, Heraklion, Crete, Greece, March 14–18, 2004, chapter P2P-DIET: One-Time and Continuous Queries in Super-Peer Networks. Springer, Berlin, pp 851–853
Jung HS, Yoon CS, Lee YW, Park JW, Yun CH (2017) Processing IoT data with cloud computing for smart cities. Int J Web Appl (IJWA) 9(3):88–95
Kaoudi Z, Koubarakis M, Kyzirakos K, Miliaraki I, Magiridou M, Papadakis-Pesaresi A (2010) Atlas: storing, updating and querying RDF(S) data on top of DHTs. Web Semant Sci Serv Agents World Wide Web 8(4):271–277
Kaoudi Z, Manolescu I (2015) Rdf in the clouds: a survey. VLDB J 24(1):67–91
Laney D (2001) 3D Data Management: controlling data volume, velocity and variety. Gartner, http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf. Accessed 19 Feb 2018
Leis V, Kemper A, Neumann T (2013) The adaptive radix tree: artful indexing for main-memory databases. In: ICDE
Li J, Tseng H-W, Lin C, Papakonstantinou Y, Swanson S (2016) Hippogriffdb: balancing i/o and gpu bandwidth in big data analytics. Proc. VLDB Endow. 9(14):1647–1658
Liang W, Yin W, Kang P, Wang L (2016) Memory efficient and high performance key-value store on FPGA using cuckoo hashing. In: FPL
Liarou E, Idreos S, Koubarakis M (2007) The Semantic Web. In: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11–15, 2007. Proceedings, chapter Continuous RDF Query Processing over DHTs. Springer, Berlin, pp 324–339
Linked Data (2016) Linked data—connect distributed data across the Web. Accessed 4 Nov 2016
Liu Y, McBrien P (2017) Spowl: spark-based owl 2 reasoning materialisation. In: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (BeyondMR’17)
LOD2 (2016) LODStats. Accessed 4 Nov 2016
LOD2 (2016) Welcome—LOD2—Creating knowledge out of interlinked data. Accessed 4 Nov 2016
Luo L, Wong MDF, Leong L (2012) Parallel implementation of r-trees on the GPU. In: 17th Asia and South Pacific Design Automation Conference, pp 353–358
Maarala AI, Su X, Riekki J (2017) Semantic reasoning for context-aware internet of things applications. IEEE Internet Things J 4(2):461–473
Mammo M, Bansal SK (2015) Distributed SPARQL over big RDF data: a comparative analysis using presto and mapreduce. In: 2015 IEEE International Congress on Big Data, New York City, NY, USA, June 27–July 2, pp 33–40
Mattern F, Floerkemeier C (2010) From the internet of computers to the internet of things. In: From Active Data Management to Event-based Systems and More. Springer, pp 242–259
McConaghy T, Marques R, Müller A, De Jonghe D, McConaghy T, McMullen G, Henderson R, Bellemare S, Granzotto A (2016) Bigchaindb: a scalable blockchain database. White paper
Mietz R, Groppe S, Oliver Kleine DB, Fischer S, Römer K, Pfisterer D (2013) A P2P semantic query framework for the internet of things. PIK - Praxis der Informationsverarbeitung und Kommunikation 36(2):73–79
Mietz R, Groppe S, Römer K, Pfisterer D (2013) Semantic models for scalable search in the internet of things. J Sens Actuator Netw 2(2):172–195
Moore GE (1965) Cramming more components onto integrated circuits. Electronics 38(8):114–117
Moore GE (1975) Progress in digital integrated electronics. In: Electron Devices Meeting, 1975 International. IEEE, vol 21, pp 11–13
Moore GE (2015) The man whose name means progress, the visionary engineer reflects on 50 years of Moore’s Law. IEEE Spectrum: special report: 50 years of Moore’s Law (Interview). Interview with Rachel Courtland. http://spectrum.ieee.org/computing/hardware/gordon-moore-the-man-whose-name-means-progress. Accessed 19 Feb 2018
Moscovici N, Cohen N, Petrank E (2017) A GPU-friendly skiplist algorithm. In: 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, pp 246–259
Mueller R, Teubner J, Alonso G (2012) Sorting networks on fpgas. VLDB J 21(1):1–23
Nakamoto S (2008) Bitcoin: a peer-to-peer electronic cash system. https://bitcoin.org/bitcoin.pdf. Accessed 19 Feb 2018
Noghabi SA, Paramasivam K, Pan Y, Ramesh N, Bringhurst J, Gupta I, Campbell RH (2017) Samza: stateful scalable stream processing at linkedin. Proc VLDB Endow 10(12):1634–1645
Nurvitadhi E, Sim J, Sheffield D, Mishra A, Krishnan S, Marr D (2016) Accelerating recurrent neural networks in analytics servers: comparison of FPGA, CPU, GPU, and ASIC. In: 26th International Conference on Field Programmable Logic and Applications (FPL)
ONeil P et al (1996) The log-structured merge-tree (LSM-tree). Acta Inform 33(4):351–385
Pagh R, Rodler F (2004) Cuckoo hashing. J Algorithms 51(2):122–144
Pirk H, Moll O, Zaharia M, Madden S (2016) Voodoo—a vector algebra for portable database performance on modern hardware. Proc VLDB Endow 9(14):1707–1718
Plessl C (2012) Accelerating scientific computing with massively parallel computer architectures. IMPRS Winter School, Wroclaw. http://www.imprs-dynamics.mpg.de/pdfs/Plessl_talk.pdf. Accessed 19 Feb 2018
Prasad SK, McDermott M, He X, Puri S (2015) Gpu-based parallel r-tree construction and querying. In: Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2015 IEEE International. IEEE, pp 618–627
Ramaswamy L, Chen J (2011) The coquos approach to continuous queries in unstructured overlays. IEEE Trans Knowl Data Eng 23(3):463–478
Rupp K (2016) CPU, GPU and MIC hardware characteristics over time. Posted in blog GPGPU/MIC computing. https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardware-characteristics-over-time/, 2013, last update
Ruta M, Scioscia F, Ieva S, Capurso G, Sciascio ED (2017) Semantic blockchain to improve scalability in the internet of things. Open J Internet Things, 3(1):46–61. Special Issue: Proceedings of the International Workshop on Very Large Internet of Things (VLIoT 2017) in conjunction with the VLDB 2017 Conference in Munich, Germany
Schätzle A, Przyjaciel-Zablocki M, Skilevic S, Lausen G (2016) S2RDF: RDF querying with SPARQL on spark. PVLDB 9(10):804–815
Segal O, Colangelo P, Nasiri N, Qian Z, Margala M (2015) SparkCL: A unified programming framework for accelerators on heterogeneous clusters. CoRR. arXiv:1505.01120. Accessed 19 Feb 2018
Shahvarani A, Jacobsen H-A (2016) A hybrid b+-tree as solution for in-memory indexing on CPU-GPU heterogeneous computing platforms. In: Proceedings of the 2016 International Conference on Management of Data (SIGMOD), pp 1523–1538
Skala K, Davidovic D, Afgan E, Sovic I, Sojat Z (2015) Scalable distributed computing hierarchy: cloud, fog and dew computing. Open J Cloud Comput 2(1):16–24
Stone JE, Gohara D, Shi G (2010) Opencl: a parallel programming standard for heterogeneous computing systems. IEEE Des. Test 12(3):66–73
ter Horst HJ (2005) Completeness, decidability and complexity of entailment for RDF schema and a semantic extension involving the OWL vocabulary. Web Semant 3(2–3):79–115
The Apache Software Foundation (2014) Welcome to Apache Hadoop!. http://hadoop.apache.org/. Accessed 19 Feb 2018
The Apache Software Foundation (2016) Apache Flink: scalable stream and batch data processing. https://flink.apache.org/. Accessed 19 Feb 2018
The Apache Software Foundation (2016) Apache Tez–Welcome to Apache Tez. https://tez.apache.org/. Accessed 19 Feb 2018
The Apache Software Foundation (2017) Apache Spark—Lightning-fast cluster computing. http://spark.apache.org/. Accessed 19 Feb 2018
Turck M (2016) Is Big data still a thing? (The 2016 Big data landscape). Blog of Matt Turck. http://mattturck.com/2016/02/01/big-data-landscape. Accessed 19 Feb 2018
Khan MA, Uddin MF, Gupta N (2014) Seven v’s of Big data understanding Big data to extract value. In: Proceedings of the 2014 Zone 1 Conference of the American Society for Engineering Education, pp 1–5
Waldrop MM (2016) The chips are down for Moores law. Nature 530(7589):144–147
Wang J, Park D, Papakonstantinou Y, Swanson S (2017) SSD in-storage computing for search engines. IEEE Trans Comput (to appear)
Wang Y (2016) Definition and categorization of dew computing. Open J Cloud Comput 3(1):1–7
Weisz G, Melber J, Wang Y, Fleming K, Nurvitadhi E, Hoe JC (2016) A study of pointer-chasing performance on shared-memory processor-fpga systems. In: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’16, New York, NY, USA, 2016. ACM, pp 264–273
Werner S (2017) Hybrid Architecture for Hardware-accelerated query processing in semantic web databases based on Runtime Reconfigurable FPGAs. PhD thesis, University of Lübeck
Werner S, Groppe S, Linnemann V, Pionteck T (2013) Hardware-accelerated join processing in large semantic web databases with FPGAs. In: Proceedings of the 2013 International Conference on High Performance Computing & Simulation (HPCS 2013), Helsinki, Finland, July 1–5 2013. IEEE, pp 131–138
Werner S, Heinrich D, Piper J, Groppe S, Backasch R, Blochwitz C, Pionteck T (2015) Automated composition and execution of hardware-accelerated operator graphs. In: Proceedings of the 10th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC 2015), Bremen, Germany, June 29–July 1 2015. IEEE
Werner S, Heinrich D, Stelzner M, Groppe S, Backasch R, Pionteck T (2014) Parallel and pipelined filter operator for hardware-accelerated operator graphs in semantic web databases. In: Proceedings of the 14th IEEE International Conference on Computer and Information Technology (CIT 2014), Xian, China, September 11–13 2014. IEEE
Werner S, Heinrich D, Stelzner M, Linnemann V, Pionteck T, Groppe S (2016) Accelerated join evaluation in semantic web databases by using FPGAs. Concurr Comput Pract Exp 28(7):2031–2051
You S, Zhang J, Gruenwald L (2013) Parallel spatial query processing on gpus using r-trees. In: Proceedings of the 2Nd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (BigSpatial), pp 23–31
Zhang H, Andersen DG, Pavlo A, Kaminsky M, Ma L, Shen R (2016) Reducing the storage overhead of main-memory oltp databases with hybrid indexes. In: Proceedings of the 2016 International Conference on Management of Data (SIGMOD), pp 1567–1581
Zohouri HR, Maruyama N, Smith A, Matsuda M, Matsuoka S (2016) Evaluating and optimizing opencl kernels for high performance computing with FPGAs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, Piscataway, NJ, USA. IEEE Press, pp 35:1–35:12
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Groppe, S. Emergent models, frameworks, and hardware technologies for Big data analytics. J Supercomput 76, 1800–1827 (2020). https://doi.org/10.1007/s11227-018-2277-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-018-2277-x