Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Discover millions of ebooks, audiobooks, and so much more with a free trial

From $11.99/month after trial. Cancel anytime.

High-Performance Big Data Computing
High-Performance Big Data Computing
High-Performance Big Data Computing
Ebook536 pages5 hours

High-Performance Big Data Computing

Rating: 0 out of 5 stars

()

Read preview

About this ebook

An in-depth overview of an emerging field that brings together high-performance computing, big data processing, and deep lLearning.
 


Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together high-performance computing (HPC), big data processing, and deep learning, aims to meet the challenges posed by large-scale data processing. This book offers an in-depth overview of high-performance big data computing and the associated technical issues, approaches, and solutions. 
 
The book covers basic concepts and necessary background knowledge, including data processing frameworks, storage systems, and hardware capabilities; offers a detailed discussion of technical issues in accelerating big data computing in terms of computation, communication, memory and storage, codesign, workload characterization and benchmarking, and system deployment and management; and surveys benchmarks and workloads for evaluating big data middleware systems. It presents a detailed discussion of big data computing systems and applications with high-performance networking, computing, and storage technologies, including state-of-the-art designs for data processing and storage systems. Finally, the book considers some advanced research topics in high-performance big data computing, including designing high-performance deep learning over big data (DLoBD) stacks and HPC cloud technologies.
 
LanguageEnglish
PublisherThe MIT Press
Release dateAug 2, 2022
ISBN9780262369428
High-Performance Big Data Computing

Related to High-Performance Big Data Computing

Related ebooks

Intelligence (AI) & Semantics For You

View More

Related articles

Reviews for High-Performance Big Data Computing

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    High-Performance Big Data Computing - Dhabaleswar K. Panda

    Cover: High-Performance Big Data Computing by Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar

    High-Performance Big Data Computing

    Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar

    The MIT Press

    Cambridge, Massachusetts

    London, England

    © 2022 Massachusetts Institute of Technology

    All rights reserved. No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher.

    The MIT Press would like to thank the anonymous peer reviewers who provided comments on drafts of this book. The generous work of academic experts is essential for establishing the authority and quality of our publications. We acknowledge with gratitude the contributions of these otherwise uncredited readers.

    Library of Congress Cataloging-in-Publication Data

    Names: Panda, Dhabaleswar K., author. | Lu, Xiaoyi (Professor of computer science), author. | Shankar, Dipti, author.

    Title: High-performance big data computing / Dhabaleswar K. Panda, Xiaoyi Lu, and Dipti Shankar.

    Description: Cambridge, Massachusetts: The MIT Press, [2022] | Series: Scientific and engineering computation | Includes bibliographical references and index.

    Identifiers: LCCN 2021038754 | ISBN 9780262046855 (hardcover)

    Subjects: LCSH: High performance computing. | Big data.

    Classification: LCC QA76.88.P36 2022 | DDC 005.7—dc23/eng/20211020

    LC record available at https://lccn.loc.gov/2021038754

    d_r0

    Contents

    Acknowledgments

    1 Introduction

    1.1 Overview

    1.2 Big Data Characteristics and Trends

    1.3 Current Systems for Data Management and Processing

    1.4 Technological Trends

    1.5 Convergence in HPC, Big Data, and Deep Learning

    1.6 Outline of the Book

    1.7 Summary

    2 Parallel Programming Models and Systems

    2.1 Overview

    2.2 Batch Processing Frameworks

    2.3 Stream Processing Frameworks

    2.4 Query Processing Frameworks

    2.5 Graph Processing Frameworks

    2.6 Machine Learning and Deep Learning Frameworks

    2.7 Interactive Big Data Tools

    2.8 Monitoring and Diagnostics Tools

    2.9 Summary

    3 Parallel and Distributed Storage Systems

    3.1 Overview

    3.2 File Storage

    3.3 Object Storage

    3.4 Block Storage

    3.5 Memory-Centric Storage

    3.6 Monitoring and Diagnostics Tools

    3.7 Summary

    4 HPC Architectures and Trends

    4.1 Overview

    4.2 Computing Capabilities

    4.3 Storage

    4.4 Network Interconnects

    4.5 Summary

    5 Opportunities and Challenges in Accelerating Big Data Computing

    5.1 Overview

    5.2 C1: Computational Challenges

    5.3 C2: Communication and Data Movement Challenges

    5.4 C3: Memory and Storage Management Challenges

    5.5 C4: Challenges of Codesigning Big Data Systems and Applications

    5.6 C5: Challenges of Big Data Workload Characterization and Benchmarking

    5.7 C6: Deployment and Management Challenges

    5.8 Summary

    6 Benchmarking Big Data Systems

    6.1 Overview

    6.2 Offline Analytical Data Processing

    6.3 Streaming Data Processing

    6.4 Online Data Processing

    6.5 Graph Data Processing

    6.6 Machine Learning and Deep Learning Workloads

    6.7 Comprehensive Benchmark Suites

    6.8 Summary

    7 Accelerations with RDMA

    7.1 Overview

    7.2 Batch and Stream Processing Systems

    7.3 Graph Processing Systems

    7.4 RPC Libraries

    7.5 Query Processing in Databases

    7.6 In-Memory KV Stores

    7.7 HiBD Project

    7.8 Case Studies and Performance Benefits

    7.9 Summary

    8 Accelerations with Multicore/Accelerator Technologies

    8.1 Introduction

    8.2 Multicore CPUs

    8.3 GPU Acceleration for Big Data Computing

    8.4 FPGAs and ASICs

    8.5 Case Studies and Performance Benefits

    8.6 Summary

    9 Accelerations with High-Performance Storage Technologies

    9.1 Overview

    9.2 Exploring NVM-Centric Designs

    9.3 Hybrid and Hierarchical Storage Middleware

    9.4 Burst Buffer Systems

    9.5 Case Studies and Performance Benefits

    9.6 Summary

    10 Deep Learning over Big Data

    10.1 Overview

    10.2 Convergence of Deep Learning, Big Data, and HPC

    10.3 Challenges of Designing DLoBD Stacks

    10.4 Distributed Deep Learning Training Basics

    10.5 Overview of DLoBD Stacks

    10.6 Characterization of DLoBD Stacks

    10.7 Case Studies and Performance Benefits

    10.8 Discussions on Optimizations for Deep Learning Workloads

    10.9 Summary

    11 Designs with Cloud Technologies

    11.1 Overview

    11.2 Overview of High-Performance Cloud Technologies

    11.3 State-of-the-Art Designs

    11.4 Case Studies and Performance Benefits

    11.5 Summary

    12 Frontier Research on High-Performance Big Data Computing

    12.1 Heterogeneity-Aware Big Data Processing and Management Systems

    12.2 Big Data Processing and Management for Hybrid Storage Systems

    12.3 Efficient and Coherent Communication and Computation in Network for Big Data Systems

    12.4 Summary

    References

    Index

    List of Figures

    Figure 1.1

    Four V characteristics of big data.

    Figure 1.2

    Data Never Sleeps 8.0. Source: Courtesy of (Domo).

    Figure 1.3

    Convergence of HPC, big data, and deep learning.

    Figure 1.4

    Challenges in bringing HPC, big data processing, and deep learning into a convergent trajectory.

    Figure 1.5

    Can we efficiently run big data and deep learning jobs on existing HPC infrastructure?

    Figure 1.6

    Challenges of high-performance big data computing. HDD, hard disk drive; NICs, network interface cards; QoS, quality of service; SR-IOV, single root input/output virtualization.

    Figure 1.7

    Outline of the book.

    Figure 2.1

    Programming models for distributed data processing.

    Figure 2.2

    Overview of data processing with MapReduce.

    Figure 2.3

    Overview of data processing with Hadoop MapReduce.

    Figure 2.4

    Overview of Spark architecture.

    Figure 2.5

    Spark dependencies.

    Figure 2.6

    Overview of streaming processing with Apache Storm.

    Figure 2.7

    Overview of streaming processing with Apache Flink.

    Figure 2.8

    Overview of TensorFlow stack.

    Figure 2.9

    Overview of distributed TensorFlow environment.

    Figure 3.1

    Various types of parallel and distributed storage systems. PCIe, Peripheral Component Interconnect Express; RADOS, Reliable, Autonomic Distributed Object Store; SATA, Serial AT Attachment.

    Figure 3.2

    File system architecture. MR, MapReduce; OSS, Object Storage Server; OST, Object Storage Target.

    Figure 3.3

    OpenStack Swift: architecture overview.

    Figure 3.4

    Apache Cassandra: architecture overview.

    Figure 3.5

    Memcached (distributed caching over DRAM).

    Figure 3.6

    Redis (distributed in-memory data store). HA, High Availability.

    Figure 4.1

    Overview of a typical HPC system architecture.

    Figure 4.2

    Storage device hierarchy.

    Figure 4.3

    NVMe command processing.

    Figure 4.4

    Overview of high-performance network interconnects and protocols. OFI (OpenFabrics Interfaces).

    Figure 4.5

    One-way latency: MPI over RDMA networks with MVAPICH2. (a) Small message latency. (b) Large message latency.

    Figure 4.6

    Bandwidth: MPI over RDMA networks with MVAPICH2. (a) Unidirectional bandwidth. (b) Bidirectional bandwidth.

    Figure 4.7

    RDMA over NVM: contrasting NVMeoF and remote PMoF

    Figure 5.1

    Envisioned architecture for next-generation HEC systems. Courtesy of Panda et al. (2018).

    Figure 5.2

    Challenges of achieving high-performance big data computing.

    Figure 6.1

    Big data benchmarks.

    Figure 7.1

    Design overview of RDMA-Memcached (Shankar, Lu, Islam, et al., 2016; Jose et al., 2011). KNL, kernel; LRU, least recently used.

    Figure 7.2

    RDMA-based Hadoop architecture and its different modes. PBS, Portable Batch System.

    Figure 7.3

    Performance improvement of RDMA-based designs for Apache Spark and Hadoop on SDSC Comet cluster. (a) PageRank with RDMA-Spark. (b) Sort with RDMA–Hadoop 2.x.

    Figure 7.4

    Performance benefits with RDMA-Memcached based workloads. (a) Memcached Set/Get over simulated MySQL. (b) Hadoop TestDFSIO throughput with Boldio.

    Figure 8.1

    Architecture overview of GPU-aware hash table in Memcached.

    Figure 8.2

    Stand-alone throughput with CPU and GPU-centric hash table (based on Mega-KV (K. Zhang, Wang, et al., 2015)). (a) Insert. (b) Search. MOPS, millions of operations per second; thrs, threads.

    Figure 8.3

    Stand-alone hash table probing performance on the twenty-eight–core Intel Skylake CPU, over a three-way cuckoo hash table versus non-SIMD CPU-optimized MemC3 hash table with 32-bit key/payload (Shankar et al., 2019a).

    Figure 9.1

    Performance benefits of heterogeneous storage-aware designs for Hadoop on SDSC Comet. (a) NVM-assisted MapReduce design. (b) Spark TeraSort over heterogeneity-aware HDFS. MR-IPoIB, Default MapReduce running with the IPoIB protocol; RMR, RDMA-based MapReduce; RMR-NVM, RDMA-based MapReduce running with NVM in a naive manner; NVMD, Non-Volatile Memory-assisted design for MapReduce and DAG execution frameworks (Rahman et al., 2017).

    Figure 9.2

    Performance benefits with RDMA-Memcached–based workloads.

    Figure 10.1

    Deep learning and big data analytics pipeline. Source: Courtesy of Flickr (Garrigues, 2015).

    Figure 10.2

    Overview of a unified DLoBD stack. IB, InfiniBand.

    Figure 10.3

    Convergence of deep learning, big data, and HPC.

    Figure 10.4

    Overview of CaffeOnSpark. DB, database.

    Figure 10.5

    Overview of TensorFlowOnSpark.

    Figure 10.6

    Overview of MMLSpark (CNTKOnSpark).

    Figure 10.7

    Overview of BigDL.

    Figure 10.8

    Comparison of DNNs. Source: Courtesy of Canziani et al. (2016). BN, Batch Nominations; ENet, efficient neural network; G-Ops, one billion (10⁹) operations per second; ResNet, residual neural network; M, million; NIN, Network in Network; GoogleLeNet, a 22-layer Deep Convolutional Neural Network that’s a variant of the Inception Neural Network developed by researchers at Google.

    Figure 10.9

    Performance and accuracy comparison of training AlexNet on ImageNet with CaffeOnSpark running over IPoIB and RDMA. Source: Courtesy of X. Lu et al. (2018).

    Figure 10.10

    Performance analysis of TensorFlowOnSpark and stand-alone TensorFlow (lower is better). The numbers were taken by training the SoftMax Regression model over the MNIST dataset on a four-node cluster, which includes one PS and three workers. Source: Courtesy of X. Lu et al. (2018).

    Figure 11.1

    Overview of virtualization techniques. (a) VM architecture. (b) Container architecture. libs, libraries; OS, operating system.

    Figure 11.2

    SR-IOV architecture. IOMMU, input-output memory management unit.

    Figure 11.3

    Topology-aware resource allocation in Hadoop-Virt.

    Figure 11.4

    NVMe hardware arbitration overview.

    Figure 11.5

    Performance benefits of Hadoop-Virt on HPC clouds. Execution times for (a) WordCount, (b) PageRank, (c) Sort, and (d) Self-Join (30 GB).

    Figure 11.6

    Evaluation with synthetic application scenarios. (a) Bandwidth over time with scenario 1. (b) Job bandwidth ratio for scenarios 2–5.

    Acknowledgments

    We are grateful to our students and collaborators, Adithya Bhat, Rajarshi Biswas, Shashank Gugnani, Yujie Hui, Nusrat Islam, Haseeb Javed, Arjun Kashyap, Kunal Kulkarni, Tianxi Li, Yuke Li, Hao Qi, Md. Wasi-ur-Rahman, Haiyang Shi, and Jie Zhang, for their joint scientific work over the past ten years. We sincerely thank Shashank Gugnani, Haseeb Javed, Arjun Kashyap, Yuke Li, Hao Qi, and Haiyang Shi for their contributions to this collection or for proofreading several versions of this manuscript. Special thanks to Marie Lee, Kate Elwell, and Elizabeth Swayze from The MIT Press for their significant help in publishing this book. In addition, we are indebted to the National Science Foundation (NSF) for multiple grants (e.g., IIS-1447804, OAC-1636846, CCF-1822987, OAC-2007991, OAC-2112606, and CCF-2132049). This book would not have been possible without this support.

    Finally, we dedicate this book to our loving families (P. S. Panda, S. M. Panda, Debashree Pati, Abha Panda, Zonghe Lu, Haiying Yu, Sherry Peng, Ada Lu, Alivia Lu, Alan Lu, Dr. R. Shivashankar, G. S. Usharani, and Manju G. Siddappa) for their love and understanding during the long process of writing this book over the past five years.

    Dhabaleswar K. (DK) Panda, Xiaoyi Lu, and Dipti Shankar

    March 19, 2022

    1

    Introduction

    Human society is in a data explosion era, where data are growing exponentially. This era has been called the big data era with the 5Vs characteristics, which are volume, velocity, variety, veracity, and value. To tamp the challenges associated with five big Vs, a new field—high-performance big data computing—is emerging, which aims to bring high-performance computing (HPC), big data processing, and deep learning into a convergent trajectory. This book aims to provide an in-depth overview of this field and the associated technical challenges, approaches, and solutions. This chapter provides a high-level overview of research topics and challenges in this field and an outline of the overall book.

    1.1 Overview

    During the last decade, big data has changed the way people understand and harness the power of data, both in the business and research domains. Big data has become one of the most important elements in business analytics. Big data, HPC, and deep learning/machine learning (DL/ML) are converging to meet large-scale data processing challenges. Running high-performance data analytics workloads in the HPC and cloud computing environments are gaining popularity. According to the recent Hyperion research report (Norton et al., 2020), High-performance data analytics workloads have seen robust growth in the last few years, both in budget allocations as well as organizational focus. This trend is poised to grow over the next decade. The field of big data is being expanded into huge data (Wang et al.).

    In this context, challenging issues are emerging along the following four major directions: (1) understanding big data characteristics and trends; (2) understanding the interplay among big data, HPC, and deep learning/machine learning; (3) understanding the trends of HPC technologies (processing, networking, and storage) to accelerate big data processing; and (4) understanding the benefits of accelerating big data processing.

    1.2 Big Data Characteristics and Trends

    Traditionally, big data problems and solutions have been characterized with the 3Vs (volume, velocity, and variety). In recent years, a fourth V (veracity) has been added. These characteristics are illustrated in figure 1.1. Volume reflects the amount of data at rest to be processed. Velocity refers to data in motion. Variety refers to the vast variety in the type of data that must be processed. Veracity refers to data in doubt.

    Figure 1.1

    Four V characteristics of big data.

    Data are being generated in all different forms and shapes from many different businesses, organizations, and entities. Figure 1.2 illustrates the amount of data being generated every minute of the day by various entities in 2020. For example, five hundred hours of videos are being uploaded to YouTube every minute. Around 41.67 million messages are being posted by WhatsApp users every minute. Amazon is shipping 6,659 packages per minute. This leads to a big challenge in designing appropriate systems to process big data analytics.

    Figure 1.2

    Data Never Sleeps 8.0. Source: Courtesy of (Domo).

    Efficient processing of big data with the 4Vs has many significant challenges with current-generation technologies, especially with the constantly growing data. Large volumes of data typically result in out-of-core data processing and movement, as well as significant input/output (I/O) bottlenecks. On the other hand, data in motion, popularly known as big velocity, requires real-time data processing capability. This places high-performance expectations on the underlying computing resources, networks, and storage systems for computation, communication, and I/O. The third V (variety) has resulted in the development of several data processing frameworks by the big data community; for example, Hadoop, Spark, Flink, Storm, Kafka.

    However, it is unlikely that a single standardized specification or implementation will be converged upon in the near future, making it hard to move forward with highly optimized frameworks since each of them is designed differently. Current optimizations proposed in the literature and community have mostly been done in a case-by-case manner. Therefore, to address the challenges presented by the 4Vs for big data processing, there is a critical need to design next-generation big data software stacks capable of processing data in a high-performance and scalable manner that can optimally leverage the underlying network, computation, and storage capabilities. In this context, a fifth V (value) has been added in the context of big data processing. The value of certain data and the associated business intelligence to be derived from this data can differ from organization to organization. For a given organization, a certain type of data might have a significant value. Thus, this organization will be willing to put a significant effort (and cost) to process this data. For another organization, such criticality might not exist. Such a trend is leading to various value propositions in the big data community to process different types of data.

    1.3 Current Systems for Data Management and Processing

    Broadly, current-generation data management and processing systems on modern data centers are working in two major tiers: front-end tier and back-end tier. The front-end tier software components are typically deployed for serving data accessing queries and online data processing. The corresponding data management and processing software components in this tier usually include (1) web servers, such as Apache HTTP Server, NGINX, Tomcat, and so on; (2) databases, such as MySQL, PostgreSQL, Oracle Database, Microsoft SQL Server, and so on; (3) distributed caching layer, such as Memcached, Redis, and so on; (4) NoSQL (Not Only SQL) databases, such as HBase, MongoDB, and so on. From the performance perspective, the online applications require these front-end tier software systems in a data center to process data in a low-latency and high-throughput manner, so that the positive user experience can be provided by the data center. This is why usually the system administrators choose to deploy a distributed caching layer on top of the traditional database system. Through such a caching layer, many data queries can be directly served with the cached data copies in the memory, which is much faster than the case of loading the data from the database system.

    With increasing amounts of data being processed and stored by the front-end tier software systems, the data will gradually be moved to the back-end tier for further processing, such as data mining, data cleaning, machine learning, data warehousing, and so on. The major goal of the back-end tier software components is mining the value in an offline fashion from the huge amount of data through data analytics and machine learning or deep learning jobs. The corresponding data management and processing software components in this tier usually include (1) distributed storage systems, such as Hadoop Distributed File System (HDFS), Ceph, Swift, and so on; (2) data analytics middleware, such as Hadoop, Spark, Flink, and so on; (3) machine learning or deep learning frameworks, such as TensorFlow, PyTorch, and so on; (4) different kinds of data analytics and machine learning tools or libraries, such as MLlib, Keras, and so on. From the performance perspective, high-throughput and horizontal scalability are the most important pursued properties for these back-end tier software systems in a data center.

    In this book, we will discuss the programming models and software architectures of some example systems in both front-end tier and back-end tier. More details can be found in chapters 2 and 3.

    1.4 Technological Trends

    In the last few years, big data analytics and management software stacks have been significantly enhanced for performance and scalability. Among various factors, hardware evolution is one of the key factors driving the evolution of big data analytics and management systems. The last few years have witnessed a rapid increase in the number of processor cores and an equally impressive increase in memory capacity and network bandwidth on modern cluster-based systems in both HPC centers and data centers. This growth has been fueled by the current trends in multi-/many-core architectures, emerging heterogeneous memory technologies (e.g., DRAM, nonvolatile memory [NVM] (Qureshi et al., 2009; Kültürsay et al., 2013), or persistent memory [PMEM], high-bandwidth memory, NVM Express solid state drive [NVMe-SSD]), and high-speed interconnects such as InfiniBand, Omni-Path, RDMA (i.e., remote direct memory access) over converged enhanced Ethernet (RoCE), and so on.

    These multi-/many-core architectures, heterogeneous memory, and high-speed interconnects are currently gaining momentum for designing next-generation HPC and cloud computing environments. These novel hardware architectures with higher performance and advanced features open up many opportunities to redesign the big data analytics and management software stacks to achieve unprecedented performance and scalability.

    Thus, hardware-conscious or architecture-aware designs for big data analytics and management software stacks have been a fruitful research area. We have seen many exciting research results and promising performance improvements brought by architecture-aware optimizations, from emerging memory technologies (such as NVM/PMEM), high-speed interconnects (such as RDMA-enabled networks) on easing the I/O and communication bottlenecks, to multi-core/many-core architecture-based parallel processing for big data analytics and management software stacks.

    In the HPC community, advanced technologies have been widely adopted to solve the challenges of a huge amount of scientific data to be processed and stored. Modern HPC systems and the associated middleware (such as message passing interface, or MPI, burst buffer, and parallel file systems) have been exploiting the advances in HPC technologies (multi-/many-core architectures, RDMA-enabled networking, NVRAMs and, SSDs) during the last decades. However, current-generation out-of-box big data analytics and management software stacks (e.g., Hadoop, Spark, Flink, Memcached) have not fully embraced such technologies. For instance, recent studies (Rahman et al., 2014; Lu et al., 2014; Islam et al., 2016b; Shankar, Lu, Islam, et al., 2016; Y. Wang et al., 2015; Lim et al., 2014; Huang et al., 2014; Arulraj et al., 2015) have shed light on the possible performance improvements for different big data middleware by taking advantage of RDMA over InfiniBand network, byte-addressability, and persistency of NVM. In this book, we will discuss more details about technological trends in modern HPC and data center clusters in chapter 4.

    1.5 Convergence in HPC, Big Data, and Deep Learning

    In recent years, the community is seeing an important convergence among three big fields—HPC, big data, and deep learning—as shown in figure 1.3. As the HPC environment keeps providing more and more advanced capabilities, the big data community is able to take advantage of those capabilities recently. In the meantime, we also see the deep learning community is able to leverage the technological advances from both HPC and big data fields to form up its two critical pillars: unprecedented computing capabilities and a huge amount of data for model training. This convergence cycle is continuing over the years. We believe that this trend will benefit all three fields, and we will see increasingly better solutions being proposed and developed from these communities to achieve higher performance and scalability for end applications.

    Figure 1.3

    Convergence of HPC, big data, and deep learning.

    The convergence of HPC, big data, and deep learning is becoming the next game-changing business opportunity. This trend has led to many important research and development activities in the fields to bring HPC, big data processing, and deep learning into a convergent trajectory. As demonstrated in figure 1.4, from a user’s perspective, we have to answer many critical questions and challenges to make this convergence happen. Some example questions may include the following:

    Figure 1.4

    Challenges in bringing HPC, big data processing, and deep learning into a convergent trajectory.

    • What are the major bottlenecks in current big data processing and deep learning middleware (e.g., Hadoop, Spark, TensorFlow, PyTorch)?

    • Can these bottlenecks be alleviated with new designs by taking advantage of HPC technologies?

    • Can RDMA-enabled high-performance interconnects, which are commonly deployed on HPC systems, benefit big data processing and deep learning systems and applications?

    • Can HPC Clusters with high-performance storage systems (e.g., PMEM, NVMe-SSD, parallel file system) benefit big data and deep learning applications?

    • How much performance benefits can be achieved through enhanced designs or codesigns?

    • How to design benchmarks for evaluating the performance of big data and deep learning middleware on HPC clusters?

    There are definitely more questions that can be added to this list. To help answer these questions, this book aims to provide an in-depth and systematic overview of the latest research findings in major and emerging topics for HPC + big data + deep learning over HPC clusters and clouds.

    As a starting point of exploring these research opportunities, we can try to deploy and run current-generation big data and deep learning jobs (e.g., Hadoop jobs, Spark jobs, TensorFlow jobs) on existing HPC infrastructures, as shown in figure 1.5. Through workload characterization and performance analysis, we can examine the potential bottlenecks for efficiency and scalability in this execution model and stack.

    Figure 1.5

    Can we efficiently run big data and deep learning jobs on existing HPC infrastructure?

    Figure 1.6 provides a high-level overview of the associated challenges of achieving high-performance big data computing on HPC and cloud computing systems. The bottom layer in this figure shows different kinds of advanced technologies provided by HPC and cloud computing infrastructures, such as high-speed networking technologies, high-performance and commodity computing system architectures, and advanced storage technologies. The upper layer in this figure shows the technology consumers, such as data-intensive applications, benchmarks, and workloads. In the middle, we have three important layers that help to deliver the near-peak performance from the hardware layer to the application layer. These three layers are the communication and I/O library layer, programming model layer, and big data processing and management middleware layer. Each of these layers needs to be designed efficiently to expose the maximum performance and flexibility to the upper layer components.

    Figure 1.6

    Challenges of high-performance big data computing. HDD, hard disk drive; NICs, network interface cards; QoS, quality of service; SR-IOV, single root input/output virtualization.

    The major challenges of designing a high-performance and scalable communication and I/O layer include designing efficient point-to-point communication protocols, thread models and synchronization mechanisms, virtualization support with near-native performance, low-latency and high-throughput I/O operations on file systems or storage systems, quality of service and fault tolerance support, performance tuning, and so on. All these properties are critical features for the desired communication and I/O library in a next-generation high-performance big data computing stack. A successful example of such a communication and I/O layer is MPI (MPI) for the HPC community. Unfortunately, the big data community has not come up with a standardized communication and I/O layer yet, which could be seen as a pre-MPI stage (R. Lu et al., 2014; Lu et al., 2011). Historical lessons tell us that high-performance big data computing needs a standard and efficient communication and I/O infrastructure.

    Traditionally, big data processing and management middleware, such as Hadoop, Spark, HBase, Memcached, and so on, are designed on top of conventional communication and I/O protocols, such as TCP/IP, remote procedure call (RPC), file system calls, and so on. These protocols are built with operating system-centric concepts and interfaces, such as Sockets, Portable Operating System Interface (POSIX), etc. These programming models typically have higher overhead due to the context switches and buffer copies between user-space and kernel-space (Rahman et al., 2014; Lu et al., 2014; Islam et al., 2016b, Shankar, Lu, Islam, et al., 2016). With more and more advanced technologies provided by the underlying hardware layer, new programming models and interfaces are becoming available and they can provide pure user-space and zero-copy communication and I/O protocols for the applications. For instance, RDMA is one such promising communication model, and it has been widely used in the HPC community for more than twenty years. In addition, PMEM and NVMe-SSD–based I/O programming models are emerging in the storage community as well, which have been demonstrated with high-performance benefits for data-intensive applications compared to the traditional POSIX-based I/O approaches (Klimovic et al., 2017; Cao et al., 2018; Xia et al., 2017; Islam et al., 2016b). These new programming models are not only significantly improving the performance and scalability of big data processing and management middleware, but they also open up a lot of new codesign opportunities for the upper layer systems and applications.

    Rather than exploit these commodity hardware platforms and technologies (e.g., RDMA, NVMe, PMEM) in the research community, many high-tech companies (such as Google and Amazon) make their own proprietary chips, motherboards, networks, and so on. Their networking, I/O, and software stacks are presumably optimized to exploit the unique capabilities provided by their hardware devices. Due to the unavailability of technical details about those proprietary designs, we will not discuss them in this book. However, the major goal of all these designs are similar, which is trying to significantly improve the performance and scalability of current-generation big data analytics and management systems to meet the growing challenges of huge data or big data.

    In the meantime, we should note that several big cloud providers (such as Microsoft Azure, AWS, Oracle Cloud, and Alibaba Cloud) have been adopting HPC networking technologies (such as InfiniBand and RoCE) in their latest HPC instances. Thus, the discussed designs in this book can also run on these HPC instances on the cloud. Even many of the social site data centers such as Facebook, Microsoft, Alibaba, and so on, have also moved to adopt InfiniBand and RoCE HPC networking technologies. In addition to running the traditional big data analytics workload, these data centers are currently running deep learning and artificial intelligence workloads. More details about these cloud-based designs will be discussed in chapter 11.

    More technical challenges of designing high-performance big data computing systems and applications will be discussed in detail in chapter 5.

    1.6 Outline of the Book

    Based on the preceding discussions on research challenges of achieving high-performance big data computing, this book has been organized in the following five parts with twelve chapters in total as shown in figure 1.7.

    Figure 1.7

    Outline of the book.

    • Chapters 1–4 describe the basic introductory concepts and background knowledge necessary for a good understanding of HPC, big data, deep learning, and so on. Chapter 1 has presented a global view of the field of high-performance big data computing. Chapter 2 describes popular

    Enjoying the preview?
    Page 1 of 1