research-article

Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming

Authors: Hasibul Jamil, Joaquin Chung, Tekin Bicer, Tevfik Kosar, Rajkumar KettimuthuAuthors Info & Claims

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

Pages 795 - 805

https://doi.org/10.1145/3624062.3624593

Published: 12 November 2023 Publication History

Abstract

With the surge in data generation rates from advanced scientific instruments, there is an urgent need for effective network management and resource utilization strategies for data streaming. Present strategies often lag behind hardware advancements, leading to resource underutilization. Modern servers typically employ non-uniform memory access (NUMA) multiprocessors, which, despite their benefits, can pose performance challenges. This paper presents a novel runtime system tailored for efficient multi-stream data management, optimizing both its compression and decompression phases, and enhancing network I/O based on the server’s unique hardware design. Our system coordinates parallel tasks for data compression, decompression, and transfer, aiming to reduce network data influx. Empirical tests show that aligning streaming tasks with the right NUMA domain results in a 1.48X throughput boost compared to cutting-edge methods and a 2.6X improvement over standard techniques.

Supplemental Material

MP4 File

Recording of "Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming" presentation at INDIS 2023.

Download
131.53 MB

References

[1]

[n. d.]. An Introduction to the Intel® QuickPath Interconnect. https://www.intel.ca/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf. [Accessed: July 2023].

[2]

[n. d.]. APS Upgrade. https://www.aps.anl.gov/APS-Upgrade. [Accessed: May 2021].

[3]

[n. d.]. hdf5. https://www.hdfgroup.org/solutions/hdf5/. [Accessed : Septeember 2023].

[4]

[n. d.]. numactl. https://github.com/numactl/numactl/tree/master.[Accessed : Septeember 2023].

[5]

[n. d.]. Scaling in the Linux Networking Stack. https://www.kernel.org/doc/Documentation/networking/scaling.txt. [Accessed: July 2023].

[6]

[n. d.]. Spheres Dataset. https://tomobank.readthedocs.io/en/latest/source/data/docs.data.spheres.html.[Accessed : July 2023].

[7]

[n. d.]. ZeroMQ. https://zeromq.org/get-started/.[Accessed : Septeember 2023].

[8]

Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud Moshref, Khaled Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam Kumar, Sylvia Ratnasamy, David Culler, and Amin Vahdat. 2022. Understanding Host Interconnect Congestion. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks (Austin, Texas) (HotNets ’22). Association for Computing Machinery, New York, NY, USA, 198–204. https://doi.org/10.1145/3563766.3564110

Digital Library

[9]

Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. Proc. VLDB Endow. 6, 11 (aug 2013), 1033–1044. https://doi.org/10.14778/2536222.2536229

Digital Library

[10]

Tekin Bicer. 2014. Supporting Data-Intensive Scientific Computing on Bandwidth and Space Constrained Environments. Ph. D. Dissertation. The Ohio State University.

[11]

Tekin Bicer, Doga Gursoy, Rajkumar Kettimuthu, Ian T Foster, Bin Ren, Vincent De Andrede, and Francesco De Carlo. 2017. Real-time data analysis and autonomous steering of synchrotron light source experiments. In IEEE 13th International Conference on e-Science (e-Science). IEEE, 59–68.

[12]

Tekin Bicer, Jian Yin, and Gagan Agrawal. 2014. Improving I/O throughput of scientific applications using transparent parallel compression. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 1–10.

Digital Library

[13]

Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal, and Karen Schuchardt. 2013. Integrating online compression to accelerate large-scale data analytics applications. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 1205–1216.

Digital Library

[14]

Tekin Bicer, Xiaodong Yu, Daniel J Ching, Ryan Chard, Mathew J Cherukara, Bogdan Nicolae, Rajkumar Kettimuthu, and Ian T Foster. 2021. High-performance ptychographic reconstruction with federated facilities. In Smoky Mountains Computational Sciences and Engineering Conference. Springer, 173–189.

[15]

Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Miquel Moretó, and Marc Casas. 2018. Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 1174–1187. https://doi.org/10.1109/TPDS.2017.2787123

[16]

Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and Rachit Agarwal. 2021. Understanding Host Network Stack Overheads. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (Virtual Event, USA) (SIGCOMM ’21). Association for Computing Machinery, New York, NY, USA, 65–77. https://doi.org/10.1145/3452296.3472888

Digital Library

[17]

Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (dec 2014), 401–412. https://doi.org/10.14778/2735496.2735503

Digital Library

[18]

Venkatarami Reddy Chintapalli, Sai Balaram Korrapati, Bheemarjuna Reddy Tamma, and Antony Franklin A. 2022. NUMASFP: NUMA-Aware Dynamic Service Function Chain Placement in Multi-Core Servers. In 2022 COMSNETS. 181–189. https://doi.org/10.1109/COMSNETS53615.2022.9668603

[19]

Yann Collet. 2011. LZ4 - Extremely Fast Compression algorithm. https://github.com/lz4/lz4. [Online; accessed 6-22-2023].

[20]

Juan A. Colmenares, Reza Dorrigiv, and Daniel G. Waddington. 2017. Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor Data on a Single Node. arxiv:1707.00825 [cs.DB]

[21]

Jana Giceva, Gustavo Alonso, Timothy Roscoe, and Tim Harris. 2014. Deployment of Query Plans on Multicores. Proc. VLDB Endow. 8, 3 (nov 2014), 233–244. https://doi.org/10.14778/2735508.2735513

Digital Library

[22]

Nathan Hanford, Vishal Ahuja, Matthew Farrens, Dipak Ghosal, Mehmet Balman, Eric Pouyoul, and Brian Tierney. 2016. Improving network performance on multicore systems: Impact of core affinities on high throughput flows. Future Generation Computer Systems 56 (2016), 277–283. https://doi.org/10.1016/j.future.2015.09.012

Digital Library

[23]

Mert Hidayetoğlu, Tekin Biçer, Simon Garcia De Gonzalo, Bin Ren, Doğa Gürsoy, Rajkumar Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2019. MemXCT: Memory-centric x-ray CT reconstruction with massive parallelization. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1–56.

Digital Library

[24]

Mert Hidayetoglu, Tekin Bicer, Simon Gonzalo, Bin Ren, Vincent Andrade, Doga Gursoy, Rajkumar Kettimuthu, Ian Foster, and Wen-mei Hwu. 2020. Petascale XCT: 3D Image Reconstruction with Hierarchical Communications on Multi-GPU Nodes. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 510–522.

[25]

Yang Hu and Tao Li. 2016. Towards efficient server architecture for virtualized network function deployment: Implications and implementations. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783711

[26]

Intelligence Advanced Research Projects Activity. [n. d.]. Rapid Analysis of Various Emerging Nanoelectronics. https://www.iarpa.gov/index.php/research-programs/raven. [Accessed: May 2021].

[27]

Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access Map Pattern Matching for Data Cache Prefetch. In Proceedings of the 23rd International Conference on Supercomputing (Yorktown Heights, NY, USA) (ICS ’09). Association for Computing Machinery, New York, NY, USA, 499–500. https://doi.org/10.1145/1542275.1542349

Digital Library

[28]

Taeuk Kim, Awais Khan, Youngjae Kim, Preethika Kasu, and Scott Atchley. 2018. NUMA-aware thread scheduling for big data transfers over terabits network infrastructure. Sci. Program. 2018 (2018), 1–8.

[29]

Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter Pietzuch. 2016. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 555–569. https://doi.org/10.1145/2882903.2882906

Digital Library

[30]

Christoph Lameter. 2013. NUMA (Non-Uniform Memory Access): An Overview: NUMA Becomes More Common Because Memory Controllers Get Close to Execution Units on Microprocessors.Queue 11, 7 (jul 2013), 40–51. https://doi.org/10.1145/2508834.2513149

Digital Library

[31]

Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD ’14). Association for Computing Machinery, New York, NY, USA, 743–754. https://doi.org/10.1145/2588555.2610507

Digital Library

[32]

Zhengchun Liu, Tekin Bicer, Rajkumar Kettimuthu, and Ian Foster. 2019. Deep learning accelerated light source experiments. In 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS). IEEE, 20–28.

[33]

Ke Meng and Guangming Tan. 2017. RING: NUMA-Aware Message-Batching Runtime for Data-Intensive Applications. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). 368–375. https://doi.org/10.1109/ICPADS.2017.00056

[34]

Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anastasia Ailamaki. 2016. Adaptive NUMA-Aware Data Placement and Task Scheduling for Analytical Workloads in Main-Memory Column-Stores. Proc. VLDB Endow. 10, 2 (oct 2016), 37–48. https://doi.org/10.14778/3015274.3015275

Digital Library

[35]

Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 22–32. https://doi.org/10.1109/PACT.2011.9

Digital Library

[36]

Somya Singh, Tyler Stannard, Sudhanshu Singh, Arun Singaravelu, Xianghui Xiao, and Nikhilesh Chawla. 2017. Varied volume fractions of borosilicate glass spheres with diameter gaussian distributed from 38-45 micronsen cased in a polypropylene matrix. https://doi.org/10.17038/XSD/1373576

[37]

Yongyu Wang. 2017. NUMA-aware design and mapping for pipeline network functions. In 2017 4th International Conference on Systems and Informatics (ICSAI). 1049–1054. https://doi.org/10.1109/ICSAI.2017.8248440

[38]

Zeyi Wen, Xingyang Liu, Hongjian Cao, and Bingsheng He. 2018. RTSI: An Index Structure for Multi-Modal Real-Time Search on Live Audio Streaming Services. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1495–1506. https://doi.org/10.1109/ICDE.2018.00168

[39]

Heng Yu, Zhilong Zheng, Junxian Shen, Congcong Miao, Chen Sun, Hongxin Hu, Jun Bi, Jianping Wu, and Jilong Wang. 2021. Octans: Optimal Placement of Service Function Chains in Many-Core Systems. IEEE Transactions on Parallel and Distributed Systems 32, 9, 2202–2215. https://doi.org/10.1109/TPDS.2021.3063613

[40]

Se-young Yu, Jim Chen, Joe Mambretti, and Fei Yeh. 2018. Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer. In 2018 IEEE/ACM Innovating the Network for Data-Intensive Science (INDIS). 64–74. https://doi.org/10.1109/INDIS.2018.00010

[41]

Xiaodong Yu, Tekin Bicer, Rajkumar Kettimuthu, and Ian Foster. 2021. Topology-aware optimizations for multi-gpu ptychographic image reconstruction. In Proceedings of the ACM International Conference on Supercomputing. 354–366.

Digital Library

[42]

Xiaodong Yu, Viktor Nikitin, Daniel J Ching, Selin Aslan, Doğa Gürsoy, and Tekin Biçer. 2022. Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography data. Scientific Reports 12, 1 (2022), 5334.

[43]

Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. Analyzing Efficient Stream Processing on Modern Hardware. Proc. VLDB Endow. 12, 5 (jan 2019), 516–530. https://doi.org/10.14778/3303753.3303758

Digital Library

[44]

Shuhao Zhang, Jiong He, Amelie Chi Zhou, and Bingsheng He. 2019. BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 705–722. https://doi.org/10.1145/3299869.3300067

Digital Library

[45]

Shuhao Zhang, Hoang Tam Vo, Daniel Dahlmeier, and Bingsheng He. 2017. Multi-Query Optimization for Complex Event Processing in SAP ESP. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 1213–1224. https://doi.org/10.1109/ICDE.2017.166

Index Terms

Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures
2. Hardware
  1. Communication hardware, interfaces and storage
    1. Networking hardware

Recommendations

Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs
ICS '18: Proceedings of the 2018 International Conference on Supercomputing

Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient ...
Scale-out NUMA
ASPLOS '14

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
HydraFS: an efficient NUMA-aware in-memory file system
Abstract
Emerging persistent file systems are designed to achieve high-performance data processing by effectively exploiting the advanced features of Non-volatile Memory (NVM). Non-uniform memory access (NUMA) architectures are universally used in high-...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

November 2023

2180 pages

ISBN:9798400707858

DOI:10.1145/3624062

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SC-W 2023

SC-W 2023: Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis

November 12 - 17, 2023

CO, Denver, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
92
Total Downloads

Downloads (Last 12 months)92
Downloads (Last 6 weeks)8

Reflects downloads up to 17 Oct 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents