Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3624062.3624593acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming

Published: 12 November 2023 Publication History

Abstract

With the surge in data generation rates from advanced scientific instruments, there is an urgent need for effective network management and resource utilization strategies for data streaming. Present strategies often lag behind hardware advancements, leading to resource underutilization. Modern servers typically employ non-uniform memory access (NUMA) multiprocessors, which, despite their benefits, can pose performance challenges. This paper presents a novel runtime system tailored for efficient multi-stream data management, optimizing both its compression and decompression phases, and enhancing network I/O based on the server’s unique hardware design. Our system coordinates parallel tasks for data compression, decompression, and transfer, aiming to reduce network data influx. Empirical tests show that aligning streaming tasks with the right NUMA domain results in a 1.48X throughput boost compared to cutting-edge methods and a 2.6X improvement over standard techniques.

Supplemental Material

MP4 File
Recording of "Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming" presentation at INDIS 2023.

References

[1]
[n. d.]. An Introduction to the Intel® QuickPath Interconnect. https://www.intel.ca/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf. [Accessed: July 2023].
[2]
[n. d.]. APS Upgrade. https://www.aps.anl.gov/APS-Upgrade. [Accessed: May 2021].
[3]
[n. d.]. hdf5. https://www.hdfgroup.org/solutions/hdf5/. [Accessed : Septeember 2023].
[4]
[n. d.]. numactl. https://github.com/numactl/numactl/tree/master.[Accessed : Septeember 2023].
[5]
[n. d.]. Scaling in the Linux Networking Stack. https://www.kernel.org/doc/Documentation/networking/scaling.txt. [Accessed: July 2023].
[6]
[n. d.]. Spheres Dataset. https://tomobank.readthedocs.io/en/latest/source/data/docs.data.spheres.html.[Accessed : July 2023].
[7]
[n. d.]. ZeroMQ. https://zeromq.org/get-started/.[Accessed : Septeember 2023].
[8]
Saksham Agarwal, Rachit Agarwal, Behnam Montazeri, Masoud Moshref, Khaled Elmeleegy, Luigi Rizzo, Marc Asher de Kruijf, Gautam Kumar, Sylvia Ratnasamy, David Culler, and Amin Vahdat. 2022. Understanding Host Interconnect Congestion. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks (Austin, Texas) (HotNets ’22). Association for Computing Machinery, New York, NY, USA, 198–204. https://doi.org/10.1145/3563766.3564110
[9]
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-Tolerant Stream Processing at Internet Scale. Proc. VLDB Endow. 6, 11 (aug 2013), 1033–1044. https://doi.org/10.14778/2536222.2536229
[10]
Tekin Bicer. 2014. Supporting Data-Intensive Scientific Computing on Bandwidth and Space Constrained Environments. Ph. D. Dissertation. The Ohio State University.
[11]
Tekin Bicer, Doga Gursoy, Rajkumar Kettimuthu, Ian T Foster, Bin Ren, Vincent De Andrede, and Francesco De Carlo. 2017. Real-time data analysis and autonomous steering of synchrotron light source experiments. In IEEE 13th International Conference on e-Science (e-Science). IEEE, 59–68.
[12]
Tekin Bicer, Jian Yin, and Gagan Agrawal. 2014. Improving I/O throughput of scientific applications using transparent parallel compression. In 2014 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 1–10.
[13]
Tekin Bicer, Jian Yin, David Chiu, Gagan Agrawal, and Karen Schuchardt. 2013. Integrating online compression to accelerate large-scale data analytics applications. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. IEEE, 1205–1216.
[14]
Tekin Bicer, Xiaodong Yu, Daniel J Ching, Ryan Chard, Mathew J Cherukara, Bogdan Nicolae, Rajkumar Kettimuthu, and Ian T Foster. 2021. High-performance ptychographic reconstruction with federated facilities. In Smoky Mountains Computational Sciences and Engineering Conference. Springer, 173–189.
[15]
Paul Caheny, Lluc Alvarez, Said Derradji, Mateo Valero, Miquel Moretó, and Marc Casas. 2018. Reducing Cache Coherence Traffic with a NUMA-Aware Runtime Approach. IEEE Transactions on Parallel and Distributed Systems 29, 5 (2018), 1174–1187. https://doi.org/10.1109/TPDS.2017.2787123
[16]
Qizhe Cai, Shubham Chaudhary, Midhul Vuppalapati, Jaehyun Hwang, and Rachit Agarwal. 2021. Understanding Host Network Stack Overheads. In Proceedings of the 2021 ACM SIGCOMM 2021 Conference (Virtual Event, USA) (SIGCOMM ’21). Association for Computing Machinery, New York, NY, USA, 65–77. https://doi.org/10.1145/3452296.3472888
[17]
Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (dec 2014), 401–412. https://doi.org/10.14778/2735496.2735503
[18]
Venkatarami Reddy Chintapalli, Sai Balaram Korrapati, Bheemarjuna Reddy Tamma, and Antony Franklin A. 2022. NUMASFP: NUMA-Aware Dynamic Service Function Chain Placement in Multi-Core Servers. In 2022 COMSNETS. 181–189. https://doi.org/10.1109/COMSNETS53615.2022.9668603
[19]
Yann Collet. 2011. LZ4 - Extremely Fast Compression algorithm. https://github.com/lz4/lz4. [Online; accessed 6-22-2023].
[20]
Juan A. Colmenares, Reza Dorrigiv, and Daniel G. Waddington. 2017. Ingestion, Indexing and Retrieval of High-Velocity Multidimensional Sensor Data on a Single Node. arxiv:1707.00825 [cs.DB]
[21]
Jana Giceva, Gustavo Alonso, Timothy Roscoe, and Tim Harris. 2014. Deployment of Query Plans on Multicores. Proc. VLDB Endow. 8, 3 (nov 2014), 233–244. https://doi.org/10.14778/2735508.2735513
[22]
Nathan Hanford, Vishal Ahuja, Matthew Farrens, Dipak Ghosal, Mehmet Balman, Eric Pouyoul, and Brian Tierney. 2016. Improving network performance on multicore systems: Impact of core affinities on high throughput flows. Future Generation Computer Systems 56 (2016), 277–283. https://doi.org/10.1016/j.future.2015.09.012
[23]
Mert Hidayetoğlu, Tekin Biçer, Simon Garcia De Gonzalo, Bin Ren, Doğa Gürsoy, Rajkumar Kettimuthu, Ian T Foster, and Wen-mei W Hwu. 2019. MemXCT: Memory-centric x-ray CT reconstruction with massive parallelization. In International Conference for High Performance Computing, Networking, Storage and Analysis. 1–56.
[24]
Mert Hidayetoglu, Tekin Bicer, Simon Gonzalo, Bin Ren, Vincent Andrade, Doga Gursoy, Rajkumar Kettimuthu, Ian Foster, and Wen-mei Hwu. 2020. Petascale XCT: 3D Image Reconstruction with Hierarchical Communications on Multi-GPU Nodes. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 510–522.
[25]
Yang Hu and Tao Li. 2016. Towards efficient server architecture for virtualized network function deployment: Implications and implementations. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 1–12. https://doi.org/10.1109/MICRO.2016.7783711
[26]
Intelligence Advanced Research Projects Activity. [n. d.]. Rapid Analysis of Various Emerging Nanoelectronics. https://www.iarpa.gov/index.php/research-programs/raven. [Accessed: May 2021].
[27]
Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2009. Access Map Pattern Matching for Data Cache Prefetch. In Proceedings of the 23rd International Conference on Supercomputing (Yorktown Heights, NY, USA) (ICS ’09). Association for Computing Machinery, New York, NY, USA, 499–500. https://doi.org/10.1145/1542275.1542349
[28]
Taeuk Kim, Awais Khan, Youngjae Kim, Preethika Kasu, and Scott Atchley. 2018. NUMA-aware thread scheduling for big data transfers over terabits network infrastructure. Sci. Program. 2018 (2018), 1–8.
[29]
Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter Pietzuch. 2016. SABER: Window-Based Hybrid Stream Processing for Heterogeneous Architectures. In Proceedings of the 2016 International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’16). Association for Computing Machinery, New York, NY, USA, 555–569. https://doi.org/10.1145/2882903.2882906
[30]
Christoph Lameter. 2013. NUMA (Non-Uniform Memory Access): An Overview: NUMA Becomes More Common Because Memory Controllers Get Close to Execution Units on Microprocessors.Queue 11, 7 (jul 2013), 40–51. https://doi.org/10.1145/2508834.2513149
[31]
Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann. 2014. Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD ’14). Association for Computing Machinery, New York, NY, USA, 743–754. https://doi.org/10.1145/2588555.2610507
[32]
Zhengchun Liu, Tekin Bicer, Rajkumar Kettimuthu, and Ian Foster. 2019. Deep learning accelerated light source experiments. In 2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS). IEEE, 20–28.
[33]
Ke Meng and Guangming Tan. 2017. RING: NUMA-Aware Message-Batching Runtime for Data-Intensive Applications. In 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). 368–375. https://doi.org/10.1109/ICPADS.2017.00056
[34]
Iraklis Psaroudakis, Tobias Scheuer, Norman May, Abdelkader Sellami, and Anastasia Ailamaki. 2016. Adaptive NUMA-Aware Data Placement and Task Scheduling for Analytical Workloads in Main-Memory Column-Stores. Proc. VLDB Endow. 10, 2 (oct 2016), 37–48. https://doi.org/10.14778/3015274.3015275
[35]
Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In 2011 International Conference on Parallel Architectures and Compilation Techniques. 22–32. https://doi.org/10.1109/PACT.2011.9
[36]
Somya Singh, Tyler Stannard, Sudhanshu Singh, Arun Singaravelu, Xianghui Xiao, and Nikhilesh Chawla. 2017. Varied volume fractions of borosilicate glass spheres with diameter gaussian distributed from 38-45 micronsen cased in a polypropylene matrix. https://doi.org/10.17038/XSD/1373576
[37]
Yongyu Wang. 2017. NUMA-aware design and mapping for pipeline network functions. In 2017 4th International Conference on Systems and Informatics (ICSAI). 1049–1054. https://doi.org/10.1109/ICSAI.2017.8248440
[38]
Zeyi Wen, Xingyang Liu, Hongjian Cao, and Bingsheng He. 2018. RTSI: An Index Structure for Multi-Modal Real-Time Search on Live Audio Streaming Services. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). 1495–1506. https://doi.org/10.1109/ICDE.2018.00168
[39]
Heng Yu, Zhilong Zheng, Junxian Shen, Congcong Miao, Chen Sun, Hongxin Hu, Jun Bi, Jianping Wu, and Jilong Wang. 2021. Octans: Optimal Placement of Service Function Chains in Many-Core Systems. IEEE Transactions on Parallel and Distributed Systems 32, 9, 2202–2215. https://doi.org/10.1109/TPDS.2021.3063613
[40]
Se-young Yu, Jim Chen, Joe Mambretti, and Fei Yeh. 2018. Analysis of CPU Pinning and Storage Configuration in 100 Gbps Network Data Transfer. In 2018 IEEE/ACM Innovating the Network for Data-Intensive Science (INDIS). 64–74. https://doi.org/10.1109/INDIS.2018.00010
[41]
Xiaodong Yu, Tekin Bicer, Rajkumar Kettimuthu, and Ian Foster. 2021. Topology-aware optimizations for multi-gpu ptychographic image reconstruction. In Proceedings of the ACM International Conference on Supercomputing. 354–366.
[42]
Xiaodong Yu, Viktor Nikitin, Daniel J Ching, Selin Aslan, Doğa Gürsoy, and Tekin Biçer. 2022. Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography data. Scientific Reports 12, 1 (2022), 5334.
[43]
Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. Analyzing Efficient Stream Processing on Modern Hardware. Proc. VLDB Endow. 12, 5 (jan 2019), 516–530. https://doi.org/10.14778/3303753.3303758
[44]
Shuhao Zhang, Jiong He, Amelie Chi Zhou, and Bingsheng He. 2019. BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 705–722. https://doi.org/10.1145/3299869.3300067
[45]
Shuhao Zhang, Hoang Tam Vo, Daniel Dahlmeier, and Bingsheng He. 2017. Multi-Query Optimization for Complex Event Processing in SAP ESP. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 1213–1224. https://doi.org/10.1109/ICDE.2017.166

Index Terms

  1. Throughput Optimization with a NUMA-Aware Runtime System for Efficient Scientific Data Streaming

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
      November 2023
      2180 pages
      ISBN:9798400707858
      DOI:10.1145/3624062
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 12 November 2023

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Heterogeneous architectures
      2. data compression/decompression
      3. data streaming
      4. non-uniform memory access (NUMA).
      5. performance optimization
      6. runtime systems

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      SC-W 2023

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 92
        Total Downloads
      • Downloads (Last 12 months)92
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 17 Oct 2024

      Other Metrics

      Citations

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media