Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3545008.3545044acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article
Open access

SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems

Published: 13 January 2023 Publication History

Abstract

With increasing core counts and multiple levels of cache memories, scaling multi-threaded and task-level parallel workloads is continuously becoming a challenge. A key challenge to scaling the number of communicating tasks (or threads) is the rate at which existing communication mechanisms scale (in terms of latency and bandwidth). Architectures with hardware accelerated queuing operations have the potential to reduce the latency and improve scalability of moving data between processing elements, reducing synchronization penalties, and thereby improving the performance of task-level parallel workloads. While hardware queues reduce synchronization penalties, they cannot fully hide load-to-use latency, i.e., perfect pipelines often are not realized. There is the potential, however, for better overlap. If the inter-processor communication latency is equal to or less than the time spent processing a message at the consumer, any and all latency may be overlapped while the consumer is processing. We exploit this property to speedup parallel applications above and beyond existing hardware queues.
In this paper, we present SPAMeR, a speculation mechanism built on top of a state-of-the-art hardware-driven message queue architecture. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR adds algorithms that attempt to predict this. We evaluate the effectiveness of SPAMeR with a set of diverse task-parallel benchmarks utilizing the gem5 full system simulator, and observe a 1.33 × average speedup.

References

[1]
Martín Abadi, Michael Isard, and Derek G Murray. 2017. A computational model for TensorFlow: an introduction. In Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 1–7.
[2]
Sam Ainsworth and Timothy M Jones. 2016. Graph prefetching using data structure knowledge. In Proceedings of the 2016 International Conference on Supercomputing.
[3]
Sam Ainsworth and Timothy M. Jones. 2020. MuonTrap: Preventing Cross-Domain Spectre-like Attacks by Capturing Speculative State. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (Virtual Event) (ISCA ’20). IEEE Press, 132–144. https://doi.org/10.1109/ISCA45697.2020.00022
[4]
Mohammad Bakhshalipour, Seyedali Tabaeiaghdaei, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Evaluation of hardware data prefetchers on server processors. ACM Computing Surveys (CSUR) 52, 3 (2019), 1–29.
[5]
Kenneth E Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30–May 2, 1968, spring joint computer conference. 307–314.
[6]
Jonathan C Beard, Peng Li, and Roger D Chamberlain. 2017. Raftlib: A C++ template library for high performance stream parallel processing. The International Journal of High Performance Computing Applications 31, 5(2017), 391–404.
[7]
S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. 2008. TILE64 - Processor: A 64-Core SoC with Mesh Interconnect. In 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers. 88–598. https://doi.org/10.1109/ISSCC.2008.4523070
[8]
Eshan Bhatia, Gino Chacon, Seth Pugsley, Elvira Teran, Paul V. Gratz, and Daniel A. Jiménez. 2019. Perceptron-Based Prefetch Filtering. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 1–13.
[9]
Sarani Bhattacharya, Chester Rebeiro, and Debdeep Mukhopadhyay. 2016. A Formal Security Analysis of Even-Odd Sequential Prefetching in Profiled Cache-Timing Attacks. In Proceedings of the Hardware and Architectural Support for Security and Privacy 2016 (Seoul, Republic of Korea) (HASP 2016). Association for Computing Machinery, New York, NY, USA, Article 6, 8 pages. https://doi.org/10.1145/2948618.2948624
[10]
Tiwei Bie, Changchun Ouyang, and Heqing Zhu. 2020. Virtio. In Data Plane Development Kit (DPDK). CRC Press, 229–250.
[11]
Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7. https://doi.org/10.1145/2024716.2024718
[12]
Thomas Chen, Ram Raghavan, Jason N Dale, and Eiji Iwata. 2007. Cell broadband engine architecture and its first implementation—a performance view. IBM Journal of Research and Development 51, 5 (2007), 559–572.
[13]
Iacopo Colonnelli, Barbara Cantalupo, Roberto Esposito, Matteo Pennisi, Concetto Spampinato, and Marco Aldinucci. 2021. HPC Application Cloudification: The StreamFlow Toolkit. In 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
[14]
Halit Dogan, Masab Ahmad, Brian Kahne, and Omer Khan. 2019. Accelerating synchronization using moving compute to data model at 1,000-core multicore scale. ACM Transactions on Architecture and Code Optimization 16, 1(2019), 1–27.
[15]
Alan AA Donovan and Brian W Kernighan. 2015. The Go programming language. Addison-Wesley Professional.
[16]
Reza Fotohi, Mehdi Effatparvar, Fateme Sarkohaki, Shahram Behzad, 2019. An improvement over threads communications on multi-core processors. arXiv preprint arXiv:1909.11644(2019).
[17]
Vasilis Gavrielatos, Antonios Katsarakis, Vijay Nagarajan, Boris Grot, and Arpit Joshi. 2020. Kite: efficient and available release consistency for the datacenter. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1–16.
[18]
Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, and Stefan Mangard. 2016. Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA, 368–379. https://doi.org/10.1145/2976749.2978356
[19]
Y. Guo, A. Zigerelli, Y. Zhang, and J. Yang. 2022. Adversarial Prefetch: New Cross-Core Cache Side Channel Attacks. In 2022 2022 IEEE Symposium on Security and Privacy (SP) (SP). IEEE Computer Society, Los Alamitos, CA, USA, 1550–1550. https://doi.org/10.1109/SP46214.2022.00121
[20]
W Daniel Hillis. 1989. The connection machine. MIT press.
[21]
Ali R Hurson and Krishna M Kavi. 2007. Dataflow computers: Their history and future. Wiley Encyclopedia of Computer Science and Engineering (2007).
[22]
Intel. 2020. Queue Management and Load Balancing on Intel® Architecture. Retrieved February 2022 from https://intel.ly/3hY0Zy8
[23]
Giorgos Kappes and Stergios V Anastasiadis. 2021. A lock-free relaxed concurrent queue for fast work distribution. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 454–456.
[24]
Engin Kayraklioglu, Michael P Ferguson, and Tarek El-Ghazawi. 2018. LAPPS: Locality-aware productive prefetching support for PGAS. ACM Transactions on Architecture and Code Optimization 15, 3(2018), 1–26.
[25]
Andi Kleen. 2009. Linux multi-core scalability. In Proceedings of Linux Kongress.
[26]
Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2016. Multiversioned Decoupled Access-execute: The Key to Energy-efficient Compilation of General-purpose Programs. In Proceedings of the 25th International Conference on Compiler Construction(CC 2016). https://doi.org/10.1145/2892208.2892209
[27]
Ben Lee and Ali R Hurson. 1993. Issues in dataflow computing. In Advances in computers. Vol. 37. Elsevier, 285–333.
[28]
Sanghoon Lee, Devesh Tiwari, Yan Solihin, and James Tuck. 2011. HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 99–110.
[29]
Thorben Louw and Simon McIntosh-Smith. 2021. Using the Graphcore IPU for traditional HPC applications. Technical Report. EasyChair.
[30]
Chi-Keung Luk, Robert Muth, Harish Patil, Richard Weiss, P. Geoffrey Lowney, and Robert Cohn. 2002. Profile-Guided Post-Link Stride Prefetching. In Proceedings of the 16th International Conference on Supercomputing (New York, New York, USA) (ICS ’02). Association for Computing Machinery, New York, NY, USA, 167–178. https://doi.org/10.1145/514191.514217
[31]
Maged M Michael and Michael L Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing. 267–275.
[32]
Gal Milman, Alex Kogan, Yossi Lev, Victor Luchangco, and Erez Petrank. 2018. BQ: A lock-free queue with batching. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures. 99–109.
[33]
K.J. Nesbit and J.E. Smith. 2004. Data Cache Prefetching Using a Global History Buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA’04). 96–96. https://doi.org/10.1109/HPCA.2004.10030
[34]
Davide Pasetto, Massimiliano Meneghin, Hubertus Franke, Fabrizio Petrini, and Jimi Xenidis. 2012. Performance evaluation of interthread communicationmechanisms on multicore/multithreaded architectures. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. 131–132.
[35]
Raghu Prabhakar and Sumti Jairath. 2021. SambaNova SN10 RDU: Accelerating Software 2.0 with Dataflow. In 2021 IEEE Hot Chips 33 Symposium. IEEE, 1–37.
[36]
DPAA QorIQ. 2012. Primer for Software Architecture. Technical Report. Technical report, Freescale Semiconductor Inc.
[37]
T. Ramírez, A. Pajuelo, O. J. Santana, O. Mutlu, and M. Valero. 2010. Efficient Runahead Threads. In 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).
[38]
Isaac Sánchez Barrera, David Black-Schaffer, Marc Casas, Miquel Moretó, Anastasiia Stupnikova, and Mihail Popov. 2020. Modeling and optimizing numa effects and prefetching with machine learning. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–13.
[39]
Youngjoo Shin, Hyung Chan Kim, Dokeun Kwon, Ji Hoon Jeong, and Junbeom Hur. 2018. Unveiling Hardware-Based Data Prefetcher, a Hidden Source of Information Leakage. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security(Toronto, Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, 131–145. https://doi.org/10.1145/3243734.3243736
[40]
sstsimulator. 2020. Ember Communication Pattern Library. Retrieved October 2020 from https://bit.ly/3k9egUV
[41]
Aaron Stillmaker and Bevan Baas. 2017. Scaling equations for the accurate prediction of CMOS device performance from 180nm to 7nm. Integration 58(2017), 74 – 81. https://doi.org/10.1016/j.vlsi.2017.02.002
[42]
J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal. 2007. FreePDK: An Open-Source Variation-Aware Design Kit. In 2007 IEEE International Conference on Microelectronic Systems Education (MSE’07). 173–174. https://doi.org/10.1109/MSE.2007.44
[43]
UT-LCA. 2021. GitHub Virtual-Link. Retrieved November 2021 from https://github.com/UT-LCA/near-data-sim
[44]
Sevin Varoglu and Stephen Jenks. 2011. Architectural support for thread communications in multi-core processors. Parallel Comput. 37, 1 (2011), 26–41.
[45]
Haoyuan Wang and Zhiwei Luo. 2017. Data Cache Prefetching with Perceptron Learning. arXiv:arXiv:1712.00905
[46]
Yipeng Wang, Ren Wang, Andrew Herdrich, James Tsai, and Yan Solihin. 2016. CAF: Core to core communication acceleration framework. In 2016 International Conference on Parallel Architecture and Compilation Techniques. IEEE, 351–362.
[47]
Scoot Wasson. 2015. Inside ARM’s Cortex-A72 microarchitecture. Retrieved February 2022 from https://bit.ly/3sf0a9h
[48]
Qinzhe Wu, Jonathan C. Beard, Ashen Ekanayake, Andreas Gerstlauer, and Lizy K. John. 2021. Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication. 2021 IEEE International Parallel and Distributed Processing Symposium (2021), 182–191.
[49]
T. Yamada, S. Hirasawa, H. Takizawa, and H. Kobayashi. 2015. A Case Study of User-Defined Code Transformations for Data Layout Optimizations. In 2015 Third International Symposium on Computing and Networking (CANDAR). https://doi.org/10.1109/CANDAR.2015.96
[50]
Ke Zhou, Si Sun, Hua Wang, Ping Huang, Xubin He, Rui Lan, Wenyan Li, Wenjie Liu, and Tianming Yang. 2019. Improving Cache Performance for Large-Scale Photo Stores via Heuristic Prefetching Scheme. IEEE Transactions on Parallel and Distributed Systems 30, 9 (2019), 2033–2045. https://doi.org/10.1109/TPDS.2019.2902392

Cited By

View all
  • (2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024

Index Terms

  1. SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICPP '22: Proceedings of the 51st International Conference on Parallel Processing
    August 2022
    976 pages
    ISBN:9781450397339
    DOI:10.1145/3545008
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 January 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. message queue
    2. multi-core system
    3. parallel computing
    4. speculation

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICPP '22
    ICPP '22: 51st International Conference on Parallel Processing
    August 29 - September 1, 2022
    Bordeaux, France

    Acceptance Rates

    Overall Acceptance Rate 91 of 313 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)136
    • Downloads (Last 6 weeks)37
    Reflects downloads up to 18 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media