research-article

Open access

SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems

Authors:

Ashen Ekanayake,

Jonathan Beard,

Lizy JohnAuthors Info & Claims

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

Article No.: 58, Pages 1 - 12

https://doi.org/10.1145/3545008.3545044

Published: 13 January 2023 Publication History

All formats PDF

Abstract

With increasing core counts and multiple levels of cache memories, scaling multi-threaded and task-level parallel workloads is continuously becoming a challenge. A key challenge to scaling the number of communicating tasks (or threads) is the rate at which existing communication mechanisms scale (in terms of latency and bandwidth). Architectures with hardware accelerated queuing operations have the potential to reduce the latency and improve scalability of moving data between processing elements, reducing synchronization penalties, and thereby improving the performance of task-level parallel workloads. While hardware queues reduce synchronization penalties, they cannot fully hide load-to-use latency, i.e., perfect pipelines often are not realized. There is the potential, however, for better overlap. If the inter-processor communication latency is equal to or less than the time spent processing a message at the consumer, any and all latency may be overlapped while the consumer is processing. We exploit this property to speedup parallel applications above and beyond existing hardware queues.

In this paper, we present SPAMeR, a speculation mechanism built on top of a state-of-the-art hardware-driven message queue architecture. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR adds algorithms that attempt to predict this. We evaluate the effectiveness of SPAMeR with a set of diverse task-parallel benchmarks utilizing the gem5 full system simulator, and observe a 1.33 × average speedup.

References

[1]

Martín Abadi, Michael Isard, and Derek G Murray. 2017. A computational model for TensorFlow: an introduction. In Proceedings of the 1st ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. 1–7.

Digital Library

[2]

Sam Ainsworth and Timothy M Jones. 2016. Graph prefetching using data structure knowledge. In Proceedings of the 2016 International Conference on Supercomputing.

Digital Library

[3]

Sam Ainsworth and Timothy M. Jones. 2020. MuonTrap: Preventing Cross-Domain Spectre-like Attacks by Capturing Speculative State. In Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (Virtual Event) (ISCA ’20). IEEE Press, 132–144. https://doi.org/10.1109/ISCA45697.2020.00022

Digital Library

[4]

Mohammad Bakhshalipour, Seyedali Tabaeiaghdaei, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2019. Evaluation of hardware data prefetchers on server processors. ACM Computing Surveys (CSUR) 52, 3 (2019), 1–29.

Digital Library

[5]

Kenneth E Batcher. 1968. Sorting networks and their applications. In Proceedings of the April 30–May 2, 1968, spring joint computer conference. 307–314.

Digital Library

[6]

Jonathan C Beard, Peng Li, and Roger D Chamberlain. 2017. Raftlib: A C++ template library for high performance stream parallel processing. The International Journal of High Performance Computing Applications 31, 5(2017), 391–404.

Digital Library

[7]

S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. 2008. TILE64 - Processor: A 64-Core SoC with Mesh Interconnect. In 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers. 88–598. https://doi.org/10.1109/ISSCC.2008.4523070

[8]

Eshan Bhatia, Gino Chacon, Seth Pugsley, Elvira Teran, Paul V. Gratz, and Daniel A. Jiménez. 2019. Perceptron-Based Prefetch Filtering. In 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). 1–13.

[9]

Sarani Bhattacharya, Chester Rebeiro, and Debdeep Mukhopadhyay. 2016. A Formal Security Analysis of Even-Odd Sequential Prefetching in Profiled Cache-Timing Attacks. In Proceedings of the Hardware and Architectural Support for Security and Privacy 2016 (Seoul, Republic of Korea) (HASP 2016). Association for Computing Machinery, New York, NY, USA, Article 6, 8 pages. https://doi.org/10.1145/2948618.2948624

Digital Library

[10]

Tiwei Bie, Changchun Ouyang, and Heqing Zhu. 2020. Virtio. In Data Plane Development Kit (DPDK). CRC Press, 229–250.

[11]

Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. 2011. The Gem5 Simulator. SIGARCH Comput. Archit. News 39, 2 (Aug. 2011), 1–7. https://doi.org/10.1145/2024716.2024718

Digital Library

[12]

Thomas Chen, Ram Raghavan, Jason N Dale, and Eiji Iwata. 2007. Cell broadband engine architecture and its first implementation—a performance view. IBM Journal of Research and Development 51, 5 (2007), 559–572.

Digital Library

[13]

Iacopo Colonnelli, Barbara Cantalupo, Roberto Esposito, Matteo Pennisi, Concetto Spampinato, and Marco Aldinucci. 2021. HPC Application Cloudification: The StreamFlow Toolkit. In 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms (PARMA-DITAM 2021). Schloss Dagstuhl-Leibniz-Zentrum für Informatik.

[14]

Halit Dogan, Masab Ahmad, Brian Kahne, and Omer Khan. 2019. Accelerating synchronization using moving compute to data model at 1,000-core multicore scale. ACM Transactions on Architecture and Code Optimization 16, 1(2019), 1–27.

Digital Library

[15]

Alan AA Donovan and Brian W Kernighan. 2015. The Go programming language. Addison-Wesley Professional.

Digital Library

[16]

Reza Fotohi, Mehdi Effatparvar, Fateme Sarkohaki, Shahram Behzad, 2019. An improvement over threads communications on multi-core processors. arXiv preprint arXiv:1909.11644(2019).

[17]

Vasilis Gavrielatos, Antonios Katsarakis, Vijay Nagarajan, Boris Grot, and Arpit Joshi. 2020. Kite: efficient and available release consistency for the datacenter. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 1–16.

Digital Library

[18]

Daniel Gruss, Clémentine Maurice, Anders Fogh, Moritz Lipp, and Stefan Mangard. 2016. Prefetch Side-Channel Attacks: Bypassing SMAP and Kernel ASLR. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (Vienna, Austria) (CCS ’16). Association for Computing Machinery, New York, NY, USA, 368–379. https://doi.org/10.1145/2976749.2978356

Digital Library

[19]

Y. Guo, A. Zigerelli, Y. Zhang, and J. Yang. 2022. Adversarial Prefetch: New Cross-Core Cache Side Channel Attacks. In 2022 2022 IEEE Symposium on Security and Privacy (SP) (SP). IEEE Computer Society, Los Alamitos, CA, USA, 1550–1550. https://doi.org/10.1109/SP46214.2022.00121

[20]

W Daniel Hillis. 1989. The connection machine. MIT press.

[21]

Ali R Hurson and Krishna M Kavi. 2007. Dataflow computers: Their history and future. Wiley Encyclopedia of Computer Science and Engineering (2007).

[22]

Intel. 2020. Queue Management and Load Balancing on Intel® Architecture. Retrieved February 2022 from https://intel.ly/3hY0Zy8

[23]

Giorgos Kappes and Stergios V Anastasiadis. 2021. A lock-free relaxed concurrent queue for fast work distribution. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 454–456.

Digital Library

[24]

Engin Kayraklioglu, Michael P Ferguson, and Tarek El-Ghazawi. 2018. LAPPS: Locality-aware productive prefetching support for PGAS. ACM Transactions on Architecture and Code Optimization 15, 3(2018), 1–26.

Digital Library

[25]

Andi Kleen. 2009. Linux multi-core scalability. In Proceedings of Linux Kongress.

[26]

Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2016. Multiversioned Decoupled Access-execute: The Key to Energy-efficient Compilation of General-purpose Programs. In Proceedings of the 25th International Conference on Compiler Construction(CC 2016). https://doi.org/10.1145/2892208.2892209

Digital Library

[27]

Ben Lee and Ali R Hurson. 1993. Issues in dataflow computing. In Advances in computers. Vol. 37. Elsevier, 285–333.

[28]

Sanghoon Lee, Devesh Tiwari, Yan Solihin, and James Tuck. 2011. HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 99–110.

[29]

Thorben Louw and Simon McIntosh-Smith. 2021. Using the Graphcore IPU for traditional HPC applications. Technical Report. EasyChair.

[30]

Chi-Keung Luk, Robert Muth, Harish Patil, Richard Weiss, P. Geoffrey Lowney, and Robert Cohn. 2002. Profile-Guided Post-Link Stride Prefetching. In Proceedings of the 16th International Conference on Supercomputing (New York, New York, USA) (ICS ’02). Association for Computing Machinery, New York, NY, USA, 167–178. https://doi.org/10.1145/514191.514217

Digital Library

[31]

Maged M Michael and Michael L Scott. 1996. Simple, fast, and practical non-blocking and blocking concurrent queue algorithms. In Proceedings of the fifteenth annual ACM symposium on Principles of distributed computing. 267–275.

Digital Library

[32]

Gal Milman, Alex Kogan, Yossi Lev, Victor Luchangco, and Erez Petrank. 2018. BQ: A lock-free queue with batching. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures. 99–109.

Digital Library

[33]

K.J. Nesbit and J.E. Smith. 2004. Data Cache Prefetching Using a Global History Buffer. In 10th International Symposium on High Performance Computer Architecture (HPCA’04). 96–96. https://doi.org/10.1109/HPCA.2004.10030

Digital Library

[34]

Davide Pasetto, Massimiliano Meneghin, Hubertus Franke, Fabrizio Petrini, and Jimi Xenidis. 2012. Performance evaluation of interthread communicationmechanisms on multicore/multithreaded architectures. In Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. 131–132.

Digital Library

[35]

Raghu Prabhakar and Sumti Jairath. 2021. SambaNova SN10 RDU: Accelerating Software 2.0 with Dataflow. In 2021 IEEE Hot Chips 33 Symposium. IEEE, 1–37.

[36]

DPAA QorIQ. 2012. Primer for Software Architecture. Technical Report. Technical report, Freescale Semiconductor Inc.

[37]

T. Ramírez, A. Pajuelo, O. J. Santana, O. Mutlu, and M. Valero. 2010. Efficient Runahead Threads. In 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[38]

Isaac Sánchez Barrera, David Black-Schaffer, Marc Casas, Miquel Moretó, Anastasiia Stupnikova, and Mihail Popov. 2020. Modeling and optimizing numa effects and prefetching with machine learning. In Proceedings of the 34th ACM International Conference on Supercomputing. 1–13.

Digital Library

[39]

Youngjoo Shin, Hyung Chan Kim, Dokeun Kwon, Ji Hoon Jeong, and Junbeom Hur. 2018. Unveiling Hardware-Based Data Prefetcher, a Hidden Source of Information Leakage. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security(Toronto, Canada) (CCS ’18). Association for Computing Machinery, New York, NY, USA, 131–145. https://doi.org/10.1145/3243734.3243736

Digital Library

[40]

sstsimulator. 2020. Ember Communication Pattern Library. Retrieved October 2020 from https://bit.ly/3k9egUV

[41]

Aaron Stillmaker and Bevan Baas. 2017. Scaling equations for the accurate prediction of CMOS device performance from 180nm to 7nm. Integration 58(2017), 74 – 81. https://doi.org/10.1016/j.vlsi.2017.02.002

[42]

J. E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W. R. Davis, P. D. Franzon, M. Bucher, S. Basavarajaiah, J. Oh, and R. Jenkal. 2007. FreePDK: An Open-Source Variation-Aware Design Kit. In 2007 IEEE International Conference on Microelectronic Systems Education (MSE’07). 173–174. https://doi.org/10.1109/MSE.2007.44

Digital Library

[43]

UT-LCA. 2021. GitHub Virtual-Link. Retrieved November 2021 from https://github.com/UT-LCA/near-data-sim

[44]

Sevin Varoglu and Stephen Jenks. 2011. Architectural support for thread communications in multi-core processors. Parallel Comput. 37, 1 (2011), 26–41.

Digital Library

[45]

Haoyuan Wang and Zhiwei Luo. 2017. Data Cache Prefetching with Perceptron Learning. arXiv:arXiv:1712.00905

[46]

Yipeng Wang, Ren Wang, Andrew Herdrich, James Tsai, and Yan Solihin. 2016. CAF: Core to core communication acceleration framework. In 2016 International Conference on Parallel Architecture and Compilation Techniques. IEEE, 351–362.

Digital Library

[47]

Scoot Wasson. 2015. Inside ARM’s Cortex-A72 microarchitecture. Retrieved February 2022 from https://bit.ly/3sf0a9h

[48]

Qinzhe Wu, Jonathan C. Beard, Ashen Ekanayake, Andreas Gerstlauer, and Lizy K. John. 2021. Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication. 2021 IEEE International Parallel and Distributed Processing Symposium (2021), 182–191.

[49]

T. Yamada, S. Hirasawa, H. Takizawa, and H. Kobayashi. 2015. A Case Study of User-Defined Code Transformations for Data Layout Optimizations. In 2015 Third International Symposium on Computing and Networking (CANDAR). https://doi.org/10.1109/CANDAR.2015.96

Digital Library

[50]

Ke Zhou, Si Sun, Hua Wang, Ping Huang, Xubin He, Rui Lan, Wenyan Li, Wenjie Liu, and Tianming Yang. 2019. Improving Cache Performance for Large-Scale Photo Stores via Heuristic Prefetching Scheme. IEEE Transactions on Parallel and Distributed Systems 30, 9 (2019), 2033–2045. https://doi.org/10.1109/TPDS.2019.2902392

Digital Library

Cited By

Twardzik TNolte LJalier CShi JWild THerkersdorf A(2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649197

Index Terms

SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multicore architectures

Recommendations

An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Post-pass binary adaptation for software-based speculative precomputation

Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch ...
Post-pass binary adaptation for software-based speculative precomputation
PLDI '02: Proceedings of the ACM SIGPLAN 2002 conference on Programming language design and implementation

Recently, a number of thread-based prefetching techniques have been proposed. These techniques aim at improving the latency of single-threaded applications by leveraging multithreading resources to perform memory prefetching via speculative prefetch ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICPP '22: Proceedings of the 51st International Conference on Parallel Processing

August 2022

976 pages

ISBN:9781450397339

DOI:10.1145/3545008

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

NSF (National Science Foundation)

Conference

ICPP '22

ICPP '22: 51st International Conference on Parallel Processing

August 29 - September 1, 2022

Bordeaux, France

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
193
Total Downloads

Downloads (Last 12 months)136
Downloads (Last 6 weeks)37

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Twardzik TNolte LJalier CShi JWild THerkersdorf A(2024)HASIIL: Hardware-Assisted Scheduling to Improve IPC Latency in LinuxProceedings of the 21st ACM International Conference on Computing Frontiers10.1145/3649153.3649197(80-87)Online publication date: 7-May-2024
https://dl.acm.org/doi/10.1145/3649153.3649197

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents