research-article

Open access

BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less Queuing

Authors:

Jonathan Beard,

Lizy JohnAuthors Info & Claims

CC 2024: Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction

Pages 100 - 112

https://doi.org/10.1145/3640537.3641568

Published: 20 February 2024 Publication History

Abstract

Message queues are used widely in parallel processing systems for worker thread synchronization. When there is a throughput mismatch between the upstream and downstream tasks, the message queue buffer will often exist as either empty or full. Polling on an empty or full queue will affect the performance of upstream or downstream threads, since such polling cycles could have been spent on other computation. Non-blocking queue is an alternative that allow polling cycles to be spared for other tasks per applications’ choice. However, application programmers are not supposed to bear the burden, because a good decision of what to do upon blocking has to take many runtime environment information into consideration.

This paper proposes Blocking-Less Queuing Runtime (BLQ), a systematic solution capable of finding the proper strategies at (or before) blocking, as well as lightening the programmers’ burden. BLQ collects a set of solutions, including yielding, advanced dynamic queue buffer resizing, and resource-aware task scheduling. The evaluation on high-end servers shows that a set of diverse parallel queuing workloads could reduce blocking and lower cache misses with BLQ. BLQ outperforms the baseline runtime considerably (with up to 3.8× peak speedup).

References

[1]

Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. 2017. Fastflow: High-Level and Efficient Streaming on Multicore. John Wiley & Sons, Ltd, 261–280. isbn:9781119332015 https://doi.org/10.1002/9781119332015.ch13

[2]

V. Anantharam. 1989. The optimal buffer allocation problem. IEEE Transactions on Information Theory, 35, 4 (1989), 721–725. https://doi.org/10.1109/18.32150

Digital Library

[3]

Timothy G. Armstrong, Justin M. Wozniak, Michael Wilde, and Ian T. Foster. 2014. Compiler Techniques for Massively Scalable Implicit Task Parallelism. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 299–310. https://doi.org/10.1109/SC.2014.30

Digital Library

[4]

R.H. Arpaci-Dusseau and A.C. Arpaci-Dusseau. 2018. Operating Systems: Three Easy Pieces. CreateSpace Independent Publishing Platform. isbn:9781985086593 https://books.google.com/books?id=0a-ouwEACAAJ

[5]

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent El Shafey, Chandramohan A. Thekkath, and Yonghui Wu. 2022. Pathways: Asynchronous Distributed Dataflow for ML. https://doi.org/10.48550/arXiv.2203.12533 arxiv:arXiv:2203.12533.

[6]

Jonathan C. Beard and Roger D. Chamberlain. 2014. Use of a Levy Distribution for Modeling Best Case Execution Time Variation. In Computer Performance Engineering, A. Horváth and K. Wolter (Eds.) (Lecture Notes in Computer Science, Vol. 8721). Springer International Publishing, 74–88. isbn:978-3-319-10884-1 https://doi.org/10.1007/978-3-319-10885-8_6

[7]

Jonathan C. Beard, Peng Li, and Roger D. Chamberlain. 2015. RaftLib: A C++ Template Library for High Performance Stream Parallel Processing. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM ’15). Association for Computing Machinery, New York, NY, USA. 96–105. isbn:9781450334044 https://doi.org/10.1145/2712386.2712400

Digital Library

[8]

boost. 2020. Class template queue. https://bit.ly/37hAMHJ

[9]

Bérenger Bramas. 2019. Impact study of data locality on task-based applications through the Heteroprio scheduler. PeerJ. Computer science, 5 (2019), 05, e190. https://doi.org/10.7717/peerj-cs.190

[10]

Go Community. 2024. Go Programming Language. https://go.dev/

[11]

L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, 5, 1 (1998), 46–55. https://doi.org/10.1109/99.660313

Digital Library

[12]

Andreas Diavastos and Pedro Trancoso. 2017. Auto-Tuning Static Schedules for Task Data-Flow Applications. In Proceedings of the 1st Workshop on AutotuniNg and ADaptivity AppRoaches for Energy Efficient HPC Systems (ANDARE ’17). Association for Computing Machinery, New York, NY, USA. Article 1, 6 pages. isbn:9781450353632 https://doi.org/10.1145/3152821.3152879

Digital Library

[13]

Andreas Diavastos and Pedro Trancoso. 2017. SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores. ACM Trans. Archit. Code Optim., 14, 3 (2017), Article 31, sep, 23 pages. issn:1544-3566 https://doi.org/10.1145/3127068

Digital Library

[14]

Alessandra Fais, Giuseppe Lettieri, Gregorio Procissi, and Stefano Giordano. 2021. Towards Scalable and Expressive Stream Packet Processing. In 2021 IEEE Global Communications Conference (GLOBECOM). 01–06. https://doi.org/10.1109/GLOBECOM46510.2021.9685436

[15]

Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Morgan and Claypool Publishers. isbn:1608459527 https://doi.org/10.1007/978-3-031-01743-8

[16]

Zbyněk Falt, Martin Kruliš, David Bednárek, Jakub Yaghob, and Filip Zavoral. 2015. Locality Aware Task Scheduling in Parallel Data Stream Processing. In Intelligent Distributed Computing VIII, David Camacho, Lars Braubach, Salvatore Venticinque, and Costin Badica (Eds.). Springer International Publishing, Cham. 331–342. isbn:978-3-319-10422-5 https://doi.org/10.1007/978-3-319-10422-5_35

[17]

Apache Foundation. 2024. Apache Storm. https://storm.apache.org/index.html

[18]

Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). Association for Computing Machinery, New York, NY, USA. 151–162. isbn:1595934510 https://doi.org/10.1145/1168857.1168877

Digital Library

[19]

Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A Stream Compiler for Communication-Exposed Architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X). Association for Computing Machinery, New York, NY, USA. 291–303. isbn:1581135742 https://doi.org/10.1145/605397.605428

Digital Library

[20]

Y. Guo, V. Cave, V. Sarkar, and J. Zhao. 2010. SLAW: A scalable locality-aware adaptive work-stealing scheduler. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA. 1–12. https://doi.org/10.1109/IPDPS.2010.5470425

[21]

M. Herlihy, N. Shavit, V. Luchangco, and M. Spear. 2020. The Art of Multiprocessor Programming. Elsevier Science. isbn:9780123914064 https://doi.org/10.1016/c2011-0-06993-4

[22]

Pieter Hintjens. 2010. ZeroMQ: the guide. http://zeromq.org

[23]

Jack Tigar Humphries, Neel Natu, Ashwin Chaugule, Ofir Weisse, Barret Rhoden, Josh Don, Luigi Rizzo, Oleg Rombakh, Paul Jack Turner, and Christos Kozyrakis. 2021. ghOSt: Fast and Flexible User-Space Delegation of Linux Scheduling. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM. New York, NY, USA. 588–604. https://doi.org/10.1145/3477132.3483542

Digital Library

[24]

M. Jones. 2018. Inside the Linux 2.6 Completely Fair Scheduler. https://developer.ibm.com/tutorials/l-completely-fair-scheduler/

[25]

Hyong-youb Kim, Vijay S. Pai, and Scott Rixner. 2003. Exploiting Task-Level Concurrency in a Programmable Network Interface. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’03). Association for Computing Machinery, New York, NY, USA. 61–72. isbn:1581135882 https://doi.org/10.1145/781498.781506

Digital Library

[26]

L. Kleinrock. 1975. Queueing Systems. Volume 1: Theory. Wiley-Interscience.

[27]

E.A. Lee and D.G. Messerschmitt. 1987. Synchronous data flow. Proc. IEEE, 75, 9 (1987), 1235–1245. https://doi.org/10.1109/PROC.1987.13876

[28]

Patrick P. C. Lee, Tian Bu, and Girish Chandranmenon. 2010. A lock-free, cache-efficient multi-core synchronization mechanism for line-rate network traffic monitoring. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). 1–12. https://doi.org/10.1109/IPDPS.2010.5470368

[29]

Sanghoon Lee, Devesh Tiwari, Yan Solihin, and James Tuck. 2011. HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 99–110. https://doi.org/10.1109/hpca.2011.5749720

[30]

Shaoshan Liu, Yuhao Zhu, Bo Yu, Jean-Luc Gaudiot, and Guang R. Gao. 2021. Dataflow Accelerator Architecture for Autonomous Machine Computing. https://doi.org/10.48550/arXiv.2109.07047

[31]

Gabriele Mencagli, Massimo Torquati, Dalvan Griebler, Marco Danelutto, and Luiz Gustavo L. Fernandes. 2019. Raising the Parallel Abstraction Level for Streaming Analytics Applications. IEEE Access, 7 (2019), 131944–131961. https://doi.org/10.1109/ACCESS.2019.2941183

[32]

Svetlana Minakova, Erqian Tang, and Todor Stefanov. 2020. Combining Task- and Data-Level Parallelism for High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs. In Embedded Computer Systems: Architectures, Modeling, and Simulation, Alex Orailoglu, Matthias Jung, and Marc Reichenbach (Eds.). Springer International Publishing, Cham. 18–35. isbn:978-3-030-60939-9 https://doi.org/10.1007/978-3-030-60939-9_2

Digital Library

[33]

Adam Morrison and Yehuda Afek. 2013. Fast Concurrent Queues for X86 Processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’13). Association for Computing Machinery, New York, NY, USA. 103–112. isbn:9781450319225 https://doi.org/10.1145/2442516.2442527

Digital Library

[34]

Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: understanding graph computing in the context of industrial solutions. In SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12. https://doi.org/10.1145/2807591.2807626

Digital Library

[35]

Poornima Nookala, Peter Dinda, Kyle C. Hale, Kyle Chard, and Ioan Raicu. 2021. Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures. In 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1–8. https://doi.org/10.1109/MASCOTS53633.2021.9614292

[36]

Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA. 361–378. isbn:978-1-931971-49-2 https://www.usenix.org/conference/nsdi19/presentation/ousterhout

Digital Library

[37]

Steven J. Plimpton and Tim Shead. 2014. Streaming data analytics via message passing with application to graph algorithms. J. Parallel and Distrib. Comput., 74, 8 (2014), 2687–2698. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2014.04.001

[38]

Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT ’11). IEEE Computer Society, USA. 22–32. isbn:9780769545660 https://doi.org/10.1109/PACT.2011.9

Digital Library

[39]

Joseph Schuchart, Poornima Nookala, Thomas Herault, Edward F. Valeev, and George Bosilca. 2022. Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 117–128. https://doi.org/10.1109/CLUSTER51413.2022.00026

[40]

Andreas Sembrant, Erik Hagersten, and David Black-Schaffer. 2016. Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement. In 2016 IEEE 34th International Conference on Computer Design (ICCD). 117–124. https://doi.org/10.1109/ICCD.2016.7753269

[41]

Fanfan Shen, Yanxiang He, Jun Zhang, Qingan Li, Jianhua Li, and Chao Xu. 2019. Reuse locality aware cache partitioning for last-level cache. Computers & Electrical Engineering, 74 (2019), 319–330. issn:0045-7906 https://doi.org/10.1016/j.compeleceng.2019.01.020

Digital Library

[42]

sstsimulator. 2020. Ember Communication Pattern Library. https://github.com/sstsimulator/ember

[43]

Jaspal Subhlok and Bwolen Yang. 1997. A New Model for Integrated Nested Task and Data Parallel Programming. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP ’97). Association for Computing Machinery, New York, NY, USA. 1–12. isbn:0897919068 https://doi.org/10.1145/263764.263768

Digital Library

[44]

Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. 2009. GRAMPS: A Programming Model for Graphics Pipelines. ACM Trans. Graph., 28, 1 (2009), Article 4, feb, 11 pages. issn:0730-0301 https://doi.org/10.1145/1477926.1477930

Digital Library

[45]

Giuseppe Tagliavini, Daniele Cesarini, and Andrea Marongiu. 2018. Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking. IEEE Transactions on Parallel and Distributed Systems, 29, 9 (2018), 2150–2163. https://doi.org/10.1109/TPDS.2018.2814602

Digital Library

[46]

William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Compiler Construction, R. Nigel Horspool (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 179–196. isbn:978-3-540-45937-8 https://doi.org/10.1007/3-540-45937-5_14

[47]

Jiajun Wang. 2019. Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies. Ph. D. Dissertation. The University of Texas at Austin. Austin TX.

[48]

Yipeng Wang, Ren Wang, Andrew Herdrich, James Tsai, and Yan Solihin. 2016. CAF: Core to Core Communication Acceleration Framework. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). Association for Computing Machinery, New York, NY, USA. 351–362. isbn:9781450341219 https://doi.org/10.1145/2967938.2967954

Digital Library

[49]

Kyle B. Wheeler, Richard C. Murphy, and Douglas Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8. https://doi.org/10.1109/IPDPS.2008.4536359

[50]

Wikipedia. 2023. Message Queue. https://en.wikipedia.org/wiki/Message_queue

[51]

Markus Wittmann and Georg Hager. 2009. A Proof of Concept for Optimizing Task Parallelism by Locality Queues. https://doi.org/10.48550/arXiv.0902.1884 arxiv:arXiv:0902.1884.

[52]

Qinzhe Wu, Jonathan Beard, Ashen Ekanayake, Andreas Gerstlauer, and Lizy K. John. 2021. Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 182–191. https://doi.org/10.1109/IPDPS49936.2021.00027

[53]

Qinzhe Wu, Ashen Ekanayake, Ruihao Li, Jonathan Beard, and Lizy John. 2023. SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems. In Proceedings of the 51st International Conference on Parallel Processing (ICPP ’22). Association for Computing Machinery, New York, NY, USA. Article 58, 12 pages. isbn:9781450397339 https://doi.org/10.1145/3545008.3545044

Digital Library

[54]

Xmcgcg. 2023. CPP copy_constructor. https://en.cppreference.com/w/cpp/language/copy_constructor

[55]

Shuhao Zhang, Bingsheng He, Daniel Dahlmeier, Amelie Chi Zhou, and Thomas Heinze. 2017. Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 659–670. https://doi.org/10.1109/ICDE.2017.119

Index Terms

BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less Queuing
1. Software and its engineering
  1. Software notations and tools
    1. Development frameworks and environments
      1. Application specific development environments
    2. Software libraries and repositories

Recommendations

Reversibility of Tandem Blocking Queueing Systems

<P>This paper is concerned with queueing systems of several service stations in series in which each station may consist of multi-servers. An infinite number of customers always waits in front of the first station, and each customer passes through all ...
Some Queuing Problems with Balking and Reneging---II

This is the second of two papers in which balking refusing to join the queue and reneging leaving the queue after joining are considered. The new element here is that the balking behavior is drastically altered The model assumes 1 Customers arrive from ...
A discrete-time queueing model with abandonments
QTNA '10: Proceedings of the 5th International Conference on Queueing Theory and Network Applications

This paper presents a multi-server queueing model with abandonments in discrete time. In every time slot a generally distributed number (batch) of customers arrives. The different numbers of arrivals in consecutive slots are mutually independent. Each ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CC 2024: Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction

February 2024

261 pages

ISBN:9798400705076

DOI:10.1145/3640537

General Chair:
Gabriel Rodríguez
Universidade da Coruña, Spain
,
Program Chairs:
P. Sadayappan
University of Utah, USA
,
Aravind Sukumaran-Rajam
Meta, USA

Copyright © 2024 Owner/Author.

This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CC '24

Sponsor:

SIGPLAN

CC '24: 33rd ACM SIGPLAN International Conference on Compiler Construction

March 2 - 3, 2024

Edinburgh, United Kingdom

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
286
Total Downloads

Downloads (Last 12 months)286
Downloads (Last 6 weeks)50

Reflects downloads up to 11 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten