Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3640537.3641568acmconferencesArticle/Chapter ViewAbstractPublication PagesccConference Proceedingsconference-collections
research-article
Open access

BLQ: Light-Weight Locality-Aware Runtime for Blocking-Less Queuing

Published: 20 February 2024 Publication History

Abstract

Message queues are used widely in parallel processing systems for worker thread synchronization. When there is a throughput mismatch between the upstream and downstream tasks, the message queue buffer will often exist as either empty or full. Polling on an empty or full queue will affect the performance of upstream or downstream threads, since such polling cycles could have been spent on other computation. Non-blocking queue is an alternative that allow polling cycles to be spared for other tasks per applications’ choice. However, application programmers are not supposed to bear the burden, because a good decision of what to do upon blocking has to take many runtime environment information into consideration.
This paper proposes Blocking-Less Queuing Runtime (BLQ), a systematic solution capable of finding the proper strategies at (or before) blocking, as well as lightening the programmers’ burden. BLQ collects a set of solutions, including yielding, advanced dynamic queue buffer resizing, and resource-aware task scheduling. The evaluation on high-end servers shows that a set of diverse parallel queuing workloads could reduce blocking and lower cache misses with BLQ. BLQ outperforms the baseline runtime considerably (with up to 3.8× peak speedup).

References

[1]
Marco Aldinucci, Marco Danelutto, Peter Kilpatrick, and Massimo Torquati. 2017. Fastflow: High-Level and Efficient Streaming on Multicore. John Wiley & Sons, Ltd, 261–280. isbn:9781119332015 https://doi.org/10.1002/9781119332015.ch13
[2]
V. Anantharam. 1989. The optimal buffer allocation problem. IEEE Transactions on Information Theory, 35, 4 (1989), 721–725. https://doi.org/10.1109/18.32150
[3]
Timothy G. Armstrong, Justin M. Wozniak, Michael Wilde, and Ian T. Foster. 2014. Compiler Techniques for Massively Scalable Implicit Task Parallelism. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 299–310. https://doi.org/10.1109/SC.2014.30
[4]
R.H. Arpaci-Dusseau and A.C. Arpaci-Dusseau. 2018. Operating Systems: Three Easy Pieces. CreateSpace Independent Publishing Platform. isbn:9781985086593 https://books.google.com/books?id=0a-ouwEACAAJ
[5]
Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi, Laurent El Shafey, Chandramohan A. Thekkath, and Yonghui Wu. 2022. Pathways: Asynchronous Distributed Dataflow for ML. https://doi.org/10.48550/arXiv.2203.12533 arxiv:arXiv:2203.12533.
[6]
Jonathan C. Beard and Roger D. Chamberlain. 2014. Use of a Levy Distribution for Modeling Best Case Execution Time Variation. In Computer Performance Engineering, A. Horváth and K. Wolter (Eds.) (Lecture Notes in Computer Science, Vol. 8721). Springer International Publishing, 74–88. isbn:978-3-319-10884-1 https://doi.org/10.1007/978-3-319-10885-8_6
[7]
Jonathan C. Beard, Peng Li, and Roger D. Chamberlain. 2015. RaftLib: A C++ Template Library for High Performance Stream Parallel Processing. In Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM ’15). Association for Computing Machinery, New York, NY, USA. 96–105. isbn:9781450334044 https://doi.org/10.1145/2712386.2712400
[8]
boost. 2020. Class template queue. https://bit.ly/37hAMHJ
[9]
Bérenger Bramas. 2019. Impact study of data locality on task-based applications through the Heteroprio scheduler. PeerJ. Computer science, 5 (2019), 05, e190. https://doi.org/10.7717/peerj-cs.190
[10]
Go Community. 2024. Go Programming Language. https://go.dev/
[11]
L. Dagum and R. Menon. 1998. OpenMP: an industry standard API for shared-memory programming. IEEE Computational Science and Engineering, 5, 1 (1998), 46–55. https://doi.org/10.1109/99.660313
[12]
Andreas Diavastos and Pedro Trancoso. 2017. Auto-Tuning Static Schedules for Task Data-Flow Applications. In Proceedings of the 1st Workshop on AutotuniNg and ADaptivity AppRoaches for Energy Efficient HPC Systems (ANDARE ’17). Association for Computing Machinery, New York, NY, USA. Article 1, 6 pages. isbn:9781450353632 https://doi.org/10.1145/3152821.3152879
[13]
Andreas Diavastos and Pedro Trancoso. 2017. SWITCHES: A Lightweight Runtime for Dataflow Execution of Tasks on Many-Cores. ACM Trans. Archit. Code Optim., 14, 3 (2017), Article 31, sep, 23 pages. issn:1544-3566 https://doi.org/10.1145/3127068
[14]
Alessandra Fais, Giuseppe Lettieri, Gregorio Procissi, and Stefano Giordano. 2021. Towards Scalable and Expressive Stream Packet Processing. In 2021 IEEE Global Communications Conference (GLOBECOM). 01–06. https://doi.org/10.1109/GLOBECOM46510.2021.9685436
[15]
Babak Falsafi and Thomas F. Wenisch. 2014. A Primer on Hardware Prefetching. Morgan and Claypool Publishers. isbn:1608459527 https://doi.org/10.1007/978-3-031-01743-8
[16]
Zbyněk Falt, Martin Kruliš, David Bednárek, Jakub Yaghob, and Filip Zavoral. 2015. Locality Aware Task Scheduling in Parallel Data Stream Processing. In Intelligent Distributed Computing VIII, David Camacho, Lars Braubach, Salvatore Venticinque, and Costin Badica (Eds.). Springer International Publishing, Cham. 331–342. isbn:978-3-319-10422-5 https://doi.org/10.1007/978-3-319-10422-5_35
[17]
Apache Foundation. 2024. Apache Storm. https://storm.apache.org/index.html
[18]
Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). Association for Computing Machinery, New York, NY, USA. 151–162. isbn:1595934510 https://doi.org/10.1145/1168857.1168877
[19]
Michael I. Gordon, William Thies, Michal Karczmarek, Jasper Lin, Ali S. Meli, Andrew A. Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann, David Maze, and Saman Amarasinghe. 2002. A Stream Compiler for Communication-Exposed Architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X). Association for Computing Machinery, New York, NY, USA. 291–303. isbn:1581135742 https://doi.org/10.1145/605397.605428
[20]
Y. Guo, V. Cave, V. Sarkar, and J. Zhao. 2010. SLAW: A scalable locality-aware adaptive work-stealing scheduler. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). IEEE Computer Society, Los Alamitos, CA, USA. 1–12. https://doi.org/10.1109/IPDPS.2010.5470425
[21]
M. Herlihy, N. Shavit, V. Luchangco, and M. Spear. 2020. The Art of Multiprocessor Programming. Elsevier Science. isbn:9780123914064 https://doi.org/10.1016/c2011-0-06993-4
[22]
Pieter Hintjens. 2010. ZeroMQ: the guide. http://zeromq.org
[23]
Jack Tigar Humphries, Neel Natu, Ashwin Chaugule, Ofir Weisse, Barret Rhoden, Josh Don, Luigi Rizzo, Oleg Rombakh, Paul Jack Turner, and Christos Kozyrakis. 2021. ghOSt: Fast and Flexible User-Space Delegation of Linux Scheduling. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles CD-ROM. New York, NY, USA. 588–604. https://doi.org/10.1145/3477132.3483542
[24]
M. Jones. 2018. Inside the Linux 2.6 Completely Fair Scheduler. https://developer.ibm.com/tutorials/l-completely-fair-scheduler/
[25]
Hyong-youb Kim, Vijay S. Pai, and Scott Rixner. 2003. Exploiting Task-Level Concurrency in a Programmable Network Interface. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’03). Association for Computing Machinery, New York, NY, USA. 61–72. isbn:1581135882 https://doi.org/10.1145/781498.781506
[26]
L. Kleinrock. 1975. Queueing Systems. Volume 1: Theory. Wiley-Interscience.
[27]
E.A. Lee and D.G. Messerschmitt. 1987. Synchronous data flow. Proc. IEEE, 75, 9 (1987), 1235–1245. https://doi.org/10.1109/PROC.1987.13876
[28]
Patrick P. C. Lee, Tian Bu, and Girish Chandranmenon. 2010. A lock-free, cache-efficient multi-core synchronization mechanism for line-rate network traffic monitoring. In 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS). 1–12. https://doi.org/10.1109/IPDPS.2010.5470368
[29]
Sanghoon Lee, Devesh Tiwari, Yan Solihin, and James Tuck. 2011. HAQu: Hardware-accelerated queueing for fine-grained threading on a chip multiprocessor. In 2011 IEEE 17th International Symposium on High Performance Computer Architecture. 99–110. https://doi.org/10.1109/hpca.2011.5749720
[30]
Shaoshan Liu, Yuhao Zhu, Bo Yu, Jean-Luc Gaudiot, and Guang R. Gao. 2021. Dataflow Accelerator Architecture for Autonomous Machine Computing. https://doi.org/10.48550/arXiv.2109.07047
[31]
Gabriele Mencagli, Massimo Torquati, Dalvan Griebler, Marco Danelutto, and Luiz Gustavo L. Fernandes. 2019. Raising the Parallel Abstraction Level for Streaming Analytics Applications. IEEE Access, 7 (2019), 131944–131961. https://doi.org/10.1109/ACCESS.2019.2941183
[32]
Svetlana Minakova, Erqian Tang, and Todor Stefanov. 2020. Combining Task- and Data-Level Parallelism for High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs. In Embedded Computer Systems: Architectures, Modeling, and Simulation, Alex Orailoglu, Matthias Jung, and Marc Reichenbach (Eds.). Springer International Publishing, Cham. 18–35. isbn:978-3-030-60939-9 https://doi.org/10.1007/978-3-030-60939-9_2
[33]
Adam Morrison and Yehuda Afek. 2013. Fast Concurrent Queues for X86 Processors. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’13). Association for Computing Machinery, New York, NY, USA. 103–112. isbn:9781450319225 https://doi.org/10.1145/2442516.2442527
[34]
Lifeng Nai, Yinglong Xia, Ilie G. Tanase, Hyesoon Kim, and Ching-Yung Lin. 2015. GraphBIG: understanding graph computing in the context of industrial solutions. In SC ’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–12. https://doi.org/10.1145/2807591.2807626
[35]
Poornima Nookala, Peter Dinda, Kyle C. Hale, Kyle Chard, and Ioan Raicu. 2021. Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures. In 2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). 1–8. https://doi.org/10.1109/MASCOTS53633.2021.9614292
[36]
Amy Ousterhout, Joshua Fried, Jonathan Behrens, Adam Belay, and Hari Balakrishnan. 2019. Shenango: Achieving High CPU Efficiency for Latency-sensitive Datacenter Workloads. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19). USENIX Association, Boston, MA. 361–378. isbn:978-1-931971-49-2 https://www.usenix.org/conference/nsdi19/presentation/ousterhout
[37]
Steven J. Plimpton and Tim Shead. 2014. Streaming data analytics via message passing with application to graph algorithms. J. Parallel and Distrib. Comput., 74, 8 (2014), 2687–2698. issn:0743-7315 https://doi.org/10.1016/j.jpdc.2014.04.001
[38]
Daniel Sanchez, David Lo, Richard M. Yoo, Jeremy Sugerman, and Christos Kozyrakis. 2011. Dynamic Fine-Grain Scheduling of Pipeline Parallelism. In Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques (PACT ’11). IEEE Computer Society, USA. 22–32. isbn:9780769545660 https://doi.org/10.1109/PACT.2011.9
[39]
Joseph Schuchart, Poornima Nookala, Thomas Herault, Edward F. Valeev, and George Bosilca. 2022. Pushing the Boundaries of Small Tasks: Scalable Low-Overhead Data-Flow Programming in TTG. In 2022 IEEE International Conference on Cluster Computing (CLUSTER). 117–128. https://doi.org/10.1109/CLUSTER51413.2022.00026
[40]
Andreas Sembrant, Erik Hagersten, and David Black-Schaffer. 2016. Data placement across the cache hierarchy: Minimizing data movement with reuse-aware placement. In 2016 IEEE 34th International Conference on Computer Design (ICCD). 117–124. https://doi.org/10.1109/ICCD.2016.7753269
[41]
Fanfan Shen, Yanxiang He, Jun Zhang, Qingan Li, Jianhua Li, and Chao Xu. 2019. Reuse locality aware cache partitioning for last-level cache. Computers & Electrical Engineering, 74 (2019), 319–330. issn:0045-7906 https://doi.org/10.1016/j.compeleceng.2019.01.020
[42]
sstsimulator. 2020. Ember Communication Pattern Library. https://github.com/sstsimulator/ember
[43]
Jaspal Subhlok and Bwolen Yang. 1997. A New Model for Integrated Nested Task and Data Parallel Programming. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP ’97). Association for Computing Machinery, New York, NY, USA. 1–12. isbn:0897919068 https://doi.org/10.1145/263764.263768
[44]
Jeremy Sugerman, Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, and Pat Hanrahan. 2009. GRAMPS: A Programming Model for Graphics Pipelines. ACM Trans. Graph., 28, 1 (2009), Article 4, feb, 11 pages. issn:0730-0301 https://doi.org/10.1145/1477926.1477930
[45]
Giuseppe Tagliavini, Daniele Cesarini, and Andrea Marongiu. 2018. Unleashing Fine-Grained Parallelism on Embedded Many-Core Accelerators with Lightweight OpenMP Tasking. IEEE Transactions on Parallel and Distributed Systems, 29, 9 (2018), 2150–2163. https://doi.org/10.1109/TPDS.2018.2814602
[46]
William Thies, Michal Karczmarek, and Saman Amarasinghe. 2002. StreamIt: A Language for Streaming Applications. In Compiler Construction, R. Nigel Horspool (Ed.). Springer Berlin Heidelberg, Berlin, Heidelberg. 179–196. isbn:978-3-540-45937-8 https://doi.org/10.1007/3-540-45937-5_14
[47]
Jiajun Wang. 2019. Reuse Aware Data Placement Schemes for Multilevel Cache Hierarchies. Ph. D. Dissertation. The University of Texas at Austin. Austin TX.
[48]
Yipeng Wang, Ren Wang, Andrew Herdrich, James Tsai, and Yan Solihin. 2016. CAF: Core to Core Communication Acceleration Framework. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). Association for Computing Machinery, New York, NY, USA. 351–362. isbn:9781450341219 https://doi.org/10.1145/2967938.2967954
[49]
Kyle B. Wheeler, Richard C. Murphy, and Douglas Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In 2008 IEEE International Symposium on Parallel and Distributed Processing. 1–8. https://doi.org/10.1109/IPDPS.2008.4536359
[50]
Wikipedia. 2023. Message Queue. https://en.wikipedia.org/wiki/Message_queue
[51]
Markus Wittmann and Georg Hager. 2009. A Proof of Concept for Optimizing Task Parallelism by Locality Queues. https://doi.org/10.48550/arXiv.0902.1884 arxiv:arXiv:0902.1884.
[52]
Qinzhe Wu, Jonathan Beard, Ashen Ekanayake, Andreas Gerstlauer, and Lizy K. John. 2021. Virtual-Link: A Scalable Multi-Producer Multi-Consumer Message Queue Architecture for Cross-Core Communication. In 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 182–191. https://doi.org/10.1109/IPDPS49936.2021.00027
[53]
Qinzhe Wu, Ashen Ekanayake, Ruihao Li, Jonathan Beard, and Lizy John. 2023. SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems. In Proceedings of the 51st International Conference on Parallel Processing (ICPP ’22). Association for Computing Machinery, New York, NY, USA. Article 58, 12 pages. isbn:9781450397339 https://doi.org/10.1145/3545008.3545044
[54]
Xmcgcg. 2023. CPP copy_constructor. https://en.cppreference.com/w/cpp/language/copy_constructor
[55]
Shuhao Zhang, Bingsheng He, Daniel Dahlmeier, Amelie Chi Zhou, and Thomas Heinze. 2017. Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). 659–670. https://doi.org/10.1109/ICDE.2017.119

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CC 2024: Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction
February 2024
261 pages
ISBN:9798400705076
DOI:10.1145/3640537
This work is licensed under a Creative Commons Attribution International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 February 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Message Queue
  2. Parallel Processing
  3. Runtime

Qualifiers

  • Research-article

Conference

CC '24
Sponsor:

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 286
    Total Downloads
  • Downloads (Last 12 months)286
  • Downloads (Last 6 weeks)50
Reflects downloads up to 11 Feb 2025

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media