research-article

No barrier in the road: a comprehensive study and optimization of ARM barriers

Authors:

Haibo ChenAuthors Info & Claims

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 348 - 361

https://doi.org/10.1145/3332466.3374535

Published: 19 February 2020 Publication History

Abstract

In this paper, we present the first comprehensive performance characterization and optimization of ARM barriers on both mobile and server platforms. We draw a set of observations through several abstracted models and validate them in scenarios where barriers are intensively used. We find that (1) order-preserving approaches without involving the bus significantly outperform other approaches, and (2) the tremendous overhead mostly comes from barriers strictly following remote memory references. Usually, such barriers are inserted when threads are exchanging data, and they are used to ensure the relative order between storing the data to a shared buffer and setting a flag to inform the receiver. Based on the observations, we propose a new mechanism, Pilot, to remove such barriers by leveraging the single-copy atomicity to piggyback the flag with the data. Applying Pilot only requires minor changes to applications and provides 10%-360% performance improvements in multiple benchmarks, which are close to the ideal performance without barriers.

References

[1]

ARM AMBA. 2011. AXI and ACE Protocol Specification.

[2]

ARM ARM. 2018. Architecture Reference Manual. ARMv8, for ARMv8-A architecture profile (2018).

[3]

Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT '08). Association for Computing Machinery, New York, NY, USA, 72--81.

Digital Library

[4]

Enrico Calore, Filippo Mantovani, and Daniel Ruiz. 2018. Advanced Performance Analysis of HPC Workloads on Cavium ThunderX. In 2018 International Conference on High Performance Computing & Simulation, HPCS 2018, Orleans, France, July 16--20, 2018. IEEE, 375--382.

[5]

Milind Chabbi, Michael Fagan, and John Mellor-Crummey. 2015. High Performance Locks for Multi-Level NUMA Systems. (2015), 215--226.

Digital Library

[6]

Milind Chabbi and John Mellor-Crummey. 2016. Contention-Conscious, Locality-Preserving Locks. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article Article 22, 14 pages.

Digital Library

[7]

Dimitrios Chasapis, Marc Casas, Miquel Moretó, Raul Vidal, Eduard Ayguadé, Jesús Labarta, and Mateo Valero. 2015. PARSECSs: Evaluating the Impact of Task Parallelism in the PARSEC Benchmark Suite. ACM Trans. Archil. Code Optim. 12, 4, Article Article 41 (Dec. 2015), 22 pages.

Digital Library

[8]

Nathan Chong and Samin Ishtiaq. 2008. Reasoning about the ARM Weakly Consistent Memory Model. In Proceedings of the 2008 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness: Held in Conjunction with the Thirteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '08) (MSPC '08). Association for Computing Machinery, New York, NY, USA, 16--19.

Digital Library

[9]

Jonathan Corbet. 2014. MCS locks and qspinlocks. https://lwn.net/Articles/590243/.

[10]

Dave Dice, Virendra J. Marathe, and Nir Shavit. 2011. Flat-Combining NUMA Locks. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). Association for Computing Machinery, New York, NY, USA, 65--74.

Digital Library

[11]

David Dice, Virendra J. Marathe, and Nir Shavit. 2015. Lock Cohorting: A General Technique for Designing NUMA Locks. ACM Trans. Parallel Comput. 1, 2, Article Article 13 (Feb. 2015), 42 pages

Digital Library

[12]

Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguadé. 2009. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP. In ICPP 2009, International Conference on Parallel Processing, Vienna, Austria, 22--25 September 2009. IEEE Computer Society, 124--131.

Digital Library

[13]

Panagiota Fatourou and Nikolaos D. Kallimanis. 2011. A Highly-Efficient Wait-Free Universal Construction. In Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). Association for Computing Machinery, New York, NY, USA, 325--334.

Digital Library

[14]

Panagiota Fatourou and Nikolaos D. Kallimanis. 2012. Revisiting the Combining Synchronization Technique. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '12). Association for Computing Machinery, New York, NY, USA, 257--266.

Digital Library

[15]

Michael I. Gordon, William Thies, and Saman Amarasinghe. 2006. Exploiting Coarse-grained Task, Data, and Pipeline Parallelism in Stream Programs. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XII). ACM, New York, NY, USA, 151--162.

Digital Library

[16]

Vincent Gramoli. 2015. More than You Ever Wanted to Know about Synchronization: Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms. In Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 2015). Association for Computing Machinery, New York, NY, USA, 1--10.

Digital Library

[17]

SD Hammond, C Hughes, MJ Levenhagen, CT Vaughan, AJ Younge, B Schwaller, MJ Aguilar, KT Pedretti, and JH Laros. [n. d.]. Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads, ([n. d.]).

[18]

Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. 2010. Flat Combining and the Synchronization-Parallelism Tradeoff. In Proceedings of the Twenty-Second Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '10). Association for Computing Machinery, New York, NY, USA, 355--364.

Digital Library

[19]

Arm Holdings, [n. d.]. ARM Cortex-A Series Programmer's Guide for ARMv8.

[20]

SPARC International Inc and David L Weaver. 1994. The SPARC architecture manual. Prentice-Hall.

[21]

Intel Intel. 64. The Intel® 64 and IA-32 architectures software developer's manual. Volume 3A: System Programming Guide, Part 1, 64 (64), 64.

[22]

Data Center Knowledge. 2015. PayPal Deploys ARM Servers in Data Centers.

[23]

I-Ting Angelina Lee, Charles E. Leiserson, Tao B. Schardl, Zhunping Zhang, and Jim Sukha. 2015. On-the-Fly Pipeline Parallelism. ACM Trans. Parallel Comput. 2, 3, Article Article 17 (Sept. 2015), 42 pages.

Digital Library

[24]

Changhui Lin, Vijay Nagarajan, and Rajiv Gupta. 2014. Fence Scoping. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16--21, 2014, Trish Damkroger and Jack J. Dongarra (Eds.). IEEE Computer Society, 105--116.

Digital Library

[25]

Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia L. Lawall, and Gilles Muller. 2012. Remote Core Locking: Migrating Critical-Section Execution to Improve the Performance of Multithreaded Applications. In 2012 USENIX Annual Technical Conference, Boston, MA, USA, June 13--15, 2012, Gernot Heiser and Wilson C. Hsieh (Eds.). USENIX Association, 65--76. https://www.usenix.org/conference/atc12/technical-sessions/presentation/lozi

[26]

Jean-Pierre Lozi, Florian David, Gaël Thomas, Julia Lawall, and Gilles Muller. 2016. Fast and Portable Locking for Multicore Architectures. ACM Trans. Comput. Syst. 33, 4, Article Article 13 (Jan. 2016), 62 pages.

Digital Library

[27]

Victor Luchangco, Daniel Nussbaum, and Nir Shavit. 2006. A Hierarchical CLH Queue Lock. In Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28 - September 1, 2006, Proceedings (Lecture Notes in Computer Science), Wolfgang E. Nagel, Wolfgang V. Walter, and Wolfgang Lehner (Eds.), Vol. 4128. Springer, 801--810.

Digital Library

[28]

Luc Maranget, Susmit Sarkar, and Peter Sewell. 2012. A tutorial introduction to the ARM and POWER relaxed memory models. Draft available from http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test7.pdf (2012).

[29]

Paul E McKenney. 2010. Memory barriers: a hardware view for software hackers. Linux Technology Center, LBM Beaverton (2010).

[30]

John M. Mellor-Crummey and Michael L. Scott. 1991. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Trans. Comput. Syst. 9, 1 (Feb. 1991), 21--65.

Digital Library

[31]

Robin Morisset and Francesco Zappa Nardelli. 2017. Partially Redundant Fence Elimination for X86, ARM, and Power Processors. In Proceedings of the 26th International Conference on Compiler Construction (CC 2017). Association for Computing Machinery, New York, NY, USA, 1--10.

Digital Library

[32]

Adam Morrison. 2016. Scaling Synchronization in Multicore Programs. Commun. ACM 59, 11 (Oct. 2016), 44--51.

Digital Library

[33]

Peizhao Ou and Brian Demsky. 2018. Towards Understanding the Costs of Avoiding Out-of-Thin-Air Results. Proc. ACM Program. Lang. 2, OOPSLA, Article Article 136 (Oct. 2018), 29 pages

Digital Library

[34]

Scott Owens, Susmit Sarkar, and Peter Sewell. 2009. A Better x86 Memory Model: X86-TSO. In Theorem Proving in Higher Order Logics, 22nd International Conference, TPHOLs 2009, Munich, Germany, August 17--20, 2009. Proceedings (Lecture Notes in Computer Science), Stefan Berghofer, Tobias Nipkow, Christian Urban, and Makarius Wenzel (Eds.), Vol. 5674. Springer, 391--407.

Digital Library

[35]

Yoshihiro Oyama, Kenjiro Taura, and Akinori Yonezawa. 1999. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications, Vol. 16. Citeseer, 95.

[36]

Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, and Peter Sewell. 2017. Simplifying ARM Concurrency: Multicopy-Atomic Axiomatic and Operational Models for ARMv8. Proc. ACM Program. Lang. 2, POPL, Article Article 19 (Dec. 2017), 29 pages

Digital Library

[37]

Milos Puzovic, Srilatha Manne, Shay GalOn, and Makoto Ono. 2016. Quantifying Energy Use in Dense Shared Memory HPC Node. In 4th International Workshop on Energy Efficient Supercomputing, E2SC@SC 2016, Salt Lake City, UT, USA, November 14, 2016. IEEE Computer Society, 16--23.

[38]

Nikola Rajovic, Paul M. Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. 2013. Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?, Article 40 (2013), 12 pages.

Digital Library

[39]

Nikola Rajovic, Paul M Carpenter, Isaac Gelado, Nikola Puzovic, Alex Ramirez, and Mateo Valero. 2013. Supercomputing with Commodity CPUs: Are Mobile SoCs Ready for HPC?. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 40.

Digital Library

[40]

J Rath. 2013. Baidu deploys marvell armbased cloud server.

[41]

Carl G. Ritson and Scott Owens. 2016. Benchmarking Weak Memory Models. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article Article 24, 11 pages

Digital Library

[42]

Sepideh Roghanchi, Jakob Eriksson, and Nilanjana Basu. 2017. Ffwd: Delegation is (Much) Faster than You Think. In Proceedings of the 26th Symposium on Operating Systems Principles (SOSP '17). Association for Computing Machinery, New York, NY, USA, 342--358.

Digital Library

[43]

Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and Derek Williams. 2011. Understanding POWER Multiprocessors. In Proceedings of the 32Nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '11). ACM, New York, NY, USA, 175--186.

Digital Library

[44]

Andreas Selinger, Karl Rupp, and Siegfried Selberherr. 2016. Evaluation of mobile ARM-based SoCs for high performance computing. In Proceedings of the 24th High Performance Computing Symposium, Pasadena, HPC 2016, part of the 2016 Spring Simulation Multi-conference, SpringSim '16, CA, USA, April 3--6, 2016, Josef Weinbub, Marc Baboulin, William I. Thacker, and Lukás Polok (Eds.). ACM, 21. http://dl.acm.org/citation.cfm?id=2972990

[45]

Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. 2010. X86-TSO: A Rigorous and Usable Programmer's Model for X86 Multiprocessors. Commun. ACM 53, 7 (July 2010), 89--97.

Digital Library

[46]

Sean White. 2014. The AMD Opteron A1100 Processor Codenamed "Seattle". In IEEE Hot Chips.

[47]

Daniel Yokoyama, Bruno Schulze, Fábio Borges, and Giacomo Mc Evoy. 2019. The survey on ARM processors for HPC. The Journal of Super-computing (08 Jun 2019).

[48]

Mingzhe Zhang, Francis C. M. Lau, Cho-Li Wang, Luwei Cheng, and Haibo Chen. 2016. Scalable Adaptive NUMA-Aware Lock: Combining Local Locking and Remote Locking for Efficient Concurrency. In Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '16). Association for Computing Machinery, New York, NY, USA, Article Article 50, 2 pages.

Digital Library

Cited By

Wang JLiu YFu MHärtig HChen HAgrawal KPetrank E(2024)Brief Announcement: Work Stealing through Partial Asynchronous DelegationProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660261(281-283)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3660261
Beck MBhat KStričević LChen GBehrens DFu MVafeiadis VChen HHärtig HAamodt TJerger NSwift M(2023)AtoMig: Automatically Migrating Millions Lines of Code from TSO to WMMProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3579849(61-73)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3579849
Gao MChakraborty SOzkan BAamodt TJerger NSwift M(2023)Probabilistic Concurrency Testing for Weak Memory ProgramsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575729(603-616)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575729
Show More Cited By

Index Terms

No barrier in the road: a comprehensive study and optimization of ARM barriers
1. Software and its engineering
  1. Software organization and properties

Recommendations

Lock elision for read-only critical sections in Java
PLDI '10: Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation

It is not uncommon in parallel workloads to encounter shared data structures with read-mostly access patterns, where operations that update data are infrequent and most operations are read-only. Typically, data consistency is guaranteed using mutual ...
Lock elision for read-only critical sections in Java
PLDI '10

It is not uncommon in parallel workloads to encounter shared data structures with read-mostly access patterns, where operations that update data are infrequent and most operations are read-only. Typically, data consistency is guaranteed using mutual ...
Lock reservation: Java locks can mostly do without atomic operations
OOPSLA '02: Proceedings of the 17th ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications

Because of the built-in support for multi-threaded programming, Java programs perform many lock operations. Although the overhead has been significantly reduced in the recent virtual machines, One or more atomic operations are required for acquiring and ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2020

454 pages

ISBN:9781450368186

DOI:10.1145/3332466

General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication Notes

Badge change: Article originally badged under Version 1.0 guidelines https://www.acm.org/publications/policies/artifact-review-badging

Publication History

Published: 19 February 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

PPoPP '20

Sponsor:

PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 22 - 26, 2020

California, San Diego

Acceptance Rates

PPoPP '20 Paper Acceptance Rate 28 of 121 submissions, 23%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
540
Total Downloads

Downloads (Last 12 months)39
Downloads (Last 6 weeks)12

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang JLiu YFu MHärtig HChen HAgrawal KPetrank E(2024)Brief Announcement: Work Stealing through Partial Asynchronous DelegationProceedings of the 36th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3626183.3660261(281-283)Online publication date: 17-Jun-2024
https://dl.acm.org/doi/10.1145/3626183.3660261
Beck MBhat KStričević LChen GBehrens DFu MVafeiadis VChen HHärtig HAamodt TJerger NSwift M(2023)AtoMig: Automatically Migrating Millions Lines of Code from TSO to WMMProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3579849(61-73)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3579849
Gao MChakraborty SOzkan BAamodt TJerger NSwift M(2023)Probabilistic Concurrency Testing for Weak Memory ProgramsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575729(603-616)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575729
Gouicem RSprokholt DRuehl JRocha RSpink TChakraborty SBhatotia PAamodt TJerger NSwift M(2023)Risotto: A Dynamic Binary Translator for Weak Memory Model ArchitecturesProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3567955.3567962(107-122)Online publication date: 25-Mar-2023
https://dl.acm.org/doi/10.1145/3567955.3567962
Liu NGu JTang DLi KZang BChen HLee JAgrawal KSpear M(2022)Asymmetry-aware scalable lockingProceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3503221.3508420(294-308)Online publication date: 2-Apr-2022
https://dl.acm.org/doi/10.1145/3503221.3508420
Park SLee JKim H(2022)Software-Level Memory Regulation to Reduce Execution Time Variation on Multicore Real-Time SystemsIEEE Access10.1109/ACCESS.2022.320370210(93799-93811)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2022.3203702
de Lima Chehab RPaolillo ABehrens DFu MHärtig HChen H(2021)CLoFProceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles10.1145/3477132.3483557(851-865)Online publication date: 26-Oct-2021
https://dl.acm.org/doi/10.1145/3477132.3483557
Oberhauser JChehab RBehrens DFu MPaolillo AOberhauser LBhat KWen YChen HKim JVafeiadis VSherwood TBerger EKozyrakis C(2021)VSync: push-button verification and optimization for synchronization primitives on weak memory modelsProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446748(530-545)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446748
Zhang YYang WLi KTang DLi K(2021)Performance Analysis and Optimization for SpMV Based on Aligned Storage Formats on an ARM ProcessorJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.08.002Online publication date: Aug-2021
https://doi.org/10.1016/j.jpdc.2021.08.002
Oberhauser JOberhauser LPaolillo ABehrens DFu MVafeiadis V(2021)Verifying and Optimizing the HMCS Lock for Arm ServersNetworked Systems10.1007/978-3-030-91014-3_17(240-260)Online publication date: 2-Dec-2021
https://doi.org/10.1007/978-3-030-91014-3_17

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents