Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

Analyzing the memory ordering models of the Apple M1

Published: 01 April 2024 Publication History

Abstract

The Apple M1 ARM processor family incorporates two memory consistency models: the conventional ARM weak memory ordering and the Total store ordering (TSO) model from the x86 architecture utilized by Apple’s x86 emulator, Rosetta 2. The presence of both memory ordering models on the same hardware enables us to thoroughly benchmark and compare their performance characteristics and worst-case workloads.
In this paper, we assess the performance implications of TSO on the Apple M1 processor architecture. Based on the multi-threading workloads of the SPEC2017 CPU FP benchmark suite, our findings indicate that TSO is, on average, 8.94 percent slower than ARM’s weaker memory ordering. Through synthetic benchmarks, we further explore the workloads that experience the most significant performance degradation due to TSO. We also take a deeper look into the specific atomic instructions provided by the ARMv8.3 specification and their synchronization overheads.

References

[1]
Gharachorloo K., Lenoski D., Laudon J., Gibbons P., Gupta A., Hennessy J., Memory consistency and event ordering in scalable shared-memory multiprocessors, SIGARCH Comput. Archit. News 18 (2SI) (1990) 15–26,.
[2]
Dubois M., Scheurich C., Briggs F., Memory access buffering in multiprocessors, in: Proceedings of the 13th Annual International Symposium on Computer Architecture, ISCA ’86, IEEE Computer Society Press, Washington, DC, USA, 1986, pp. 434–442.
[3]
Higham L., Kawash J., Verwaal N., Defining and Comparing Memory Consistency Models, University of Calgary, 1997.
[4]
Intel 64 and IA-32 Architectures Software Developer’s Manual - Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D and 4, 2022, Intel. https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html. (Accessed 30 May 2023).
[5]
ARM Cortex-A Series – Programmer’s Guide for ARMv8-A, ARM Limited, 2015.
[6]
Learn the Architecture – Memory Systems, Ordering, and Barriers, ARM Limited, 2022, https://developer.arm.com/documentation/102336/0100. (Accessed 30 May 2023).
[7]
Apple announces Mac transition to Apple silicon, 2020, https://nr.apple.com/d2O2Y718J3. (Accessed 22 March 2023).
[9]
Kenyon C., Capano C., Apple silicon performance in scientific computing, in: 2022 IEEE High Performance Extreme Computing Conference, HPEC, 2022, pp. 1–10,.
[10]
Ali Z., Tanveer T., Aziz S., Usman M., Azam A., Reassessing the performance of ARM vs x86 with recent technological shift of apple, in: 2022 International Conference on IT and Industrial Technologies, ICIT, 2022, pp. 01–06,.
[12]
The standard performance evaluation corporation, 2023, https://www.spec.org/ (Accessed 22 March 2023).
[13]
Wrenger L., Töllner D., Lohmann D., TOSTING: Investigating total store ordering on ARM, in: Proceedings of the 36th GI/ITG International Conference on Architecture of Computing Systems, ARCS 23, Springer International Publishing, Athens, Greece, 2023,.
[14]
C++ Atomic operations library, 2023, https://en.cppreference.com/w/cpp/atomic (Accessed 26 March 2023).
[15]
Rust Standard Library – Module std::sync::atomic, 2023, https://doc.rust-lang.org/std/sync/atomic/index.html. (Accessed 26 March 2023).
[16]
Atig M.F., Bouajjani A., Burckhardt S., Musuvathi M., What’s decidable about weak memory models?, in: Seidl H. (Ed.), ESOP, in: Lecture Notes in Computer Science, Springer-Verlag, 2021, pp. 26–46.
[17]
Pulte C., Flur S., Deacon W., French J., Sarkar S., Sewell P., Simplifying ARM concurrency: Multicopy-atomic axiomatic and operational models for ARMv8, Proc. ACM Program. Lang. 2 (POPL) (2017),.
[18]
Mattioli M., Meet the FaM1ly, IEEE Micro 42 (3) (2022) 78–84,.
[19]
Johnson D., Apple M1 microarchitecture research, 2023, https://dougallj.github.io/applecpu/firestorm.html. (Accessed 22 March 2023).
[20]
Asahi linux wiki, 2023, https://github.com/AsahiLinux/docs/wiki. (Accessed 22 March 2023).
[21]
Apple’s M1 pro, M1 max SoCs investigated: New performance and efficiency heights, 2021, https://www.anandtech.com/show/17024/apple-m1-max-performance-review. (Accessed 22 March 2023).
[22]
SPEC CPU benchmark package, 2023, https://www.spec.org/cpu2017/ (Accessed 27 March 2023).
[23]
Lamport D., How to make a multiprocessor computer that correctly executes multiprocess programs, IEEE Trans. Comput. C-28 (9) (1979) 690–691,.
[24]
Goodman J.R., Cache Consistency and Sequential Consistency, University of Wisconsin-Madison Department of Computer Sciences, 1991, http://digital.library.wisc.edu/1793/59442.
[25]
SPARC International, Inc. C., The SPARC Architecture Manual: Version 8, Prentice-Hall, Inc., USA, 1992.
[26]
SPARC International, Inc. C., The SPARC Architecture Manual: Version 9, Prentice-Hall, Inc., USA, 1994.
[27]
Gharachorloo K., Gupta A., Hennessy J., Performance evaluation of memory consistency models for shared-memory multiprocessors, in: Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, in: ASPLOS IV, Association for Computing Machinery, New York, NY, USA, 1991, pp. 245–257,.
[28]
Naeem A., Chen X., Lu Z., Jantsch A., Realization and performance comparison of sequential and weak memory consistency models in network-on-chip based multi-core systems, in: 16th Asia and South Pacific Design Automation Conference, ASP-DAC 2011, 2011, pp. 154–159,.
[29]
Boehm H.-J., Adve S.V., Foundations of the C++ concurrency memory model, in: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI ’08, Association for Computing Machinery, New York, NY, USA, 2008, pp. 68–78,.
[30]
Flur S., Sarkar S., Pulte C., Nienhuis K., Maranget L., Gray K.E., Sezgin A., Batty M., Sewell P., Mixed-size concurrency: ARM, POWER, C/C++11, and SC, in: Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages, POPL ’17, Association for Computing Machinery, New York, NY, USA, 2017, pp. 429–442,.
[31]
Gupta N., Ashiwal R., Brank B., Peddoju S.K., Pleiter D., Performance evaluation of ParalleX execution model on arm-based platforms, in: 2020 IEEE International Conference on Cluster Computing, CLUSTER, 2020, pp. 567–575,.
[32]
Ouro P., Lopez-Novoa U., Guest M.F., On the performance of a highly-scalable Computational Fluid Dynamics code on AMD, ARM and Intel processor-based HPC systems, Comput. Phys. Comm. 269 (2021),.
[33]
Xia J., Cheng C., Zhou X., Hu Y., Chun P., Kunpeng 920: The first 7-nm chiplet-based 64-core ARM SoC for cloud services, IEEE Micro 41 (5) (2021) 67–75,.
[34]
Kodama Y., Kondo M., Sato M., Evaluation of SPEC CPU and SPEC OMP on the A64FX, in: 2021 IEEE International Conference on Cluster Computing, CLUSTER, 2021, pp. 553–561,.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Journal of Systems Architecture: the EUROMICRO Journal
Journal of Systems Architecture: the EUROMICRO Journal  Volume 149, Issue C
Apr 2024
121 pages

Publisher

Elsevier North-Holland, Inc.

United States

Publication History

Published: 01 April 2024

Author Tags

  1. TSO
  2. Memory ordering
  3. Apple M1
  4. ARM

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Jan 2025

Other Metrics

Citations

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media