Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3409963.3410488acmconferencesArticle/Chapter ViewAbstractPublication PagesapsysConference Proceedingsconference-collections
research-article

AVX overhead profiling: how much does your fast code slow you down?

Published: 24 August 2020 Publication History

Abstract

The AVX2 and AVX-512 instructions found in recent Intel CPUs can increase the performance of vectorized code. Their complexity and increased power consumption, however, causes the CPU to reduce its frequency. This frequency reduction can affect parts of the workload which do not use AVX2 or AVX-512, with previous work reporting an overall slowdown of more than 10% for various workloads with AVX-512-enabled parts. Although countermeasures against this frequency reduction overhead exist, they themselves cause additional overhead and are therefore only viable if the gains are larger than the additional overhead.
It is, however, often not clear how much AVX2/AVX-512 frequency reduction overhead is present. In this paper, we describe a sampling profiler to determine the magnitude of the overhead as an aid during software development or during the selection of countermeasures. Our profiler temporarily stops individual CPU cores to let the cores recover their maximum (non-AVX) frequency. The profiler then observes whether the frequency is immediately reduced again once the workload is resumed to determine whether the previous frequency reduction was actually necessary. The resulting information is used to calculate the approximate AVX2/AVX-512 frequency reduction overhead. In the case of AVX-512, our prototype is able to estimate the overhead with an average error of 1.2 percentage points for various benchmarks. We describe potential improvements to our design, and we describe a novel hardware-software interface which would allow more accurate measurement of the overhead.

References

[1]
[n.d.]. Phoronix Test Suite. https://phoronix-test-suite.com/.
[2]
[n.d.]. The /proc filesystem. Linux, Documentation/filesystems/proc.txt.
[3]
2018. Intel® 64 and IA-32 Architectures Software Developer's Manual - Volume 3: System Programming Guide.
[4]
2018. Intel® Xeon® Processor Scalable Family - Specification Update. Intel Corporation.
[5]
2019. Intel® 64 and IA-32 Architectures Optimization Reference Manual.
[6]
Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.
[7]
Jonathan Corbet. 2019. Many uses for Core scheduling. https://lwn.net/Articles/799454/
[8]
Travis Downs. 2020. Gathering Intel on Intel AVX-512 Transitions. https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html
[9]
Markus Dreseler, Jan Kossmann, Johannes Frohnhofen, Matthias Uflacker, and Hasso Plattner. 2018. Fused table scans: Combining AVX-512 and JIT to double the performance of multi-predicate scans. In 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW). IEEE, 102--109.
[10]
Martin Goll and Shay Gueron. 2015. Vectorization of Poly1305 message authentication code. In 2015 12th International Conference on Information Technology-New Generations. IEEE, 145--150.
[11]
Mathias Gottschlag, Peter Brantsch, and Frank Bellosa. 2020. Automatic Core Specialization for AVX-512 Applications. In Proceedings of the 13th ACM International Systems and Storage Conference (SYSTOR '20). Association for Computing Machinery, New York, NY, USA, 25--35.
[12]
Mathias Gottschlag, Yussuf Khalil, and Frank Bellosa. 2020. Dim Silicon and the Case for Improved DVFS Policies. arXiv preprint arXiv:2005.01498 (2020).
[13]
Daniel Hackenberg, Robert Schöne, Thomas Ilsche, Daniel Molka, Joseph Schuchart, and Robin Geyer. 2015. An energy efficiency feature survey of the intel haswell processor. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 896--904.
[14]
Ranjan Hebbar SR and Aleksandar Milenković. 2019. Impact of Thread and Frequency Scaling on Performance and Energy Efficiency: An Evaluation of Core i7-8700K Using SPEC CPU2017. In 2019 SoutheastCon. IEEE, 1--7.
[15]
Georgios Keramidas, Vasileios Spiliopoulos, and Stefanos Kaxiras. 2010. Interval-based models for run-time DVFS orchestration in superscalar processors. In Proceedings of the 7th ACM International Conference on Computing Frontiers. 287--296.
[16]
Vlad Krasnov. 2017. On the dangers of Intel's frequency scaling. https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/.
[17]
Rakesh Kumar, Alejandro Martinez, and Antonio Gonzalez. 2014. Efficient power gating of simd accelerators through dynamic selective devectorization in an hw/sw codesigned environment. ACM Transactions on Architecture and Code Optimization (TACO) 11, 3 (2014), 25.
[18]
Aubrey Li. 2019. Core scheduling: prevent fast instructions from slowing you down. (Sept. 9 2019). https://linuxplumbersconf.org/event/4/contributions/430/ Linux Plumbers Conference.
[19]
Robert Schöne, Thomas Ilsche, Mario Bielert, Andreas Gocht, and Daniel Hackenberg. 2019. Energy Efficiency Features of the Intel Skylake-SP Processor and Their Impact on Performance. arXiv preprint arXiv:1905.12468 (2019).
[20]
Bo Su, Joseph L Greathouse, Junli Gu, Michael Boyer, Li Shen, and Zhiying Wang. 2014. Implementing a leading loads performance predictor on commodity processors. In 2014 USENIX Annual Technical Conference (USENIX ATC'14).
[21]
Michael B Taylor. 2012. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse. In 49th ACM/EDAC/IEEE Design Automation Conference. IEEE, 1131--1136.
[22]
Praveen Kumar Tiwari, Vignesh V Menon, Jayashri Murugan, Jayashree Chandrasekaran, Gopi Satykrishna Akisetty, Pradeep Ramachandran, Sravanthi Kota Venkata, Christopher A Bird, and Kevin Cone. 2018. Accelerating x265 with Intel® Advanced Vector Extensions 512. Technical Report. Intel.

Cited By

View all
  • (2023)Performance Portability Assessment: Non-negative Matrix Factorization as a Case StudyEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_18(239-250)Online publication date: 2-May-2023
  • (2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
  • (2022)FPGA-Extended General Purpose Computer ArchitectureApplied Reconfigurable Computing. Architectures, Tools, and Applications10.1007/978-3-031-19983-7_7(87-102)Online publication date: 19-Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
APSys '20: Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems
August 2020
135 pages
ISBN:9781450380690
DOI:10.1145/3409963
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. AVX-512
  2. AVX2
  3. frequency scaling
  4. profiling

Qualifiers

  • Research-article

Conference

APSys '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 169 of 430 submissions, 39%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)3
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Performance Portability Assessment: Non-negative Matrix Factorization as a Case StudyEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_18(239-250)Online publication date: 2-May-2023
  • (2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
  • (2022)FPGA-Extended General Purpose Computer ArchitectureApplied Reconfigurable Computing. Architectures, Tools, and Applications10.1007/978-3-031-19983-7_7(87-102)Online publication date: 19-Sep-2022
  • (2021)AsteropeProceedings of the 11th Workshop on Programming Languages and Operating Systems10.1145/3477113.3487264(9-16)Online publication date: 25-Oct-2021

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media