research-article

AVX overhead profiling: how much does your fast code slow you down?

Authors:

Mathias Gottschlag,

Frank BellosaAuthors Info & Claims

APSys '20: Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems

Pages 59 - 66

https://doi.org/10.1145/3409963.3410488

Published: 24 August 2020 Publication History

Abstract

The AVX2 and AVX-512 instructions found in recent Intel CPUs can increase the performance of vectorized code. Their complexity and increased power consumption, however, causes the CPU to reduce its frequency. This frequency reduction can affect parts of the workload which do not use AVX2 or AVX-512, with previous work reporting an overall slowdown of more than 10% for various workloads with AVX-512-enabled parts. Although countermeasures against this frequency reduction overhead exist, they themselves cause additional overhead and are therefore only viable if the gains are larger than the additional overhead.

It is, however, often not clear how much AVX2/AVX-512 frequency reduction overhead is present. In this paper, we describe a sampling profiler to determine the magnitude of the overhead as an aid during software development or during the selection of countermeasures. Our profiler temporarily stops individual CPU cores to let the cores recover their maximum (non-AVX) frequency. The profiler then observes whether the frequency is immediately reduced again once the workload is resumed to determine whether the previous frequency reduction was actually necessary. The resulting information is used to calculate the approximate AVX2/AVX-512 frequency reduction overhead. In the case of AVX-512, our prototype is able to estimate the overhead with an average error of 1.2 percentage points for various benchmarks. We describe potential improvements to our design, and we describe a novel hardware-software interface which would allow more accurate measurement of the overhead.

References

[1]

[n.d.]. Phoronix Test Suite. https://phoronix-test-suite.com/.

[2]

[n.d.]. The /proc filesystem. Linux, Documentation/filesystems/proc.txt.

[3]

2018. Intel® 64 and IA-32 Architectures Software Developer's Manual - Volume 3: System Programming Guide.

[4]

2018. Intel® Xeon® Processor Scalable Family - Specification Update. Intel Corporation.

[5]

2019. Intel® 64 and IA-32 Architectures Optimization Reference Manual.

[6]

Christian Bienia. 2011. Benchmarking Modern Multiprocessors. Ph.D. Dissertation. Princeton University.

Digital Library

[7]

Jonathan Corbet. 2019. Many uses for Core scheduling. https://lwn.net/Articles/799454/

[8]

Travis Downs. 2020. Gathering Intel on Intel AVX-512 Transitions. https://travisdowns.github.io/blog/2020/01/17/avxfreq1.html

[9]

Markus Dreseler, Jan Kossmann, Johannes Frohnhofen, Matthias Uflacker, and Hasso Plattner. 2018. Fused table scans: Combining AVX-512 and JIT to double the performance of multi-predicate scans. In 2018 IEEE 34th International Conference on Data Engineering Workshops (ICDEW). IEEE, 102--109.

[10]

Martin Goll and Shay Gueron. 2015. Vectorization of Poly1305 message authentication code. In 2015 12th International Conference on Information Technology-New Generations. IEEE, 145--150.

Digital Library

[11]

Mathias Gottschlag, Peter Brantsch, and Frank Bellosa. 2020. Automatic Core Specialization for AVX-512 Applications. In Proceedings of the 13th ACM International Systems and Storage Conference (SYSTOR '20). Association for Computing Machinery, New York, NY, USA, 25--35.

Digital Library

[12]

Mathias Gottschlag, Yussuf Khalil, and Frank Bellosa. 2020. Dim Silicon and the Case for Improved DVFS Policies. arXiv preprint arXiv:2005.01498 (2020).

[13]

Daniel Hackenberg, Robert Schöne, Thomas Ilsche, Daniel Molka, Joseph Schuchart, and Robin Geyer. 2015. An energy efficiency feature survey of the intel haswell processor. In 2015 IEEE International Parallel and Distributed Processing Symposium Workshop. IEEE, 896--904.

Digital Library

[14]

Ranjan Hebbar SR and Aleksandar Milenković. 2019. Impact of Thread and Frequency Scaling on Performance and Energy Efficiency: An Evaluation of Core i7-8700K Using SPEC CPU2017. In 2019 SoutheastCon. IEEE, 1--7.

[15]

Georgios Keramidas, Vasileios Spiliopoulos, and Stefanos Kaxiras. 2010. Interval-based models for run-time DVFS orchestration in superscalar processors. In Proceedings of the 7th ACM International Conference on Computing Frontiers. 287--296.

Digital Library

[16]

Vlad Krasnov. 2017. On the dangers of Intel's frequency scaling. https://blog.cloudflare.com/on-the-dangers-of-intels-frequency-scaling/.

[17]

Rakesh Kumar, Alejandro Martinez, and Antonio Gonzalez. 2014. Efficient power gating of simd accelerators through dynamic selective devectorization in an hw/sw codesigned environment. ACM Transactions on Architecture and Code Optimization (TACO) 11, 3 (2014), 25.

[18]

Aubrey Li. 2019. Core scheduling: prevent fast instructions from slowing you down. (Sept. 9 2019). https://linuxplumbersconf.org/event/4/contributions/430/ Linux Plumbers Conference.

[19]

Robert Schöne, Thomas Ilsche, Mario Bielert, Andreas Gocht, and Daniel Hackenberg. 2019. Energy Efficiency Features of the Intel Skylake-SP Processor and Their Impact on Performance. arXiv preprint arXiv:1905.12468 (2019).

[20]

Bo Su, Joseph L Greathouse, Junli Gu, Michael Boyer, Li Shen, and Zhiying Wang. 2014. Implementing a leading loads performance predictor on commodity processors. In 2014 USENIX Annual Technical Conference (USENIX ATC'14).

[21]

Michael B Taylor. 2012. Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon apocalypse. In 49th ACM/EDAC/IEEE Design Automation Conference. IEEE, 1131--1136.

[22]

Praveen Kumar Tiwari, Vignesh V Menon, Jayashri Murugan, Jayashree Chandrasekaran, Gopi Satykrishna Akisetty, Pradeep Ramachandran, Sravanthi Kota Venkata, Christopher A Bird, and Kevin Cone. 2018. Accelerating x265 with Intel® Advanced Vector Extensions 512. Technical Report. Intel.

Cited By

Faqir-Rhazoui YGarcía CTirado F(2023)Performance Portability Assessment: Non-negative Matrix Factorization as a Case StudyEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_18(239-250)Online publication date: 2-May-2023
https://doi.org/10.1007/978-3-031-31209-0_18
Litz HAyers GRanganathan PFalsafi BFerdman MLu SWenisch T(2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507745
Papaphilippou PShah M(2022)FPGA-Extended General Purpose Computer ArchitectureApplied Reconfigurable Computing. Architectures, Tools, and Applications10.1007/978-3-031-19983-7_7(87-102)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-19983-7_7
Show More Cited By

Index Terms

AVX overhead profiling: how much does your fast code slow you down?
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

Optimizing parallel GEMM routines using auto-tuning with Intel AVX-512
HPCAsia '19: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

This paper presents the optimal implementations of single- and double-precision general matrix-matrix multiplication (GEMM) routines for the Intel Xeon Phi Processor code-named Knights Landing (KNL) and the Intel Xeon Scalable Processors based on an auto-...
Automatic Core Specialization for AVX-512 Applications
SYSTOR '20: Proceedings of the 13th ACM International Systems and Storage Conference

Advanced Vector Extension (AVX) instructions operate on wide SIMD vectors. Due to the resulting high power consumption, recent Intel processors reduce their frequency when executing complex AVX2 and AVX-512 instructions. Following non-AVX code is slowed ...
An implementation of matrix---matrix multiplication on the Intel KNL processor with AVX-512

The second generation Intel Xeon Phi processor codenamed Knights Landing (KNL) have recently emerged with 2D tile mesh architecture and the Intel AVX-512 instructions. However, it is very difficult for general users to get the maximum performance from ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

APSys '20: Proceedings of the 11th ACM SIGOPS Asia-Pacific Workshop on Systems

August 2020

135 pages

ISBN:9781450380690

DOI:10.1145/3409963

Program Chairs:
Taesoo Kim,
Patrick P. C. Lee

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

APSys '20

Sponsor:

SIGOPS

APSys '20: 11th ACM SIGOPS Asia-Pacific Workshop on Systems

August 24 - 25, 2020

Tsukuba, Japan

Acceptance Rates

Overall Acceptance Rate 169 of 430 submissions, 39%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
243
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)3

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Faqir-Rhazoui YGarcía CTirado F(2023)Performance Portability Assessment: Non-negative Matrix Factorization as a Case StudyEuro-Par 2022: Parallel Processing Workshops10.1007/978-3-031-31209-0_18(239-250)Online publication date: 2-May-2023
https://doi.org/10.1007/978-3-031-31209-0_18
Litz HAyers GRanganathan PFalsafi BFerdman MLu SWenisch T(2022)CRISP: critical slice prefetchingProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3503222.3507745(300-313)Online publication date: 28-Feb-2022
https://dl.acm.org/doi/10.1145/3503222.3507745
Papaphilippou PShah M(2022)FPGA-Extended General Purpose Computer ArchitectureApplied Reconfigurable Computing. Architectures, Tools, and Applications10.1007/978-3-031-19983-7_7(87-102)Online publication date: 19-Sep-2022
https://dl.acm.org/doi/10.1007/978-3-031-19983-7_7
Kappes GAnastasiadis S(2021)AsteropeProceedings of the 11th Workshop on Programming Languages and Operating Systems10.1145/3477113.3487264(9-16)Online publication date: 25-Oct-2021
https://dl.acm.org/doi/10.1145/3477113.3487264

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents