research-article

VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers

Authors:

Depei QianAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 892 - 904

https://doi.org/10.1145/3575693.3576934

Published: 30 January 2023 Publication History

Abstract

Fine-grained value profilers reveal a promising way to accurately detect value-related software inefficiencies with binary instrumentation. Due to the architecture-dependent implementation details of binary instrumentation, existing value profilers suffer from poor portability as well as high engineering efforts to achieve efficiency across platforms. In this paper, we propose VClinic, a portable and efficient fine-grained value profiling framework for analyzing highly optimized binaries on both X86 and ARM platforms. VClinic exploits operand-centric two-level designs in its implementation to provide the common building blocks required for value profilers. By constructing four representative value profilers with VClinic, we demonstrate that VClinic can ease the development of value profilers with portability and efficiency across platforms. Guided by the value profilers built upon VClinic, we can achieve up to 89.94% and 74.66% speedup for real-world programs on X86 and ARM platforms, respectively.

References

[1]

[n. d.]. Distribution of Intel and AMD x86 computer central processing units (CPUs) worldwide from 2012 to 2022, by quarter. https://www.statista.com/statistics/735904/worldwide-x86-intel-amd-market-share/

[2]

[n. d.]. Supercomputer Top500 Statistics. https://www.top500.org/statistics/list/

[3]

2022. ARM MAP, Homepage: https://www.arm.com/products/development-tools/server-and-hpc/forge/map.

[4]

2022. Dyinst, Homepage: https://www.dyninst.org. https://www.dyninst.org

[5]

2022. NVIDIA Compute Sanitizer, Homepage: https://docs.nvidia.com/cuda/compute-sanitizer/index.html. https://docs.nvidia.com/cuda/compute-sanitizer/index.html

[6]

Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22, 6 (2010), 685–701.

[7]

Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M Jones. 2021. Cinnamon: a domain-specific language for binary profiling and monitoring. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 103–114.

Digital Library

[8]

ARM. 2016. ARM® Compiler armasm User Guide.

[9]

David H Bailey. 2011. NAS parallel benchmarks. Encyclopedia of Parallel Computing, 1254–1259.

[10]

Andrew R Bernat and Barton P Miller. 2011. Anywhere, any-time binary instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools. 9–16.

Digital Library

[11]

Kristof Beyls and Erik H D’Hollander. 2009. Refactoring for data locality. Computer, 42, 2 (2009), 62–71.

Digital Library

[12]

David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: performance introspection for HPC software stacks. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 550–560.

[13]

Derek Bruening and Saman Amarasinghe. 2004. Efficient, transparent, and comprehensive runtime code manipulation. Ph. D. Dissertation. Massachusetts Institute of Technology, Department of Electrical Engineering ….

[14]

Derek Bruening, Vladimir Kiriansky, Timothy Garnett, and Sanjeev Banerji. 2006. Thread-shared software code caches. In International Symposium on Code Generation and Optimization (CGO’06). 11–pp.

Digital Library

[15]

Milind Chabbi, Xu Liu, and John Mellor-Crummey. 2014. Call paths for pin tools. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. 76–86.

Digital Library

[16]

Milind Chabbi and John Mellor-Crummey. 2012. Deadspy: a tool to pinpoint program inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 124–134.

Digital Library

[17]

Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). 44–54.

Digital Library

[18]

Arnaldo Carvalho De Melo. 2010. The new linux’perf’tools. In Slides from Linux Kongress. 18.

[19]

Thomas Dullien and Sebastian Porst. 2009. REIL: A platform-independent intermediate representation of disassembled code for static code analysis.

[20]

Philip Ginsbach, Lewis Crawford, and Michael FP O’boyle. 2018. CAnDL: a domain specific language for compiler analysis. In Proceedings of the 27th International Conference on Compiler Construction. 151–162.

Digital Library

[21]

Brian Gough. 2009. GNU scientific library reference manual. Network Theory Ltd.

Digital Library

[22]

Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, and Zhenlin Wang. 2021. GRAPHSPY: Fused Program Semantic Embedding through Graph Neural Networks for Memory Efficiency. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 1045–1050.

Digital Library

[23]

Christian Heitman and Iván Arce. 2014. BARF: a multiplatform open source binary analysis and reverse engineering framework. In XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014).

[24]

Jeffrey K Hollingsworth, Oscar Niam, Barton P Miller, Zhichen Xu, Marcelo JR Gonçalves, and Ling Zheng. 1997. MDL: A language and compiler for dynamic program instrumentation. In Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques. 201–212.

[25]

Intel. 2019. Intel® 64 and ia-32 architectures software developer’s manual. Volume 2: Instruction Set Reference, 2, 11 (2019).

[26]

Heike Jagode, Anthony Danalis, Hartwig Anzt, and Jack Dongarra. 2019. PAPI software-defined events for in-depth performance analysis. The International Journal of High Performance Computing Applications, 33, 6 (2019), 1113–1127.

Digital Library

[27]

Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, and Jidong Zhai. 2022. PerFlow: A Domain Specific Framework for Automatic Performance Analysis of Parallel Applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’22). isbn:978145039204 https://doi.org/10.1145/3503221.3508405

Digital Library

[28]

Tyson Jones, Anna Brown, Ian Bush, and Simon C Benjamin. 2019. QuEST and high performance simulation of quantum computers. Scientific reports, 9, 1 (2019), 1–11.

[29]

Armand Joulin and Tomas Mikolov. 2015. Inferring algorithmic patterns with stack-augmented recurrent nets. Advances in neural information processing systems, 28 (2015).

[30]

John Levon and Philippe Elie. 2008. OProfile, a system-wide profiler for Linux systems. Homepage: http://oprofile. sourceforge. net.

[31]

Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices. 40, 190–200.

[32]

Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: a domain-specific language for bytecode instrumentation. In Proceedings of the 11th annual international conference on Aspect-oriented Software Development. 239–250.

Digital Library

[33]

Margaret Martonosi, Anoop Gupta, and Thomas Anderson. 1992. Memspy: Analyzing memory system bottlenecks in programs. ACM SIGMETRICS Performance Evaluation Review, 20, 1 (1992), 1–12.

Digital Library

[34]

Mahesh Rajan, Douglas W Doerfler, and Simon David Hammond. 2015. Trinity Benchmarks on the Intel Xeon Phi (Knights Corner). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).

[35]

Ashay Rane and James Browne. 2012. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 147–156.

Digital Library

[36]

Ram Rangan, Mark W Stephenson, Aditya Ukarande, Shyam Murthy, Virat Agarwal, and Marc Blackstein. 2020. Zeroploit: Exploiting zero valued operands in interactive gaming applications. ACM Transactions on Architecture and Code Optimization (TACO), 17, 3 (2020), 1–26.

Digital Library

[37]

James Reinders. 2005. VTune performance analyzer essentials. Intel Press.

[38]

Julian Seward. 1996. bzip2 and libbzip2. avaliable at http://www. bzip. org.

[39]

Du Shen, Xu Liu, and Felix Xiaozhu Lin. 2016. Characterizing emerging heterogeneous memory. ACM SIGPLAN Notices, 51, 11 (2016), 13–23.

Digital Library

[40]

Du Shen, Shuaiwen Leon Song, Ang Li, and Xu Liu. 2018. Cudaadvisor: Llvm-based runtime profiling for modern gpus. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 214–227.

Digital Library

[41]

Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20, 2 (2006), 287–311.

Digital Library

[42]

Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. 2019. Redundant loads: A software inefficiency indicator. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 982–993.

Digital Library

[43]

Jialiang Tan, Shuyin Jiao, Milind Chabbi, and Xu Liu. 2020. What every scientific programmer should know about compiler optimizations? In Proceedings of the 34th ACM International Conference on Supercomputing. 1–12.

Digital Library

[44]

Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, 157–173.

[45]

Julien Vanegue. 2008. Static binary analysis with a domain specific language. Proc. of the EKOPARTY, 2008 (2008).

[46]

Julien Vanegue, Thomas Garnier, Julio Auto, Sebastien Roy, and Rafal Lesniak. 2007. Next generation debuggers for reverse engineering. In 4th Annual Hackers To Hackers Conference (BlackHat Europe).

[47]

Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. Nvbit: A dynamic binary instrumentation framework for nvidia gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 372–383.

Digital Library

[48]

Shasha Wen, Milind Chabbi, and Xu Liu. 2017. Redspy: Exploring value locality in software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 47–61.

Digital Library

[49]

Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for software inefficiencies with witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 332–347.

Digital Library

[50]

Shasha Wen, Xu Liu, and Milind Chabbi. 2015. Runtime value numbering: A profiling technique to pinpoint redundant computations. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 254–265.

Digital Library

[51]

Xin You. 2022. VClinic artifacts. https://doi.org/10.5281/zenodo.7311322

Digital Library

[52]

Xin You, Hailong Yang, Zhongzhi Luan, Depei Qian, and Xu Liu. 2020. ZeroSpy: exploring software inefficiency with redundant zeros. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.

[53]

Qidong Zhao, Xu Liu, and Milind Chabbi. 2020. DrCCTProf: A fine-grained call path profiler for arm-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.

[54]

Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: predictive NUMA profiling. In Proceedings of the ACM International Conference on Supercomputing. 52–62.

Digital Library

[55]

Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2020. GVProf: A value profiler for GPU-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.

[56]

Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2022. ValueExpert: exploring value patterns in GPU-accelerated applications. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 171–185.

Digital Library

[57]

Ruoyu Zhou, George Wort, Márton Erdős, and Timothy M Jones. 2019. The Janus triad: Exploiting parallelism through dynamic binary modification. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 88–100.

Digital Library

Cited By

Xuan ZYou XYang HLi MLuan ZLiu YQian D(2024)Retrospection on the Performance Analysis Tools for Large-Scale HPC Programs2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC62374.2024.00013(34-44)Online publication date: 18-Dec-2024
https://doi.org/10.1109/HiPC62374.2024.00013
Cui JZhao QHao YLiu XGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python ApplicationsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444862(245-257)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444862
You XYang HLei KLuan ZQian DMohror KArnold DBadia R(2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607052
Show More Cited By

Index Terms

VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers
1. General and reference
  1. Cross-computing tools and techniques
    1. Design
    2. Performance
2. Software and its engineering
  1. Software creation and management
    1. Software development techniques
  2. Software notations and tools
    1. Development frameworks and environments

Recommendations

Evaluating the accuracy of Java profilers
PLDI '10

Performance analysts profile their programs to find methods that are worth optimizing: the "hot" methods. This paper shows that four commonly-used Java profilers (xprof , hprof , jprofile, and yourkit) often disagree on the identity of the hot methods. ...
Profiling a parallel language based on fine-grained communication
Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing

Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks. In this ...
Profiling energy profilers
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

While energy is directly consumed by hardware, it is the software that provides the instructions to do so. Energy profilers provide a means to measure the energy consumption of software, enabling the user to take measures in making software more ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

January 2023

947 pages

ISBN:9781450399166

DOI:10.1145/3575693

General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
464
Total Downloads

Downloads (Last 12 months)117
Downloads (Last 6 weeks)7

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xuan ZYou XYang HLi MLuan ZLiu YQian D(2024)Retrospection on the Performance Analysis Tools for Large-Scale HPC Programs2024 IEEE 31st International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC62374.2024.00013(34-44)Online publication date: 18-Dec-2024
https://doi.org/10.1109/HiPC62374.2024.00013
Cui JZhao QHao YLiu XGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python ApplicationsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444862(245-257)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444862
You XYang HLei KLuan ZQian DMohror KArnold DBadia R(2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607052
Lei KDu SYou XXuan ZKong HYang HShang JXiao ZWu ZLuan ZQian D(2023)Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00114(751-756)Online publication date: 17-Dec-2023
https://doi.org/10.1109/ICPADS60453.2023.00114

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten