Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3575693.3576934acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers

Published: 30 January 2023 Publication History

Abstract

Fine-grained value profilers reveal a promising way to accurately detect value-related software inefficiencies with binary instrumentation. Due to the architecture-dependent implementation details of binary instrumentation, existing value profilers suffer from poor portability as well as high engineering efforts to achieve efficiency across platforms. In this paper, we propose VClinic, a portable and efficient fine-grained value profiling framework for analyzing highly optimized binaries on both X86 and ARM platforms. VClinic exploits operand-centric two-level designs in its implementation to provide the common building blocks required for value profilers. By constructing four representative value profilers with VClinic, we demonstrate that VClinic can ease the development of value profilers with portability and efficiency across platforms. Guided by the value profilers built upon VClinic, we can achieve up to 89.94% and 74.66% speedup for real-world programs on X86 and ARM platforms, respectively.

References

[1]
[n. d.]. Distribution of Intel and AMD x86 computer central processing units (CPUs) worldwide from 2012 to 2022, by quarter. https://www.statista.com/statistics/735904/worldwide-x86-intel-amd-market-share/
[2]
[n. d.]. Supercomputer Top500 Statistics. https://www.top500.org/statistics/list/
[3]
2022. ARM MAP, Homepage: https://www.arm.com/products/development-tools/server-and-hpc/forge/map.
[4]
2022. Dyinst, Homepage: https://www.dyninst.org. https://www.dyninst.org
[5]
2022. NVIDIA Compute Sanitizer, Homepage: https://docs.nvidia.com/cuda/compute-sanitizer/index.html. https://docs.nvidia.com/cuda/compute-sanitizer/index.html
[6]
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22, 6 (2010), 685–701.
[7]
Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M Jones. 2021. Cinnamon: a domain-specific language for binary profiling and monitoring. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 103–114.
[8]
ARM. 2016. ARM® Compiler armasm User Guide.
[9]
David H Bailey. 2011. NAS parallel benchmarks. Encyclopedia of Parallel Computing, 1254–1259.
[10]
Andrew R Bernat and Barton P Miller. 2011. Anywhere, any-time binary instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools. 9–16.
[11]
Kristof Beyls and Erik H D’Hollander. 2009. Refactoring for data locality. Computer, 42, 2 (2009), 62–71.
[12]
David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: performance introspection for HPC software stacks. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 550–560.
[13]
Derek Bruening and Saman Amarasinghe. 2004. Efficient, transparent, and comprehensive runtime code manipulation. Ph. D. Dissertation. Massachusetts Institute of Technology, Department of Electrical Engineering ….
[14]
Derek Bruening, Vladimir Kiriansky, Timothy Garnett, and Sanjeev Banerji. 2006. Thread-shared software code caches. In International Symposium on Code Generation and Optimization (CGO’06). 11–pp.
[15]
Milind Chabbi, Xu Liu, and John Mellor-Crummey. 2014. Call paths for pin tools. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. 76–86.
[16]
Milind Chabbi and John Mellor-Crummey. 2012. Deadspy: a tool to pinpoint program inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 124–134.
[17]
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). 44–54.
[18]
Arnaldo Carvalho De Melo. 2010. The new linux’perf’tools. In Slides from Linux Kongress. 18.
[19]
Thomas Dullien and Sebastian Porst. 2009. REIL: A platform-independent intermediate representation of disassembled code for static code analysis.
[20]
Philip Ginsbach, Lewis Crawford, and Michael FP O’boyle. 2018. CAnDL: a domain specific language for compiler analysis. In Proceedings of the 27th International Conference on Compiler Construction. 151–162.
[21]
Brian Gough. 2009. GNU scientific library reference manual. Network Theory Ltd.
[22]
Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, and Zhenlin Wang. 2021. GRAPHSPY: Fused Program Semantic Embedding through Graph Neural Networks for Memory Efficiency. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 1045–1050.
[23]
Christian Heitman and Iván Arce. 2014. BARF: a multiplatform open source binary analysis and reverse engineering framework. In XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014).
[24]
Jeffrey K Hollingsworth, Oscar Niam, Barton P Miller, Zhichen Xu, Marcelo JR Gonçalves, and Ling Zheng. 1997. MDL: A language and compiler for dynamic program instrumentation. In Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques. 201–212.
[25]
Intel. 2019. Intel® 64 and ia-32 architectures software developer’s manual. Volume 2: Instruction Set Reference, 2, 11 (2019).
[26]
Heike Jagode, Anthony Danalis, Hartwig Anzt, and Jack Dongarra. 2019. PAPI software-defined events for in-depth performance analysis. The International Journal of High Performance Computing Applications, 33, 6 (2019), 1113–1127.
[27]
Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, and Jidong Zhai. 2022. PerFlow: A Domain Specific Framework for Automatic Performance Analysis of Parallel Applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’22). isbn:978145039204 https://doi.org/10.1145/3503221.3508405
[28]
Tyson Jones, Anna Brown, Ian Bush, and Simon C Benjamin. 2019. QuEST and high performance simulation of quantum computers. Scientific reports, 9, 1 (2019), 1–11.
[29]
Armand Joulin and Tomas Mikolov. 2015. Inferring algorithmic patterns with stack-augmented recurrent nets. Advances in neural information processing systems, 28 (2015).
[30]
John Levon and Philippe Elie. 2008. OProfile, a system-wide profiler for Linux systems. Homepage: http://oprofile. sourceforge. net.
[31]
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices. 40, 190–200.
[32]
Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: a domain-specific language for bytecode instrumentation. In Proceedings of the 11th annual international conference on Aspect-oriented Software Development. 239–250.
[33]
Margaret Martonosi, Anoop Gupta, and Thomas Anderson. 1992. Memspy: Analyzing memory system bottlenecks in programs. ACM SIGMETRICS Performance Evaluation Review, 20, 1 (1992), 1–12.
[34]
Mahesh Rajan, Douglas W Doerfler, and Simon David Hammond. 2015. Trinity Benchmarks on the Intel Xeon Phi (Knights Corner). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).
[35]
Ashay Rane and James Browne. 2012. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 147–156.
[36]
Ram Rangan, Mark W Stephenson, Aditya Ukarande, Shyam Murthy, Virat Agarwal, and Marc Blackstein. 2020. Zeroploit: Exploiting zero valued operands in interactive gaming applications. ACM Transactions on Architecture and Code Optimization (TACO), 17, 3 (2020), 1–26.
[37]
James Reinders. 2005. VTune performance analyzer essentials. Intel Press.
[38]
Julian Seward. 1996. bzip2 and libbzip2. avaliable at http://www. bzip. org.
[39]
Du Shen, Xu Liu, and Felix Xiaozhu Lin. 2016. Characterizing emerging heterogeneous memory. ACM SIGPLAN Notices, 51, 11 (2016), 13–23.
[40]
Du Shen, Shuaiwen Leon Song, Ang Li, and Xu Liu. 2018. Cudaadvisor: Llvm-based runtime profiling for modern gpus. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 214–227.
[41]
Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20, 2 (2006), 287–311.
[42]
Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. 2019. Redundant loads: A software inefficiency indicator. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 982–993.
[43]
Jialiang Tan, Shuyin Jiao, Milind Chabbi, and Xu Liu. 2020. What every scientific programmer should know about compiler optimizations? In Proceedings of the 34th ACM International Conference on Supercomputing. 1–12.
[44]
Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, 157–173.
[45]
Julien Vanegue. 2008. Static binary analysis with a domain specific language. Proc. of the EKOPARTY, 2008 (2008).
[46]
Julien Vanegue, Thomas Garnier, Julio Auto, Sebastien Roy, and Rafal Lesniak. 2007. Next generation debuggers for reverse engineering. In 4th Annual Hackers To Hackers Conference (BlackHat Europe).
[47]
Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. Nvbit: A dynamic binary instrumentation framework for nvidia gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 372–383.
[48]
Shasha Wen, Milind Chabbi, and Xu Liu. 2017. Redspy: Exploring value locality in software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 47–61.
[49]
Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for software inefficiencies with witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 332–347.
[50]
Shasha Wen, Xu Liu, and Milind Chabbi. 2015. Runtime value numbering: A profiling technique to pinpoint redundant computations. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 254–265.
[51]
Xin You. 2022. VClinic artifacts. https://doi.org/10.5281/zenodo.7311322
[52]
Xin You, Hailong Yang, Zhongzhi Luan, Depei Qian, and Xu Liu. 2020. ZeroSpy: exploring software inefficiency with redundant zeros. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
[53]
Qidong Zhao, Xu Liu, and Milind Chabbi. 2020. DrCCTProf: A fine-grained call path profiler for arm-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.
[54]
Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: predictive NUMA profiling. In Proceedings of the ACM International Conference on Supercomputing. 52–62.
[55]
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2020. GVProf: A value profiler for GPU-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.
[56]
Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2022. ValueExpert: exploring value patterns in GPU-accelerated applications. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 171–185.
[57]
Ruoyu Zhou, George Wort, Márton Erdős, and Timothy M Jones. 2019. The Janus triad: Exploiting parallelism through dynamic binary modification. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 88–100.

Cited By

View all
  • (2024)DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python ApplicationsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444862(245-257)Online publication date: 2-Mar-2024
  • (2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
  • (2023)Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00114(751-756)Online publication date: 17-Dec-2023

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
January 2023
947 pages
ISBN:9781450399166
DOI:10.1145/3575693
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2023

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. Dynamic Binary Instrumentation
  2. Performance Analysis
  3. Value Profiler

Qualifiers

  • Research-article

Funding Sources

Conference

ASPLOS '23

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)161
  • Downloads (Last 6 weeks)19
Reflects downloads up to 22 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python ApplicationsProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444862(245-257)Online publication date: 2-Mar-2024
  • (2023)TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value ProfilingProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607052(1-13)Online publication date: 12-Nov-2023
  • (2023)Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00114(751-756)Online publication date: 17-Dec-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media