Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3575693.3576934acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers

Published: 30 January 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Fine-grained value profilers reveal a promising way to accurately detect value-related software inefficiencies with binary instrumentation. Due to the architecture-dependent implementation details of binary instrumentation, existing value profilers suffer from poor portability as well as high engineering efforts to achieve efficiency across platforms. In this paper, we propose VClinic, a portable and efficient fine-grained value profiling framework for analyzing highly optimized binaries on both X86 and ARM platforms. VClinic exploits operand-centric two-level designs in its implementation to provide the common building blocks required for value profilers. By constructing four representative value profilers with VClinic, we demonstrate that VClinic can ease the development of value profilers with portability and efficiency across platforms. Guided by the value profilers built upon VClinic, we can achieve up to 89.94% and 74.66% speedup for real-world programs on X86 and ARM platforms, respectively.

    References

    [1]
    [n. d.]. Distribution of Intel and AMD x86 computer central processing units (CPUs) worldwide from 2012 to 2022, by quarter. https://www.statista.com/statistics/735904/worldwide-x86-intel-amd-market-share/
    [2]
    [n. d.]. Supercomputer Top500 Statistics. https://www.top500.org/statistics/list/
    [3]
    2022. ARM MAP, Homepage: https://www.arm.com/products/development-tools/server-and-hpc/forge/map.
    [4]
    2022. Dyinst, Homepage: https://www.dyninst.org. https://www.dyninst.org
    [5]
    2022. NVIDIA Compute Sanitizer, Homepage: https://docs.nvidia.com/cuda/compute-sanitizer/index.html. https://docs.nvidia.com/cuda/compute-sanitizer/index.html
    [6]
    Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22, 6 (2010), 685–701.
    [7]
    Mahwish Arif, Ruoyu Zhou, Hsi-Ming Ho, and Timothy M Jones. 2021. Cinnamon: a domain-specific language for binary profiling and monitoring. In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 103–114.
    [8]
    ARM. 2016. ARM® Compiler armasm User Guide.
    [9]
    David H Bailey. 2011. NAS parallel benchmarks. Encyclopedia of Parallel Computing, 1254–1259.
    [10]
    Andrew R Bernat and Barton P Miller. 2011. Anywhere, any-time binary instrumentation. In Proceedings of the 10th ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools. 9–16.
    [11]
    Kristof Beyls and Erik H D’Hollander. 2009. Refactoring for data locality. Computer, 42, 2 (2009), 62–71.
    [12]
    David Boehme, Todd Gamblin, David Beckingsale, Peer-Timo Bremer, Alfredo Gimenez, Matthew LeGendre, Olga Pearce, and Martin Schulz. 2016. Caliper: performance introspection for HPC software stacks. In SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 550–560.
    [13]
    Derek Bruening and Saman Amarasinghe. 2004. Efficient, transparent, and comprehensive runtime code manipulation. Ph. D. Dissertation. Massachusetts Institute of Technology, Department of Electrical Engineering ….
    [14]
    Derek Bruening, Vladimir Kiriansky, Timothy Garnett, and Sanjeev Banerji. 2006. Thread-shared software code caches. In International Symposium on Code Generation and Optimization (CGO’06). 11–pp.
    [15]
    Milind Chabbi, Xu Liu, and John Mellor-Crummey. 2014. Call paths for pin tools. In Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization. 76–86.
    [16]
    Milind Chabbi and John Mellor-Crummey. 2012. Deadspy: a tool to pinpoint program inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 124–134.
    [17]
    Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In 2009 IEEE international symposium on workload characterization (IISWC). 44–54.
    [18]
    Arnaldo Carvalho De Melo. 2010. The new linux’perf’tools. In Slides from Linux Kongress. 18.
    [19]
    Thomas Dullien and Sebastian Porst. 2009. REIL: A platform-independent intermediate representation of disassembled code for static code analysis.
    [20]
    Philip Ginsbach, Lewis Crawford, and Michael FP O’boyle. 2018. CAnDL: a domain specific language for compiler analysis. In Proceedings of the 27th International Conference on Compiler Construction. 151–162.
    [21]
    Brian Gough. 2009. GNU scientific library reference manual. Network Theory Ltd.
    [22]
    Yixin Guo, Pengcheng Li, Yingwei Luo, Xiaolin Wang, and Zhenlin Wang. 2021. GRAPHSPY: Fused Program Semantic Embedding through Graph Neural Networks for Memory Efficiency. In 2021 58th ACM/IEEE Design Automation Conference (DAC). 1045–1050.
    [23]
    Christian Heitman and Iván Arce. 2014. BARF: a multiplatform open source binary analysis and reverse engineering framework. In XX Congreso Argentino de Ciencias de la Computación (Buenos Aires, 2014).
    [24]
    Jeffrey K Hollingsworth, Oscar Niam, Barton P Miller, Zhichen Xu, Marcelo JR Gonçalves, and Ling Zheng. 1997. MDL: A language and compiler for dynamic program instrumentation. In Proceedings 1997 International Conference on Parallel Architectures and Compilation Techniques. 201–212.
    [25]
    Intel. 2019. Intel® 64 and ia-32 architectures software developer’s manual. Volume 2: Instruction Set Reference, 2, 11 (2019).
    [26]
    Heike Jagode, Anthony Danalis, Hartwig Anzt, and Jack Dongarra. 2019. PAPI software-defined events for in-depth performance analysis. The International Journal of High Performance Computing Applications, 33, 6 (2019), 1113–1127.
    [27]
    Yuyang Jin, Haojie Wang, Runxin Zhong, Chen Zhang, and Jidong Zhai. 2022. PerFlow: A Domain Specific Framework for Automatic Performance Analysis of Parallel Applications. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’22). isbn:978145039204 https://doi.org/10.1145/3503221.3508405
    [28]
    Tyson Jones, Anna Brown, Ian Bush, and Simon C Benjamin. 2019. QuEST and high performance simulation of quantum computers. Scientific reports, 9, 1 (2019), 1–11.
    [29]
    Armand Joulin and Tomas Mikolov. 2015. Inferring algorithmic patterns with stack-augmented recurrent nets. Advances in neural information processing systems, 28 (2015).
    [30]
    John Levon and Philippe Elie. 2008. OProfile, a system-wide profiler for Linux systems. Homepage: http://oprofile. sourceforge. net.
    [31]
    Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: building customized program analysis tools with dynamic instrumentation. In Acm sigplan notices. 40, 190–200.
    [32]
    Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: a domain-specific language for bytecode instrumentation. In Proceedings of the 11th annual international conference on Aspect-oriented Software Development. 239–250.
    [33]
    Margaret Martonosi, Anoop Gupta, and Thomas Anderson. 1992. Memspy: Analyzing memory system bottlenecks in programs. ACM SIGMETRICS Performance Evaluation Review, 20, 1 (1992), 1–12.
    [34]
    Mahesh Rajan, Douglas W Doerfler, and Simon David Hammond. 2015. Trinity Benchmarks on the Intel Xeon Phi (Knights Corner). Sandia National Lab.(SNL-NM), Albuquerque, NM (United States).
    [35]
    Ashay Rane and James Browne. 2012. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). 147–156.
    [36]
    Ram Rangan, Mark W Stephenson, Aditya Ukarande, Shyam Murthy, Virat Agarwal, and Marc Blackstein. 2020. Zeroploit: Exploiting zero valued operands in interactive gaming applications. ACM Transactions on Architecture and Code Optimization (TACO), 17, 3 (2020), 1–26.
    [37]
    James Reinders. 2005. VTune performance analyzer essentials. Intel Press.
    [38]
    Julian Seward. 1996. bzip2 and libbzip2. avaliable at http://www. bzip. org.
    [39]
    Du Shen, Xu Liu, and Felix Xiaozhu Lin. 2016. Characterizing emerging heterogeneous memory. ACM SIGPLAN Notices, 51, 11 (2016), 13–23.
    [40]
    Du Shen, Shuaiwen Leon Song, Ang Li, and Xu Liu. 2018. Cudaadvisor: Llvm-based runtime profiling for modern gpus. In Proceedings of the 2018 International Symposium on Code Generation and Optimization. 214–227.
    [41]
    Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications, 20, 2 (2006), 287–311.
    [42]
    Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. 2019. Redundant loads: A software inefficiency indicator. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). 982–993.
    [43]
    Jialiang Tan, Shuyin Jiao, Milind Chabbi, and Xu Liu. 2020. What every scientific programmer should know about compiler optimizations? In Proceedings of the 34th ACM International Conference on Supercomputing. 1–12.
    [44]
    Dan Terpstra, Heike Jagode, Haihang You, and Jack Dongarra. 2010. Collecting performance data with PAPI-C. In Tools for High Performance Computing 2009. Springer, 157–173.
    [45]
    Julien Vanegue. 2008. Static binary analysis with a domain specific language. Proc. of the EKOPARTY, 2008 (2008).
    [46]
    Julien Vanegue, Thomas Garnier, Julio Auto, Sebastien Roy, and Rafal Lesniak. 2007. Next generation debuggers for reverse engineering. In 4th Annual Hackers To Hackers Conference (BlackHat Europe).
    [47]
    Oreste Villa, Mark Stephenson, David Nellans, and Stephen W Keckler. 2019. Nvbit: A dynamic binary instrumentation framework for nvidia gpus. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. 372–383.
    [48]
    Shasha Wen, Milind Chabbi, and Xu Liu. 2017. Redspy: Exploring value locality in software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 47–61.
    [49]
    Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for software inefficiencies with witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 332–347.
    [50]
    Shasha Wen, Xu Liu, and Milind Chabbi. 2015. Runtime value numbering: A profiling technique to pinpoint redundant computations. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 254–265.
    [51]
    Xin You. 2022. VClinic artifacts. https://doi.org/10.5281/zenodo.7311322
    [52]
    Xin You, Hailong Yang, Zhongzhi Luan, Depei Qian, and Xu Liu. 2020. ZeroSpy: exploring software inefficiency with redundant zeros. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–14.
    [53]
    Qidong Zhao, Xu Liu, and Milind Chabbi. 2020. DrCCTProf: A fine-grained call path profiler for arm-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.
    [54]
    Xin Zhao, Jin Zhou, Hui Guan, Wei Wang, Xu Liu, and Tongping Liu. 2021. NumaPerf: predictive NUMA profiling. In Proceedings of the ACM International Conference on Supercomputing. 52–62.
    [55]
    Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2020. GVProf: A value profiler for GPU-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. 1–16.
    [56]
    Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2022. ValueExpert: exploring value patterns in GPU-accelerated applications. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 171–185.
    [57]
    Ruoyu Zhou, George Wort, Márton Erdős, and Timothy M Jones. 2019. The Janus triad: Exploiting parallelism through dynamic binary modification. In Proceedings of the 15th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments. 88–100.

    Cited By

    View all
    • (2024)DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444862(245-257)Online publication date: 2-Mar-2024
    • (2023)Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00114(751-756)Online publication date: 17-Dec-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
    January 2023
    947 pages
    ISBN:9781450399166
    DOI:10.1145/3575693
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 January 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Badges

    Author Tags

    1. Dynamic Binary Instrumentation
    2. Performance Analysis
    3. Value Profiler

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ASPLOS '23

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)209
    • Downloads (Last 6 weeks)15

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)DrPy: Pinpointing Inefficient Memory Usage in Multi-Layer Python Applications2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)10.1109/CGO57630.2024.10444862(245-257)Online publication date: 2-Mar-2024
    • (2023)Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00114(751-756)Online publication date: 17-Dec-2023

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media