research-article

TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling

Authors:

Depei QianAuthors Info & Claims

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Article No.: 90, Pages 1 - 13

https://doi.org/10.1145/3581784.3607052

Published: 11 November 2023 Publication History

Abstract

Trivial operations cause software inefficiencies that waste functional units and memory bandwidth for executing useless instructions. Although previous works have identified a significant amount of trivial operations in widely used programs, the proposed solutions only provide useful observations, other than actionable guidance to eliminate trivial operations for better performance. In this paper, we propose TrivialSpy - a fine-grained and dataflow-based value profiler to effectively identify software triviality with optimization potential estimation. With the help of dataflow analysis, TrivialSpy can detect software trivialities of heavy operation, trivial chain, and redundant backward slice. In addition, TrivialSpy can identify trivial breakpoints that combine multiple trivial conditions for more optimization opportunities. The evaluation results demonstrate TrivialSpy is capable of identifying software triviality in highly optimized programs. Based on the optimization guidance provided by TrivialSpy, we can achieve 52.09% performance speedup at maximum after eliminating trivial operations.

References

[1]

2017. CORAL-2 Benchmarks, https://asc.llnl.gov/coral-2-benchmarks/.

[2]

Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22, 6 (2010), 685--701.

[3]

Jennifer M Anderson, Lance M Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R Henzinger, Shun-Tak A Leung, Richard L Sites, Mark T Vandevoorde, Carl A Waldspurger, and William E Weihl. 1997. Continuous profiling: Where have all the cycles gone? ACM Transactions on Computer Systems (TOCS) 15, 4 (1997), 357--390.

Digital Library

[4]

Ehsan Atoofian. 2020. Trivial bypassing in GPGPUs. IEEE Embedded Systems Letters 13, 1 (2020), 25--28.

[5]

Ehsan Atoofian and Amirali Baniasadi. 2005. Improving energy-efficiency by bypassing trivial computations. In 19th IEEE International Parallel and Distributed Processing Symposium. IEEE, 7--pp.

Digital Library

[6]

Ehsan Atoofian, Amirali Baniasadi, and Nikitas Dimopoulos. 2004. Improving performance by speculating trivializing operands in trivial instructions. In 2nd Value-Prediction and Value-Based Optimization Workshop, Boston, Massachusetts. 26--31.

[7]

Ehsan Atoofian, Zayan Shaikh, and Ali Jannesari. 2021. Reducing energy in GPGPUs through approximate trivial bypassing. ACM Transactions on Embedded Computing Systems (TECS) 20, 2 (2021), 1--27.

Digital Library

[8]

David H Bailey. 2011. NAS parallel benchmarks. Encyclopedia of Parallel Computing (2011), 1254--1259.

[9]

Preston Briggs, Keith D Cooper, and L Taylor Simpson. 1997. Value numbering. Software: Practice and Experience 27, 6 (1997), 701--724.

Digital Library

[10]

Derek Bruening and Saman Amarasinghe. 2004. Efficient, transparent, and comprehensive runtime code manipulation. Ph.D. Dissertation. Massachusetts Institute of Technology, Department of Electrical Engineering ....

Digital Library

[11]

James Bucek, Klaus-Dieter Lange, et al. 2018. Spec cpu2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 41--42.

Digital Library

[12]

Michael Burrows, Úlfar Erlingsson, ST A Leung, Mark T Vandevoorde, Carl A Waldspurger, Kip Walker, and William E Weihl. 2000. Efficient and flexible value sampling. ACM SIGARCH Computer Architecture News 28, 5 (2000), 160--167.

Digital Library

[13]

Brad Calder, Peter Feller, and Alan Eustace. 1997. Value profiling. In Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 259--269.

Digital Library

[14]

Brad Calder, Peter Feller, Alan Eustace, et al. 1999. Value profiling and optimization. Journal of Instruction Level Parallelism 1, 1 (1999), 1--6.

[15]

Milind Chabbi and John Mellor-Crummey. 2012. Deadspy: a tool to pinpoint program inefficiencies. In Proceedings of the Tenth International Symposium on Code Generation and Optimization. 124--134.

Digital Library

[16]

Arnaldo Carvalho De Melo. 2010. The new linux'perf'tools. In Slides from Linux Kongress, Vol. 18.

[17]

Steven J Deitz, Bradford L Chamberlain, and Lawrence Snyder. 2001. Eliminating redundancies in sum-of-product array computations. In Proceedings of the 15th international conference on Supercomputing. 65--77.

Digital Library

[18]

Alberto Delmas Lascorz, Patrick Judd, Dylan Malone Stuart, Zissis Poulos, Mostafa Mahmoud, Sayeh Sharify, Milos Nikolic, Kevin Siu, and Andreas Moshovos. 2019. Bit-tactical: A software/hardware approach to exploiting value and bit sparsity in neural networks. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 749--763.

Digital Library

[19]

Luiz DeRose, Bill Homer, Dean Johnson, Steve Kaufmann, and Heidi Poxon. 2008. Cray performance analysis tools. In Tools for High Performance Computing. Springer, 191--199.

[20]

Mary F Fernandez. 1995. Simple and effective link-time optimization of Modula-3 programs. In Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation. 103--115.

Digital Library

[21]

A Fog. 2019. Software optimization resources, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs. Fog. Regime of access: http://www.agner.org/optimize (2019).

[22]

Zhangxiaowen Gong, Houxiang Ji, Christopher W Fletcher, Christopher J Hughes, and Josep Torrellas. 2020. SparseTrain: Leveraging Dynamic Sparsity in Software for Training DNNs on General-Purpose SIMD Processors. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 279--292.

Digital Library

[23]

Susan L Graham, Peter B Kessler, and Marshall K McKusick. 1982. Gprof: A call graph execution profiler. ACM Sigplan Notices 17, 6 (1982), 120--126.

Digital Library

[24]

Rajiv Gupta, Eduard Mehofer, and Youtao Zhang. 2002. Profile guided compiler optimizations. (2002).

[25]

Zhihao Jia, Oded Padon, James Thomas, Todd Warszawski, Matei Zaharia, and Alex Aiken. 2019. TASO: optimizing deep learning computation with automatic generation of graph substitutions. In Proceedings of the 27th ACM Symposium on Operating Systems Principles. 47--62.

Digital Library

[26]

Tyson Jones, Anna Brown, Ian Bush, and Simon C Benjamin. 2019. Quest and high performance simulation of quantum computers. Scientific reports 9, 1 (2019), 1--11.

[27]

Bumshik Lee, Jaehong Jung, and Munchurl Kim. 2016. An all-zero block detection scheme for low-complexity HEVC encoders. IEEE Transactions on Multimedia 18, 7 (2016), 1257--1268.

Digital Library

[28]

Guilherme Vieira Leobas and Fernando Magno Quintão Pereira. 2020. Semiring optimizations: dynamic elision of expressions with identity and absorbing elements. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1--28.

Digital Library

[29]

Kevin M Lepak and Mikko H Lipasti. 2000. On the value locality of store instructions. In Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No. RS00201). IEEE, 182--191.

Digital Library

[30]

John Levon, Philippe Elie, et al. 2008. OProfile, a system-wide profiler for Linux systems. Homepage: http://oprofile.sourceforge.net (2008).

[31]

Bolun Li, Hao Xu, Qidong Zhao, Pengfei Su, Milind Chabbi, Shuyin Jiao, and Xu Liu. 2022. OJXPerf: Featherlight Object Replica Detection for Java Programs. In The International Conference on Software Engineering.

Digital Library

[32]

Kuo-You Peng, Sheng-Yu Fu, Yu-Ping Liu, and Wei-Chung Hsu. 2017. Adaptive runtime exploiting sparsity in tensor of deep learning neural network on heterogeneous systems. In 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). IEEE, 105--112.

[33]

Ram Rangan, Mark W Stephenson, Aditya Ukarande, Shyam Murthy, Virat Agarwal, and Marc Blackstein. 2020. Zeroploit: Exploiting zero valued operands in interactive gaming applications. ACM Transactions on Architecture and Code Optimization (TACO) 17, 3 (2020), 1--26.

Digital Library

[34]

James Reinders. 2005. VTune performance analyzer essentials. Intel Press (2005).

[35]

Stephen E Richardson. 1993. Exploiting trivial and redundant computation. In Proceedings of IEEE 11th Symposium on Computer Arithmetic. IEEE, 220--227.

[36]

Muhammad Aditya Sasongko, Milind Chabbi, Palwisha Akhtar, and Didem Unat. 2019. ComDetective: a lightweight communication detection tool for threads. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--21.

Digital Library

[37]

Muhammad Aditya Sasongko, Milind Chabbi, Mandana Bagheri Marzijarani, and Didem Unat. 2021. ReuseTracker: Fast Yet Accurate Multicore Reuse Distance Analyzer. ACM Transactions on Architecture and Code Optimization (TACO) 19, 1 (2021), 1--25.

[38]

Sameer S Shende and Allen D Malony. 2006. The TAU parallel performance system. The International Journal of High Performance Computing Applications 20, 2 (2006), 287--311.

Digital Library

[39]

Mark Stephenson and Ram Rangan. 2021. PGZ: automatic zero-value code specialization. In Proceedings of the 30th ACM SIGPLAN International Conference on Compiler Construction. 36--46.

Digital Library

[40]

Mark Stephenson, Ram Rangan, and Stephen W Keckler. 2021. Cooperative Profile Guided Optimizations. In Computer Graphics Forum, Vol. 40. Wiley Online Library, 71--83.

[41]

Pengfei Su, Shasha Wen, Hailong Yang, Milind Chabbi, and Xu Liu. 2019. Redundant loads: A software inefficiency indicator. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 982--993.

Digital Library

[42]

Jialiang Tan, Shuyin Jiao, Milind Chabbi, and Xu Liu. 2020. What every scientific programmer should know about compiler optimizations?. In Proceedings of the 34th ACM International Conference on Supercomputing. 1--12.

Digital Library

[43]

Shizhi Tang, Jidong Zhai, Haojie Wang, Lin Jiang, Liyan Zheng, Zhenhao Yuan, and Chen Zhang. 2022. FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs. In Proceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation (San Diego, CA, USA) (PLDI 2022). Association for Computing Machinery, New York, NY, USA, 872--887.

Digital Library

[44]

Haojie Wang, Jidong Zhai, Mingyu Gao, Zixuan Ma, Shizhi Tang, Liyan Zheng, Yuanzhi Li, Kaiyuan Rong, Yuanyong Chen, and Zhihao Jia. 2021. PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. In OSDI. 37--54.

[45]

Michael S Warren and John K Salmon. 1992. Astrophysical N-body simulations using hierarchical tree data structures. SC 92 (1992), 570--576.

[46]

Scott Watterson and Saumya Debray. 2001. Goal-directed value profiling. In International Conference on Compiler Construction. Springer, 319--333.

[47]

Mark N Wegman and F Kenneth Zadeck. 1991. Constant propagation with conditional branches. ACM Transactions on Programming Languages and Systems (TOPLAS) 13, 2 (1991), 181--210.

Digital Library

[48]

Shasha Wen, Milind Chabbi, and Xu Liu. 2017. Redspy: Exploring value locality in software. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems. 47--61.

Digital Library

[49]

Shasha Wen, Xu Liu, John Byrne, and Milind Chabbi. 2018. Watching for software inefficiencies with witch. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems. 332--347.

Digital Library

[50]

Shasha Wen, Xu Liu, and Milind Chabbi. 2015. Runtime value numbering: A profiling technique to pinpoint redundant computations. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 254--265.

Digital Library

[51]

Joshua J Yi and David J Lilja. 2002. Improving processor performance by simplifying and bypassing trivial computations. In Proceedings. IEEE International Conference on Computer Design: VLSI in Computers and Processors. IEEE, 462--465.

[52]

Xin You, Hailong Yang, Kelun Lei, Zhongzhi Luan, and Depei Qian. 2023. VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 892--904.

Digital Library

[53]

Xin You, Hailong Yang, Zhongzhi Luan, Depei Qian, and Xu Liu. 2020. ZeroSpy: exploring software inefficiency with redundant zeros. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--14.

[54]

Qidong Zhao, Xu Liu, and Milind Chabbi. 2020. DrCCTProf: a fine-grained call path profiler for ARM-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.

[55]

Ningxin Zheng, Bin Lin, Quanlu Zhang, Lingxiao Ma, Yuqing Yang, Fan Yang, Yang Wang, Mao Yang, and Lidong Zhou. 2022. {SparTA}:{Deep-Learning} Model Sparsity via {Tensor-with-Sparsity-Attribute}. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 213--232.

[56]

Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2020. GVProf: A value profiler for GPU-based clusters. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1--16.

[57]

Keren Zhou, Yueming Hao, John Mellor-Crummey, Xiaozhu Meng, and Xu Liu. 2022. ValueExpert: exploring value patterns in GPU-accelerated applications. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. 171--185.

Digital Library

Index Terms

TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling

Recommendations

Software bloat analysis: finding, removing, and preventing performance problems in modern large-scale object-oriented applications
FoSER '10: Proceedings of the FSE/SDP workshop on Future of software engineering research

Generally believed to be a problem belonging to the compiler and architecture communities, performance optimization has rarely gained attention in mainstream software engineering research. However, due to the proliferation of large-scale object-oriented ...
Application of QPSO Algorithm in Aeroengine Maximum Thrust Optimization
CCIE '10: Proceedings of the 2010 International Conference on Computing, Control and Industrial Engineering - Volume 02

A new and practical solution, Quantum-behaved Particle Swam Optimization (QPSO) algorithm, is applied to Aeroengine maximum thrust optimization implemented for some turbo fan engine. Simulation is carried out under different altitudes and velocities and ...
Application of redundant computation in software performance analysis
WOSP '05: Proceedings of the 5th international workshop on Software and performance

Redundant computation is an execution of a program statement(s) that does not contribute to the program output. The same statement on one execution may exhibit redundant computation whereas on a different execution, it contributes to the program output. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '23: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

November 2023

1428 pages

ISBN:9798400701092

DOI:10.1145/3581784

Chair:
Dorian Arnold,
Program Chair:
Rosa M Badia,
Program Co-chair:
Kathryn Mohror

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 November 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Badges

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SC '23

Sponsor:

SIGHPC

SC '23: International Conference for High Performance Computing, Networking, Storage and Analysis

November 12 - 17, 2023

CO, Denver, USA

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
263
Total Downloads

Downloads (Last 12 months)263
Downloads (Last 6 weeks)3

Reflects downloads up to 10 Aug 2024

Other Metrics

View Author Metrics

Citations

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents