Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Analysis and Optimization of Task Granularity on the Java Virtual Machine

Published: 16 July 2019 Publication History

Abstract

Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, leading to missed parallelization opportunities. In this article, we provide a better understanding of task granularity for task-parallel applications running on a single Java Virtual Machine in a shared-memory multicore. We present a new methodology to accurately and efficiently collect the granularity of each executed task, implemented in a novel profiler (available open-source) that collects carefully selected metrics from the whole system stack with low overhead, and helps developers locate performance and scalability problems. We analyze task granularity in the DaCapo, ScalaBench, and Spark Perf benchmark suites, revealing inefficiencies related to fine-grained and coarse-grained tasks in several applications. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in several applications, achieving speedups up to a factor of 5.90×. Our results highlight the importance of analyzing and optimizing task granularity on the Java Virtual Machine.

References

[1]
Umut A. Acar, Arthur Charguéraud, and Mike Rainey. 2011. Oracle scheduling: Controlling granularity in implicitly parallel languages. In OOPSLA. 499--518.
[2]
Gul Agha. 1986. Actors: A Model of Concurrent Computation in Distributed Systems. MIT Press.
[3]
Glenn Ammons, Thomas Ball, and James R. Larus. 1997. Exploiting hardware performance counters with flow and context sensitive profiling. In PLDI. 85--96.
[4]
Jianmin Bi, Xiaofei Liao, Yu Zhang, Chencheng Ye, Hai Jin, and Laurence T. Yang. 2014. An adaptive task granularity based scheduling for task-centric parallelism. In HPCC. 165--172.
[5]
Walter Binder, Jarle Hulaas, and Philippe Moret. 2007. Advanced Java bytecode instrumentation. In PPPJ. 135--144.
[6]
Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In OOPSLA. 169--190.
[7]
Feng Chen, Traian Florin Serbanuta, and Grigore Rosu. 2008. jPredictor: A predictive runtime analysis tool for Java. In ICSE. 221--230.
[8]
Kuo-Yi Chen, J. Morris Chang, and Ting-Wei Hou. 2011. Multithreading in Java: Performance and scalability on multicore systems. IEEE Trans. Comput. 60, 11 (2011), 1521--1534.
[9]
Guojing Cong, Sreedhar Kodali, Sriram Krishnamoorthy, Doug Lea, Vijay Saraswat, and Tong Wen. 2008. Solving large, irregular graph problems using adaptive work-stealing. In ICPP. 536--545.
[10]
Databricks. 2015. Spark Performance Tests. Retrieved from https://github.com/databricks/spark-perf.
[11]
Florian David, Gael Thomas, Julia Lawall, and Gilles Muller. 2014. Continuously measuring critical section pressure with the free-lunch profiler. In OOPSLA. 291--307.
[12]
Bruno Dufour, Karel Driesen, Laurie Hendren, and Clark Verbrugge. 2003. Dynamic metrics for Java. In OOPSLA. 149--168.
[13]
Alejandro Duran, Julita Corbalán, and Eduard Ayguadé. 2008. An adaptive cut-off for task parallelism. In SC. 1--11.
[14]
H2. 2018. H2 Database Engine. Retrieved from http://www.h2database.com.
[15]
Kevin Hammond, Hans-Wolfgang Loidl, and Andrew S Partridge. 1995. Visualising granularity in parallel programs: A graphical winnowing system for Haskell. In HPFC. 208--221.
[16]
Matthias Hauswirth, Peter F. Sweeney, Amer Diwan, and Michael Hind. 2004. Vertical profiling: Understanding the behavior of object-oriented applications. In OOPSLA. 251--269.
[17]
Yuxiong He, Charles E. Leiserson, and William M. Leiserson. 2010. The Cilkview scalability analyzer. In SPAA. 145--156.
[18]
Carl Hewitt, Peter Bishop, and Richard Steiger. 1973. A universal modular ACTOR formalism for artificial intelligence. In IJCAI. 235--245.
[19]
Lorenz Huelsbergen, James R. Larus, and Alexander Aiken. 1994. Using the run-time sizes of data structures to guide parallel-thread creation. In LSP. 79--90.
[20]
IBM. 2007. DayTrader. Retrieved from https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaag/wascrypt/l0wscry00_daytrader.htm.
[21]
ICL. 2017. PAPI. Retrieved from http://icl.utk.edu/papi/.
[22]
Hiroshi Inoue and Toshio Nakatani. 2009. How a Java VM can get more from a hardware performance monitor. In OOPSLA. 137--154.
[23]
Shintaro Iwasaki and Kenjiro Taura. 2016. Autotuning of a cut-off for task parallel programs. In MCSoC. 353--360.
[24]
Joseph JaJa. 1992. Introduction to Parallel Algorithms. Addison-Wesley Professional.
[25]
Stephen Kell, Danilo Ansaloni, Walter Binder, and Lukáš Marek. 2012. The JVM is not observable enough (and what to do about it). In VMIL. 33--38.
[26]
Gregor Kiczales, John Lamping, Anurag Mendhekar, Chris Maeda, Cristina Lopes, Jean-Marc Loingtier, and John Irwin. 1997. Aspect-oriented programming. In ECOOP. 220--242.
[27]
Clyde P. Kruskal and Carl H. Smith. 1988. On the notion of granularity. J. Supercomput. 1, 4 (1988), 395--408.
[28]
Vivek Kumar, Daniel Frampton, Stephen M. Blackburn, David Grove, and Olivier Tardieu. 2012. Work-stealing without the baggage. In OOPSLA. 297--314.
[29]
Philipp Lengauer, Verena Bitto, Hanspeter Mössenböck, and Markus Weninger. 2017. A comprehensive Java benchmark study on memory and garbage collection behavior of DaCapo, DaCapo Scala, and SPECjvm2008. In ICPE. 3--14.
[30]
Jonathan Lifflander, Sriram Krishnamoorthy, and Laxmikant V. Kale. 2014. Optimizing data locality for fork/join programs using constrained work stealing. In SC. 857--868.
[31]
Linux man. 2013. top(1). Retrieved from https://linux.die.net/man/1/top.
[32]
Linux man. 2018. Documentation of CLOCK_MONOTONIC in clock_gettime(). Retrieved from https://linux.die.net/man/3/clock_gettime.
[33]
Pedro Lopez, Manuel Hermenegildo, and Saumya K. Debray. 1996. A methodology for granularity-based control of parallelism in logic programs. J. Symbolic Comput. 21, 4 (1996), 715--734.
[34]
Lukáš Marek, Stephen Kell, Yudi Zheng, Lubomír Bulej, Walter Binder, Petr Tůma, Danilo Ansaloni, Aibek Sarimbekov, and Andreas Sewe. 2013. ShadowVM: Robust and comprehensive dynamic program analysis for the Java platform. In GPCE. 105--114.
[35]
Lukáš Marek, Alex Villazón, Yudi Zheng, Danilo Ansaloni, Walter Binder, and Zhengwei Qi. 2012. DiSL: A domain-specific language for bytecode instrumentation. In AOSD. 239--250.
[36]
Eric Mohr, David A. Kranz, and Robert H. Halstead Jr. 1991. Lazy task creation: A technique for increasing the granularity of parallel programs. IEEE Trans. Parallel Distrib. Syst. 2, 3 (1991), 264--280.
[37]
Philippe Moret, Walter Binder, and Alex Villazon. 2009. CCCP: Complete calling context profiling in virtual execution environments. In PEPM. 151--160.
[38]
Ananya Muddukrishna, Peter A. Jonsson, Artur Podobas, and Mats Brorsson. 2016. Grain graphs: OpenMP performance analysis made easy. In PPoPP. 28:1--28:13.
[39]
Todd Mytkowicz, Amer Diwan, Matthias Hauswirth, and Peter F. Sweeney. 2010. Evaluating the accuracy of Java profilers. In PLDI. 187--197.
[40]
Albert Noll and Thomas Gross. 2013. Online feedback-directed optimizations for parallel Java code. In OOPSLA. 713--728.
[41]
Oracle. 2017. Documentation of System.nanotime(). Retrieved from https://docs.oracle.com/javase/9/docs/api/java/lang/System.html.
[42]
Oracle. 2017. Java Native Interface. Retrieved from https://docs.oracle.com/javase/9/docs/specs/jni/index.html.
[43]
Oracle. 2017. Java Platform, Standard Edition 8 Java Development Kit Version 9 API Specification. Retrieved from https://docs.oracle.com/javase/9/docs/api/.
[44]
Oracle. 2017. Java Virtual Machine Tool Interface (JVM TI). Retrieved from https://docs.oracle.com/javase/9/docs/specs/jvmti.html.
[45]
Oracle. 2017. The Parallel Collector. Retrieved from https://docs.oracle.com/javase/9/gctuning/parallel-collector1.htm.
[46]
Oracle. 2017. ExecutorService. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ExecutorService.html.
[47]
Oracle. 2017. ForkJoinPool. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ForkJoinPool.html.
[48]
Oracle. 2017. ThreadPoolExecutor. Retrieved from https://docs.oracle.com/javase/9/docs/api/java/util/concurrent/ThreadPoolExecutor.html.
[49]
perf. 2015. Linux profiling with performance counters. Retrieved from https://perf.wiki.kernel.org.
[50]
Andrea Rosà and Walter Binder. 2018. Optimizing type-specific instrumentation on the JVM with reflective supertype information. J. Visual Lang. Comput. 49 (2018), 29--45.
[51]
Andrea Rosà, Lydia Y. Chen, and Walter Binder. 2016. Actor profiling in virtual execution environments. SIGPLAN Not. 52, 3 (Oct. 2016), 36--46.
[52]
Andrea Rosà, Eduardo Rosales, and Walter Binder. 2017. Accurate reification of complete supertype information for dynamic analysis on the JVM. SIGPLAN Not. 52, 12 (Oct. 2017), 104--116.
[53]
Andrea Rosà, Eduardo Rosales, and Walter Binder. 2018. Analyzing and optimizing task granularity on the JVM. In CGO. 27--37.
[54]
Eduardo Rosales, Andrea Rosà, and Walter Binder. 2017. tgp: A task-granularity profiler for the Java virtual machine. In APSEC. 570--575.
[55]
Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 1). Retrieved from http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-1/.
[56]
Sandy Ryza. 2015. How-to: Tune Your Apache Spark Jobs (Part 2). Retrieved from http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/.
[57]
Aibek Sarimbekov, Andreas Sewe, Walter Binder, Philippe Moret, and Mira Mezini. 2014. JP2: Call-site aware calling context profiling for the Java virtual machine. Sci. Comput. Program. 79 (Jan. 2014), 146--157.
[58]
Tao B. Schardl, Bradley C. Kuszmaul, I-Ting Angelina Lee, William M. Leiserson, and Charles E. Leiserson. 2015. The cilkprof scalability profiler. In SPAA. 89--100.
[59]
Andreas Sewe, Mira Mezini, Aibek Sarimbekov, and Walter Binder. 2011. Da Capo Con Scala: Design and analysis of a scala benchmark suite for the Java virtual machine. In OOPSLA. 657--676.
[60]
The Apache Software Foundation. 2018. Apache Spark—RDD Programming Guide. Retrieved from https://spark.apache.org/docs/latest/rdd-programming-guide.html.
[61]
The Apache Software Foundation. 2018. Apache Spark MLlib. Retrieved from https://spark.apache.org/mllib/.
[62]
The Apache Software Foundation. 2018. Apache Tomcat. Retrieved from http://tomcat.apache.org.
[63]
The Apache Software Foundation. 2018. Lucene. Retrieved from https://lucene.apache.org.
[64]
The Apache Software Foundation. 2018. Spark Configuration. Retrieved from https://spark.apache.org/docs/latest/configuration.html.
[65]
The Apache Software Foundation. 2018. Spark Streaming. Retrieved from https://spark.apache.org/streaming/.
[66]
The Apache Software Foundation. 2018. SparkContext API. Retrieved from https://spark.apache.org/docs/2.3.0/api/java/org/apache/spark/SparkContext.html.
[67]
The Eclipse Foundation. 2016. Jetty. Retrieved from http://www.eclipse.org/jetty/.
[68]
The Eclipse Foundation. 2018. Eclipse. Retrieved from https://www.eclipse.org.
[69]
The Stanford Natural Language Processing Group. 2010. Stanford Topic Modeling Toolbox. Retrieved from https://nlp.stanford.edu/software/tmt/tmt-0.4/.
[70]
Peter Thoman, Herbert Jordan, and Thomas Fahringer. 2013. Adaptive granularity control in task parallel programs using multiversioning. In Euro-Par. 164--177.
[71]
TPC. 2010. TPC-C. Retrieved from http://www.tpc.org/tpcc/.
[72]
Alex Villazón, Haiyang Sun, Andrea Rosà, Eduardo Rosales, Daniele Bonetta, Isabella Defilippis, Sergio Oporto, and Walter Binder. 2019. Automated large-scale multi-language dynamic program analysis in the wild. In ECOOP. 1--26.
[73]
Adarsh Yoga and Santosh Nagarakatte. 2017. A fast causal profiler for task parallel programs. In ESEC/FSE. 15--26.
[74]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI. 1--14.
[75]
Jisheng Zhao, Jun Shirako, V. Krishna Nandivada, and Vivek Sarkar. 2010. Reducing task creation and termination overhead in explicitly parallel programs. In PACT. 169--180.
[76]
Yudi Zheng, Lubomír Bulej, and Walter Binder. 2015. Accurate profiling in the presence of dynamic compilation. In OOPSLA. 433--450.
[77]
Yudi Zheng, Andrea Rosà, Luca Salucci, Yao Li, Haiyang Sun, Omar Javed, Lumobír Bulej, Lydia Y. Chen, Zhengwei Qi, and Walter Binder. 2016. AutoBench: Finding workloads that you need using pluggable hybrid analyses. In SANER. 639--643.
[78]
Gary M. Zoppetti, Gagan Agrawal, Lori Pollock, Jose Nelson Amaral, Xinan Tang, and Guang Gao. 2000. Automatic compiler techniques for thread coarsening for multithreaded architectures. In ICS. 306--315.

Cited By

View all
  • (2024)Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State PerformanceProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645051(14-24)Online publication date: 7-May-2024
  • (2023)Optimization-Aware Compiler-Level Event ProfilingACM Transactions on Programming Languages and Systems10.1145/359147345:2(1-50)Online publication date: 26-Jun-2023
  • (2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 41, Issue 3
September 2019
278 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/3343145
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 July 2019
Accepted: 01 May 2019
Revised: 01 March 2019
Received: 01 August 2018
Published in TOPLAS Volume 41, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Java virtual machine
  2. Task granularity
  3. actionable profiles
  4. performance analysis and optimization
  5. task parallelism
  6. vertical profiling

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)144
  • Downloads (Last 6 weeks)29
Reflects downloads up to 15 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Vectorized Intrinsics Can Be Replaced with Pure Java Code without Impairing Steady-State PerformanceProceedings of the 15th ACM/SPEC International Conference on Performance Engineering10.1145/3629526.3645051(14-24)Online publication date: 7-May-2024
  • (2023)Optimization-Aware Compiler-Level Event ProfilingACM Transactions on Programming Languages and Systems10.1145/359147345:2(1-50)Online publication date: 26-Jun-2023
  • (2023)DJXPerf: Identifying Memory Inefficiencies via Object-Centric Profiling for JavaProceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3579990.3580010(81-94)Online publication date: 17-Feb-2023
  • (2023)Optimizing Iterative Data-flow Scientific Applications using Directed Cyclic GraphsIEEE Access10.1109/ACCESS.2023.3269902(1-1)Online publication date: 2023
  • (2023)Large‐scale characterization of Java streamsSoftware: Practice and Experience10.1002/spe.321353:9(1763-1792)Online publication date: 5-Jun-2023
  • (2022)Quantifying effects of tasks on group performance in social learningHumanities and Social Sciences Communications10.1057/s41599-022-01305-29:1Online publication date: 20-Aug-2022
  • (2022)Accurate Fork-Join Profiling on the Java Virtual MachineEuro-Par 2022: Parallel Processing10.1007/978-3-031-12597-3_3(35-50)Online publication date: 22-Aug-2022
  • (2020): A Profiler Suite for Parallel Applications on the Java Virtual MachineProgramming Languages and Systems10.1007/978-3-030-64437-6_19(364-372)Online publication date: 30-Nov-2020

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media