Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Boosting the Priority of Garbage: Scheduling Collection on Heterogeneous Multicore Processors

Published: 07 March 2016 Publication History

Abstract

While hardware is evolving toward heterogeneous multicore architectures, modern software applications are increasingly written in managed languages. Heterogeneity was born of a need to improve energy efficiency; however, we want the performance of our applications not to suffer from limited resources. How best to schedule managed language applications on a mix of big, out-of-order cores and small, in-order cores is an open question, complicated by the host of service threads that perform key tasks such as memory management. These service threads compete with the application for core and memory resources, and garbage collection (GC) must sometimes suspend the application if there is not enough memory available for allocation.
In this article, we explore concurrent garbage collection’s behavior, particularly when it becomes critical, and how to schedule it on a heterogeneous system to optimize application performance. While some applications see no difference in performance when GC threads are run on big versus small cores, others—those with GC criticality—see up to an 18% performance improvement. We develop a new, adaptive scheduling algorithm that responds to GC criticality signals from the managed runtime, giving more big-core cycles to the concurrent collector when it is under pressure and in danger of suspending the application. Our experimental results show that our GC-criticality-aware scheduler is robust across a range of heterogeneous architectures with different core counts and frequency scaling and across heap sizes. Our algorithm is performance and energy neutral for GC-uncritical Java applications and significantly speeds up GC-critical applications by 16%, on average, while being 20% more energy efficient for a heterogeneous multicore with three big cores and one small core.

References

[1]
Bowen Alpern, Clement R. Attanasio, John J. Barton, Michael G. Burke, Perry Cheng, Jong-Deok Choi, et al. 2000. The Jalapeño virtual machine. IBM Syst. J. 39, 1 (2000), 211--238.
[2]
Michela Becchi and Patrick Crowley. 2006. Dynamic thread assignment on heterogeneous multiprocessor architectures. In Proceedings of the Conference on Computing Frontiers (CF). 29--40.
[3]
Stephen M. Blackburn, Perry Cheng, and Kathryn S. McKinley. 2004. Myths and realities: The performance impact of garbage collection. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS). 25--36.
[4]
Stephen M. Blackburn, Robin Garner, Chris Hoffmann, Asjad M. Khang, Kathryn S. McKinley, Rotem Bentzur, et al. 2006. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21st Annual ACM SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA). 169--190.
[5]
Stephen M. Blackburn and Antony L. Hosking. 2004. Barriers: Friend or foe? In Proceedings of the International Symposium on Memory Management (ISMM). 143--151.
[6]
Stephen M. Blackburn and Kathryn S. McKinley. 2008. Immix: A mark-region garbage collector with space efficiency, fast collection, and mutator performance. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). 22--32.
[7]
Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, et al. 2008. Wake up and smell the coffee: Evaluation methodology for the 21st century. Commun. ACM 51, 8 (Aug. 2008), 83--89.
[8]
Ting Cao, Stephen M. Blackburn, Tiejun Gao, and Kathryn S. McKinley. 2012. The yin and yang of power and performance for asymmetric hardware and managed software. In Proceedings of the International Symposium on Computer Architecture (ISCA). 225--236.
[9]
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 52:1--52:12.
[10]
Jian Chen and Lizy K. John. 2009. Efficient program scheduling for heterogeneous multi-core processors. In Proceedings of the ACM/IEEE Design Automation Conference (DAC). 927--930.
[11]
Nagabhushan Chitlur, Ganapati Srinivasa, Scott Hahn, P. K. Gupta, Dheeraj Reddy, David Koufaty, et al. 2012. QuickIA: Exploring heterogeneous architectures on real prototypes. In Proceedings of the High Performance Computer Architecture (HPCA). 1--8.
[12]
Cliff Click. 2009. Azul’s Experiences with Hardware/Software Co-Design. Keynote at ECOOP.
[13]
Kristof Du Bois, Stijn Eyerman, Jennifer B. Sartor, and Lieven Eeckhout. 2013a. Criticality stacks: Identifying critical threads in parallel programs using synchronization behavior. In Proceedings of the International Symposium on Computer Architecture (ISCA). 511--522.
[14]
Kristof Du Bois, Jennifer B. Sartor, Stijn Eyerman, and Lieven Eeckhout. 2013b. Bottle graphs: Visualizing scalability bottlenecks in multi-threaded applications. In Proceedings of the ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 355--372.
[15]
Hadi Esmaeilzadeh, Ting Cao, Yang Xi, Stephen M. Blackburn, and Kathryn S. McKinley. 2011. Looking back on the language and hardware revolutions: Measured power, performance, and scaling. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 319--332.
[16]
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E. Smith. 2006. A performance counter architecture for computing accurate CPI components. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 175--184.
[17]
Soraya Ghiasi, Tom Keller, and Freeman Rawson. 2005. Scheduling for heterogeneous processors in server systems. In Proceedings of the Conference on Computing Frontiers (CF). 199--210.
[18]
Peter Greenhalgh. 2011. Big.LITTLE Processing with ARM Cortex-A15 & Cortex-A7: Improving Energy Efficiency in High-Performance Mobile Platforms. Retrieved from http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pdf.
[19]
Timothy H. Heil and James E. Smith. 2000. Concurrent garbage collection using hardware-assisted profiling. In Proceedings of the International Symposium on Memory Management (ISMM). 15--19.
[20]
Wim Heirman, Souradip Sarkar, Trevor E. Carlson, Ibrahim Hur, and Lieven Eeckhout. 2012. Power-aware multi-core simulation for early design stage hardware/software co-optimization. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 3--12.
[21]
Shiwen Hu and Lizy K. John. 2006. Impact of virtual execution environments on processor energy consumption and hardware adaptation. In Proceedings of the International Conference on Virtual Execution Environments (VEE). 100--110.
[22]
Xianglong Huang, Stephen M. Blackburn, Kathryn S. McKinley, J. Eliot, B. Moss, Zhenlin Wang, and Perry Cheng. 2004. The garbage collection advantage: Improving mutator locality. In Proceedings of the ACM SIGPLAN International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 69--80.
[23]
José A. Joao, Onur Mutlu, and Yale N. Patt. 2009. Flexible reference-counting-based hardware acceleration for garbage collection. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). 418--428.
[24]
José A. Joao, M. Aater Suleman, Onur Mutlu, and Yale N. Patt. 2012. Bottleneck identification and scheduling in multithreaded applications. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 223--234.
[25]
David Koufaty, Dheeraj Reddy, and Scott Hahn. 2010. Bias scheduling in heterogeneous multi-core architectures. In Proceedings of the European Conference on Computer Systems (EuroSys). 125--138.
[26]
Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy Ranganathan, and Dean M. Tullsen. 2003. Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 81--92.
[27]
Rakesh Kumar, Dean M. Tullsen, Parthasarathy Ranganathan, Norman P. Jouppi, and Keith I. Farkas. 2004. Single-ISA heterogeneous multi-core architectures for multithreaded workload performance. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA). 64--75.
[28]
Nagesh B. Lakshminarayana, Jaekyu Lee, and Hyesoon Kim. 2009. Age based scheduling for asymmetric multiprocessors. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC). 25:1--25:12.
[29]
Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn. 2007. Efficient operating system scheduling for performance-asymmetric multi-core architectures. In Proceedings of the ACM/IEEE Conference on Supercomputing (ICS). 53:1--53:11.
[30]
Tong Li, P. Brett, R. Knauerhase, D. Koufaty, D. Reddy, and S. Hahn. 2010. Operating system support for overlapping-ISA heterogeneous multi-core architectures. In Proceedings of the High Performance Computer Architecture (HPCA). 1--12.
[31]
Martin Maas, Philip Reames, Jeffrey Morlan, Krste Asanović, Anthony D. Joseph, and John Kubiatowicz. 2012. GPUs as an opportunity for offloading garbage collection. In Proceedings of the International Symposium on Memory Management (ISMM). 25--36.
[32]
NVidia. 2011. Variable SMP -- A Multi-Core CPU Architecture for Low Power and High Performance. Retrieved from http://www.nvidia.com/content/PDF/tegra_white_papers/Variable-SMP-A-Multi-Core-CPU-Architecture-for-Low-Power-and-High-Performance.pdf.
[33]
Jennifer B. Sartor and Lieven Eeckhout. 2012. Exploring multi-threaded java application performance on multicore hardware. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications (OOPSLA). 281--296.
[34]
Jennifer B. Sartor, Wim Heirman, Steve Blackburn, Lieven Eeckhout, and McKinley McKinley. 2014. Cooperative cache scrubbing. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 15--26.
[35]
William J. Schmidt and Kelvin D. Nilsen. 1994. Performance of a hardware-assisted real-time garbage collector. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 76--85.
[36]
Daniel Shelepov, Juan Carlos Saez Alcaide, Stacey Jeffery, Alexandra Fedorova, Nestor Perez, Zhi Feng Huang, Sergey Blagodurov, and Viren Kumar. 2009. HASS: A scheduler for heterogeneous multicore systems. SIGOPS Oper. Syst. Rev. 43, 2 (April 2009).
[37]
Sadagopan Srinivasan, Li Zhao, Ramesh Illikkal, and Ravishankar Iyer. 2011. Efficient interaction between OS and architecture in heterogeneous platforms. SIGOPS Oper. Syst. Rev. 45, 1 (Feb. 2011), 62--72.
[38]
M. Aater Suleman, Onur Mutlu, Moinuddin K. Qureshi, and Yale N. Patt. 2009. Accelerating critical section execution with asymmetric multi-core architectures. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 253--264.
[39]
Kenzo Van Craeynest, Shoaib Akram, Wim Heirman, Aamer Jaleel, and Lieven Eeckhout. 2013. Fairness-aware scheduling on single-ISA heterogeneous multi-cores. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 177--188.
[40]
Kenzo Van Craeynest, Aamer Jaleel, Lieven Eeckhout, Paolo Narvaez, and Joel Emer. 2012. Scheduling heterogeneous multi-cores through performance impact estimation (PIE). In Proceedings of the International Symposium on Computer Architecture (ISCA). 213--224.
[41]
Xi Yang, Stephen M. Blackburn, Daniel Frampton, Jennifer B. Sartor, and Kathryn S. McKinley. 2011. Why nothing matters: The impact of zeroing. In Proceedings of the ACM Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 307--324.
[42]
Taiichi Yuasa. 1990. Real-time garbage collection on general-purpose machines. Journal of Systems and Software 11, 3 (1990), 181--198.
[43]
Yi Zhao, Jin Shi, Kai Zheng, Haichuan Wang, Haibo Lin, and Ling Shao. 2009. Allocation wall: A limiting factor of java applications on emerging multi-core platforms. In Proceeding of the ACM Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA). 361--376.

Cited By

View all
  • (2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
  • (2020)CASH: correlation-aware scheduling to mitigate soft error impact on heterogeneous multicoresConnection Science10.1080/09540091.2020.1758924(1-23)Online publication date: 18-May-2020
  • (2019)To expose, or not to expose, hardware heterogeneity to runtimesCompanion Proceedings of the 3rd International Conference on the Art, Science, and Engineering of Programming10.1145/3328433.3328442(1-2)Online publication date: 1-Apr-2019
  • Show More Cited By

Recommendations

Reviews

David E. Goldfarb

This well-written paper presents a new algorithm for garbage collection of a Java program running across a heterogeneous set of cores (where some cores execute code significantly faster than others). While it presents a new algorithm, it is written in an accessible manner with several pages of clear introduction building up to the key points. Although I have not been actively involved in garbage collection (GC) development for many years, I found it very easy to follow the text. The key idea, obvious once stated, is that it is more efficient to run the GC on a fast core when it is at risk of being outpaced by the mutator and on a slow core otherwise. Doing this optimally requires dynamic allocation of the GC thread to the different cores at different times. The authors present a GC algorithm based on these ideas and claim three to 20 percent improvements in time and energy consumption over existing algorithms. Their analysis appears to be sound and detailed. The paper is 24 pages long, including about 40 references mostly dating from 2000 to 2013. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 13, Issue 1
April 2016
347 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2899032
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 March 2016
Accepted: 01 January 2016
Revised: 01 October 2015
Received: 01 August 2015
Published in TACO Volume 13, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GC criticality
  2. Managed runtimes
  3. automatic memory management
  4. concurrent collection
  5. energy-efficiency
  6. heterogeneous multicore architectures

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • European Research Council under the European Community's Seventh Framework Programme (FP7/2007-2013)/ERC
  • EU FP7 Adept Project

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)59
  • Downloads (Last 6 weeks)14
Reflects downloads up to 04 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Analyzing and Improving the Scalability of In-Memory Indices for Managed Search EnginesProceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management10.1145/3591195.3595272(15-29)Online publication date: 6-Jun-2023
  • (2020)CASH: correlation-aware scheduling to mitigate soft error impact on heterogeneous multicoresConnection Science10.1080/09540091.2020.1758924(1-23)Online publication date: 18-May-2020
  • (2019)To expose, or not to expose, hardware heterogeneity to runtimesCompanion Proceedings of the 3rd International Conference on the Art, Science, and Engineering of Programming10.1145/3328433.3328442(1-2)Online publication date: 1-Apr-2019
  • (2019)Learning when to garbage collect with random forestsProceedings of the 2019 ACM SIGPLAN International Symposium on Memory Management10.1145/3315573.3329983(53-63)Online publication date: 23-Jun-2019
  • (2019)Composite-ISA Cores: Enabling Multi-ISA Heterogeneity Using a Single ISA2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00026(42-55)Online publication date: Feb-2019
  • (2017)Managed Language Runtimes on Heterogeneous HardwareCompanion Proceedings of the 1st International Conference on the Art, Science, and Engineering of Programming10.1145/3079368.3079397(1-2)Online publication date: 3-Apr-2017
  • (2017)DEP+BURSTIEEE Transactions on Computers10.1109/TC.2016.260990366:4(601-615)Online publication date: 1-Apr-2017
  • (2017)Analyzing the scalability of managed language applications with speedup stacks2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2017.7975267(23-32)Online publication date: Apr-2017
  • (2016)DVFS performance prediction for managed multithreaded applications2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2016.7482070(12-23)Online publication date: Apr-2016

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media