Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/1815961.1815996acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications

Published: 19 June 2010 Publication History

Abstract

Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, such as cache capacity or memory bandwidth, thus degrading performance. On the other hand, having too few threads makes inefficient use of the resources available. Beyond static resource assignment, the program inputs and dynamic system state (e.g., what other applications are executing in the system) can have a significant impact on the right number of threads to use for a particular application.
To address this problem we present the Thread Tailor, a dynamic system that automatically adjusts the number of threads in an application to optimize system efficiency. The Thread Tailor leverages offline analysis to estimate what type of threads will exist at runtime and the communication patterns between them. Using this information Thread Tailor dynamically combines threads to better suit the needs of the target system. Thread Tailor adjusts not only to the architecture, but also other applications in the system, and this paper demonstrates that this type of adjustment can lead to significantly better use of thread-level parallelism in real-world architectures.

References

[1]
L. O. Andersen. Program analysis and specialization for the c programming language. Technical report, University of Copenhagen, 1994.
[2]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008.
[3]
M. Chu, R. Ravindran, and S. Mahlke. Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. In Proc. of the 40th Annual International Symposium on Microarchitecture, pages 369--378, 2007.
[4]
J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. In Proc. of the 2000 International Symposium on on Operating Systems Design and Implementation, pages 255--266, 2008.
[5]
M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In Proc. of the 2006 International Conference on Supercomputing, pages 157--166, 2006.
[6]
L. Dagum and R. Menon. OpenMP: an industry standard API for shared-memory programming. IEEE Computer Science and Engineering, 5(1):46--55, 1998.
[7]
Y. Ding, M. Kandemir, P. Raghavan, and M. Irwin. A helper thread based edp reduction scheme for adapting application execution in cmps. In Proc. of the 2008 IEEE Symposium on on Parallel and Distributed Processing, pages 1--14, 2008.
[8]
C. Fiduccia and R. Mattheyses. A linear time heuristic for improving network partitions. In Proc. of the 19th Design Automation Conference, pages 175--181, 1982.
[9]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979.
[10]
A. Ghuloum. Ct: C for throughput computing (whitepaper), 2009. http://techresearch.intel.com/articles/Tera-Scale/1514.htm.
[11]
M. Hirzel, D. von Dincklage, A. Diwan, and M. Hind. Fast online pointer analysis. ACM Transactions on Programming Languages and Systems, 29(2):11, Apr. 2007.
[12]
C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for SMT multiprocessor architectures. In Proc. of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 236--246, 2005.
[13]
B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(2):291--207, Feb. 1970.
[14]
M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. of the SIGPLAN '08 Conference on Programming Language Design and Implementation, pages 114--124, June 2008.
[15]
M. Kulkarni et al. Optimistic Parallelism Requires Abstractions. In Proc. of the SIGPLAN '07 Conference on Programming Language Design and Implementation, pages 211--222, June 2007.
[16]
R. Kumar, G. Agrawal, and G. Gao. Compiling several classes of communication patterns on a multithreaded architecture. In Proc. of the 2002 IEEE Symposium on on Parallel and Distributed Processing, pages 18--23, 2002.
[17]
R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proc. of the 31st Annual International Symposium on Computer Architecture, 2004.
[18]
R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In Proc. of the 36th Annual International Symposium on Microarchitecture, pages 81--92, Dec. 2003.
[19]
N. Lakshiminarayana, S. Rao, and H. Kim. Asymmetricity Aware Scheduling Algorithms for Asymmetric Processors. In Workshop on the Interaction between Operating Systems and Computer Architecture, 2009.
[20]
C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004.
[21]
J. Li and J. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In Proc. of the 12th International Symposium on High-Performance Computer Architecture, pages 77--87, 2006.
[22]
U. Nawathe et al. An 8-core, 64-thread, 64-bit, power efficient SPARC SoC (Niagara2), Feb. 2007. In Proc. of ISSCC.
[23]
J. Nieplocha et al. Evaluating the potential of multithreaded platforms for irregular scientific computations. In Proc. of the 2007 ACM Conference on Computing Frontiers, pages 47--58, 2007.
[24]
Nvidia. CUDA Programming Guide, June 2007. http://developer.download.nvidia.com/compute/cuda.
[25]
K. Olukotun et al. The case for a single chip multiprocessor. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, 1996.
[26]
M. Qureshi and Y. Patt. Partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. of the 39th Annual International Symposium on Microarchitecture, pages 423--432, 2006.
[27]
R. Ravindran, R. Senger, E. Marsman, G. Dasika, M. Guthaus, S. Mahlke, and R. Brown. Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File. In Proc. of the 2003 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 125--136, 2003.
[28]
H. Schwartz, D. Marpe, and T. Wiegand. Overview of the scalable video coding extension of the h.264/avc standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1103--1120, Sept. 2007.
[29]
M. A. Suleman, M. Qureshi, and Y. Patt. Feedback Driven Threading: Power-Efficient and High-Performance Execution of Multithreaded Workloads on CMPs. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 277--286, 2008.
[30]
P. Wang et al. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proc. of the SIGPLAN '07 Conference on Programming Language Design and Implementation, pages 156--166, 2007.
[31]
S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd annual international symposium on Computer architecture, pages 24--36, 1995.
[32]
Q. Wu et al. A dynamic compilation framework for controlling microprocessor energy and performance. In Proc. of the 38th Annual International Symposium on Microarchitecture, pages 271--282, 2005.
[33]
Y. Xie and G. H. Loh. PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 174--183, 2009.

Cited By

View all
  • (2024)Efficient Thread Tuning for Asymmetric Multicores2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI62366.2024.10703981(1-5)Online publication date: 2-Sep-2024
  • (2024)Integration Framework for Online Thread Throttling with Thread and Page Mapping on NUMA Systems2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00202(1189-1192)Online publication date: 27-May-2024
  • (2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
June 2010
520 pages
ISBN:9781450300537
DOI:10.1145/1815961
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
    ISCA '10
    June 2010
    508 pages
    ISSN:0163-5964
    DOI:10.1145/1816038
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dynamic compilation
  2. managed parallelism
  3. threading

Qualifiers

  • Research-article

Conference

ISCA '10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)7
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Efficient Thread Tuning for Asymmetric Multicores2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI62366.2024.10703981(1-5)Online publication date: 2-Sep-2024
  • (2024)Integration Framework for Online Thread Throttling with Thread and Page Mapping on NUMA Systems2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00202(1189-1192)Online publication date: 27-May-2024
  • (2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
  • (2023)Searching for the Ideal Number of Threads on Asymmetric Multiprocessors2023 XIII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC60926.2023.10324167(1-6)Online publication date: 21-Nov-2023
  • (2022)Online Thread Auto-Tuning for Performance Improvement and Resource SavingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316941033:12(3746-3759)Online publication date: 1-Dec-2022
  • (2022)On the benefits of Collaborative Thread Throttling and HLS-Versioning in CPU-FPGA Environments2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI55532.2022.9893223(1-6)Online publication date: 22-Aug-2022
  • (2021)Towards Exploiting CPU Elasticity via Efficient Thread OversubscriptionProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460641(215-226)Online publication date: 21-Jun-2021
  • (2021)ETCG: Energy-Aware CPU Thread Throttling for CPU-GPU Collaborative Environments2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI53441.2021.9529986(1-6)Online publication date: 23-Aug-2021
  • (2021)Optimizing Parallel Applications via Dynamic Concurrency Throttling and Turbo Boosting2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00032(153-160)Online publication date: Mar-2021
  • (2021)Synergically Rebalancing Parallel Execution via DCT and Turbo Boosting2021 58th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18074.2021.9586201(277-282)Online publication date: 5-Dec-2021
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media