research-article

Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications

Authors:

Madhumitha Ravichandran,

Nathan ClarkAuthors Info & Claims

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Pages 270 - 279

https://doi.org/10.1145/1815961.1815996

Published: 19 June 2010 Publication History

Abstract

Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, such as cache capacity or memory bandwidth, thus degrading performance. On the other hand, having too few threads makes inefficient use of the resources available. Beyond static resource assignment, the program inputs and dynamic system state (e.g., what other applications are executing in the system) can have a significant impact on the right number of threads to use for a particular application.

To address this problem we present the Thread Tailor, a dynamic system that automatically adjusts the number of threads in an application to optimize system efficiency. The Thread Tailor leverages offline analysis to estimate what type of threads will exist at runtime and the communication patterns between them. Using this information Thread Tailor dynamically combines threads to better suit the needs of the target system. Thread Tailor adjusts not only to the architecture, but also other applications in the system, and this paper demonstrates that this type of adjustment can lead to significantly better use of thread-level parallelism in real-world architectures.

References

[1]

L. O. Andersen. Program analysis and specialization for the c programming language. Technical report, University of Copenhagen, 1994.

[2]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, Oct. 2008.

Digital Library

[3]

M. Chu, R. Ravindran, and S. Mahlke. Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures. In Proc. of the 40th Annual International Symposium on Microarchitecture, pages 369--378, 2007.

Digital Library

[4]

J. Corbalan, X. Martorell, and J. Labarta. Performance-driven processor allocation. In Proc. of the 2000 International Symposium on on Operating Systems Design and Implementation, pages 255--266, 2008.

Digital Library

[5]

M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and D. S. Nikolopoulos. Online power-performance adaptation of multithreaded programs using hardware event-based prediction. In Proc. of the 2006 International Conference on Supercomputing, pages 157--166, 2006.

Digital Library

[6]

L. Dagum and R. Menon. OpenMP: an industry standard API for shared-memory programming. IEEE Computer Science and Engineering, 5(1):46--55, 1998.

Digital Library

[7]

Y. Ding, M. Kandemir, P. Raghavan, and M. Irwin. A helper thread based edp reduction scheme for adapting application execution in cmps. In Proc. of the 2008 IEEE Symposium on on Parallel and Distributed Processing, pages 1--14, 2008.

[8]

C. Fiduccia and R. Mattheyses. A linear time heuristic for improving network partitions. In Proc. of the 19th Design Automation Conference, pages 175--181, 1982.

Digital Library

[9]

M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., 1979.

Digital Library

[10]

A. Ghuloum. Ct: C for throughput computing (whitepaper), 2009. http://techresearch.intel.com/articles/Tera-Scale/1514.htm.

[11]

M. Hirzel, D. von Dincklage, A. Diwan, and M. Hind. Fast online pointer analysis. ACM Transactions on Programming Languages and Systems, 29(2):11, Apr. 2007.

Digital Library

[12]

C. Jung, D. Lim, J. Lee, and S. Han. Adaptive execution techniques for SMT multiprocessor architectures. In Proc. of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 236--246, 2005.

Digital Library

[13]

B. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(2):291--207, Feb. 1970.

[14]

M. Kudlur and S. Mahlke. Orchestrating the execution of stream programs on multicore platforms. In Proc. of the SIGPLAN '08 Conference on Programming Language Design and Implementation, pages 114--124, June 2008.

Digital Library

[15]

M. Kulkarni et al. Optimistic Parallelism Requires Abstractions. In Proc. of the SIGPLAN '07 Conference on Programming Language Design and Implementation, pages 211--222, June 2007.

Digital Library

[16]

R. Kumar, G. Agrawal, and G. Gao. Compiling several classes of communication patterns on a multithreaded architecture. In Proc. of the 2002 IEEE Symposium on on Parallel and Distributed Processing, pages 18--23, 2002.

Digital Library

[17]

R. Kumar et al. Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance. In Proc. of the 31st Annual International Symposium on Computer Architecture, 2004.

Digital Library

[18]

R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen. Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction. In Proc. of the 36th Annual International Symposium on Microarchitecture, pages 81--92, Dec. 2003.

Digital Library

[19]

N. Lakshiminarayana, S. Rao, and H. Kim. Asymmetricity Aware Scheduling Algorithms for Asymmetric Processors. In Workshop on the Interaction between Operating Systems and Computer Architecture, 2009.

[20]

C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In Proc. of the 2004 International Symposium on Code Generation and Optimization, pages 75--86, 2004.

Digital Library

[21]

J. Li and J. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In Proc. of the 12th International Symposium on High-Performance Computer Architecture, pages 77--87, 2006.

[22]

U. Nawathe et al. An 8-core, 64-thread, 64-bit, power efficient SPARC SoC (Niagara2), Feb. 2007. In Proc. of ISSCC.

[23]

J. Nieplocha et al. Evaluating the potential of multithreaded platforms for irregular scientific computations. In Proc. of the 2007 ACM Conference on Computing Frontiers, pages 47--58, 2007.

Digital Library

[24]

Nvidia. CUDA Programming Guide, June 2007. http://developer.download.nvidia.com/compute/cuda.

[25]

K. Olukotun et al. The case for a single chip multiprocessor. In Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 2--11, 1996.

Digital Library

[26]

M. Qureshi and Y. Patt. Partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proc. of the 39th Annual International Symposium on Microarchitecture, pages 423--432, 2006.

Digital Library

[27]

R. Ravindran, R. Senger, E. Marsman, G. Dasika, M. Guthaus, S. Mahlke, and R. Brown. Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File. In Proc. of the 2003 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, pages 125--136, 2003.

Digital Library

[28]

H. Schwartz, D. Marpe, and T. Wiegand. Overview of the scalable video coding extension of the h.264/avc standard. IEEE Transactions on Circuits and Systems for Video Technology, 17(9):1103--1120, Sept. 2007.

Digital Library

[29]

M. A. Suleman, M. Qureshi, and Y. Patt. Feedback Driven Threading: Power-Efficient and High-Performance Execution of Multithreaded Workloads on CMPs. In 16th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 277--286, 2008.

Digital Library

[30]

P. Wang et al. EXOCHI: architecture and programming environment for a heterogeneous multi-core multithreaded system. In Proc. of the SIGPLAN '07 Conference on Programming Language Design and Implementation, pages 156--166, 2007.

Digital Library

[31]

S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd annual international symposium on Computer architecture, pages 24--36, 1995.

Digital Library

[32]

Q. Wu et al. A dynamic compilation framework for controlling microprocessor energy and performance. In Proc. of the 38th Annual International Symposium on Microarchitecture, pages 271--282, 2005.

Digital Library

[33]

Y. Xie and G. H. Loh. PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches. In Proc. of the 36th Annual International Symposium on Computer Architecture, pages 174--183, 2009.

Digital Library

Cited By

Moori MRocha HLorenzon ABeck A(2024)Efficient Thread Tuning for Asymmetric Multicores2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI62366.2024.10703981(1-5)Online publication date: 2-Sep-2024
https://doi.org/10.1109/SBCCI62366.2024.10703981
Schwarzrock JLorenzon Ade Souza SBeck A(2024)Integration Framework for Online Thread Throttling with Thread and Page Mapping on NUMA Systems2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00202(1189-1192)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00202
Huang HZhao YRao JWu SJin HWang DKun SPan L(2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TC.2022.3174480
Show More Cited By

Index Terms

Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications
1. Computer systems organization
  1. Architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Dynamic compilers
      2. Runtime environments
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management

Recommendations

Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications
ISCA '10

Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many ...
SOS: saving time in dynamic race detection with stationary analysis
OOPSLA '11

Data races are subtle and difficult to detect errors that arise during concurrent program execution. Traditional testing techniques fail to find these errors, but recent research has shown that targeted dynamic analysis techniques can be developed to ...
SOS: saving time in dynamic race detection with stationary analysis
OOPSLA '11: Proceedings of the 2011 ACM international conference on Object oriented programming systems languages and applications

Data races are subtle and difficult to detect errors that arise during concurrent program execution. Traditional testing techniques fail to find these errors, but recent research has shown that targeted dynamic analysis techniques can be developed to ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

June 2010

520 pages

ISBN:9781450300537

DOI:10.1145/1815961

General Chair:
André Seznec
INRIA Rennes
,
Program Chairs:
Uri Weiser
Technion
,
Ronny Ronen
Intel

ACM SIGARCH Computer Architecture News Volume 38, Issue 3
ISCA '10
June 2010
508 pages
ISSN:0163-5964
DOI:10.1145/1816038
Issue’s Table of Contents

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISCA '10

Sponsor:

SIGARCH

ISCA '10: The 37th Annual International Symposium on Computer Architecture

June 19 - 23, 2010

Saint-Malo, France

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

103
Total Citations
View Citations
1,085
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)7

Reflects downloads up to 09 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Moori MRocha HLorenzon ABeck A(2024)Efficient Thread Tuning for Asymmetric Multicores2024 37th SBC/SBMicro/IEEE Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI62366.2024.10703981(1-5)Online publication date: 2-Sep-2024
https://doi.org/10.1109/SBCCI62366.2024.10703981
Schwarzrock JLorenzon Ade Souza SBeck A(2024)Integration Framework for Online Thread Throttling with Thread and Page Mapping on NUMA Systems2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW63119.2024.00202(1189-1192)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPSW63119.2024.00202
Huang HZhao YRao JWu SJin HWang DKun SPan L(2023)Adapt Burstable Containers to Variable CPU ResourcesIEEE Transactions on Computers10.1109/TC.2022.317448072:3(614-626)Online publication date: 1-Mar-2023
https://doi.org/10.1109/TC.2022.3174480
Moori MRocha HLorenzon ABeck A(2023)Searching for the Ideal Number of Threads on Asymmetric Multiprocessors2023 XIII Brazilian Symposium on Computing Systems Engineering (SBESC)10.1109/SBESC60926.2023.10324167(1-6)Online publication date: 21-Nov-2023
https://doi.org/10.1109/SBESC60926.2023.10324167
Luan GPang PChen QXue SSong ZGuo M(2022)Online Thread Auto-Tuning for Performance Improvement and Resource SavingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.316941033:12(3746-3759)Online publication date: 1-Dec-2022
https://doi.org/10.1109/TPDS.2022.3169410
Knorst TKorol GJordan MVicenzi JLorenzon ARutzig MBeck A(2022)On the benefits of Collaborative Thread Throttling and HLS-Versioning in CPU-FPGA Environments2022 35th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI55532.2022.9893223(1-6)Online publication date: 22-Aug-2022
https://doi.org/10.1109/SBCCI55532.2022.9893223
Huang HRao JWu SJin HJiang HChe HWu XLaure EMarkidis SVerbanescu ALofstead G(2021)Towards Exploiting CPU Elasticity via Efficient Thread OversubscriptionProceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing10.1145/3431379.3460641(215-226)Online publication date: 21-Jun-2021
https://dl.acm.org/doi/10.1145/3431379.3460641
Knorst TJordan MLorenzen ARutzig MSchneider Beck A(2021)ETCG: Energy-Aware CPU Thread Throttling for CPU-GPU Collaborative Environments2021 34th SBC/SBMicro/IEEE/ACM Symposium on Integrated Circuits and Systems Design (SBCCI)10.1109/SBCCI53441.2021.9529986(1-6)Online publication date: 23-Aug-2021
https://doi.org/10.1109/SBCCI53441.2021.9529986
Marques SMedeiros TSerpa MRossi FLuizelli MNavaux PBeck ALorenzon A(2021)Optimizing Parallel Applications via Dynamic Concurrency Throttling and Turbo Boosting2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP52278.2021.00032(153-160)Online publication date: Mar-2021
https://doi.org/10.1109/PDP52278.2021.00032
Marques SMedeiros TRossi FLuizelli MBeck ALorenzon A(2021)Synergically Rebalancing Parallel Execution via DCT and Turbo Boosting2021 58th ACM/IEEE Design Automation Conference (DAC)10.1109/DAC18074.2021.9586201(277-282)Online publication date: 5-Dec-2021
https://doi.org/10.1109/DAC18074.2021.9586201
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents