Thread tailor: dynamically weaving threads together for efficient, adaptive parallel applications

J Lee, H Wu, M Ravichandran, N Clark - Proceedings of the 37th annual …, 2010 - dl.acm.org
J Lee, H Wu, M Ravichandran, N Clark
Proceedings of the 37th annual international symposium on Computer architecture, 2010dl.acm.org
Extracting performance from modern parallel architectures requires that applications be
divided into many different threads of execution. Unfortunately selecting the appropriate
number of threads for an application is a daunting task. Having too many threads can quickly
saturate shared resources, such as cache capacity or memory bandwidth, thus degrading
performance. On the other hand, having too few threads makes inefficient use of the
resources available. Beyond static resource assignment, the program inputs and dynamic …
Extracting performance from modern parallel architectures requires that applications be divided into many different threads of execution. Unfortunately selecting the appropriate number of threads for an application is a daunting task. Having too many threads can quickly saturate shared resources, such as cache capacity or memory bandwidth, thus degrading performance. On the other hand, having too few threads makes inefficient use of the resources available. Beyond static resource assignment, the program inputs and dynamic system state (e.g., what other applications are executing in the system) can have a significant impact on the right number of threads to use for a particular application.
To address this problem we present the Thread Tailor, a dynamic system that automatically adjusts the number of threads in an application to optimize system efficiency. The Thread Tailor leverages offline analysis to estimate what type of threads will exist at runtime and the communication patterns between them. Using this information Thread Tailor dynamically combines threads to better suit the needs of the target system. Thread Tailor adjusts not only to the architecture, but also other applications in the system, and this paper demonstrates that this type of adjustment can lead to significantly better use of thread-level parallelism in real-world architectures.
ACM Digital Library