Article

Support for High-Frequency Streaming in CMPs

Authors:

Neil Vachharajani,

Guilherme Ottoni,

David I. August,

George Z. N. CaiAuthors Info & Claims

MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

Pages 259 - 272

https://doi.org/10.1109/MICRO.2006.47

Published: 09 December 2006 Publication History

Abstract

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.

References

[1]

{1} H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, pages 204-215, February 1997.

Digital Library

[2]

{2} G. T. Byrd. Communication Mechanisms in Shared Memory Multiprocessors. PhD thesis, Department of Electrical Engineering, Stanford University, Stanford, CA, 1998.

Digital Library

[3]

{3} E. Caspi, A. DeHon, and J. Wawrzynek. A streaming multi-threaded model. In Proceedings of the Third Workshop on Media and Stream Processors, December 2001.

[4]

{4} J. Dai, B. Huang, L. Li, and L. Harrison. Automatically partitioning packet processing applications for pipelined architectures. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 237-248, 2005.

Digital Library

[5]

{5} M. I. Frank and M. K. Vernon. A hybrid shared memory/message passing parallel machine. In Proceedings of the 1993 International Conference on Parallel Processing, pages 232-236. CRC Press, August 1993.

Digital Library

[6]

{6} M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291-303, 2002.

Digital Library

[7]

{7} T. Gross and D. O'Halloron. iWarp, Anatomy of a Parallel Computing System. MIT Press, 1998.

Digital Library

[8]

{8} J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38-50. ACM Press, 1994.

Digital Library

[9]

{9} The IMPACT compiler. Web site: http://www.crhc.uiuc.edu/IMPACT, June 2004.

[10]

{10} Intel Corporation. Intel Itanium 2 Processor Reference Manual: For Software Development and Optimization. Santa Clara, CA, 2002.

[11]

{11} D. Kranz, K. Johnson, A. Agarwal, J. Kubiatowicz, and B.-H. Lim. Integrating message-passing and shared-memory: early experience. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 54-63. ACM Press, May 1993.

Digital Library

[12]

{12} R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 408-419. IEEE Computer Society, June 2005.

Digital Library

[13]

{13} C. Lee, M. Potkonjak, and W. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture, pages 330-335, December 1997.

Digital Library

[14]

{14} D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash multiprocessor. Computer, 25(3):63-79, 1992.

Digital Library

[15]

{15} G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, November 2005.

Digital Library

[16]

{16} D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation, June 2005.

[17]

{17} D. Poulsen. Memory Latency Reduction via Data Prefetching and Data Forwarding in Shared-Memory Multiprocessors. PhD thesis, University of Illinois, Urbana, IL, 1994.

Digital Library

[18]

{18} B. R. Preiss and V. C. Hamacher. A cache-based message passing scheme for a shared-bus multiprocessor. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 358-364. IEEE Computer Society Press, 1988.

Digital Library

[19]

{19} R. Rajwar, A. Kagi, and J. R. Goodman. Inferential queueing and speculative push for reducing critical communication latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, pages 273-284. ACM Press, June 2003.

Digital Library

[20]

{20} U. Ramachandran, G. Shah, A. Sivasubramaniam, A. Singla, and I. Yanasak. Architectural mechanisms for explicit communication in shared memory multiprocessors. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, page 62. ACM Press, 1995.

Digital Library

[21]

{21} R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 177-188, September 2004.

Digital Library

[22]

{22} StreamIt benchmarks. Web site: http://cag.csail.mit.edu/streamit/shtml/benchmarks.shtml.

[23]

{23} M. Takesue. Software queue-based algorithms for pipelined synchronization on multiprocessors. In Proceedings of the 2003 International Conference on Parallel Processing Workshops, October 2003.

[24]

{24} M. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuit and general-purpose programs. IEEE Micro, 22(2):25-35, March 2002.

Digital Library

[25]

{25} M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, 16(2):145-162, February 2005.

Digital Library

[26]

{26} W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proceedings of the 12th International Conference on Compiler Construction, 2002.

Digital Library

[27]

{27} M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, and D. I. August. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture, pages 271-282, November 2002.

Digital Library

Cited By

Zhang YXiao GBaba T(2014)Accelerating sequential programs on commodity multi-core processorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.12.00974:4(2257-2265)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1016/j.jpdc.2013.12.009
Zhang YOotsu KYokota TBaba T(2013)An automatic thread decomposition approach for pipelined multithreadingInternational Journal of High Performance Computing and Networking10.1504/IJHPCN.2013.0565267:3(227-237)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1504/IJHPCN.2013.056526
Kavadias SKatevenis MZampetakis MNikolopoulos DAmato NFranke HKelly P(2010)On-chip communication and synchronization mechanisms with cache-integrated network interfacesProceedings of the 7th ACM international conference on Computing frontiers10.1145/1787275.1787328(217-226)Online publication date: 17-May-2010
https://dl.acm.org/doi/10.1145/1787275.1787328
Show More Cited By

Index Terms

Support for High-Frequency Streaming in CMPs
1. Hardware

Recommendations

Support for Speculative Execution in High-Performance Processors
Support for speculative execution in high-performance processors
Banked multiported register files for high-frequency superscalar microprocessors
ISCA '03: Proceedings of the 30th annual international symposium on Computer architecture

Multiported register files are a critical component of high-performance superscalar microprocessors. Conventional multiported structures can consume significant power and die area. We examine the designs of banked multiported register files that employ ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture

December 2006

493 pages

ISBN:0769527329

Sponsors

SIGMICRO: ACM Special Interest Group on Microarchitectural Research and Processing

Publisher

IEEE Computer Society

United States

Publication History

Published: 09 December 2006

Check for updates

Qualifiers

Article

Conference

Micro-39

Sponsor:

SIGMICRO

Micro-39: The 39th Annual IEEE/ACM International Symposium on Microarchitecture

December 9 - 13, 2006

Acceptance Rates

MICRO 39 Paper Acceptance Rate 42 of 174 submissions, 24%;

Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Sponsor:
sigmicro

57th Annual IEEE/ACM International Symposium on Microarchitecture

November 2 - 6, 2024

Austin , TX , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
365
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YXiao GBaba T(2014)Accelerating sequential programs on commodity multi-core processorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.12.00974:4(2257-2265)Online publication date: 1-Apr-2014
https://dl.acm.org/doi/10.1016/j.jpdc.2013.12.009
Zhang YOotsu KYokota TBaba T(2013)An automatic thread decomposition approach for pipelined multithreadingInternational Journal of High Performance Computing and Networking10.1504/IJHPCN.2013.0565267:3(227-237)Online publication date: 1-Sep-2013
https://dl.acm.org/doi/10.1504/IJHPCN.2013.056526
Kavadias SKatevenis MZampetakis MNikolopoulos DAmato NFranke HKelly P(2010)On-chip communication and synchronization mechanisms with cache-integrated network interfacesProceedings of the 7th ACM international conference on Computing frontiers10.1145/1787275.1787328(217-226)Online publication date: 17-May-2010
https://dl.acm.org/doi/10.1145/1787275.1787328
Kim HRaman ALiu FLee JAugust D(2010)Scalable Speculative Parallelization on Commodity ClustersProceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2010.19(3-14)Online publication date: 4-Dec-2010
https://dl.acm.org/doi/10.1109/MICRO.2010.19
Watkins MAlbonesi D(2010)ReMAPProceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2010.15(497-508)Online publication date: 4-Dec-2010
https://dl.acm.org/doi/10.1109/MICRO.2010.15
Hashemi MGhiasi S(2009)Throughput-driven synthesis of embedded software for pipelined execution on multicore architecturesACM Transactions on Embedded Computing Systems10.1145/1457255.14572588:2(1-35)Online publication date: 9-Feb-2009
https://dl.acm.org/doi/10.1145/1457255.1457258
Rangan RVachharajani NOttoni GAugust D(2008)Performance scalability of decoupled software pipeliningACM Transactions on Architecture and Code Optimization10.1145/1400112.14001135:2(1-25)Online publication date: 3-Sep-2008
https://dl.acm.org/doi/10.1145/1400112.1400113
Ottoni GAugust D(2008)Communication optimizations for global multi-threaded instruction schedulingACM SIGPLAN Notices10.1145/1353536.134631043:3(222-232)Online publication date: 1-Mar-2008
https://dl.acm.org/doi/10.1145/1353536.1346310
Ottoni GAugust D(2008)Communication optimizations for global multi-threaded instruction schedulingACM SIGOPS Operating Systems Review10.1145/1353535.134631042:2(222-232)Online publication date: 1-Mar-2008
https://dl.acm.org/doi/10.1145/1353535.1346310
Ottoni GAugust D(2008)Communication optimizations for global multi-threaded instruction schedulingACM SIGARCH Computer Architecture News10.1145/1353534.134631036:1(222-232)Online publication date: 1-Mar-2008
https://dl.acm.org/doi/10.1145/1353534.1346310
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents