Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/MICRO.2006.47acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
Article

Support for High-Frequency Streaming in CMPs

Published: 09 December 2006 Publication History

Abstract

As the industry moves toward larger-scale chip multiprocessors, the need to parallelize applications grows. High inter-thread communication delays, exacerbated by over-stressed high-latency memory subsystems and ever-increasing wire delays, require parallelization techniques to create partially or fully independent threads to improve performance. Unfortunately, developers and compilers alike often fail to find sufficient independent work of this kind. Recently proposed pipelined streaming techniques have shown significant promise for both manual and automatic parallelization. These techniques have wide-scale applicability because they embrace inter-thread dependences (albeit acyclic dependences) and tolerate long-latency communication of these dependences. This paper addresses the lack of architectural support for this type of concurrency, which has blocked its adoption and hindered related language and compiler research. We observe that both manual and automatic techniques create high-frequency streaming threads, with communication occurring every 5 to 20 instructions. Even while easily tolerating inter-thread transit delays, high-frequency communication makes thread performance very sensitive to intrathread delays from the repeated execution of the communication operations. Using this observation, we define the design-space and evaluate several mechanisms to find a better trade-off between performance and operating system, hardware, and design costs. From this, we find a light-weight streaming-aware enhancement to conventional memory subsystems that doubles the speed of these codes and is within 2% of the best-performing, but heavy-weight, hardware solution.

References

[1]
{1} H. Abdel-Shafi, J. Hall, S. V. Adve, and V. S. Adve. An evaluation of fine-grain producer-initiated communication in cache-coherent multiprocessors. In Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, pages 204-215, February 1997.
[2]
{2} G. T. Byrd. Communication Mechanisms in Shared Memory Multiprocessors. PhD thesis, Department of Electrical Engineering, Stanford University, Stanford, CA, 1998.
[3]
{3} E. Caspi, A. DeHon, and J. Wawrzynek. A streaming multi-threaded model. In Proceedings of the Third Workshop on Media and Stream Processors, December 2001.
[4]
{4} J. Dai, B. Huang, L. Li, and L. Harrison. Automatically partitioning packet processing applications for pipelined architectures. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 237-248, 2005.
[5]
{5} M. I. Frank and M. K. Vernon. A hybrid shared memory/message passing parallel machine. In Proceedings of the 1993 International Conference on Parallel Processing, pages 232-236. CRC Press, August 1993.
[6]
{6} M. I. Gordon, W. Thies, M. Karczmarek, J. Lin, A. S. Meli, A. A. Lamb, C. Leger, J. Wong, H. Hoffmann, D. Maze, and S. Amarasinghe. A stream compiler for communication-exposed architectures. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 291-303, 2002.
[7]
{7} T. Gross and D. O'Halloron. iWarp, Anatomy of a Parallel Computing System. MIT Press, 1998.
[8]
{8} J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta. Integration of message passing and shared memory in the Stanford FLASH multiprocessor. In Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 38-50. ACM Press, 1994.
[9]
{9} The IMPACT compiler. Web site: http://www.crhc.uiuc.edu/IMPACT, June 2004.
[10]
{10} Intel Corporation. Intel Itanium 2 Processor Reference Manual: For Software Development and Optimization. Santa Clara, CA, 2002.
[11]
{11} D. Kranz, K. Johnson, A. Agarwal, J. Kubiatowicz, and B.-H. Lim. Integrating message-passing and shared-memory: early experience. In Proceedings of the Fourth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 54-63. ACM Press, May 1993.
[12]
{12} R. Kumar, V. Zyuban, and D. M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 408-419. IEEE Computer Society, June 2005.
[13]
{13} C. Lee, M. Potkonjak, and W. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proceedings of the 30th Annual International Symposium on Microarchitecture, pages 330-335, December 1997.
[14]
{14} D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. Lam. The Stanford Dash multiprocessor. Computer, 25(3):63-79, 1992.
[15]
{15} G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In Proceedings of the 38th IEEE/ACM International Symposium on Microarchitecture, November 2005.
[16]
{16} D. A. Penry, M. Vachharajani, and D. I. August. Rapid development of a flexible validated processor model. In Proceedings of the 2005 Workshop on Modeling, Benchmarking, and Simulation, June 2005.
[17]
{17} D. Poulsen. Memory Latency Reduction via Data Prefetching and Data Forwarding in Shared-Memory Multiprocessors. PhD thesis, University of Illinois, Urbana, IL, 1994.
[18]
{18} B. R. Preiss and V. C. Hamacher. A cache-based message passing scheme for a shared-bus multiprocessor. In Proceedings of the 15th Annual International Symposium on Computer Architecture, pages 358-364. IEEE Computer Society Press, 1988.
[19]
{19} R. Rajwar, A. Kagi, and J. R. Goodman. Inferential queueing and speculative push for reducing critical communication latencies. In Proceedings of the 17th Annual International Conference on Supercomputing, pages 273-284. ACM Press, June 2003.
[20]
{20} U. Ramachandran, G. Shah, A. Sivasubramaniam, A. Singla, and I. Yanasak. Architectural mechanisms for explicit communication in shared memory multiprocessors. In Proceedings of the 1995 ACM/IEEE Conference on Supercomputing, page 62. ACM Press, 1995.
[21]
{21} R. Rangan, N. Vachharajani, M. Vachharajani, and D. I. August. Decoupled software pipelining with the synchronization array. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques, pages 177-188, September 2004.
[22]
{22} StreamIt benchmarks. Web site: http://cag.csail.mit.edu/streamit/shtml/benchmarks.shtml.
[23]
{23} M. Takesue. Software queue-based algorithms for pipelined synchronization on multiprocessors. In Proceedings of the 2003 International Conference on Parallel Processing Workshops, October 2003.
[24]
{24} M. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal. The Raw microprocessor: A computational fabric for software circuit and general-purpose programs. IEEE Micro, 22(2):25-35, March 2002.
[25]
{25} M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, 16(2):145-162, February 2005.
[26]
{26} W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language for streaming applications. In Proceedings of the 12th International Conference on Compiler Construction, 2002.
[27]
{27} M. Vachharajani, N. Vachharajani, D. A. Penry, J. A. Blome, and D. I. August. Microarchitectural exploration with Liberty. In Proceedings of the 35th International Symposium on Microarchitecture, pages 271-282, November 2002.

Cited By

View all
  • (2014)Accelerating sequential programs on commodity multi-core processorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.12.00974:4(2257-2265)Online publication date: 1-Apr-2014
  • (2013)An automatic thread decomposition approach for pipelined multithreadingInternational Journal of High Performance Computing and Networking10.1504/IJHPCN.2013.0565267:3(227-237)Online publication date: 1-Sep-2013
  • (2010)On-chip communication and synchronization mechanisms with cache-integrated network interfacesProceedings of the 7th ACM international conference on Computing frontiers10.1145/1787275.1787328(217-226)Online publication date: 17-May-2010
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture
December 2006
493 pages
ISBN:0769527329

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 09 December 2006

Check for updates

Qualifiers

  • Article

Conference

Micro-39
Sponsor:

Acceptance Rates

MICRO 39 Paper Acceptance Rate 42 of 174 submissions, 24%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Upcoming Conference

MICRO '24

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 18 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2014)Accelerating sequential programs on commodity multi-core processorsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2013.12.00974:4(2257-2265)Online publication date: 1-Apr-2014
  • (2013)An automatic thread decomposition approach for pipelined multithreadingInternational Journal of High Performance Computing and Networking10.1504/IJHPCN.2013.0565267:3(227-237)Online publication date: 1-Sep-2013
  • (2010)On-chip communication and synchronization mechanisms with cache-integrated network interfacesProceedings of the 7th ACM international conference on Computing frontiers10.1145/1787275.1787328(217-226)Online publication date: 17-May-2010
  • (2010)Scalable Speculative Parallelization on Commodity ClustersProceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2010.19(3-14)Online publication date: 4-Dec-2010
  • (2010)ReMAPProceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2010.15(497-508)Online publication date: 4-Dec-2010
  • (2009)Throughput-driven synthesis of embedded software for pipelined execution on multicore architecturesACM Transactions on Embedded Computing Systems10.1145/1457255.14572588:2(1-35)Online publication date: 9-Feb-2009
  • (2008)Performance scalability of decoupled software pipeliningACM Transactions on Architecture and Code Optimization10.1145/1400112.14001135:2(1-25)Online publication date: 3-Sep-2008
  • (2008)Communication optimizations for global multi-threaded instruction schedulingACM SIGPLAN Notices10.1145/1353536.134631043:3(222-232)Online publication date: 1-Mar-2008
  • (2008)Communication optimizations for global multi-threaded instruction schedulingACM SIGOPS Operating Systems Review10.1145/1353535.134631042:2(222-232)Online publication date: 1-Mar-2008
  • (2008)Communication optimizations for global multi-threaded instruction schedulingACM SIGARCH Computer Architecture News10.1145/1353534.134631036:1(222-232)Online publication date: 1-Mar-2008
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media