Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.5555/977395.977665acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections
Article

Physical Experimentation with Prefetching Helper Threads on Intel's Hyper-Threaded Processors

Published: 20 March 2004 Publication History

Abstract

Pre-execution techniques have received much attention as aneffective way of prefetching cache blocks to tolerate the ever-increasingmemory latency. A number of pre-execution techniquesbased on hardware, compiler, or both have been proposed andstudied extensively by researchers. They report promising resultson simulators that model a Simultaneous Multithreading (SMT)processor. In this paper, we apply the helper threading idea ona real multithreaded machine, i.e., Intel Pentium 4 processor withHyper-Threading Technology, and show that indeed it can providewall-clock speedup on real silicon. To achieve further performanceimprovements via helper threads, we investigate threehelper threading scenarios that are driven by automated compilerinfrastructure, and identify several key challenges and opportunitiesfor novel hardware and software optimizations. Our studyshows a program behavior changes dynamically during execution.In addition, the organizations of certain critical hardware structuresin the hyper-threaded processors are either shared or partitionedin the multi-threading mode and thus, the tradeoffs regardingresource contention can be intricate. Therefore, it is essentialto judiciously invoke helper threads by adapting to the dynamicprogram behavior so that we can alleviate potential performancedegradation due to resource contention. Moreover, since adaptingto the dynamic behavior requires frequent thread synchronization,having light-weight thread synchronization mechanisms is important.

References

[1]
{1} M. Annavaram, J. M. Patel, and E. S. Davidson. Data Prefetching by Dependence Graph Precomputation. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 52-61, Goteborg, Sweden, June 2001. ACM.
[2]
{2} M. C. Carlisle. Olden: Parallelizing Programs with Dynamic Data Structures on Distributed-Memory Machines. Technical Report PhD Thesis, Princeton University Department of Computer Science, June 1996.
[3]
{3} R. S. Chappell, S. P. Kim, S. K. Reinhardt, and Y. N. Patt. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 186-195, Atlanta, GA, May 1999. ACM.
[4]
{4} T.-F. Chen and J.-L. Baer. Effective Hardware-Based Data Prefetching for High-Performance Processors. IEEE Transactions on Computers, 44(5):609-623, May 1995.
[5]
{5} T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-Conscious Structure Layout. In Proceedings of the ACM SIGPLAN '99 Conference on Programming Language Design and Implementation , pages 1-12, Atlanta, GA, May 1999. ACM.
[6]
{6} J. Collins, D. Tullsen, H. Wang, and J. Shen. Dynamic Speculative Precomputation. In Proceedings of the 34th Annual ACM/IEEE International Symposium on Microarchitecture, pages 306-317, Austin, TX, December 2001. ACM.
[7]
{7} J. D. Collins, H. Wang, D. M. Tullsen, C. Hughes, Y.-F. Lee, D. Lavery, and J. P. Shen. Speculative Precomputation: Long-range Prefetching of Delinquent Loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 14-25, Goteborg, Sweden, June 2001. ACM.
[8]
{8} J. L. Henning. SPEC CPU2000: measuring CPU performance in the new millennium. IEEE Computer, July 2000.
[9]
{9} G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The Microarchitecture of the Pentium 4 Processor. Intel Technology Journal, Issue on Pentium 4 Processor, February 2001.
[10]
{10} D. Kim and D. Yeung. Design and Evaluation of Compiler Algorithms for Pre-Execution. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 159- 170, San Jose, CA, October 2002. ACM.
[11]
{11} S. S. W. Liao, P. H. Wang, H. Wang, G. Hoflehner, D. Lavery, and J. P. Shen. Post-Pass Binary Adaptation for Software-Based Speculative Precomputation. In Proceedings of the ACM SIGPLAN '02 Conference on Programming Language Design and Implementation, pages 117-128, Berlin, Germany, June 2002. ACM.
[12]
{12} C.-K. Luk. Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 40-51, Goteborg, Sweden, June 2001. ACM.
[13]
{13} D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton. Hyper-Threading Technology Architecture and Microarchitecture. Intel Technology Journal, Volume 6, Issue on Hyper-Threading Technology, February 2002.
[14]
{14} A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi. Slice-Processors: An Implementation of Operation-Based Prediction. In Proceedings of the 15th International Conference on Supercomputing, pages 321-334, Sorrento, Italy, June 2001. ACM.
[15]
{15} T. Mowry. Tolerating Latency in Multiprocessors through Compiler-Inserted Prefetching. ACM Transactions on Computer Systems, 16(1):55-92, February 1998.
[16]
{16} A. Roth and G. S. Sohi. Speculative Data-Driven Multithreading. In Proceedings of the 7th International Conference on High Performance Computer Architecture, pages 191-202, Monterrey, Mexico, January 2001. IEEE.
[17]
{17} M. Smith. Tracing with Pixie. Technical Report CSL-TR- 91-497, Stanford University, Nov 1991.
[18]
{18} Y. Solihin, J. Lee, and J. Torrellas. Using a User-Level Memory Thread for Correlation Prefetching. In Proceedings of the 29th Annual International Symposium on Computer Architecture , pages 171-182, Anchorage, AK, May 2002. ACM.
[19]
{19} Y. Song and M. Dubois. Assisted Execution. Technical Report CENG 98-25, Department of EE-Systems, University of Southern California, Oct 1998.
[20]
{20} K. Sundaramoorthy, Z. Purser, and E. Rotenberg. Slipstream Processors: Improving Both Performance and Fault Tolerance. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 191-202, Cambridge, MA, May 2000. ACM.
[21]
{21} D. Tullsen, S. Eggers, and H. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 392-403, Santa Margherita Ligure, Italy, June 1995. ACM.
[22]
{22} D. M. Tullsen, J. L. Lo, S. J. Eggers, and H. M. Levy. Supporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor. In Proceedings of the 5th International Symposium on High-Performance Computer Architecture , pages 54-58, Orlando, FL, January 1999. IEEE.
[23]
{23} Intel Corporation. VTune Performance Analyzer. http://developer.intel.com/software/products/VTune/ index.html.
[24]
{24} M. Weiser. Program Slicing. IEEE Transactions on Software Engineering, SE-10(4), July 1984.
[25]
{25} C. B. Zilles and G. Sohi. Execution-Based Prediction Using Speculative Slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture, pages 2-13, Goteborg, Sweden, June 2001. ACM.
[26]
{26} C. B. Zilles and G. S. Sohi. Understanding the Backward Slices of Performance Degrading Instructions. In Proceedings of the 27th Annual International Symposium on Computer Architecture, pages 172-181, Vancouver, Canada, June 2000. ACM.

Cited By

View all
  • (2019)Accelerating In-Memory Database Selections Using Latency Masking Hardware ThreadsACM Transactions on Architecture and Code Optimization10.1145/331022916:2(1-28)Online publication date: 9-Apr-2019
  • (2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
  • (2016)TurboTilingProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926288(1-12)Online publication date: 1-Jun-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CGO '04: Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization
March 2004
301 pages
ISBN:0769521029

Sponsors

Publisher

IEEE Computer Society

United States

Publication History

Published: 20 March 2004

Check for updates

Qualifiers

  • Article

Conference

CGO04

Acceptance Rates

CGO '04 Paper Acceptance Rate 25 of 79 submissions, 32%;
Overall Acceptance Rate 312 of 1,061 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 03 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Accelerating In-Memory Database Selections Using Latency Masking Hardware ThreadsACM Transactions on Architecture and Code Optimization10.1145/331022916:2(1-28)Online publication date: 9-Apr-2019
  • (2019)BootstrappingProceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3297858.3304052(687-700)Online publication date: 4-Apr-2019
  • (2016)TurboTilingProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926288(1-12)Online publication date: 1-Jun-2016
  • (2015)The load slice core microarchitectureACM SIGARCH Computer Architecture News10.1145/2872887.275040743:3S(272-284)Online publication date: 13-Jun-2015
  • (2015)The load slice core microarchitectureProceedings of the 42nd Annual International Symposium on Computer Architecture10.1145/2749469.2750407(272-284)Online publication date: 13-Jun-2015
  • (2014)Execution DraftingProceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2014.43(432-444)Online publication date: 13-Dec-2014
  • (2012)HELIXProceedings of the Tenth International Symposium on Code Generation and Optimization10.1145/2259016.2259028(84-93)Online publication date: 31-Mar-2012
  • (2012)The HELIX projectProceedings of the 49th Annual Design Automation Conference10.1145/2228360.2228412(277-282)Online publication date: 3-Jun-2012
  • (2012)Non-intrusive coscheduling for general purpose operating systemsProceedings of the 2012 international conference on Multicore Software Engineering, Performance, and Tools10.1007/978-3-642-31202-1_7(66-77)Online publication date: 31-May-2012
  • (2011)Inter-core prefetching for multicore processors using migrating helper threadsACM SIGPLAN Notices10.1145/1961296.195041146:3(393-404)Online publication date: 5-Mar-2011
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media