article

Memory Ordering: A Value-Based Approach

Authors:

Harold W. Cain,

Mikko H. LipastiAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 32, Issue 2

Page 90

https://doi.org/10.1145/1028176.1006709

Published: 02 March 2004 Publication History

Abstract

Conventional out-of-order processors employ a multi-ported,fully-associative load queue to guarantee correctmemory reference order both within a single thread of executionand across threads in a multiprocessor system. Asimprovements in process technology and pipelining lead tohigher clock frequencies, scaling this complex structure toaccommodate a larger number of in-flight loads becomesdifficult if not impossible. Furthermore, each access to thiscomplex structure consumes excessive amounts of energy.In this paper, we solve the associative load queue scalabilityproblem by completely eliminating the associative loadqueue. Instead, data dependences and memory consistencyconstraints are enforced by simply re-executing loadinstructions in program order prior to retirement. Usingheuristics to filter the set of loads that must be re-executed,we show that our replay-based mechanism enables a simple,scalable, and energy-efficient FIFO load queue designwith no associative lookup functionality, while sacrificingonly a negligible amount of performance and cache bandwidth.

References

[1]

{1} H. Akkary, R. Rajwar, and S. T. Srinivasan. "Checkpoint processing and recovery: Towards scalable large instruction window processors." In Proc. of the 36th Intl. Symp. on Microarchitecture, December 2003.

Digital Library

[2]

{2} A. Alameldeen and D. Wood. "Variability in architectural simulations of multi-threaded workloads." In Proc. of the Ninth Intl. Symp. on High Performance Computer Architecture , February 2003.

Digital Library

[3]

{3} T. Austin. "DIVA: A reliable substrate for deep submicron microarchitecture design." In Proc. of the 32nd Intl. Symp. on Microarchitecture, pages 196-207, Haifa, Israel, November 1999.

Digital Library

[4]

{4} H. W. Cain, K. M. Lepak, B. A. Schwartz, and M. H. Lipasti. "Precise and accurate processor simulation." In Proc. of the Workshop on Computer Architecture Evaluation using Commercial Workloads, February 2002.

[5]

{5} A. Charlesworth, A. Phelps, R. Williams, and G. Gilbert. "Gigaplane-XB: extending the ultra enterprise family." In Proceedings of Hot Interconnects V, pages 97-112, August 1997.

[6]

{6} G. Z. Chrysos and J. S. Emer. "Memory dependence prediction using store sets." In Proc. of the 25th Intl. Symp. on Computer architecture, pages 142-153. IEEE Press, 1998.

Digital Library

[7]

{7} Compaq Computer Corporation, Shrewsbury, Massachusetts. 21264/EV68CB and 21264/EV68DC Hardware Reference Manual, 1.1 edition, June 2001.

[8]

{8} A. Condon and A. J. Hu. "Automatable verification of sequential consistency." In Proc. of the 13th Symp. on Parallel Algorithms and Architectures, January 2001.

Digital Library

[9]

{9} K. Gharachorloo, A. Gupta, and J. Hennessy. "Two techniques to enhance the performance of memory consistency models." In Proc. of the 1991 Intl. Conf. on Parallel Processing , pages 355-364, August 1991.

[10]

{10} Intel Corporation. Pentium Pro Family Developers Manual, Volume 3: Operating System Writers Manual, Jan. 1996.

[11]

{11} T. Keller, A. Maynard, R. Simpson, and P. Bohrer. "Simos-ppc full system simulator." http://www.cs.utex-as.edu/users/cart/simOS.

[12]

{12} A. KleinOsowski and D. J. Lilja. "Minnespec: A new SPEC benchmark workload for simulation-based computer architecture research." Computer Architecure Letters, 1, June 2002.

Digital Library

[13]

{13} A. Landin, E. Hagersten, and S. Haridi. "Race-free interconnection networks and multiprocessor consistency." In Proc. of the 18th Intl. Symp. on Comp. Architecture, 1991.

Digital Library

[14]

{14} K. M. Lepak and M. H. Lipasti. "On the value locality of store instructions." In Proceedings of the 27th International Symposium on Computer Architecture, pages 182-191, Vancouver, BC, June 2000.

Digital Library

[15]

{15} M. M. K. Martin, D. J. Sorin, H. W. Cain, M. D. Hill, and M. H. Lipasti. "Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing." In Proc. of the 34th Intl. Symp. on Microarchitecture , pages 328-337, December 2001.

Digital Library

[16]

{16} J. F. Martinez, J. Renau, M. C. Huang, M. Prvulovic, and J. Torrellas. "Cherry: checkpointed early resource recycling in out-of-order microprocessors." In Proceedings of the 35th annual Intl. Symp. on Microarchitecture, pages 3-14. November, 2002.

Digital Library

[17]

{17} S. Onder and R. Gupta. "Dynamic memory disambiguation in the presence of out-of-order store issuing." In Proc. of the 32nd Intl. Symp. on Microarchitecture, November 1999.

Digital Library

[18]

{18} I. Park, C.-L. Ooi, and T. N. Vijaykumar. "Reducing design complexity of the load-store queue." In Proc. of the 36th Intl. Symp. on Microarchitecture, December 2003.

Digital Library

[19]

{19} D. Ponomarev, G. Kucuk, and K. Ghose. "Reducing power requirements of instruction scheduling through dynamic allocation of multiple datapath resources." In Proc. of the 34th Intl. Symp. on Microarchitecture, December 2001.

Digital Library

[20]

{20} M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta. "Complete computer simulation: the simos approach." IEEE Parallel and Distributed Technology, 3(4):34-43, 1995.

Digital Library

[21]

{21} S. Sethumadhavan, R. Desikan, D. Burger, C. R. Moore, and S. W. Keckler. "Scalable hardware memory disambiguation for high-ilp processors." In Proc. of the 36th Intl. Symp. on Microarchitecture, December 2003.

Digital Library

[22]

{22} P. Shivakumar and N. P. Jouppi. "Cacti 3.0: An integrated cache timing, power, and area model." Technical Report 2001/2, Compaq Western Research Lab Research Report, 2001.

[23]

{23} J. M. Tendler, S. Dodson, S. Fields, H. Le, and B. Sinharoy. "POWER4 system microarchitecture." Technical white paper, IBM Server Group, October 2001.

[24]

{24} S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. "The SPLASH2 programs: Characterization and methodological considerations." In Proceedings of the 22nd International Symposium on Computer Architecture, pages 24- 36, June 1995.

Digital Library

[25]

{25} K. C. Yeager. "The MIPS R10000 superscalar microprocessor." IEEE Micro, 16(2):28-40, April 1996.

Digital Library

[26]

{26} A. Yoaz, R. Ronen, R. S. Chappell, and Y. Almog. "Silence is golden?" In Work-in-progress workshop of the 7th International Symposium on High-Performance Computer Architecture, January 2001.

Cited By

Zhang AGoens AOswald NGrosser TSorin DNagarajan V(2024)PipeGen: Automated Transformation of a Single-Core Pipeline into a Multicore Pipeline for a Given Memory Consistency ModelProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676889(1-13)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676889
ANDO HSHIOYA R(2016)Performance of Dynamic Instruction Window Resizing for a Given Power Budget under DVFS ControlIEICE Transactions on Information and Systems10.1587/transinf.2015EDP7325E99.D:2(341-350)Online publication date: 2016
https://doi.org/10.1587/transinf.2015EDP7325
Michaud PMondelli ASeznec A(2015)Revisiting Clustered Microarchitecture for Future Superscalar CoresACM Transactions on Architecture and Code Optimization10.1145/280078712:3(1-22)Online publication date: 31-Aug-2015
https://dl.acm.org/doi/10.1145/2800787
Show More Cited By

Recommendations

Memory Ordering: A Value-Based Approach
ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture

Conventional out-of-order processors employ a multi-ported,fully-associative load queue to guarantee correctmemory reference order both within a single thread of executionand across threads in a multiprocessor system. Asimprovements in process ...
Reducing Memory Ordering Overheads in Software Transactional Memory
CGO '09: Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization

Most research into high-performance software transactional memory (STM) assumes that transactions will run on a processor with a relatively strict memory model, such as Total Store Ordering (TSO). To execute these algorithms correctly on processors with ...
Conditional Memory Ordering
ISCA '06: Proceedings of the 33rd annual international symposium on Computer Architecture

Conventional relaxed memory ordering techniques follow a proactive model: at a synchronization point, a processor makes its own updates to memory available to other processors by executing a memory barrier instruction, ensuring that recent writes have ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 32, Issue 2

ISCA 2004

March 2004

373 pages

ISSN:0163-5964

DOI:10.1145/1028176

Issue’s Table of Contents

ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture
June 2004
373 pages
ISBN:0769521436

Copyright © 2004 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 March 2004

Published in SIGARCH Volume 32, Issue 2

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

68
Total Citations
View Citations
814
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang AGoens AOswald NGrosser TSorin DNagarajan V(2024)PipeGen: Automated Transformation of a Single-Core Pipeline into a Multicore Pipeline for a Given Memory Consistency ModelProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676889(1-13)Online publication date: 14-Oct-2024
https://dl.acm.org/doi/10.1145/3656019.3676889
ANDO HSHIOYA R(2016)Performance of Dynamic Instruction Window Resizing for a Given Power Budget under DVFS ControlIEICE Transactions on Information and Systems10.1587/transinf.2015EDP7325E99.D:2(341-350)Online publication date: 2016
https://doi.org/10.1587/transinf.2015EDP7325
Michaud PMondelli ASeznec A(2015)Revisiting Clustered Microarchitecture for Future Superscalar CoresACM Transactions on Architecture and Code Optimization10.1145/280078712:3(1-22)Online publication date: 31-Aug-2015
https://dl.acm.org/doi/10.1145/2800787
Yi Ma Hongliang Gao Dimitrov MHuiyang Zhou (2015)Submitted to IEEE Transactions on Parallel and Distributed Systems Special Issue on CMP ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2007.1080(1-1)Online publication date: 2015
https://doi.org/10.1109/TPDS.2007.1080
Hechtman BSorin D(2013)Exploring memory consistency for massively-threaded throughput-oriented processorsACM SIGARCH Computer Architecture News10.1145/2508148.248594041:3(201-212)Online publication date: 23-Jun-2013
https://dl.acm.org/doi/10.1145/2508148.2485940
Hechtman BSorin DMendelson A(2013)Exploring memory consistency for massively-threaded throughput-oriented processorsProceedings of the 40th Annual International Symposium on Computer Architecture10.1145/2485922.2485940(201-212)Online publication date: 23-Jun-2013
https://dl.acm.org/doi/10.1145/2485922.2485940
Zhang ZWang XTong DYi JLu JWang K(2012)Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-EfficiencyJournal of Computer Science and Technology10.1007/s11390-012-1263-727:4(769-780)Online publication date: 12-Jul-2012
https://doi.org/10.1007/s11390-012-1263-7
Apolloni RChaver DRodriguez FPinuel LPrieto MTirado F(2011)Hybrid timing-address oriented load-store queue filtering for an x86 architectureIET Computers & Digital Techniques10.1049/iet-cdt.2010.00045:2(145)Online publication date: 2011
https://doi.org/10.1049/iet-cdt.2010.0004
Ahn WQi SNicolaides MTorrellas JLee JFang XMidkiff SWong DAlbonesi DMartonosi MAugust DMartínez J(2009)BulkCompilerProceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture10.1145/1669112.1669131(133-144)Online publication date: 12-Dec-2009
https://dl.acm.org/doi/10.1145/1669112.1669131
Pericàs MCristal ACazorla FGonzález RVeidenbaum AJiménez DValero M(2008)A Two-Level Load/Store Queue Based on Execution LocalityACM SIGARCH Computer Architecture News10.1145/1394608.138217136:3(25-36)Online publication date: 1-Jun-2008
https://dl.acm.org/doi/10.1145/1394608.1382171
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents