Article

Microarchitecture Optimizations for Exploiting Memory-Level Parallelism

Authors:

Santosh AbrahamAuthors Info & Claims

ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture

Page 76

Published: 02 March 2004 Publication History

Abstract

The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese applications and that microarchitecture has a profound impacton achievable MLP. Using the epoch model of MLP, we reasonhow traditional microarchitecture features such as out-of-orderissue and state-of-the-art microarchitecture techniques suchas runahead execution affect MLP. Simulation results show that amoderately aggressive out-of-order issue processor improvesMLP over an in-order issue processor by 12-30%, and that aggressivehandling of loads, branches and serializing instructionsis needed to attain the full benefits of large out-of-order instructionwindows. The results also show that a processor's issue windowand reorder buffer should be decoupled to exploit MLP more efficiently.In addition, we demonstrate that runahead execution ishighly effective in enhancing MLP, potentially improving the MLPof the database workload by 82% and its overall performance by60%. Finally, our limit study shows that there is considerableheadroom in improving MLP and overall performance by implementingeffective instruction prefetching, more accurate branchprediction and better value prediction in addition to runahead execution.

References

[1]

{1} A. Maynard, C. Donelly and B. Olszewski, "Contrasting Characteristics and Cache Performance of Technical and Multi-User Commercial Workloads," in ASPLOS-VI, 1998.

Digital Library

[2]

{2} L. Barroso, K. Gharachorloo, E. Bugnion, "Memory System Characterization of Commercial Workloads," in 25th International Symposium on Computer Architecture, 1998.

Digital Library

[3]

{3} R. Hankins, T. Diep, M. Annavaram, B. Hirano, H. Eric, H. Nueckel and J. Shen, "Scaling and Characterizing Database Workloads: Bridging the Gap between Research and Practice," in 36th International Symposium on Microarchitecture, December 2003.

Digital Library

[4]

{4} W. Wulf, and S. McKee, "Hitting the Memory Wall: Implications of the Obvious," in Computer Architecture News, Vol. 23, No. 4, September 1995.

Digital Library

[5]

{5} A. Glew, "MLP yes! ILP no!," in ASPLOS Wild and Crazy Idea Session '98, October 1998.

[6]

{6} V. Pai and S. Adve, "Code Transformations to Improve Memory Parallelism," in 32nd International Symposium on Microarchitecture, November 1999.

Digital Library

[7]

{7} H. Zhou and T. Conte, "Enhancing Memory Level Parallelism via Recovery-Free Value Prediction," in International Conference on Supercomputing, June 2003.

Digital Library

[8]

{8} J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss," in International Conference on Supercomputing, July 1997.

Digital Library

[9]

{9} O. Mutlu, J. Stark, C. Wilkerson and Y. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors," in 9th International Sysmposium on High Performance Computer Architecture, February 2003.

Digital Library

[10]

{10} M. Lipasti and J. Shen, "Value Locality and Load Value Prediction," in ASPLOS-VII, October 1996.

Digital Library

[11]

{11} F. Gabbay and A. Mendelson, "Speculative Execution Based on Value Prediction," in EE Department Tech Report 1080, Technion - Israel Institute of Technology, November 1996.

[12]

{12} Y. Sazeides and J. Smith, "The Predictability of Data Values," in 30th International Symposium on Microarchitecture, 1997.

Digital Library

[13]

{13} D. Weaver and T. Germond, "The SPARC Architecture Manual," PTR Prentice Hall, 1994.

Digital Library

[14]

{14} www.spec.org

[15]

{15} J. Collins, H. Wang, D. Tullsen, C. Hughes, Y. Lee, D. Lavery and J. Shen, "Speculative Precomputation: Long-Range Prefetching of Delinquent Loads," in 28th International Symposium on Computer Architecture, 2001.

Digital Library

[16]

{16} C. Luk, "Tolerating Memory Latency Through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors," in 28th International Symposium on Computer Architecture, 2001.

Digital Library

[17]

{17} D. Kim and D. Yeung, "Design and Evaluation of Compiler Algorithms for Pre-Execution," in ASPLOS-X, October 2002.

Digital Library

[18]

{18} K. Wang and M. Franklin, "Highly Accurate Data Value Prediction Using Hybrid Predictors," in 30th International Symposium on Microarchitecture, November 1997.

Digital Library

[19]

{19} T. Karkhanis and J. Smith, "A Day in the Life of a Data Cache Miss," in Workshop on Memory Performance Issues, May 2002.

[20]

{20} H. Akkary, R. Rajwar and S. Srinivasan, "Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors," in 36th International Symposium on Microarchitecture, December 2003.

Digital Library

[21]

{21} A. Roth and G. Sohi, "Speculative Data-Driven Multithreading," in 7th International Symposium on High-Performance Computer Architecture, January 2001.

Digital Library

[22]

{22} A. Moshovos, D. Pnevmatikatos and A. Baniasadi, "Slice-Processors: An Implementation of Operation-Based Prediction," in International Conference on Supercomputing, June 2001.

Digital Library

[23]

{23} M. Dubois and Y. Song, "Assisted Execution," University of Southern California CENG Technical Report 98-25, 1998.

[24]

{24} D. Sorin et al, "Analytic Evaluation of Shared-Memory Systems with ILP Processors," in 25th International Symposium on Computer Architecture, 1998.

Digital Library

[25]

{25} V. Pai, P. Ranganathan and S. Adve, "The Impact of Instruction-Level Parallelism on Multiprocessor Performance and Simulation Methodology," in International Symposium on High Performance Computer Architecture, February 1997.

Digital Library

[26]

{26} P. Ranganathan, K. Gharachorloo, S. Adve and L. Barroso, "Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors," in ASPLOS-VIII, 1998.

Digital Library

Cited By

Gupta SBhattacharyya AOh YBhattacharjee AFalsafi BPayer MMartínez JDuato JJohn L(2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00047
Ghose SLi THajinazar NCali DMutlu O(2019)Demystifying Complex Workload-DRAM InteractionsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/33667083:3(1-50)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3366708
Radulovic MSánchez Verdejo RCarpenter PRadojković PJacob BAyguadé E(2019)PROFETProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261493:2(1-33)Online publication date: 19-Jun-2019
https://dl.acm.org/doi/10.1145/3341617.3326149
Show More Cited By

Recommendations

Cimple: instruction and memory level parallelism: a DSL for uncovering ILP and MLP
PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Modern out-of-order processors have increased capacity to exploit instruction level parallelism (ILP) and memory level parallelism (MLP), e.g., by using wide superscalar pipelines and vector execution units, as well as deep buffers for inflight memory ...
Microarchitecture Optimizations for Exploiting Memory-Level Parallelism
ISCA 2004

The performance of memory-bound commercial applicationssuch as databases is limited by increasing memory latencies. Inthis paper, we show that exploiting memory-level parallelism(MLP) is an effective approach for improving the performance ofthese ...
Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '04: Proceedings of the 31st annual international symposium on Computer architecture

June 2004

373 pages

ISBN:0769521436

ACM SIGARCH Computer Architecture News Volume 32, Issue 2
ISCA 2004
March 2004
373 pages
ISSN:0163-5964
DOI:10.1145/1028176
Issue’s Table of Contents

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

IEEE Computer Society

United States

Publication History

Published: 02 March 2004

Check for updates

Qualifiers

Article

Conference

ISCA04

Sponsor:

SIGARCH

ISCA04: The 31st Annual International Symposium on Computer Architecture 2004

June 19 - 23, 2004

München, Germany

Acceptance Rates

ISCA '04 Paper Acceptance Rate 31 of 217 submissions, 14%;

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

178
Total Citations
View Citations
1,892
Total Downloads

Downloads (Last 12 months)90
Downloads (Last 6 weeks)2

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gupta SBhattacharyya AOh YBhattacharjee AFalsafi BPayer MMartínez JDuato JJohn L(2021)Rebooting virtual memory with midgardProceedings of the 48th Annual International Symposium on Computer Architecture10.1109/ISCA52012.2021.00047(512-525)Online publication date: 14-Jun-2021
https://dl.acm.org/doi/10.1109/ISCA52012.2021.00047
Ghose SLi THajinazar NCali DMutlu O(2019)Demystifying Complex Workload-DRAM InteractionsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/33667083:3(1-50)Online publication date: 17-Dec-2019
https://dl.acm.org/doi/10.1145/3366708
Radulovic MSánchez Verdejo RCarpenter PRadojković PJacob BAyguadé E(2019)PROFETProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/3341617.33261493:2(1-33)Online publication date: 19-Jun-2019
https://dl.acm.org/doi/10.1145/3341617.3326149
Tang XKandemir MKarakoy MArunachalam MMcKinley KFisher K(2019)Co-optimizing memory-level parallelism and cache-level parallelismProceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3314221.3314599(935-949)Online publication date: 8-Jun-2019
https://dl.acm.org/doi/10.1145/3314221.3314599
Alipour MCarlson TBlack-Schaffer DKaxiras S(2019)Maximizing Limited ResourcesJournal of Signal Processing Systems10.1007/s11265-018-1369-491:3-4(379-397)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.1007/s11265-018-1369-4
Srinivasa SRamanathan ALi XChen WHsueh FYang CShen CShieh JGupta SChang MGhosh SSampson JNarayanan V(2018)A Monolithic-3D SRAM Design with Enhanced Robustness and In-Memory Computation SupportProceedings of the International Symposium on Low Power Electronics and Design10.1145/3218603.3218645(1-6)Online publication date: 23-Jul-2018
https://dl.acm.org/doi/10.1145/3218603.3218645
Van Den Steen SEeckhout L(2018)Modeling Superscalar Processor Memory-Level ParallelismIEEE Computer Architecture Letters10.1109/LCA.2017.270137017:1(9-12)Online publication date: 1-Jan-2018
https://dl.acm.org/doi/10.1109/LCA.2017.2701370
Tran KCarlson TKoukos KSjälander MSpiliopoulos VKaxiras SJimborean AReddi VSmith ATang L(2017)Clairvoyance: look-ahead compile-time schedulingProceedings of the 2017 International Symposium on Code Generation and Optimization10.5555/3049832.3049852(171-184)Online publication date: 4-Feb-2017
https://dl.acm.org/doi/10.5555/3049832.3049852
Tang XKandemir MYedlapalli PKotra JHsu WYang CLipasti MLee H(2016)Improving bank-level parallelism for irregular applicationsThe 49th Annual IEEE/ACM International Symposium on Microarchitecture10.5555/3195638.3195708(1-12)Online publication date: 15-Oct-2016
https://dl.acm.org/doi/10.5555/3195638.3195708
Lee DGhose SPekhimenko GKhan SMutlu O(2016)Simultaneous Multi-Layer AccessACM Transactions on Architecture and Code Optimization10.1145/283291112:4(1-29)Online publication date: 6-Jan-2016
https://dl.acm.org/doi/10.1145/2832911
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten