research-article

Branch vanguard: decomposing branch functionality into prediction and resolution instructions

Authors:

Daniel S. McFarlin,

Craig ZillesAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 323 - 335

https://doi.org/10.1145/2749469.2750400

Published: 13 June 2015 Publication History

Abstract

While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver.

We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.

References

[1]

J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence," in Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, ser. POPL '83. New York, NY, USA: ACM, 1983, pp. 177--189. {Online}. Available: http://doi.acm.org/10.1145/567067.567085

Digital Library

[2]

D. I. August, D. A. Connors, J. C. Gyllenhaal, and W.-m. W. Hwu, "Architectural support for compiler-synthesized dynamic branch prediction strategies: Rationale and initial results," in Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, ser. HPCA '97. Washington, DC, USA: IEEE Computer Society, 1997, pp. 84--. {Online}. Available: http://dl.acm.org/citation.cfm?id=548716.822702

Digital Library

[3]

E. Brunvand, "The nsr processor," in System Sciences, 1993, Proceeding of the Twenty-Sixth Hawaii International Conference on, vol. i, Jan 1993, pp. 428--435 vol.1.

[4]

H. W. Cain and P. Nagpurkar, "Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor," in ISPASS, 2010, pp. 203--212.

[5]

M. Charney, "Intel software development emulator." {Online}. Available: https://software.intel.com/en-us/articles/pintool

[6]

R. P. Colwell, R. P. Nix, J. J. O. Donnell, D. B. Papworth, and P. K. Rodman, "A vliw architecture for a trace scheduling compiler," in Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1987, pp. 180--192.

[7]

B. Dally, ""project denver"processor to usher in a new era of computing," Jan. 2011. {Online}. Available: http://blogs.nvidia.com/blog/2011/01/05/project-denver-processor-to-usher-in-new-era-of-computing

[8]

J. W. Davidson and D. B. Whalley, "Reducing the cost of branches by using registers," in Proceedings of the 17th Annual International Symposium on Computer Architecture, ser. ISCA '90. New York, NY, USA: ACM, 1990, pp. 182--191. {Online}. Available: http://doi.acm.org/10.1145/325164.325138

Digital Library

[9]

J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson, "The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life Challenges," in Proceedings of the International Symposium on Code Generation and Optimization, 2003, pp. 15--24.

Digital Library

[10]

J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss," in Proceedings of the 11th International Conference on Supercomputing, ser. ICS '97. New York, NY, USA: ACM, 1997, pp. 68--75. {Online}. Available: http://doi.acm.org/10.1145/263580.263597

Digital Library

[11]

J. Edmondson, P. Rubinfeld, R. Preston, and V. Rajagopalan, "Superscalar instruction execution in the 21164 alpha microprocessor," Micro, IEEE, vol. 15, no. 2, pp. 33--43, Apr 1995.

Digital Library

[12]

M. Farrens and A. Pleszhun, "Implementation of the pipe processor," Computer, vol. 24, no. 1, pp. 65--70, Jan 1991.

Digital Library

[13]

B. A. Fields, S. Rubin, and R. Bodik, "Focusing processor policies via Critical-Path prediction," in Proceedings of the 28th Annual International Symposium on Computer Architecture, Jul. 2001, pp. 74--85. {Online}. Available: http://www.cs.wisc.edu/~bodik/research/isca01a.pdf

Digital Library

[14]

J. A. Fisher, "Trace scheduling: a technique for global microcode compaction," vol. 30(7), pp. 478--490, 1981.

Digital Library

[15]

J. Fritts and W. Wolf, "Evaluation of static and dynamic scheduling for media processors," in Proceedings of the 2nd Workshop on Media Processors and DSPs, ser. Micro '00, 2000.

[16]

J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young, "Pipe: A vlsi decoupled architecture," SIGARCH Comput. Archit. News, vol. 13, no. 3, pp. 20--27, Jun. 1985. {Online}. Available: http://doi.acm.org/10.1145/327070.327117

Digital Library

[17]

M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, "Synergistic processing in cell's multicore architecture," IEEE Micro, vol. 26, no. 2, pp. 10--24, Mar. 2006. {Online}. Available: http://dx.doi.org/10.1109/MM.2006.41

Digital Library

[18]

J. Hennessy, N. Jouppi, F. Baskett, T. Gross, and J. Gill, "Hardware/software tradeoffs for increased performance," in Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS I. New York, NY, USA: ACM, 1982, pp. 2--11. {Online}. Available: http://doi.acm.org/10.1145/800050.801820

Digital Library

[19]

A. Hilton, S. Nagarakatte, and A. Roth, "icfp: Tolerating all-level cache misses in in-order processors," IEEE Micro, vol. 30, no. 1, pp. 12--19, Jan. 2010. {Online}. Available: http://dx.doi.org/10.1109/MM.2010.20

Digital Library

[20]

P. Y. T. Hsu and E. S. Davidson, "Highly concurrent scalar processing," in Proceedings of the 13th Annual International Symposium on Computer Architecture, ser. ISCA '86. Los Alamitos, CA, USA: IEEE Computer Society Press, 1986, pp. 386--395. {Online}. Available: http://dl.acm.org/citation.cfm?id=17407.17401

Digital Library

[21]

W. M. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. O. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, "The Superblock: An Effective Technique for VLIW and Superscalar Compilation," Journal of Supercomputing, vol. 7, no. 1, pp. 229--248, Mar 1993. {Online}. Available: http://www.crhc.uiuc.edu/IMPACT/ftp/journal/jsc.superblock.93.pdf

Digital Library

[22]

Intel, "Intel itanium processor 9500 series refence manual. software development and optimization guide," Intel Technical Manual, 2012.

[23]

A. Jaleel, "Memory characterization of workloads using instrumentation-driven simulation: A pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites." {Online}. Available: http://www.jaleels.org/ajaleel/workload/SPECanalysis.pdf

[24]

V. Kathail, M. Schlansker, and B. Rau, "HPL PlayDoh architecture specification: Version 1.0," Hewlett-Packard Laboratories, Tech. Rep. HPL-93-80, Feb. 1993.

[25]

H. Kim, J. Joao, O. Mutlu, and Y. N. Patt, "Profile-assisted compiler support for dynamic predication in diverge-merge processors," in Proceedings of the International Symposium on Code Generation and Optimization, ser. CGO '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 367--378. {Online}. Available: http://dx.doi.org/10.1109/CGO.2007.31

Digital Library

[26]

H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt, "Diverge-merge processor: Generalized and energy-efficient dynamic predication," IEEE Micro, vol. 27, no. 1, pp. 94--104, Jan. 2007. {Online}. Available: http://dx.doi.org/10.1109/MM.2007.9

Digital Library

[27]

H. Kim, O. Mutlu, J. Stark, and Y. Patt, "Wish branches: combining conditional branching and predication for adaptive predicated execution," in Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on, Nov 2005, pp. 12 pp.--54.

Digital Library

[28]

A. Klauser, T. Austin, D. Grunwald, and B. Calder, "Dynamic hammock predication for non-predicated instruction set architectures," in Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, Oct 1998, pp. 278--285.

Digital Library

[29]

S. Mahlke and B. Natarajan, "Compiler synthesized dynamic branch prediction," in Microarchitecture, 1996. MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on, Dec 1996, pp. 153--164.

Digital Library

[30]

S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, "Effective compiler support for predicated execution using the hyperblock," in In Proceedings of the 25th International Symposium on Microarchitecture, 1992, pp. 45--54.

Digital Library

[31]

D. S. McFarlin, C. Tucker, and C. Zilles, "Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism?" in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '13. New York, NY, USA: ACM, 2013, pp. 241--252. {Online}. Available: http://doi.acm.org/10.1145/2451116.2451143

Digital Library

[32]

C. McNairy and D. Soltis, "Itanium 2 processor microarchitecture," IEEE Micro, vol. 23, no. 2, pp. 44--55, Mar. 2003. {Online}. Available: http://dx.doi.org/10.1109/MM.2003.1196114

Digital Library

[33]

A. S. Nadkarni and A. Tyagi, "A trace based evaluation of speculative branch decoupling," in Computer Design, 2000. Proceedings. 2000 International Conference on. IEEE, 2000, pp. 300--307.

Digital Library

[34]

N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles, "Hardware atomicity for reliable software speculation," in Proceedings of the 34th International Symposium on Computer Architecture, 2007, pp. 174--185.

Digital Library

[35]

A. Seznec, "A new case for the tage branch predictor," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011, pp. 117--127. {Online}. Available: http://doi.acm.org/10.1145/2155620.2155635

Digital Library

[36]

R. Sheikh, J. Tuck, and E. Rotenberg, "Control-flow decoupling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012, pp. 329--340. {Online}. Available: http://dx.doi.org/10.1109/MICRO.2012.38

Digital Library

[37]

G. Shobaki, K. Wilken, and M. Heffernan, "Optimal trace scheduling using enumeration," ACM Trans. Archit. Code Optim., vol. 5, no. 4, pp. 19:1--19:32, Mar. 2009. {Online}. Available: http://doi.acm.org/10.1145/1498690.1498694

Digital Library

[38]

M. Smotherman, "Documentation project for the IBM ACS-1 Supercomputer," Jun. 2010. {Online}. Available: http://www.cs.clemson.edu/~mark/acs.html

[39]

A. Srivastava and A. Despain, "Prophetic branches: a branch architecture for code compaction and efficient execution," in Microarchitecture, 1993., Proceedings of the 26th Annual International Symposium on, Dec 1993, pp. 94--99.

Digital Library

[40]

A. Tyagi, H.-C. Ng, and P. Mohapatra, "Dynamic branch decoupled architecture," in Computer Design, 1999.(ICCD'99) International Conference on. IEEE, 1999, pp. 442--450.

Digital Library

[41]

W. J. Watson, "The ti asc: A highly modular and flexible super computer architecture," in Proceedings of the December 5-7, 1972, Fall Joint Computer Conference, Part I, ser. AFIPS '72 (Fall, part I). New York, NY, USA: ACM, 1972, pp. 221--228. {Online}. Available: http://doi.acm.org/10.1145/1479992.1480022

Digital Library

[42]

C. Young and M. D. Smith, "Improving the accuracy of static branch prediction using branch correlation," in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS VI. New York, NY, USA: ACM, 1994, pp. 232--241. {Online}. Available: http://doi.acm.org/10.1145/195473.195549

Digital Library

[43]

H. C. Young, "Code scheduling methods for some architectural features in pipe," Microprocessing and Microprogramming, vol. 22, no. 1, pp. 39--63, 1988. {Online}. Available: http://www.sciencedirect.com/science/article/pii/0165607488900063

Digital Library

[44]

M. Yourst, "Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator," in Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE International Symposium on, April 2007, pp. 23--34.

Cited By

Hajiabadi ADiavastos ACarlson TSherwood TBerger EKozyrakis C(2021)NOREBA: a compiler-informed non-speculative out-of-order commit processorProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446726(182-193)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446726
Chen LZhu JDeng YLi ZChen JJiang XYin SWei SLiu L(2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
https://doi.org/10.1109/TPDS.2021.3084804
Adileh ALilja DEeckhout LOskin MInoue K(2018)Architectural support for probabilistic branchesProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00018(108-120)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00018
Show More Cited By

Index Terms

Branch vanguard: decomposing branch functionality into prediction and resolution instructions
1. Hardware
  1. Hardware validation
  2. Very large scale integration design
    1. Application-specific VLSI designs
      1. Application specific instruction set processors
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language features
        Control structures

Recommendations

Branch vanguard: decomposing branch functionality into prediction and resolution instructions
ISCA'15

While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches ...
Classification-directed branch predictor design
A latency-conscious SMT branch prediction architecture

Executing multiple threads has proved to be an effective solution to partially hide latencies that appear in a processor. When a thread is stalled because of a long-latency operation is being processed, such as a memory access or a floating-point ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
625
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 11 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Hajiabadi ADiavastos ACarlson TSherwood TBerger EKozyrakis C(2021)NOREBA: a compiler-informed non-speculative out-of-order commit processorProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446726(182-193)Online publication date: 19-Apr-2021
https://dl.acm.org/doi/10.1145/3445814.3446726
Chen LZhu JDeng YLi ZChen JJiang XYin SWei SLiu L(2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
https://doi.org/10.1109/TPDS.2021.3084804
Adileh ALilja DEeckhout LOskin MInoue K(2018)Architectural support for probabilistic branchesProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00018(108-120)Online publication date: 20-Oct-2018
https://dl.acm.org/doi/10.1109/MICRO.2018.00018
Huang ZHilton ALee B(2016)Decoupling loads for nano-instruction set computersACM SIGARCH Computer Architecture News10.1145/3007787.300118144:3(406-417)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1145/3007787.3001181
Laurenzano MZhang YChen JTang LMars J(2016)PowerChopACM SIGARCH Computer Architecture News10.1145/3007787.300115244:3(140-152)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1145/3007787.3001152
Huang ZHilton ALee BMin SLoh G(2016)Decoupling loads for nano-instruction set computersProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.43(406-417)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1109/ISCA.2016.43
Laurenzano MZhang YChen JTang LMars JMin SLoh G(2016)PowerChopProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.22(140-152)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1109/ISCA.2016.22
Nowatzki TGovindaraju VSankaralingam K(2015)A Graph-Based Program Representation for Analyzing Hardware Specialization ApproachesIEEE Computer Architecture Letters10.1109/LCA.2015.247680114:2(94-98)Online publication date: 1-Jul-2015
https://dl.acm.org/doi/10.1109/LCA.2015.2476801
Laurenzano MZhang YChen JTang LMars JMin SLoh G(2016)PowerChopProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.22(140-152)Online publication date: 18-Jun-2016
https://dl.acm.org/doi/10.1109/ISCA.2016.22

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents