Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/2749469.2750400acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Branch vanguard: decomposing branch functionality into prediction and resolution instructions

Published: 13 June 2015 Publication History
  • Get Citation Alerts
  • Abstract

    While control speculation is highly effective for generating good schedules in out-of-order processors, it is less effective for in-order processors because compilers have trouble scheduling in the presence of unbiased branches, even when those branches are highly predictable. In this paper, we demonstrate a novel architectural branch decomposition that separates the prediction and deconvergence point of a branch from its resolution, which enables the compiler to profitably schedule across predictable, but unbiased branches. We show that the hardware support for this branch architecture is a trivial extension of existing systems and describe a simple code transformation for exploiting this architectural support. As architectural changes are required, this technique is most compelling for a dynamic binary translation-based system like Project Denver.
    We evaluate the performance improvements enabled by this transformation for several in-order configurations across the SPEC 2006 benchmark suites. We show that our technique produces a Geomean speedup of 11% for SPEC 2006 Integer, with speedups as large as 35%. As floating point benchmarks contain fewer unbiased, but predictable branches, our Geomean speedup on SPEC 2006 FP is 7%, with a maximum speedup of 26%.

    References

    [1]
    J. R. Allen, K. Kennedy, C. Porterfield, and J. Warren, "Conversion of control dependence to data dependence," in Proceedings of the 10th ACM SIGACT-SIGPLAN Symposium on Principles of Programming Languages, ser. POPL '83. New York, NY, USA: ACM, 1983, pp. 177--189. {Online}. Available: http://doi.acm.org/10.1145/567067.567085
    [2]
    D. I. August, D. A. Connors, J. C. Gyllenhaal, and W.-m. W. Hwu, "Architectural support for compiler-synthesized dynamic branch prediction strategies: Rationale and initial results," in Proceedings of the 3rd IEEE Symposium on High-Performance Computer Architecture, ser. HPCA '97. Washington, DC, USA: IEEE Computer Society, 1997, pp. 84--. {Online}. Available: http://dl.acm.org/citation.cfm?id=548716.822702
    [3]
    E. Brunvand, "The nsr processor," in System Sciences, 1993, Proceeding of the Twenty-Sixth Hawaii International Conference on, vol. i, Jan 1993, pp. 428--435 vol.1.
    [4]
    H. W. Cain and P. Nagpurkar, "Runahead execution vs. conventional data prefetching in the ibm power6 microprocessor," in ISPASS, 2010, pp. 203--212.
    [5]
    M. Charney, "Intel software development emulator." {Online}. Available: https://software.intel.com/en-us/articles/pintool
    [6]
    R. P. Colwell, R. P. Nix, J. J. O. Donnell, D. B. Papworth, and P. K. Rodman, "A vliw architecture for a trace scheduling compiler," in Proceedings of the Second International Conference on Architectural Support for Programming Languages and Operating Systems, Oct. 1987, pp. 180--192.
    [7]
    B. Dally, ""project denver"processor to usher in a new era of computing," Jan. 2011. {Online}. Available: http://blogs.nvidia.com/blog/2011/01/05/project-denver-processor-to-usher-in-new-era-of-computing
    [8]
    J. W. Davidson and D. B. Whalley, "Reducing the cost of branches by using registers," in Proceedings of the 17th Annual International Symposium on Computer Architecture, ser. ISCA '90. New York, NY, USA: ACM, 1990, pp. 182--191. {Online}. Available: http://doi.acm.org/10.1145/325164.325138
    [9]
    J. C. Dehnert, B. K. Grant, J. P. Banning, R. Johnson, T. Kistler, A. Klaiber, and J. Mattson, "The Transmeta Code Morphing Software: Using Speculation, Recovery, and Adaptive Retranslation to Address Real-life Challenges," in Proceedings of the International Symposium on Code Generation and Optimization, 2003, pp. 15--24.
    [10]
    J. Dundas and T. Mudge, "Improving data cache performance by pre-executing instructions under a cache miss," in Proceedings of the 11th International Conference on Supercomputing, ser. ICS '97. New York, NY, USA: ACM, 1997, pp. 68--75. {Online}. Available: http://doi.acm.org/10.1145/263580.263597
    [11]
    J. Edmondson, P. Rubinfeld, R. Preston, and V. Rajagopalan, "Superscalar instruction execution in the 21164 alpha microprocessor," Micro, IEEE, vol. 15, no. 2, pp. 33--43, Apr 1995.
    [12]
    M. Farrens and A. Pleszhun, "Implementation of the pipe processor," Computer, vol. 24, no. 1, pp. 65--70, Jan 1991.
    [13]
    B. A. Fields, S. Rubin, and R. Bodik, "Focusing processor policies via Critical-Path prediction," in Proceedings of the 28th Annual International Symposium on Computer Architecture, Jul. 2001, pp. 74--85. {Online}. Available: http://www.cs.wisc.edu/~bodik/research/isca01a.pdf
    [14]
    J. A. Fisher, "Trace scheduling: a technique for global microcode compaction," vol. 30(7), pp. 478--490, 1981.
    [15]
    J. Fritts and W. Wolf, "Evaluation of static and dynamic scheduling for media processors," in Proceedings of the 2nd Workshop on Media Processors and DSPs, ser. Micro '00, 2000.
    [16]
    J. R. Goodman, J.-t. Hsieh, K. Liou, A. R. Pleszkun, P. B. Schechter, and H. C. Young, "Pipe: A vlsi decoupled architecture," SIGARCH Comput. Archit. News, vol. 13, no. 3, pp. 20--27, Jun. 1985. {Online}. Available: http://doi.acm.org/10.1145/327070.327117
    [17]
    M. Gschwind, H. P. Hofstee, B. Flachs, M. Hopkins, Y. Watanabe, and T. Yamazaki, "Synergistic processing in cell's multicore architecture," IEEE Micro, vol. 26, no. 2, pp. 10--24, Mar. 2006. {Online}. Available: http://dx.doi.org/10.1109/MM.2006.41
    [18]
    J. Hennessy, N. Jouppi, F. Baskett, T. Gross, and J. Gill, "Hardware/software tradeoffs for increased performance," in Proceedings of the First International Symposium on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS I. New York, NY, USA: ACM, 1982, pp. 2--11. {Online}. Available: http://doi.acm.org/10.1145/800050.801820
    [19]
    A. Hilton, S. Nagarakatte, and A. Roth, "icfp: Tolerating all-level cache misses in in-order processors," IEEE Micro, vol. 30, no. 1, pp. 12--19, Jan. 2010. {Online}. Available: http://dx.doi.org/10.1109/MM.2010.20
    [20]
    P. Y. T. Hsu and E. S. Davidson, "Highly concurrent scalar processing," in Proceedings of the 13th Annual International Symposium on Computer Architecture, ser. ISCA '86. Los Alamitos, CA, USA: IEEE Computer Society Press, 1986, pp. 386--395. {Online}. Available: http://dl.acm.org/citation.cfm?id=17407.17401
    [21]
    W. M. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A. Bringmann, R. O. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm, and D. M. Lavery, "The Superblock: An Effective Technique for VLIW and Superscalar Compilation," Journal of Supercomputing, vol. 7, no. 1, pp. 229--248, Mar 1993. {Online}. Available: http://www.crhc.uiuc.edu/IMPACT/ftp/journal/jsc.superblock.93.pdf
    [22]
    Intel, "Intel itanium processor 9500 series refence manual. software development and optimization guide," Intel Technical Manual, 2012.
    [23]
    A. Jaleel, "Memory characterization of workloads using instrumentation-driven simulation: A pin-based memory characterization of the spec cpu2000 and spec cpu2006 benchmark suites." {Online}. Available: http://www.jaleels.org/ajaleel/workload/SPECanalysis.pdf
    [24]
    V. Kathail, M. Schlansker, and B. Rau, "HPL PlayDoh architecture specification: Version 1.0," Hewlett-Packard Laboratories, Tech. Rep. HPL-93-80, Feb. 1993.
    [25]
    H. Kim, J. Joao, O. Mutlu, and Y. N. Patt, "Profile-assisted compiler support for dynamic predication in diverge-merge processors," in Proceedings of the International Symposium on Code Generation and Optimization, ser. CGO '07. Washington, DC, USA: IEEE Computer Society, 2007, pp. 367--378. {Online}. Available: http://dx.doi.org/10.1109/CGO.2007.31
    [26]
    H. Kim, J. A. Joao, O. Mutlu, and Y. N. Patt, "Diverge-merge processor: Generalized and energy-efficient dynamic predication," IEEE Micro, vol. 27, no. 1, pp. 94--104, Jan. 2007. {Online}. Available: http://dx.doi.org/10.1109/MM.2007.9
    [27]
    H. Kim, O. Mutlu, J. Stark, and Y. Patt, "Wish branches: combining conditional branching and predication for adaptive predicated execution," in Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on, Nov 2005, pp. 12 pp.--54.
    [28]
    A. Klauser, T. Austin, D. Grunwald, and B. Calder, "Dynamic hammock predication for non-predicated instruction set architectures," in Parallel Architectures and Compilation Techniques, 1998. Proceedings. 1998 International Conference on, Oct 1998, pp. 278--285.
    [29]
    S. Mahlke and B. Natarajan, "Compiler synthesized dynamic branch prediction," in Microarchitecture, 1996. MICRO-29.Proceedings of the 29th Annual IEEE/ACM International Symposium on, Dec 1996, pp. 153--164.
    [30]
    S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, "Effective compiler support for predicated execution using the hyperblock," in In Proceedings of the 25th International Symposium on Microarchitecture, 1992, pp. 45--54.
    [31]
    D. S. McFarlin, C. Tucker, and C. Zilles, "Discerning the dominant out-of-order performance advantage: Is it speculation or dynamism?" in Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS '13. New York, NY, USA: ACM, 2013, pp. 241--252. {Online}. Available: http://doi.acm.org/10.1145/2451116.2451143
    [32]
    C. McNairy and D. Soltis, "Itanium 2 processor microarchitecture," IEEE Micro, vol. 23, no. 2, pp. 44--55, Mar. 2003. {Online}. Available: http://dx.doi.org/10.1109/MM.2003.1196114
    [33]
    A. S. Nadkarni and A. Tyagi, "A trace based evaluation of speculative branch decoupling," in Computer Design, 2000. Proceedings. 2000 International Conference on. IEEE, 2000, pp. 300--307.
    [34]
    N. Neelakantam, R. Rajwar, S. Srinivas, U. Srinivasan, and C. Zilles, "Hardware atomicity for reliable software speculation," in Proceedings of the 34th International Symposium on Computer Architecture, 2007, pp. 174--185.
    [35]
    A. Seznec, "A new case for the tage branch predictor," in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-44. New York, NY, USA: ACM, 2011, pp. 117--127. {Online}. Available: http://doi.acm.org/10.1145/2155620.2155635
    [36]
    R. Sheikh, J. Tuck, and E. Rotenberg, "Control-flow decoupling," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-45. Washington, DC, USA: IEEE Computer Society, 2012, pp. 329--340. {Online}. Available: http://dx.doi.org/10.1109/MICRO.2012.38
    [37]
    G. Shobaki, K. Wilken, and M. Heffernan, "Optimal trace scheduling using enumeration," ACM Trans. Archit. Code Optim., vol. 5, no. 4, pp. 19:1--19:32, Mar. 2009. {Online}. Available: http://doi.acm.org/10.1145/1498690.1498694
    [38]
    M. Smotherman, "Documentation project for the IBM ACS-1 Supercomputer," Jun. 2010. {Online}. Available: http://www.cs.clemson.edu/~mark/acs.html
    [39]
    A. Srivastava and A. Despain, "Prophetic branches: a branch architecture for code compaction and efficient execution," in Microarchitecture, 1993., Proceedings of the 26th Annual International Symposium on, Dec 1993, pp. 94--99.
    [40]
    A. Tyagi, H.-C. Ng, and P. Mohapatra, "Dynamic branch decoupled architecture," in Computer Design, 1999.(ICCD'99) International Conference on. IEEE, 1999, pp. 442--450.
    [41]
    W. J. Watson, "The ti asc: A highly modular and flexible super computer architecture," in Proceedings of the December 5-7, 1972, Fall Joint Computer Conference, Part I, ser. AFIPS '72 (Fall, part I). New York, NY, USA: ACM, 1972, pp. 221--228. {Online}. Available: http://doi.acm.org/10.1145/1479992.1480022
    [42]
    C. Young and M. D. Smith, "Improving the accuracy of static branch prediction using branch correlation," in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS VI. New York, NY, USA: ACM, 1994, pp. 232--241. {Online}. Available: http://doi.acm.org/10.1145/195473.195549
    [43]
    H. C. Young, "Code scheduling methods for some architectural features in pipe," Microprocessing and Microprogramming, vol. 22, no. 1, pp. 39--63, 1988. {Online}. Available: http://www.sciencedirect.com/science/article/pii/0165607488900063
    [44]
    M. Yourst, "Ptlsim: A cycle accurate full system x86-64 microarchitectural simulator," in Performance Analysis of Systems Software, 2007. ISPASS 2007. IEEE International Symposium on, April 2007, pp. 23--34.

    Cited By

    View all
    • (2021)NOREBA: a compiler-informed non-speculative out-of-order commit processorProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446726(182-193)Online publication date: 19-Apr-2021
    • (2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
    • (2018)Architectural support for probabilistic branchesProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00018(108-120)Online publication date: 20-Oct-2018
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
    June 2015
    768 pages
    ISBN:9781450334020
    DOI:10.1145/2749469
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 13 June 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article

    Conference

    ISCA '15
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 543 of 3,203 submissions, 17%

    Upcoming Conference

    ISCA '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 11 Aug 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2021)NOREBA: a compiler-informed non-speculative out-of-order commit processorProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3445814.3446726(182-193)Online publication date: 19-Apr-2021
    • (2021)An Elastic Task Scheduling Scheme on Coarse-Grained Reconfigurable ArchitecturesIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308480432:12(3066-3080)Online publication date: 1-Dec-2021
    • (2018)Architectural support for probabilistic branchesProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00018(108-120)Online publication date: 20-Oct-2018
    • (2016)Decoupling loads for nano-instruction set computersACM SIGARCH Computer Architecture News10.1145/3007787.300118144:3(406-417)Online publication date: 18-Jun-2016
    • (2016)PowerChopACM SIGARCH Computer Architecture News10.1145/3007787.300115244:3(140-152)Online publication date: 18-Jun-2016
    • (2016)Decoupling loads for nano-instruction set computersProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.43(406-417)Online publication date: 18-Jun-2016
    • (2016)PowerChopProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.22(140-152)Online publication date: 18-Jun-2016
    • (2015)A Graph-Based Program Representation for Analyzing Hardware Specialization ApproachesIEEE Computer Architecture Letters10.1109/LCA.2015.247680114:2(94-98)Online publication date: 1-Jul-2015
    • (2016)PowerChopProceedings of the 43rd International Symposium on Computer Architecture10.1109/ISCA.2016.22(140-152)Online publication date: 18-Jun-2016

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media