research-article

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Authors:

Alexandra Jimborean,

Trevor E. Carlson,

Konstantinos Koukos,

Magnus Själander,

Stefanos KaxirasAuthors Info & Claims

ACM SIGPLAN Notices, Volume 53, Issue 4

Pages 328 - 343

https://doi.org/10.1145/3296979.3192393

Published: 11 June 2018 Publication History

Abstract

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.

In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.

Supplementary Material

WEBM File (p328-tran.webm)

Download
116.52 MB

References

[1]

Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Deinition. In Proceedings of the 17th Annual International Symposium on Computer Architecture. Seattle, WA, June 1990, Jean-Loup Baer, Larry Snyder, and James R. Goodman (Eds.). ACM, 2ś14.

Digital Library

[2]

Alexander Aiken, Alexandru Nicolau, and Steven Novack. 1995. Resource-Constrained Software Pipelining. IEEE Trans. Parallel Distrib. Syst. 6, 12 (1995), 1248ś1270.

Digital Library

[3]

Sam Ainsworth and Timothy M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 305ś317. htp://dl.acm.org/citation.cfm?id=3049865

[4]

Manuel Arenaz, Juan Touriño, and Ramon Doallo. 2004. An InspectorExecutor Algorithm for Irregular Assignment Parallelization. In Parallel and Distributed Processing and Applications, Second InternationalSymposium, ISPA 2004, Hong Kong, China, December 13-15, 2004, Proceedings. 4ś15.

Digital Library

[5]

ARM. {n. d.}. ARM Cortex-A15 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a15.php .

[6]

ARM. {n. d.}. ARM Cortex-A7 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a7.php .

[7]

Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott B. Baden, and Dean M. Tullsen. 2012. Redeining the Role of the CPU in the Era of CPU-GPU Integration. IEEE Micro 32, 6 (2012), 4ś16.

Digital Library

[8]

Mark Bohr. 2007. A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter 12, 1 (2007), 11ś13.

[9]

Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Commun. ACM 54, 5 (2011), 67ś77.

Digital Library

[10]

Trevor E. Carlson, Wim Heirman, Osman Allam, Stefanos Kaxiras, and Lieven Eeckhout. 2015. The load slice core microarchitecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015, Deborah T. Marr and David H. Albonesi (Eds.). ACM, 272ś284.

Digital Library

[11]

Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011. 52:1ś52:12.

Digital Library

[12]

Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. TACO 11, 3 (2014), 28:1ś28:25.

Digital Library

[13]

Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999. 186ś195.

Digital Library

[14]

Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Håkan Zefer, and Marc Tremblay. 2009. Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun’s Rock Processor. In Proceedings of the Annual International Symposium on Computer Architecture. ACM, New York, NY, USA, 484ś495.

Digital Library

[15]

Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John Paul Shen. 2001. Dynamic speculative precomputation. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Austin, Texas, USA, December 1-5, 2001. 306ś317.

Digital Library

[16]

Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher J. Hughes, Yong-Fong Lee, Daniel M. Lavery, and John Paul Shen. 2001. Speculative precomputation: long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 14ś25.

Digital Library

[17]

Neal Clayton Crago and Sanjay J. Patel. 2011. OUTRIDER: eicient memory latency tolerance with decoupled strands. In 38th International Symposium on Computer Architecture (ISCA 2011), June 4-8, 2011, San Jose, CA, USA. 117ś128.

[18]

Michel Dubois and Yong Ho Song. 1998. Assisted Execution. Technical Report CENG 98-25. Department of EE-Systems, University of Southern California.

[19]

James Dundas and Trevor N. Mudge. 1997. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proceedings of the 11th international conference on Supercomputing, ICS 1997, Vienna, Austria, July 7-11, 1997. 68ś75.

Digital Library

[20]

Richard James Eickemeyer, Hung Qui Le, Dung Quoc Nguyen, Benjamin Walter Stolt, and Brian William Thompto. 2009. Load lookahead prefetch for microprocessors. US Patent 7,594,096.

[21]

Philip G. Emma, Allan Hartstein, Thomas R. Puzak, and Viji Srinivasan. 2005. Exploring the limits of prefetching. IBM Journal of Research and Development 49, 1 (2005), 127ś144. htp://www.research.ibm.com/ journal/rd/491/emma.pdf

Digital Library

[22]

Joseph A. Fisher. 1998. Very Long Instruction Word Architectures and the ELI-512. In 25 Years of the International Symposia on Computer Architecture (Selected Papers). 263ś273.

Digital Library

[23]

Daniele Folegnani and Antonio González. 2001. Energy-efective issue logic. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 230ś239.

Digital Library

[24]

Manoj Franklin. 1993. The multiscalar architecture. Ph.D. Dissertation. University of Wisconsin Madison.

Digital Library

[25]

Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 191ś203.

Digital Library

[26]

John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach, Appendix H: Hardware and Software for VLIW and EPIC (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Digital Library

[27]

John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34, 4 (2006), 1ś17.

Digital Library

[28]

Andrew D. Hilton, Santosh Nagarakatte, and Amir Roth. 2009. iCFP: Tolerating all-level cache misses in in-order processors. In 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 14-18 February 2009, Raleigh, North Carolina, USA. IEEE Computer Society, 431ś442.

[29]

Andrew D. Hilton and Amir Roth. 2010. BOLT: Energy-eicient Out-ofOrder Latency-Tolerant execution. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India, Matthew T. Jacob, Chita R. Das, and Pradip Bose (Eds.). IEEE Computer Society, 1ś12.

[30]

Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith. 1996. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, PA, USA, May 22-24, 1996, Jean-Loup Baer (Ed.). ACM, 260ś270.

Digital Library

[31]

Mitsuhiko Igarashi, Toshifumi Uemura, Ryo Mori, Hiroshi Kishibe, Midori Nagayama, Masaaki Taniguchi, Kohei Wakahara, Toshiharu Saito, Masaki Fujigaya, Kazuki Fukuoka, Koji Nii, Takeshi Kataoka, and Toshihiro Hattori. 2015. A 28 nm High-k/MG Heterogeneous Multi-Core Mobile Application Processor With 2 GHz Cores and LowPower 1 GHz Cores. J. Solid-State Circuits 50, 1 (2015), 92ś101.

[32]

Intel. 2010. Intel™ Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. Nehalem. https://software.intel.com/sites/default/iles/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf.

[33]

Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schafer, and Stefanos Kaxiras. 2014. Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage-Frequency scaling. In 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, Orlando, FL, USA, February 15-19, 2014, David R. Kaeli and Tipp Moseley (Eds.). ACM, 262.

Digital Library

[34]

Roel Jordans and Henk Corporaal. 2015. High-level softwarepipelining in LLVM. In Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015, Sankt Goar, Germany, June 1-3, 2015, Henk Corporaal and Sander Stuijk (Eds.). ACM, 97ś100.

Digital Library

[35]

Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Intercore prefetching for multicore processors using migrating helper threads. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, Newport Beach, CA, USA, March 5-11, 2011, Rajiv Gupta and Todd C. Mowry (Eds.). ACM, 393ś404.

Digital Library

[36]

Vinod Kathail, Michael Schlansker, and B Ramakrishna Rau. 1994. HPL PlayDoh architecture speciication: Version 1.0. Hewlett Packard Laboratories Palo Alto, California.

[37]

Vinod Kathail, Michael S Schlansker, and B Ramakrishna Rau. 2000. HPL-PD architecture speciication: Version 1.1. Hewlett-Packard Laboratories.

[38]

Muneeb Khan and Erik Hagersten. 2014. Resource conscious prefetching for irregular applications in multicores. In XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2014, Agios Konstantinos, Samos, Greece, July 14-17, 2014. IEEE, 34ś43.

[39]

Muneeb Khan, Michael A. Laurenzano, Jason Mars, Erik Hagersten, and David Black-Schafer. 2015. AREP: Adaptive Resource Eicient Prefetching for Maximizing Multicore Performance. In 2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 18-21, 2015. IEEE Computer Society, 367ś378.

Digital Library

[40]

Muneeb Khan, Andreas Sandberg, and Erik Hagersten. 2014. A Case for Resource Eicient Prefetching in Multicores. In 43rd International Conference on Parallel Processing, ICPP 2014, Minneapolis, MN, USA, September 9-12, 2014. IEEE Computer Society, 101ś110.

Digital Library

[41]

Jinwoo Kim, Rodric M. Rabbah, Krishna V. Palem, and Weng-Fai Wong. 2004. Adaptive Compiler Directed Prefetching for EPIC Processors. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA ’04, June 21-24, 2004, Las Vegas, Nevada, USA, Volume 1, Hamid R. Arabnia (Ed.). CSREA Press, 495ś501.

[42]

Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2016. Multiversioned decoupled access-execute: the key to energy-eicient compilation of general-purpose programs. In Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016, Ayal Zaks and Manuel V. Hermenegildo (Eds.). ACM, 121ś131.

Digital Library

[43]

Monica S. Lam. 1988. Software Pipelining: An Efective Scheduling Technique for VLIW Machines. In Proceedings of the ACM SIGPLAN’88 Conference on Programming Language Design and Implementation (PLDI), Atlanta, Georgia, USA, June 22-24, 1988, Richard L. Wexelblat (Ed.). ACM, 318ś328.

Digital Library

[44]

Jaekyu Lee, Hyesoon Kim, and Richard W. Vuduc. 2012. When Prefetching Works, When It Doesn’t, and Why. TACO 9, 1 (2012), 2:1ś2:29.

Digital Library

[45]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2013. The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing. TACO 10, 1 (2013), 5:1ś5:29.

Digital Library

[46]

Sushil J. Louis. {n. d.}. CIGAR - Case Injected Genetic Algortihm. htp://www.cse.unr.edu/~sushil/class/gas/code/cigar/ htp://ecsl.cse. unr.edu/~sushil/class/gas/code/cigar/ .

[47]

Chi-Keung Luk. 2001. Tolerating memory latency through softwarecontrolled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 40ś51.

Digital Library

[48]

Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Efective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, November 1992, Wen-mei W. Hwu (Ed.). ACM/IEEE, 45ś54.

Digital Library

[49]

Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA’03), Anaheim, California, USA, February 8-12, 2003. IEEE Computer Society, 129ś140.

Digital Library

[50]

NASA. 1999. NAS Parallel Benchmarks. htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf .

[51]

Karthik Natarajan, Heather Hanson, Stephen W. Keckler, Charles R. Moore, and Doug Burger. 2003. Microprocessor pipeline energy analysis. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003, Seoul, Korea, August 25-27, 2003, Ingrid Verbauwhede and Hyung Roh (Eds.). ACM, 282ś287.

Digital Library

[52]

Satyanarayana Nekkalapu, Haitham Akkary, Komal Jothi, Renjith Retnamma, and Xiaoyu Song. 2008. A simple latency tolerant processor. In 26th International Conference on Computer Design, ICCD 2008, 12-15 October 2008, Lake Tahoe, CA, USA, Proceedings. IEEE Computer Society, 384ś389.

[53]

Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic Thread Extraction with Decoupled Software Pipelining. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 12-16 November 2005, Barcelona, Spain. IEEE Computer Society, 105ś118.

Digital Library

[54]

Emre Özer and Thomas M. Conte. 2005. High-Performance and LowCost Dual-Thread VLIW Processor Using Weld Architecture Paradigm. IEEE Trans. Parallel Distrib. Syst. 16, 12 (2005), 1132ś1142.

Digital Library

[55]

Vlad-Mihai Panait, Amit Sasturkar, and Weng-Fai Wong. 2004. Static Identiication of Delinquent Loads. In 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 303ś314.

Digital Library

[56]

Carlos García Quiñones, Carlos Madriles, F. Jesús Sánchez, Pedro Marcuello, Antonio González, and Dean M. Tullsen. 2005. Mitosis compiler: an infrastructure for speculative threading based on precomputation slices. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 269ś279.

Digital Library

[57]

Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A Case for MLP-Aware Cache Replacement. In 33rd International Symposium on Computer Architecture (ISCA 2006), June 17-21, 2006, Boston, MA, USA. IEEE Computer Society, 167ś178.

Digital Library

[58]

Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled Software Pipelining with the Synchronization Array. In 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September - 3 October 2004, Antibes Juan-les-Pins, France. IEEE Computer Society, 177ś188.

Digital Library

[59]

B. Ramakrishna Rau. 1991. Data Flow and Dependence Analysis for Instruction Level Parallelism. In Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, California, USA, August 7-9, 1991, Proceedings (Lecture Notes in Computer Science), Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua (Eds.), Vol. 589. Springer, 236ś250.

Digital Library

[60]

Alberto Ros, Trevor E. Carlson, Mehdi Alipour, and Stefanos Kaxiras. 2017. Non-Speculative Load-Load Reordering in TSO. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017. ACM, 187ś200.

Digital Library

[61]

Amir Roth and Gurindar S. Sohi. 2001. Speculative Data-Driven Multithreading. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA’01), Nuevo Leone, Mexico, January 20-24, 2001. IEEE Computer Society, 37ś48.

Digital Library

[62]

Andreas Sembrant, Erik Hagersten, and David Black-Schafer. 2014. Navigating the cache hierarchy with a single lookup. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014. IEEE Computer Society, 133ś 144.

Digital Library

[63]

Carlo H. Séquin and David A. Patterson. 1982. Design and Implementation of RISC I. Technical Report UCB/CSD-82-106. EECS Department, University of California, Berkeley. htp://www2.eecs.berkeley.edu/ Pubs/TechRpts/1982/5449.html

Digital Library

[64]

Rami Sheikh, James Tuck, and Eric Rotenberg. 2015. Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching. IEEE Trans. Computers 64, 8 (2015), 2182ś2203.

[65]

James E. Smith. 1984. Decoupled Access/Execute Computer Architectures. ACM Trans. Comput. Syst. 2, 4 (1984), 289ś308.

Digital Library

[66]

Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar Processors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, Santa Margherita Ligure, Italy, June 22-24, 1995, David A. Patterson (Ed.). ACM, 414ś425.

Digital Library

[67]

Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Michael Upton. 2004. Continual low pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2004, Boston, MA, USA, October 7-13, 2004, Shubu Mukherjee and Kathryn S. McKinley (Eds.). ACM, 107ś119.

Digital Library

[68]

Karthik Sundaramoorthy, Zachary Purser, and Eric Rotenberg. 2000. Slipstream Processors: Improving both Performance and Fault Tolerance. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000., Larry Rudolph and Anoop Gupta (Eds.). ACM Press, 257ś268.

Digital Library

[69]

Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Sjä-lander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2017. Clairvoyance: look-ahead compile-time scheduling. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 171ś184. htp://dl.acm.org/citation.cfm?id=3049852

[70]

Marc Tremblay and Shailender Chaudhry. 2008. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor. In 2008 IEEE International Solid-State Circuits Conference, ISSCC 2008, Digest of Technical Papers, San Francisco, CA, USA, February 3-7, 2008. IEEE, 82ś83.

[71]

Francis Tseng and Yale N. Patt. 2008. Achieving Out-of-Order Performance with Almost In-Order Complexity. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China. IEEE Computer Society, 3ś12.

Digital Library

[72]

Vladimir Uzelac and Aleksandar Milenkovic. 2009. Experiment lows and microbenchmarks for reverse engineering of branch predictor structures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26-28, 2009, Boston, Massachusetts, USA, Proceedings. IEEE Computer Society, 207ś217.

[73]

Steven P. Vanderwiel and David J. Lilja. 2000. Data prefetch mechanisms. ACM Comput. Surv. 32, 2 (2000), 174ś199.

Digital Library

[74]

T. N. Vijaykumar and Gurindar S. Sohi. 1998. Task Selection for a Multiscalar Processor. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 31, Dallas, Texas, USA, November 30 - December 2, 1998, James O. Bondi and Jim Smith (Eds.). ACM/IEEE Computer Society, 81ś92.

Digital Library

[75]

Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering, San Diego, California, USA, March 9-12, 1981., Seymour Jefrey and Leon G. Stucki (Eds.). IEEE Computer Society, 439ś449. htp://dl.acm.org/citation.cfm?id=802557

Digital Library

[76]

Sebastian Winkel, Rakesh Krishnaiyer, and Robyn Sampson. 2008. Latency-tolerant software pipelining in a production compiler. In Sixth International Symposium on Code Generation and Optimization (CGO 2008), April 5-9, 2008, Boston, MA, USA, Mary Lou Sofa and Evelyn Duesterwald (Eds.). ACM, 104ś113.

Digital Library

[77]

Carole-Jean Wu, Aamer Jaleel, William Hasenplaugh, Margaret Martonosi, Simon C. Steely Jr., and Joel S. Emer. 2011. SHiP: signaturebased hit predictor for high performance caching. In 44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 3-7, 2011, Carlo Galuzzi, Luigi Carro, Andreas Moshovos, and Milos Prvulovic (Eds.). ACM, 430ś441.

Digital Library

[78]

William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: implications of the obvious. SIGARCH Computer Architecture News 23, 1 (1995), 20ś24.

Digital Library

[79]

Xin-Xin Yang. 2014. An Introduction to the QorIQ LS1 Family. Presentation slides. htps://cache.freescale.com/files/training/doc/dwf/ DWF14_APF_NET_T0162.pdf .

[80]

Adi Yoaz, Mattan Erez, Ronny Ronen, and Stéphan Jourdan. 1999. Speculation Techniques for Improving Load Related Instruction Scheduling. In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999, Allan Gottlieb and William J. Dally (Eds.). IEEE Computer Society, 42ś53.

Digital Library

[81]

Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 178ś190.

Digital Library

[82]

Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and Adapting Precomputation Threads for Efcient Prefetching. In 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 10-14 February 2007, Phoenix, Arizona, USA. IEEE Computer Society, 85ś95.

Digital Library

[83]

Chuan-Qi Zhu and Pen-Chung Yew. 1987. A Scheme to Enforce Data Dependence on Large Multiprocessor Systems. IEEE Trans. Software Eng. 13, 6 (1987), 726ś739.

Digital Library

[84]

Craig B. Zilles and Gurindar S. Sohi. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 2ś13.

Digital Library

[85]

Victor V. Zyuban and Peter M. Kogge. 2001. Inherently Lower-Power High-Performance Superscalar Architectures. IEEE Trans. Computers 50, 3 (2001), 268ś285.

Digital Library

Cited By

Deshmukh APatt Y(2021)Criticality Driven FetchMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480115(380-391)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480115
Wang LZhang XWang SJiang ZLu TChen MLuo SHuang K(2024)Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory AccessACM Transactions on Architecture and Code Optimization10.1145/366347921:3(1-28)Online publication date: 9-May-2024
https://dl.acm.org/doi/10.1145/3663479
Roelandts JNaithani AAinsworth SJones TEeckhout L(2024)Scalar Vector Runahead2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00101(1367-1381)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00101
Show More Cited By

Index Terms

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
1. Hardware
  1. Power and energy
2. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation
  2. Software organization and properties
    1. Extra-functional properties
      1. Software performance

Recommendations

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores
PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign ...
NOREBA: a compiler-informed non-speculative out-of-order commit processor
ASPLOS '21: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

Modern superscalar processors execute instructions out-of-order, but commit them in program order to provide precise exception handling and safe instruction retirement. However, in-order instruction commit is highly conservative and holds on to critical ...
Speculative precomputation: long-range prefetching of delinquent loads
Special Issue: Proceedings of the 28th annual international symposium on Computer architecture (ISCA '01)

This paper explores Speculative Precomputation, a technique that uses idle thread context in a multithreaded architecture to improve performance of single-threaded applications. It attacks program stalls from data cache misses by pre-computing future ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 53, Issue 4

PLDI '18

April 2018

834 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/3296979

Editor:
Matthew Fluet
Rodchester Institude of Technology

Issue’s Table of Contents

PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2018
825 pages
ISBN:9781450356985
DOI:10.1145/3192366
General Chair:
Jeffrey S. Foster
University of Maryland at College Park, USA
,
Program Chair:
Dan Grossman
University of Washington, USA

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2018

Published in SIGPLAN Volume 53, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Vetenskapsrådet

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
486
Total Downloads

Downloads (Last 12 months)34
Downloads (Last 6 weeks)5

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Deshmukh APatt Y(2021)Criticality Driven FetchMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480115(380-391)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480115
Wang LZhang XWang SJiang ZLu TChen MLuo SHuang K(2024)Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory AccessACM Transactions on Architecture and Code Optimization10.1145/366347921:3(1-28)Online publication date: 9-May-2024
https://dl.acm.org/doi/10.1145/3663479
Roelandts JNaithani AAinsworth SJones TEeckhout L(2024)Scalar Vector Runahead2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00101(1367-1381)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00101
Chen DZhang THuang YZhu JLiu YGou PFeng CLi BWei SLiu LSolihin YHeinrich M(2023)Orinoco: Ordered Issue and Unordered Commit with Non-Collapsible QueuesProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589046(1-14)Online publication date: 17-Jun-2023
https://dl.acm.org/doi/10.1145/3579371.3589046
Diavastos ACarlson T(2022)Efficient Instruction Scheduling Using Real-time Load Delay TrackingACM Transactions on Computer Systems10.1145/354868140:1-4(1-21)Online publication date: 24-Nov-2022
https://dl.acm.org/doi/10.1145/3548681
Lakshminarasimhan KNaithani AFeliu JEeckhout L(2022)The Forward Slice Core: A High-Performance, Yet Low-Complexity MicroarchitectureACM Transactions on Architecture and Code Optimization10.1145/349942419:2(1-25)Online publication date: 31-Jan-2022
https://dl.acm.org/doi/10.1145/3499424
Orenes-Vera MManocha ABalkind JGao FAragón JWentzlaff DMartonosi MSalapura VZahran MChong FTang L(2022)Tiny but mightyProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527400(817-830)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527400
Cheng LPan PZhao ZRanjan KWeber JVeluri BEhsani SRuttenberg MJung DIvanov PRichmond DTaylor MZhang ZBatten C(2022)A Tensor Processing Framework for CPU-Manycore Heterogeneous SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.310382541:6(1620-1635)Online publication date: Jun-2022
https://doi.org/10.1109/TCAD.2021.3103825
Deshmukh APatt Y(2021)Criticality Driven FetchMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480115(380-391)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480115
Lakshminarasimhan KNaithani AFeliu JEeckhout LSarkar VKim H(2020)The Forward Slice Core MicroarchitectureProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414629(361-372)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414629

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents