Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

SWOOP: software-hardware co-design for non-speculative, execute-ahead, in-order cores

Published: 11 June 2018 Publication History

Abstract

Increasing demands for energy efficiency constrain emerging hardware. These new hardware trends challenge the established assumptions in code generation and force us to rethink existing software optimization techniques. We propose a cross-layer redesign of the way compilers and the underlying microarchitecture are built and interact, to achieve both performance and high energy efficiency.
In this paper, we address one of the main performance bottlenecks — last-level cache misses — through a software-hardware co-design. Our approach is able to hide memory latency and attain increased memory and instruction level parallelism by orchestrating a non-speculative, execute-ahead paradigm in software (SWOOP). While out-of-order (OoO) architectures attempt to hide memory latency by dynamically reordering instructions, they do so through expensive, power-hungry, speculative mechanisms.We aim to shift this complexity into software, and we build upon compilation techniques inherited from VLIW, software pipelining, modulo scheduling, decoupled access-execution, and software prefetching. In contrast to previous approaches we do not rely on either software or hardware speculation that can be detrimental to efficiency. Our SWOOP compiler is enhanced with lightweight architectural support, thus being able to transform applications that include highly complex control-flow and indirect memory accesses.

Supplementary Material

WEBM File (p328-tran.webm)

References

[1]
Sarita V. Adve and Mark D. Hill. 1990. Weak Ordering - A New Deinition. In Proceedings of the 17th Annual International Symposium on Computer Architecture. Seattle, WA, June 1990, Jean-Loup Baer, Larry Snyder, and James R. Goodman (Eds.). ACM, 2ś14.
[2]
Alexander Aiken, Alexandru Nicolau, and Steven Novack. 1995. Resource-Constrained Software Pipelining. IEEE Trans. Parallel Distrib. Syst. 6, 12 (1995), 1248ś1270.
[3]
Sam Ainsworth and Timothy M. Jones. 2017. Software prefetching for indirect memory accesses. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 305ś317. htp://dl.acm.org/citation.cfm?id=3049865
[4]
Manuel Arenaz, Juan Touriño, and Ramon Doallo. 2004. An InspectorExecutor Algorithm for Irregular Assignment Parallelization. In Parallel and Distributed Processing and Applications, Second InternationalSymposium, ISPA 2004, Hong Kong, China, December 13-15, 2004, Proceedings. 4ś15.
[5]
ARM. {n. d.}. ARM Cortex-A15 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a15.php .
[6]
ARM. {n. d.}. ARM Cortex-A7 Processor. htp://www.arm.com/ products/processors/cortex-a/cortex-a7.php .
[7]
Manish Arora, Siddhartha Nath, Subhra Mazumdar, Scott B. Baden, and Dean M. Tullsen. 2012. Redeining the Role of the CPU in the Era of CPU-GPU Integration. IEEE Micro 32, 6 (2012), 4ś16.
[8]
Mark Bohr. 2007. A 30 Year Retrospective on Dennard’s MOSFET Scaling Paper. IEEE Solid-State Circuits Society Newsletter 12, 1 (2007), 11ś13.
[9]
Shekhar Borkar and Andrew A. Chien. 2011. The future of microprocessors. Commun. ACM 54, 5 (2011), 67ś77.
[10]
Trevor E. Carlson, Wim Heirman, Osman Allam, Stefanos Kaxiras, and Lieven Eeckhout. 2015. The load slice core microarchitecture. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, Portland, OR, USA, June 13-17, 2015, Deborah T. Marr and David H. Albonesi (Eds.). ACM, 272ś284.
[11]
Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In Conference on High Performance Computing Networking, Storage and Analysis, SC 2011, Seattle, WA, USA, November 12-18, 2011. 52:1ś52:12.
[12]
Trevor E. Carlson, Wim Heirman, Stijn Eyerman, Ibrahim Hur, and Lieven Eeckhout. 2014. An Evaluation of High-Level Mechanistic Core Models. TACO 11, 3 (2014), 28:1ś28:25.
[13]
Robert S. Chappell, Jared Stark, Sangwook P. Kim, Steven K. Reinhardt, and Yale N. Patt. 1999. Simultaneous Subordinate Microthreading (SSMT). In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999. 186ś195.
[14]
Shailender Chaudhry, Robert Cypher, Magnus Ekman, Martin Karlsson, Anders Landin, Sherman Yip, Håkan Zefer, and Marc Tremblay. 2009. Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun’s Rock Processor. In Proceedings of the Annual International Symposium on Computer Architecture. ACM, New York, NY, USA, 484ś495.
[15]
Jamison D. Collins, Dean M. Tullsen, Hong Wang, and John Paul Shen. 2001. Dynamic speculative precomputation. In Proceedings of the 34th Annual International Symposium on Microarchitecture, Austin, Texas, USA, December 1-5, 2001. 306ś317.
[16]
Jamison D. Collins, Hong Wang, Dean M. Tullsen, Christopher J. Hughes, Yong-Fong Lee, Daniel M. Lavery, and John Paul Shen. 2001. Speculative precomputation: long-range prefetching of delinquent loads. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 14ś25.
[17]
Neal Clayton Crago and Sanjay J. Patel. 2011. OUTRIDER: eicient memory latency tolerance with decoupled strands. In 38th International Symposium on Computer Architecture (ISCA 2011), June 4-8, 2011, San Jose, CA, USA. 117ś128.
[18]
Michel Dubois and Yong Ho Song. 1998. Assisted Execution. Technical Report CENG 98-25. Department of EE-Systems, University of Southern California.
[19]
James Dundas and Trevor N. Mudge. 1997. Improving Data Cache Performance by Pre-Executing Instructions Under a Cache Miss. In Proceedings of the 11th international conference on Supercomputing, ICS 1997, Vienna, Austria, July 7-11, 1997. 68ś75.
[20]
Richard James Eickemeyer, Hung Qui Le, Dung Quoc Nguyen, Benjamin Walter Stolt, and Brian William Thompto. 2009. Load lookahead prefetch for microprocessors. US Patent 7,594,096.
[21]
Philip G. Emma, Allan Hartstein, Thomas R. Puzak, and Viji Srinivasan. 2005. Exploring the limits of prefetching. IBM Journal of Research and Development 49, 1 (2005), 127ś144. htp://www.research.ibm.com/ journal/rd/491/emma.pdf
[22]
Joseph A. Fisher. 1998. Very Long Instruction Word Architectures and the ELI-512. In 25 Years of the International Symposia on Computer Architecture (Selected Papers). 263ś273.
[23]
Daniele Folegnani and Antonio González. 2001. Energy-efective issue logic. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001. 230ś239.
[24]
Manoj Franklin. 1993. The multiscalar architecture. Ph.D. Dissertation. University of Wisconsin Madison.
[25]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: decoupled supply-compute communication management for heterogeneous architectures. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 191ś203.
[26]
John L. Hennessy and David A. Patterson. 2011. Computer Architecture, Fifth Edition: A Quantitative Approach, Appendix H: Hardware and Software for VLIW and EPIC (5th ed.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
[27]
John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Computer Architecture News 34, 4 (2006), 1ś17.
[28]
Andrew D. Hilton, Santosh Nagarakatte, and Amir Roth. 2009. iCFP: Tolerating all-level cache misses in in-order processors. In 15th International Conference on High-Performance Computer Architecture (HPCA-15 2009), 14-18 February 2009, Raleigh, North Carolina, USA. IEEE Computer Society, 431ś442.
[29]
Andrew D. Hilton and Amir Roth. 2010. BOLT: Energy-eicient Out-ofOrder Latency-Tolerant execution. In 16th International Conference on High-Performance Computer Architecture (HPCA-16 2010), 9-14 January 2010, Bangalore, India, Matthew T. Jacob, Chita R. Das, and Pradip Bose (Eds.). IEEE Computer Society, 1ś12.
[30]
Mark Horowitz, Margaret Martonosi, Todd C. Mowry, and Michael D. Smith. 1996. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, PA, USA, May 22-24, 1996, Jean-Loup Baer (Ed.). ACM, 260ś270.
[31]
Mitsuhiko Igarashi, Toshifumi Uemura, Ryo Mori, Hiroshi Kishibe, Midori Nagayama, Masaaki Taniguchi, Kohei Wakahara, Toshiharu Saito, Masaki Fujigaya, Kazuki Fukuoka, Koji Nii, Takeshi Kataoka, and Toshihiro Hattori. 2015. A 28 nm High-k/MG Heterogeneous Multi-Core Mobile Application Processor With 2 GHz Cores and LowPower 1 GHz Cores. J. Solid-State Circuits 50, 1 (2015), 92ś101.
[32]
Intel. 2010. Intel™ Microarchitecture Codename Nehalem Performance Monitoring Unit Programming Guide. Nehalem. https://software.intel.com/sites/default/iles/m/5/2/c/f/1/30320-Nehalem-PMU-Programming-Guide-Core.pdf.
[33]
Alexandra Jimborean, Konstantinos Koukos, Vasileios Spiliopoulos, David Black-Schafer, and Stefanos Kaxiras. 2014. Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage-Frequency scaling. In 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO 2014, Orlando, FL, USA, February 15-19, 2014, David R. Kaeli and Tipp Moseley (Eds.). ACM, 262.
[34]
Roel Jordans and Henk Corporaal. 2015. High-level softwarepipelining in LLVM. In Proceedings of the 18th International Workshop on Software and Compilers for Embedded Systems, SCOPES 2015, Sankt Goar, Germany, June 1-3, 2015, Henk Corporaal and Sander Stuijk (Eds.). ACM, 97ś100.
[35]
Md Kamruzzaman, Steven Swanson, and Dean M. Tullsen. 2011. Intercore prefetching for multicore processors using migrating helper threads. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2011, Newport Beach, CA, USA, March 5-11, 2011, Rajiv Gupta and Todd C. Mowry (Eds.). ACM, 393ś404.
[36]
Vinod Kathail, Michael Schlansker, and B Ramakrishna Rau. 1994. HPL PlayDoh architecture speciication: Version 1.0. Hewlett Packard Laboratories Palo Alto, California.
[37]
Vinod Kathail, Michael S Schlansker, and B Ramakrishna Rau. 2000. HPL-PD architecture speciication: Version 1.1. Hewlett-Packard Laboratories.
[38]
Muneeb Khan and Erik Hagersten. 2014. Resource conscious prefetching for irregular applications in multicores. In XIVth International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation, SAMOS 2014, Agios Konstantinos, Samos, Greece, July 14-17, 2014. IEEE, 34ś43.
[39]
Muneeb Khan, Michael A. Laurenzano, Jason Mars, Erik Hagersten, and David Black-Schafer. 2015. AREP: Adaptive Resource Eicient Prefetching for Maximizing Multicore Performance. In 2015 International Conference on Parallel Architecture and Compilation, PACT 2015, San Francisco, CA, USA, October 18-21, 2015. IEEE Computer Society, 367ś378.
[40]
Muneeb Khan, Andreas Sandberg, and Erik Hagersten. 2014. A Case for Resource Eicient Prefetching in Multicores. In 43rd International Conference on Parallel Processing, ICPP 2014, Minneapolis, MN, USA, September 9-12, 2014. IEEE Computer Society, 101ś110.
[41]
Jinwoo Kim, Rodric M. Rabbah, Krishna V. Palem, and Weng-Fai Wong. 2004. Adaptive Compiler Directed Prefetching for EPIC Processors. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA ’04, June 21-24, 2004, Las Vegas, Nevada, USA, Volume 1, Hamid R. Arabnia (Ed.). CSREA Press, 495ś501.
[42]
Konstantinos Koukos, Per Ekemark, Georgios Zacharopoulos, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2016. Multiversioned decoupled access-execute: the key to energy-eicient compilation of general-purpose programs. In Proceedings of the 25th International Conference on Compiler Construction, CC 2016, Barcelona, Spain, March 12-18, 2016, Ayal Zaks and Manuel V. Hermenegildo (Eds.). ACM, 121ś131.
[43]
Monica S. Lam. 1988. Software Pipelining: An Efective Scheduling Technique for VLIW Machines. In Proceedings of the ACM SIGPLAN’88 Conference on Programming Language Design and Implementation (PLDI), Atlanta, Georgia, USA, June 22-24, 1988, Richard L. Wexelblat (Ed.). ACM, 318ś328.
[44]
Jaekyu Lee, Hyesoon Kim, and Richard W. Vuduc. 2012. When Prefetching Works, When It Doesn’t, and Why. TACO 9, 1 (2012), 2:1ś2:29.
[45]
Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2013. The McPAT Framework for Multicore and Manycore Architectures: Simultaneously Modeling Power, Area, and Timing. TACO 10, 1 (2013), 5:1ś5:29.
[46]
Sushil J. Louis. {n. d.}. CIGAR - Case Injected Genetic Algortihm. htp://www.cse.unr.edu/~sushil/class/gas/code/cigar/ htp://ecsl.cse. unr.edu/~sushil/class/gas/code/cigar/ .
[47]
Chi-Keung Luk. 2001. Tolerating memory latency through softwarecontrolled pre-execution in simultaneous multithreading processors. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 40ś51.
[48]
Scott A. Mahlke, David C. Lin, William Y. Chen, Richard E. Hank, and Roger A. Bringmann. 1992. Efective compiler support for predicated execution using the hyperblock. In Proceedings of the 25th Annual International Symposium on Microarchitecture, Portland, Oregon, November 1992, Wen-mei W. Hwu (Ed.). ACM/IEEE, 45ś54.
[49]
Onur Mutlu, Jared Stark, Chris Wilkerson, and Yale N. Patt. 2003. Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA’03), Anaheim, California, USA, February 8-12, 2003. IEEE Computer Society, 129ś140.
[50]
NASA. 1999. NAS Parallel Benchmarks. htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf htp://www.nas.nasa.gov/ assets/pdf/techreports/1999/nas-99-011.pdf .
[51]
Karthik Natarajan, Heather Hanson, Stephen W. Keckler, Charles R. Moore, and Doug Burger. 2003. Microprocessor pipeline energy analysis. In Proceedings of the 2003 International Symposium on Low Power Electronics and Design, 2003, Seoul, Korea, August 25-27, 2003, Ingrid Verbauwhede and Hyung Roh (Eds.). ACM, 282ś287.
[52]
Satyanarayana Nekkalapu, Haitham Akkary, Komal Jothi, Renjith Retnamma, and Xiaoyu Song. 2008. A simple latency tolerant processor. In 26th International Conference on Computer Design, ICCD 2008, 12-15 October 2008, Lake Tahoe, CA, USA, Proceedings. IEEE Computer Society, 384ś389.
[53]
Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. 2005. Automatic Thread Extraction with Decoupled Software Pipelining. In 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 12-16 November 2005, Barcelona, Spain. IEEE Computer Society, 105ś118.
[54]
Emre Özer and Thomas M. Conte. 2005. High-Performance and LowCost Dual-Thread VLIW Processor Using Weld Architecture Paradigm. IEEE Trans. Parallel Distrib. Syst. 16, 12 (2005), 1132ś1142.
[55]
Vlad-Mihai Panait, Amit Sasturkar, and Weng-Fai Wong. 2004. Static Identiication of Delinquent Loads. In 2nd IEEE / ACM International Symposium on Code Generation and Optimization (CGO 2004), 20-24 March 2004, San Jose, CA, USA. IEEE Computer Society, 303ś314.
[56]
Carlos García Quiñones, Carlos Madriles, F. Jesús Sánchez, Pedro Marcuello, Antonio González, and Dean M. Tullsen. 2005. Mitosis compiler: an infrastructure for speculative threading based on precomputation slices. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, Chicago, IL, USA, June 12-15, 2005, Vivek Sarkar and Mary W. Hall (Eds.). ACM, 269ś279.
[57]
Moinuddin K. Qureshi, Daniel N. Lynch, Onur Mutlu, and Yale N. Patt. 2006. A Case for MLP-Aware Cache Replacement. In 33rd International Symposium on Computer Architecture (ISCA 2006), June 17-21, 2006, Boston, MA, USA. IEEE Computer Society, 167ś178.
[58]
Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. 2004. Decoupled Software Pipelining with the Synchronization Array. In 13th International Conference on Parallel Architectures and Compilation Techniques (PACT 2004), 29 September - 3 October 2004, Antibes Juan-les-Pins, France. IEEE Computer Society, 177ś188.
[59]
B. Ramakrishna Rau. 1991. Data Flow and Dependence Analysis for Instruction Level Parallelism. In Languages and Compilers for Parallel Computing, Fourth International Workshop, Santa Clara, California, USA, August 7-9, 1991, Proceedings (Lecture Notes in Computer Science), Utpal Banerjee, David Gelernter, Alexandru Nicolau, and David A. Padua (Eds.), Vol. 589. Springer, 236ś250.
[60]
Alberto Ros, Trevor E. Carlson, Mehdi Alipour, and Stefanos Kaxiras. 2017. Non-Speculative Load-Load Reordering in TSO. In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA 2017, Toronto, ON, Canada, June 24-28, 2017. ACM, 187ś200.
[61]
Amir Roth and Gurindar S. Sohi. 2001. Speculative Data-Driven Multithreading. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture (HPCA’01), Nuevo Leone, Mexico, January 20-24, 2001. IEEE Computer Society, 37ś48.
[62]
Andreas Sembrant, Erik Hagersten, and David Black-Schafer. 2014. Navigating the cache hierarchy with a single lookup. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14-18, 2014. IEEE Computer Society, 133ś 144.
[63]
Carlo H. Séquin and David A. Patterson. 1982. Design and Implementation of RISC I. Technical Report UCB/CSD-82-106. EECS Department, University of California, Berkeley. htp://www2.eecs.berkeley.edu/ Pubs/TechRpts/1982/5449.html
[64]
Rami Sheikh, James Tuck, and Eric Rotenberg. 2015. Control-Flow Decoupling: An Approach for Timely, Non-Speculative Branching. IEEE Trans. Computers 64, 8 (2015), 2182ś2203.
[65]
James E. Smith. 1984. Decoupled Access/Execute Computer Architectures. ACM Trans. Comput. Syst. 2, 4 (1984), 289ś308.
[66]
Gurindar S. Sohi, Scott E. Breach, and T. N. Vijaykumar. 1995. Multiscalar Processors. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA ’95, Santa Margherita Ligure, Italy, June 22-24, 1995, David A. Patterson (Ed.). ACM, 414ś425.
[67]
Srikanth T. Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Michael Upton. 2004. Continual low pipelines. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2004, Boston, MA, USA, October 7-13, 2004, Shubu Mukherjee and Kathryn S. McKinley (Eds.). ACM, 107ś119.
[68]
Karthik Sundaramoorthy, Zachary Purser, and Eric Rotenberg. 2000. Slipstream Processors: Improving both Performance and Fault Tolerance. In ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12-15, 2000., Larry Rudolph and Anoop Gupta (Eds.). ACM Press, 257ś268.
[69]
Kim-Anh Tran, Trevor E. Carlson, Konstantinos Koukos, Magnus Sjä-lander, Vasileios Spiliopoulos, Stefanos Kaxiras, and Alexandra Jimborean. 2017. Clairvoyance: look-ahead compile-time scheduling. In Proceedings of the 2017 International Symposium on Code Generation and Optimization, CGO 2017, Austin, TX, USA, February 4-8, 2017, Vijay Janapa Reddi, Aaron Smith, and Lingjia Tang (Eds.). ACM, 171ś184. htp://dl.acm.org/citation.cfm?id=3049852
[70]
Marc Tremblay and Shailender Chaudhry. 2008. A Third-Generation 65nm 16-Core 32-Thread Plus 32-Scout-Thread CMT SPARC® Processor. In 2008 IEEE International Solid-State Circuits Conference, ISSCC 2008, Digest of Technical Papers, San Francisco, CA, USA, February 3-7, 2008. IEEE, 82ś83.
[71]
Francis Tseng and Yale N. Patt. 2008. Achieving Out-of-Order Performance with Almost In-Order Complexity. In 35th International Symposium on Computer Architecture (ISCA 2008), June 21-25, 2008, Beijing, China. IEEE Computer Society, 3ś12.
[72]
Vladimir Uzelac and Aleksandar Milenkovic. 2009. Experiment lows and microbenchmarks for reverse engineering of branch predictor structures. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2009, April 26-28, 2009, Boston, Massachusetts, USA, Proceedings. IEEE Computer Society, 207ś217.
[73]
Steven P. Vanderwiel and David J. Lilja. 2000. Data prefetch mechanisms. ACM Comput. Surv. 32, 2 (2000), 174ś199.
[74]
T. N. Vijaykumar and Gurindar S. Sohi. 1998. Task Selection for a Multiscalar Processor. In Proceedings of the 31st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 31, Dallas, Texas, USA, November 30 - December 2, 1998, James O. Bondi and Jim Smith (Eds.). ACM/IEEE Computer Society, 81ś92.
[75]
Mark Weiser. 1981. Program Slicing. In Proceedings of the 5th International Conference on Software Engineering, San Diego, California, USA, March 9-12, 1981., Seymour Jefrey and Leon G. Stucki (Eds.). IEEE Computer Society, 439ś449. htp://dl.acm.org/citation.cfm?id=802557
[76]
Sebastian Winkel, Rakesh Krishnaiyer, and Robyn Sampson. 2008. Latency-tolerant software pipelining in a production compiler. In Sixth International Symposium on Code Generation and Optimization (CGO 2008), April 5-9, 2008, Boston, MA, USA, Mary Lou Sofa and Evelyn Duesterwald (Eds.). ACM, 104ś113.
[77]
Carole-Jean Wu, Aamer Jaleel, William Hasenplaugh, Margaret Martonosi, Simon C. Steely Jr., and Joel S. Emer. 2011. SHiP: signaturebased hit predictor for high performance caching. In 44rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2011, Porto Alegre, Brazil, December 3-7, 2011, Carlo Galuzzi, Luigi Carro, Andreas Moshovos, and Milos Prvulovic (Eds.). ACM, 430ś441.
[78]
William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: implications of the obvious. SIGARCH Computer Architecture News 23, 1 (1995), 20ś24.
[79]
Xin-Xin Yang. 2014. An Introduction to the QorIQ LS1 Family. Presentation slides. htps://cache.freescale.com/files/training/doc/dwf/ DWF14_APF_NET_T0162.pdf .
[80]
Adi Yoaz, Mattan Erez, Ronny Ronen, and Stéphan Jourdan. 1999. Speculation Techniques for Improving Load Related Instruction Scheduling. In Proceedings of the 26th Annual International Symposium on Computer Architecture, ISCA 1999, Atlanta, Georgia, USA, May 2-4, 1999, Allan Gottlieb and William J. Dally (Eds.). IEEE Computer Society, 42ś53.
[81]
Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: indirect memory prefetcher. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5-9, 2015, Milos Prvulovic (Ed.). ACM, 178ś190.
[82]
Weifeng Zhang, Dean M. Tullsen, and Brad Calder. 2007. Accelerating and Adapting Precomputation Threads for Efcient Prefetching. In 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 10-14 February 2007, Phoenix, Arizona, USA. IEEE Computer Society, 85ś95.
[83]
Chuan-Qi Zhu and Pen-Chung Yew. 1987. A Scheme to Enforce Data Dependence on Large Multiprocessor Systems. IEEE Trans. Software Eng. 13, 6 (1987), 726ś739.
[84]
Craig B. Zilles and Gurindar S. Sohi. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th Annual International Symposium on Computer Architecture, ISCA 2001, Göteborg, Sweden, June 30-July 4, 2001, Per Stenström (Ed.). ACM, 2ś13.
[85]
Victor V. Zyuban and Peter M. Kogge. 2001. Inherently Lower-Power High-Performance Superscalar Architectures. IEEE Trans. Computers 50, 3 (2001), 268ś285.

Cited By

View all
  • (2021)Criticality Driven FetchMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480115(380-391)Online publication date: 18-Oct-2021
  • (2024)Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory AccessACM Transactions on Architecture and Code Optimization10.1145/366347921:3(1-28)Online publication date: 9-May-2024
  • (2024)Scalar Vector Runahead2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00101(1367-1381)Online publication date: 2-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 53, Issue 4
PLDI '18
April 2018
834 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296979
Issue’s Table of Contents
  • cover image ACM Conferences
    PLDI 2018: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation
    June 2018
    825 pages
    ISBN:9781450356985
    DOI:10.1145/3192366
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2018
Published in SIGPLAN Volume 53, Issue 4

Check for updates

Author Tags

  1. compilers
  2. hardware-software co-design
  3. memory level parallelism

Qualifiers

  • Research-article

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)34
  • Downloads (Last 6 weeks)5
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Criticality Driven FetchMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480115(380-391)Online publication date: 18-Oct-2021
  • (2024)Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory AccessACM Transactions on Architecture and Code Optimization10.1145/366347921:3(1-28)Online publication date: 9-May-2024
  • (2024)Scalar Vector Runahead2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00101(1367-1381)Online publication date: 2-Nov-2024
  • (2023)Orinoco: Ordered Issue and Unordered Commit with Non-Collapsible QueuesProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589046(1-14)Online publication date: 17-Jun-2023
  • (2022)Efficient Instruction Scheduling Using Real-time Load Delay TrackingACM Transactions on Computer Systems10.1145/354868140:1-4(1-21)Online publication date: 24-Nov-2022
  • (2022)The Forward Slice Core: A High-Performance, Yet Low-Complexity MicroarchitectureACM Transactions on Architecture and Code Optimization10.1145/349942419:2(1-25)Online publication date: 31-Jan-2022
  • (2022)Tiny but mightyProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527400(817-830)Online publication date: 18-Jun-2022
  • (2022)A Tensor Processing Framework for CPU-Manycore Heterogeneous SystemsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.310382541:6(1620-1635)Online publication date: Jun-2022
  • (2021)Criticality Driven FetchMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480115(380-391)Online publication date: 18-Oct-2021
  • (2020)The Forward Slice Core MicroarchitectureProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414629(361-372)Online publication date: 30-Sep-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media