Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

An Event-Triggered Programmable Prefetcher for Irregular Workloads

Published: 19 March 2018 Publication History

Abstract

Many modern workloads compute on large amounts of data, often with irregular memory accesses. Current architectures perform poorly for these workloads, as existing prefetching techniques cannot capture the memory access patterns; these applications end up heavily memory-bound as a result. Although a number of techniques exist to explicitly configure a prefetcher with traversal patterns, gaining significant speedups, they do not generalise beyond their target data structures. Instead, we propose an event-triggered programmable prefetcher combining the flexibility of a general-purpose computational unit with an event-based programming model, along with compiler techniques to automatically generate events from the original source code with annotations. This allows more complex fetching decisions to be made, without needing to stall when intermediate results are required. Using our programmable prefetching system, combined with small prefetch kernels extracted from applications, we achieve an average 3.0x speedup in simulation for a variety of graph, database and HPC workloads.

References

[1]
S. Ainsworth and T. M. Jones. Graph prefetching using data structure knowledge. In ICS, 2016.
[2]
S. Ainsworth and T. M. Jones. Software prefetching for indirect memory accesses. In CGO, 2017.
[3]
H. Al-Sukhni, I. Bratt, and D. A. Connors. Compiler-directed content-aware prefetching for dynamic data structures. In PACT, 2003.
[4]
AnandTech. http://www.anandtech.com/show/8718/the-samsung-galaxy-note-4-exynos-review/6, a.
[5]
AnandTech. http://www.anandtech.com/show/8542/cortexm7-launches-embedded-iot-and-wearables/2, b.
[6]
M. Annavaram, J. M. Patel, and E. S. Davidson. Data prefetching by dependence graph precomputation. In ISCA, 2001.
[7]
ARM. http://www.arm.com/products/processors/cortex-m/cortex-m0plus.php.
[8]
K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer, J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek, D. Wessel, and K. Yelick. A view of the parallel computing landscape. Commun. ACM, 52 (10), Oct. 2009.
[9]
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The NAS Parallel benchmarks -- summary and preliminary results. In Supercomputing, 1991.
[10]
N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell, M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39 (2), Aug. 2011.
[11]
S. Blanas, Y. Li, and J. M. Patel. Design and evaluation of main memory hash join algorithms for multi-core cpus. In SIGMOD, 2011.
[12]
D. Callahan, K. Kennedy, and A. Porterfield. Software prefetching. In ASPLOS, 1991.
[13]
T. Chen and J. Baer. Effective hardware-based data prefetching for high-performance processors. IEEE Transactions on Computers, 44 (5), May 1995.
[14]
T.-F. Chen and J.-L. Baer. Reducing memory latency via non-blocking and prefetching caches. In ASPLOS, 1992.
[15]
S. Choi, N. Kohout, S. Pamnani, D. Kim, and D. Yeung. A general framework for prefetch scheduling in linked data structures and its application to multi-chain prefetching. ACM Trans. Comput. Syst., 22 (2), May 2004.
[16]
G. Z. Chrysos and J. S. Emer. Memory dependence prediction using store sets. In ISCA, 1998.
[17]
R. Cooksey, S. Jourdan, and D. Grunwald. A stateless, content-directed data prefetching mechanism. In ASPLOS, 2002.
[18]
P. Demosthenous, N. Nicolaou, and J. Georgiou. A hardware-efficient lowpass filter design for biomedical applications. In BioCAS, Nov 2010.
[19]
B. Falsafi and T. F. Wenisch. A primer on hardware prefetching. Synthesis Lectures on Computer Architecture, 9 (1), 2014.
[20]
I. Ganusov and M. Burtscher. Efficient emulation of hardware prefetchers via event-driven helper threading. In PACT, 2006.
[21]
A. Gutierrez, J. Pusdesris, R. G. Dreslinski, T. Mudge, C. Sudanthi, C. D. Emmons, M. Hayenga, and N. Paver. Sources of error in full-system simulation. In ISPASS, 2014.
[22]
T. J. Ham, J. L. Aragón, and M. Martonosi. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. In MICRO, 2015.
[23]
M. Hashemi, O. Mutlu, and Y. N. Patt. Continuous runahead: Transparent hardware acceleration for memory intensive workloads. In MICRO, 2016.
[24]
C.-H. Ho, S. J. Kim, and K. Sankaralingam. Efficient execution of memory access phases using dataflow specialization. In ISCA, 2015.
[25]
A. Jain and C. Lin. Linearizing irregular memory accesses for improved correlated prefetching. In MICRO, 2013.
[26]
D. Joseph and D. Grunwald. Prefetching using markov predictors. In ISCA, 1997.
[27]
D. Kim and D. Yeung. Design and evaluation of compiler algorithms for pre-execution. In ASPLOS, 2002.
[28]
D. Kim and D. Yeung. A study of source-level compiler algorithms for automatic construction of pre-execution code. ACM Trans. Comput. Syst., 22 (3), Aug. 2004.
[29]
J. Kim, S. H. Pugsley, P. V. Gratz, A. L. N. Reddy, C. Wilkerson, and Z. Chishti. Path confidence based lookahead prefetching. In MICRO, 2016.
[30]
O. Kocberber, B. Falsafi, K. Lim, P. Ranganathan, and S. Harizopoulos. Dark silicon accelerators for database indexing. In 1st Dark Silicon Workshop (DaSi), 2012.
[31]
O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In MICRO, 2013.
[32]
O. Kocberber, B. Falsafi, and B. Grot. Asynchronous memory access chaining. In VLDB, 2015.
[33]
N. Kohout, S. Choi, D. Kim, and D. Yeung. Multi-chain prefetching: Effective exploitation of inter-chain memory parallelism for pointer-chasing codes. In PACT, 2001.
[34]
S. Kumar, A. Shriraman, V. Srinivasan, D. Lin, and J. Phillips. Sqrl: Hardware accelerator for collecting software data structures. In PACT, 2014.
[35]
S. Kumar, N. Vedula, A. Shriraman, and V. Srinivasan. Dasx: Hardware accelerator for software data structures. In ICS, 2015.
[36]
C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis & transformation. In CGO, 2004.
[37]
E. Lau, J. E. Miller, I. Choi, D. Yeung, S. Amarasinghe, and A. Agarwal. Multicore performance optimization using partner cores. In HotPar, 2011.
[38]
A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17 (01), 2007.
[39]
P. R. Luszczek, D. H. Bailey, J. J. Dongarra, J. Kepner, R. F. Lucas, R. Rabenseifner, and D. Takahashi. The hpc challenge (hpcc) benchmark suite. In SC, 2006.
[40]
V. Malhotra and C. Kozyrakis. Library-based prefetching for pointer-intensive applications. Technical report, Online, 2006.
[41]
F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what cost? In HotOS, 2015.
[42]
D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. In PPoPP, 2012.
[43]
S. Mittal. A survey of recent prefetching techniques for processor caches. ACM Comput. Surv., 49 (2), Aug. 2016.
[44]
A. Moshovos, D. N. Pnevmatikatos, and A. Baniasadi. Slice-processors: An implementation of operation-based prediction. In ICS, 2001.
[45]
T. C. Mowry, M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS, 1992.
[46]
R. C. Murphy, K. B. Wheeler, B. W. Barrett, and J. A. Ang. Introducing the graph 500. Cray User's Group (CUG), May 5, 2010.
[47]
O. Mutlu, J. Stark, C. Wilkerson, and Y. N. Patt. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In HPCA, 2003.
[48]
K. J. Nesbit and J. E. Smith. Data cache prefetching using a global history buffer. In HPCA, 2004.
[49]
K. Nilakant, V. Dalibard, A. Roy, and E. Yoneki. Prefedge: Ssd prefetcher for large-scale graph traversal. In SYSTOR, 2014.
[50]
L. Peled, S. Mannor, U. Weiser, and Y. Etsion. Semantic locality and context-based prefetching using reinforcement learning. In ISCA, 2015.
[51]
A. Roth, A. Moshovos, and G. S. Sohi. Dependence based prefetching for linked data structures. In ASPLOS, 1998.
[52]
M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti. Efficiently prefetching complex address patterns. In MICRO, 2015.
[53]
J. Siek, L.-Q. Lee, and A. Lumsdaine. The Boost Graph Library: User Guide and Reference Manual. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2002. ISBN 0--201--72914--8.
[54]
V. Viswanathan. Disclosure of h/w prefetcher control on some intel processors. https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors, Sept. 2014.
[55]
T. F. Wenisch, S. Somogyi, N. Hardavellas, J. Kim, A. Ailamaki, and B. Falsafi. Temporal streaming of shared memory. In ISCA '05, 2005.
[56]
M. Yabuuchi, Y. Tsukamoto, M. Morimoto, M. Tanaka, and K. Nii. 20nm high-density single-port and dual-port srams with wordline-voltage-adjustment system for read/write assists. In ISSCC, 2014.
[57]
C.-L. Yang and A. Lebeck. A programmable memory hierarchy for prefetching linked data structures. In H. Zima, K. Joe, M. Sato, Y. Seo, and M. Shimasaki, editors, High Performance Computing, volume 2327 of Lecture Notes in Computer Science. 2002. ISBN 978--3--540--43674--4.
[58]
X. Yu, C. J. Hughes, N. Satish, and S. Devadas. IMP: Indirect memory prefetcher. In MICRO, 2015.

Cited By

View all
  • (2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 22-Jan-2024
  • (2023)A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614284(1332-1346)Online publication date: 28-Oct-2023
  • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
  • Show More Cited By

Index Terms

  1. An Event-Triggered Programmable Prefetcher for Irregular Workloads

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM SIGPLAN Notices
      ACM SIGPLAN Notices  Volume 53, Issue 2
      ASPLOS '18
      February 2018
      809 pages
      ISSN:0362-1340
      EISSN:1558-1160
      DOI:10.1145/3296957
      Issue’s Table of Contents
      • cover image ACM Conferences
        ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2018
        827 pages
        ISBN:9781450349116
        DOI:10.1145/3173162
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 19 March 2018
      Published in SIGPLAN Volume 53, Issue 2

      Check for updates

      Author Tag

      1. prefetching

      Qualifiers

      • Research-article

      Funding Sources

      • ARM Ltd.
      • EPSRC

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)100
      • Downloads (Last 6 weeks)10
      Reflects downloads up to 01 Nov 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Tyche: An Efficient and General Prefetcher for Indirect Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/364185321:2(1-26)Online publication date: 22-Jan-2024
      • (2023)A Tensor Marshaling Unit for Sparse Tensor Algebra on General-Purpose ProcessorsProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3613424.3614284(1332-1346)Online publication date: 28-Oct-2023
      • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
      • (2022)APT-GETProceedings of the Seventeenth European Conference on Computer Systems10.1145/3492321.3519583(747-764)Online publication date: 28-Mar-2022
      • (2022)Tiny but mightyProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527400(817-830)Online publication date: 18-Jun-2022
      • (2022)CrescentProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527395(962-977)Online publication date: 18-Jun-2022
      • (2021)Automatic Sublining for Efficient Sparse Memory AccessesACM Transactions on Architecture and Code Optimization10.1145/345214118:3(1-23)Online publication date: 10-May-2021
      • (2021)Performance Evaluation of Intel Optane Memory for Managed WorkloadsACM Transactions on Architecture and Code Optimization10.1145/345134218:3(1-26)Online publication date: 22-Apr-2021
      • (2021)GraphPEGACM Transactions on Architecture and Code Optimization10.1145/345044018:3(1-24)Online publication date: 10-May-2021
      • (2021)Spot the DifferenceACM Transactions on Applied Perception10.1145/344906418:2(1-15)Online publication date: 11-May-2021
      • Show More Cited By

      View Options

      Get Access

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media