research-article

Open access

Optimizing Indirect Memory References with milk

Authors:

Vladimir Kiriansky,

Saman AmarasingheAuthors Info & Claims

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Pages 299 - 312

https://doi.org/10.1145/2967938.2967948

Published: 11 September 2016 Publication History

Abstract

Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap.

In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP.

We evaluate the MILK compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3x.

References

[1]

Intel 64 and IA-32 Architectures Optimization Reference Manual. http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html.

[2]

Intel Core i7--4790K Processor (8M Cache, up to 4.40 GHz). http://ark.intel.com/products/80807/Intel-Core-i7-4790K-Processor-8M-Cache-up-to-4_40-GHz.

[3]

OpenMP Application Program Interface 4.0. http://openmp.org/wp/openmp-specifications/, 2013.

[4]

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Available at tensorflow.org.

[5]

M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. Proc. VLDB Endow., 5(10):1064--1075, June 2012.

Digital Library

[6]

L. Backstrom, P. Boldi, M. Rosa, J. Ugander, and S. Vigna. Four degrees of separation. In Proceedings of the 4th Annual ACM Web Science Conference, WebSci '12, pages 33--42, New York, NY, USA, 2012. ACM.

Digital Library

[7]

D. A. Bader, G. Cong, and J. Feo. On the architectural requirements for efficient execution of graph algorithms. In Parallel Processing, 2005. ICPP 2005. International Conference on, pages 547--556, June 2005.

Digital Library

[8]

C. Balkesen, J. Teubner, G. Alonso, and M. T. Özsu. Main-memory hash joins on modern processor architectures. IEEE Trans. Knowl. Data Eng., 27(7):1754--1766, 2015.

Digital Library

[9]

S. Beamer, K. Asanović, and D. Patterson. Direction-optimizing breadth-first search. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 12:1--12:10, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press.

Digital Library

[10]

S. Beamer, K. Asanović, and D. A. Patterson. The GAP benchmark suite. CoRR, abs/1508.03619, 2015.

[11]

S. Beamer, K. Asanović, and D. A. Patterson. Locality exists in graph processing: Workload characterization on an Ivy Bridge server. In 2015 IEEE International Symposium on Workload Characterization, IISWC 2015, Atlanta, GA, USA, October 4-6, 2015, pages 56--65, 2015.

Digital Library

[12]

G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun. Internally deterministic parallel algorithms can be fast. In Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP 2012, New Orleans, LA, USA, February 25--29, 2012, pages 181--192, 2012.

Digital Library

[13]

D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-MAT: A recursive model for graph mining. In Proceedings of the Fourth SIAM International Conference on Data Mining, Lake Buena Vista, Florida, USA, April 22-24, 2004, pages 442--446, 2004.

[14]

A. Ching, S. Edunov, M. Kabiljo, D. Logothetis, and S. Muthukrishnan. One trillion edges: Graph processing at Facebook-scale. Proceedings of the VLDB Endowment, 8(12):1804--1815, 2015.

Digital Library

[15]

T. Coffman, S. Greenblatt, and S. Marcus. Graph-based technologies for intelligence analysis. Commun. ACM, 47(3):45--47, Mar. 2004.

Digital Library

[16]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, OSDI'04, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association.

Digital Library

[17]

C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. In ACM SIGPLAN Notices, volume 34, pages 229--241. ACM, 1999.

Digital Library

[18]

W. Ding, M. Kandemir, D. Guttman, A. Jog, C. R. Das, and P. Yedlapalli. Trading cache hit rate for memory performance. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 357--368, New York, NY, USA, 2014. ACM.

Digital Library

[19]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI '98, pages 212--223, New York, NY, USA, 1998. ACM.

Digital Library

[20]

R. Garg and Y. Sabharwal. Optimizing the HPCC RandomAccess benchmark on Blue Gene/L supercomputer. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '06/Performance '06, pages 369--370, New York, NY, USA, 2006. ACM.

Digital Library

[21]

Graph500. Graph 500 benchmark. http://www.graph500.org/specifications.

[22]

H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In Languages and Compilers for Parallel Computing, pages 181--196. Springer, 1998.

Digital Library

[23]

H. Han and C.-W. Tseng. Exploiting locality for irregular scientific codes. IEEE Transactions on Parallel and Distributed Systems, 17(7):606--618, July 2006.

Digital Library

[24]

JEDEC. DDR3 SDRAM Standard. http://www.jedec.org/standards-documents/docs/jesd-79--3d.

[25]

W. Jung, J. Park, and J. Lee. Versatile and scalable parallel histogram construction. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation, PACT '14, pages 127--138, New York, NY, USA, 2014. ACM.

Digital Library

[26]

C. Kim, T. Kaldewey, V. W. Lee, E. Sedlar, A. D. Nguyen, N. Satish, J. Chhugani, A. Di Blas, and P. Dubey. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. Proc. VLDB Endow., 2(2):1378--1389, Aug. 2009.

Digital Library

[27]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis and transformation. CGO '04, San Jose, CA, USA, Mar 2004.

Digital Library

[28]

F. Liljeros, C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Åberg. The web of human sexual contacts. Nature, 411(6840):907--908, 2001.

[29]

Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: A framework for machine learning and data mining in the cloud. Proc. VLDB Endow., 5(8):716--727, Apr. 2012.

Digital Library

[30]

S. Manegold, P. Boncz, and M. Kersten. Optimizing main-memory join on modern hardware. IEEE Trans. on Knowl. and Data Eng., 14(4):709--730, July 2002.

Digital Library

[31]

U. Meyer and P. Sanders. Δ-stepping: A parallelizable shortest path algorithm. J. Algorithms, 49(1):114--152, Oct. 2003.

Digital Library

[32]

N. Mitchell, L. Carter, and J. Ferrante. Localizing non-affine array references. In Parallel Architectures and Compilation Techniques, 1999. Proceedings. 1999 International Conference on, pages 192--202. IEEE, 1999.

Digital Library

[33]

MLC. Intel Memory Latency Checker v3.0. https://software.intel.com/en-us/articles/intelr-memory-latency-checker.

[34]

J. Nelson, B. Holt, B. Myers, P. Briggs, L. Ceze, S. Kahan, and M. Oskin. Latency-tolerant software distributed shared memory. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 291--305, Berkeley, CA, USA, 2015. USENIX Association.

Digital Library

[35]

M. E. J. Newman. Detecting community structure in networks. The European Physical Journal B - Condensed Matter and Complex Systems, 38(2):321--330, 2004.

[36]

D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 456--471, New York, NY, USA, 2013. ACM.

Digital Library

[37]

L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab, November 1999.

[38]

I. Papazian, S. Kottapalli, J. Baxter, J. Chamberlain, G. Vedaraman, and B. Morris. Ivy Bridge server: A converged design. Micro, IEEE, 35(2):16--25, Mar 2015.

[39]

K. Pingali, D. Nguyen, M. Kulkarni, M. Burtscher, M. A. Hassaan, R. Kaleem, T.-H. Lee, A. Lenharth, R. Manevich, M. Méndez-Lojo, D. Prountzos, and X. Sui. The tao of parallelism in algorithms. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '11, pages 12--25, New York, NY, USA, 2011. ACM.

Digital Library

[40]

J. Saltz, K. Crowley, R. Michandaney, and H. Berryman. Run-time scheduling and execution of loops on message passing machines. Journal of Parallel and Distributed Computing, 8(4):303--312, 1990.

Digital Library

[41]

Samsung. DDR3L SDRAM 240pin Registered DIMM M393B2G70BH0 datasheet:. http://www.samsung.com/global/business/semiconductor/file/product/ds_ddr3_4gb_b-die_based_1_35v_rdimm_rev16-5.pdf, 2012.

[42]

Samsung. DDR4 SDRAM 288pin Registered DIMM M393A2G40DB1 datasheet. http://www.samsung.com/semiconductor/global/file/product/DS_8GB_DDR4_4Gb_D_die_RegisteredDIMM_Rev15.pdf, 2015.

[43]

N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 979--990, New York, NY, USA, 2014. ACM.

Digital Library

[44]

H. Schweizer, M. Besta, and T. Hoefler. Evaluating the cost of atomic operations on modern architectures. San Francisco, CA, 2015. Parallel Architectures and Compilation Techniques (PACT'15).

Digital Library

[45]

A. Shatdal, C. Kant, and J. F. Naughton. Cache conscious algorithms for relational query processing. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB '94, pages 510--521, San Francisco, CA, USA, 1994. Morgan Kaufmann Publishers Inc.

Digital Library

[46]

Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. Journal of Algorithms, 3(1), 1982.

[47]

J. Shun and G. E. Blelloch. Ligra: A lightweight graph processing framework for shared memory. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP '13, pages 135--146, New York, NY, USA, 2013. ACM.

Digital Library

[48]

J. Su and K. Yelick. Automatic support for irregular computations in a high-level language. In Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers - Volume 01, IPDPS '05, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[49]

J. Teubner, G. Alonso, C. Balkesen, and M. T. Ozsu. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware. In Proceedings of the 2013 IEEE International Conference on Data Engineering, ICDE '13, pages 362--373, Washington, DC, USA, 2013. IEEE Computer Society.

Digital Library

[50]

L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990.

Digital Library

[51]

J. Wassenberg and P. Sanders. Engineering a multi-core radix sort. In Proceedings of the 17th International Conference on Parallel Processing - Volume Part II, Euro-Par'11, pages 160--169, Berlin, Heidelberg, 2011. Springer-Verlag.

Digital Library

Cited By

Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640542
Schwedock BBeckmann N(2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00095
Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
Show More Cited By

Index Terms

Optimizing Indirect Memory References with milk

Recommendations

Optimizing irregular shared-memory applications for distributed-memory systems
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming

In prior work, we have proposed techniques to extend the ease of shared-memory parallel programming to distributed-memory platforms by automatic translation of OpenMP programs to MPI. In the case of irregular applications, the performance of this ...
Characterizing Memory Write References for Efficient Management of Hybrid PCM and DRAM Memory
MASCOTS '11: Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems

In order to reduce the energy dissipation in main memory of computer systems, phase change memory (PCM) has emerged as one of the most promising technologies to incorporate into the memory hierarchy. However, PCM has two critical weaknesses to ...
Large-Scale BSP Graph Processing in Distributed Non-Volatile Memory
GRADES'15: Proceedings of the GRADES'15

Processing large graphs is becoming increasingly important for many domains. Large-scale graph processing requires a large-scale cluster system, which is very expensive. Thus, for high-performance large-scale graph processing in small clusters, we have ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

September 2016

474 pages

ISBN:9781450341219

DOI:10.1145/2967938

General Chairs:
Ayal Zaks
Intel, Israel
,
Bilha Mendelson
Optitura, Israel
,
Program Chairs:
Lawrence Rauchwerger
Texas A&M University, USA
,
Wen-mei W. Hwu
University of Illinois at Urbana-Champaign, USA

Copyright © 2016 Owner/Author.

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

IFIP WG 10.3: IFIP WG 10.3
IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 September 2016

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PACT '16

Sponsor:

IFIP WG 10.3
IEEE TCCA
SIGARCH
IEEE CS TCPP

PACT '16: International Conference on Parallel Architectures and Compilation

September 11 - 15, 2016

Haifa, Israel

Acceptance Rates

PACT '16 Paper Acceptance Rate 31 of 119 submissions, 26%;

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
4,393
Total Downloads

Downloads (Last 12 months)141
Downloads (Last 6 weeks)16

Reflects downloads up to 14 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Isaac–Chassande VEvans ADurand YRousseau F(2024)Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A SurveyACM Transactions on Architecture and Code Optimization10.1145/364054221:2(1-26)Online publication date: 17-Jan-2024
https://dl.acm.org/doi/10.1145/3640542
Schwedock BBeckmann N(2024)Leviathan: A Unified System for General-Purpose Near-Data Computing2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00095(1278-1294)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00095
Dalmia PShashi Kumar RSinclair M(2024)CPElide: Efficient Multi-Chiplet GPU Implicit Synchronization2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00058(700-717)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00058
McMichen TGreiner NZhong PSossai FPatel ACampanoni SGrosser TDubach CSteuwer MXue JOttoni GQuintão Pereira F(2024)Representing Data Collections in an SSA FormProceedings of the 2024 IEEE/ACM International Symposium on Code Generation and Optimization10.1109/CGO57630.2024.10444817(308-321)Online publication date: 2-Mar-2024
https://dl.acm.org/doi/10.1109/CGO57630.2024.10444817
Yesil SHeidarshenas AMorrison ATorrellas JDehnavi MKulkarni MKrishnamoorthy S(2023)WISEProceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3572848.3577506(329-341)Online publication date: 25-Feb-2023
https://dl.acm.org/doi/10.1145/3572848.3577506
Xie YQiu JDhulipala LYu WTang JPeng RWang C(2023)Towards Lightweight and Automated Representation Learning System for NetworksIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.324316935:9(9613-9627)Online publication date: 1-Sep-2023
https://doi.org/10.1109/TKDE.2023.3243169
Balaji VCrago NJaleel AKeckler S(2023)Community-based Matrix Reordering for Sparse Linear Algebra Optimization2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS57527.2023.00029(214-223)Online publication date: Apr-2023
https://doi.org/10.1109/ISPASS57527.2023.00029
Schwedock BYoovidhya PSeibert JBeckmann NSalapura VZahran MChong FTang L(2022)täkōProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527379(42-58)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527379
Balaji VLucia B(2022)Improving Locality of Irregular Updates with Hardware Assisted Propagation Blocking2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00047(543-557)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00047
Suzuki THiwada KKajihara HSano SNomura SShiozawa T(2021)Approaching DRAM performance by using microsecond-latency flash memory for small-sized random read accessesProceedings of the VLDB Endowment10.14778/3457390.345739714:8(1311-1324)Online publication date: 1-Apr-2021
https://dl.acm.org/doi/10.14778/3457390.3457397
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents