research-article

Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory

Authors:

Xiaobing FengAuthors Info & Claims

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Pages 97 - 109

https://doi.org/10.1145/3410463.3414637

Published: 30 September 2020 Publication History

Abstract

Scratchpad Memory (SPM) is widely used in emerging domain-specific architectures and accelerators for improving energy efficiency and time predictability. Typically, SPM-based architectures use DMA for fetching data from off-chip memory and global load instructions for loading fine-grained data directly into registers. For such architectures, neither capacity-only nor bandwidth-only loop tiling can efficiently use the bandwidth and SPM. This paper introduces a bandwidth-aware loop tiling approach that enables a tradeoff between SPM space utilization and bandwidth utilization to be made, by leveraging a runtime tiling framework and a cross-host-kernel IPA. Experimental results demonstrate that our approach can achieve the performance improvement of up to 4x, with a geometric average of 26%.

References

[1]

2014. LLVM-CBE. https://github.com/JuliaComputing/llvm-cbe

[2]

A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. 2011. Compilers, Principles, Techniques and Tools (2 ed.).

[3]

Y. Ao, C. Yang, X. Wang, W. Xue, H. Fu, F. Liu, L. Gan, P. Xu, and W. Ma. 2017. 26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS '17). IEEE, Florida USA.

[4]

R. Banakar, S. Steinke, B. Lee, M. Balakrishnan, and P. Marwedel. 2002. Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES '02). New York, NY, USA, 73--78.

[5]

M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J.Ramanujam, and P. Sadayappan. 2010. Parameterized Tiling Revisited. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, New York, NY, USA, 200--209.

[6]

P. K. Bhatotia, S. K. Aggarwal, and M. Chaudhuri. 2009. A Compilation Framework for Irregular Memory Accesses on the Cell Broadband Engine. In Proceedings of the 15th International Symposium on High-Performance Computer Architecture (HPCA '09). IEEE, North Carolina, USA.

[7]

G. Chen, O. Ozturk, M. T. Kandemir, and M. Karaköy. 2006. Dynamic Scratch-Pad Memory Management for Irregular Array Access Patterns. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE '06). Munich, Germany.

[8]

J. Chen, R. Tan, and Y. Zhang. 2017. Heterogeneous Parallel and Distributed Optimization of K-means Algorithm on Sunway Supercomputer. In Proceedings of the 15th IEEE International Symposium on Parallel and Distributed Processing with Applications and the 16th IEEE International Conference on Ubiquitous Computing and Communications (ISPA '17). IEEE, Guangzhou, China.

[9]

T. Chen, Z. Du, J. Wang, C. Wu, and Y. Chen. 2014. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, Salt Lake City, Utah, USA.

[10]

D. Cho, I. Issenin, N. Dutt, J. W. Yoon, and Y. Paek. 2007. Software Controlled Memory Layout Reorganization for Irregular Array Access Patterns. In Proceedings of the 2007 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'07). Salzburg, Austria.

[11]

M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11). IEEE, New Orleans, Louisiana, USA, 676--687.

[12]

S. Coleman and K. S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI '95). ACM, New York, NY, USA, 279--290.

[13]

H. Cui, L.Wang, Y J. Xue, Yang, and X. Feng. 2011. Automatic Library Generation for BLAS3 on GPUs. In 2011 IEEE International Parallel Distributed Processing Symposium. IEEE, 255--265.

[14]

H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. 2011. Extendable Patternoriented Optimization Directives. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). IEEE, Chamonix, France, 107--118.

[15]

H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. 2012. Extendable Patternoriented Optimization Directives. ACM Transactions on Architecture and Code Optimization 9, 3 (Oct. 2012).

Digital Library

[16]

J. Dongarra. 2016. Report on the sunway taihulight system. Technical Report Tech Report UT-EECS-16--742. University of Tennessee.

[17]

D. Fan, X. Ye, W. Li, and D. Wang. 2018. An Efficient Many-Core Processor for High-Throughput Applications in Datacenters. In Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA '18).

[18]

J. Fang, H. Fu, W. Zhao, B. Chen, W. Zheng, and G. Yang. 2017. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS '17). IEEE, Florida USA.

[19]

H. Fu, J. Liao, and et al. 2016. Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE, Salt Lake City, Utah, USA.

[20]

H. Fu, J. Liao, J. Yang, L. Wang, Z. Song, X. Huang, C. Yang, and et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59 (2016), 1--16.

[21]

T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'14). ACM, Orlando, FL, USA.

[22]

T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6thWorkshop on General Purpose Processor Using Graphics Processing Units (GPGPU '13). ACM, 24--31.

[23]

Khronos Group. 2018. OpenCL Overview. https://www.khronos.org/opencl/

[24]

A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric Multi-level Tiling of Imperfectly Nested Loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM, New York, NY, USA, 147--157.

[25]

A. Hartono, M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. DynTile: Parametric tiled loop generation for parallel execution on multicore processorss. In 2010 IEEE International Symposium on Parallel Distributed Processing.

[26]

J. Holewinski, L. Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, Taiwan, China, 311--320.

[27]

P. Jääskeläinen, C. S. Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. 2015. Pocl: A Performance-Portable OpenCL Implementation. International Journal of Parallel Programming. 43, 5 (Oct. 2015), 752--785.

Digital Library

[28]

G. Juckeland,W. C. Brantley, S. Chandrasekaran, and et al. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In Proceedings of 5th InternationalWorkshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS'14). Springer, New Orleans, LA, USA, 46--67.

[29]

C. D. Krieger, M. M. Strout, C. Olschanowsky, A. Stone, S. Guzik, X. Gao, C. Bertolli, P. Kelly, G. Mudalige, B. Van Straalen, and S. Williams. 2013. Loop chaining: A programming abstraction for balancing locality and parallelism. In Proceedings of the 18th InternationalWorkshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '13). Boston, Massachusetts, USA.

[30]

M. S. Lam and M. Wolf. 1991. A Data Locality Optimizing Algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (PLDI '91). ACM, New York, NY, USA, 30--44.

[31]

R. Lazcano, D. Madroñal, E. Juarez, and P. Clauss. 2020. Runtime Multi-versioning and Specialization inside a Memoized Speculative Loop Optimizer. In Proceedings of the 29th International Conference on Compiler Construction (CC '20). ACM, San Diego, CA, USA.

[32]

J. Lee, J. Kim, S. Seo, S. Kim, and et al. 2010. An OpenCL Framework for Heterogeneous Multicores with Local Memory. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). Vienna, Austria, 193--204.

[33]

L. Li, L. Gao, and J. Xue. 2005. Memory Coloring: A Compiler Approach for Scratchpad Memory Management. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT '05).

[34]

L. Li, H.Wu, H. Feng, and J. Xue. 2007. Towards Data Tiling for Whole Programs in Scratchpad Memory Allocation. In Proceedings of the 12th Asia-Pacific Conference on Advances in Computer Systems Architecture (ACSAC '07). Miami Beach, Florida, USA.

[35]

H. Lin, T. Liu, L. Renganarayana, H. Li, T. Chen, J. K. O'Brilen, and L. Shao. 2011. Automatic Loop Tiling for Direct Memory Access. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11). IEEE, New Orleans, Louisiana, USA.

[36]

H. Lin, X. Tang, B. Yu, Y. Zhuo, W. Chen, J. Zhai, W. Yin, and W. Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS '17). IEEE, Florida USA.

[37]

Y. Lin, H. Lee, M. Woh, Y. Hare, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. 2007. SODA: A High-Performance DSP Architecture for Software- Defined Radio. IEEE Micro 27, 1 (2007), 114--123.

Digital Library

[38]

C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu. 2018. Towards Efficient SpMV on Sunway Many-core Architectures. In Proceedings of the 32nd ACM International Conference on Supercomputing (ICS '18). ACM, Beijing, China.

[39]

J. Liu, Y. Zhang, W. Ding, and M. T. Kandemir. 2011. On-chip cache hierarchyaware tile scheduling form ulticore machines. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). ACM, Chamonix, France, 161--170.

[40]

T. Liu, H. Lin, T. Chen, J. K. O'Brilen, and L. Shao. 2009. DBDB: optimizing DMATransfer for the cell be architecture. In Proceedings of the 23rd international conference on Supercomputing (ISC '09). ACM, New York, NY, USA, 36--45.

[41]

Y. Liu, L. Huang, M. Wu, H. Cui, F. Lv, X. Feng, and J. Xue. 2019. PPOpenCL: A Performance-Portable OpenCL Compiler with Host and Kernel Thread Code Fusion. In Proceedings of the 28th International Conference on Compiler Construction (CC'19). ACM, Washington, DC, USA, 2--16.

[42]

A. M. Malik. 2012. Optimal Tile Size Selection Problem Using Machine Learning. In 2012 11th International Conference on Machine Learning and Applications, Vol. 2. 275--280.

[43]

S. Mehta, R. Garg, N. Trivedi, and P. Yew. 2016. Leveraging Prefetching to Boost Performance of Tiled Codes. In Proceedings of the 2016 International Conference on Supercomputing (ISC '16). ACM, New York, NY, USA.

[44]

M. Mohammadi, T. Yuki, K. Cheshmi, E. Davis, M. Hall, M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. Strout. 2019. Sparse Computation Data Dependence Simplification for Efficient Compiler-Generated Inspectors. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 594--609.

Digital Library

[45]

M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick. 2009. Minimizing communication in sparse matrix solvers. In Proceedings of the 21th Conference on High Performance Computing Networking, Storage and Analysis (SC '09). IEEE, Portland, Oregon, USA.

[46]

R. T. Mullapudi, V. Vasista, and U. Bondhugula. 2015. Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, Istanbul, Turkey, 429--443.

[47]

P. R. Panda, N. D. Dutt, and A. Nicolau. 1997. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC '97). IEEE Computer Society, USA, 7.

[48]

D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, and J. Keaty. 2005. The design and implementation of a firstgeneration CELL processor - a multi-core SoC. In Proceedings of the 2005 International Conference on Integrated Circuit Design and Technology (ICICDT '05). IEEE, Austin, TX, USA.

[49]

M. Rahman, L. Pouchet, and P. Sadayappan. 2010. Neural networks assisted tile size selection. In 5th International Workshop on Automatic Performance Tuning.

[50]

M. Ravishankar, J. Holewinski, and V. G. Forma. 2015. A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU '15). ACM, 109--120.

[51]

P. S. Rawat, C. Hong, M. Ravishankar, V. Grover, L. Pouchet, A. Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT '16). Haifa, Israel.

[52]

L. Renganarayanan, D. Kim, S. Rajopadhye, and M. M. Strout. 2007. Parameterized Tiled Loops for Free. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07). ACM, New York, NY, USA.

[53]

Y. Sato, T. Yuki, and T. Endo. 2019. An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation. ACM Transactions on Architecture and Code Optimization 15, 4 (Jan. 2019).

Digital Library

[54]

S. Seo, R. G. Dreslinski, M. Woh, C. Chakrabarti, S. Mahlke, and T. Mudge. 2010. Diet SODA: A Power-Efficient Processor for Digital Camerasg. In Proceedings of the 16th International Symposium on Low Power Electronics and Design (ISLPED '10). ACM, Austin, Texas, USA.

[55]

P. Srivastava, M. Kotsifakou, and V. Adve. 2016. HPVM: A Portable Virtual Instruction Set for Heterogeneous Parallel Systems. https://arxiv.org/pdf/1611. 00860.pdf

[56]

M. M. Strout, L. Carter, and J. Ferrante. 2003. Compile-time composition of run-time data and iteration reorderings. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '03). ACM, New York, NY, USA.

[57]

M. M. Strout, L. Carter, J. Ferrante, and B. Kreaseck. 2004. Sparse tiling for stationary iterative methods. International Journal of High Performance Computing Applications 18, 1 (2004), 95--114.

Digital Library

[58]

M. M. Strout, F. Luporini, C. D. Krieger, and C. Bertolli. 2014. Generalizing Runtime Tiling with the Loop Chain Abstraction. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE, New Orleans, Louisiana USA.

[59]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C. Luk, and C. E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). ACM, New York, NY, USA, 117--128.

[60]

A. Venkat, M. Hall, and M. Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. SIGPLAN Not. 50, 6 (June 2015), 521--532.

Digital Library

[61]

X. Wang, W. Liu, W. Xue, and L. Wu. 2018. swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, Vösendorf/Wien, Austria.

[62]

X. Wang, P. Xu, W. Xue, Y. Ao, C. Yang, H. Fu, L. Gan, G. Yang, and W. Zheng. 2018. A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010. In Proceedings of the 47th International Conference on Parallel Processing (ICPP '18). ACM, Eugene, OR, USA.

[63]

C. Whaley, A. Petitet, and J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput 27, 1 (2001), 3--35.

Digital Library

[64]

Z. Xu, J. Lin, and S. Matsuoka. 2017. Benchmarking SW26010 Many-core Processor. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW '17). IEEE, Florida USA.

[65]

J. Xue. 1997. Communication-Minimal Tiling of Uniform Dependence Loops. J. Parallel Distrib. Comput 42, 1 (1997), 42--59.

Digital Library

[66]

J. Xue. 1997. On Tiling as a Loop Transformation. Parallel Processing Letters 7, 4 (1997), 409--424.

[67]

J. Xue. 2000. Loop Tiling for Parallelism. Kluwer International Series in Engineering and Computer Science, Vol. 575. Kluwer.

Digital Library

[68]

J. Xue and C. Huang. 1998. Reuse-Driven Tiling for Improving Data Locality. International Journal of Parallel Programming 26, 6 (1998), 671--696.

Digital Library

[69]

J. Xue, Q. Huang, and M. Guo. 2005. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences. In Proceedings of the 2005 International Conference on Parallel Processing (ICPP'05). 107--115.

[70]

K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, and P. Wu. 2003. A Comparison of Empirical and Modeldriven Optimization. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). ACM, New York, NY, USA, 63--76.

[71]

T. Yuki, L. Renganarayanan, S. Rajopadhye, C. Anderson, A. E. Eichenberger, and K. O'Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, New York, NY, USA, 190--199.

Digital Library

[72]

P. Zhang, J. Fang, C. Yang, T. Tang, C. Huang, and Z. Wang. 2018. MOCL: An Efficient OpenCL Implementation for the Matrix-2000 Architecture. In Computing Frontiers Conference (CF'18). ACM, Ischia, Italy, 10.

[73]

J. Zhao and A. Cohen. 2019. Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation. ACM Transactions on Architecture and Code Optimization 16, 4 (2019).

[74]

J. Zhao, H. Cui, Y. Zhang, J. Xue, and X. Feng. 2018. Revisiting Loop Tiling for Datacenters: Live and Let Live. In Proceedings of the 32nd International Conference on Supercomputing (ICS '18). ACM, Beijing, China.

[75]

M. Zhao, R. Liu, Y. Liu, K. Song, and D. Qian. 2016. Parallel Image Processing on the Sunway Many-core Processor. In Proceedings of the 18th International Conference on High Performance Computing and Communications. IEEE, Sydney, Australia.

[76]

W. Zhao, H. Fu, J. Fang, W. Zheng, L. Gan, and G. Yang. 2018. Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer. ACM Transactions on Architecture and Code Optimization 15, 1 (2018).

Digital Library

Cited By

Shang HLiu YWu ZChen ZLiu JShao MLi YKan BCui HFeng XZhang YTruhlar DAn HHe XYang J(2024)Pushing the Limit of Quantum Mechanical Simulation to the Raman Spectra of a Biological System with 100 Million AtomsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00011(1-12)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00011
Du YSha ESong YGuo YXu LZhuge Q(2024)MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latencyFrontiers of Computer Science10.1007/s11704-023-3566-y19:5Online publication date: 22-Nov-2024
https://doi.org/10.1007/s11704-023-3566-y
Singer AWang KEgger BLee D(2023)Tiling for DMA-Based Hardware Accelerators (WIP)Proceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596283(138-142)Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1145/3589610.3596283
Show More Cited By

Index Terms

Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory
1. Software and its engineering
  1. Software notations and tools
    1. Compilers
      1. Source code generation

Recommendations

SA-SPM: an efficient compiler for security aware scratchpad memory (invited paper)
LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Scratchpad memories (SPM) are often used to boost the performance of application-specific embedded systems. In embedded systems, main memories are vulnerable to external attacks such as bus snooping or memory extraction. Therefore it is desirable to ...
Write Mode Aware Loop Tiling for High Performance Low Power Volatile PCM
DAC '14: Proceedings of the 51st Annual Design Automation Conference

Architecting PCM, especially MLC PCM, as main memory for MCUs is a promising technique to replace conventional DRAM deployment. However, PCM/MLC PCM suffers from long write latency and large write energy. Recent work has proposed a compiler directed ...
Improving scratchpad allocation with demand-driven data tiling
CASES '10: Proceedings of the 2010 international conference on Compilers, architectures and synthesis for embedded systems

Existing scratchpad memory (SPM) allocation algorithms for arrays, whether they rely on well-crafted heuristics or resort to integer linear programming (ILP) techniques, typically assume that every array is small enough to fit directly into the SPM. As ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

September 2020

505 pages

ISBN:9781450380751

DOI:10.1145/3410463

General Chair:
Vivek Sarkar
Georgia Institute of Technology
,
Program Chair:
Hyesoon Kim
Georgia Institute of Technology

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

the National Natural Science Foundation of China
the National Key Research and Development Program of China
CCF-Tencent Open Research Fund
Australian Research Council grant

Conference

PACT '20

Sponsor:

SIGARCH

PACT '20: International Conference on Parallel Architectures and Compilation Techniques

October 3 - 7, 2020

GA, Virtual Event, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

7
Total Citations
View Citations
382
Total Downloads

Downloads (Last 12 months)72
Downloads (Last 6 weeks)6

Reflects downloads up to 22 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Shang HLiu YWu ZChen ZLiu JShao MLi YKan BCui HFeng XZhang YTruhlar DAn HHe XYang J(2024)Pushing the Limit of Quantum Mechanical Simulation to the Raman Spectra of a Biological System with 100 Million AtomsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00011(1-12)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00011
Du YSha ESong YGuo YXu LZhuge Q(2024)MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latencyFrontiers of Computer Science10.1007/s11704-023-3566-y19:5Online publication date: 22-Nov-2024
https://doi.org/10.1007/s11704-023-3566-y
Singer AWang KEgger BLee D(2023)Tiling for DMA-Based Hardware Accelerators (WIP)Proceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596283(138-142)Online publication date: 13-Jun-2023
https://dl.acm.org/doi/10.1145/3589610.3596283
Wu ZWu YLiu YShang HGao YZhang ZZhang YLong YFeng XCui HMohror KArnold DBadia R(2023)Portable and Scalable All-Electron Quantum Perturbation Simulations on Exascale SupercomputersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607085(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607085
Wu MWu YShang HLiu YCui HLi FDuan XZhang YFeng X(2022)Scaling Poisson Solvers on Many Cores via MMEwaldIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312713833:8(1888-1901)Online publication date: 1-Aug-2022
https://doi.org/10.1109/TPDS.2021.3127138
Shang HLi FZhang YLiu YZhang LWu MWu YWei DCui HLiu XWang FYe YGao YNi SChen XChen Dde Supinski BHall MGamblin T(2021)Accelerating all-electron ab initio simulation of raman spectra for biological systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476160(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476160
Abdelaal KKong MZhou HMoreira JMueller FEtsion Y(2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460369

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents