Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3410463.3414637acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory

Published: 30 September 2020 Publication History

Abstract

Scratchpad Memory (SPM) is widely used in emerging domain-specific architectures and accelerators for improving energy efficiency and time predictability. Typically, SPM-based architectures use DMA for fetching data from off-chip memory and global load instructions for loading fine-grained data directly into registers. For such architectures, neither capacity-only nor bandwidth-only loop tiling can efficiently use the bandwidth and SPM. This paper introduces a bandwidth-aware loop tiling approach that enables a tradeoff between SPM space utilization and bandwidth utilization to be made, by leveraging a runtime tiling framework and a cross-host-kernel IPA. Experimental results demonstrate that our approach can achieve the performance improvement of up to 4x, with a geometric average of 26%.

References

[1]
2014. LLVM-CBE. https://github.com/JuliaComputing/llvm-cbe
[2]
A. V. Aho, M. S. Lam, R. Sethi, and J. D. Ullman. 2011. Compilers, Principles, Techniques and Tools (2 ed.).
[3]
Y. Ao, C. Yang, X. Wang, W. Xue, H. Fu, F. Liu, L. Gan, P. Xu, and W. Ma. 2017. 26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS '17). IEEE, Florida USA.
[4]
R. Banakar, S. Steinke, B. Lee, M. Balakrishnan, and P. Marwedel. 2002. Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In Proceedings of the 10th International Symposium on Hardware/Software Codesign (CODES '02). New York, NY, USA, 73--78.
[5]
M. Baskaran, A. Hartono, S. Tavarageri, T. Henretty, J.Ramanujam, and P. Sadayappan. 2010. Parameterized Tiling Revisited. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, New York, NY, USA, 200--209.
[6]
P. K. Bhatotia, S. K. Aggarwal, and M. Chaudhuri. 2009. A Compilation Framework for Irregular Memory Accesses on the Cell Broadband Engine. In Proceedings of the 15th International Symposium on High-Performance Computer Architecture (HPCA '09). IEEE, North Carolina, USA.
[7]
G. Chen, O. Ozturk, M. T. Kandemir, and M. Karaköy. 2006. Dynamic Scratch-Pad Memory Management for Irregular Array Access Patterns. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE '06). Munich, Germany.
[8]
J. Chen, R. Tan, and Y. Zhang. 2017. Heterogeneous Parallel and Distributed Optimization of K-means Algorithm on Sunway Supercomputer. In Proceedings of the 15th IEEE International Symposium on Parallel and Distributed Processing with Applications and the 16th IEEE International Conference on Ubiquitous Computing and Communications (ISPA '17). IEEE, Guangzhou, China.
[9]
T. Chen, Z. Du, J. Wang, C. Wu, and Y. Chen. 2014. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, Salt Lake City, Utah, USA.
[10]
D. Cho, I. Issenin, N. Dutt, J. W. Yoon, and Y. Paek. 2007. Software Controlled Memory Layout Reorganization for Irregular Array Access Patterns. In Proceedings of the 2007 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES'07). Salzburg, Austria.
[11]
M. Christen, O. Schenk, and H. Burkhart. 2011. PATUS: A code generation and autotuning framework for parallel iterative stencil computations on modern microarchitectures. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11). IEEE, New Orleans, Louisiana, USA, 676--687.
[12]
S. Coleman and K. S. McKinley. 1995. Tile Size Selection Using Cache Organization and Data Layout. In Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language Design and Implementation (PLDI '95). ACM, New York, NY, USA, 279--290.
[13]
H. Cui, L.Wang, Y J. Xue, Yang, and X. Feng. 2011. Automatic Library Generation for BLAS3 on GPUs. In 2011 IEEE International Parallel Distributed Processing Symposium. IEEE, 255--265.
[14]
H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. 2011. Extendable Patternoriented Optimization Directives. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). IEEE, Chamonix, France, 107--118.
[15]
H. Cui, J. Xue, L. Wang, Y. Yang, X. Feng, and D. Fan. 2012. Extendable Patternoriented Optimization Directives. ACM Transactions on Architecture and Code Optimization 9, 3 (Oct. 2012).
[16]
J. Dongarra. 2016. Report on the sunway taihulight system. Technical Report Tech Report UT-EECS-16--742. University of Tennessee.
[17]
D. Fan, X. Ye, W. Li, and D. Wang. 2018. An Efficient Many-Core Processor for High-Throughput Applications in Datacenters. In Proceedings of the 24th International Symposium on High-Performance Computer Architecture (HPCA '18).
[18]
J. Fang, H. Fu, W. Zhao, B. Chen, W. Zheng, and G. Yang. 2017. swDNN: A Library for Accelerating Deep Learning Applications on Sunway TaihuLight. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS '17). IEEE, Florida USA.
[19]
H. Fu, J. Liao, and et al. 2016. Refactoring and Optimizing the Community Atmosphere Model (CAM) on the Sunway TaihuLight Supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '16). IEEE, Salt Lake City, Utah, USA.
[20]
H. Fu, J. Liao, J. Yang, L. Wang, Z. Song, X. Huang, C. Yang, and et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59 (2016), 1--16.
[21]
T. Grosser, A. Cohen, J. Holewinski, P. Sadayappan, and S. Verdoolaege. 2014. Hybrid Hexagonal/Classical Tiling for GPUs. In Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO'14). ACM, Orlando, FL, USA.
[22]
T. Grosser, A. Cohen, P. H. J. Kelly, J. Ramanujam, P. Sadayappan, and S. Verdoolaege. 2013. Split tiling for GPUs: Automatic parallelization using trapezoidal tiles. In Proceedings of the 6thWorkshop on General Purpose Processor Using Graphics Processing Units (GPGPU '13). ACM, 24--31.
[23]
Khronos Group. 2018. OpenCL Overview. https://www.khronos.org/opencl/
[24]
A. Hartono, M. M. Baskaran, C. Bastoul, A. Cohen, S. Krishnamoorthy, B. Norris, J. Ramanujam, and P. Sadayappan. 2009. Parametric Multi-level Tiling of Imperfectly Nested Loops. In Proceedings of the 23rd International Conference on Supercomputing (ICS '09). ACM, New York, NY, USA, 147--157.
[25]
A. Hartono, M. M. Baskaran, J. Ramanujam, and P. Sadayappan. 2010. DynTile: Parametric tiled loop generation for parallel execution on multicore processorss. In 2010 IEEE International Symposium on Parallel Distributed Processing.
[26]
J. Holewinski, L. Pouchet, and P. Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM International Conference on Supercomputing (ICS '12). ACM, Taiwan, China, 311--320.
[27]
P. Jääskeläinen, C. S. Lama, E. Schnetter, K. Raiskila, J. Takala, and H. Berg. 2015. Pocl: A Performance-Portable OpenCL Implementation. International Journal of Parallel Programming. 43, 5 (Oct. 2015), 752--785.
[28]
G. Juckeland,W. C. Brantley, S. Chandrasekaran, and et al. 2014. SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance. In Proceedings of 5th InternationalWorkshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS'14). Springer, New Orleans, LA, USA, 46--67.
[29]
C. D. Krieger, M. M. Strout, C. Olschanowsky, A. Stone, S. Guzik, X. Gao, C. Bertolli, P. Kelly, G. Mudalige, B. Van Straalen, and S. Williams. 2013. Loop chaining: A programming abstraction for balancing locality and parallelism. In Proceedings of the 18th InternationalWorkshop on High-Level Parallel Programming Models and Supportive Environments (HIPS '13). Boston, Massachusetts, USA.
[30]
M. S. Lam and M. Wolf. 1991. A Data Locality Optimizing Algorithm. In Proceedings of the ACM SIGPLAN 1991 Conference on Programming Language Design and Implementation (PLDI '91). ACM, New York, NY, USA, 30--44.
[31]
R. Lazcano, D. Madroñal, E. Juarez, and P. Clauss. 2020. Runtime Multi-versioning and Specialization inside a Memoized Speculative Loop Optimizer. In Proceedings of the 29th International Conference on Compiler Construction (CC '20). ACM, San Diego, CA, USA.
[32]
J. Lee, J. Kim, S. Seo, S. Kim, and et al. 2010. An OpenCL Framework for Heterogeneous Multicores with Local Memory. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10). Vienna, Austria, 193--204.
[33]
L. Li, L. Gao, and J. Xue. 2005. Memory Coloring: A Compiler Approach for Scratchpad Memory Management. In Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT '05).
[34]
L. Li, H.Wu, H. Feng, and J. Xue. 2007. Towards Data Tiling for Whole Programs in Scratchpad Memory Allocation. In Proceedings of the 12th Asia-Pacific Conference on Advances in Computer Systems Architecture (ACSAC '07). Miami Beach, Florida, USA.
[35]
H. Lin, T. Liu, L. Renganarayana, H. Li, T. Chen, J. K. O'Brilen, and L. Shao. 2011. Automatic Loop Tiling for Direct Memory Access. In Proceedings of the 2011 IEEE International Parallel and Distributed Processing Symposium (IPDPS '11). IEEE, New Orleans, Louisiana, USA.
[36]
H. Lin, X. Tang, B. Yu, Y. Zhuo, W. Chen, J. Zhai, W. Yin, and W. Zheng. 2017. Scalable Graph Traversal on Sunway TaihuLight with Ten Million Cores. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium (IPDPS '17). IEEE, Florida USA.
[37]
Y. Lin, H. Lee, M. Woh, Y. Hare, S. Mahlke, T. Mudge, C. Chakrabarti, and K. Flautner. 2007. SODA: A High-Performance DSP Architecture for Software- Defined Radio. IEEE Micro 27, 1 (2007), 114--123.
[38]
C. Liu, B. Xie, X. Liu, W. Xue, H. Yang, and X. Liu. 2018. Towards Efficient SpMV on Sunway Many-core Architectures. In Proceedings of the 32nd ACM International Conference on Supercomputing (ICS '18). ACM, Beijing, China.
[39]
J. Liu, Y. Zhang, W. Ding, and M. T. Kandemir. 2011. On-chip cache hierarchyaware tile scheduling form ulticore machines. In Proceedings of the 9th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '11). ACM, Chamonix, France, 161--170.
[40]
T. Liu, H. Lin, T. Chen, J. K. O'Brilen, and L. Shao. 2009. DBDB: optimizing DMATransfer for the cell be architecture. In Proceedings of the 23rd international conference on Supercomputing (ISC '09). ACM, New York, NY, USA, 36--45.
[41]
Y. Liu, L. Huang, M. Wu, H. Cui, F. Lv, X. Feng, and J. Xue. 2019. PPOpenCL: A Performance-Portable OpenCL Compiler with Host and Kernel Thread Code Fusion. In Proceedings of the 28th International Conference on Compiler Construction (CC'19). ACM, Washington, DC, USA, 2--16.
[42]
A. M. Malik. 2012. Optimal Tile Size Selection Problem Using Machine Learning. In 2012 11th International Conference on Machine Learning and Applications, Vol. 2. 275--280.
[43]
S. Mehta, R. Garg, N. Trivedi, and P. Yew. 2016. Leveraging Prefetching to Boost Performance of Tiled Codes. In Proceedings of the 2016 International Conference on Supercomputing (ISC '16). ACM, New York, NY, USA.
[44]
M. Mohammadi, T. Yuki, K. Cheshmi, E. Davis, M. Hall, M. Dehnavi, P. Nandy, C. Olschanowsky, A. Venkat, and M. Strout. 2019. Sparse Computation Data Dependence Simplification for Efficient Compiler-Generated Inspectors. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 594--609.
[45]
M. Mohiyuddin, M. Hoemmen, J. Demmel, and K. Yelick. 2009. Minimizing communication in sparse matrix solvers. In Proceedings of the 21th Conference on High Performance Computing Networking, Storage and Analysis (SC '09). IEEE, Portland, Oregon, USA.
[46]
R. T. Mullapudi, V. Vasista, and U. Bondhugula. 2015. Automatic optimization for image processing pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '15). ACM, Istanbul, Turkey, 429--443.
[47]
P. R. Panda, N. D. Dutt, and A. Nicolau. 1997. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC '97). IEEE Computer Society, USA, 7.
[48]
D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, and J. Keaty. 2005. The design and implementation of a firstgeneration CELL processor - a multi-core SoC. In Proceedings of the 2005 International Conference on Integrated Circuit Design and Technology (ICICDT '05). IEEE, Austin, TX, USA.
[49]
M. Rahman, L. Pouchet, and P. Sadayappan. 2010. Neural networks assisted tile size selection. In 5th International Workshop on Automatic Performance Tuning.
[50]
M. Ravishankar, J. Holewinski, and V. G. Forma. 2015. A DSL for image processing applications to target GPUs and multi-core CPUs. In Proceedings of the 8th Workshop on General Purpose Processing Using GPUs (GPGPU '15). ACM, 109--120.
[51]
P. S. Rawat, C. Hong, M. Ravishankar, V. Grover, L. Pouchet, A. Rountev, and P. Sadayappan. 2016. Resource Conscious Reuse-Driven Tiling for GPUs. In Proceedings of the 25th International Conference on Parallel Architectures and Compilation Techniques (PACT '16). Haifa, Israel.
[52]
L. Renganarayanan, D. Kim, S. Rajopadhye, and M. M. Strout. 2007. Parameterized Tiled Loops for Free. In Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '07). ACM, New York, NY, USA.
[53]
Y. Sato, T. Yuki, and T. Endo. 2019. An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation. ACM Transactions on Architecture and Code Optimization 15, 4 (Jan. 2019).
[54]
S. Seo, R. G. Dreslinski, M. Woh, C. Chakrabarti, S. Mahlke, and T. Mudge. 2010. Diet SODA: A Power-Efficient Processor for Digital Camerasg. In Proceedings of the 16th International Symposium on Low Power Electronics and Design (ISLPED '10). ACM, Austin, Texas, USA.
[55]
P. Srivastava, M. Kotsifakou, and V. Adve. 2016. HPVM: A Portable Virtual Instruction Set for Heterogeneous Parallel Systems. https://arxiv.org/pdf/1611. 00860.pdf
[56]
M. M. Strout, L. Carter, and J. Ferrante. 2003. Compile-time composition of run-time data and iteration reorderings. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '03). ACM, New York, NY, USA.
[57]
M. M. Strout, L. Carter, J. Ferrante, and B. Kreaseck. 2004. Sparse tiling for stationary iterative methods. International Journal of High Performance Computing Applications 18, 1 (2004), 95--114.
[58]
M. M. Strout, F. Luporini, C. D. Krieger, and C. Bertolli. 2014. Generalizing Runtime Tiling with the Loop Chain Abstraction. In Proceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS '14). IEEE, New Orleans, Louisiana USA.
[59]
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C. Luk, and C. E. Leiserson. 2011. The pochoir stencil compiler. In Proceedings of the Twenty-third Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11). ACM, New York, NY, USA, 117--128.
[60]
A. Venkat, M. Hall, and M. Strout. 2015. Loop and Data Transformations for Sparse Matrix Code. SIGPLAN Not. 50, 6 (June 2015), 521--532.
[61]
X. Wang, W. Liu, W. Xue, and L. Wu. 2018. swSpTRSV: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, Vösendorf/Wien, Austria.
[62]
X. Wang, P. Xu, W. Xue, Y. Ao, C. Yang, H. Fu, L. Gan, G. Yang, and W. Zheng. 2018. A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010. In Proceedings of the 47th International Conference on Parallel Processing (ICPP '18). ACM, Eugene, OR, USA.
[63]
C. Whaley, A. Petitet, and J. Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput 27, 1 (2001), 3--35.
[64]
Z. Xu, J. Lin, and S. Matsuoka. 2017. Benchmarking SW26010 Many-core Processor. In Proceedings of the 31th IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW '17). IEEE, Florida USA.
[65]
J. Xue. 1997. Communication-Minimal Tiling of Uniform Dependence Loops. J. Parallel Distrib. Comput 42, 1 (1997), 42--59.
[66]
J. Xue. 1997. On Tiling as a Loop Transformation. Parallel Processing Letters 7, 4 (1997), 409--424.
[67]
J. Xue. 2000. Loop Tiling for Parallelism. Kluwer International Series in Engineering and Computer Science, Vol. 575. Kluwer.
[68]
J. Xue and C. Huang. 1998. Reuse-Driven Tiling for Improving Data Locality. International Journal of Parallel Programming 26, 6 (1998), 671--696.
[69]
J. Xue, Q. Huang, and M. Guo. 2005. Enabling loop fusion and tiling for cache performance by fixing fusion-preventing data dependences. In Proceedings of the 2005 International Conference on Parallel Processing (ICPP'05). 107--115.
[70]
K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran, D. Padua, K. Pingali, P. Stodghill, and P. Wu. 2003. A Comparison of Empirical and Modeldriven Optimization. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation (PLDI '03). ACM, New York, NY, USA, 63--76.
[71]
T. Yuki, L. Renganarayanan, S. Rajopadhye, C. Anderson, A. E. Eichenberger, and K. O'Brien. 2010. Automatic Creation of Tile Size Selection Models. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '10). ACM, New York, NY, USA, 190--199.
[72]
P. Zhang, J. Fang, C. Yang, T. Tang, C. Huang, and Z. Wang. 2018. MOCL: An Efficient OpenCL Implementation for the Matrix-2000 Architecture. In Computing Frontiers Conference (CF'18). ACM, Ischia, Italy, 10.
[73]
J. Zhao and A. Cohen. 2019. Flextended Tiles: A Flexible Extension of Overlapped Tiles for Polyhedral Compilation. ACM Transactions on Architecture and Code Optimization 16, 4 (2019).
[74]
J. Zhao, H. Cui, Y. Zhang, J. Xue, and X. Feng. 2018. Revisiting Loop Tiling for Datacenters: Live and Let Live. In Proceedings of the 32nd International Conference on Supercomputing (ICS '18). ACM, Beijing, China.
[75]
M. Zhao, R. Liu, Y. Liu, K. Song, and D. Qian. 2016. Parallel Image Processing on the Sunway Many-core Processor. In Proceedings of the 18th International Conference on High Performance Computing and Communications. IEEE, Sydney, Australia.
[76]
W. Zhao, H. Fu, J. Fang, W. Zheng, L. Gan, and G. Yang. 2018. Optimizing Convolutional Neural Networks on the Sunway TaihuLight Supercomputer. ACM Transactions on Architecture and Code Optimization 15, 1 (2018).

Cited By

View all
  • (2024)Pushing the Limit of Quantum Mechanical Simulation to the Raman Spectra of a Biological System with 100 Million AtomsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00011(1-12)Online publication date: 17-Nov-2024
  • (2024)MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latencyFrontiers of Computer Science10.1007/s11704-023-3566-y19:5Online publication date: 22-Nov-2024
  • (2023)Tiling for DMA-Based Hardware Accelerators (WIP)Proceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596283(138-142)Online publication date: 13-Jun-2023
  • Show More Cited By

Index Terms

  1. Bandwidth-Aware Loop Tiling for DMA-Supported Scratchpad Memory

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PACT '20: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
    September 2020
    505 pages
    ISBN:9781450380751
    DOI:10.1145/3410463
    • General Chair:
    • Vivek Sarkar,
    • Program Chair:
    • Hyesoon Kim
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 September 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compiler
    2. dma
    3. loop tiling
    4. scratchpad memory

    Qualifiers

    • Research-article

    Funding Sources

    • the National Natural Science Foundation of China
    • the National Key Research and Development Program of China
    • CCF-Tencent Open Research Fund
    • Australian Research Council grant

    Conference

    PACT '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 121 of 471 submissions, 26%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)72
    • Downloads (Last 6 weeks)6
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Pushing the Limit of Quantum Mechanical Simulation to the Raman Spectra of a Biological System with 100 Million AtomsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00011(1-12)Online publication date: 17-Nov-2024
    • (2024)MuDP: multi-granularity data placement for uniform loops on SPM-DRAM architectures to minimize latencyFrontiers of Computer Science10.1007/s11704-023-3566-y19:5Online publication date: 22-Nov-2024
    • (2023)Tiling for DMA-Based Hardware Accelerators (WIP)Proceedings of the 24th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems10.1145/3589610.3596283(138-142)Online publication date: 13-Jun-2023
    • (2023)Portable and Scalable All-Electron Quantum Perturbation Simulations on Exascale SupercomputersProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607085(1-13)Online publication date: 12-Nov-2023
    • (2022)Scaling Poisson Solvers on Many Cores via MMEwaldIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312713833:8(1888-1901)Online publication date: 1-Aug-2022
    • (2021)Accelerating all-electron ab initio simulation of raman spectra for biological systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476160(1-15)Online publication date: 14-Nov-2021
    • (2021)Tile size selection of affine programs for GPGPUs using polyhedral cross-compilationProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460369(13-26)Online publication date: 3-Jun-2021

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media