Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3373376.3378508acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

Published: 13 March 2020 Publication History

Abstract

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current tensor library implementation mainly requires programmers to manually design low-level implementation and optimize from the algorithm, architecture, and compilation perspectives. Such a manual development process often takes months or even years, which falls far behind the rapid evolution of the application algorithms.
In this paper, we introduce FlexTensor, which is a schedule exploration and optimization framework for tensor computation on heterogeneous systems. FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details. FlexTensor systematically explores the optimization design spaces that are composed of many different schedules for different hardware. Then, FlexTensor combines different exploration techniques, including heuristic method and machine learning method to find the optimized schedule configuration. Finally, based on the results of exploration, customized schedules are automatically generated for different hardware. In the experiments, we test 12 different kinds of tensor computations with totally hundreds of test cases and FlexTensor achieves average 1.83x performance speedup on NVIDIA V100 GPU compared to cuDNN; 1.72x performance speedup on Intel Xeon CPU compared to MKL-DNN for 2D convolution; 1.5x performance speedup on Xilinx VU9P FPGA compared to OpenCL baselines; 2.21x speedup on NVIDIA V100 GPU compared to the state-of-the-art.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2--4, 2016. 265--283. https://www.usenix.org/conference/osdi16/technicalsessions/ presentation/abadi
[2]
Andrew Adams, Karima Ma, Luke Anderson, Riyadh Baghdadi, Tzu- Mao Li, Michaël Gharbi, Benoit Steiner, Steven Johnson, Kayvon Fatahalian, Frédo Durand, and Jonathan Ragan-Kelley. 2019. Learning to optimize halide with tree search and random programs. ACM Trans. Graph. 38, 4 (2019), 121:1--121:12. https://doi.org/10.1145/3306346. 3322967
[3]
Abien Fred Agarap. 2018. Deep Learning using Rectified Linear Units (ReLU). CoRR abs/1803.08375 (2018). arXiv:1803.08375 http://arxiv. org/abs/1803.08375
[4]
Geoffrey Belter, Elizabeth R. Jessup, Ian Karlin, and Jeremy G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the ACM/IEEE Conference on High Performance Computing, SC 2009, November 14--20, 2009, Portland, Oregon, USA. https://doi.org/10.1145/1654059.1654119
[5]
Uday Bondhugula, Albert Hartono, J. Ramanujam, and P. Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, Tucson, AZ, USA, June 7--13, 2008. 101--113. https://doi.org/10.1145/1375581.1375595
[6]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016. 785--794. https://doi.org/10.1145/2939672. 2939785
[7]
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRR abs/1512.01274 (2015). arXiv:1512.01274 http://arxiv.org/abs/1512.01274
[8]
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2018, Carlsbad, CA, USA, October 8--10, 2018. 578--594. https://www.usenix.org/conference/osdi18/presentation/chen
[9]
Tianqi Chen, Lianmin Zheng, Eddie Q. Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to Optimize Tensor Programs. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3--8 December 2018, Montréal, Canada. 3393--3404. http://papers.nips.cc/paper/7599-learning-tooptimize- tensor-programs
[10]
Yu Cheng, Felix X. Yu, Rogério Schmidt Feris, Sanjiv Kumar, Alok N. Choudhary, and Shih-Fu Chang. 2015. An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 2857--2865. https://doi.org/10.1109/ICCV.2015.327
[11]
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cuDNN: Efficient Primitives for Deep Learning. CoRR abs/1410.0759 (2014). arXiv:1410.0759 http://arxiv.org/abs/1410.0759
[12]
François Chollet. 2016. Xception: Deep Learning with Depthwise Separable Convolutions. CoRR abs/1610.02357 (2016). arXiv:1610.02357 http://arxiv.org/abs/1610.02357
[13]
Jason Cong, Bin Liu, Stephen Neuendorffer, Juanjo Noguera, Kees A. Vissers, and Zhiru Zhang. 2011. High-Level Synthesis for FPGAs: From Prototyping to Deployment. IEEE Trans. on CAD of Integrated Circuits and Systems 30, 4 (2011), 473--491. https://doi.org/10.1109/TCAD.2011. 2110592
[14]
Leonardo Dagum and Ramesh Menon. 1988. OpenMP: An Industry- Standard API for Shared-Memory Programming. IEEE Computational Science & Engineering 5 Issue 1 (1988).
[15]
Naila Farooqui, Christopher J. Rossbach, Yuan Yu, and Karsten Schwan. 2014. Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications. In 2014 Conference on Timely Results in Operating Systems, TRIOS '14, Broomfield, CO, USA, October 5, 2014. https://www.usenix.org/conference/trios14/technical-sessions/ presentation/farooqui
[16]
Matteo Frigo and Steven G. Johnson. 1998. FFTW: an adaptive software architecture for the FFT. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98, Seattle, Washington, USA, May 12--15, 1998. 1381--1384. https://doi.org/10.1109/ICASSP.1998.681704
[17]
Evangelos Georganas, Sasikanth Avancha, Kunal Banerjee, Dhiraj D. Kalamkar, Greg Henry, Hans Pabst, and Alexander Heinecke. 2018. Anatomy of high-performance deep learning convolutions on SIMD architectures. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC 2018, Dallas, TX, USA, November 11--16, 2018. 66:1--66:12. http://dl.acm.org/ citation.cfm?id=3291744
[18]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2017. Learning Spatio-Temporal Features with 3D Residual Networks for Action Recognition. In 2017 IEEE International Conference on Computer Vision Workshops, ICCV Workshops 2017, Venice, Italy, October 22--29, 2017. 3154--3160. https://doi.org/10.1109/ICCVW.2017.373
[19]
Intel(R) PlaidML https://ai.intel.com/plaidml. [n.d.]. https://ai.intel. com/plaidml
[20]
Inter(R) MKL-DNN https://github.com/intel/mkl dnn. [n.d.]. https: //github.com/intel/mkl-dnn
[21]
Torch/CUNN https://github.com/torch/cunn. [n.d.]. https://github. com/torch/cunn
[22]
Intel(R) MKL https://software.intel.com/en us/mkl. [n.d.]. https: //software.intel.com/en-us/mkl
[23]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093 (2014).
[24]
DaeGon Kim, Lakshminarayanan Renganarayanan, Dave Rostron, Sanjay V. Rajopadhye, and Michelle Mills Strout. 2007. Multi-level tiling:Mfor the price of one. In Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, SC 2007, November 10--16, 2007, Reno, Nevada, USA. 51. https://doi.org/10.1145/1362622. 1362691
[25]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. CoRR abs/1408.5882 (2014). arXiv:1408.5882 http://arxiv.org/ abs/1408.5882
[26]
S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi. 1983. Optimization by Simulated Annealing. In Science, 220(4598):671--680.
[27]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman P. Amarasinghe. 2017. The tensor algebra compiler. PACMPL 1, OOPSLA (2017), 77:1--77:29. https://doi.org/10.1145/3133901
[28]
Yi-Hsiang Lai, Yuze Chi, Yuwei Hu, JieWang, Cody Hao Yu, Yuan Zhou, Jason Cong, and Zhiru Zhang. 2019. HeteroCL: A Multi-Paradigm Programming Infrastructure for Software-Defined Reconfigurable Computing. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA, February 24--26, 2019. 242--251. https://doi.org/10.1145/3289602.3293910
[29]
Andrew Lavin. 2015. Fast Algorithms for Convolutional Neural Networks. CoRR abs/1509.09308 (2015). arXiv:1509.09308 http: //arxiv.org/abs/1509.09308
[30]
Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1, 4 (1989), 541--551. https://doi.org/10.1162/neco. 1989.1.4.541
[31]
Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, and Yinghan Li. 2019. A coordinated tiling and batching framework for efficient GEMM on GPUs. In Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2019, Washington, DC, USA, February 16--20, 2019. 229--241. https: //doi.org/10.1145/3293883.3295734
[32]
Yun Liang, Shuo Wang, and Wei Zhang. 2018. FlexCL: A Model of Performance and Power for OpenCLWorkloads on FPGAs. IEEE Trans. Computers 67, 12 (2018), 1750--1764. https://doi.org/10.1109/TC.2018. 2840686
[33]
Liqiang Lu and Yun Liang. 2018. SpWA: an efficient sparse winograd convolutional neural networks accelerator on FPGAs. In Proceedings of the 55th Annual Design Automation Conference, DAC 2018, San Francisco, CA, USA, June 24--29, 2018. 135:1--135:6. https://doi.org/10.1145/ 3195970.3196120
[34]
Michaël Mathieu, Mikael Henaff, and Yann LeCun. 2013. Fast Training of Convolutional Networks through FFTs. CoRR abs/1312.5851 (2013). arXiv:1312.5851 http://arxiv.org/abs/1312.5851
[35]
Tiziano De Matteis, Johannes de Fine Licht, and Torsten Hoefler. 2019. FBLAS: Streaming Linear Algebra on FPGA. CoRR abs/1907.07929 (2019). arXiv:1907.07929 http://arxiv.org/abs/1907.07929
[36]
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin A. Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. 2015. Human-level control through deep reinforcement learning. Nature 518, 7540 (2015), 529--533. https://doi.org/10.1038/nature14236
[37]
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan- Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling halide image processing pipelines. ACM Trans. Graph. 35, 4 (2016), 83:1--83:11. https://doi.org/10.1145/2897824.2925952
[38]
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic Optimization for Image Processing Pipelines. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '15, Istanbul, Turkey, March 14--18, 2015. 429--443. https: //doi.org/10.1145/2694344.2694364
[39]
Makoto Nakatsuji, Qingpeng Zhang, Xiaohui Lu, Bassem Makni, and James A. Hendler. 2017. Semantic Social Network Analysis by Cross- Domain Tensor Factorization. IEEE Trans. Comput. Social Systems 4, 4 (2017), 207--217. https://doi.org/10.1109/TCSS.2017.2732685
[40]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable Parallel Programming with CUDA. ACM Queue 6, 2 (2008), 40--53. https://doi.org/10.1145/1365490.1365500
[41]
NVIDIA(R). [n.d.]. CUBLAS Library https://www.nvidia.com/. https://developer.download.nvidia.cn/compute/DevZone/docs/html/ CUDALibraries/doc/CUBLAS_Library.pdf
[42]
Evangelos E. Papalexakis, Christos Faloutsos, and Nicholas D. Sidiropoulos. 2017. Tensors for Data Mining and Data Fusion: Models, Applications, and Scalable Algorithms. ACM TIST 8, 2 (2017), 16:1--16:44. https://doi.org/10.1145/2915921
[43]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8--14 December 2019, Vancouver, BC, Canada. 8024-- 8035. http://papers.nips.cc/paper/9015-pytorch-an-imperative-stylehigh- performance-deep-learning-library
[44]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman P. Amarasinghe. 2013. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '13, Seattle, WA, USA, June 16--19, 2013. 519--530. https://doi.org/10.1145/2491956. 2462176
[45]
Joseph Redmon, Santosh Kumar Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You Only Look Once: Unified, Real-Time Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27--30, 2016. 779--788. https://doi.org/10.1109/CVPR.2016.91
[46]
Jared Roesch, Steven Lyubomirsky, Marisa Kirisame, Josh Pollock, LoganWeber, Ziheng Jiang, Tianqi Chen, Thierry Moreau, and Zachary Tatlock. 2019. Relay: A High-Level IR for Deep Learning. CoRR abs/1904.08368 (2019). arXiv:1904.08368 http://arxiv.org/abs/1904. 08368
[47]
Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 4510--4520. https://doi.org/10.1109/CVPR.2018.00474
[48]
Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. 2014. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14--16, 2014, Conference Track Proceedings. http: //arxiv.org/abs/1312.6229
[49]
Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, and David M. Brooks. 2014. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In ACM/IEEE 41st International Symposium on Computer Architecture, ISCA 2014, Minneapolis, MN, USA, June 14--18, 2014. 97--108. https://doi.org/10.1109/ISCA.2014.6853196
[50]
Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning - an introduction. MIT Press. http://www.worldcat.org/oclc/37293240
[51]
Du Tran, Lubomir D. Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features with 3D Convolutional Networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7--13, 2015. 4489-- 4497. https://doi.org/10.1109/ICCV.2015.510
[52]
Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2015. Fast Convolutional Nets With fbfft: A GPU Performance Evaluation. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7--9, 2015, Conference Track Proceedings. http://arxiv.org/abs/1412.7580
[53]
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: Framework- Agnostic High-Performance Machine Learning Abstractions. CoRR abs/1802.04730 (2018). arXiv:1802.04730 http://arxiv.org/abs/1802. 04730
[54]
Anand Venkat, Tharindu Rusira, Raj Barik, Mary Hall, and Leonard Truong. 2019. SWIRL: High-performance many-core CPU code generation for deep neural networks. The International Journal of High Performance Computing Applications 33, 6 (2019), 1275--1289. https://doi.org/10.1177/1094342019866247 arXiv:https://doi.org/10.1177/1094342019866247
[55]
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. TACO 9, 4 (2013), 54:1--54:23. https://doi.org/10.1145/2400682.2400713
[56]
Shuo Wang, Zhe Li, Caiwen Ding, Bo Yuan, Qinru Qiu, Yanzhi Wang, and Yun Liang. 2018. C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2018, Monterey, CA, USA, February 25--27, 2018. 11--20. https://doi.org/10.1145/3174243.3174253
[57]
Christopher J. C. H. Watkins and Peter Dayan. 1992. Technical Note Q-Learning. Machine Learning 8 (1992), 279--292. https://doi.org/10. 1007/BF00992698
[58]
R. Clint Whaley. 2011. ATLAS (Automatically Tuned Linear Algebra Software). In Encyclopedia of Parallel Computing. 95--101. https: //doi.org/10.1007/978-0--387-09766--4_85
[59]
BichenWu, AlvinWan, Xiangyu Yue, Peter H. Jin, Sicheng Zhao, Noah Golmant, Amir Gholaminejad, Joseph Gonzalez, and Kurt Keutzer. 2018. Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18--22, 2018. 9127-- 9135. https://doi.org/10.1109/CVPR.2018.00951
[60]
Qingcheng Xiao, Yun Liang, Liqiang Lu, Shengen Yan, and Yu-Wing Tai. 2017. Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs. In Proceedings of the 54th Annual Design Automation Conference, DAC 2017, Austin, TX, USA, June 18--22, 2017. 62:1--62:6. https://doi.org/10.1145/3061639.3062244
[61]
Xiaolong Xie, Yun Liang, Xiuhong Li, and Wei Tan. 2019. CuLDA: Solving Large-scale LDA Problems on GPUs. In Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2019, Phoenix, AZ, USA, June 22--29, 2019. 195--205. https://doi.org/10.1145/3307681.3325407
[62]
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture, MICRO 2015, Waikiki, HI, USA, December 5--9, 2015. 395--406. https://doi.org/10.1145/ 2830772.2830813
[63]
Zhaoyi Yan, Xiaoming Li, Mu Li, Wangmeng Zuo, and Shiguang Shan. 2018. Shift-Net: Image Inpainting via Deep Feature Rearrangement. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8--14, 2018, Proceedings, Part XIV. 3--19. https: //doi.org/10.1007/978--3-030-01264--9_1
[64]
Matthew D. Zeiler. 2012. ADADELTA: An Adaptive Learning Rate Method. CoRR abs/1212.5701 (2012). arXiv:1212.5701 http://arxiv.org/ abs/1212.5701
[65]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks (FPGA '15). ACM, New York, NY, USA, 161--170. https://doi.org/10.1145/2684746.2689060
[66]
Jialiang Zhang and Jing Li. 2017. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA 2017, Monterey, CA, USA, February 22--24, 2017. 25--34. http://dl.acm.org/citation.cfm?id=3021698

Cited By

View all
  • (2024)Синтез нейрокомп'ютерних систем з узгоджено-паралельним обробленням інтенсивних потоків даних у реальному часіScientific Bulletin of UNFU10.36930/4034061134:6(76-86)Online publication date: 5-Sep-2024
  • (2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
  • (2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
  • Show More Cited By

Index Terms

  1. FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems
        March 2020
        1412 pages
        ISBN:9781450371025
        DOI:10.1145/3373376
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        In-Cooperation

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 13 March 2020

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. code generation
        2. compiler optimization
        3. heterogeneous systems
        4. machine learning

        Qualifiers

        • Research-article

        Funding Sources

        Conference

        ASPLOS '20

        Acceptance Rates

        Overall Acceptance Rate 535 of 2,713 submissions, 20%

        Upcoming Conference

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)475
        • Downloads (Last 6 weeks)74
        Reflects downloads up to 09 Nov 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Синтез нейрокомп'ютерних систем з узгоджено-паралельним обробленням інтенсивних потоків даних у реальному часіScientific Bulletin of UNFU10.36930/4034061134:6(76-86)Online publication date: 5-Sep-2024
        • (2024)Cross-Feature Transfer Learning for Efficient Tensor Program GenerationApplied Sciences10.3390/app1402051314:2(513)Online publication date: 6-Jan-2024
        • (2024)Bricks: A high-performance portability layer for computations on block-structured gridsThe International Journal of High Performance Computing Applications10.1177/1094342024126828838:6(549-567)Online publication date: 19-Aug-2024
        • (2024)Layer-wise Exploration of a Neural Processing Unit Compiler's Optimization SpaceProceedings of the 2024 10th International Conference on Computer Technology Applications10.1145/3674558.3674562(20-26)Online publication date: 15-May-2024
        • (2024)BrickDL: Graph-Level Optimizations for DNNs with Fine-Grained Data Blocking on GPUsProceedings of the 53rd International Conference on Parallel Processing10.1145/3673038.3673046(576-586)Online publication date: 12-Aug-2024
        • (2024)(De/Re)-Composition of Data-Parallel Computations via Multi-Dimensional HomomorphismsACM Transactions on Programming Languages and Systems10.1145/366564346:3(1-74)Online publication date: 10-Oct-2024
        • (2024)Fasor: A Fast Tensor Program Optimization Framework for Efficient DNN DeploymentProceedings of the 38th ACM International Conference on Supercomputing10.1145/3650200.3656631(498-510)Online publication date: 30-May-2024
        • (2024)Empowering In-Browser Deep Learning Inference on Edge Through Just-In-Time Kernel OptimizationProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661892(438-450)Online publication date: 3-Jun-2024
        • (2024)POSTER: FineCo: Fine-grained Heterogeneous Resource Management for Concurrent DNN InferencesProceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming10.1145/3627535.3638485(451-453)Online publication date: 2-Mar-2024
        • (2024)POPA: Expressing High and Portable Performance across Spatial and Vector Architectures for Tensor ComputationsProceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays10.1145/3626202.3637566(199-210)Online publication date: 1-Apr-2024
        • Show More Cited By

        View Options

        Get Access

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media