Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3575693.3575747acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open access

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

Published: 30 January 2023 Publication History

Abstract

Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art Edge (Cloud) accelerators with no customized dataflow optimization. When on-chip resources are scarce (20 KB-200 KB), FLAT yields, on average, 1.5x end-to-end latency reduction across a diverse range of conventional attention-based models with input sequence lengths ranging from 512-token to 64K-token. Our evaluations demonstrate that state-of-the-art DNN dataflow applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64K elements.

References

[1]
Randy Allen and Ken Kennedy. Vector Register Allocation. IEEE Computer Architecture Letters, 1992.
[2]
Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. Fused-layer CNN Accelerators. In MICRO, 2016.
[3]
Eunjin Baek, Dongup Kwon, and Jangwoo Kim. A Multi-Neural Network Acceleration Architecture. In ISCA, 2020.
[4]
Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In CGO, 2019.
[5]
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The Long-document Transformer. arXiv preprint arXiv:2004.05150, 2020.
[6]
Prarthana Bhattacharyya, Chengjie Huang, and Krzysztof Czarnecki. Self-attention based context-aware 3d object detection. arXiv preprint arXiv:2101.02672, 2021.
[7]
Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative Pretraining from Pixels. In ICML, 2020.
[8]
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to Optimize Tensor Programs. In NeurIPS, 2018.
[9]
Yu-Hsin Chen et al. Eyeriss: An Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. JSSC, 2016.
[10]
Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. JSSC, 2019.
[11]
Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In ISSCC, 2016.
[12]
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509, 2019.
[13]
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis, Tamas Sarlos, David Belanger, Lucy Colwell, and Adrian Weller. Masked Language Modeling for Proteins via Linearly Scalable Long-context Transformers. arXiv preprint arXiv:2006.03555, 2020.
[14]
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking Attention with Performers. arXiv preprint arXiv:2009.14794, 2020.
[15]
UniProt Consortium. UniProt: A Worldwide Hub of Protein Knowledge. Nucleic Acids Research, 2019.
[16]
Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively Sparse Transformers. arXiv preprint arXiv:1909.00015, 2019.
[17]
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv preprint arXiv:1901.02860, 2019.
[18]
Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, and Aviral Shrivastava. dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators. TECS, 2019.
[19]
Chen Ding and Ken Kennedy. Improving Effective Bandwidth Through Compiler Enhancement of Global Cache Reuse. Journal of Parallel and Distributed Computing, 2004.
[20]
Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In ISCA, 2015.
[21]
Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv preprint arXiv:2012.09841, 2020.
[22]
Guang Gao, Russ Olsen, Vivek Sarkar, and Radhika Thekkath. Collective Loop Fusion for Array Contraction. In International Workshop on Languages and Compilers for Parallel Computing, 1992.
[23]
Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS, 2017.
[24]
Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators. In ASPLOS, pages 807–820, 2019.
[25]
Google. Coral AI. https://coral.ai/, 2020.
[26]
Google. TensorFlow XLA. https://www.tensorflow.org/xla, 2021.
[27]
Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In ICCV, 2021.
[28]
Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. Reweighted Proximal Pruning for Large-scale Language Representation. arXiv preprint arXiv:1909.12486, 2019.
[29]
Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. A³: Accelerating Attention Mechanisms in Neural Networks with Approximation. In HPCA, 2020.
[30]
Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W Lee. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In ISCA, 2021.
[31]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
[32]
Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W Fletcher. Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search Extended Abstract. In ASPLOS, 2021.
[33]
Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs. arXiv preprint arXiv:2101.02402, 2021.
[34]
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music Transformer: Generating Music with Long-Term Structure. In ICLR, 2018.
[35]
Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. arXiv preprint arXiv:2105.01898, 2021.
[36]
Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In MLSys, 2021.
[37]
Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond Data and Model Parallelism for Deep Neural Networks. arXiv preprint arXiv:1807.05358, 2018.
[38]
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv preprint arXiv:1909.10351, 2019.
[39]
Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A Domain-Specific Supercomputer for Training Deep Neural Networks. Communications of the ACM, 2020.
[40]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In ISCA, 2017.
[41]
Sheng-Chun Kao and Tushar Krishna. GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm. In ICCAD, 2020.
[42]
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In ICML, 2020.
[43]
Ken Kennedy and Kathryn S McKinley. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In International Workshop on Languages and Compilers for Parallel Computing, 1993.
[44]
Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-BERT: Integer-only BERT Quantization. arXiv preprint arXiv:2101.01321, 2021.
[45]
Kevin Kiningham, Christopher Re, and Philip Levis. GRIP: A Graph Neural Network Accelerator Architecture. arXiv preprint arXiv:2007.13828, 2020.
[46]
Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. arXiv preprint arXiv:2001.04451, 2020.
[47]
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The Tensor Algebra Compiler. In OOPSLA, 2017.
[48]
Tushar Krishna, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, and Ananda Samajdar. Data Orchestration in Deep Learning Accelerators. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2020.
[49]
Aviral Kumar, Amir Yazdanbakhsh, Milad Hashemi, Kevin Swersky, and Sergey Levine. Data-Driven Offline Optimization for Architecting Hardware Accelerators. In ICLR, 2022.
[50]
Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach. In MICRO, 2019.
[51]
Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In ASPLOS, 2018.
[52]
Guillaume Lample and Alexis Conneau. Cross-Lingual Language Model Pretraining. arXiv preprint arXiv:1901.07291, 2019.
[53]
Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. Flaubert: Unsupervised Language Model Pre-training for French. arXiv preprint arXiv:1912.05372, 2019.
[54]
Jiajun Li, Ahmed Louri, Avinash Karanth, and Razvan Bunescu. GCNAX: A Flexible and Energy-Efficient Accelerator for Graph Convolutional Neural Networks. In HPCA, 2021.
[55]
Zheng Li, Soroush Ghodrati, Amir Yazdanbakhsh, Hadi Esmaeilzadeh, and Mingu Kang. Accelerating Attention through Gradient-Based Learned Runtime Pruning. In ISCA, 2022.
[56]
Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by Summarizing Long Sequences. arXiv preprint arXiv:1801.10198, 2018.
[57]
Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture. In MICRO, 2021.
[58]
Wenyan Lu et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In HPCA, 2017.
[59]
Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework. arXiv preprint arXiv:2007.11360, 2020.
[60]
Vinh Nguyen, Sukru Burc Eryilmax, Karthik Mandakolathur, and Shar Narasimhan. Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization. https://tinyurl.com/3dku474c, 2021.
[61]
Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In PLDI, 2021.
[62]
Nvidia. NVDLA Deep Learning Accelerator. http://nvdla.org, 2017.
[63]
Nvidia. FasterTransforemr. https://github.com/NVIDIA/FasterTransformer, 2021.
[64]
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In ISPASS, 2019.
[65]
Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image Transformer. In ICML, 2018.
[66]
Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In HPCA, 2020.
[67]
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise Self-Attention for Long Document Understanding. arXiv preprint arXiv:1911.02972, 2019.
[68]
Markus N. Rabe and Charles Staats. Self-attention Does Not Need O(n^2) Memory. arXiv preprint arXiv:2112.05682, 2021.
[69]
Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive Transformers for Long-Range Sequence Modelling. arXiv preprint arXiv:1911.05507, 2019.
[70]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv preprint arXiv:1910.10683, 2019.
[71]
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In PLDI, 2013.
[72]
Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 2021.
[73]
Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. Poor Man’s BERT: Smaller and Faster Transformer Models. arXiv preprint arXiv:2004.03844, 2020.
[74]
Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. SCALE-Sim: Systolic CNN Accelerator Simulator. arXiv preprint arXiv:1811.02883, 2018.
[75]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108, 2019.
[76]
Kiran Seshadri, Berkin Akin, James Laudon, Ravi Narayanaswami, and Amir Yazdanbakhsh. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. IISWC, 2022.
[77]
Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In MICRO, 2019.
[78]
Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In AAAI, 2020.
[79]
Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling. arXiv preprint arXiv:1804.00857, 2018.
[80]
Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. In HPCA, 2019.
[81]
Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. TransTrack: Multiple Object Tracking with Transformer. arXiv preprint arXiv:2012.15460, 2020.
[82]
Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. arXiv preprint arXiv:2004.02984, 2020.
[83]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient Processing of Deep Neural Networks. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2020.
[84]
Thierry Tambe, En-Yu Yang, Glenn G Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N Whatmough, Alexander M Rush, David Brooks, and Gu-Yeon Wei. SM6: A 16nm System-on-Chip for Accurate and Noise-Robust Attention-Based NLP Applications. In HCS, 2021.
[85]
Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743, 2020.
[86]
Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long Range Arena: A Benchmark for Efficient Transformers. In ICLR, 2021.
[87]
Tesla, Nvidia. Nvidia T4 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/tesla-t4/, 2018.
[88]
Tesla, Nvidia. V100 GPU Architecture. https://www.nvidia.com/en-us/data-center/v100/, 2018.
[89]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In NeurIPS, 2017.
[90]
Swagath Venkataramani, Jungwook Choi, Vijayalakshmi Srinivasan, Wei Wang, Jintao Zhang, Marcel Schaal, Mauricio J. Serrano, Kazuaki Ishizaki, Hiroshi Inoue, Eri Ogawa, Moriyoshi Ohara, Leland Chang, and Kailash Gopalakrishnan. DeepTools: Compiler and Execution Runtime Extensions for RaPiD AI Accelerator. IEEE Micro, 2019.
[91]
Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. In MICRO, 2017.
[92]
Hanrui Wang, Zhekai Zhang, and Song Han. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In HPCA, 2021.
[93]
Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768, 2020.
[94]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv preprint arXiv:2002.10957, 2020.
[95]
Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured Pruning of Large Language Models. arXiv preprint arXiv:1910.04732, 2019.
[96]
Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference. In ICCAD, 2018.
[97]
Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In DAC, 2017.
[98]
Michael Joseph Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, University of Illinois at Urbana-Champaign, 1982.
[99]
Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT: Introducing Convolutions to Vision Transformers. In ICCV, 2021.
[100]
Yannan N. Wu, Joel S. Emer, and Vivienne Sze. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In ICCAD, 2019.
[101]
Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators. In ASPLOS, 2020.
[102]
Amir Yazdanbakhsh, Ashkan Moradifirouzabadi, Zheng Li, and Mingu Kang. Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation. In MICRO, 2022.
[103]
Amir Yazdanbakhsh, Kambiz Samadi, Nam Sung Kim, and Hadi Esmaeilzadeh. GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks. In ISCA, 2018.
[104]
Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8BERT: Quantized 8Bit BERT. arXiv preprint arXiv:1910.06188, 2019.
[105]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In FPGA, 2015.
[106]
Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding. arXiv preprint arXiv:2103.15358, 2021.
[107]
Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, and Qun Liu. TernaryBERT: Distillation-aware Ultra-low Bit BERT. arXiv preprint arXiv:2009.12812, 2020.
[108]
Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, and Bryan Catanzaro. Long-Short Transformer: Efficient Transformers for Language and Vision. NeurIPS, 2021.
[109]
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR, 2021.

Cited By

View all
  • (2024)When linear attention meets autoregressive decodingProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694435(57350-57366)Online publication date: 21-Jul-2024
  • (2024)InfiniGenProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691947(155-172)Online publication date: 10-Jul-2024
  • (2024)SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning MethodIntegrated Circuits and Systems10.23919/ICS.2024.34588971:3(144-156)Online publication date: Jul-2024
  • Show More Cited By

Index Terms

  1. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2
    January 2023
    947 pages
    ISBN:9781450399166
    DOI:10.1145/3575693
    This work is licensed under a Creative Commons Attribution 4.0 International License.

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 January 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Attention
    2. DNN Accelerators
    3. Dataflow
    4. Transformer

    Qualifiers

    • Research-article

    Conference

    ASPLOS '23

    Acceptance Rates

    Overall Acceptance Rate 535 of 2,713 submissions, 20%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,827
    • Downloads (Last 6 weeks)181
    Reflects downloads up to 13 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)When linear attention meets autoregressive decodingProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694435(57350-57366)Online publication date: 21-Jul-2024
    • (2024)InfiniGenProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691947(155-172)Online publication date: 10-Jul-2024
    • (2024)SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning MethodIntegrated Circuits and Systems10.23919/ICS.2024.34588971:3(144-156)Online publication date: Jul-2024
    • (2024)Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention FusionProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658810(599-603)Online publication date: 12-Jun-2024
    • (2024)EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344369243:11(3949-3960)Online publication date: Nov-2024
    • (2024)Comprehensive Performance Modeling and System Design Insights for Foundation ModelsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00179(1380-1397)Online publication date: 17-Nov-2024
    • (2024)FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00107(1458-1473)Online publication date: 2-Nov-2024
    • (2024)Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00105(1429-1443)Online publication date: 2-Nov-2024
    • (2024)SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00093(1247-1263)Online publication date: 2-Nov-2024
    • (2024)Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00090(1201-1216)Online publication date: 2-Nov-2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media