research-article

Open access

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

Authors:

Sheng-Chun Kao,

Suvinay Subramanian,

Gaurav Agrawal,

Amir Yazdanbakhsh,

Tushar KrishnaAuthors Info & Claims

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

Pages 295 - 310

https://doi.org/10.1145/3575693.3575747

Published: 30 January 2023 Publication History

Abstract

Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art Edge (Cloud) accelerators with no customized dataflow optimization. When on-chip resources are scarce (20 KB-200 KB), FLAT yields, on average, 1.5x end-to-end latency reduction across a diverse range of conventional attention-based models with input sequence lengths ranging from 512-token to 64K-token. Our evaluations demonstrate that state-of-the-art DNN dataflow applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64K elements.

References

[1]

Randy Allen and Ken Kennedy. Vector Register Allocation. IEEE Computer Architecture Letters, 1992.

[2]

Manoj Alwani, Han Chen, Michael Ferdman, and Peter Milder. Fused-layer CNN Accelerators. In MICRO, 2016.

[3]

Eunjin Baek, Dongup Kwon, and Jangwoo Kim. A Multi-Neural Network Acceleration Architecture. In ISCA, 2020.

[4]

Riyadh Baghdadi, Jessica Ray, Malek Ben Romdhane, Emanuele Del Sozzo, Abdurrahman Akkas, Yunming Zhang, Patricia Suriana, Shoaib Kamil, and Saman Amarasinghe. Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code. In CGO, 2019.

[5]

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The Long-document Transformer. arXiv preprint arXiv:2004.05150, 2020.

[6]

Prarthana Bhattacharyya, Chengjie Huang, and Krzysztof Czarnecki. Self-attention based context-aware 3d object detection. arXiv preprint arXiv:2101.02672, 2021.

[7]

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative Pretraining from Pixels. In ICML, 2020.

[8]

Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. Learning to Optimize Tensor Programs. In NeurIPS, 2018.

[9]

Yu-Hsin Chen et al. Eyeriss: An Energy-efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. JSSC, 2016.

[10]

Yu-Hsin Chen, Tien-Ju Yang, Joel Emer, and Vivienne Sze. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. JSSC, 2019.

[11]

Chen, Yu-Hsin and Krishna, Tushar and Emer, Joel and Sze, Vivienne. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. In ISSCC, 2016.

[12]

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating Long Sequences with Sparse Transformers. arXiv preprint arXiv:1904.10509, 2019.

[13]

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Jared Davis, Tamas Sarlos, David Belanger, Lucy Colwell, and Adrian Weller. Masked Language Modeling for Proteins via Linearly Scalable Long-context Transformers. arXiv preprint arXiv:2006.03555, 2020.

[14]

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking Attention with Performers. arXiv preprint arXiv:2009.14794, 2020.

[15]

UniProt Consortium. UniProt: A Worldwide Hub of Protein Knowledge. Nucleic Acids Research, 2019.

[16]

Gonçalo M Correia, Vlad Niculae, and André FT Martins. Adaptively Sparse Transformers. arXiv preprint arXiv:1909.00015, 2019.

[17]

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context. arXiv preprint arXiv:1901.02860, 2019.

[18]

Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, and Aviral Shrivastava. dMazeRunner: Executing Perfectly Nested Loops on Dataflow Accelerators. TECS, 2019.

[19]

Chen Ding and Ken Kennedy. Improving Effective Bandwidth Through Compiler Enhancement of Global Cache Reuse. Journal of Parallel and Distributed Computing, 2004.

Digital Library

[20]

Zidong Du, Robert Fasthuber, Tianshi Chen, Paolo Ienne, Ling Li, Tao Luo, Xiaobing Feng, Yunji Chen, and Olivier Temam. ShiDianNao: Shifting Vision Processing Closer to the Sensor. In ISCA, 2015.

Digital Library

[21]

Patrick Esser, Robin Rombach, and Björn Ommer. Taming Transformers for High-Resolution Image Synthesis. arXiv preprint arXiv:2012.09841, 2020.

[22]

Guang Gao, Russ Olsen, Vivek Sarkar, and Radhika Thekkath. Collective Loop Fusion for Array Contraction. In International Workshop on Languages and Compilers for Parallel Computing, 1992.

[23]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. In ASPLOS, 2017.

Digital Library

[24]

Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators. In ASPLOS, pages 807–820, 2019.

[25]

Google. Coral AI. https://coral.ai/, 2020.

[26]

Google. TensorFlow XLA. https://www.tensorflow.org/xla, 2021.

[27]

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Hervé Jégou, and Matthijs Douze. LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference. In ICCV, 2021.

[28]

Fu-Ming Guo, Sijia Liu, Finlay S Mungall, Xue Lin, and Yanzhi Wang. Reweighted Proximal Pruning for Large-scale Language Representation. arXiv preprint arXiv:1909.12486, 2019.

[29]

Tae Jun Ham, Sung Jun Jung, Seonghak Kim, Young H Oh, Yeonhong Park, Yoonho Song, Jung-Hun Park, Sanghee Lee, Kyoung Park, Jae W Lee, et al. A³: Accelerating Attention Mechanisms in Neural Networks with Approximation. In HPCA, 2020.

[30]

Tae Jun Ham, Yejin Lee, Seong Hoon Seo, Soosung Kim, Hyunji Choi, Sung Jun Jung, and Jae W Lee. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In ISCA, 2021.

[31]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.

[32]

Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, and Christopher W Fletcher. Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search Extended Abstract. In ASPLOS, 2021.

[33]

Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. Compound Word Transformer: Learning to Compose Full-Song Music over Dynamic Directed Hypergraphs. arXiv preprint arXiv:2101.02402, 2021.

[34]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music Transformer: Generating Music with Long-Term Structure. In ICLR, 2018.

[35]

Qijing Huang, Minwoo Kang, Grace Dinh, Thomas Norell, Aravind Kalaiah, James Demmel, John Wawrzynek, and Yakun Sophia Shao. CoSA: Scheduling by Constrained Optimization for Spatial Accelerators. arXiv preprint arXiv:2105.01898, 2021.

[36]

Andrei Ivanov, Nikoli Dryden, Tal Ben-Nun, Shigang Li, and Torsten Hoefler. Data Movement Is All You Need: A Case Study on Optimizing Transformers. In MLSys, 2021.

[37]

Zhihao Jia, Matei Zaharia, and Alex Aiken. Beyond Data and Model Parallelism for Deep Neural Networks. arXiv preprint arXiv:1807.05358, 2018.

[38]

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for Natural Language Understanding. arXiv preprint arXiv:1909.10351, 2019.

[39]

Norman P Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff Young, and David Patterson. A Domain-Specific Supercomputer for Training Deep Neural Networks. Communications of the ACM, 2020.

Digital Library

[40]

Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In ISCA, 2017.

[41]

Sheng-Chun Kao and Tushar Krishna. GAMMA: Automating the HW Mapping of DNN Models on Accelerators via Genetic Algorithm. In ICCAD, 2020.

Digital Library

[42]

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In ICML, 2020.

[43]

Ken Kennedy and Kathryn S McKinley. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution. In International Workshop on Languages and Compilers for Parallel Computing, 1993.

Digital Library

[44]

Sehoon Kim, Amir Gholami, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. I-BERT: Integer-only BERT Quantization. arXiv preprint arXiv:2101.01321, 2021.

[45]

Kevin Kiningham, Christopher Re, and Philip Levis. GRIP: A Graph Neural Network Accelerator Architecture. arXiv preprint arXiv:2007.13828, 2020.

[46]

Nikita Kitaev, Łukasz Kaiser, and Anselm Levskaya. Reformer: The Efficient Transformer. arXiv preprint arXiv:2001.04451, 2020.

[47]

Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. The Tensor Algebra Compiler. In OOPSLA, 2017.

Digital Library

[48]

Tushar Krishna, Hyoukjun Kwon, Angshuman Parashar, Michael Pellauer, and Ananda Samajdar. Data Orchestration in Deep Learning Accelerators. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2020.

[49]

Aviral Kumar, Amir Yazdanbakhsh, Milad Hashemi, Kevin Swersky, and Sergey Levine. Data-Driven Offline Optimization for Architecting Hardware Accelerators. In ICLR, 2022.

[50]

Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. Understanding Reuse, Performance, and Hardware Cost of DNN Dataflow: A Data-Centric Approach. In MICRO, 2019.

[51]

Hyoukjun Kwon, Ananda Samajdar, and Tushar Krishna. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects. In ASPLOS, 2018.

[52]

Guillaume Lample and Alexis Conneau. Cross-Lingual Language Model Pretraining. arXiv preprint arXiv:1901.07291, 2019.

[53]

Hang Le, Loïc Vial, Jibril Frej, Vincent Segonne, Maximin Coavoux, Benjamin Lecouteux, Alexandre Allauzen, Benoît Crabbé, Laurent Besacier, and Didier Schwab. Flaubert: Unsupervised Language Model Pre-training for French. arXiv preprint arXiv:1912.05372, 2019.

[54]

Jiajun Li, Ahmed Louri, Avinash Karanth, and Razvan Bunescu. GCNAX: A Flexible and Energy-Efficient Accelerator for Graph Convolutional Neural Networks. In HPCA, 2021.

[55]

Zheng Li, Soroush Ghodrati, Amir Yazdanbakhsh, Hadi Esmaeilzadeh, and Mingu Kang. Accelerating Attention through Gradient-Based Learned Runtime Pruning. In ISCA, 2022.

[56]

Peter J Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by Summarizing Long Sequences. arXiv preprint arXiv:1801.10198, 2018.

[57]

Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. Sanger: A Co-Design Framework for Enabling Sparse Attention using Reconfigurable Architecture. In MICRO, 2021.

Digital Library

[58]

Wenyan Lu et al. FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks. In HPCA, 2017.

[59]

Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework. arXiv preprint arXiv:2007.11360, 2020.

[60]

Vinh Nguyen, Sukru Burc Eryilmax, Karthik Mandakolathur, and Shar Narasimhan. Boosting NVIDIA MLPerf Training v1.1 Performance with Full Stack Optimization. https://tinyurl.com/3dku474c, 2021.

[61]

Wei Niu, Jiexiong Guan, Yanzhi Wang, Gagan Agrawal, and Bin Ren. DNNFusion: Accelerating Deep Neural Networks Execution with Advanced Operator Fusion. In PLDI, 2021.

Digital Library

[62]

Nvidia. NVDLA Deep Learning Accelerator. http://nvdla.org, 2017.

[63]

Nvidia. FasterTransforemr. https://github.com/NVIDIA/FasterTransformer, 2021.

[64]

Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W Keckler, and Joel Emer. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In ISPASS, 2019.

[65]

Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Image Transformer. In ICML, 2018.

[66]

Eric Qin, Ananda Samajdar, Hyoukjun Kwon, Vineet Nadella, Sudarshan Srinivasan, Dipankar Das, Bharat Kaul, and Tushar Krishna. SIGMA: A Sparse and Irregular GEMM Accelerator with Flexible Interconnects for DNN Training. In HPCA, 2020.

[67]

Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise Self-Attention for Long Document Understanding. arXiv preprint arXiv:1911.02972, 2019.

[68]

Markus N. Rabe and Charles Staats. Self-attention Does Not Need O(n^2) Memory. arXiv preprint arXiv:2112.05682, 2021.

[69]

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive Transformers for Long-Range Sequence Modelling. arXiv preprint arXiv:1911.05507, 2019.

[70]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv preprint arXiv:1910.10683, 2019.

[71]

Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In PLDI, 2013.

Digital Library

[72]

Aurko Roy, Mohammad Saffar, Ashish Vaswani, and David Grangier. Efficient Content-Based Sparse Attention with Routing Transformers. Transactions of the Association for Computational Linguistics, 2021.

[73]

Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. Poor Man’s BERT: Smaller and Faster Transformer Models. arXiv preprint arXiv:2004.03844, 2020.

[74]

Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. SCALE-Sim: Systolic CNN Accelerator Simulator. arXiv preprint arXiv:1811.02883, 2018.

[75]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter. arXiv preprint arXiv:1910.01108, 2019.

[76]

Kiran Seshadri, Berkin Akin, James Laudon, Ravi Narayanaswami, and Amir Yazdanbakhsh. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks. IISWC, 2022.

[77]

Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture. In MICRO, 2019.

[78]

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In AAAI, 2020.

[79]

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, and Chengqi Zhang. Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Modeling. arXiv preprint arXiv:1804.00857, 2018.

[80]

Linghao Song, Jiachen Mao, Youwei Zhuo, Xuehai Qian, Hai Li, and Yiran Chen. HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array. In HPCA, 2019.

[81]

Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. TransTrack: Multiple Object Tracking with Transformer. arXiv preprint arXiv:2012.15460, 2020.

[82]

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, and Denny Zhou. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. arXiv preprint arXiv:2004.02984, 2020.

[83]

Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient Processing of Deep Neural Networks. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2020.

[84]

Thierry Tambe, En-Yu Yang, Glenn G Ko, Yuji Chai, Coleman Hooper, Marco Donato, Paul N Whatmough, Alexander M Rush, David Brooks, and Gu-Yeon Wei. SM6: A 16nm System-on-Chip for Accurate and Noise-Robust Attention-Based NLP Applications. In HCS, 2021.

[85]

Yi Tay, Dara Bahri, Donald Metzler, Da-Cheng Juan, Zhe Zhao, and Che Zheng. Synthesizer: Rethinking self-attention in transformer models. arXiv preprint arXiv:2005.00743, 2020.

[86]

Yi Tay, Mostafa Dehghani, Samira Abnar, Yikang Shen, Dara Bahri, Philip Pham, Jinfeng Rao, Liu Yang, Sebastian Ruder, and Donald Metzler. Long Range Arena: A Benchmark for Efficient Transformers. In ICLR, 2021.

[87]

Tesla, Nvidia. Nvidia T4 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/tesla-t4/, 2018.

[88]

Tesla, Nvidia. V100 GPU Architecture. https://www.nvidia.com/en-us/data-center/v100/, 2018.

[89]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention Is All You Need. In NeurIPS, 2017.

Digital Library

[90]

Swagath Venkataramani, Jungwook Choi, Vijayalakshmi Srinivasan, Wei Wang, Jintao Zhang, Marcel Schaal, Mauricio J. Serrano, Kazuaki Ishizaki, Hiroshi Inoue, Eri Ogawa, Moriyoshi Ohara, Leland Chang, and Kailash Gopalakrishnan. DeepTools: Compiler and Execution Runtime Extensions for RaPiD AI Accelerator. IEEE Micro, 2019.

[91]

Swagath Venkataramani, Ashish Ranjan, Subarno Banerjee, Dipankar Das, Sasikanth Avancha, Ashok Jagannathan, Ajaya Durg, Dheemanth Nagaraj, Bharat Kaul, Pradeep Dubey, and Anand Raghunathan. SCALEDEEP: A Scalable Compute Architecture for Learning and Evaluating Deep Networks. In MICRO, 2017.

[92]

Hanrui Wang, Zhekai Zhang, and Song Han. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In HPCA, 2021.

[93]

Sinong Wang, Belinda Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-Attention with Linear Complexity. arXiv preprint arXiv:2006.04768, 2020.

[94]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arXiv preprint arXiv:2002.10957, 2020.

[95]

Ziheng Wang, Jeremy Wohlwend, and Tao Lei. Structured Pruning of Large Language Models. arXiv preprint arXiv:1910.04732, 2019.

[96]

Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. TGPA: Tile-Grained Pipeline Architecture for Low Latency CNN Inference. In ICCAD, 2018.

Digital Library

[97]

Xuechao Wei, Cody Hao Yu, Peng Zhang, Youxiang Chen, Yuxin Wang, Han Hu, Yun Liang, and Jason Cong. Automated Systolic Array Architecture Synthesis for High Throughput CNN Inference on FPGAs. In DAC, 2017.

Digital Library

[98]

Michael Joseph Wolfe. Optimizing Supercompilers for Supercomputers. PhD thesis, University of Illinois at Urbana-Champaign, 1982.

Digital Library

[99]

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. CvT: Introducing Convolutions to Vision Transformers. In ICCV, 2021.

[100]

Yannan N. Wu, Joel S. Emer, and Vivienne Sze. Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs. In ICCAD, 2019.

[101]

Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, Priyanka Raina, Christos Kozyrakis, and Mark Horowitz. Interstellar: Using Halide’s Scheduling Language to Analyze DNN Accelerators. In ASPLOS, 2020.

Digital Library

[102]

Amir Yazdanbakhsh, Ashkan Moradifirouzabadi, Zheng Li, and Mingu Kang. Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation. In MICRO, 2022.

[103]

Amir Yazdanbakhsh, Kambiz Samadi, Nam Sung Kim, and Hadi Esmaeilzadeh. GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks. In ISCA, 2018.

[104]

Ofir Zafrir, Guy Boudoukh, Peter Izsak, and Moshe Wasserblat. Q8BERT: Quantized 8Bit BERT. arXiv preprint arXiv:1910.06188, 2019.

[105]

Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks. In FPGA, 2015.

Digital Library

[106]

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding. arXiv preprint arXiv:2103.15358, 2021.

[107]

Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang, and Qun Liu. TernaryBERT: Distillation-aware Ultra-low Bit BERT. arXiv preprint arXiv:2009.12812, 2020.

[108]

Chen Zhu, Wei Ping, Chaowei Xiao, Mohammad Shoeybi, Tom Goldstein, Anima Anandkumar, and Bryan Catanzaro. Long-Short Transformer: Efficient Transformers for Language and Vision. NeurIPS, 2021.

[109]

Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR, 2021.

Cited By

You HFu YWang ZYazdanbakhsh ALin YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)When linear attention meets autoregressive decodingProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694435(57350-57366)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694435
Lee WLee JSeo JSim JGavrilovska ATerry D(2024)InfiniGenProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691947(155-172)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691947
Song YMeng YChen BChen SKang Y(2024)SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning MethodIntegrated Circuits and Systems10.23919/ICS.2024.34588971:3(144-156)Online publication date: Jul-2024
https://doi.org/10.23919/ICS.2024.3458897
Show More Cited By

Index Terms

FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Data flow architectures

Recommendations

Watt: A Write-Optimized RRAM-Based Accelerator for Attention
Euro-Par 2024: Parallel Processing
Abstract
Attention-based models, such as Transformer and BERT, have achieved remarkable success across various tasks. However, their deployment is hindered by challenges such as high memory requirements, long inference latency, and significant power ... $^{}$ $^{}$
Attention-based generative adversarial network in medical imaging: A narrative review
Abstract
As a popular probabilistic generative model, generative adversarial network (GAN) has been successfully used not only in natural image processing, but also in medical image analysis and computer-aided diagnosis. Despite the various ...
Highlights
- A systematic review on medical image analysis based on GAN with attention mechanism is provided.
STT-MRAM-based Near-Memory Computing Architecture with Read Scheme and Dataflow Co-Design for High-Throughput and Energy-Efficiency
ISLPED '24: Proceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design

Spin transfer torque magnetic random access memory (STT-MRAM)-based near-memory computing (NMC) architecture has been actively studied due to its potential for high-throughput and energy-efficient processing of AI algorithms. However, the low read ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ASPLOS 2023: Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

January 2023

947 pages

ISBN:9781450399166

DOI:10.1145/3575693

General Chair:
Tor M. Aamodt
University of British Columbia, Canada
,
Program Chairs:
Natalie Enright Jerger
University of Toronto, Canada
,
Michael Swift
University of Wisconsin-Madison, USA

Copyright © 2023 Owner/Author.

This work is licensed under a Creative Commons Attribution 4.0 International License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 January 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ASPLOS '23

Sponsor:

ASPLOS '23: 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2

March 25 - 29, 2023

BC, Vancouver, Canada

Acceptance Rates

Overall Acceptance Rate 535 of 2,713 submissions, 20%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
3,622
Total Downloads

Downloads (Last 12 months)1,827
Downloads (Last 6 weeks)181

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

You HFu YWang ZYazdanbakhsh ALin YSalakhutdinov RKolter ZHeller KWeller AOliver NScarlett JBerkenkamp F(2024)When linear attention meets autoregressive decodingProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694435(57350-57366)Online publication date: 21-Jul-2024
https://dl.acm.org/doi/10.5555/3692070.3694435
Lee WLee JSeo JSim JGavrilovska ATerry D(2024)InfiniGenProceedings of the 18th USENIX Conference on Operating Systems Design and Implementation10.5555/3691938.3691947(155-172)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691938.3691947
Song YMeng YChen BChen SKang Y(2024)SALTM: Accelerating Large Transformers in Multi-Device System With 2-D Model Partitioning MethodIntegrated Circuits and Systems10.23919/ICS.2024.34588971:3(144-156)Online publication date: Jul-2024
https://doi.org/10.23919/ICS.2024.3458897
Qin YLou WWang CGong LZhou X(2024)Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention FusionProceedings of the Great Lakes Symposium on VLSI 202410.1145/3649476.3658810(599-603)Online publication date: 12-Jun-2024
https://dl.acm.org/doi/10.1145/3649476.3658810
Dong PZhuang JYang ZJi SLi YXu DHuang HHu JJones AShi YWang YZhou P(2024)EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344369243:11(3949-3960)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3443692
Subramanian SRrapaj EHarrington PChheda SFarrell SAustin BWilliams SWright NBhimji W(2024)Comprehensive Performance Modeling and System Design Insights for Foundation ModelsSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SCW63240.2024.00179(1380-1397)Online publication date: 17-Nov-2024
https://doi.org/10.1109/SCW63240.2024.00179
Nayak NWu XOdemuyiwa TPellauer MEmer JFletcher C(2024)FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00107(1458-1473)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00107
Yun SKyung KCho JChoi JKim JKim BLee SSohn KAhn J(2024)Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00105(1429-1443)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00105
Wang HFang JTang XYue ZLi JQin YGuan SYang QWang YLi CHu YYin S(2024)SOFA: A Compute-Memory Optimized Sparsity Accelerator via Cross-Stage Coordinated Tiling2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00093(1247-1263)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00093
Zhang YTsai PTseng H(2024)Sparsepipe: Sparse Inter-operator Dataflow Architecture with Cross-Iteration Reuse2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00090(1201-1216)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00090
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents