research-article

MI2D: Accelerating Matrix Inversion with 2-Dimensional Tile Manipulations

Authors:

Pengju RenAuthors Info & Claims

GLSVLSI '22: Proceedings of the Great Lakes Symposium on VLSI 2022

Pages 423 - 429

https://doi.org/10.1145/3526241.3530314

Published: 06 June 2022 Publication History

Abstract

Matrix inversion is critical in mathematics and scientific applications. Large-scale dense matrix inversion is especially challenging for modern computers due to its heavy dependency of matrix elements and the poor temporal data locality. In this paper, we propose a novel accelerator termed MI2D, which converts matrix inversion into regular matrix multiplications using 2-dimensional cross-tile operations and novel algorithms for efficient data reuse and computations. Our evaluations show that MI2D can be easily integrated with existing matrix engines in modern high-end CPU and NPU, and effectively improves matrix inversion with 2.7× speedup against Intel Skylake CPU, and 24× against NVIDIA RTX 2080 Ti.

Supplementary Material

MP4 File (GLSVLSI22-099.mp4)

Presentation video for the paper <MI2D: Accelerating Matrix Inversion with 2-Dimensional Tile Manipulations> on GLSVLSI-2022 to introduce a new architecture for fast matrix inversion. We propose a novel multi-function on-chip network for flexible tile operations, as well as a novel matrix inversion method to maximize the data reuse. The architecture can be integrated into existing standard matrix process unit (MPU) to provide high-performance matrix inversion capability. The experiments show superior performance over high-end CPU and GPU processors.

Download
104.09 MB

References

[1]

A C Antoulas and Danny C Sorensen. 2001. Approximation of large-scale dynamical systems: an overview. International Journal of Applied Mathematics and Computer Science, Vol. 11, 5 (2001), 1093--1121.

[2]

Janier Arias-Garcia, Ricardo Pezzuol Jacobi, Llanos, and Carlos H. 2011. A suitable FPGA implementation of floating-point matrix inversion based on Gauss-Jordan elimination. In VII Southern Conference on Programmable Logic. IEEE, 263--268.

[3]

Peter Benner, Pablo Ezzatti, Enrique S Quintana-Orti, and Alfredo Remón. 2013. Matrix inversion on CPU--GPU platforms with applications in control theory. Concurrency and Computation: Practice and Experience, Vol. 25, 8 (2013), 1170--1182.

[4]

David Boland. 2016. Reducing memory requirements for high-performance and numerically stable gaussian elimination. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 244--253.

Digital Library

[5]

Debabrata DasGupta et al. 2013. In-place matrix inversion by modified Gauss-Jordan algorithm. Applied Mathematics, Vol. 4, 10 (2013), 1392--1396.

[6]

Goncalo M de Matos and Horacio C Neto. 2007. Memory optimized architecture for efficient gauss-jordan matrix inversion. In 2007 3rd Southern Conference on Programmable Logic. IEEE, 33--38.

[7]

Rui Duarte, Horácio Neto, and Mário Véstias. 2009. Double-precision gauss-jordan algorithm with partial pivoting on fpgas. In 2009 12th Euromicro Conference on Digital System Design, Architectures, Methods and Tools. IEEE, 273--280.

Digital Library

[8]

Natalie Enright Jerger, Tushar Krishna, and Li-Shiuan Peh. 2017. On-chip networks. Synthesis Lectures on Computer Architecture, Vol. 12, 3 (2017), 1--210.

[9]

Zhenhua Jiang and Sayed Ata Raziei. 2017. An efficient FPGA-based direct linear solver. In 2017 IEEE National Aerospace and Electronics Conference (NAECON). IEEE, 159--166.

[10]

Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a scalable and unified architecture for ubiquitous deep neural network computing: Industry track paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 789--801.

[11]

Ming Liu, Matthieu Crussiere, and Jeanfrancois Helard. 2012. A Novel Data-Aided Channel Estimation With Reduced Complexity for TDS-OFDM Systems. IEEE Transactions on Broadcasting, Vol. 58, 2 (2012), 247--260.

[12]

Nevine Nassif, Ashley O Munch, Carleton L Molnar, Gerald Pasdast, Sitaraman V Lyer, Zibing Yang, Oscar Mendoza, Mark Huddart, Srikrishnan Venkataraman, Sireesha Kandula, et al. 2022. Sapphire Rapids: The Next-Generation Intel Xeon Scalable Processor. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), Vol. 65. IEEE, 44--46.

[13]

Thomas Norrie, Nishant Patil, Doe Hyun Yoon, George Kurian, Sheng Li, James Laudon, Cliff Young, Norman P Jouppi, and David A Patterson. 2020. Google's Training Chips Revealed: TPUv2 and TPUv3. In Hot Chips Symposium. 1--70.

[14]

Kwanjun Park, Taeseok Daniel Yang, Hyung Jin Kim, Taedong Kong, Jung Min Lee, Hyuk Soon Choi, Hoon Jai Chun, Beopmin Kim, and Youngwoon Choi. 2019. Inversion-free image recovery from strong aberration using a minimally sampled transmission matrix. Scientific Reports, Vol. 9, 1 (2019), 1206.

[15]

James Reinders. 2012. An overview of programming for Intel Xeon processors and Intel Xeon Phi coprocessors. Intel Corporation, Santa Clara, Vol. 1 (2012), 1550002.

[16]

Qing Wang, Ming Liu, Nian Liu, and Zhangdui Zhong. 2018. On Augmenting UL Connections in Massive MIMO System Using Composite Channel Estimation. (2018), 1--7.

[17]

Tian Xia, Pengchen Zong, Haoran Zhao, Jianming Tong, Wenzhe Zhao, Nanning Zheng, and Pengju Ren. 2020. COCOA: Content-Oriented Configurable Architecture Based on Highly-Adaptive Data Transmission Networks. In Proceedings of the 2020 on Great Lakes Symposium on VLSI. 253--258.

Digital Library

[18]

Wenzhe Zhao. 2021. Open-HiPU200 Project on github.com/xjtuiair-cag/Open-HiPU200. https://www.researchgate.net/publication/243771074_DIY_Corpora_the_WWW_and_the_Translator

[19]

Dengkui Zhu, Boyu Li, and Ping Liang. 2015. On the matrix inversion approximation based on neumann series in massive MIMO systems. (2015), 1763--1769.

[20]

Pengchen Zong, Tian Xia, Haoran Zhao, Jianming Tong, Wenzhe Zhao, Nanning Zheng, and Pengju Ren. 2020. PIT: Processing-In-Transmission with Fine-Grained Data Manipulation Networks. IEEE Trans. Comput. (2020).

Index Terms

MI2D: Accelerating Matrix Inversion with 2-Dimensional Tile Manipulations

Recommendations

Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs
Highlights
- Accelerates all three phases of the singular value decomposition using a GPU.
- ...
Abstract
The increasing gap between memory bandwidth and computation speed motivates the choice of algorithms to take full advantage of today’s high performance computers. For dense matrices, the classic algorithm for the singular value ...
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs
PDP '11: Proceedings of the 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing

Inversion of large-scale matrices appears in a few scientific applications like model reduction or optimal control. Matrix inversion requires an important computational effort and, therefore, the application of high performance computing techniques and ...
Triangular Matrix Inversion on Heterogeneous Multicore Systems

Dense matrix inversion is a basic procedure in many linear algebra algorithms. Any factorization-based dense matrix inversion algorithm involves the inversion of one or two triangular matrices. In this work, we present an improved implementation of a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

GLSVLSI '22: Proceedings of the Great Lakes Symposium on VLSI 2022

June 2022

560 pages

ISBN:9781450393225

DOI:10.1145/3526241

General Chairs:
Ioannis Savidis
Drexel University, USA
,
Avesta Sasan
University of California, Davis, USA
,
Program Chairs:
Himanshu Thapliyal
University of Tennessee, Knoxville, USA
,
Ronald F. DeMara
University of Central Florida, USA

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGDA: ACM Special Interest Group on Design Automation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 June 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development Program of China
Key Research and Development Program of Shaanxi

Conference

GLSVLSI '22

Sponsor:

SIGDA

GLSVLSI '22: Great Lakes Symposium on VLSI 2022

June 6 - 8, 2022

CA, Irvine, USA

Acceptance Rates

Overall Acceptance Rate 312 of 1,156 submissions, 27%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
114
Total Downloads

Downloads (Last 12 months)21
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents