Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

SFC: Achieve Accurate Fast Convolution under Low-precision Arithmetic

Liulu He    Yufei Zhao    Rui Gao    Yuan Du    Li Du
Abstract

Fast convolution algorithms, including Winograd and FFT, can efficiently accelerate convolution operations in deep models. However, these algorithms depend on high-precision arithmetic to maintain inference accuracy, which conflicts with the model quantization. To resolve this conflict and further improve the efficiency of quantized convolution, we proposes SFC, a new algebra transform for fast convolution by extending the Discrete Fourier Transform (DFT) with symbolic computing, in which only additions are required to perform the transformation at specific transform points, avoiding the calculation of irrational number and reducing the requirement for precision. Additionally, we enhance convolution efficiency by introducing correction terms to convert invalid circular convolution outputs of the Fourier method into effective ones. The numerical error analysis is presented for the first time in this type of work and proves that our algorithms can provide a 3.68× multiplication reduction for 3×3 convolution, while the Winograd algorithm only achieves a 2.25× reduction with similarly low numerical errors. Experiments carried out on benchmarks and FPGA show that our new algorithms can further improve the computation efficiency of quantized models while maintaining accuracy, surpassing both the quantization-alone method and existing works on fast convolution quantization.

Machine Learning, ICML

1 Introduction

Convolution operations are a crucial component of many deep learning models. Due to their intensive computation requirements, optimizing convolution calculations is key to improving model deployment efficiency. Fast Convolution Algorithms and Quantization are two distinct approaches to mitigate the computational burden. Fast convolution algorithms can reduce the arithmetic complexity by multiplying inputs and kernel weights in the transformation domain. A 3×3 convolution performed with Winograd F(2×2, 3×3) algorithm consumes just 12.2512.25\frac{1}{2.25}divide start_ARG 1 end_ARG start_ARG 2.25 end_ARG multiplications compared to direct computing. Whereas, model quantization reduces the cost of a single arithmetic operation and data transmission by converting high-precision floating numbers to low-precision integers. An int8 multiply-and-accumulate operation consumes only 116116\frac{1}{16}divide start_ARG 1 end_ARG start_ARG 16 end_ARG of the energy compared to a fp32 operation.

However, when attempting to combine existing fast convolution algorithms and model quantization to further improve computational efficiency, the model accuracy could be severely degraded. This is due to the significant increase in numerical error when the two methods are used together. For example, Winograd, a well-known fast convolution algorithm for small filter sizes (Lavin & Gray, 2016), has ill-conditioned transformation matrices (with high condition numbers), necessitating the use of high-precision arithmetic to avoid numerical issues (Barabasz et al., 2020). Another renowned fast algorithm for convolution is the Fast Fourier transform (FFT). While its transformation is well-conditioned, its irrational coefficients can introduce significant rounding errors, particularly under low-precision representation. Number Theoretic Transform (NTT) can achieve precise computation for integer convolution, but it involves module operations and the transformations for inputs and filters would extend the bit-width to be equivalent to that of the outputs.

At present, two research approaches have been developed to tackle the aforementioned problem. One approach involves customizing the quantization method specifically optimized for fast convolution algorithms (Li et al., 2021; Chikin & Kryzhanovskiy, 2022; Andri et al., 2022; Tianqi et al., 2023). However, this approach struggles to maintain satisfactory accuracy under int8 quantization for faster algorithms such as Winograd F(4×4, 3×3). The other approach is to explore new fast algorithms that are better fit for quantization (Liu & Mattina, 2020; Alam et al., 2022). Nevertheless, these emerging algorithms encounter challenges in achieving low arithmetic complexity. These facts show that simultaneously achieving low arithmetic complexity and low-precision quantization while maintaining model accuracy is a persistent challenge, which, to our knowledge, no existing work has overcome.

This paper aims to develop a new efficient fast convolution algorithm with high numerical accuracy, which is compatible with model quantization techniques. We note that all the complex coefficients in the Fourier Transform have constant modulus 1, so the quantization error can be reduced by bypassing the direct calculation on these coefficients. To achieve this, we introduce symbolic computing to avoid the involvement of irrational numbers, where the transformation process contains only additions. This advance adapts the Fourier transform to low-precision arithmetic. We also discovered that conventional Fourier convolution does not fully utilize its circular convolution outputs, which significantly affects its efficiency, so we add correction terms to convert these neglected outputs into valid results. Moreover, we investigate the error generation mechanism of convolution algorithms and compare the numerical errors of direct convolution, Winograd, and our algorithms. The key contributions can be summarized as follows:

  • 1.

    We formulate an efficient and quantization-friendly fast convolution algorithm extended from Fourier Convolution. We employ symbolic computing to perform Discret Fourier Transformation by just additions, and introduce correction terms to fully utilize its circular convolution outputs and further enhance its efficiency.

  • 2.

    We analyse the numerical error of direct convolution and other fast convolution algorithms. Our method can achieve 3.68× arithmetic reduction while the Winogard algorithm only achieves 2.25× at equivalent numerical accuracy. Through error analysis and observation of the energy distribution in the frequency domain, we also present a frequency-wise quantization strategy to improve model accuracy at low-bitwidth.

  • 3.

    Experiments on the ImageNet dataset validate that our method can achieve less than 0.2% accuracy degradation with int8 post-training quantization. In the same model accuracy, our method achieves up to 2.5× bit-operations reduction compared to Winograd convolution or direct convolution under quantization. FPGA implementation shows that our algorithms can significantly improve model inference throughput combined with low-precision arithmetic.

2 Related Work

Fast Convolution Algorithms. The FFT was the first utilized algorithm (Mathieu et al., 2014) to fast the training of convolutions. Subsequently, for small convolutions, the Winograd minimum filtering algorithm (Lavin & Gray, 2016) was found to outperform the Fourier-based method due to its real field arithmetic operations, whereas the Fourier method requires more inefficient arithmetic in complex field. Additionally, the NTT has also been proposed to accelerate convolutions (Hong et al., 2022). However, when combining quantization and fast convolution algorithms, the challenge of potential model accuracy degradation arises. The Winograd algorithm is susceptible to numerical instability due to the ill-conditioned Vandermonde matrix in the transformation (Vincent et al., 2017; Barabasz et al., 2020). Fourier-based methods demand a high-precision format to accurately represent irrational numbers. NTT methods can offer accurate integer computing, but involve a large number of modulo operations and high-bitwidth intermediate data representations, reducing computation efficiency.

Quantization for Fastconv. Some approaches aim to optimize the quantization method to regain model accuracy. For example, LoWino (Li et al., 2021) presents a post-training quantization (PTQ) method for Winograd, optimizing the scaling factor by minimizing the KL distance between the quantized and original vectors. Another PTQ work (Chikin & Kryzhanovskiy, 2022) introduces a balancing operation between the filter and input channels to enhance bit-width utilization and improve the quality of quantization for Winograd. Additionally, a full quantization method with optimizing the transformation matrix in Winograd has been proposed (Tianqi et al., 2023), which successfully restores model accuracy when employing the Winograd F(4×4444\tiny{\times}44 × 4, 3×3333\tiny{\times}33 × 3) algorithm with int8 quantization. Nevertheless, the methods above tend to struggle to achieve satisfactory accuracy recovery under sub-int8 quantization.

Numerical Accuracy for Fastconv. Another approaches focus on improving the intrinsic properties of the fast algorithm itself. As Winograd algorithms can be defined by root points, a bilinear approach that strikes a balance between computational complexity and numerical accuracy has been proposed (Barabasz & Gregg, 2019). Additionally, two existing works (Barabasz et al., 2020; Alam et al., 2022) aimed to discover more effective polynomial root points to improve numerical accuracy. The Winograd algorithms have also been extended to the Residue Number System (RNS) (Liu & Mattina, 2020), decomposing single high-precision intermediate multiplications into multiple low-precision arithmetics (e.g., 8-bit). However, these all come at the cost of increased computational complexity.

3 Preliminaries

Algorithms Construction. Fast convolution algorithms, including Winograd, Fourier Transform, and Number Theoretic Transform all employ a three-stage computing process: transformations of filters and inputs, element-wise multiplication, and a transformation for generating outputs. The generalized form for fast 2D convolution is as follows:

y=AT[[GfGT][BTxB]]A𝑦superscript𝐴𝑇delimited-[]direct-productdelimited-[]𝐺𝑓superscript𝐺𝑇delimited-[]superscript𝐵𝑇𝑥𝐵𝐴y=A^{T}[[GfG^{T}]\odot[B^{T}xB]]Aitalic_y = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ [ italic_G italic_f italic_G start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] ⊙ [ italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x italic_B ] ] italic_A (1)

where direct-product\odot denotes element-wise multiplication, and B𝐵Bitalic_B, G𝐺Gitalic_G, and A𝐴Aitalic_A represent the linear transformations of the input, filter, and multiplication result.

For one specific algorithm (whether Winograd, FFT or NTT), the G𝐺Gitalic_G, B𝐵Bitalic_B and A𝐴Aitalic_A are all derive from a Vandermonde matrix V𝑉Vitalic_V, which consist of a set of root points s0..sN1s_{0}..s_{N-1}italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . . italic_s start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT:

V=[1s01s02s0N11s11s12s1N1........1sN11sN12sN1N1]𝑉matrix1superscriptsubscript𝑠01superscriptsubscript𝑠02superscriptsubscript𝑠0𝑁11superscriptsubscript𝑠11superscriptsubscript𝑠12superscriptsubscript𝑠1𝑁1absentabsentabsentabsent1superscriptsubscript𝑠𝑁11superscriptsubscript𝑠𝑁12superscriptsubscript𝑠𝑁1𝑁1V=\begin{bmatrix}1&s_{0}^{1}&s_{0}^{2}&...&s_{0}^{N-1}\\ 1&s_{1}^{1}&s_{1}^{2}&...&s_{1}^{N-1}\\ ..&..&..&...&..\\ 1&s_{N-1}^{1}&s_{N-1}^{2}&...&s_{N-1}^{N-1}\end{bmatrix}italic_V = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL . . end_CELL start_CELL . . end_CELL start_CELL . . end_CELL start_CELL … end_CELL start_CELL . . end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL … end_CELL start_CELL italic_s start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] (2)

A N×N𝑁𝑁N\times Nitalic_N × italic_N matrix V𝑉Vitalic_V and its inverse V1superscript𝑉1V^{-1}italic_V start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can construct a fast convolution algorithm for R×R𝑅𝑅R\times Ritalic_R × italic_R filter accommodating N×N𝑁𝑁N\times Nitalic_N × italic_N inputs with M×M𝑀𝑀M\times Mitalic_M × italic_M outputs, or M×M𝑀𝑀M\times Mitalic_M × italic_M inputs with N×N𝑁𝑁N\times Nitalic_N × italic_N outputs, where N=M+R1𝑁𝑀𝑅1N=M+R-1italic_N = italic_M + italic_R - 1.

Difference. The fundamental difference among various algorithms lies in the number field of V𝑉Vitalic_V and the chosen Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. In the Winograd algorithm (also known as Toom-Cook algorithm), the Snsubscript𝑆𝑛S_{n}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are N𝑁Nitalic_N interpolation points selected in the real number field \mathbb{R}blackboard_R. Similarly, all arithmetic operations are performed in the \mathbb{R}blackboard_R. As a comparison, all arithmetic in Fourier convolution is defined in the complex field \mathbb{C}blackboard_C. And the matrix V𝑉Vitalic_V is the discrete Fourier transform matrix, where Sn=ej2πnNsubscript𝑆𝑛superscript𝑒𝑗2𝜋𝑛𝑁S_{n}=e^{-j\frac{2\pi*n}{N}}italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_e start_POSTSUPERSCRIPT - italic_j divide start_ARG 2 italic_π ∗ italic_n end_ARG start_ARG italic_N end_ARG end_POSTSUPERSCRIPT. The number theoretic transformation is similar in structure to the Fourier transform, but it operates in a finite field denoted as 𝔽psubscript𝔽𝑝\mathbb{F}_{p}blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Arithmetic Complexity Reduction. Assuming these transformations are lightweight compared to element-wise multiply and can be amortized due to channel reuse, the fast algorithms consume N2=(M+R1)2superscript𝑁2superscript𝑀𝑅12N^{2}=(M+R-1)^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_M + italic_R - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT multiplications to generate M2superscript𝑀2M^{2}italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT outputs, where the arithmetic complexity reduction is M2R2(M+R1)2superscript𝑀2superscript𝑅2superscript𝑀𝑅12\frac{M^{2}R^{2}}{(M+R-1)^{2}}divide start_ARG italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_M + italic_R - 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. However, convolution operations in CNNs are generally defined in \mathbb{R}blackboard_R, so employing fast algorithms defined in \mathbb{C}blackboard_C or 𝔽psubscript𝔽𝑝\mathbb{F}_{p}blackboard_F start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (such as FFT and NTT) would lead to waste in the calculation. Hence, Winograd, defined in \mathbb{R}blackboard_R, is the most popular algorithm for model acceleration.

Precision Requirement. For Winograd, the extremum of a row in V𝑉Vitalic_V is [1,SnN1]1superscriptsubscript𝑆𝑛𝑁1[1,S_{n}^{N-1}][ 1 , italic_S start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N - 1 end_POSTSUPERSCRIPT ]. So the required arithmetic precision grows exponentially with N𝑁Nitalic_N. Thus only the Winograd algorithm with small N𝑁Nitalic_N is practical. In comparison, the FFT method performs a significant numerical advantage when dealing with large filters due to its numerically stable V𝑉Vitalic_V matrix. However, performing accurate Fourier transforms necessitates high-precision arithmetic. NTT method provide a bit-correct result for integral convolution. However, when using NTT to perform a calculation with N𝑁Nitalic_N-bit inputs and 2N2𝑁2N2 italic_N-bit outputs, the transformed inputs must have a datawidth of at least 2N2𝑁2N2 italic_N-bit, which increases the datawidth of direct-product\odot.

In summary, no matter which type of algorithm is chosen, achieving both robust computation and significant arithmetic reduction under low-precision operations remains a challenge.

4 Symbolic Fourier Convolution Algorithm

It is worth noting that the Fourier transform has better numerical stability, as all its root points are distributed on a circle of radius 1 in the complex plane. When dealing with larger N𝑁Nitalic_N, it is more accurate than Winograd. However, Fourier transform has two serious drawbacks. First, its irrational coefficients are not friendly for low-precision format and would give more computation burden for transformation calculation.

In addition, the FFT is not as efficient as Winograd. There are two reasons for this. Firstly the FFT is computed using complex numbers and even after utilizing the Hamiltonian symmetry with real sequences and the fast complex multiplication, each complex multiplication still requires 1.5 real multiplications. Secondly, the direct calculation result of the FFT is a circular convolution, so padding the sequence with zeros to achieve a linear convolution is required, which would wasted computation.

We propose two key improvement strategies to address these drawbacks:

  • 1.

    We employ symbolic computation rather than numerical computation to implement the discrete Fourier transform (DFT). By selecting an appropriate number of DFT points, we can avoid or minimize the irrational values involved in computing. All complex points can be represented by first order integer coefficient polynomials under both 4 and 6 DFT points.

  • 2.

    We introduce correction terms to fully exploit the cyclic convolution output generated by the Fourier method to enhance computing utilization, and the smaller number of transformation points we chose also helps to reduce the proportion of complex arithmetic.

4.1 Symbolic Computing for DFT

Generally, the coefficients of the N-point DFT are derived from:

e2πnNj=cos(2πnN)+jsin(2πnN),n=0,1,..,N1e^{\frac{2\pi n}{N}j}=cos(\frac{2\pi n}{N})+jsin(\frac{2\pi n}{N}),n=0,1,..,N-1italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_π italic_n end_ARG start_ARG italic_N end_ARG italic_j end_POSTSUPERSCRIPT = italic_c italic_o italic_s ( divide start_ARG 2 italic_π italic_n end_ARG start_ARG italic_N end_ARG ) + italic_j italic_s italic_i italic_n ( divide start_ARG 2 italic_π italic_n end_ARG start_ARG italic_N end_ARG ) , italic_n = 0 , 1 , . . , italic_N - 1

when nN{0,14,12,34}𝑛𝑁0141234\frac{n}{N}\notin\{0,\frac{1}{4},\frac{1}{2},\frac{3}{4}\}divide start_ARG italic_n end_ARG start_ARG italic_N end_ARG ∉ { 0 , divide start_ARG 1 end_ARG start_ARG 4 end_ARG , divide start_ARG 1 end_ARG start_ARG 2 end_ARG , divide start_ARG 3 end_ARG start_ARG 4 end_ARG }, irrational values will be introduced. To eliminate the rounding errors that arise from these irrational values, we employ symbolic computation rather than numerical calculation. This approach represents irrational values in polynomials with integer coefficients.

To illustrate, we consider the 3-point DFT. For a real input sequence x=(x0,x1,x2)T𝑥superscriptsubscript𝑥0subscript𝑥1subscript𝑥2𝑇x=(x_{0},x_{1},x_{2})^{T}italic_x = ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, its transform processing can be represented as:

[X0X1X2]=[1111ss21s2s][x0x1x2],s=e2π3jformulae-sequencematrixsubscript𝑋0subscript𝑋1subscript𝑋2matrix1111𝑠superscript𝑠21superscript𝑠2𝑠matrixsubscript𝑥0subscript𝑥1subscript𝑥2𝑠superscript𝑒2𝜋3𝑗\begin{bmatrix}X_{0}\\ X_{1}\\ X_{2}\\ \end{bmatrix}=\begin{bmatrix}1&1&1\\ 1&s&s^{2}\\ 1&s^{2}&s\\ \end{bmatrix}\begin{bmatrix}x_{0}\\ x_{1}\\ x_{2}\\ \end{bmatrix},s=e^{\frac{2\pi}{3}j}[ start_ARG start_ROW start_CELL italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_s end_CELL start_CELL italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL italic_s end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] , italic_s = italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_π end_ARG start_ARG 3 end_ARG italic_j end_POSTSUPERSCRIPT (3)
Refer to caption
Figure 1: The geometric symmetry in DFT-3, DFT-4 and DFT-6.

We do not substitute the numerical value of s𝑠sitalic_s into the calculation. Instead, we represent and compute subsequent variables using the polynomial form of s. This allows us to express the DFT outputs Xnsubscript𝑋𝑛X_{n}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT as Xn=Xn,0+Xn,1s+Xn,2s2subscript𝑋𝑛subscript𝑋𝑛0subscript𝑋𝑛1𝑠subscript𝑋𝑛2superscript𝑠2X_{n}=X_{n,0}+X_{n,1}\cdot s+X_{n,2}\cdot s^{2}italic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT + italic_X start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT ⋅ italic_s + italic_X start_POSTSUBSCRIPT italic_n , 2 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

By exploiting the geometric symmetry among 1,s,s21𝑠superscript𝑠21,s,s^{2}1 , italic_s , italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the s2superscript𝑠2s^{2}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be expressed as s2=(s0+s1)superscript𝑠2superscript𝑠0superscript𝑠1s^{2}=-(s^{0}+s^{1})italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = - ( italic_s start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT + italic_s start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ), as Figure 1. Further, the Hamiltonian symmetry of the real signal Fourier transform can reduce the number of components by almost half. Thus we can rearrange Equation 3 as following:

[X0X1X2]=[10001s01s2][111101011][x0x1x2]matrixsubscript𝑋0subscript𝑋1subscript𝑋2matrix10001𝑠01superscript𝑠2matrix111101011matrixsubscript𝑥0subscript𝑥1subscript𝑥2\begin{bmatrix}X_{0}\\ X_{1}\\ X_{2}\\ \end{bmatrix}=\begin{bmatrix}1&0&0\\ 0&1&s\\ 0&1&s^{2}\\ \end{bmatrix}\begin{bmatrix}1&1&1\\ 1&0&-1\\ 0&1&-1\\ \end{bmatrix}\begin{bmatrix}x_{0}\\ x_{1}\\ x_{2}\\ \end{bmatrix}[ start_ARG start_ROW start_CELL italic_X start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_X start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_s end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] (4)

In the above formula, s=e2π3j𝑠superscript𝑒2𝜋3𝑗s=e^{\frac{2\pi}{3}j}italic_s = italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_π end_ARG start_ARG 3 end_ARG italic_j end_POSTSUPERSCRIPT is not need to be explicitly included in the calculation but serves as a notation from the outset. Similar to complex number, the production of a0+a1ssubscript𝑎0subscript𝑎1𝑠a_{0}+a_{1}sitalic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s and b0+b1ssubscript𝑏0subscript𝑏1𝑠b_{0}+b_{1}sitalic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s can be calculated as:

(a0+a1s)subscript𝑎0subscript𝑎1𝑠\displaystyle(a_{0}+a_{1}s)( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) (b0+b1s)=c0+c1s,[c0c1]=\displaystyle(b_{0}+b_{1}s)=c_{0}+c_{1}s,\quad\begin{bmatrix}c_{0}\\ c_{1}\end{bmatrix}=( italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) = italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s , [ start_ARG start_ROW start_CELL italic_c start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = (5)
[110101]matrix110101\displaystyle\begin{bmatrix}1&-1&0\\ 1&0&1\end{bmatrix}[ start_ARG start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ([100111][a0a1][100111][b0b1])direct-productmatrix100111matrixsubscript𝑎0subscript𝑎1matrix100111matrixsubscript𝑏0subscript𝑏1\displaystyle(\begin{bmatrix}1&0\\ 0&1\\ -1&1\end{bmatrix}\begin{bmatrix}a_{0}\\ a_{1}\end{bmatrix}\odot\begin{bmatrix}1&0\\ 0&1\\ 1&-1\end{bmatrix}\begin{bmatrix}b_{0}\\ b_{1}\end{bmatrix})( [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ⊙ [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_b start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] )

Through enumeration, we can identify that 6666 and 4444 are suitable choices for DFT points in small-size convolution applications.

1) DFT-6

Considering DFT-6, its transformation coefficients consist of six values: 1,ejπ3,ej2π3,1,ej4π3,ej5π31superscript𝑒𝑗𝜋3superscript𝑒𝑗2𝜋31superscript𝑒𝑗4𝜋3superscript𝑒𝑗5𝜋31,e^{j\frac{\pi}{3}},e^{j\frac{2\pi}{3}},-1,e^{j\frac{4\pi}{3}},e^{j\frac{5\pi% }{3}}1 , italic_e start_POSTSUPERSCRIPT italic_j divide start_ARG italic_π end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT italic_j divide start_ARG 2 italic_π end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT , - 1 , italic_e start_POSTSUPERSCRIPT italic_j divide start_ARG 4 italic_π end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT , italic_e start_POSTSUPERSCRIPT italic_j divide start_ARG 5 italic_π end_ARG start_ARG 3 end_ARG end_POSTSUPERSCRIPT. Defining s=eπ3j𝑠superscript𝑒𝜋3𝑗s=e^{\frac{\pi}{3}j}italic_s = italic_e start_POSTSUPERSCRIPT divide start_ARG italic_π end_ARG start_ARG 3 end_ARG italic_j end_POSTSUPERSCRIPT, thus s2=s1superscript𝑠2𝑠1s^{2}=s-1italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_s - 1, which allows all coefficients to be expressed as first-order polynomials of s𝑠sitalic_s: 1,s,s1,1,s,1s1𝑠𝑠11𝑠1𝑠1,s,s-1,-1,-s,1-s1 , italic_s , italic_s - 1 , - 1 , - italic_s , 1 - italic_s. When two first-degree polynomials are multiplied, any quadratic term can be reduced to a first-degree term using the rule s2=s1superscript𝑠2𝑠1s^{2}=s-1italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_s - 1. Therefore, the DFT-6 transform processing under symbolic computation is as follows:

DFT6(x)=S6F6x=[10000001s0000001s000000101s20000001s20]\displaystyle DFT6(x)=S_{6}F_{6}x=\begin{bmatrix}1&0&0&0&0&0\\ 0&1&s&0&0&0\\ 0&0&0&1&s&0\\ 0&0&0&0&0&1\\ 0&1&-s^{2}&0&0&0\\ 0&0&0&1&-s^{2}&0\\ \end{bmatrix}\cdotitalic_D italic_F italic_T 6 ( italic_x ) = italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_x = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_s end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_s end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] ⋅ (6)
[111111110110011011101101011011111111][x0x1x2x3x4x5]matrix111111110110011011101101011011111111matrixsubscript𝑥0subscript𝑥1subscript𝑥2subscript𝑥3subscript𝑥4subscript𝑥5\displaystyle\begin{bmatrix}1&1&1&1&1&1\\ 1&1&0&-1&-1&0\\ 0&-1&-1&0&1&1\\ 1&0&-1&1&0&-1\\ 0&-1&1&0&-1&1\\ 1&-1&1&-1&1&-1\\ \end{bmatrix}\cdot\begin{bmatrix}x_{0}\\ x_{1}\\ x_{2}\\ x_{3}\\ x_{4}\\ x_{5}\\ \end{bmatrix}[ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] ⋅ [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

Here, S6subscript𝑆6S_{6}italic_S start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT represents the format transition from symbolic to numerical computing without any arithmetic, and F6subscript𝐹6F_{6}italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT signifies the Fourier transform under symbolic computing. We refer to the intermediate matrix as SFT-6 (Symbolic Fourier Transform-6), as its coefficients consist solely of 1, -1, and 0. The inverse transform iDFT6 can be rearranged in a similar way. F6subscript𝐹6F_{6}italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT has its fast algorithm by decomposing it into DFT-3 and DFT-2. Therefore, only 14 addition operations are needed to perform SFT-6 (subtraction can be considered as addition of complement). The inverse transformation matrix iF6𝑖subscript𝐹6iF_{6}italic_i italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT can be derived in a similar way:

iF6=16[111111112112121121112112211211111111]𝑖subscript𝐹616matrix111111112112121121112112211211111111iF_{6}=\frac{1}{6}\begin{bmatrix}1&1&1&1&1&1\\ 1&-1&-2&-1&1&2\\ -1&-2&-1&1&2&1\\ -1&-1&2&-1&-1&2\\ -2&1&1&-2&1&1\\ -1&1&-1&1&-1&1\\ \end{bmatrix}italic_i italic_F start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 6 end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] (7)

Note that 1616\frac{1}{6}divide start_ARG 1 end_ARG start_ARG 6 end_ARG can be fused into floating model without having to compute that division operation during inference stage.

In the element-wise multiplication steps, multiplications are performed in polynomial form. Multiplying two 1st-order polynomials requires 4 real number multiplications. We can utilize a short fast convolution algorithm to reduce it to 3:

(a0+a1s)subscript𝑎0subscript𝑎1𝑠\displaystyle(a_{0}+a_{1}s)( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) (w0+w1s)=o0+o1s,[o0o1]=\displaystyle(w_{0}+w_{1}s)=o_{0}+o_{1}s,\quad\begin{bmatrix}o_{0}\\ o_{1}\end{bmatrix}=( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) = italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s , [ start_ARG start_ROW start_CELL italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = (8)
[110101]matrix110101\displaystyle\begin{bmatrix}1&-1&0\\ -1&0&1\end{bmatrix}[ start_ARG start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ([100111][a0a1][100111][w0w1])direct-productmatrix100111matrixsubscript𝑎0subscript𝑎1matrix100111matrixsubscript𝑤0subscript𝑤1\displaystyle(\begin{bmatrix}1&0\\ 0&1\\ 1&1\end{bmatrix}\begin{bmatrix}a_{0}\\ a_{1}\end{bmatrix}\odot\begin{bmatrix}1&0\\ 0&1\\ 1&1\end{bmatrix}\begin{bmatrix}w_{0}\\ w_{1}\end{bmatrix})( [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ⊙ [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] )

2) DFT-4

Similarly, the DFT-4 can be constructed in the same manner.

DFT4(x)=S4F4x=[100001j0000101j0]\displaystyle DFT4(x)=S_{4}F_{4}x=\begin{bmatrix}1&0&0&0\\ 0&1&j&0&\\ 0&0&0&1\\ 0&1&-j&0\\ \end{bmatrix}\cdotitalic_D italic_F italic_T 4 ( italic_x ) = italic_S start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_x = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL italic_j end_CELL start_CELL 0 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - italic_j end_CELL start_CELL 0 end_CELL end_ROW end_ARG ] ⋅ (9)
[1111101001011111][x0x1x2x3]matrix1111101001011111matrixsubscript𝑥0subscript𝑥1subscript𝑥2subscript𝑥3\displaystyle\begin{bmatrix}1&1&1&1\\ 1&0&-1&0\\ 0&-1&0&1\\ 1&-1&1&-1\\ \end{bmatrix}\begin{bmatrix}x_{0}\\ x_{1}\\ x_{2}\\ x_{3}\\ \end{bmatrix}[ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ]

The multiplication on the DFT-4 convolution can be performed as:

(a0+a1s)subscript𝑎0subscript𝑎1𝑠\displaystyle(a_{0}+a_{1}s)( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) (w0+w1s)=o0+o1s,[o0o1]=\displaystyle(w_{0}+w_{1}s)=o_{0}+o_{1}s,\quad\begin{bmatrix}o_{0}\\ o_{1}\end{bmatrix}=( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s ) = italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_s , [ start_ARG start_ROW start_CELL italic_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] = (10)
[110111]matrix110111\displaystyle\begin{bmatrix}1&-1&0\\ -1&-1&1\end{bmatrix}[ start_ARG start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ([100111][a0a1][100111][w0w1])direct-productmatrix100111matrixsubscript𝑎0subscript𝑎1matrix100111matrixsubscript𝑤0subscript𝑤1\displaystyle(\begin{bmatrix}1&0\\ 0&1\\ 1&1\end{bmatrix}\begin{bmatrix}a_{0}\\ a_{1}\end{bmatrix}\odot\begin{bmatrix}1&0\\ 0&1\\ 1&1\end{bmatrix}\begin{bmatrix}w_{0}\\ w_{1}\end{bmatrix})( [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ⊙ [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] [ start_ARG start_ROW start_CELL italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] )

If we want to compute A((Gf)(Bx))𝐴direct-product𝐺𝑓𝐵𝑥A((Gf)\odot(Bx))italic_A ( ( italic_G italic_f ) ⊙ ( italic_B italic_x ) ) directly in the real number field, akin to the Winograd algorithm, without involving polynomial multiplication, we can integrate Eq. 8 or Eq. 10 into the SFT matrix Eq. 6 or Eq. 9. In the 1D case, this does not impact efficiency. However, in the 2D case, it introduces slight redundant components and marginally reduces the acceleration ratio. We list these algorithms with polynomial multiplication integrated in the appendix.

4.2 Achieving Efficient Linear Convolution

The conventional Fourier Transform method inherently generates cyclic convolution. As a consequence, only (NR+1)2superscript𝑁𝑅12(N-R+1)^{2}( italic_N - italic_R + 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT components are valid for the intended N2superscript𝑁2N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT linear convolution. However, it’s noteworthy that the invalid results are not entirely useless. They actually contain partial sums that can be effectively utilized. By intelligently applying correction terms to these partial sums, it is possible to convert them into valid outputs. This approach significantly enhances the efficiency of the convolution process.

Figure 2 illustrates an example of Fourier-based cyclic convolution for N=6𝑁6N=6italic_N = 6 and R=3𝑅3R=3italic_R = 3. The first term o1csuperscriptsubscript𝑜1𝑐o_{1}^{c}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT is equal to a6w1+a1w2+a2w1subscript𝑎6subscript𝑤1subscript𝑎1subscript𝑤2subscript𝑎2subscript𝑤1a_{6}w_{1}+a_{1}w_{2}+a_{2}w_{1}italic_a start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, but the desired output is o1=a0w1+a1w2+a2w1subscript𝑜1subscript𝑎0subscript𝑤1subscript𝑎1subscript𝑤2subscript𝑎2subscript𝑤1o_{1}=a_{0}w_{1}+a_{1}w_{2}+a_{2}w_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. To align o1subscript𝑜1o_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with o1csuperscriptsubscript𝑜1𝑐o_{1}^{c}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT, a corrective term is introduced to obtain the desired output: o1=o1c+(a0a6)w1subscript𝑜1superscriptsubscript𝑜1𝑐subscript𝑎0subscript𝑎6subscript𝑤1o_{1}=o_{1}^{c}+(a_{0}-a_{6})w_{1}italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_o start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT + ( italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ) italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. This adjustment allows us to obtain an additional correct result by adding just one MAC operation, thus utilizing the Fourier convolution output more efficiently compared to discarding erroneous terms.

Refer to caption
Figure 2: Converting invalid cyclic convolution to linear convolution, which is a process that can also be used to adjust the input tile sizes.

To unambiguously represent a particular algorithm, we employ the notation SFC-N𝑁Nitalic_N(M𝑀Mitalic_M, R𝑅Ritalic_R), where N𝑁Nitalic_N represents the length of the SFT transformation, M𝑀Mitalic_M represents the feature tile size, and R𝑅Ritalic_R represents the kernel size. For example, the SFC-6(6×6, 3×3) algorithm is constructed based on a 6666-point Fourier transform, employing a 3×3 kernel size and a 6×6 feature tile size.

By introducing correction terms, we can also adjust input tile size M𝑀Mitalic_M independently. For example, as the images in the ImageNet dataset are in size 224×224, the feature map size in the model has a common factor of 7. Utilizing the SFC-6(7×7, 3×3) algorithm to infer models trained on Imagenet would have higher tiling efficiency without the need for zero padding. The transformation matrix integrated polynomial multiplication of the SFC-6(7×7, 3×3) is as follows:

BT=[011111100011011000001101100010110100010110100001101100011011000011111100100000100010000010010000010001000001],superscript𝐵𝑇matrix011111100011011000001101100010110100010110100001101100011011000011111100100000100010000010010000010001000001\displaystyle{\tiny B^{T}=\begin{bmatrix}0&1&1&1&1&1&1&0&0\\ 0&1&1&0&-1&-1&0&0&0\\ 0&0&-1&-1&0&1&1&0&0\\ 0&1&0&-1&-1&0&1&0&0\\ 0&1&0&-1&1&0&-1&0&0\\ 0&0&-1&1&0&-1&1&0&0\\ 0&1&-1&0&1&-1&0&0&0\\ 0&1&-1&1&-1&1&-1&0&0\\ 1&0&0&0&0&0&-1&0&0\\ 0&-1&0&0&0&0&0&1&0\\ 0&-1&0&0&0&0&0&1&0\\ 0&0&-1&0&0&0&0&0&1\\ \end{bmatrix}},italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (11)
A=16[111111121121121111121121121112112111121121212211211111116000000000006000000060000006],𝐴16matrix111111121121121111121121121112112111121121212211211111116000000000006000000060000006\displaystyle{\tiny A=\frac{1}{6}\begin{bmatrix}1&1&1&1&1&1&1\\ 2&1&-1&-2&-1&1&2\\ -1&1&1&1&-1&-2&-1\\ -1&-2&-1&1&2&1&-1\\ 1&-2&1&1&-2&1&1\\ 1&1&-2&1&1&-2&1\\ -2&1&2&-2&1&1&-2\\ -1&1&-1&1&-1&1&-1\\ 6&0&0&0&0&0&0\\ 0&0&0&0&0&6&0\\ 0&0&0&0&0&0&6\\ 0&0&0&0&0&0&6\\ \end{bmatrix}},italic_A = divide start_ARG 1 end_ARG start_ARG 6 end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 6 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW end_ARG ] ,
G=[111011110101101110011111100001010001]𝐺matrix111011110101101110011111100001010001\displaystyle{\tiny G=\begin{bmatrix}1&1&1\\ 0&1&1\\ -1&-1&0\\ -1&0&1\\ -1&0&1\\ 1&-1&0\\ 0&-1&1\\ 1&-1&1\\ 1&0&0\\ 0&0&1\\ 0&1&0\\ 0&0&1\\ \end{bmatrix}}italic_G = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ]

The SFC-6(6×6, 3×3) algorithm can reduce 73% multiplications for 3×3 convolutions. And multiplications can be optimized by 81% and 79% for 5×5 and 7×7 convolutions, respectively. A selection of achievable Symbolic Fourier Convolution algorithms is listed in Table 1. For comparison, we also list their efficiencies and the numerical errors obtained from simulation and theoretical analyses (Detailed in Section 5). In Table 1, we can see that for 3×3 convolution the SFC-6666(6×6666\tiny{\times}66 × 6, 3×3333\tiny{\times}33 × 3) algorithm is 1.64× faster than the Winograd (2×2, 3×3) algorithm, while maintaining nearly the same numerical error.

Further, for larger kernel size, the Winograd algorithm is no longer able to construct linear transformations with sufficiently low numerical errors, no matter how the root points are chosen. Although there exists work by splitting the core, making it possible to compute large size convolutions with the Winograd(2×2, 3×3) algorithm with low numerical error (Huang et al., 2020), it is not possible to achieve a lower arithmetic complexity than Winograd(2×2, 3×3). Whereas our algorithm provides higher speedups and keeps the high numerical stability.

Table 1: Comparison of different fast convolution algorithms. Mean square error was measured on randomly generated data, and κ(AT)𝜅superscript𝐴𝑇\kappa(A^{T})italic_κ ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) was calculated using the singular value of ATsuperscript𝐴𝑇A^{T}italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT.
Algorithm
 Mean Square
Error
κ(AT)𝜅superscript𝐴𝑇\kappa(A^{T})italic_κ ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT )
 Arithmetic
Complexity
direct convolution 1.0 1.0 100%
Wino(2×2222\tiny{\times}22 × 2, 3×3333\tiny{\times}33 × 3) 2.2 2.4 44.4%
Wino(3×3333\tiny{\times}33 × 3, 3×3333\tiny{\times}33 × 3) 6.4 14.5 30.4%
Wino(4×4444\tiny{\times}44 × 4, 3×3333\tiny{\times}33 × 3) 10.5 20.1 25%
SFC-4(4×4444\tiny{\times}44 × 4, 3×3333\tiny{\times}33 × 3) 2.4 2.7  31.94%
SFC-6(6×6666\tiny{\times}66 × 6, 3×3333\tiny{\times}33 × 3) 2.4 3.3 27.16%
SFC-6(7×7777\tiny{\times}77 × 7, 3×3333\tiny{\times}33 × 3) 2.6 3.4 29.93%
Wino(2×2222\tiny{\times}22 × 2, 5×5555\tiny{\times}55 × 5) 10.5 20.1 36%
SFC-6(6×6666\tiny{\times}66 × 6, 5×5555\tiny{\times}55 × 5) 3.6 3.5 20.44%
Wino(2×2222\tiny{\times}22 × 2, 7×7777\tiny{\times}77 × 7) 28.1 31.0 32.6%
SFC-6(4×4444\tiny{\times}44 × 4, 7×7777\tiny{\times}77 × 7) 3.6 3.5 21.99%

5 Error Analysis and Frequency-wise Quantization

This section would analyze the numerical error of fast convolution algorithms, which can be used to guide the development of quantization methods and serve as a benchmark for assessing the numerical stability across various fast algorithms.

To cover the direct convolution into the same error analysis model, we can consider it as a fast convolution with R=3𝑅3R=3italic_R = 3, M=1𝑀1M=1italic_M = 1. For derivation convenience, we use the overlapped output form, in which A𝐴Aitalic_A is a reversible square matrix.

[y0y1y2]T=[111](([111][f0f1f2])([1][x0]))superscriptmatrixsubscript𝑦0subscript𝑦1subscript𝑦2𝑇matrix1missing-subexpressionmissing-subexpressionmissing-subexpression1missing-subexpressionmissing-subexpressionmissing-subexpression1direct-productmatrix1missing-subexpressionmissing-subexpressionmissing-subexpression1missing-subexpressionmissing-subexpressionmissing-subexpression1matrixsubscript𝑓0subscript𝑓1subscript𝑓2matrix1matrixsubscript𝑥0\begin{bmatrix}y_{0}\\ y_{1}\\ y_{2}\end{bmatrix}^{T}=\begin{bmatrix}1&&\\ &1&\\ &&1\end{bmatrix}\cdot((\begin{bmatrix}1&&\\ &1&\\ &&1\end{bmatrix}\cdot\begin{bmatrix}f_{0}\\ f_{1}\\ f_{2}\end{bmatrix})\odot(\begin{bmatrix}1\end{bmatrix}\cdot\begin{bmatrix}x_{0% }\end{bmatrix}))[ start_ARG start_ROW start_CELL italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ⋅ ( ( [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] ⋅ [ start_ARG start_ROW start_CELL italic_f start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_f start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ) ⊙ ( [ start_ARG start_ROW start_CELL 1 end_CELL end_ROW end_ARG ] ⋅ [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] ) ) (12)

We denote the quantized element-wise multiply as Qsubscriptdirect-product𝑄\odot_{Q}⊙ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, through which the operands are quantized and multiplied. δy𝛿𝑦\delta yitalic_δ italic_y represents the calculation error of the output y𝑦yitalic_y. The error forward propagation process can be described as:

y+δy=AT((Gf)Q(BTx))𝑦𝛿𝑦superscript𝐴𝑇subscriptdirect-product𝑄𝐺𝑓superscript𝐵𝑇𝑥y+\delta y=A^{T}\cdot((G\cdot f)\odot_{Q}(B^{T}\cdot x))italic_y + italic_δ italic_y = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ ( ( italic_G ⋅ italic_f ) ⊙ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_x ) ) (13)

We assume these transformations to be accurate, and the quantized operator direct-product\odot introduces rounding errors. These errors would be subsequently magnified by the matrix multiplying ATA^{T}\cdotitalic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅.

Set s+δs=(Gf)Q(BTx)𝑠𝛿𝑠subscriptdirect-product𝑄𝐺𝑓superscript𝐵𝑇𝑥s+\delta s=(G\cdot f)\odot_{Q}(B^{T}\cdot x)italic_s + italic_δ italic_s = ( italic_G ⋅ italic_f ) ⊙ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_x ), by substituting y=ATs𝑦superscript𝐴𝑇𝑠y=A^{T}\cdot sitalic_y = italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ⋅ italic_s into the Equation 13, we can obtain:

δy=(AT)1δs𝛿𝑦superscriptsuperscript𝐴𝑇1𝛿𝑠\delta y=(A^{T})^{-1}\cdot\delta sitalic_δ italic_y = ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_δ italic_s (14)

Applying the properties of the paradigm ||||||*||| | ∗ | | yields:

δy<=(AT)δsnorm𝛿𝑦normsuperscript𝐴𝑇norm𝛿𝑠||\delta y||<=||(A^{T})||\cdot||\delta s||| | italic_δ italic_y | | < = | | ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) | | ⋅ | | italic_δ italic_s | | (15)

Similar, bringing in s<=(AT)1ynorm𝑠normsuperscriptsuperscript𝐴𝑇1norm𝑦||s||<=||(A^{T})^{-1}||\cdot||y||| | italic_s | | < = | | ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | | ⋅ | | italic_y | |, we have:

δyy<=(AT)1(AT)δssnorm𝛿𝑦norm𝑦normsuperscriptsuperscript𝐴𝑇1normsuperscript𝐴𝑇norm𝛿𝑠norm𝑠\frac{||\delta y||}{||y||}<=||(A^{T})^{-1}||\cdot||(A^{T})||\cdot\frac{||% \delta s||}{||s||}divide start_ARG | | italic_δ italic_y | | end_ARG start_ARG | | italic_y | | end_ARG < = | | ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | | ⋅ | | ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) | | ⋅ divide start_ARG | | italic_δ italic_s | | end_ARG start_ARG | | italic_s | | end_ARG (16)

Here we can perform an analysis of Eq.16. The first term (AT)1(AT)normsuperscriptsuperscript𝐴𝑇1normsuperscript𝐴𝑇||(A^{T})^{-1}||\cdot||(A^{T})||| | ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | | ⋅ | | ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) | | is the condition number of matrix ATsuperscript𝐴𝑇A^{T}italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, denoted as κ(AT)𝜅superscript𝐴𝑇\kappa(A^{T})italic_κ ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ). This term indicates the amplification factor applied to the error δs𝛿𝑠\delta sitalic_δ italic_s. The condition number of the standard orthogonal matrix is constant 1, like A𝐴Aitalic_A in direct convolution (Eq.1) and vanilla Fourier convolution, which would not amplify δs𝛿𝑠\delta sitalic_δ italic_s anymore. While the κ(AT)𝜅superscript𝐴𝑇\kappa(A^{T})italic_κ ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) for Winograd convolution can reach up to 31.0, as listed in Table 1. However, in our method, the κ(AT)𝜅superscript𝐴𝑇\kappa(A^{T})italic_κ ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) takes the values of 2.7, 3.3, and 3.5 for SFC4(4,3), SFC6(6,3), and SFC6(4,7), respectively, which is much less than that of Winograd.

The final term δssnorm𝛿𝑠norm𝑠\frac{||\delta s||}{||s||}divide start_ARG | | italic_δ italic_s | | end_ARG start_ARG | | italic_s | | end_ARG in Eq.16 is the error caused by the quantized operator Qsubscriptdirect-product𝑄\odot_{Q}⊙ start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT, which depends on the data width, quantization method, and data distribution. For floating numbers, the error is fairly stationary as every operand has its own exponent code(scaling factor). We conduct numerical experiments under single precision (fp16) in Table 1, and it is find that the numerical error is highly correlated to the κ(AT)𝜅superscript𝐴𝑇\kappa(A^{T})italic_κ ( italic_A start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ), in accordance with our analysis.

Refer to caption
Figure 3: The energy distribution in the frequency domain of the 9-th layer in Resnet18. Energy is concentrated in the low frequencies.

For model quantization, numbers in a group share the same scaling factor to achieve integer representation. The granularity of the scaling factor group should fit with the data distribution to achieve low quantization error. Some work has found that in Winograd convolution, scaling factor grouping based on the transformation domain coordinates can recover model accuracy effectively. Here we give a brief theoretical explanation. Assume that the input x𝑥xitalic_x has a fixed energy x=1norm𝑥1||x||=1| | italic_x | | = 1, the maximum possible value in the transformation domain is (1,1,1,8,8,1)111881(1,1,1,8,8,1)( 1 , 1 , 1 , 8 , 8 , 1 ) for F(4,3) algorithm, which would be quadratic in 2D convolution. Grouping one tensor to one scaling factor would cause a waste of 3/6 bit in 1D/2D. For Fourier convolution, each frequency would have the same maximum value. However, due to the frequency properties of the model inputs (image or audio), there will be a tendency for the energy to converge towards lower frequencies, as Figure 3 shows. Thus grouping scaling factors by frequency can also reduce the error of Fourier convolution, but not as necessary as Winograd. This is because the aforementioned tendency is only significant in the first few layers, and the magnitude variance is not as large as Winograd.

y=Cin(sTxBTxB/sTxintNTxsTfGfGT/sTfintNTf)\small y=\sum_{C_{in}}(s_{Tx}\left\lceil B^{T}xB/s_{Tx}\right\rfloor_{intN_{Tx% }}\odot s_{Tf}\left\lceil GfG^{T}/s_{Tf}\right\rfloor_{intN_{Tf}})italic_y = ∑ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_T italic_x end_POSTSUBSCRIPT ⌈ italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_x italic_B / italic_s start_POSTSUBSCRIPT italic_T italic_x end_POSTSUBSCRIPT ⌋ start_POSTSUBSCRIPT italic_i italic_n italic_t italic_N start_POSTSUBSCRIPT italic_T italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⊙ italic_s start_POSTSUBSCRIPT italic_T italic_f end_POSTSUBSCRIPT ⌈ italic_G italic_f italic_G start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / italic_s start_POSTSUBSCRIPT italic_T italic_f end_POSTSUBSCRIPT ⌋ start_POSTSUBSCRIPT italic_i italic_n italic_t italic_N start_POSTSUBSCRIPT italic_T italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (17)

The scaling factor sTxsubscript𝑠𝑇𝑥s_{Tx}italic_s start_POSTSUBSCRIPT italic_T italic_x end_POSTSUBSCRIPT is of size [T×T]delimited-[]𝑇𝑇[T\tiny{\times}T][ italic_T × italic_T ], where T𝑇Titalic_T is the size of the transform domain. For the scaling factor sTfsubscript𝑠𝑇𝑓s_{Tf}italic_s start_POSTSUBSCRIPT italic_T italic_f end_POSTSUBSCRIPT of weights, considering that per-channel quantization can achieve better results in direct convolutions, we suggest combining per-frequency quantization and per-channel quantization whose sTfsubscript𝑠𝑇𝑓s_{Tf}italic_s start_POSTSUBSCRIPT italic_T italic_f end_POSTSUBSCRIPT is of size [OC×T×T]delimited-[]𝑂𝐶𝑇𝑇[OC\tiny{\times}T\tiny{\times}T][ italic_O italic_C × italic_T × italic_T ] to achieve higher accuracy.

6 Experimental Evaluation

We conducted experiments on image classification tasks to demonstrate the effectiveness of our algorithms. To comprehensively evaluate the computation cost of models accelerated by fast convolution algorithm and quantization, it is critical to consider both the reduction in arithmetic complexity afforded by fast convolution and the reduction in arithmetic data width introduced by quantization. Consequently, we adopt bit-operations (BOPs) as a metric of computation cost, diverging from the traditional floating-operations (FLOPs). In this metric, an n-bit addition operation requires n BOPs, whereas an n-bit multiplication costs n(n-1) BOPs. This is because an n-bit multiplication can be decomposed into n-1 instances of n-bit additions. The transformation cost of fast algorithms is also taken into account.

6.1 Post-training Quantization on Image Classification Benchmarks

Experiments were conducted on the ImageNet dataset, which contains 1.4 million images of size 224×224×3, distributed across 1,000 classes. We randomly selected 500 images from training set to create the calibration set for PTQ fine-tuning. Model accuracy was evaluated on the validation set. We utilized pre-trained fp32 models from TorchVision as our benchmarks. All batch normalization layers were fused into the preceding convolution layers prior to quantization.

We conducted quantization on the following algorithms: 1) Direct convolution, 2) The well-known Winograd F(4×4, 3×3) algorithms, which have been extensively researched for their quantization methods in recent years, and 3) our proposed SFC algorithms, including 1D and 2D format. For all these methods, all 3×3 convolution layers with a stride of 1 were replaced by the corresponding algorithm. Direct convolution and our SFC algorithms were quantized using AdaQuant (Hubara et al., 2020), while the Winograd algorithm was processed with Scaling Gradient Backward (Jain et al., 2020), due to observed convergence differences with AdaQuant in the Winograd F(4×4, 3×3). All the data in the spatial domain are quantized to int8, and the data in the transformation domain are quantized from int8 to int4. As mentioned in Section 5, storing spatial domain data in external storage is unnecessary; instead, data can be stored in the transform domain to avoid errors caused by multiple quantization operators.

Refer to caption
Figure 4: Model Accuracy vs. Computation Cost. We done post-training quantization for Resnet18 on Imagenet dataset from int8 to int4.
Table 2: Post-training quantization experiments on ImageNet.
Resnet18 Bits Top-1 ΔΔ\Deltaroman_Δ
Channel Balancing Wino(4×4, 3×3) 8 67.54 -2.22
Full Quant. Wino(4×4, 3×3) 8 68.16 -1.60
Full Quant. Wino(4×4, 3×3) 6 64.34 -5.42
Ours SFC6(7×7, 3×3) 8 69.59 -0.17
Ours SFC6(7×7, 3×3) 6 68.80 -0.96
Resnet34 Bits Top-1 ΔΔ\Deltaroman_Δ
Channel Balancing Wino(4×4, 3×3) 8 71.86 -1.44
Full Quant. Wino(4×4, 3×3) 8 71.75 -1.55
Full Quant. Wino(4×4, 3×3) 6 68.80 -4.50
Ours SFC6(7×7, 3×3) 8 73.14 -0.16
Ours SFC6(7×7, 3×3) 6 72.40 -0.90
Resnet50 Bits Top-1 ΔΔ\Deltaroman_Δ
Channel Balancing Wino(4×4, 3×3) 8 75.84 -0.29
Full Quant. Wino(4×4, 3×3) 8 75.74 -0.40
Full Quant. Wino(4×4, 3×3) 6 74.75 -1.39
Ours SFC6(7×7, 3×3) 8 76.02 -0.12
Ours SFC6(7×7, 3×3) 6 75.54 -0.60

Since it is not feasible to align a specific category of data, such as ensuring uniform computational costs across a column to compare accuracy in the table, we have illustrated the accuracy curves in relation to computation cost (BOPs) in Figure 4 for a visual comparison of different algorithms. To ensure a fair and comprehensive comparison, we have included the state-of-the-art Winograd-quantization work in the figure, including Channel Balancing (Chikin & Kryzhanovskiy, 2022) and Full Quantization (Tianqi et al., 2023). A subset of the detailed data is presented in Table 2. As depicted in Figure 4, the SFC-6(7×7, 3×3) algorithm demonstrates a computation cost reduction of 1.6× to 2.5× compared to both the Winograd F(4×4, 3×3) and direct convolution algorithms, while maintaining equivalent model accuracy. The experiments confirm that the SFC algorithms achieve a reduction in arithmetic complexity comparable to Winograd, with model accuracy similar to that of direct convolution (quantization-alone).

Refer to caption
Figure 5: The MSE for different algorithms under int8 PTQ. Lower numerical error results to better PTQ results.

We also plot the mean squared error (MSE) between accelerated layers to fp32 layers for different methods under int8 quantization. The experimental results, presented in Figure 5, match perfectly with the numerical error analysis conducted in Section 5 and Table 1.

6.2 FPGA Simulation

We develop RTL code for the convolution accelerator based on the SFC-6(7×7,3×3) algorithm. The resource consumption and timing report are synthesized using Xilinx Vivado tools. The parallelism of our design is configured at [4×4×7×7], indicating that one convolution operator with 4 input channels, 4 output channels, and 7×7 feature map are processed simultaneously. The VGG-16 model is taken as an example, whose convolution layers all have 3×3 filters with stride=1, making it well-suited for fast convolution. All components in our datapath are quantized to int8, and all computing stages in fast convolution are designed to operate in a full pipeline architecture. The DSP48 hardcore is a crucial resource on FPGA, as it can efficiently deploy multiply operations. One DSP48 unit can implement two int8 multipliers or one int16 multipliers, which means that our implementation consumes only 1056 DSPs (calculated as 4×4×132×0.5). In comparison, Winograd-based or NTT-based accelerators (Liang et al., 2020; Prasetiyo et al., 2023) require more DSPs for high precision multipliers, and accelerators for direct convolution (Huang et al., 2022) need more DSPs due to a higher complexity of multiplications. The result highlights the efficiency of integrating fast convolution with model quantization.

Table 3: FPGA synthesis results.
Works Liang et al., 2020 Prasetiyo et al., 2023 Huang et al., 2022 Ours
Algorithm Winograd NTT direct conv SFC
Platform zcu102 xc7vx980t alveo U50 xczu19eg
Precision 16bit 8bit/21bit 8bit 8bit
LUTs - 601.7K 335K 221K
DSPs 2304 4100 3395 1056
Clock 200M 200M 200M 200M
Throughput (GOPs) 2601 2859.5 1000 2129
GOPs/DSPs /Clock 5.64 3.48 1.96 10.08

6.3 Ablation Study on Quantization Granularity

In Section 5, we theoretically predict that frequency-wise quantization will produce less error than tensor-wise quantization. Here, we provide an ablation experiment on Resnet-18 by enumerating combinations of different quantization granularity in Table 4 and Table 5. The results underscore that the Winograd algorithm exhibits more sensitivity to quantization granularity, requiring the finest granularity even with int8 quantization. In contrast, the SFC maintains acceptable accuracy under int8 quantization without specialized quantization. However, at lower bitwidths, it is still necessary to employ frequency-wise quantization for activation tensors.

Table 4: Ablation study for Resnet18 on int8 quantization.
Algorithm Activation Granularity Filter Granularity Accuracy
SFC-6(7×7, 3×3) Tensor Channel 69.18
Freq. Channel 69.54
Freq. Freq. 69.55
Freq. Channel+Freq. 69.58
Wino(4×4, 3×3) Tensor Channel 57.40
Freq. Channel+Freq. 67.62
Table 5: Ablation study for Resnet-18 with SFC(7×7, 3×3).
Quant. Granularity int8 int6 int4
A: Tensor, W:Channel 69.18 65.42 17.81
A: Freq., W:Channel 69.54 68.34 31.10
A: Freq., W:Freq.+Channel 69.58 68.83 55.82

7 Conclusion

We propose a novel fast convolution algorithm extended from the Fourier transform, which solves the accuracy problem of the conventional fast convolution algorithm applied to quantized models. According to experiment results, our algorithms outperform state-of-the-art fast convolution quantization works on both model accuracy and computation cost reduction. Our algorithms share exactly the same computational flow as the Winograd algorithm, which means that they can be deployed on general-purpose processors (CPUs, GPUs) and hardware accelerator design conveniently by following the existing works.

Acknowledgements

This work was supported in part by the National Key Research and Development Program of China under Grant 2022YFB4400900, in part by the Natural Science Foundation of China under Grants 62211530492 and 62371223.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Alam et al. (2022) Alam, S. A., Anderson, A., Barabasz, B., and Gregg, D. Winograd convolution for deep neural networks: Efficient point selection. ACM Transactions on Embedded Computing Systems, 21(6):1–28, 2022.
  • Andri et al. (2022) Andri, R., Bussolino, B., Cipolletta, A., Cavigelli, L., and Wang, Z. Going further with winograd convolutions: Tap-wise quantization for efficient inference on 4x4 tiles. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.  582–598. IEEE, 2022.
  • Barabasz & Gregg (2019) Barabasz, B. and Gregg, D. Winograd convolution for dnns: Beyond linear polynomials. In International Conference of the Italian Association for Artificial Intelligence, pp.  307–320. Springer, 2019.
  • Barabasz et al. (2020) Barabasz, B., Anderson, A., Soodhalter, K. M., and Gregg, D. Error analysis and improving the accuracy of winograd convolution for deep neural networks. ACM Transactions on Mathematical Software (TOMS), 46(4):1–33, 2020.
  • Chikin & Kryzhanovskiy (2022) Chikin, V. and Kryzhanovskiy, V. Channel balancing for accurate quantization of winograd convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12507–12516, 2022.
  • Hong et al. (2022) Hong, S., Arthanto, Y. F., Kim, J.-Y., et al. Accelerating deep convolutional neural networks using number theoretic transform. IEEE Transactions on Circuits and Systems I: Regular Papers, 70(1):315–326, 2022.
  • Huang et al. (2020) Huang, D., Zhang, X., Zhang, R., Zhi, T., He, D., Guo, J., Liu, C., Guo, Q., Du, Z., Liu, S., et al. Dwm: A decomposable winograd method for convolution acceleration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp.  4174–4181, 2020.
  • Huang et al. (2022) Huang, W., Wu, H., Chen, Q., Luo, C., Zeng, S., Li, T., and Huang, Y. Fpga-based high-throughput cnn hardware accelerator with high computing resource utilization ratio. IEEE Transactions on Neural Networks and Learning Systems, 33(8):4069–4083, 2022. doi: 10.1109/TNNLS.2021.3055814.
  • Hubara et al. (2020) Hubara, I., Nahshan, Y., Hanani, Y., Banner, R., and Soudry, D. Improving post training neural quantization: Layer-wise calibration and integer programming. arXiv.org, 2020.
  • Jain et al. (2020) Jain, S., Gural, A., Wu, M., and Dick, C. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. Proceedings of Machine Learning and Systems, 2:112–128, 2020.
  • Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp.  1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  • Lavin & Gray (2016) Lavin, A. and Gray, S. Fast algorithms for convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4013–4021, 2016.
  • Li et al. (2021) Li, G., Jia, Z., Feng, X., and Wang, Y. Lowino: Towards efficient low-precision winograd convolutions on modern cpus. In Proceedings of the 50th International Conference on Parallel Processing, pp.  1–11, 2021.
  • Liang et al. (2020) Liang, Y., Lu, L., Xiao, Q., and Yan, S. Evaluating fast algorithms for convolutional neural networks on fpgas. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 39(4):857–870, 2020. doi: 10.1109/tcad.2019.2897701.
  • Liu & Mattina (2020) Liu, Z.-G. and Mattina, M. Efficient residue number system based winograd convolution. In European Conference on Computer Vision, pp.  53–68. Springer, 2020.
  • Mathieu et al. (2014) Mathieu, M., Henaff, M., and LeCun, Y. Fast training of convolutional networks through ffts: international conference on learning representations (iclr2014), cbls, april 2014. In 2nd International Conference on Learning Representations, ICLR 2014, 2014.
  • Prasetiyo et al. (2023) Prasetiyo, Hong, S., Arthanto, Y. F., and Kim, J.-Y. Accelerating deep convolutional neural networks using number theoretic transform. IEEE Transactions on Circuits and Systems I: Regular Papers, 70(1):315–326, 2023. doi: 10.1109/tcsi.2022.3214528.
  • Tianqi et al. (2023) Tianqi, C., Xu, W., Chen, W., Wang, P., and Cheng, J. Towards efficient and accurate winograd convolution via full quantization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Vincent et al. (2017) Vincent, K., Stephano, K., Frumkin, M., Ginsburg, B., and Demouth, J. On improving the numerical stability of winograd convolutions. 2017.

Appendix A Typical Symbolic Fourier Convolution Algorithms

For SFC-4(4×4444\times 44 × 4, 3×3333\times 33 × 3) algorithm, the matrices (with polynomial multiplication integrated) are:

BT=(011110011110011110001010010100100010010001),superscript𝐵𝑇matrix011110011110011110001010010100100010010001\displaystyle B^{T}=\begin{pmatrix}0&1&1&1&1&0\\ 0&-1&1&-1&1&0\\ 0&1&-1&-1&1&0\\ 0&0&-1&0&1&0\\ 0&1&0&-1&0&0\\ 1&0&0&0&-1&0\\ 0&-1&0&0&0&1\\ \end{pmatrix},italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = ( start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) , (18)
G=(111111111101010100001),A=14(1111111102022222222240000004)formulae-sequence𝐺matrix111111111101010100001𝐴14matrix1111111102022222222240000004\displaystyle G=\begin{pmatrix}1&1&1\\ 1&-1&1\\ 1&-1&-1\\ 1&0&-1\\ 0&-1&0\\ 1&0&0\\ 0&0&1\\ \end{pmatrix},A=\frac{1}{4}\begin{pmatrix}1&1&1&1\\ 1&-1&1&-1\\ 0&2&0&-2\\ 2&-2&-2&2\\ -2&-2&2&2\\ 4&0&0&0\\ 0&0&0&4\end{pmatrix}italic_G = ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ) , italic_A = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ( start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 2 end_CELL start_CELL 0 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL 2 end_CELL start_CELL - 2 end_CELL start_CELL - 2 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL - 2 end_CELL start_CELL 2 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL 4 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 4 end_CELL end_ROW end_ARG )

It costs 49 multiplications to generate 16 outputs. And only 46 multiplications are consumed when Hermitian symmetry is fully considered.

The SFC-6(6×6666\times 66 × 6, 3×3333\times 33 × 3) algorithm costs 100/88 multiplications to generate 36 outputs. The matrices are:

BT=[01111110011011000011011001011010010110100011011001101100011111101000001001000001],superscript𝐵𝑇matrix01111110011011000011011001011010010110100011011001101100011111101000001001000001\displaystyle B^{T}=\begin{bmatrix}0&1&1&1&1&1&1&0\\ 0&1&1&0&-1&-1&0&0\\ 0&0&-1&-1&0&1&1&0\\ 0&1&0&-1&-1&0&1&0\\ 0&1&0&-1&1&0&-1&0\\ 0&0&-1&1&0&-1&1&0\\ 0&1&-1&0&1&-1&0&0\\ 0&1&-1&1&1&-1&1&0\\ 1&0&0&0&0&0&-1&0\\ 0&-1&0&0&0&0&0&1\\ \end{bmatrix},italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (19)
G=[111011110101101110011111100001],A=16[111111211211112112121121121121112112211211111111600000000006]formulae-sequence𝐺matrix111011110101101110011111100001𝐴16matrix111111211211112112121121121121112112211211111111600000000006\displaystyle G=\begin{bmatrix}1&1&1\\ 0&1&1\\ -1&-1&0\\ -1&0&1\\ -1&0&1\\ 1&-1&0\\ 0&-1&1\\ 1&-1&1\\ 1&0&0\\ 0&0&1\\ \end{bmatrix},A=\frac{1}{6}\begin{bmatrix}1&1&1&1&1&1\\ 2&1&-1&-2&-1&1\\ -1&1&2&1&-1&-2\\ -1&-2&-1&1&2&1\\ 1&-2&1&1&-2&1\\ 1&1&-2&1&1&-2\\ -2&1&1&-2&1&1\\ -1&1&-1&1&-1&1\\ 6&0&0&0&0&0\\ 0&0&0&0&0&6\\ \end{bmatrix}italic_G = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , italic_A = divide start_ARG 1 end_ARG start_ARG 6 end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 6 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW end_ARG ]

The SFC-6(7×7777\times 77 × 7, 3×3333\times 33 × 3) algorithm costs 144/132 multiplications to generate 49 outputs. Its transformation matrices are:

BT=[011111100011011000001101100010110100010110100001101100011011000011111100100000100010000010010000010001000001],superscript𝐵𝑇matrix011111100011011000001101100010110100010110100001101100011011000011111100100000100010000010010000010001000001\displaystyle B^{T}=\begin{bmatrix}0&1&1&1&1&1&1&0&0\\ 0&1&1&0&-1&-1&0&0&0\\ 0&0&-1&-1&0&1&1&0&0\\ 0&1&0&-1&-1&0&1&0&0\\ 0&1&0&-1&1&0&-1&0&0\\ 0&0&-1&1&0&-1&1&0&0\\ 0&1&-1&0&1&-1&0&0&0\\ 0&1&-1&1&-1&1&-1&0&0\\ 1&0&0&0&0&0&-1&0&0\\ 0&-1&0&0&0&0&0&1&0\\ 0&-1&0&0&0&0&0&1&0\\ 0&0&-1&0&0&0&0&0&1\\ \end{bmatrix},italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (20)
G=[111011110101101110011111100001010001],A=16[111111121121121121121121121112112111121121211211211111116000000000006000000060000006]formulae-sequence𝐺matrix111011110101101110011111100001010001𝐴16matrix111111121121121121121121121112112111121121211211211111116000000000006000000060000006\displaystyle G=\begin{bmatrix}1&1&1\\ 0&1&1\\ -1&-1&0\\ -1&0&1\\ -1&0&1\\ 1&-1&0\\ 0&-1&1\\ 1&-1&1\\ 1&0&0\\ 0&0&1\\ 0&1&0\\ 0&0&1\\ \end{bmatrix},A=\frac{1}{6}\begin{bmatrix}1&1&1&1&1&1&1\\ 2&1&-1&-2&-1&1&2\\ -1&1&2&1&-1&-2&-1\\ -1&-2&-1&1&2&1&-1\\ 1&-2&1&1&-2&1&1\\ 1&1&-2&1&1&-2&1\\ -2&1&1&-2&1&1&-2\\ -1&1&-1&1&-1&1&-1\\ 6&0&0&0&0&0&0\\ 0&0&0&0&0&6&0\\ 0&0&0&0&0&0&6\\ 0&0&0&0&0&0&6\\ \end{bmatrix}italic_G = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , italic_A = divide start_ARG 1 end_ARG start_ARG 6 end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 6 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW end_ARG ]

The SFC-6(6×6666\times 66 × 6, 5×5555\times 55 × 5) algorithm costs 196/184 multiplications to generate 36 outputs. Its transformation matrices are:

BT=[00111111000011011000000110110000101101000010110100000110110000110110000011111100100000100001000001000100000100001000001000100000100001000001],superscript𝐵𝑇matrix00111111000011011000000110110000101101000010110100000110110000110110000011111100100000100001000001000100000100001000001000100000100001000001\displaystyle B^{T}=\begin{bmatrix}0&0&1&1&1&1&1&1&0&0\\ 0&0&1&1&0&-1&-1&0&0&0\\ 0&0&0&-1&-1&0&1&1&0&0\\ 0&0&1&0&-1&-1&0&1&0&0\\ 0&0&1&0&-1&1&0&-1&0&0\\ 0&0&0&-1&1&0&-1&1&0&0\\ 0&0&1&-1&0&1&-1&0&0&0\\ 0&0&1&-1&1&-1&1&-1&0&0\\ 1&0&0&0&0&0&-1&0&0&0\\ 0&1&0&0&0&0&0&-1&0&0\\ 0&1&0&0&0&0&0&-1&0&0\\ 0&0&-1&0&0&0&0&0&1&0\\ 0&0&-1&0&0&0&0&0&1&0\\ 0&0&0&-1&0&0&0&0&0&1\\ \end{bmatrix},italic_B start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT = [ start_ARG start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , (21)
G=[1111111011101100110101101101101101111111100001000001000000100000100001],A=16[11111111121121121121221121112112111121121211211211111111600000060000000600000000006000000060000006]formulae-sequence𝐺matrix1111111011missing-subexpression101100110101101101101101111111100001000001000000100000100001𝐴16matrix11111111121121121121221121112112111121121211211211111111600000060000000600000000006000000060000006\displaystyle G=\begin{bmatrix}1&1&1&1&1\\ -1&-1&0&1&1&\\ 1&0&-1&-1&0\\ 0&-1&-1&0&1\\ 0&1&-1&0&1\\ -1&0&1&-1&0\\ -1&1&0&-1&1\\ 1&-1&1&-1&1\\ 1&0&0&0&0\\ 1&0&0&0&0\\ 0&1&0&0&0\\ 0&0&0&1&0\\ 0&0&0&0&1\\ 0&0&0&0&1\\ \end{bmatrix},A=\frac{1}{6}\begin{bmatrix}1&1&1&1&1&1&1\\ 1&-1&-2&-1&1&2&-1\\ 1&2&1&-1&-2&-1&2\\ -2&-1&1&2&1&-1&-1\\ -2&1&1&-2&1&1&1\\ 1&-2&1&1&-2&1&-2\\ 1&1&-2&1&1&-2&1\\ 1&-1&1&-1&1&-1&-1\\ 6&0&0&0&0&0&0\\ 6&0&0&0&0&0&0\\ 0&6&0&0&0&0&0\\ 0&0&0&0&0&6&0\\ 0&0&0&0&0&0&6\\ 0&0&0&0&0&0&6\\ \end{bmatrix}italic_G = [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 1 end_CELL end_ROW end_ARG ] , italic_A = divide start_ARG 1 end_ARG start_ARG 6 end_ARG [ start_ARG start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 2 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL 2 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL start_CELL 1 end_CELL start_CELL - 2 end_CELL start_CELL 1 end_CELL end_ROW start_ROW start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL 1 end_CELL start_CELL - 1 end_CELL start_CELL - 1 end_CELL end_ROW start_ROW start_CELL 6 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 6 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 6 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL start_CELL 0 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 0 end_CELL start_CELL 6 end_CELL end_ROW end_ARG ]

Appendix B Applying the SFC Algorithm to Large-size Convolution Kernels

Large-kernel convolutional neural networks have recently received extensive research attention in recent years, and the kernel sizes ranging from 7×7 to 51×51. The vanilla Fast Fourier Transform (FFT) is an option available. Since large-size convolutions commonly use depth-wise convolutions, this makes the multiplication complexity of the algorithm n2lognsuperscript𝑛2𝑙𝑜𝑔𝑛n^{2}lognitalic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l italic_o italic_g italic_n, i.e., the computational complexity of the FFT itself will become dominant.

As a comparison, our algorithm is accomplished using only addition in the transform stage, so applying our algorithm to large-size convolutional kernels is promising. However, our algorithm itself is not applicable to scaling large numbers of transformation points, which can lead to the appearance of higher-order terms further increasing the real multiplication times. We therefore consider an iterative convolution approach to the operation.

Considering a convolution with a 29×29 kernel size and a 26×26 feature map size, we will describe the computational process of iterative convolution. We split the feature map into 5×5 tiles with 6×6 size, while splite the kernel into 6×6 tiles with 5×5 size. Convolution operation will be performed between each feature map tile and each kernel tile, so we use SFC(6×6, 5×5) to speed it up. Note that at this time the partial convolution results of the each tiles are still summed up in the same pattern as the convolution window sliding, which allows us to use the SFC(5×5, 6×6) algorithm to speed up the process as the feature map has been split into 5×5 tiles, and the kernel has been split into 6×6 tiles in the first iteration. The number of multiplications required by above two iterations is the product of the multiplications in the two SFC algorithms, i.e., 132 × 132 = 17,424 multiplications. Our approach reduces the number of multiplications to just 3% of what is required by direct convolution.

We can increass the number of iterations to computing convolution with more larger sizes. Since the SFC algorithm uses only addition for the transformation, it has an easier, efficient and flexible deployment compared to the FFT method. However, when the convolution size is large enough to require 3 or more iterations, the FFT method will appear to be more advantageous in terms of theoretical computational efficiency.