Serial parallel dataflow-pipelined processing architecture based accelerator for 2D transform-quantization in video coder and decoder
Serial parallel dataflow-pipelined processing architecture based accelerator for 2D transform-quantization in video coder and decoder
Corresponding Author:
Sumalatha Shivarudraiah
Research Scholar, Department of Electronics and Communication Engineering
Acharya Institute of Technology
Bangalore, India
Email: sumalatha.disha@gmail.com
1. INTRODUCTION
Video coder and decoder (CODEC) standards starting with the latest versatile video codec (VVC)
2020 down to any previous standard high efficiency video coding (HEVC) 2013 and advanced video coding
(AVC) 2003 had a goal of achieving high video quality, reducing the bandwidth and storage requirements [1],
[2]. Reducing bit rate in the order of 30% to 50% over previous standards and maintaining high video quality
is possible only by using sophisticated coding tools and algorithms. By adopting the newer coding tools like
increasing the number of intra and inter prediction modes, including primary transformation techniques like
discrete cosine transform (DCT)-type-II/V/VIII, discrete sine transform (DST)-type-I/VII, secondary low
frequency non separable transformation (LFNST) types with rectangular transformation, and frequency
dependent and rate dependent perceptual quantization, increases the computational complexity [3]. In order to
perform all these computationally intensive transformation and quantization of video frames efficiently, the
hardware architecture implementation reported a performance gain over software only solutions, especially for
a real time processing on an embedded platform. As the new era of silicon-on-chip (SoC) field programmable
gate array (FPGA) ranging from low end to high end comes with hybrid processing elements like digital signal
processors (DSPs), GPUS and CPU which supports hardware and software co-design, where an architect can
partition the complex video CODEC’S implementation on both CPU-which can handle more sequential
data-flow and control intensive part while allowing FPGA to handle reconfigurable transform and quantization
acceleration tasks [4].
So the position importance of the transformed pixels by Euclidean distance measurement from DC coefficient
to all other AC coefficients in Luma and Chroma Cb and Cr transform blocks and normalized display resolution
hypotenuse for QM derivation is considered [21], [22]. Fitting the complex Barten’s CSF model to Daly model,
suitable for high dynamic range (HDR)-ultra high definition video, developed [23] for both Luma and Chroma
CSF tuned frequency weighting matrices (FWM) for 8×8 transform unit (TU) size. Then this matrix can be up
and down sampled to derive QM for other sizes. The first QM for Luma and Chroma coding was investigated
by considering the CSF of DCT subbands for RGB videos [24]. The R, G, and B channels were combined with
1:1:4 ratios with high priority assigned to G-channel for QM derivation. The visual quality metric analysis and
the corresponding experimental works are reviewed detail in [25] which are more on HVS. They also suggested
learning based adaptive quantization can advance the performance and will also be suitable for machine vision
applications.
From the literature review, it is revealed that, most researchers suggested different architectures for
2D transformation of square and rectangular block sizes. Also proposed integer approximation of
transformation kernel coefficients suitable for VLSI implementations and to reduce complexity of handling
real values. Next defined various methods for transposing the intermediate 1D DCT result suitable for 2D
transformation. Then applying quantization for perceptual redundancy removal and to support entropy coding.
In all this work, handling multiple data values from the external source or storage to processing unit and
processing data in parallel within a pipeline architecture still remained challenging. Also quantizing transform
coefficients to give a better balance between redundancy removal and improving visual quality on a HD display
is a domain of research interest. The main contribution of the proposed work to address the gaps include i) the
size of data selection based on block size and image width to perform transformation operation of multiple data
in parallel. Also, this approach allows flexible transformation of square and rectangular data sizes specified in
new CODEC standards; ii) having data near to the processing unit loaded to line buffers and new technique of
1D-DCT result transpose using demux and linebuffers accelerates the processing of transformation and
quantization operations; and iii) adaptive quantization method based on display resolution to have a trade-off
between visual quality and number of encoding bits per pixel during entropy coding.
The rest of the section is organized as follows. Section 3 gives the outline of our proposed method
and detailed mathematical model with an architectural framework of integer approximated 2D-DCT
transformation and perceptually optimized adaptive quantization modules. In section 4, the data flow
accelerations and pipeline operations of proposed architecture are described with simulation and
implementation results on the targeted FPGA evaluation boards. Finally, the conclusion is covered in section 5.
3. PROPOSED METHODOLOGY
The architectural framework in Figure 1 gives the outline of the processing sub-modules and data
flow acceleration. The test input will be loaded serially into the on-chip line-buffers for 1D transformation
along the row. The number of line buffers instantiated will be based on transform size selection and depth of
each line buffer is equal to image width shown in Figure 1.
Figure 1. Proposed serial parallel data-flow pipelined processing architecture-hardware accelerator framework
The main objective of required data availability near the processing unit is achieved by these line
buffers. The transform kernel coefficients -T are loaded into a read only memory (ROM) based on transform
size selection. The first 1D-transformation along row-wise and second 2D-transformation along column-wise
will be performed by general matrix multiplication method. After every multiplication usual data truncation is
performed by scaling to maintain data size of 16-bit after every step. The scaling factor depends on pixel depth
‘B’ and ‘K’ which is a logarithmic value of the selected block size ’N’. The 1D-transformation output will be
transposed by DEMUX and line buffers, which will be fed to the second column-wise transformation by
multiplication with T’. This way of transformation supports both square and rectangular transformation of
various sizes as shown in Figure 1. The 2D-transformation output will be then quantized based on quantization
parameter, block size, and frame size configured by control unit to obtain transform coefficient level. As the
proposed architecture is a unified structure, the process of forward transformation is performed in reverse order
for inverse transform and inverse quantization. The serial and parallel dataflow through sub modules of the
architecture with four stage pipelined operations are described in detail in the next sections.
1
if i = 0
√N
Ti,j = { 2 (2j+1)iπ
} (1)
√ cos if i > 0
N 2N
In (1), Ti,j represents the element value of the transformation coefficient matrix in real; i is the row index, j is the
column index, N is the transform size; and i, j = 0, 1, . . . , N − 1. In order to process the transformation in integer,
the integer approximation of real coefficients in (1) can be obtained by (2) defined as:
𝑙𝑜𝑔𝑁 𝐾
where n=6 + 2 = 6 + and N= transformation size.
2 2
The two-dimensional transform of an MxN block of residual matrix can be achieved by (3), first
applying a one-dimensional transform to each row of the block and then applying another one-dimensional
transform to each column of the row-transformed result.
𝑌 = 𝑇𝑋𝑇′ (3)
𝑃00 𝑃01 𝑃02 𝑃03 𝑇00 𝑇01 𝑇02 𝑇03 𝑋00 𝑋01 𝑋02 𝑋03
𝑃 𝑃11 𝑃12 𝑃13 𝑇 𝑇11 𝑇12 𝑇13 𝑋 𝑋11 𝑋12 𝑋13
𝑃 = 𝑇 ∗ 𝑋 = [ 10 ] = [ 10 ] ∗ [ 10 ] (4)
𝑃20 𝑃21 𝑃22 𝑃23 𝑇20 𝑇21 𝑇22 𝑇23 𝑋20 𝑋21 𝑋22 𝑋23
𝑃30 𝑃31 𝑃32 𝑃33 𝑇30 𝑇31 𝑇32 𝑇33 𝑋30 𝑋31 𝑋32 𝑋33
As shown in (4), X represents the pixel residual matrix, T is the 4-point transform kernel matrix, and P is the
1D-transformed resultant matrix. The calculation formula for the second level of transformation to get
2D-transform output can be expressed as in (5).
𝑌00 𝑌01 𝑌02 𝑌03 𝑃00 𝑃01 𝑃02 𝑃03 𝑇00 𝑇10 𝑇20 𝑇30
𝑌 𝑌11 𝑌12 𝑌13 𝑃 𝑃11 𝑃12 𝑃13 𝑇 𝑇11 𝑇21 𝑇31
𝑌 = 𝑃 ∗ 𝑇′ = [ 10 ] = [ 10 ] ∗ [ 01 ] (5)
𝑌20 𝑌21 𝑌22 𝑌23 𝑃20 𝑃21 𝑃22 𝑃23 𝑇02 𝑇12 𝑇22 𝑇32
𝑌30 𝑌31 𝑌32 𝑌33 𝑃30 𝑃31 𝑃32 𝑃33 𝑇03 𝑇13 𝑇23 𝑇33
Based on the above steps of computation the architecture is designed as shown in Figure 2. The
complete pipeline architecture has mainly four important modules: i) input pixel control unit, ii) line-buffers,
iii) processing unit, and iv) output buffer. The transform accelerator module has a very regular structure for
both 1D and 2D forward/inverse transform. Hence gives a more efficient pipelining of the sub module as well
as maximum frequency of operation.
3.1.2. Line-buffers
As shown in the hardware architecture framework in Figure 2, the line-buffers are used at the input
stage of 1D and 2D transforms. The total depth of the line-buffers used at the input of both 1D and 2D
transforms is equal to the column value of the input frame. The data width of line-buffer used before 1D
transformation is 8-bit, as it stores input pixel values and the data width of line-buffer used at the input of 2D
transformation is a 1D-DCT output of 16-bit. The register transfer level (RTL) elaborated line buffer module
is shown in Figure 4.
ℎ𝑎𝑐𝑡𝑢𝑎𝑙
ℎ= ∈ [0,1] (6)
ℎ𝑡ℎ𝑒𝑜𝑟𝑒𝑡𝑖𝑐𝑎𝑙
where, ℎ𝑡ℎ𝑒𝑜𝑟𝑒𝑡𝑖𝑐𝑎𝑙 = theoretical maximum hypotenuse value, in pixels and ℎ𝑎𝑐𝑡𝑢𝑎𝑙 = actual maximum
hypotenuse value, in pixels.
Calculating ‘h’ based on resolution of display unit, depends on h theoretical and hactual defined by using
(7) and (8):
ℎ𝑎𝑐𝑡𝑢𝑎𝑙 = √𝑥 2 + 𝑦 2 (8)
The theoretical maximum pixel values 𝑥𝑚𝑎𝑥 and 𝑦𝑚𝑎𝑥 on the maximum possible image size, in pixels,
permitted in the JPEG standard is 65535×65535 [14]. Therefore, substituting 𝑥𝑚𝑎𝑥 =65535 and 𝑦𝑚𝑎𝑥 =65535
into (7) gives ℎ𝑡ℎ𝑒𝑜𝑟𝑒𝑡𝑖𝑐𝑎𝑙 =92680.4858. Then the values x and y for 2 k and 4 k resolution display, as per
standard HD size is considered in work is 1920×1080 and 3840×2160 respectively. Table 1 shows the ℎ𝑎𝑐𝑡𝑢𝑎𝑙
and h values for both 2 k and 4 k resolution.
The resolution parameter w in-terms of normalized hypotenuse ‘h’ is quantified in (9). From (9) it is clear that
w is totally controlled by appropriate normalized distribution of ℎ𝑡ℎ𝑒𝑜𝑟𝑒𝑡𝑖𝑐𝑎𝑙 and h values.
ℎ
𝑤 = ℎ𝑡ℎ𝑒𝑜𝑟𝑒𝑡𝑖𝑐𝑎𝑙 ∈ [0,1] (9)
Where, (i1, j1)= position of the DC coefficient, (i2, j2)= position of the current AC coefficient.
(imax , jmax)= position of the last AC coefficient from DC coefficient for the considered block size.
Finally, a parameter Mi, j that relates both display resolution and perceptually important transform
coefficients position in the considered block size is given by (11).
𝐸𝑢𝑑𝑖,𝑗
−[ ]
𝑀𝑖,𝑗 (𝐸𝑢𝑑𝑖,𝑗 , 𝑊) = 𝑒 1+𝑊 ∈ [0,1] (11)
Now this Mi,j, is applied to each element of matrix H(x, y) located at position (i, j), denoted as Hi, j defined in
[21] to produce an adaptive 2D FWM H'(i, j).
′ 𝑀𝑖,𝑗
𝐻𝑖,𝑗 = 𝐻𝑖,𝑗 (12)
𝑄𝑃
𝑄𝑀 = (13)
𝐻(𝑥,𝑦)
Finally, the adaptive QM in (13) can be obtained from (12) and the quantization parameter which
controls quantization step size. This matrix will be used to quantize the transform coefficients by right shift
operations instead of regular element-wise division operations. The same QM will be shared between forward
and inverse path as shown in Figure 1. Hence during inverse quantization, each transform coefficient level will
be shifted left by the corresponding value specified in the QM instead of regular multiplications.
The pixel values of the image are read as a hexadecimal character using file operations supported in
Verilog HDL included in test bench code. The simulation waveforms obtained from Vivado Simulator 2022.2
are shown starting from data read into line buffers then feeding to the transformation and quantization module
followed by writing to the output buffer. In the following simulation waveform the stage wise operation of the
accelerator modules are illustrated considering the block size four and image size 512×512 with pixel size of
eight bit. The line buffers are loaded with input pixels, row wise sequentially one after the other shown in
Figure 7(a) and reading four-pixel values column wise for feeding to transformation operation is shown in
Figure 7(b).
(a) (b)
Figure 7. Data movement from source to processing unit through line buffers (a) serial data writing to line
buffers and (b) parallel data reading from line buffers
As the input image has 512 columns, total 512×4=2048 pixels are written first into the line buffers and
it will be processed by the 1D transform module as per equation (4). As shown in Figure 8 the 1D transformation
of the first 4 pixels takes three clock cycles, where multiplication and accumulation of four partial products
happens in one clock cycle and then moving data through pipeline registers takes two clock cycles. So to get
one set of four transformed row DCT output, it takes 7 clock cycles. Now each of these 1D-DCT row values are
transposed using de-multiplexer and second stage line buffers as shown in Figure 9. Based on de-multiplexer
select line input, the de-multiplexer output will be fed to line buffers at every clock cycle, which will be accessed
by 2D-transformation module for column-wise transformation as per (5) and shown in architecture Figure 2. At
every clock cycle the 1D transformed pixel gets multiplied with transposed kernel coefficients to produce four
partial product columns, shown in Figure 10. All these partial products are accumulated row wise at fourth clock
cycles to output four 2D- transformed results.
Figure 9. 1D transformation output transpose using DEMUX and second stage line-buffers
Hence in the proposed architecture, based on the selected block size, the multiple pixel data, made
available near the processing module within line buffers and then this data will be processed to have multiple
outputs in parallel. The QM obtained from (13) for the selected quantization parameter will be used to quantize
every 2D-transformed output pixel by right shift operation. Hence the proposed design has four pipeline stages
like 1D transformation, data transpose, 2D transformation and quantization. In the proposed pipeline
architecture, for the selected 4×4 block size processing shown in Figure 11 has a latency of 15 clock cycles
from input to output. The Table 2 shows the total clock cycles required to process transformation and
quantization of different block sizes from 4×4 to 32×32 with pipeline latency of each stage.
Figure 11. 2D-transformation and quantization pipelined operation output for block size 4×4
made available in advance near to the transformation module by having sufficient on chip line buffers and
second one will be the processing of multiple data in parallel by four stage pipelined architecture. The number
of flip flops and BRAM resource utilization in Table 4 for the proposed architecture is due to line buffers at
every input and output stages of transformation and quantization operations. The synthesis result of 2D unified
architecture to support VVC multiple transform cores by [14], shows more resource utilization and clock cycles
to process maximum block size of 32×32 on targeted Arria-10 SoC FPGA.
5. CONCLUSION
In the proposed pipelined architecture, the integer approximated 2D-transformation with integration
of perceptual models into the quantization process to enhance the visual quality of compressed videos is
presented. The reconfigurable architecture supports various block sizes for unified transformation and
quantization using only limited hardware resources on targeted FPGA. The overall latency from input to output
of state-of-the-art architecture is less for processing maximum block size due to the novel approach of data
acceleration technique. In future implementation, the goal is to increase the performance by enhancing possible
data accessing and data processing methodology to support real time processing of 2K and 4K videos with run
time selection of the processing parameters. Also, the high-level implementation of transformation and
quantization accelerators and its testing on a heterogeneous platform with performance analysis is our future
work.
REFERENCES
[1] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview of the high efficiency video coding (hevc) standard,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, 2012, doi:
10.1109/TCSVT.2012.2221191.
[2] B. Bross et al., “Overview of the versatile video coding (vvc) standard and its applications,” IEEE Transactions on Circuits and
Systems for Video Technology, vol. 31, no. 10, pp. 3736–3764, 2021, doi: 10.1109/TCSVT.2021.3101953.
[3] F. Bossen, K. Suhring, A. Wieckowski, and S. Liu, “VVC complexity and software implementation analysis,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3765–3778, 2021, doi: 10.1109/TCSVT.2021.3072204.
[4] Xilinx, “Zynq ultrascale+ mpsoc: embedded design tutorial,” xilinx, vol. 1209, 2018.
[5] M. Chen, Y. Zhang, and C. Lu, “Efficient architecture of variable size HEVC 2D-DCT for FPGA platforms,” AEU - International
Journal of Electronics and Communications, vol. 73, pp. 1–8, 2017, doi: 10.1016/j.aeue.2016.12.024.
[6] P. K. Meher, S. Y. Park, B. K. Mohanty, K. S. Lim, and C. Yeo, “Efficient integer DCT architectures for HEVC,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 24, no. 1, pp. 168–178, 2014, doi: 10.1109/TCSVT.2013.2276862.
[7] W. Zhao, T. Onoye, and T. Song, “High-performance multiplierless transform architecture for HEVC,” in 2013 IEEE International
Symposium on Circuits and Systems (ISCAS2013), IEEE, 2013, pp. 1668–1671, doi: 10.1109/ISCAS.2013.6572184.
[8] R. Conceição, J. C. D. Souza, R. Jeske, B. Zatt, M. Porto, and L. Agostini, “Low-cost and high-throughput hardware design for the
hevc 16x16 2-D DCT transform,” Journal of Integrated Circuits and Systems, vol. 9, no. 1, pp. 25–35, 2014, doi:
10.29292/jics.v9i1.386.
[9] A. Kammoun, W. Hamidouche, F. Belghith, J. F. Nezan, and N. Masmoudi, “Hardware design and implementation of adaptive
multiple transforms for the versatile video coding standard,” IEEE Transactions on Consumer Electronics, vol. 64, no. 4, pp. 424–
432, 2018, doi: 10.1109/TCE.2018.2875528.
[10] M. J. Garrido, F. Pescador, M. Chavarrias, P. J. Lobo, and C. Sanz, “A 2-D multiple transform processor for the versatile video
coding standard,” IEEE Transactions on Consumer Electronics, vol. 65, no. 3, pp. 274–283, 2019, doi: 10.1109/TCE.2019.2913327.
[11] M. J. Garrido, F. Pescador, M. Chavarrias, P. J. Lobo, C. Sanz, and P. Paz, “An FPGA-based architecture for the versatile video
coding multiple transform selection core,” IEEE Access, vol. 8, pp. 81887–81903, 2020, doi: 10.1109/ACCESS.2020.2991299.
[12] A. C. Mert, E. Kalali, and I. Hamzaoglu, “High performance 2D transform hardware for future video coding,” IEEE Transactions
on Consumer Electronics, vol. 63, no. 2, pp. 117–125, 2017, doi: 10.1109/TCE.2017.014862.
[13] Y. Fan, Y. Zeng, H. Sun, J. Katto, and X. Zeng, “A pipelined 2D transform architecture supporting mixed block sizes for the VVC
standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 9, pp. 3289–3295, 2020, doi:
10.1109/TCSVT.2019.2934752.
[14] A. Kammoun et al., “Forward-inverse 2D hardware implementation of approximate transform core for the VVC standard,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 30, no. 11, pp. 4340–4354, 2020, doi:
10.1109/TCSVT.2019.2954749.
[15] J. Chen, S. Liu, G. Deng, and S. Rahardja, “Hardware efficient integer discrete cosine transform for efficient image/video
compression,” IEEE Access, vol. 7, pp. 152635–152645, 2019, doi: 10.1109/ACCESS.2019.2947269.
[16] P. K. Meher, S. K. Lam, T. Srikanthan, D. H. Kim, and S. Y. Park, “Area-time efficient two-dimensional reconfigurable integer
DCT architecture for HEVC,” Electronics, vol. 10, no. 5, pp. 1–11, 2021, doi: 10.3390/electronics10050603.
[17] D. F. Chiper and A. Cracan, “An efficient algorithm and architecture for the VLSI implementation of integer DCT that allows an
efficient incorporation of the hardware security with a low overhead,” Applied Sciences, vol. 13, no. 12, 2023, doi:
10.3390/app13126927.
[18] C. Y. Wang, S. M. Lee, and L. W. Chang, “Designing jpeg quantization tables based on human visual system,” Signal Processing:
Image Communication, vol. 16, no. 5, pp. 501–506, 2001, doi: 10.1016/S0923-5965(00)00012-6.
[19] M. Budagavi, A. Fuldseth, and G. Bjøntegaard, “HEVC transform and quantization,” in High Efficiency Video Coding (HEVC):
Algorithms and Architectures, 2014, pp. 141–169, doi: 10.1007/978-3-319-06895-4_6.
[20] N. Casali, M. Naccari, M. Mrak, and R. Leonardi, “Adaptive quantisation in HEVC for contouring artefacts removal in UHD
content,” in 2015 IEEE International Conference on Image Processing (ICIP), IEEE, 2015, pp. 2577–2581, doi:
10.1109/ICIP.2015.7351268.
[21] L. Prangnell and V. Sanchez, “Adaptive quantization matrices for HD and UHD resolutions in scalable HEVC,” in 2016 Data
Compression Conference (DCC), IEEE, 2016, pp. 626–626, doi: 10.1109/DCC.2016.47.
[22] L. Prangnell, “Frequency-dependent perceptual quantization for visually lossless compression applications,” Arxiv-Computer
Science, pp. 1–26, 2019.
[23] D. Grois and A. Giladi, “Perceptual quantization matrices for high dynamic range h.265/MPEG-HEVC video coding,” in
Applications of Digital Image Processing XLII, SPIE, 2020, p. 24, doi: 10.1117/12.2525406.
[24] X. Shang, G. Wang, X. Zhao, Y. Zuo, J. Liang, and I. V Bajic, “Weighting quantization matrices for HEVC/H.265-CODED RGB
videos,” IEEE Access, vol. 7, pp. 36019–36032, 2019, doi: 10.1109/ACCESS.2019.2902173.
[25] Y. Zhang, L. Zhu, G. Jiang, S. Kwong, and C. C. J. Kuo, “A survey on perceptually optimized video coding,” ACM Computing
Surveys, vol. 55, no. 12, 2023, doi: 10.1145/3571727.
BIOGRAPHIES OF AUTHORS