An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules

Li, Qitao; Zhang, Wei; Wu, Zhuolun; Dai, Yuzhou; Liu, Yanyan

doi:10.3390/electronics13234668

Open AccessArticle

An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules

by

Qitao Li

¹

,

Wei Zhang

^1,*

,

Zhuolun Wu

¹

,

Yuzhou Dai

¹

and

Yanyan Liu

²

¹

School of Microelectronics, Tianjin University, Tianjin 300072, China

²

College of Electronic Information and Optical Engineering, Nankai University, Tianjin 300071, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(23), 4668; https://doi.org/10.3390/electronics13234668

Submission received: 23 October 2024 / Revised: 18 November 2024 / Accepted: 25 November 2024 / Published: 26 November 2024

Download

Browse Figures

Versions Notes

Abstract

:

A multi-level 2D Discrete wavelet transform (DWT) architecture for JPEG2000 is proposed, enhancing speed through parallel processing multiple tile blocks. Based on the lifting scheme, folded architecture and unfolded architecture achieving critical path delay with only one multiplier are designed to increase throughput rate. Connecting the folded and unfolded architecture through a pipeline architecture ensures uniform throughput rates across all DWT levels within a singular clock domain. Computational resource consumption is reduced by adjusting the timing to allow one folded architecture to process three tile blocks of three to five levels of DWT, and a transposing module requiring merely six registers is devised to decrease storage resource consumption. The quantization module, crucial for code-word control in JPEG2000, is integrated into the scaling module with minimal additional resource expenditure. Compared to the existing architecture, the analysis demonstrates that the proposed architecture exhibits enhanced hardware efficiency, with a reduction in transistor-delay-product (TDP) of no less than 14.69%. Synthesis results further reveal an area reduction of at least 26.64%, and a decrease in area-delay-product (ADP) by a minimum of 29.89%. Results from FPGA implementation indicate a significant decrease in resource utilization.

Keywords:

2D DWT; multi-level DWT; lifting scheme; quantization; JPEG2000

1. Introduction

JPEG2000 stands as a highly effective image compression standard [1], extensively utilized in the realms of medical and satellite imagery. The fundamental enhancement in JPEG2000 is the substitution of the discrete cosine transform (DCT) with the discrete wavelet transform (DWT), as it demonstrates an improvement in coding efficiency and image quality [2]. DWT is a multiresolution analysis tool adept at decomposing a signal into distinct subbands with both time and frequency information [3], which finds extensive application across image processing and image encryption domains [4,5,6]. The performance of 9/7 DWT hardware depends on the precision and efficiency of the quantization filters. Higher precision ensures compression quality close to unquantized data but requires more hardware and processing time [7].

In JPEG2000, images are typically segmented into smaller tile blocks with a typical size of

128 \times 128

. Subsequently, these tiles undergo compression using a five-level two-dimensional (2D) DWT to achieve optimal compression efficacy [8]. However, to meet real-time computation demands, contemporary multi-level 2D DWT architectures face challenges with heightened computational complexity and increased memory resource requirements [9]. Consequently, researching and developing a more efficient VLSI architecture for multi-level DWT is essential.

In recent years, DWT architectures have been categorized into the convolution-based scheme [9,10,11] and lifting-based scheme [12,13,14,15,16,17]. The lifting-based architecture is favored over its convolution-based counterpart due to its reduced computational complexity, narrower internal bit widths, and diminished memory requirements [18,19]. In the domain of single-level DWT, George [10] introduced architectures termed horizontal-FrWF and vertical-FrWF based on a convolutional framework. These architectures eliminate the dependency of FrWF computation’s memory requirements on image resolution, effectively reducing both area and memory needs. Naseer [12] employed a lifting-based architecture with a CSD-DA structure to decrease the complexity of multiplication operations, enhancing the clock frequency. Mohamed [17] achieved a significant reduction in resource utilization and obtained a higher PSNR through hardware–software codesign with partial reconfiguration. However, the resource demands of single-level architectures increase exponentially with the number of levels when processing multi-level DWT, significantly reducing encoding efficiency.

Multi-level DWT architecture has also been extensively researched. Mohanty [20] proposed a line-based parallel lifting modular and pipelined architecture, effectively obviating the requirement for lines and frame memories. However, this design results in a critical path delay (CPD) of

T m + 2 T a

, where

T a

represents the delay of an adder and

T m

signifies the delay associated with a multiplier. Wu [21] utilized a CSD multiplier to lower the critical path delay to

T a

, though it complicates the internal structure and control signals. Zhang [22] developed a parallel architecture integrating first-stage unfolding and internal semi-folding, combined with folding, to achieve synchronization with the rate and high throughput requirements of three-level DWT. However, this architecture lacks flexibility in modifying parallelism and DWT levels. Furthermore, when managing multi-level DWT with a high number of levels, Refs. [21] and [22] both encountered the issue of multiple clock domains. This results in a structure that fails to reach optimal efficiency, consequently consuming additional clock cycles. Hu [23] proposed a scanning method based on overlapping stripes and developed a scalable pipelined multilevel DWT architecture, which eliminates the need for frame memory, yet the seven-pixel overlap in scans significantly increases the required input RAM size.

Quantization modules have been thoroughly examined, with a primary focus on minimizing the loss of image quality during the quantization process. Kotteri [7] explored a two-stage cascading quantization structure that enhances filter performance by compensating quantization coefficients. Chaker [24] employed preprocessing steps to extract image features, enabling uniform scalar quantization that reduces loss in image compression quality. Moreno [25] introduced alternative estimations and normalizes quantization coefficients to improve the quality of image encoding. However, the studies in [7,24,25] faced the challenge of increased hardware resource consumption to improve quality. Furthermore, the quantization module functions as a separate component, leading to additional clock cycles and further hardware overhead.

In this article, an efficient multi-level 2D DWT architecture using a lifting scheme capable of processing multiple tile blocks concurrently is proposed, with the quantization module integrated into the scaling module with minimal additional resource consumption. The proposed architecture synchronizes throughput within a single clock domain through the integration of both folded and unfolded architectures which reduce CPD to

T m

. In the architecture, level 1 employs a three-way parallel unfolded architecture, whereas level 2 utilizes a three-way parallel folded architecture. For level 3 to 5, only one folded architecture is implemented to compute three-way parallel DWT, diminishing the consumption of computational resources, and a simplified control module is designed to minimize system complexity. The quantization module is integrated with the scaling module by adding only a few selectors, effectively avoiding additional hardware resource consumption and clock cycle overhead. Additionally, a novel transposing module requiring merely six registers is introduced to minimize the requirements of the transposing buffer. Consequently, this approach markedly enhances hardware efficiency.

The rest of this article is organized as follows. Section 2 explains the lifting algorithm, DWT process, and the integration of the quantization formula. Section 3 introduces the proposed multi-level 2D DWT architecture. Section 4 discusses implementation results and comparisons. Conclusions and a discussion are in Section 5.

2. Refinements to DWT and Quantization Formulas

2.1. Algorithm of Lifting-Based 2D DWT

In the implementation of the 2D DWT column and row filters, an identical lifting scheme corresponding to the 9/7 filter in a dual application is employed, encompassing two distinct lifting steps and one scaling step. To enhance the critical path delay and avoid additional registers, Huang [26] introduced an innovative formula tailored for the flipping scheme, delineated as follows:

\frac{1}{α} y (2 n + 1) = \frac{1}{α} x (2 n + 1) + x (2 n) + x (2 n + 2)

(1)

\frac{1}{α β} y (2 n) = \frac{1}{α β} x (2 n) + \frac{1}{α} y (2 n - 1) + \frac{1}{α} y (2 n + 1)

(2)

\frac{1}{α β γ} H (2 n + 1) = \frac{1}{α β γ} y (2 n + 1) + \frac{1}{α β} y (2 n) + \frac{1}{α β} y (2 n + 2)

(3)

\frac{1}{α β γ δ} L (2 n) = \frac{1}{α β γ δ} y (2 n) + \frac{1}{α β γ} H (2 n - 1) + \frac{1}{α β γ} H (2 n + 1)

(4)

S_{H} (2 n + 1) = α β γ K \frac{1}{α β γ} H (2 n + 1)

(5)

S_{L} (2 n) = \frac{α β γ δ}{K} \frac{1}{α β γ δ} L (2 n)

(6)

where

α

,

β

,

γ

,

δ

are the lifting coefficients, where

α = - 1.586134342

,

β = - 0.052980118

,

γ = 0.882911075

, and

δ = 0.443506852

, K is the scaling coefficient, where

K = 1.230174105

, and the variable

x (n)

represents the pixel value of the input image, with

x (2 n + 1)

and

x (2 n)

corresponding to the odd and even indexed inputs, respectively. The intermediate variables of the lifting process are represented by

y (n)

,

H (2 n + 1)

, and

L (2 n)

. The culmination of the lifting process yields

S_{H} (2 n + 1)

and

S_{L} (2 n)

, which are the high-frequency and low-frequency components outputted by the 9/7 wavelet transform, respectively.

The execution of the 2D DWT necessitates an initial column transformation succeeded by a subsequent row transformation, thereby requiring a pair of sequential applications of the flipping operation. Within the second iteration, the high-frequency component

H (2 n + 1)

and the low-frequency component

L (2 n)

, derived from the initial calculation, serve as distinct inputs. Specifically, the high-frequency output from

H (2 n + 1)

constitutes the high-high (

H H

) component, whereas its low-frequency counterpart results in the high-low (

H L

) component. Similarly, the high-frequency output from

L (2 n)

forms the low-high (

L H

) component and the corresponding low-frequency output generates the low-low (

L L

) component.

In 2D DWT, the

L L

component contains the majority of the signal’s energy and information, and the multi-level DWT methodology involves employing this

L L

component from a given level as the input data

x (n)

for the ensuing level’s analysis. By iteratively applying the 2D DWT, one acquires the sophisticated subband components

H H

,

H L

,

L H

, and

L L

at higher hierarchical levels. Figure 1 elucidates the principle of the 2D DWT and simultaneously depicts the decomposition process via an exemplification of a two-level DWT.

2.2. Integration of Quantization and Scaling Formulas

Quantization is a crucial aspect of bit-rate control in JPEG2000, where it compresses wavelet coefficients further to eliminate redundancy in image details imperceptible to the human eye. The JPEG2000 standard specifies a method of dead-zone scalar quantization [27], as demonstrated in Equation (7). This method relies on subband information derived from wavelet transformations to determine the quantization step sizes appropriate for different subbands:

q = s i g n (x) ⌊ | x | / Δ ⌋

(7)

where q is the quantization result, x represents the components of each level of subbands processed by DWT,

⌊ ⌋

represents rounding down to the nearest integer,

Δ

denotes the quantization step size for the subbands, and

s i g n (x)

indicates the sign of x.

q = h \times S (n)

(8)

q (H H) = h_{H H}^{*} \times \frac{1}{α β γ} H H (2 n + 1)

(9)

q (L H) = h_{L H}^{*} \times \frac{1}{α β γ} L H (2 n + 1)

(10)

q (H L) = h_{H L}^{*} \times \frac{1}{α β γ δ} H L (2 n)

(11)

q (L L) = h_{L L}^{*} \times \frac{1}{α β γ δ} L L (2 n)

(12)

h_{H H}^{*} = {(α β γ K)}^{2} \times h_{H H}

(13)

h_{L H}^{*} = {(α β γ)}^{2} δ \times h_{L H}

(14)

h_{H L}^{*} = {(α β γ)}^{2} δ \times h_{H L}

(15)

h_{L L}^{*} = {(\frac{α β γ δ}{K})}^{2} \times h_{L L}

(16)

In a five-level 2D DWT, each wavelet subband has different quantization step sizes, though the step size remains constant within each subband. The quantization result is obtained by multiplying the DWT wavelet coefficients by the corresponding quantization coefficients, converting the formula from Equation (7) to Equation (8). In Equation (8),

S (n)

represents the subband components processed by DWT and h denotes the quantization coefficient. The quantization coefficients for different levels and subbands are shown in Table 1. Since Equations (5), (6), and (8) all involve multiplication, their coefficients can be combined into a single operation. By replacing Equations (5) and (6) in the column transformation with Equations (9)–(12), the scaling and quantization steps can be performed simultaneously. In Equations (9) to (12), the

H H (2 n + 1)

and

H L (2 n)

components are derived by applying row transformation to the column transformed H component. Similarly, the

L H (2 n + 1)

and

L L (2 n)

components result from row transformation on the column transformed L component. In the scaling process following DWT encoding, the HH subband component is multiplied twice by the coefficient

α β γ K

from Equation (5), while the LL subband component is multiplied twice by the coefficient in Equation (6). The LH and HL subband components are each multiplied once by the coefficients in Equations (5) and (6), respectively. The combined coefficient is obtained by applying these multiplications to the original quantization coefficient, as shown in Equations (13)–(16). The specific values of the combined coefficients

h^{*}

for each subband are provided in Table 1.

By combining quantization with the scaling step in the column transformation of DWT, as described in Equations (9)–(16), both DWT and quantization in JPEG2000 encoding are completed without adding hardware resources or clock cycles to the original DWT module. This approach effectively improves the efficiency of JPEG2000 encoding. Furthermore, encoding with different quantization step sizes can be achieved by modifying the h values in Table 1 and computing the corresponding

h^{*}

values using Equations (13)–(16) for substitution, demonstrating the flexibility of this approach.

3. Proposed Architecture for Multi-Level 2D DWT

3.1. Folded and Unfolded Architecture

Considering each flipping equation being computed through two additions and one multiplication, and acknowledging

T m \approx 2 T a

, a three-input basic processing unit (BPU) has been designed. This BPU, delineated by the dashed box in Figure 2 and Figure 3, executes two additions and one multiplication concurrently, thereby reducing the CPD to

T m

. Notably, the ∗ sign in the figures signifies the necessity for the multiplier input to precede the adder input by one clock cycle, ensuring accurate timing synchronization.

The proposed folded architecture and unfolded architecture are based on the BPUs proposed above as shown in Figure 2 and Figure 3, where only row filters include the scaling module. The folded architecture is characterized by its utilization of a solitary BPU, which is complemented by MUXs to modulate the outputs from the adder and multiplier to sequentially address the four lifting equations. Additionally, the unfolded architecture is constructed upon a four-stage pipeline framework, whereby each stage is equipped with a distinct BPU that is tasked with the computation of one of Equations (1) through (4), respectively. Since only one BPU is utilized, the folded architecture necessitates four clock cycles to compute four lifting equations and sequentially deliver two lifting results over two cycles. In contrast, the unfolded architecture, leveraging a pipelined design, is capable of outputting two lifting results per clock cycle. Consequently, this architectural distinction results in a data throughput ratio of 1:4 between the folded and unfolded architectures.

Furthermore, the scaling module tailored for the unfolded architecture, owing to timing constraints, necessitates the deployment of two multipliers to execute the scaling function. In contrast, the scaling module designed for the folded architecture efficiently utilizes a single multiplier, which alternately scales the H and L frequency components.

3.2. Data Input Method

The proposed architecture uses a three-input scanning method with a one-pixel overlap, a technique widely applied in studies such as [21]. This method was chosen for several reasons: both folded and unfolded architectures are three-input systems, so the three-input scanning method, which reads three pixels at once, is ideal for providing input to the encoding module. Additionally, this method only requires overlapping one row per scan, unlike the seven-column overlap used in [23]. This minimizes redundant data encoding, thereby improving encoding efficiency. Furthermore, the three-input scanning method requires less input memory and reduces the internal temporal memory within the architecture to only three units, thereby lowering hardware resource consumption.

As illustrated in Figure 4, our approach adopts a Z-type pattern for reading image pixels, wherein data from three rows are simultaneously accessed. To ensure the timely arrival of

x (2 n + 1)

, a staggered data input method is employed, as depicted in Figure 5, where the data of identical color are involved in concurrent operations. Additionally, the final line of data, indicated by a dashed circle in the figure, is subjected to overlap scanning.

3.3. Overall DWT Hardware Architecture

The overall architecture of the 2D DWT is presented in Figure 6, and all levels of DWT are predicated upon this overall architecture. Input RAM stores the input image and frame RAM saves the

L L

components from a low level, while column and row filters execute the lifting process. Based on the sequence of data input, the column filter needs temporal memory for intermediate variables and the row filter merely requires a two-unit buffer. The transposing module rearranges and synchronizes the L and H components from the column filter. Lastly, the scaling and quantization module scales and quantizes the data, obtaining the final DWT results with redundant precision removed. These results are then used for subsequent encoding in JPEG2000.

For the first-level 2D DWT, the influx of raw input data occurs at a rate of three units per clock cycle. To accommodate this rate and ensure the completion of data processing within each cycle, an unfolded architecture for the column and row filters is utilized, as illustrated in Figure 3.

Given that only the

L L

frequency component from the first-level DWT output requires processing by the second-level DWT, the data volume entering the second-level DWT is a mere quarter of that fed into the first-level DWT. Consequently, for the second-level 2D DWT, the folded architecture is used in the column and row filters to maintain throughput consistency between the two DWT levels. However, a complication arises as the

L L

components produced by the first-level DWT are delivered every two cycles, while the second-level DWT reads data every four cycles. To bridge this temporal gap, a frame memory storing two rows of data are required.

For third-level and subsequent DWT, the indivisible nature of the folded architecture necessitates an alternative strategy. To address this challenge, a novel multi-level DWT architecture that matches the required throughput is proposed, as depicted in Figure 7. The first and second level both employ a three-way parallel architecture. For 3–5 level DWT, only one folded architecture is deployed, which can process components from three tile blocks by timing adjustments as depicted in Figure 8.

Figure 8 elucidates the timing and sequential data processing of 3–5 level DWT. To ensure staggered data flow, the input to the first-level DWT modules is sequentially offset by four cycles. The initial 12 cycles are dedicated to the uninterrupted processing of data from the third-level DWT modules. Subsequently, every 16 cycles are allocated for processing fourth-level DWT data, and every 64 cycles address the data from the fifth-level DWT. In scenarios limited to five levels of DWT, a pattern emerges where 4 idle cycles are observed after every set of 256 cycles, during which no data processing occurs. However, if additional levels of DWT computation are required, these idle intervals can be strategically utilized for such calculations, though this would increase the clock cycles needed to obtain the encoded data. For an N-level DWT beyond three levels, the system must wait for

4^{(N - 2)}

clock cycles before producing each wavelet coefficient. For example, a 5-level DWT requires a wait of 64 clock cycles, while a 6-level DWT requires 256 clock cycles, which may impact subsequent encoding. Therefore, the proposed architecture is optimally suited for up to 5-level DWT. Furthermore, although a recurring pattern prevents the proposed architecture from achieving 100% hardware utilization, the first two levels reach full utilization, and the final folded architecture achieves 98.44%, leading to an overall hardware utilization above 99.9%. Consequently, the proposed architectural framework demonstrates high efficiency in managing multiple DWT levels, even as the required clock cycles increase for processing beyond five levels.

Assuming that the folded architecture has a maximum throughput of 1. Given that only the

L L

component from second level is propagated to the third level DWT processing, the collective output from the three modules at the second level equates to

3 / 4

of the input data volume for the subsequent third level. This cascading effect results in a proportional reduction of data volume conveyed to the fourth and fifth levels of DWT processing, yielding fractions of

3 / 16

and

3 / 64

, respectively. From this pattern, it is evident that extending DWT processing beyond five levels would culminate in a total data volume sum that remains below unity, as shown in Equation (17). Under the given inequality constraints, 3 is the largest integer that can multiply the left-hand side, making three-way parallelism the maximum achievable level of parallelism. This aggregate is less than the maximum throughput of the folded architecture, thereby enabling one folded architecture to manage the entire data processing workload effectively.

3 \times (1 / 4 + 1 / 16 + 1 / 64 + . . . + 1 / (4^{n})) < 1

(17)

3.4. Simplified Control Module

A simplified control module is designed for the folded architecture to handle 3–5 levels DWT, which manages image data from different tiles and levels, ensuring DWT computations are executed in the intended sequence. The control module only incorporates three distinct counting signals: cnt3, cnt4, and cnt5. The cnt3 signal is a 4-bit counter with a range from 0 to 15, while cnt4 and cnt5 are 2-bit counters, each with a counting range from 0 to 3. The corresponding counting signals initiate their counting process immediately when the

L L

component from the preceding level DWT starts being fed into this folded architecture.

As presented in Table 2, when cnt3 is within the ranges of 0–3, 4–7, and 8–11, the third-level DWT is counted for the first, second, and third tile blocks. When cnt3 ranges from 12 to 15, the fourth-level DWT is executed for the first, second, and third tile blocks, contingent upon cnt4 values of 1, 2, and 3, respectively. Likewise, with cnt3 ranging from 12 to 15 and cnt4 set to 4, the fifth-level DWT is performed on the first, second, and third tile blocks based on cnt5 values of 1, 2, and 3. No calculations are conducted when cnt3 is between 12 and 15 and both cnt4 and cnt5 are set to 4. In Table 2, the symbol ’x’ denotes an arbitrary value, while ’N’ indicates that no operation is performed in this instance. Equipped with only three counting signals, the control module effectively reduces hardware complexity. This simplification is further elucidated in the subsequent resource comparison section.

3.5. Transposing Module

A novel transposing buffer module has been developed to align with the temporal requirements of both the unfolded and folded architectural designs. In the unfolded architecture, the continuous flow of column filter data necessitates merely four registers to achieve transposition as Figure 9 shows. Conversely, due to the variable interval arrival of column filter data at different levels within the folded architecture, the transposing buffer, as depicted in Figure 10, employs six registers to store the H and L frequency components separately.

4. Results

In the context of processing input images or tile blocks with dimensions

N \times N

, Table 3 delineates the hardware resource allocation for the proposed multi-level 2D DWT architecture across varying DWT levels. Transposition memory, temporal memory, and frame memory are included in the temporal RAM, and the parallelism S for the proposed architecture signifies the count of tile blocks processed simultaneously.

The transistor-delay-product (TDP) introduced in [21] serves as a viable metric for assessing the efficiency of hardware architectures. Concurrently, the area-delay-product (ADP) proposed in [23] provides a visual framework for comparing the efficiency of different architectures, utilizing synthesis results as a basis. Lower TDP and ADP values signify a more efficient hardware architecture. Equations (18)–(20) detail the calculation involving transistor count (TC), active cycle time (ACT), and data arrival time (DAT), and the TC is calculated using a method from [20]. A lower TC indicates reduced resource consumption and better architectural efficiency; a lower ACT suggests fewer clock cycles required to process an image, reflecting an improved design; and a lower DAT implies reduced latency, indicating a more optimized architecture. Additionally, it is posited that

T m = 2 T a = 6.02

ns.

T D P = T C \times C P D \times A C T

(18)

A C T = N^{2} / t h r o u g h p u t

(19)

A D P = A r e a \times A C T \times D A T

(20)

Table 4 presents a comparison of hardware performance between existing architecture and the proposed architecture. Notably, while other studies focus solely on DWT encoding, the proposed architecture additionally performs quantization calculations. Refs. [10] and [12] described high-performing single-level DWT architectures recently introduced, assuming a configuration that employs five single-level architectures in series for processing a 5-level DWT. However, accurately estimating the requisite cache resources for multi-level DWT remains problematic. The data presented in Table 4 indicate that although throughput rate and CPD are enhanced in these architectures, a considerable increase in computational resources is required. Consequently, the proposed architecture outperforms existing single-level DWT designs in handling multi-level DWT.

For architectures specifically designed for multi-level DWT, the proposed architecture demonstrates notable improvements. Compared to [14], the proposed architecture features a shorter critical path and a more efficient multi-level structure, achieving twice the throughput, along with a 74.84% reduction in TC and an 87.42% decrease in TDP. Ref. [20] employed excessive registers for caching and exhibits a higher CPD. In contrast, the proposed architecture reduces TC by 9.67%, lowers TDP by 54.83%, and doubles throughput relative to [20]. In Ref. [22], CPD is reduced to

2 T a

, and parallel processing eliminates intermediate variable storage, minimizing RAM usage. However, for DWT levels exceeding three, Ref. [22] required cross-clock domain processing to match throughput, effectively halving the clock frequency and doubling encoding time compared to the proposed architecture. Against [22], the proposed architecture shows a 12.55% increase in TC but a 50% reduction in ACT and a 43.73% decrease in TDP. The input scheme in [23] involves scanning seven rows of pixels repetitively, significantly increasing input memory. Additionally, a lack of register segmentation in the adders and multipliers results in a CPD of

3 T a

. In comparison, the proposed architecture increases TC by 27.97% but lowers CPD by 33.33%, enhancing throughput by 50% and reducing TDP by 14.69%.Overall, for a 5-level DWT, although the proposed architecture slightly increases storage resource usage, it significantly reduces computational resource consumption and CPD. The proposed architecture demonstrates superior hardware efficiency, with at least a 14.69% improvement over existing designs.

Table 5 presents the synthesis results and comparisons excluding RAM. Compared to [14] and [20], the proposed architecture achieves only half the CPD, with a reduction in area exceeding 67.81% and an 81.31% decrease in ADP, indicating significantly higher hardware efficiency. When compared to [22], the proposed design uses fewer computational resources, leading to a 26.64% reduction in area and a 29.89% decrease in ADP. These findings are consistent with the theoretical analysis presented in Table 4. However, the proposed architecture exhibits slightly higher power consumption than the referenced designs. This is primarily due to the frequent data reads and writes to RAM from three distinct tile blocks during encoding. Additionally, the reduced need for repetitive scans results in more frequent internal signal transitions, which may further contribute to the increased power usage. Overall, the analysis indicates although the proposed architecture does not achieve peak hardware efficiency at the configuration of

N = 512

, 3-level, and

S = 8

, it achieves a 26.64% reduction in area and a 29.89% decrease in ADP compared to the current best-performing design, albeit with increased power consumption.

The proposed architecture was synthesized and implemented on Xilinx FPGA platforms using ISE 14.7, with the results presented in Table 6. The configuration with S = 3 and level = 5 represents the most efficient architecture in this study. When synthesizing for levels 1 and 2, the FPGA synthesis results of the proposed architecture include both the encoding section and the frame RAM for each level. The study in [4] implemented a watermarking DWT algorithm that lowers the CPD using convolution scheme. However, the hardware inefficiencies associated with convolution lead to substantial resource consumption. In comparison, the proposed architecture reduces register usage by 70.09% and LUT resources by 64.34%, though it also results in a 51.86% decrease in maximum clock frequency. In [5], a Haar DWT formula based solely on addition and subtraction operations achieves a lower CPD, though its complex watermarking module results in higher resource usage. Compared to [5], the proposed architecture reduces register consumption by 23.24% and LUT resources by 22.03%, with a 15.67% decrease in maximum clock frequency. Additionally, the results from [4,5] were obtained using the System Generator tool within a MATLAB environment, which differs from the tools used in this study. This variation may introduce discrepancies in the results; therefore, the findings from [4,5] are provided for reference only. In [17], the DWT design is implemented with hardware–software codesign and partial reconfiguration on the Vivado platform, effectively reducing resource usage. However, the extended combinational logic path significantly increases CPD. Relative to this design, the proposed architecture reduces LUT resource usage by 46.30%, increases register usage by 92.35%, and enhances maximum clock frequency by over 3.13 times. Furthermore, the peak efficiency of the proposed architecture is documented in Table 5 and Table 6, correlating with the synthesis results for N = 128, 5-level, and S = 3. Overall, considering resource consumption and maximum clock frequency, the FPGA synthesis results suggest that the proposed architecture demonstrates increased hardware efficiency in managing multi-level DWT and is well-suited for practical applications like JPEG2000.

5. Conclusions and Discussion

This article introduces a novel VLSI architecture for multi-level 2D DWT, utilizing a lifting-based approach, capable of parallel processing various tile blocks. The proposed architecture integrates folded and unfolded architecture, which reduces the CPD to

T m

, ensuring rate consistency across all DWT levels and improving the throughput rate. Additionally, using only one folded architecture for processing 3–5 level DWT significantly diminishes the computational resources required. The critical quantization module in JPEG2000 encoding has been integrated into the DWT module with minimal additional resource use. Enhancements in the transposing module lead to a reduction in storage resource requirements. The empirical results demonstrate that the proposed architecture yields a reduction in TDP by 14.69%, a decrease in the required area by 26.64%, a diminution in ADP by 29.89%, and utilizes substantially fewer FPGA resources, thereby evidencing enhanced hardware efficiency. Overall, the proposed architecture facilitates more efficient encoding of DWT and quantization in JPEG2000.

In the future, we will continue to investigate efficient architectures for the DWT in JPEG2000, aiming to further reduce the computational and storage resources consumed by the DWT and quantization modules. Additionally, research will be expanded to enhance the processing accuracy of these modules. Improvements to the interface between the quantization module and the subsequent JPEG2000 encoding modules will also be pursued to achieve rate matching, thereby enhancing the overall hardware efficiency of JPEG2000 encoding. Exploring the application of the proposed DWT and quantization modules to other image encoding protocols also represents a future research direction for our work.

Author Contributions

Conceptualization, Q.L. and Z.W.; methodology, Q.L. and W.Z.; software, Q.L. and Y.D.; validation, Y.D. and W.Z.; formal analysis, Q.L. and W.Z.; investigation, W.Z. and Y.L.; resources, W.Z. and Y.L.; data curation, Q.L. and Y.D.; writing—original draft preparation, Q.L. and Y.D.; writing—review and editing, W.Z., Z.W. and Y.L.; visualization, Y.L.; supervision, W.Z.; project administration, W.Z. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Christopoulos, C.; Skodras, A.; Ebrahimi, T. The JPEG2000 still image coding system: An overview. IEEE Trans. Consum. Electron. 2000, 46, 1103–1127. [Google Scholar] [CrossRef]
Li, B.F.; Dou, Y.; Shao, Q. Efficient Memory Subsystem for High Throughput JPEG2000 2D-DWT Encoder. In Proceedings of the 2008 Congress on Image and Signal Processing, Sanya, China, 27–30 May 2008; Volume 1, pp. 529–533. [Google Scholar] [CrossRef]
Jain, N.; Singh, M.; Mishra, B. Image Compression Using 2D-Discrete Wavelet Transform on a Light Weight Reconfigurable Hardware. In Proceedings of the 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID), Pune, India, 6–10 January 2018; pp. 61–66. [Google Scholar] [CrossRef]
Karthigaikumar, P.; Anumol; Baskaran, K. FPGA Implementation of High Speed Low Area DWT Based Invisible Image Watermarking Algorithm. Procedia Eng. 2012, 30, 266–273. [Google Scholar] [CrossRef]
Hajjaji, M.A.; Gafsi, M.; Ben Abdelali, A.; Mtibaa, A. FPGA Implementation of Digital Images Watermarking System Based on Discrete Haar Wavelet Transform. Secur. Commun. Networks 2019, 2019, 17. [Google Scholar] [CrossRef]
Oweiss, K.G.; Mason, A.; Suhail, Y.; Kamboh, A.M.; Thomson, K.E. A Scalable Wavelet Transform VLSI Architecture for Real-Time Signal Processing in High-Density Intra-Cortical Implants. IEEE Trans. Circuits Syst. I Regul. Pap. 2007, 54, 1266–1278. [Google Scholar] [CrossRef]
Kotteri, K.; Bell, A.; Carletta, J. Design of multiplierless, high-performance, wavelet filter banks with image compression applications. IEEE Trans. Circuits Syst. I Regul. Pap. 2004, 51, 483–494. [Google Scholar] [CrossRef]
Taubman, D.S.; Marcellin, M.W. JPEG2000 Image Compression Fundamentals, Standards and Practice; Kluwer Academic Publishers: New York, NY, USA, 2002; pp. 429–430. [Google Scholar]
Wu, P.C.; Chen, L.G. An efficient architecture for two-dimensional discrete wavelet transform. IEEE Trans. Circuits Syst. Video Technol. 2001, 11, 536–545. [Google Scholar] [CrossRef]
George, A.; P, J.E. Hardware-Efficient DWT Architecture for Image Processing in Visual Sensors Networks. IEEE Sens. J. 2023, 23, 5382–5390. [Google Scholar] [CrossRef]
Cheng, C.; Parhi, K.K. High-Speed VLSI Implementation of 2-D Discrete Wavelet Transform. IEEE Trans. Signal Process. 2008, 56, 393–403. [Google Scholar] [CrossRef]
Naseer, R.A.; Nasim, M.; Sohaib, M.; Younis, C.J.; Mehmood, A.; Alam, M.; Massoud, Y. VLSI architecture design and implementation of 5/3 and 9/7 lifting Discrete Wavelet Transform. Integration 2022, 87, 253–259. [Google Scholar] [CrossRef]
Zhang, W.; Jiang, Z.; Gao, Z.; Liu, Y. An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Transform. IEEE Trans. Circuits Syst. II Express Briefs 2012, 59, 158–162. [Google Scholar] [CrossRef]
Tian, X.; Wu, L.; Tan, Y.H.; Tian, J.W. Efficient Multi-Input/Multi-Output VLSI Architecture for Two-Dimensional Lifting-Based Discrete Wavelet Transform. IEEE Trans. Comput. 2011, 60, 1207–1211. [Google Scholar] [CrossRef]
Mohanty, B.K. Approximate Lifting 2-D DWT Hardware Design for Image Encoder of Wireless Visual Sensors. IEEE Sens. J. 2023, 23, 7868–7878. [Google Scholar] [CrossRef]
Darji, A.; Agrawal, S.; Oza, A.; Sinha, V.; Verma, A.; Merchant, S.N.; Chandorkar, A.N. Dual-Scan Parallel Flipping Architecture for a Lifting-Based 2-D Discrete Wavelet Transform. IEEE Trans. Circuits Syst. II Express Briefs 2014, 61, 433–437. [Google Scholar] [CrossRef]
Bharadwaja, P. Efficient FPGA Implementations of Lifting based DWT using Partial Reconfiguration. In Proceedings of the 2023 36th International Conference on VLSI Design and 2023 22nd International Conference on Embedded Systems (VLSID), Hyderabad, India, 8–12 January 2023; pp. 319–324. [Google Scholar] [CrossRef]
Xiong, C.Y.; Tian, J.W.; Liu, J. Efficient high-speed/low-power line-based architecture for two-dimensional discrete wavelet transform using lifting scheme. IEEE Trans. Circuits Syst. Video Technol. 2006, 16, 309–316. [Google Scholar] [CrossRef]
Kotteri, K.; Barua, S.; Bell, A.; Carletta, J. A comparison of hardware implementations of the biorthogonal 9/7 DWT: Convolution versus lifting. IEEE Trans. Circuits Syst. II: Express Briefs 2005, 52, 256–260. [Google Scholar] [CrossRef]
Mohanty, B.K.; Meher, P.K. Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT. IEEE Trans. Signal Process. 2011, 59, 2072–2084. [Google Scholar] [CrossRef]
Wu, C.; Zhang, W.; Jia, Q.; Liu, Y. Hardware efficient multiplier-less multi-level 2D DWT architecture without off-chip RAM. IET Image Process. 2017, 11, 362–369. [Google Scholar] [CrossRef]
Zhang, W.; Wu, C.; Zhang, P.; Liu, Y. An Internal Folded Hardware-Efficient Architecture for Lifting-Based Multi-Level 2-D 9/7 DWT. Appl. Sci. 2019, 9, 4635. [Google Scholar] [CrossRef]
Hu, Y.; Jong, C.C. A Memory-Efficient High-Throughput Architecture for Lifting-Based Multi-Level 2-D DWT. IEEE Trans. Signal Process. 2013, 61, 4975–4987. [Google Scholar] [CrossRef]
Chaker, A.; Kaaniche, M.; Benazza-Benyahia, A.; Antonini, M. An efficient statistical-based retrieval approach for JPEG2000 compressed images. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 1830–1834. [Google Scholar] [CrossRef]
Moreno-Escobar, J.J.; Morales-Matamoros, O.; Tejeida-Padilla, R. SQbSN: JPEG2000 scalar quantizer implemented by means a statistical normalization. In Proceedings of the 2017 Intelligent Systems Conference (IntelliSys), London, UK, 7–8 September 2017; pp. 576–584. [Google Scholar] [CrossRef]
Huang, C.T.; Tseng, P.C.; Chen, L.G. Flipping structure: An efficient VLSI architecture for lifting-based discrete wavelet transform. IEEE Trans. Signal Process. 2004, 52, 1080–1089. [Google Scholar] [CrossRef]
Bartrina-Rapesta, J.; Aulí-Llinàs, F. Cell-Based Two-Step Scalar Deadzone Quantization for High Bit-Depth Hyperspectral Image Coding. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1893–1897. [Google Scholar] [CrossRef]

Figure 1. Multi-level 2D DWT Process.

Figure 2. Folded architecture with scaling module.

Figure 3. Unfolded architecture with scaling module.

Figure 4. Z-scan.

Figure 5. Repeated scanning for three inputs.

Figure 6. Proposed 2D DWT overall architecture.

Figure 7. Proposed multi-level 2D DWT architecture.

Figure 8. Internal data processing sequence of 3–5 level DWT.

Figure 9. Transposing buffer of the unfolded architecture.

Figure 10. Transposing buffer of the folded architecture.

Table 1. Quantization coefficients for different subbands.

Subband	h	$h^{*}$
HH1	1.040466	0.008674635
HL1	1.011230	0.002468401
LH1	1.011230	0.002468401
HH2	1.934753	0.016130537
HL2	1.997070	0.004874826
LH2	1.997070	0.004874826
HH3	4.160461	0.034686842
HL3	4.183838	0.010212703
LH3	4.183838	0.010212703
HH4	8.605042	0.071742466
HL4	8.542175	0.020851355
LH4	8.542175	0.020851355
HH5	17.392761	0.145007958
HL5	17.173950	0.041921423
LH5	17.173950	0.041921423
LL5	16.995850	0.012146502
LL1-LL4	1	0.000714675

Table 2. Count signals for 3–5 level DWT.

cnt3[3:0]	cnt4[1:0]	cnt5[1:0]	Level	Module
0–3	x	x	3	1
4–7	x	x	3	2
8–11	x	x	3	3
9–12	1	x	4	1
9–12	2	x	4	2
9–12	3	x	4	3
9–12	4	1	5	1
9–12	4	2	5	2
9–12	4	3	5	3
9–12	4	4	N	N

Table 3. Hardware resources for the proposed multi-level 2D DWT architecture.

Level	Multiplier	Adder	Register	Temporal RAM	Parallelism
1	10	16	26	$3 N$	S
2	3	4	16	$5 N / 2$	S
3	3	4	40	$5 N / 4$	$S / 3$
3, 4, 5	3	4	112	$35 N / 16$	$S / 3$
1–5	42	64	238	$299 N / 16$	$3 S$

Table 4. The hardware performance for 5-level 2D DWT on a 128 × 128 image with

S = 3

.

Table 4. The hardware performance for 5-level 2D DWT on a 128 × 128 image with

S = 3

.

Architecture	Multiplier	Adder	Register	Internal Memory	Input Memory	Throughout Rate	CPD	TC $(\times 10^{5})$	ACT	TDP
[10]	195	270	540	x	x	$6 / T a$	$2 T a$	x	x	x
[12]	0	330	x	x	x	$6 / T a$	$T a$	x	x	x
[14]	60	120	3567	12,310	1152	$3 / 2 T a$	$4 T a$	25.71	2730.67	84.53
[20]	43	76	1410	128	896	$3 / 2 T a$	$4 T a$	7.16	2730.67	23.55
[22]	56	88	318	816	896	$3 / T a$	$2 T a$	5.75	5461.33	18.90
[23]	46	76	90	527	1664	$2 / T a$	$3 T a$	5.06	2730.67	12.47
Proposed	42	64	238	2392	1152	$3 / T a$	$2 T a$	6.47	2730.67	10.63

x: the data are not available in the cited sources.

Table 5. Synthesis results for the

N \times N

image without RAM.

Table 5. Synthesis results for the

N \times N

image without RAM.

Architecture	N	S	Level	DAT (ns)	Area ( $μ$ m²)	Power (mw)	ADP ( $μ$ m²)	EPI ( $μ$ J)
[14]	512	8	3	42.66	3,377,870.70	24.45	3098.72	26.28
[20]	512	8	3	45.58	3,104,371.05	22.59	2318.29	18.50
[22]	512	8	3	27.70	1,362,035.87	12.94	618.14	10.60
Proposed	512	8	3	26.74	999,231.90	23.70	433.35	19.42
Proposed	128	3	5	14.62	409,442.04	16.66	16.35	2.27

EPI: Energy per Image,

E P I = P o w e r \times A C T \div C l o c k F r e q u e n c y

.

Table 6. Implemented results on Xilinx FPGA.

Architecture	Device	S	Level	Fmax (MHz)	Registers	LUTs
[4]	XC6VSX315T	1	2	344.34	3922	4708
[5]	XC5VLX330T	1	2	224.00	1536	2092
[17]	XC7Z020CLG484	1	1	34.83	327	1121
Proposed	XC6VSX315T	1	2	165.75	1173	1679
	XC5VLX330T	1	2	188.89	1179	1631
	XC7Z020CLG484	1	1	143.70	629	602
	XC6VSX315T	3	5	149.12	6974	11,379
	XC5VLX330T	3	5	156.92	7064	11,546
	XC7Z020CLG484	3	5	130.14	6974	11,378
	XC7K325T	3	5	169.27	6978	11,378

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Q.; Zhang, W.; Wu, Z.; Dai, Y.; Liu, Y. An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules. Electronics 2024, 13, 4668. https://doi.org/10.3390/electronics13234668

AMA Style

Li Q, Zhang W, Wu Z, Dai Y, Liu Y. An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules. Electronics. 2024; 13(23):4668. https://doi.org/10.3390/electronics13234668

Chicago/Turabian Style

Li, Qitao, Wei Zhang, Zhuolun Wu, Yuzhou Dai, and Yanyan Liu. 2024. "An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules" Electronics 13, no. 23: 4668. https://doi.org/10.3390/electronics13234668

APA Style

Li, Q., Zhang, W., Wu, Z., Dai, Y., & Liu, Y. (2024). An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules. Electronics, 13(23), 4668. https://doi.org/10.3390/electronics13234668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules

Abstract

1. Introduction

2. Refinements to DWT and Quantization Formulas

2.1. Algorithm of Lifting-Based 2D DWT

2.2. Integration of Quantization and Scaling Formulas

3. Proposed Architecture for Multi-Level 2D DWT

3.1. Folded and Unfolded Architecture

3.2. Data Input Method

3.3. Overall DWT Hardware Architecture

3.4. Simplified Control Module

3.5. Transposing Module

4. Results

5. Conclusions and Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI