Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
Next Article in Journal
Efficient Consistency Check Based on Perceived Initial Deviation
Previous Article in Journal
Automatic Classification of Rotating Rectifier Faults in Brushless Synchronous Machines
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules

1
School of Microelectronics, Tianjin University, Tianjin 300072, China
2
College of Electronic Information and Optical Engineering, Nankai University, Tianjin 300071, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(23), 4668; https://doi.org/10.3390/electronics13234668
Submission received: 23 October 2024 / Revised: 18 November 2024 / Accepted: 25 November 2024 / Published: 26 November 2024

Abstract

:
A multi-level 2D Discrete wavelet transform (DWT) architecture for JPEG2000 is proposed, enhancing speed through parallel processing multiple tile blocks. Based on the lifting scheme, folded architecture and unfolded architecture achieving critical path delay with only one multiplier are designed to increase throughput rate. Connecting the folded and unfolded architecture through a pipeline architecture ensures uniform throughput rates across all DWT levels within a singular clock domain. Computational resource consumption is reduced by adjusting the timing to allow one folded architecture to process three tile blocks of three to five levels of DWT, and a transposing module requiring merely six registers is devised to decrease storage resource consumption. The quantization module, crucial for code-word control in JPEG2000, is integrated into the scaling module with minimal additional resource expenditure. Compared to the existing architecture, the analysis demonstrates that the proposed architecture exhibits enhanced hardware efficiency, with a reduction in transistor-delay-product (TDP) of no less than 14.69%. Synthesis results further reveal an area reduction of at least 26.64%, and a decrease in area-delay-product (ADP) by a minimum of 29.89%. Results from FPGA implementation indicate a significant decrease in resource utilization.

1. Introduction

JPEG2000 stands as a highly effective image compression standard [1], extensively utilized in the realms of medical and satellite imagery. The fundamental enhancement in JPEG2000 is the substitution of the discrete cosine transform (DCT) with the discrete wavelet transform (DWT), as it demonstrates an improvement in coding efficiency and image quality [2]. DWT is a multiresolution analysis tool adept at decomposing a signal into distinct subbands with both time and frequency information [3], which finds extensive application across image processing and image encryption domains [4,5,6]. The performance of 9/7 DWT hardware depends on the precision and efficiency of the quantization filters. Higher precision ensures compression quality close to unquantized data but requires more hardware and processing time [7].
In JPEG2000, images are typically segmented into smaller tile blocks with a typical size of 128 × 128 . Subsequently, these tiles undergo compression using a five-level two-dimensional (2D) DWT to achieve optimal compression efficacy [8]. However, to meet real-time computation demands, contemporary multi-level 2D DWT architectures face challenges with heightened computational complexity and increased memory resource requirements [9]. Consequently, researching and developing a more efficient VLSI architecture for multi-level DWT is essential.
In recent years, DWT architectures have been categorized into the convolution-based scheme [9,10,11] and lifting-based scheme [12,13,14,15,16,17]. The lifting-based architecture is favored over its convolution-based counterpart due to its reduced computational complexity, narrower internal bit widths, and diminished memory requirements [18,19]. In the domain of single-level DWT, George [10] introduced architectures termed horizontal-FrWF and vertical-FrWF based on a convolutional framework. These architectures eliminate the dependency of FrWF computation’s memory requirements on image resolution, effectively reducing both area and memory needs. Naseer [12] employed a lifting-based architecture with a CSD-DA structure to decrease the complexity of multiplication operations, enhancing the clock frequency. Mohamed [17] achieved a significant reduction in resource utilization and obtained a higher PSNR through hardware–software codesign with partial reconfiguration. However, the resource demands of single-level architectures increase exponentially with the number of levels when processing multi-level DWT, significantly reducing encoding efficiency.
Multi-level DWT architecture has also been extensively researched. Mohanty [20] proposed a line-based parallel lifting modular and pipelined architecture, effectively obviating the requirement for lines and frame memories. However, this design results in a critical path delay (CPD) of T m + 2 T a , where T a represents the delay of an adder and T m signifies the delay associated with a multiplier. Wu [21] utilized a CSD multiplier to lower the critical path delay to T a , though it complicates the internal structure and control signals. Zhang [22] developed a parallel architecture integrating first-stage unfolding and internal semi-folding, combined with folding, to achieve synchronization with the rate and high throughput requirements of three-level DWT. However, this architecture lacks flexibility in modifying parallelism and DWT levels. Furthermore, when managing multi-level DWT with a high number of levels, Refs. [21] and [22] both encountered the issue of multiple clock domains. This results in a structure that fails to reach optimal efficiency, consequently consuming additional clock cycles. Hu [23] proposed a scanning method based on overlapping stripes and developed a scalable pipelined multilevel DWT architecture, which eliminates the need for frame memory, yet the seven-pixel overlap in scans significantly increases the required input RAM size.
Quantization modules have been thoroughly examined, with a primary focus on minimizing the loss of image quality during the quantization process. Kotteri [7] explored a two-stage cascading quantization structure that enhances filter performance by compensating quantization coefficients. Chaker [24] employed preprocessing steps to extract image features, enabling uniform scalar quantization that reduces loss in image compression quality. Moreno [25] introduced alternative estimations and normalizes quantization coefficients to improve the quality of image encoding. However, the studies in [7,24,25] faced the challenge of increased hardware resource consumption to improve quality. Furthermore, the quantization module functions as a separate component, leading to additional clock cycles and further hardware overhead.
In this article, an efficient multi-level 2D DWT architecture using a lifting scheme capable of processing multiple tile blocks concurrently is proposed, with the quantization module integrated into the scaling module with minimal additional resource consumption. The proposed architecture synchronizes throughput within a single clock domain through the integration of both folded and unfolded architectures which reduce CPD to T m . In the architecture, level 1 employs a three-way parallel unfolded architecture, whereas level 2 utilizes a three-way parallel folded architecture. For level 3 to 5, only one folded architecture is implemented to compute three-way parallel DWT, diminishing the consumption of computational resources, and a simplified control module is designed to minimize system complexity. The quantization module is integrated with the scaling module by adding only a few selectors, effectively avoiding additional hardware resource consumption and clock cycle overhead. Additionally, a novel transposing module requiring merely six registers is introduced to minimize the requirements of the transposing buffer. Consequently, this approach markedly enhances hardware efficiency.
The rest of this article is organized as follows. Section 2 explains the lifting algorithm, DWT process, and the integration of the quantization formula. Section 3 introduces the proposed multi-level 2D DWT architecture. Section 4 discusses implementation results and comparisons. Conclusions and a discussion are in Section 5.

2. Refinements to DWT and Quantization Formulas

2.1. Algorithm of Lifting-Based 2D DWT

In the implementation of the 2D DWT column and row filters, an identical lifting scheme corresponding to the 9/7 filter in a dual application is employed, encompassing two distinct lifting steps and one scaling step. To enhance the critical path delay and avoid additional registers, Huang [26] introduced an innovative formula tailored for the flipping scheme, delineated as follows:
1 α y ( 2 n + 1 ) = 1 α x ( 2 n + 1 ) + x ( 2 n ) + x ( 2 n + 2 )
1 α β y ( 2 n ) = 1 α β x ( 2 n ) + 1 α y ( 2 n 1 ) + 1 α y ( 2 n + 1 )
1 α β γ H ( 2 n + 1 ) = 1 α β γ y ( 2 n + 1 ) + 1 α β y ( 2 n ) + 1 α β y ( 2 n + 2 )
1 α β γ δ L ( 2 n ) = 1 α β γ δ y ( 2 n ) + 1 α β γ H ( 2 n 1 ) + 1 α β γ H ( 2 n + 1 )
S H ( 2 n + 1 ) = α β γ K 1 α β γ H ( 2 n + 1 )
S L ( 2 n ) = α β γ δ K 1 α β γ δ L ( 2 n )
where α , β , γ , δ are the lifting coefficients, where α = 1.586134342 , β = 0.052980118 , γ = 0.882911075 , and δ = 0.443506852 , K is the scaling coefficient, where K = 1.230174105 , and the variable x ( n ) represents the pixel value of the input image, with x ( 2 n + 1 ) and x ( 2 n ) corresponding to the odd and even indexed inputs, respectively. The intermediate variables of the lifting process are represented by y ( n ) , H ( 2 n + 1 ) , and L ( 2 n ) . The culmination of the lifting process yields S H ( 2 n + 1 ) and S L ( 2 n ) , which are the high-frequency and low-frequency components outputted by the 9/7 wavelet transform, respectively.
The execution of the 2D DWT necessitates an initial column transformation succeeded by a subsequent row transformation, thereby requiring a pair of sequential applications of the flipping operation. Within the second iteration, the high-frequency component H ( 2 n + 1 ) and the low-frequency component L ( 2 n ) , derived from the initial calculation, serve as distinct inputs. Specifically, the high-frequency output from H ( 2 n + 1 ) constitutes the high-high ( H H ) component, whereas its low-frequency counterpart results in the high-low ( H L ) component. Similarly, the high-frequency output from L ( 2 n ) forms the low-high ( L H ) component and the corresponding low-frequency output generates the low-low ( L L ) component.
In 2D DWT, the L L component contains the majority of the signal’s energy and information, and the multi-level DWT methodology involves employing this L L component from a given level as the input data x ( n ) for the ensuing level’s analysis. By iteratively applying the 2D DWT, one acquires the sophisticated subband components H H , H L , L H , and L L at higher hierarchical levels. Figure 1 elucidates the principle of the 2D DWT and simultaneously depicts the decomposition process via an exemplification of a two-level DWT.

2.2. Integration of Quantization and Scaling Formulas

Quantization is a crucial aspect of bit-rate control in JPEG2000, where it compresses wavelet coefficients further to eliminate redundancy in image details imperceptible to the human eye. The JPEG2000 standard specifies a method of dead-zone scalar quantization [27], as demonstrated in Equation (7). This method relies on subband information derived from wavelet transformations to determine the quantization step sizes appropriate for different subbands:
q = s i g n ( x ) | x | / Δ
where q is the quantization result, x represents the components of each level of subbands processed by DWT, represents rounding down to the nearest integer, Δ denotes the quantization step size for the subbands, and s i g n ( x ) indicates the sign of x.
q = h × S ( n )
q ( H H ) = h H H × 1 α β γ H H ( 2 n + 1 )
q ( L H ) = h L H × 1 α β γ L H ( 2 n + 1 )
q ( H L ) = h H L × 1 α β γ δ H L ( 2 n )
q ( L L ) = h L L × 1 α β γ δ L L ( 2 n )
h H H = ( α β γ K ) 2 × h H H
h L H = ( α β γ ) 2 δ × h L H
h H L = ( α β γ ) 2 δ × h H L
h L L = ( α β γ δ K ) 2 × h L L
In a five-level 2D DWT, each wavelet subband has different quantization step sizes, though the step size remains constant within each subband. The quantization result is obtained by multiplying the DWT wavelet coefficients by the corresponding quantization coefficients, converting the formula from Equation (7) to Equation (8). In Equation (8), S ( n ) represents the subband components processed by DWT and h denotes the quantization coefficient. The quantization coefficients for different levels and subbands are shown in Table 1. Since Equations (5), (6), and (8) all involve multiplication, their coefficients can be combined into a single operation. By replacing Equations (5) and (6) in the column transformation with Equations (9)–(12), the scaling and quantization steps can be performed simultaneously. In Equations (9) to (12), the H H ( 2 n + 1 ) and H L ( 2 n ) components are derived by applying row transformation to the column transformed H component. Similarly, the L H ( 2 n + 1 ) and L L ( 2 n ) components result from row transformation on the column transformed L component. In the scaling process following DWT encoding, the HH subband component is multiplied twice by the coefficient α β γ K from Equation (5), while the LL subband component is multiplied twice by the coefficient in Equation (6). The LH and HL subband components are each multiplied once by the coefficients in Equations (5) and (6), respectively. The combined coefficient is obtained by applying these multiplications to the original quantization coefficient, as shown in Equations (13)–(16). The specific values of the combined coefficients h for each subband are provided in Table 1.
By combining quantization with the scaling step in the column transformation of DWT, as described in Equations (9)–(16), both DWT and quantization in JPEG2000 encoding are completed without adding hardware resources or clock cycles to the original DWT module. This approach effectively improves the efficiency of JPEG2000 encoding. Furthermore, encoding with different quantization step sizes can be achieved by modifying the h values in Table 1 and computing the corresponding h values using Equations (13)–(16) for substitution, demonstrating the flexibility of this approach.

3. Proposed Architecture for Multi-Level 2D DWT

3.1. Folded and Unfolded Architecture

Considering each flipping equation being computed through two additions and one multiplication, and acknowledging T m 2 T a , a three-input basic processing unit (BPU) has been designed. This BPU, delineated by the dashed box in Figure 2 and Figure 3, executes two additions and one multiplication concurrently, thereby reducing the CPD to T m . Notably, the ∗ sign in the figures signifies the necessity for the multiplier input to precede the adder input by one clock cycle, ensuring accurate timing synchronization.
The proposed folded architecture and unfolded architecture are based on the BPUs proposed above as shown in Figure 2 and Figure 3, where only row filters include the scaling module. The folded architecture is characterized by its utilization of a solitary BPU, which is complemented by MUXs to modulate the outputs from the adder and multiplier to sequentially address the four lifting equations. Additionally, the unfolded architecture is constructed upon a four-stage pipeline framework, whereby each stage is equipped with a distinct BPU that is tasked with the computation of one of Equations (1) through (4), respectively. Since only one BPU is utilized, the folded architecture necessitates four clock cycles to compute four lifting equations and sequentially deliver two lifting results over two cycles. In contrast, the unfolded architecture, leveraging a pipelined design, is capable of outputting two lifting results per clock cycle. Consequently, this architectural distinction results in a data throughput ratio of 1:4 between the folded and unfolded architectures.
Furthermore, the scaling module tailored for the unfolded architecture, owing to timing constraints, necessitates the deployment of two multipliers to execute the scaling function. In contrast, the scaling module designed for the folded architecture efficiently utilizes a single multiplier, which alternately scales the H and L frequency components.

3.2. Data Input Method

The proposed architecture uses a three-input scanning method with a one-pixel overlap, a technique widely applied in studies such as [21]. This method was chosen for several reasons: both folded and unfolded architectures are three-input systems, so the three-input scanning method, which reads three pixels at once, is ideal for providing input to the encoding module. Additionally, this method only requires overlapping one row per scan, unlike the seven-column overlap used in [23]. This minimizes redundant data encoding, thereby improving encoding efficiency. Furthermore, the three-input scanning method requires less input memory and reduces the internal temporal memory within the architecture to only three units, thereby lowering hardware resource consumption.
As illustrated in Figure 4, our approach adopts a Z-type pattern for reading image pixels, wherein data from three rows are simultaneously accessed. To ensure the timely arrival of x ( 2 n + 1 ) , a staggered data input method is employed, as depicted in Figure 5, where the data of identical color are involved in concurrent operations. Additionally, the final line of data, indicated by a dashed circle in the figure, is subjected to overlap scanning.

3.3. Overall DWT Hardware Architecture

The overall architecture of the 2D DWT is presented in Figure 6, and all levels of DWT are predicated upon this overall architecture. Input RAM stores the input image and frame RAM saves the L L components from a low level, while column and row filters execute the lifting process. Based on the sequence of data input, the column filter needs temporal memory for intermediate variables and the row filter merely requires a two-unit buffer. The transposing module rearranges and synchronizes the L and H components from the column filter. Lastly, the scaling and quantization module scales and quantizes the data, obtaining the final DWT results with redundant precision removed. These results are then used for subsequent encoding in JPEG2000.
For the first-level 2D DWT, the influx of raw input data occurs at a rate of three units per clock cycle. To accommodate this rate and ensure the completion of data processing within each cycle, an unfolded architecture for the column and row filters is utilized, as illustrated in Figure 3.
Given that only the L L frequency component from the first-level DWT output requires processing by the second-level DWT, the data volume entering the second-level DWT is a mere quarter of that fed into the first-level DWT. Consequently, for the second-level 2D DWT, the folded architecture is used in the column and row filters to maintain throughput consistency between the two DWT levels. However, a complication arises as the L L components produced by the first-level DWT are delivered every two cycles, while the second-level DWT reads data every four cycles. To bridge this temporal gap, a frame memory storing two rows of data are required.
For third-level and subsequent DWT, the indivisible nature of the folded architecture necessitates an alternative strategy. To address this challenge, a novel multi-level DWT architecture that matches the required throughput is proposed, as depicted in Figure 7. The first and second level both employ a three-way parallel architecture. For 3–5 level DWT, only one folded architecture is deployed, which can process components from three tile blocks by timing adjustments as depicted in Figure 8.
Figure 8 elucidates the timing and sequential data processing of 3–5 level DWT. To ensure staggered data flow, the input to the first-level DWT modules is sequentially offset by four cycles. The initial 12 cycles are dedicated to the uninterrupted processing of data from the third-level DWT modules. Subsequently, every 16 cycles are allocated for processing fourth-level DWT data, and every 64 cycles address the data from the fifth-level DWT. In scenarios limited to five levels of DWT, a pattern emerges where 4 idle cycles are observed after every set of 256 cycles, during which no data processing occurs. However, if additional levels of DWT computation are required, these idle intervals can be strategically utilized for such calculations, though this would increase the clock cycles needed to obtain the encoded data. For an N-level DWT beyond three levels, the system must wait for 4 ( N 2 ) clock cycles before producing each wavelet coefficient. For example, a 5-level DWT requires a wait of 64 clock cycles, while a 6-level DWT requires 256 clock cycles, which may impact subsequent encoding. Therefore, the proposed architecture is optimally suited for up to 5-level DWT. Furthermore, although a recurring pattern prevents the proposed architecture from achieving 100% hardware utilization, the first two levels reach full utilization, and the final folded architecture achieves 98.44%, leading to an overall hardware utilization above 99.9%. Consequently, the proposed architectural framework demonstrates high efficiency in managing multiple DWT levels, even as the required clock cycles increase for processing beyond five levels.
Assuming that the folded architecture has a maximum throughput of 1. Given that only the L L component from second level is propagated to the third level DWT processing, the collective output from the three modules at the second level equates to 3 / 4 of the input data volume for the subsequent third level. This cascading effect results in a proportional reduction of data volume conveyed to the fourth and fifth levels of DWT processing, yielding fractions of 3 / 16 and 3 / 64 , respectively. From this pattern, it is evident that extending DWT processing beyond five levels would culminate in a total data volume sum that remains below unity, as shown in Equation (17). Under the given inequality constraints, 3 is the largest integer that can multiply the left-hand side, making three-way parallelism the maximum achievable level of parallelism. This aggregate is less than the maximum throughput of the folded architecture, thereby enabling one folded architecture to manage the entire data processing workload effectively.
3 × ( 1 / 4 + 1 / 16 + 1 / 64 + . . . + 1 / ( 4 n ) ) < 1

3.4. Simplified Control Module

A simplified control module is designed for the folded architecture to handle 3–5 levels DWT, which manages image data from different tiles and levels, ensuring DWT computations are executed in the intended sequence. The control module only incorporates three distinct counting signals: cnt3, cnt4, and cnt5. The cnt3 signal is a 4-bit counter with a range from 0 to 15, while cnt4 and cnt5 are 2-bit counters, each with a counting range from 0 to 3. The corresponding counting signals initiate their counting process immediately when the L L component from the preceding level DWT starts being fed into this folded architecture.
As presented in Table 2, when cnt3 is within the ranges of 0–3, 4–7, and 8–11, the third-level DWT is counted for the first, second, and third tile blocks. When cnt3 ranges from 12 to 15, the fourth-level DWT is executed for the first, second, and third tile blocks, contingent upon cnt4 values of 1, 2, and 3, respectively. Likewise, with cnt3 ranging from 12 to 15 and cnt4 set to 4, the fifth-level DWT is performed on the first, second, and third tile blocks based on cnt5 values of 1, 2, and 3. No calculations are conducted when cnt3 is between 12 and 15 and both cnt4 and cnt5 are set to 4. In Table 2, the symbol ’x’ denotes an arbitrary value, while ’N’ indicates that no operation is performed in this instance. Equipped with only three counting signals, the control module effectively reduces hardware complexity. This simplification is further elucidated in the subsequent resource comparison section.

3.5. Transposing Module

A novel transposing buffer module has been developed to align with the temporal requirements of both the unfolded and folded architectural designs. In the unfolded architecture, the continuous flow of column filter data necessitates merely four registers to achieve transposition as Figure 9 shows. Conversely, due to the variable interval arrival of column filter data at different levels within the folded architecture, the transposing buffer, as depicted in Figure 10, employs six registers to store the H and L frequency components separately.

4. Results

In the context of processing input images or tile blocks with dimensions N × N , Table 3 delineates the hardware resource allocation for the proposed multi-level 2D DWT architecture across varying DWT levels. Transposition memory, temporal memory, and frame memory are included in the temporal RAM, and the parallelism S for the proposed architecture signifies the count of tile blocks processed simultaneously.
The transistor-delay-product (TDP) introduced in [21] serves as a viable metric for assessing the efficiency of hardware architectures. Concurrently, the area-delay-product (ADP) proposed in [23] provides a visual framework for comparing the efficiency of different architectures, utilizing synthesis results as a basis. Lower TDP and ADP values signify a more efficient hardware architecture. Equations (18)–(20) detail the calculation involving transistor count (TC), active cycle time (ACT), and data arrival time (DAT), and the TC is calculated using a method from [20]. A lower TC indicates reduced resource consumption and better architectural efficiency; a lower ACT suggests fewer clock cycles required to process an image, reflecting an improved design; and a lower DAT implies reduced latency, indicating a more optimized architecture. Additionally, it is posited that T m = 2 T a = 6.02 ns.
T D P = T C × C P D × A C T
A C T = N 2 / t h r o u g h p u t
A D P = A r e a × A C T × D A T
Table 4 presents a comparison of hardware performance between existing architecture and the proposed architecture. Notably, while other studies focus solely on DWT encoding, the proposed architecture additionally performs quantization calculations. Refs. [10] and [12] described high-performing single-level DWT architectures recently introduced, assuming a configuration that employs five single-level architectures in series for processing a 5-level DWT. However, accurately estimating the requisite cache resources for multi-level DWT remains problematic. The data presented in Table 4 indicate that although throughput rate and CPD are enhanced in these architectures, a considerable increase in computational resources is required. Consequently, the proposed architecture outperforms existing single-level DWT designs in handling multi-level DWT.
For architectures specifically designed for multi-level DWT, the proposed architecture demonstrates notable improvements. Compared to [14], the proposed architecture features a shorter critical path and a more efficient multi-level structure, achieving twice the throughput, along with a 74.84% reduction in TC and an 87.42% decrease in TDP. Ref. [20] employed excessive registers for caching and exhibits a higher CPD. In contrast, the proposed architecture reduces TC by 9.67%, lowers TDP by 54.83%, and doubles throughput relative to [20]. In Ref. [22], CPD is reduced to 2 T a , and parallel processing eliminates intermediate variable storage, minimizing RAM usage. However, for DWT levels exceeding three, Ref. [22] required cross-clock domain processing to match throughput, effectively halving the clock frequency and doubling encoding time compared to the proposed architecture. Against [22], the proposed architecture shows a 12.55% increase in TC but a 50% reduction in ACT and a 43.73% decrease in TDP. The input scheme in [23] involves scanning seven rows of pixels repetitively, significantly increasing input memory. Additionally, a lack of register segmentation in the adders and multipliers results in a CPD of 3 T a . In comparison, the proposed architecture increases TC by 27.97% but lowers CPD by 33.33%, enhancing throughput by 50% and reducing TDP by 14.69%.Overall, for a 5-level DWT, although the proposed architecture slightly increases storage resource usage, it significantly reduces computational resource consumption and CPD. The proposed architecture demonstrates superior hardware efficiency, with at least a 14.69% improvement over existing designs.
Table 5 presents the synthesis results and comparisons excluding RAM. Compared to [14] and [20], the proposed architecture achieves only half the CPD, with a reduction in area exceeding 67.81% and an 81.31% decrease in ADP, indicating significantly higher hardware efficiency. When compared to [22], the proposed design uses fewer computational resources, leading to a 26.64% reduction in area and a 29.89% decrease in ADP. These findings are consistent with the theoretical analysis presented in Table 4. However, the proposed architecture exhibits slightly higher power consumption than the referenced designs. This is primarily due to the frequent data reads and writes to RAM from three distinct tile blocks during encoding. Additionally, the reduced need for repetitive scans results in more frequent internal signal transitions, which may further contribute to the increased power usage. Overall, the analysis indicates although the proposed architecture does not achieve peak hardware efficiency at the configuration of N = 512 , 3-level, and S = 8 , it achieves a 26.64% reduction in area and a 29.89% decrease in ADP compared to the current best-performing design, albeit with increased power consumption.
The proposed architecture was synthesized and implemented on Xilinx FPGA platforms using ISE 14.7, with the results presented in Table 6. The configuration with S = 3 and level = 5 represents the most efficient architecture in this study. When synthesizing for levels 1 and 2, the FPGA synthesis results of the proposed architecture include both the encoding section and the frame RAM for each level. The study in [4] implemented a watermarking DWT algorithm that lowers the CPD using convolution scheme. However, the hardware inefficiencies associated with convolution lead to substantial resource consumption. In comparison, the proposed architecture reduces register usage by 70.09% and LUT resources by 64.34%, though it also results in a 51.86% decrease in maximum clock frequency. In [5], a Haar DWT formula based solely on addition and subtraction operations achieves a lower CPD, though its complex watermarking module results in higher resource usage. Compared to [5], the proposed architecture reduces register consumption by 23.24% and LUT resources by 22.03%, with a 15.67% decrease in maximum clock frequency. Additionally, the results from [4,5] were obtained using the System Generator tool within a MATLAB environment, which differs from the tools used in this study. This variation may introduce discrepancies in the results; therefore, the findings from [4,5] are provided for reference only. In [17], the DWT design is implemented with hardware–software codesign and partial reconfiguration on the Vivado platform, effectively reducing resource usage. However, the extended combinational logic path significantly increases CPD. Relative to this design, the proposed architecture reduces LUT resource usage by 46.30%, increases register usage by 92.35%, and enhances maximum clock frequency by over 3.13 times. Furthermore, the peak efficiency of the proposed architecture is documented in Table 5 and Table 6, correlating with the synthesis results for N = 128, 5-level, and S = 3. Overall, considering resource consumption and maximum clock frequency, the FPGA synthesis results suggest that the proposed architecture demonstrates increased hardware efficiency in managing multi-level DWT and is well-suited for practical applications like JPEG2000.

5. Conclusions and Discussion

This article introduces a novel VLSI architecture for multi-level 2D DWT, utilizing a lifting-based approach, capable of parallel processing various tile blocks. The proposed architecture integrates folded and unfolded architecture, which reduces the CPD to T m , ensuring rate consistency across all DWT levels and improving the throughput rate. Additionally, using only one folded architecture for processing 3–5 level DWT significantly diminishes the computational resources required. The critical quantization module in JPEG2000 encoding has been integrated into the DWT module with minimal additional resource use. Enhancements in the transposing module lead to a reduction in storage resource requirements. The empirical results demonstrate that the proposed architecture yields a reduction in TDP by 14.69%, a decrease in the required area by 26.64%, a diminution in ADP by 29.89%, and utilizes substantially fewer FPGA resources, thereby evidencing enhanced hardware efficiency. Overall, the proposed architecture facilitates more efficient encoding of DWT and quantization in JPEG2000.
In the future, we will continue to investigate efficient architectures for the DWT in JPEG2000, aiming to further reduce the computational and storage resources consumed by the DWT and quantization modules. Additionally, research will be expanded to enhance the processing accuracy of these modules. Improvements to the interface between the quantization module and the subsequent JPEG2000 encoding modules will also be pursued to achieve rate matching, thereby enhancing the overall hardware efficiency of JPEG2000 encoding. Exploring the application of the proposed DWT and quantization modules to other image encoding protocols also represents a future research direction for our work.

Author Contributions

Conceptualization, Q.L. and Z.W.; methodology, Q.L. and W.Z.; software, Q.L. and Y.D.; validation, Y.D. and W.Z.; formal analysis, Q.L. and W.Z.; investigation, W.Z. and Y.L.; resources, W.Z. and Y.L.; data curation, Q.L. and Y.D.; writing—original draft preparation, Q.L. and Y.D.; writing—review and editing, W.Z., Z.W. and Y.L.; visualization, Y.L.; supervision, W.Z.; project administration, W.Z. and Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Christopoulos, C.; Skodras, A.; Ebrahimi, T. The JPEG2000 still image coding system: An overview. IEEE Trans. Consum. Electron. 2000, 46, 1103–1127. [Google Scholar] [CrossRef]
  2. Li, B.F.; Dou, Y.; Shao, Q. Efficient Memory Subsystem for High Throughput JPEG2000 2D-DWT Encoder. In Proceedings of the 2008 Congress on Image and Signal Processing, Sanya, China, 27–30 May 2008; Volume 1, pp. 529–533. [Google Scholar] [CrossRef]
  3. Jain, N.; Singh, M.; Mishra, B. Image Compression Using 2D-Discrete Wavelet Transform on a Light Weight Reconfigurable Hardware. In Proceedings of the 2018 31st International Conference on VLSI Design and 2018 17th International Conference on Embedded Systems (VLSID), Pune, India, 6–10 January 2018; pp. 61–66. [Google Scholar] [CrossRef]
  4. Karthigaikumar, P.; Anumol; Baskaran, K. FPGA Implementation of High Speed Low Area DWT Based Invisible Image Watermarking Algorithm. Procedia Eng. 2012, 30, 266–273. [Google Scholar] [CrossRef]
  5. Hajjaji, M.A.; Gafsi, M.; Ben Abdelali, A.; Mtibaa, A. FPGA Implementation of Digital Images Watermarking System Based on Discrete Haar Wavelet Transform. Secur. Commun. Networks 2019, 2019, 17. [Google Scholar] [CrossRef]
  6. Oweiss, K.G.; Mason, A.; Suhail, Y.; Kamboh, A.M.; Thomson, K.E. A Scalable Wavelet Transform VLSI Architecture for Real-Time Signal Processing in High-Density Intra-Cortical Implants. IEEE Trans. Circuits Syst. I Regul. Pap. 2007, 54, 1266–1278. [Google Scholar] [CrossRef]
  7. Kotteri, K.; Bell, A.; Carletta, J. Design of multiplierless, high-performance, wavelet filter banks with image compression applications. IEEE Trans. Circuits Syst. I Regul. Pap. 2004, 51, 483–494. [Google Scholar] [CrossRef]
  8. Taubman, D.S.; Marcellin, M.W. JPEG2000 Image Compression Fundamentals, Standards and Practice; Kluwer Academic Publishers: New York, NY, USA, 2002; pp. 429–430. [Google Scholar]
  9. Wu, P.C.; Chen, L.G. An efficient architecture for two-dimensional discrete wavelet transform. IEEE Trans. Circuits Syst. Video Technol. 2001, 11, 536–545. [Google Scholar] [CrossRef]
  10. George, A.; P, J.E. Hardware-Efficient DWT Architecture for Image Processing in Visual Sensors Networks. IEEE Sens. J. 2023, 23, 5382–5390. [Google Scholar] [CrossRef]
  11. Cheng, C.; Parhi, K.K. High-Speed VLSI Implementation of 2-D Discrete Wavelet Transform. IEEE Trans. Signal Process. 2008, 56, 393–403. [Google Scholar] [CrossRef]
  12. Naseer, R.A.; Nasim, M.; Sohaib, M.; Younis, C.J.; Mehmood, A.; Alam, M.; Massoud, Y. VLSI architecture design and implementation of 5/3 and 9/7 lifting Discrete Wavelet Transform. Integration 2022, 87, 253–259. [Google Scholar] [CrossRef]
  13. Zhang, W.; Jiang, Z.; Gao, Z.; Liu, Y. An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet Transform. IEEE Trans. Circuits Syst. II Express Briefs 2012, 59, 158–162. [Google Scholar] [CrossRef]
  14. Tian, X.; Wu, L.; Tan, Y.H.; Tian, J.W. Efficient Multi-Input/Multi-Output VLSI Architecture for Two-Dimensional Lifting-Based Discrete Wavelet Transform. IEEE Trans. Comput. 2011, 60, 1207–1211. [Google Scholar] [CrossRef]
  15. Mohanty, B.K. Approximate Lifting 2-D DWT Hardware Design for Image Encoder of Wireless Visual Sensors. IEEE Sens. J. 2023, 23, 7868–7878. [Google Scholar] [CrossRef]
  16. Darji, A.; Agrawal, S.; Oza, A.; Sinha, V.; Verma, A.; Merchant, S.N.; Chandorkar, A.N. Dual-Scan Parallel Flipping Architecture for a Lifting-Based 2-D Discrete Wavelet Transform. IEEE Trans. Circuits Syst. II Express Briefs 2014, 61, 433–437. [Google Scholar] [CrossRef]
  17. Bharadwaja, P. Efficient FPGA Implementations of Lifting based DWT using Partial Reconfiguration. In Proceedings of the 2023 36th International Conference on VLSI Design and 2023 22nd International Conference on Embedded Systems (VLSID), Hyderabad, India, 8–12 January 2023; pp. 319–324. [Google Scholar] [CrossRef]
  18. Xiong, C.Y.; Tian, J.W.; Liu, J. Efficient high-speed/low-power line-based architecture for two-dimensional discrete wavelet transform using lifting scheme. IEEE Trans. Circuits Syst. Video Technol. 2006, 16, 309–316. [Google Scholar] [CrossRef]
  19. Kotteri, K.; Barua, S.; Bell, A.; Carletta, J. A comparison of hardware implementations of the biorthogonal 9/7 DWT: Convolution versus lifting. IEEE Trans. Circuits Syst. II: Express Briefs 2005, 52, 256–260. [Google Scholar] [CrossRef]
  20. Mohanty, B.K.; Meher, P.K. Memory Efficient Modular VLSI Architecture for Highthroughput and Low-Latency Implementation of Multilevel Lifting 2-D DWT. IEEE Trans. Signal Process. 2011, 59, 2072–2084. [Google Scholar] [CrossRef]
  21. Wu, C.; Zhang, W.; Jia, Q.; Liu, Y. Hardware efficient multiplier-less multi-level 2D DWT architecture without off-chip RAM. IET Image Process. 2017, 11, 362–369. [Google Scholar] [CrossRef]
  22. Zhang, W.; Wu, C.; Zhang, P.; Liu, Y. An Internal Folded Hardware-Efficient Architecture for Lifting-Based Multi-Level 2-D 9/7 DWT. Appl. Sci. 2019, 9, 4635. [Google Scholar] [CrossRef]
  23. Hu, Y.; Jong, C.C. A Memory-Efficient High-Throughput Architecture for Lifting-Based Multi-Level 2-D DWT. IEEE Trans. Signal Process. 2013, 61, 4975–4987. [Google Scholar] [CrossRef]
  24. Chaker, A.; Kaaniche, M.; Benazza-Benyahia, A.; Antonini, M. An efficient statistical-based retrieval approach for JPEG2000 compressed images. In Proceedings of the 2015 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 31 August–4 September 2015; pp. 1830–1834. [Google Scholar] [CrossRef]
  25. Moreno-Escobar, J.J.; Morales-Matamoros, O.; Tejeida-Padilla, R. SQbSN: JPEG2000 scalar quantizer implemented by means a statistical normalization. In Proceedings of the 2017 Intelligent Systems Conference (IntelliSys), London, UK, 7–8 September 2017; pp. 576–584. [Google Scholar] [CrossRef]
  26. Huang, C.T.; Tseng, P.C.; Chen, L.G. Flipping structure: An efficient VLSI architecture for lifting-based discrete wavelet transform. IEEE Trans. Signal Process. 2004, 52, 1080–1089. [Google Scholar] [CrossRef]
  27. Bartrina-Rapesta, J.; Aulí-Llinàs, F. Cell-Based Two-Step Scalar Deadzone Quantization for High Bit-Depth Hyperspectral Image Coding. IEEE Geosci. Remote Sens. Lett. 2015, 12, 1893–1897. [Google Scholar] [CrossRef]
Figure 1. Multi-level 2D DWT Process.
Figure 1. Multi-level 2D DWT Process.
Electronics 13 04668 g001
Figure 2. Folded architecture with scaling module.
Figure 2. Folded architecture with scaling module.
Electronics 13 04668 g002
Figure 3. Unfolded architecture with scaling module.
Figure 3. Unfolded architecture with scaling module.
Electronics 13 04668 g003
Figure 4. Z-scan.
Figure 4. Z-scan.
Electronics 13 04668 g004
Figure 5. Repeated scanning for three inputs.
Figure 5. Repeated scanning for three inputs.
Electronics 13 04668 g005
Figure 6. Proposed 2D DWT overall architecture.
Figure 6. Proposed 2D DWT overall architecture.
Electronics 13 04668 g006
Figure 7. Proposed multi-level 2D DWT architecture.
Figure 7. Proposed multi-level 2D DWT architecture.
Electronics 13 04668 g007
Figure 8. Internal data processing sequence of 3–5 level DWT.
Figure 8. Internal data processing sequence of 3–5 level DWT.
Electronics 13 04668 g008
Figure 9. Transposing buffer of the unfolded architecture.
Figure 9. Transposing buffer of the unfolded architecture.
Electronics 13 04668 g009
Figure 10. Transposing buffer of the folded architecture.
Figure 10. Transposing buffer of the folded architecture.
Electronics 13 04668 g010
Table 1. Quantization coefficients for different subbands.
Table 1. Quantization coefficients for different subbands.
Subbandh h
HH11.0404660.008674635
HL11.0112300.002468401
LH11.0112300.002468401
HH21.9347530.016130537
HL21.9970700.004874826
LH21.9970700.004874826
HH34.1604610.034686842
HL34.1838380.010212703
LH34.1838380.010212703
HH48.6050420.071742466
HL48.5421750.020851355
LH48.5421750.020851355
HH517.3927610.145007958
HL517.1739500.041921423
LH517.1739500.041921423
LL516.9958500.012146502
LL1-LL410.000714675
Table 2. Count signals for 3–5 level DWT.
Table 2. Count signals for 3–5 level DWT.
cnt3[3:0]cnt4[1:0]cnt5[1:0]LevelModule
0–3xx31
4–7xx32
8–11xx33
9–121x41
9–122x42
9–123x43
9–124151
9–124252
9–124353
9–1244NN
Table 3. Hardware resources for the proposed multi-level 2D DWT architecture.
Table 3. Hardware resources for the proposed multi-level 2D DWT architecture.
LevelMultiplierAdderRegisterTemporal RAMParallelism
1101626 3 N S
23416 5 N / 2 S
33440 5 N / 4 S / 3
3, 4, 534112 35 N / 16 S / 3
1–54264238 299 N / 16 3 S
Table 4. The hardware performance for 5-level 2D DWT on a 128 × 128 image with S = 3 .
Table 4. The hardware performance for 5-level 2D DWT on a 128 × 128 image with S = 3 .
ArchitectureMultiplierAdderRegisterInternal MemoryInput MemoryThroughout RateCPDTC ( × 10 5 ) ACTTDP
[10]195270540xx 6 / T a 2 T a xxx
[12]0330xxx 6 / T a T a xxx
[14]60120356712,3101152 3 / 2 T a 4 T a 25.712730.6784.53
[20]43761410128896 3 / 2 T a 4 T a 7.162730.6723.55
[22]5688318816896 3 / T a 2 T a 5.755461.3318.90
[23]4676905271664 2 / T a 3 T a 5.062730.6712.47
Proposed426423823921152 3 / T a 2 T a 6.472730.6710.63
x: the data are not available in the cited sources.
Table 5. Synthesis results for the N × N image without RAM.
Table 5. Synthesis results for the N × N image without RAM.
ArchitectureNSLevelDAT (ns)Area ( μ m2)Power (mw)ADP ( μ m2)EPI ( μ J)
[14]5128342.663,377,870.7024.453098.7226.28
[20]5128345.583,104,371.0522.592318.2918.50
[22]5128327.701,362,035.8712.94618.1410.60
Proposed5128326.74999,231.9023.70433.3519.42
Proposed1283514.62409,442.0416.6616.352.27
EPI: Energy per Image, E P I = P o w e r × A C T ÷ C l o c k F r e q u e n c y .
Table 6. Implemented results on Xilinx FPGA.
Table 6. Implemented results on Xilinx FPGA.
ArchitectureDeviceSLevelFmax (MHz)RegistersLUTs
[4]XC6VSX315T12344.3439224708
[5]XC5VLX330T12224.0015362092
[17]XC7Z020CLG4841134.833271121
ProposedXC6VSX315T12165.7511731679
XC5VLX330T12188.8911791631
XC7Z020CLG48411143.70629602
XC6VSX315T35149.12697411,379
XC5VLX330T35156.92706411,546
XC7Z020CLG48435130.14697411,378
XC7K325T35169.27697811,378
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Q.; Zhang, W.; Wu, Z.; Dai, Y.; Liu, Y. An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules. Electronics 2024, 13, 4668. https://doi.org/10.3390/electronics13234668

AMA Style

Li Q, Zhang W, Wu Z, Dai Y, Liu Y. An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules. Electronics. 2024; 13(23):4668. https://doi.org/10.3390/electronics13234668

Chicago/Turabian Style

Li, Qitao, Wei Zhang, Zhuolun Wu, Yuzhou Dai, and Yanyan Liu. 2024. "An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules" Electronics 13, no. 23: 4668. https://doi.org/10.3390/electronics13234668

APA Style

Li, Q., Zhang, W., Wu, Z., Dai, Y., & Liu, Y. (2024). An Efficient Multi-Level 2D DWT Architecture for Parallel Tile Block Processing with Integrated Quantization Modules. Electronics, 13(23), 4668. https://doi.org/10.3390/electronics13234668

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop