High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO.
4, APRIL 2011 709
TABLE II [6] B. Kim, J.-S. Ko, and K. Lee, “A new linearization technique for
COMPARISON OF RECEIVER FRONTENDS MOSFET RF amplifier using multiple gated transistors,” IEEE Mi-
crow. Guided Wave Lett., vol. 10, no. 9, pp. 371–373, Sep. 2000.
[7] N. Stanić, P. Kinget, and Y. Tsividis, “A 0.5 V 900 MHz CMOS re-
ceiver front end,” in IEEE VLSI Circuits Symp. Tech. Dig., Jun. 2006,
pp. 228–229.
[8] A. Balankutty, S.-A. Yu, Y. Feng, and P. Kinget, “A 0.6 V 32.5 mW
highly integrated receiver for 2.4 GHz ISM-band applications,” in IEEE
ISSCC Dig. Tech. Papers, Feb. 2008, pp. 366–367.
[9] A. Liscidini, M. Tedeschi, and R. Castello, “A 2.4 GHz 3.6 mW 0.35
mm quadrature front-end RX for ZigBee and WPAN applications,” in
IEEE ISSCC Dig. Tech. Papers, Feb. 2008, pp. 370–371.
[10] X. Wang and R. Weber, “A novel low power low voltage LNA and
mixer for WLAN IEEE 802.11a standard,” in Proc. IEEE Topical Meet.
SiRF, Sep. 2004, pp. 231–234.
[11] M. Tedeschi, A. Liscidini, and R. Castello, “A 0.23 mm free coil
ZigBee receiver based on a bond-wire self-oscillating mixer,” in Proc.
ESSCIRC, Sep. 2008, pp. 430–433.
5-GHz RF output at a LO power level of 6 dBm. By fixing the IF fre- High Throughput DA-Based DCT With High Accuracy
quency, the conversion gain versus the LO frequency was character- Error-Compensated Adder Tree
ized. The measured RF power versus IF power indicates an input-re-
ferred 1-dB compression point (Pin01 dB ) of 016 dBm and a satu- Yuan-Ho Chen, Tsin-Yuan Chang, and Chung-Yi Li
rated output power (Psat ) of 01 dBm.
The performance of the receiver and transmitter frontends is summa-
rized in Table I. According to the experimental results, the proposed
circuit topologies demonstrate the potential of implementing CMOS Abstract—In this brief, by operating the shifting and addition in par-
allel, an error-compensated adder-tree (ECAT) is proposed to deal with
RF frontends for ultra-low-power and ultra-low-voltage applications
the truncation errors and to achieve low-error and high-throughput dis-
at multi-gigahertz frequencies. A comparison with other reported low- crete cosine transform (DCT) design. Instead of the 12 bits used in previous
voltage receiver front-ends is tabulated in Table II. works, 9-bit distributed arithmetic-precision is chosen for this work so as
to meet peak-signal-to-noise-ratio (PSNR) requirements. Thus, an area-ef-
V. CONCLUSION ficient DCT core is implemented to achieve 1 Gpels/s throughput rate with
gate counts of 22.2 K for the PSNR requirements outlined in the previous
Using a standard 0.18-m CMOS process, fully integrated trans- works.
mitter and receiver frontends are implemented at the 5-GHz frequency
Index Terms—Distributed arithmetic (DA)-based, error-compensated
band. With the proposed design techniques, the fabricated RF frontends adder-tree (ECAT), 2-D discrete cosine transform (DCT).
are able to operate at a reduced supply voltage of 0.6 V with ultra-low
power consumption while maintaining reasonable circuit performance
in terms of gain, linearity, and noise figure for short-range wireless I. INTRODUCTION
communications.
ACKNOWLEDGMENT Discrete cosine transform (DCT) is a widely used tool in image

and video compression applications [1]. Recently, the high-throughput
The authors would like to thank National Chip Implementation DCT designs have been adopted to fit the requirements of real-time ap-
Center (CIC), Hsinchu, Taiwan, for chip fabrication and National Nano plications [2]–[11].
Device Laboratories (NDL), Hsinchu, Taiwan, for chip measurement. The multiplier-based DCTs were presented and implemented in [2]
and [3]. To reduce area, ROM-based distributed arithmetic (DA) was ap-
REFERENCES plied inDCT cores[4]–[6]. Uramotoet al.[4]implemented theDA-based
[1] N. Stanic, P. Kinget, and Y. Tsividis, “A 0.5 V 900 MHz CMOS re- multipliers using ROMs to produce partial products together with adders
ceiver front end,” in IEEE Symp. VLSI Circuits Dig. Techn. Papers, Jun. that accumulated these partial products. In this way, instead of multi-
2006, pp. 228–229. pliers, the DA-based ROM can be applied in a DCT core design to re-
[2] M. N. El-Gamal, K. H. Lee, and T. K. Tsang, “Very low-voltage (0.8
V) CMOS receiver frontend for 5 GHz RF applications,” IEE Proc. duce the area required. In addition, the symmetrical properties of the
Circuits, Devices Syst., vol. 149, no. 5/6, pp. 355–362, Oct.–Dec. 2002. DCT transform and parallel DA architecture can be used in reducing
[3] M. Harada, T. Tsukahara, J. Kodate, A. Yamagishi, and J. Yamada,
“2-GHz RF front-end circuits in CMOS/SIMOX operating at an ex-
tremely low voltage of 0.5 V,” IEEE J. Solid-State Circuits, vol. 35, no. Manuscript received June 29, 2009; revised September 17, 2009 and
12, pp. 2000–2004, Dec. 2000. November 05, 2009. First published January 12, 2010; current version pub-
[4] A.-S. Porret, T. Melly, C. Christian, and E. A. Vittoz, “A low-power lished March 23, 2011. This work was supported in part by the National
low-voltage transceiver architecture suitable for wireless distributed Science Council under Project NSC 98-2221-E-007-095.
sensors network,” in Proc. IEEE Int. Symp. Circuits Syst., May 2000, The authors are with the Department of Electrical Engineering, National
vol. 1, pp. 56–59. Tsing Hua University, Hsinchu 30013, Taiwan (e-mail: yhchen@larc.ee.nthu.
[5] H.-H. Hsieh and L.-H. Lu, “Design of ultra-low-voltage RF frontends edu.tw; yhchen@yard.ee.nthu.edu.tw; tyc@ee.nthu.edu.tw; cyli@larc.ee.nthu.
with complementary current-reused architectures,” IEEE Trans. Mi- edu.tw).
crow. Theory Tech., vol. 55, no. 7, pp. 1445–1458, Jul. 2007. Digital Object Identifier 10.1109/TVLSI.2009.2037968
1063-8210/$26.00 © 2010 IEEE

710 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011
the ROM size in [5] and [6], respectively. Recently, ROM-free DA ar-
chitectures were presented [7]–[11]. Shams et al. employed a bit-level
sharing scheme to construct the adder-based butterfly matrix called new
DA (NEDA) [7]. Being compressed, the butterfly-adder-matrix in [7]
utilized 35 adders and 8 shift-addition elements to replace the ROM.
Based on NEDA architecture, the recursive form and arithmetic logic
unit (ALU) were applied in DCT design to reduce area cost [8], [9].
Hence the NEDA architecture is the smallest architecture for DA-based
DCT core designs, but speed limitations exist in the operations of serial
shifting and addition after the DA-computation. The high-throughput
shift-adder-tree (SAT) and adder-tree (AT), those unroll the number of
shifting and addition words in parallel for DA-based computation, were
Fig. 1. Q P -bit words shifting and addition operations in parallel.
introduced in [10] and [11], respectively. However, a large truncation
error occurred. In order to reduce the truncation error effect, several inner product computation in (1) can be implemented by using shifting
error compensation bias methods have been presented [12]–[14] based and adders instead of multipliers. Therefore, low hardware cost can be
on statistical analysis of the relationship between partial products and achieved by using DA-based architecture.
multiplier-multiplicand. However, the elements of the truncation part
outlined in this work are independent so that the previously described III. ECAT ARCHITECTURE
compensation methods cannot be applied. From (2), the shifting and addition computation can be written as
This brief addresses a DA-based DCT core with an error-compen- follows:
sated adder-tree (ECAT). The proposed ECAT operates shifting and Q01
addition in parallel by unrolling all the words required to be computed. Y = yj 1 20j : (3)
Furthermore, the error-compensated circuit alleviates the truncation j =0
error for high accuracy design. Based on low-error ECAT, the DA-pre- In general, the shifting and addition computation uses a shift-and-add
cision in this work is chosen to be 9 bits instead of the traditional 12 operator [7] in VLSI implementation in order to reduce hardware
bits so as to achieve the peak-signal-to-noise-ratio (PSNR) [1] require- cost. However, when the number of the shifting and addition words
ments. Therefore, the hardware cost is reduced, and the speed is im- increases, the computation time will also increase. Therefore, the
proved using the proposed ECAT. shift-adder-tree (SAT) presented in [10] operates shifting and addition
This brief is organized as follows. In Section II, the mathematical in parallel by unrolling all the words needed to be computed for
derivation of the distributed arithmetic is given. The proposed ECAT high-speed applications. However, a large truncation error occurs in
architecture is discussed in Section III. The proposed 8 2 8 2-D DCT SAT, and an ECAT architecture is proposed in this brief to compensate
core is demonstrated in Section IV. The comparisons and results are for the truncation error in high-speed applications.
presented in Section V, and conclusions are drawn in Section VI. In Fig. 1, the Q P-bit words operate the shifting and addition in par-
allel by unrolling all computations. Furthermore, the operation in Fig. 1
II. MATHEMATICAL DERIVATION OF DISTRIBUTED ARITHMETIC can be divided into two parts: the main part (MP) that includes P most
The inner product is an important tool in digital signal processing significant bits (MSBs) and the truncation part (TP) that has Q least
applications. It can be written as follows: significant bits (LSBs). Then, the shifting and addition output can be
L expressed as follows:
AX
Y = T = Ai Xi (1)
Y = MP + TP 1 20(P 02) : (4)
i=1
where Ai , Xi , and L are ith fixed coefficient, ith input data, and The output Y will obtain the P -bit MSBs using a rounding opera-
number of inputs, respectively. Assume that coefficient Ai is Q-bit tion called post truncation (Post-T), which is used for high-accuracy
two’s complement binary fraction number. Equation (1) can be applications. However, hardware cost increases in the VLSI design.
expressed as follows: In general, the TP is usually truncated to reduce hardware costs in
parallel shifting and addition operations, known as the direct trunca-
Y = 20 201 1 1 1 20(Q01) tion (Direct-T) method. Thus, a large truncation error occurs due to
A1;0 A2;0 111 AL;0 X1 the neglecting of carry propagation from the TP to MP. In order to
A1;1 A2;1 111 AL;1 alleviate the truncation error effect, several error compensation bias
methods have been presented [12]–[14]. All previous works were only
1 . . . ..
X2
.. applied in the design of a fixed-width multiplier. Because the prod-
.. .. .. . . ucts in a multiplier have a relationship between the input multiplier
A1;(Q01) A2;(Q01) 111 AL;(Q01) XL and multiplicand, the compensation methods usually use the correla-
y0 tion of inputs to calculate a fixed [12] or an adaptive [13], [14] com-
y1 pensation bias using simulation or statistical analysis. Note that the ad-
= 20 201 1 1 1 20(Q01) .. (2) dition elements yqp in the TP in Fig. 1 (where 1 q (Q 0 1) and
. (P 0 q 0 1) p (P 0 1)) are independent from each other. There-
y(Q01) fore, the previous compensation method cannot be applied in this work,
and the proposed ECAT is explained as follows.
where yj = L i=1 Ai;j Xi , Ai;j 2 f0; 1g for 1 j (Q 0 1), and
Ai;j 2 f01; 0g for j = 0. Note that y0 may be 0 or a negative number A. Proposed Error-Compensated Scheme
due to two’s complement representation. In (2), yj can be calculated by From Fig. 1, (4) can be approximated as
adding all Xi values when Ai;j = 1, and then the transform output Y
can be obtained by shifting and adding all nonzero yj values. Thus, the Y MP + 1 20(P 02) (5)
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011 711
Fig. 2. Proposed ECAT architecture of shifting and addition operators for the (P; Q) = (12; 6) example.
where is the compensated bias from the TP to the MP as listed in TABLE I

(6)–(8) COMPARISONS OF ABSOLUTE AVERAGE ERROR ", MAXIMUM ABSOLUTE
ERROR " , AND MEAN SQUARE ERROR "
= Round(TPmajor + TPminor ) (6)
and
Q01
TPmajor = 12 yj (P010j) (7)
j =0
TPminor = 14 y1(P01) + + y(Q01)(P0Q+1)
111
+ 81 y2(P01) + + y(Q01)(P0Q+2) +
111 111
Q
+ 12 y(Q01)(P01) (8)
where Round() is rounded to the nearest integer. The TPmajor has

more weight than TPminor when contributing towards the . There-
fore, the compensated bias can be calculated by obtaining TPmajor Case 3) Q = 4k + 2, 4k + 3 (k 1)
and estimating TPminor . Let the probability of yqp = 1 be 0.5, where

1 q (Q 1) and (P q 1) p (P 1). Hence, (8) can

0 0 0 0
= k + Round(TPmajor ): (14)
be expressed as follows:
Q+1
TPminor = 14 12 (Q 1) + + 12
0 111 B. Performance Simulation for an Error-Compensated Circuit
Q+2 Q01 In this subsection, comparisons of the absolute average error ", the
= 12 n 2n =
(Q 2) + 1 Q+1 : (9)
0
maximum error "max , and the mean square error "mse for the proposed
n=1
1
4 2 error-compensated circuit with Direct-T and Post-T are listed in Table I.
The ", "max , and "mse are defined as follows:
For a given TPmajor , (yj (P 010j ) , 0 j (Q 1)), the can be
0
obtained after rounding the sum of (TPmajor + TPminor ). In order to " = Avg fjTP 0 jg
round the summation, TPminor can be divided into four parts:
(15)
"max = max fjTP 0 jg
k 12 + 12 4k+1 ; for Q = 4k
(16)
"mse = Avg (TP 0 )2
0
1 1 4k+2 ; for Q = 4k + 1
TPminor = k 4 1+4k+3
(17)
0
2 (10)
k+ 2 ; for Q = 4k + 2 where Avg fg is the average operator.
k + 14 + 12 4k+4 ; for Q = 4k + 3: The internal word-length usually uses 12 bits in a DCT design. Con-
sequently, word length P = 12 is chosen together with different Q
As k 1, the TPminor approximates (11) values of 3, 6, 9, and 12, which are listed in Table I. The Post-T method
(k 1) + 12 ; for Q = 4k
0 provides the most accurate values for fixed-width computation nowa-
TPminor
(k 1) + 34 ; for Q = 4k + 1
0
(11)
days. In addition, the Direct-T method has the largest inaccuracies of
the errors shown in Table I for low-cost hardware design. The proposed
k; for Q = 4k + 2 ECAT is more accurate than Direct-T and is close to the performance of
k + 14 ; for Q = 4k + 3: the Post-T method using a compensated circuit. Because the truncation
part TPminor is estimated using statistical analysis, the magnitude of
Hence, can be rewritten as three cases.
errors also increases as the number of shift-and-add words Q increases.
Case 1) Q = 0, 1, 2, 3
= Round(TPmajor ): (12) C. Proposed ECAT Architecture
Case 2) Q = 4k, 4k + 1 (k 1) The proposed ECAT architecture is illustrated in Fig. 2 for (P; Q) =
(12; 6) (case 3), where block FA indicates a full-adder cell with three
= (k 1) + Round(TPmajor + 0:5):
0 (13) inputs (a, b, and c) and two outputs, a sum (s) and a carry-out (co). Also,
TABLE II TABLE III

COMPARISONS OF THE PROPOSED ECAT WITH OTHER ARCHITECTURES FOR A 9-BIT DA-BASED COEFFICIENT MATRIX C
SIX 8-BIT WORDS EXAMPLE
block HA indicates half-adder cell with two inputs (a and b) and two
outputs, a sum (s) and a carry-out (co). The comparisons of area, delay,
area-delay product, and accuracy for the proposed ECAT with other
architectures are listed in Table II. The area and delay are synthesized
using a Synopsys Design Compiler with the Artisan TSMC 0.18-m
Standard cell library.
The proposed ECAT has the highest accuracy with a moderate area- input data A0 and A1 , the transform output Zee needs only one adder
delay product. The shift-and-add [7] method has the smallest area, but to compute (A0 + A1 ) and two separated ECATs to obtain the re-
the overall computation time is equal to 10:8(= 1:8 2 6) ns that is the sults of Z0 and Z4 . Similarly, the other transform outputs Zeo and Zo
longest. Similarly, the SAT [10], which truncates the TP and computes can be implemented in DA-based forms using 10(= 1 + 9) adders
in parallel, takes 3.72 ns to complete the computation and uses 406 and corresponding ECATs. Consequently, from the (19)–(22), the pro-
gates, which is the best area-delay product performance. However, for posed 1-D 8-point DCT architecture can be constructed as illustrated
system accuracy, the SAT is the worst option shown in Table II. There- in Fig. 3 using a DA-Butterfly-Matrix, that includes two DA even pro-
fore, the ECAT is suitable for high-speed and low-error applications. cessing elements (DAEs), a DA odd processing element (DAO) and
12 adders/subtractors, and 8 ECATs (one ECAT for each transform
output Zn ). The eight separated ECATs work simultaneously, enabling
IV. PROPOSED 8 2 8 2-D DCT CORE DESIGN high-speed applications to be achieved. After the data output from the
The 1-D DCT employs the DA-based architecture and the proposed DA-Butterfly-Matrix is completed, the transform output Z will be com-
ECAT to achieve a high-speed, small area, and low-error design. The pleted during one clock cycle by the proposed ECATs. In contrast, the
1-D 8-point DCT can be expressed as follows: traditional shift-and-add architecture requires Q clock cycles to com-
plete the transform output Z if the DA-precision is Q bits.
With high-speed considerations in mind, the proposed 2-D DCT is
7
1 (2m + 1)n designed using two 1-D DCT cores and one transpose buffer. For ac-
Zn = kn xm 2 cos (18) curacy, the DA-precision and transpose buffer word lengths are chosen
2 m=0 16
to be 9 bits and 12 bits, respectively, meaning that the system can meet
the PSNR requirements outlined in previous works. Moreover, the 2-D
where xm denotes the p input data; Zn denotes the transform output;
0 n 7; kn = 1= 2 for n = 0; and kn = 1 for other n values. By DCT core accepts 9-bit image input and 12-bit output precision.
For the proposed 2-D DCT, the Synopsys Design Compiler was ap-
neglecting the scaling factor 1/2, the 1-D 8-point DCT in (18) can be
divided into even and odd parts: Ze and Zo as listed in (19) and (20),
plied to synthesize the RTL design of the proposed core, and the Ca-
dence SoC Encounter was adopted for placement and routing (P&R).
respectively
Implemented in a 1.8-V TSMC 0.18-m 1P6M CMOS process, the
Z0 c4 c4 c4 c4 a0 proposed 8 2 8 2-D DCT core has a latency of 10 clock cycles and is
Ze = Z = cc2
Z2 c6 0c6 0c2 a1
= Ce 1 a (19)
operated at 125 MHz. As a result of the 8 parallel outputs, the proposed
4 4 0c4 0c4 c4 a2 2-D DCT core can achieve a throughput rate of 1 Gpixels per second
Z6 c6 0c2 c2 0c6 a3 (= 82 125 MHz), meeting the 1080 p (192021080 2 60 pixels/s)
high definition television (HDTV) specifications for 200 MHz based
Z1 c1 c3 c5 c7 b0 on low power operations. The core layout and simulated characteris-
Zo = Z = cc3
Z3 0c7 0c1 0c5 b1
= Co 1 b (20) tics are shown in Fig. 4.
5 5 0c1 c7 c3 b2
Z7 c7 0c5 c3 0c1 b3
V. DISCUSSION AND COMPARISONS
where ci = cos(i=16). Moreover, the even part Ze can be further
decomposed into even and odd parts: Zee and Zeo
The test image “Lena” used to check system accuracy is comprised
of 512 2 512 pixels with each pixel being represented by 8-bit 256
gray level data. After inputting the original test image pixels to the pro-
Zee = ZZ0 c4 c4 A0 posed 2-D DCT core, the transform output data is captured and fed into
= = Cee 1 A
4 c4 0c4 A1
(21) MATLAB to compute the inverse DCT using 64-bit double-precision op-
erations. The PSNRs are close to 44 and 47 dB for test image and for
Zeo = ZZ2 =
c2
c6
c6
0c2
B0
B1
= Ceo 1 B: (22)
random 8-bit 256 gray level data inputs, receptively.
Table IV compares the proposed 8 2 8 2-D DCT core with previous
6
2-D DCT cores. In [3], a multiplier-based DCT core based on pipeline
For the DA-based computation, the coefficient matrix Co , Cee , and radix-42 single delay feedback path (R42 SDF) architecture to achieve
Ceo , are expressed as 9-bit binary fraction numbers. Table III expresses high-speed design. The ROM-based DCT core is presented in [4] to re-
Zee (Z0 and Z4 ) in the bit level formulation. In Table III, using given duce hardware cost. In [7], a NEDA architecture is presented by using
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 4, APRIL 2011 713
TABLE IV
COMPARISONS OF DIFFERENT 2-D DCT ARCHITECTURES WITH THE PROPOSED ARCHITECTURE
ALU: Arithmetic logic unit. 4 transistors per NAND2 gate for different technology. CCITT: Consultative Committee for International Telegraph
and Telephone.
= 13
ECAT: The proposed error-compensated adder-tree. 77 MHz 1 GHz= , where denominator 13 is the number of shifting and addition computation cycles.
TABLE V
COMPARISONS OF 2-D DCT ARCHITECTURES IN FPGAS
Furthermore, the proposed 2-D DCT core synthesized by using

Xilinx ISE 9.1, and the Xilinx XC2VP30 FPGA can achieve 792 mega
pixels per second (M-pels/sec) throughput rate (up to about 7 folds of
previous work [16]). Table V compares the proposed 2-D DCT core
with previous FPGA implementations.
Fig. 3. Architecture of the proposed 1-D 8-point DCT.
VI. CONCLUSION
In this brief, a high-speed and low-error 8 2 8 2-D DCT design with
ECAT is proposed to improve the throughput rate significantly up to
about 13 folds at high compression rates by operating the shifting and
addition in parallel. Furthermore, the proposed error-compensated cir-
cuit alleviates the truncation error in ECAT. In this way, the DA-preci-
sion can be chosen as 9 bits instead of 12 bits so as to meet the PSNR re-
quirements. Thus, the proposed DCT core has the highest hardware ef-
ficiency than those in previous works for the same PSNR requirements.
Finally, an area-efficient 2-D DCT core is implemented using a TSMC
0.18-m process, and the maximum throughput rate is 1 Gpels/s. In
summary, the proposed architecture is suitable for high compression
Fig. 4. Core layout and characteristics. rate applications in VLSI designs.
adders to reduce the chip area of DCT core. Nevertheless, a speed lim- REFERENCES
itation for shift-and-add is in NEDA design. In [10] and [11], the SAT
[1] Y. Wang, J. Ostermann, and Y. Zhang, Video Processing and Commu-
and AT architectures for DA-based DCTs improve the throughput rate
nications, 1st ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.
of the NEDA method. However, DA-precision must be chosen as 13 [2] Y. Chang and C. Wang, “New systolic array implementation of the 2-D
bits to meet the system accuracy with more area overhead. The pro- discrete cosine transform and its inverse,” IEEE Trans. Circuits Syst.
posed DCT core uses low-error ECAT to achieve a high-speed design, Video Technol., vol. 5, no. 2, pp. 150–157, Apr. 1995.
and the DA-precision can be chosen as 9 bits to meet the PSNR re- [3] C. T. Lin, Y. C. Yu, and L. D. Van, “Cost-effective triple-mode recon-
figurable pipeline FFT/IFFT/2-D DCT processor,” IEEE Trans. Very
quirements for reducing hardware costs. The proposed DCT core has Large Scale Integr. Syst., vol. 16, no. 8, pp. 1058–1071, Aug. 2008.
the highest hardware efficiency, defined as follows (based on the accu- [4] S. Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita, H.
racy required by the presented standards) Yerane, and M. Yoshimoto, “A 100-MHz 2-D discrete cosine trans-
form core processor,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp.
492–499, Apr. 1992.
ThroughputRate [5] S. Yu and E. E. S. , Jr., “DCT implementation with distributed arith-
Hardware Efficiency(103 pels/s) = : (23) metic,” IEEE Trans. Comput., vol. 50, no. 9, pp. 985–991, Sep. 2001.
Gate Counts
[6] P. K. Meher, “Unified systolic-like architecture for DCT and DST using
distributed arithmetic,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol.
53, no. 12, pp. 2656–2663, Dec. 2006.
[7] A. M. Shams, A. Chidanandan, W. Pan, and M. A. Bayoumi, “NEDA:
A low-power high-performance DCT architecture,” IEEE Trans. Signal
Process., vol. 54, no. 3, pp. 955–964, Mar. 2006.
[8] M. R. M. Rizk and M. Ammar, “Low power small area high perfor-
mance 2D-DCT architecture,” in Proc. Int. Design Test Workshop,
2007, pp. 120–125.
[9] Y. Chen, X. Cao, Q. Xie, and C. Peng, “An area efficient high perfor- Fig. 1. (a) Definition of logic “1” and “0” for nanomagnets. (b) Metastable
mance DCT distributed architecture for video compression,” in Proc. states for coupled pairs.
Int. Conf. Adv. Comm. Technol., 2007, pp. 238–241.
[10] C. Peng, X. Cao, D. Yu, and X. Zhang, “A 250 MHz optimized dis-
2
tributed architecture of 2D 8 8 DCT,” in Proc. Int. Conf. ASIC, 2007,
I. INTRODUCTION
pp. 189–192.
[11] C. Y. Huang, L. F. Chen, and Y. K. Lai, “A high-speed 2-D transform One of the pioneering efforts in field-coupled cellular automata com-
architecture with unique kernel for multi-standard video applications,”
puting evolved using quantum tunneling interactions of electrons in
in Proc. IEEE Int. Symp. Circuits Syst., 2008, pp. 21–24.
[12] S. S. Kidambi, F. E. Guibaly, and A. Antonious, “Area-efficient multi- neighboring cell [1] is the promising phenomenal packing density, and
pliers for digital signal processing applications,” IEEE Trans. Circuits the low power-delay product. In this work, we study magnetic cel-
Syst. II, Exp. Briefs, vol. 43, no. 2, pp. 90–95, Feb. 1996. lular automata architecture, already functionally demonstrated by pi-
[13] K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, “Design of low-error oneering efforts of Imre et al. [2], [3] that promise stable operation at
fixed-width modified booth multiplier,” IEEE Trans. Very Large Scale
Integr. (VLSI) Syst., vol. 12, no. 5, pp. 522–531, May 2004. room temperature alleviating the criticism of some of the other Cel-
[14] L. D. Van and C. C. Yang, “Generalized low-error area-efficient fixed- lular Automata variations. The salient feature of the magnetic cellular
width multipliers,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 52, automata architectures are: 1) single-domain structure and the shape
no. 8, pp. 1608–1619, Aug. 2005. anisotropy work magnificently to store Boolean logic as the easy axis
[15] C. C. Sun, P. Donner, and J. Gotze, “Low-complexity multi-purpose
(Y -axis in our case) magnetization; 2) magnetic coupling between the
IP core for quantized discrete cosine and integer transform,” in Proc.
IEEE Int. Symp. Circuits Syst., 2009, pp. 3014–3017. interacting neighbors assures anti-ferromagnetic alignments (anti-par-
[16] A. Tumeo, M. Monchiero, G. Palermo, F. Ferrandi, and D. Sciuto, “A allel), thus generating the signal and its inverse next to each other [see
pipelined fast 2D-DCT accelerator for FPGA-based SoCs,” in Proc. Fig. 1(a)] and; 3) since magnetic interactions are direction-insensitive,
IEEE Comput. Soc. Annu. Symp. VLSI, 2007, pp. 331–336. we need an addition control apart from input to drive the informa-
[17] S. Ghosh, S. Venigalla, and M. Bayoumi, “Design and implementaion
of a 2D-DCT architecture using coefficient distributed arithmetic,” in tion flow from input to output which is commonly termed as clock.
Proc. IEEE Comput. Soc. Ann. Symp. VLSI, 2005, pp. 162–166. We have observed that conventional adiabatic clock, having group of
nanomagnets in one clock state [4] does not work well for lengthy
magnetic cellular automata (MCA) array. So we propose a spatially
moving clock field named as Landauer clock, accomplished by mag-
netically switching cell from a null state [the state which holds no bi-
Landauer Clocking for Magnetic Cellular Automata nary information (“1” or “0”)], through a switching state (in which the
(MCA) Arrays nanomagnet state is determined by its neighbor) and finally to a locked
state (stable state) (in which the state is independent of the previous
Anita Kumari and Sanjukta Bhanja neighbor).
We used a micro-magnetic simulator [object oriented micromagnetic
Abstract—Magnetic cellular automata (MCA) is a variant of quantum- framework (OOMMF)] that solves the Landau–Lifshitz equations ac-
dot-cellular automata (QCA) where neighboring single-domain nanomag- counting various energies (zeeman energy, magnetostatic energy, ex-
nets (also termed as magnetic cell) process and propagate information (logic change energy, anisotropy energy, demagnetization energy, etc.). We
1 or logic 0) through mutual interaction. The attractive nature of this frame- demonstrated the spatial temporal clock known as Landauer clock on
work is that not only room temperature operations are feasible but also
different length arrays (8, 16, 32), different shapes (rectangular and
interaction between neighbors is central to information processing as op-
posed to creating interference. In this work, we explore spatially moving oval) and different nanomagnet aspect ratio (AR). The aspect ratio is
Landauer clocking scheme for MCA arrays (length of 8, 16, and 32 cells) the width to height ratio.
and show the role and effectiveness of the clock in propagating logic signal A few observations made by our experimental simulations for the
from input to output without magnetic frustration. Simulation performed clocking scheme are as follows.
in object oriented micromagnetic framework suggests that the clocking field
is sensitive to scaling, shape, and aspect ratio. 1) Clock field is invariant with length (8, 16, and 32) and works per-
fectly all the time, yielding anti-parallel cell.
Index Terms—Clock, magnetic cellular automata (MCA), quantum-dot-
2) Oval shape nanomagnet requires high clock field strength due to
cellular automata (QCA).
high coercivity as compared to rectangular shape nanomagnet.
Manuscript received February 13, 2009; revised July 16, 2009. First published Hence it is not suitable for MCA architecture.
January 22, 2010; current version published March 23, 2011. This work was sup- 3) Input field required is very low as compared to the null and switch
ported in part by National Science Foundation Career Award 0639624, by the fields and is same for both shapes (rectangle and oval) for aspect
National Science Foundation Computing Research Infrastructure (CRI) Grant
0551621, and by the National Science Foundation Emerging Models for Tech- ratios under study.
nology (EMT) Grant 0829838. 4) Clock field decreases linearly with scaling of nanomagnet.
The authors are with the Nano Computing Research Group (NCRG), Depart-
ment of Electrical Engineering, University of South Florida, Tampa, FL 33620 II. THEORETICAL BACKGROUND
USA (e-mail: akumari@mail.usf.edu; bhanja@eng.usf.edu).
Color versions of one or more of the figures in this paper are available online Magnetic field coupling is emerging as a promising successor
at http://ieeexplore.ieee.org. of CMOS. The behavior of magnetic materials is described by the
Digital Object Identifier 10.1109/TVLSI.2009.2036627 classical theory of micromagnetism. In bulk materials the balance of
1063-8210/$26.00 © 2010 IEEE

High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

Uploaded by

Copyright:

Available Formats

High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO.

4, APRIL 2011 709

ACKNOWLEDGMENT Discrete cosine transform (DCT) is a widely used tool in image

1063-8210/$26.00 © 2010 IEEE

where is the compensated bias from the TP to the MP as listed in TABLE I

where Round() is rounded to the nearest integer. The TPmajor has

1 q (Q 1) and (P q 1) p (P 1). Hence, (8) can

= Round(TPmajor ): (12) C. Proposed ECAT Architecture

TABLE II TABLE III

Furthermore, the proposed 2-D DCT core synthesized by using

1063-8210/$26.00 © 2010 IEEE

You might also like

High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

Uploaded by

Copyright:

Available Formats

High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

High Throughput DA-Based DCT With High Accuracy Error-Compensated Adder Tree

Uploaded by

Copyright:

Available Formats

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO.

4, APRIL 2011 709

ACKNOWLEDGMENT Discrete cosine transform (DCT) is a widely used tool in image

1063-8210/$26.00 © 2010 IEEE

where  is the compensated bias from the TP to the MP as listed in TABLE I

where Round() is rounded to the nearest integer. The TPmajor has

1 q (Q 1) and (P q 1) p (P 1). Hence, (8) can

 = Round(TPmajor ): (12) C. Proposed ECAT Architecture

TABLE II TABLE III

Furthermore, the proposed 2-D DCT core synthesized by using

1063-8210/$26.00 © 2010 IEEE

You might also like

where is the compensated bias from the TP to the MP as listed in TABLE I

= Round(TPmajor ): (12) C. Proposed ECAT Architecture