An MDCT Hardware Accelerator For MP3 Audio
An MDCT Hardware Accelerator For MP3 Audio
Abstract— With the increasing popularity of MP3 audio, there subbands to be processed by the MDCT/IMDCT, resulting a
is a need to develop cost and power efficient architectures for long block of 36 samples and a short block of 12 samples. The
the MP3 encoder and decoder. This paper describes dedicated switch from a long block to a short block is called window
architectures for computing the modified discrete cosine trans-
form (MDCT) and its inverse (IMDCT). Recent profiling studies switching and it is used to supress distortions frequently
have shown that these operations represent about 30% of the associated with frequency doamin coding of an audio signal.
total MP3 computations. MP3 format defines two frame sizes The ability to process both long and short blocks is crucial to
that can occur in the same data stream. We have developed the efficient implementations of MP3 audio encoder and decoder.
most efficient algorithms for MDCT and IMDCT suitable for Because the MP3 block lengths are not power of two, only
both sizes. Unlike previous algorithms, our computations can be
unified in a single ASIC architecture. This unified architecture a few fast algorithms are published and the focus has been
implemented in 90 nm TSMC library is still 25% smaller and on software implementations [5]–[7]. However, over the years
25% faster than any previous single frame size architectures the cost of application specific architectures has decreased
designed in the same technology. In addition, at 128 Kbits/sec substantially. In addition, such dedicated architectures can be
data rates, our algorithms save nearly 1800 multiplications per embedded in field programmable gate arrays making them just
second (18%) which can help reduce the power consumption.
as flexible as software routines. Such an approach permits one
I. I NTRODUCTION to utilize a low-cost and low-power processor to work on data
The MPEG-1/2 layer-III (MP3) standard is widely employed management and other computationally simple tasks while
in music industry because of its efficient audio compression. A the dedicated hardware can take care of the computationally
key enabler in MP3 coding is the prefect reconstruction (PR) challenging tasks.
cosine modulated filter bank based on the concept of time This paper is devoted to the development of the MDCT/
domain aliasing cancellation (TDAC) [1]. For the encoder, IMDCT hardware accelerator for MP3 audio applications.
this analysis filter bank is realized by applying the modified Based on a group theoretic partitioning of the transform kernel,
discrete cosine transform (MDCT) to sliding blocks of data, our solutions employ carefully crafted bilinear algorithms
while the synthesis filter bank of the decoder uses the inverse which can be directly mapped to hardware architectures. We
modified discrete cosine transform (IMDCT). Since the MDCT show for the first time that both the long and short MP3 audio
and IMDCT require intensive computations, fast algorithms blocks can be seamlessly processed with a single hardware
and efficient implementations for theses transforms is im- architecture. Our MDCT/IMDCT algorithms require only 9
portant to the realization of high quality audio compression, multiplications for the short block and 36 multiplications for
especially when most MP3 audio players are battery operated the long block. These may be compared with the current
[2]–[4]. best algorithms for the same tasks which use at least 11
The N point MDCT of a sequence {x(i)} is defined as and 43 multiplications respectively [5]–[7]. However, the main
advantage of the proposed algorithms is its bilinearity which
N
−1
π(2i + 1 + N2 )(2k + 1) implies that all the multiplications are independent and can be
X(k) = x(i) cos( ),
i=0
2N carried out concurrently. Thus when implemented in hardware,
our algorithms have only one multiplication on the critical path
0 ≤ k < N/2.
for both the long and short block MDCT/IMDCT. Previous
The inverse MDCT (IMDCT) is defined as reported algorithms have at least 2 multiplications along the
N critical path [5]–[7]. As a result, our VLSI implementations re-
2 −1
2 π(2i + 1 + N2 )(2k + 1) duce the critial path delay by about 25% while simultaneously
x(i) = X(k) cos( ),
N 2N saving about 25% of the chip area.
k=0
0 ≤ i < N. II. A LGORITHM AND ARCHITECTURE
Note that MDCT converts N signal samples into only N/2 A bilinear algorithm is made up of an addition stage
transform samples. MP3 standard defines two data frames of followed by a stage of independent multiplications and a final
1152 and 384 samples. These frames are further divided in 32 addition stage. Our procedure consists of (a) converting the
121
1-4244-2334-7/08/$20.00
c 2008 IEEE
MDCT/IMDCT computation to a DCT of type IV (using are in the same set. Further, the summation in (1) is carried out
additions and subtractions only), (b) decomposing the DCT over the two signal index sets separately. We will denote the
kernel into cyclic convolutions and Hankel matrix products, computation of X(k) with signal component indices restricted
and finally, (c) employing bilinear algorithms for each of the to sets S1 and S2 by X1 (k) and X2 (k) respectively. Clearly,
convolutions and Hankel matrix products to obtain the required X(k) = X1 (k) + X2 (k).
bilinear algorithm for the MDCT. We illustrate each of these Let p denote the DCT kernel element cos(πp/4N ) and
steps in this section. p, the element − cos(πp/4N ). Then from (1), the transform
components X1 (k), k ∈ S1 are given by:
A. Conversion of MDCT into DCT
To obtain MDCT using a DCT, introduce a new data X1 (1) 9 3 x(1)
= . (2)
sequence X1 (4) 3 9 x(4)
−x(i + 3N/4), if 0 ≤ i < N/4,
y(i) = Since the constant matrix in (2) is a Hankel matrix, one can
x(i − N/4), if N/4 ≤ i < N.
use a bilinear algorithm to obtain this product.
Then defining Similarly, using the same shorthand notation, transform
components X2 (k), k ∈ S1 are given by:
z(i) = y(i) − y(N − 1 − i), 0 ≤ i < N/2,
an N point MDCT can be expressed as an N/2 point DCT-IV X2 (1) 3 9 x(0) − x(3)
= . (3)
as X2 (4) 9 3 −x(5) − x(2)
N/2−1
π(2i + 1)(2k + 1) Note that (3) exploits the fact that some kernel matrix elements
X(k) = z(i) cos ,
i=0
2N are equal in magnitude and therefore the signal sequence may
be folded to reduce the number of multiplications. Since the
0 ≤ k < N/2.
constant matrix in (3) is a Hankel matrix, one can once again
Thus the MDCT computation of short and long blocks is use a bilinear algorithm to obtain this product.
transformed into DCTs of 6 and 18 points respectively. When both the signal and transform indices are restricted to
set S2 , the computation can be transformed into a multidimen-
B. Conversion of IMDCT into DCT
sional convolution. To do this, observe that indices in S2 are
To obtain the IMDCT, we first compute the N/2 point type- elements of A(8N ), a group formed by non-negative integers
IV DCT of X as less than and relatively prime to 8N under the operation of
N/2−1 multiplication modulo 8N . The structure of this group can
2 π(2i + 1)(2k + 1)
z(i) = X(k) cos , be used to order the indices in S2 as well as to define a
N i=0 2N sign function. By using the order {0, 5, 3, 2} suggested by
0 ≤ i < N/2. A(8N ) and the corresponding sign function {1, −1, 1, 1}, one
can express X2 (k), k ∈ S2 as:
Then defining a new data sequence
⎡ ⎤ ⎡ ⎤⎡ ⎤
z(i), if 0 ≤ i ≤ N/2 − 1, X2 (0) 1 11 7 5 x(0)
y(i) = ⎢ −X2 (5) ⎥ ⎢ 11 1 5 ⎥
7 ⎥ ⎢ −x(5) ⎥
⎢
−z(N − 1 − i), if N/2 ≤ i < N, ⎢ ⎥ ⎢ ⎥ . (4)
⎣ X2 (3) ⎦ = ⎣ 7 5 1 11 ⎦ ⎣ x(3) ⎦
One can recover the signal sequence as
X2 (2) 5 7 11 1 x(2)
y(i + N/4), if 0 ≤ i < 3N/4,
x(i) =
−y(i − 3N/4), if 3N/4 ≤ i < N. The structure of the matrix in (4) is predicted by the structure
of the group A(8N ). By partitioning the matrix in (4) as
Thus the IMDCT computation of short and long blocks is also shown, one can see that it is a block cyclic matrix with
transformed into DCTs of 6 and 18 points respectively. each block being a Hankel matrix. Thus this computation
C. DCT algorithm for MP3 short block corresponds to a two dimensional convolution with a 2 point
cyclic convolution along one dimension and a 2 point Hankel
Recall that a DCT-IV of an N point sequence {x(i)} is
product along the other. Again, appropriate bilinear algorithms
given by
for these small lengths can be combined to obtain the bilinear
N
−1
π(2i + 1)(2k + 1) algorithm for (4).
X(k) = x(i) cos( ), 0 ≤ k < N. (1) Finally, when the signal indices are restricted to the set S1 ,
i=0
4N
one the computation of X2 (k), k ∈ S2 gives:
For MP3 short block, the DCT length N = 6. To compute (1) ⎡ ⎤ ⎡ ⎤
for this N ,we partition the signal and transform indices in two X1 (0) 3 9
sets: set S1 = {1, 4} is made up of those i’s for which (2i+1) ⎢ X1 (5) ⎥ ⎢ 9 3 ⎥ x(1)
⎢ ⎥=⎢ ⎥
is a multiple of 3 and set S2 = {0, 2, 3, 5}, of the rest. We ⎣ X1 (3) ⎦ ⎣ 3 9 ⎦ x(4) . (5)
compute those transform components together whose indices X1 (2) 9 3
12
for (4). The operation of multiplication by 2 may be counted Ref.[5]
as one addition. However, frequently in hardware design, this 11
Ref.[6]
Ref.[7]
scaling by 2 can be realized as a trivial left shift with negligible Bilinear
impact on area and speed of the implementation. 10
Area (μm2)
8
c1
x(4) X + + X(4) 7
c2
+ X 2x
c3 6
x(1) X + − X(1)
c4
x(0) + X + + X(0) 5
c5
+ X
c6 4
x(5) + X + + X(5) 5 10 15 20
c7 Delay (ns)
x(3) − X + − − X(3)
c8
+ X 2x
c9 Fig. 2. Delay and area for various implementations of 12 point MDCT for
x(2) − X + + − X(2) MP3 short block.
CGT6
see that the new proposed algorithm has about 20% less mul-
2
1.5
6 8 10 12 14 16 18 20 22
Delay (ns)
Fig. 7. Delay and area for unified 12 and 36 point MDCT and IMDCT
architectures (A, B and C) for MP3 audio, with comparison to the 36 point
MDCT architectures in literature.