1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1
Design Methodology to Explore Hybrid

Approximate Adders for Energy-Efficient
Image and Video Processing Accelerators
Leonardo Bandeira Soares , Student Member, IEEE, Morgana Macedo Azevedo da Rosa,
Cláudio Machado Diniz , Member, IEEE, Eduardo Antonio César da Costa, Member, IEEE,
and Sergio Bampi , Senior Member, IEEE
Abstract— This paper proposes a new design methodology to anymore [1]. Furthermore, power and thermal walls bring
explore the state-of-the-art approximate adders for accelerator much more effort to designers, so that digital CMOS design
architectures conceived in the realm of multiplier-less multiple is facing the so-called “Dark Silicon Era” even considering
constant multiplication optimization problem. The proposed
methodology is composed of: 1) a search heuristic to seek faster recent Fin Field-Effect Transistor (FinFET) technologies [2].
and feasible approximate configurations for the architectures Therefore, the current and future computing scenario is
under evaluation; 2) low-power techniques regarding hybrid characterized by the demand for numerous and ubiquitous
approximate adders design for accelerators based on trees of compute-intensive applications in constrained power budget
shift-and-add operations; 3) high-performance evaluation by digital devices. Based on that, energy-efficient techniques
exploring parallel prefix adders and low power analysis through
the use of the adder optimized by a commercial synthesis tool in (i.e. maximize the number of arithmetic operations per energy
the precise part of the approximate adders; and 4) energy effi- unit) are paramount to cope with the previously observed chal-
ciency analysis by considering both the approximate techniques lenges. According to [3], two trending energy-efficient tech-
and voltage over scaling estimation. Furthermore, improvements niques are listed as follows: (i) accelerator-rich architectures
are proposed for the state-of-the-art approximate adders under based on Application Specific Integrated Circuits (ASICs) and
evaluation in this paper. Two case studies are considered to
assess the proposed methodology: 1) Gaussian image filter and (ii) Approximate Computing (AC).
2) Sobel operator. The precise and approximate image filters Architectural heterogeneity and the use of ASIC accel-
were described in very high-speed integrated circuits hard- erators are energy-efficient techniques to execute the most
ware description language regarding the proposed methodology. compute-intensive kernels of an application [3]. On the other
Results are shown after synthesis to a 45-nm standard cell-based hand, the remaining tasks which demand less energy con-
technology, where energy reductions ranging from 7.7% up to
73.2% were experienced for multiple levels of quality considering sumption can be scheduled for general-purpose processors.
the applications under analysis. As a result, general-purpose processors’ workload is allevi-
ated due to the use of energy-efficient specific processing
Index Terms— Image and video processing, energy efficiency,
approximate computing, adders, multiplier-less multiple constant cores. Power-management schemes can be implemented to
multiplication. power off accelerators or general-purpose cores when not
in use, thus respecting the power and thermal constraints.
I. I NTRODUCTION The works in [4] and [5] show that despite the challenges
T HE semiconductor industry faces challenges at each new

Complementary Metal Oxide Semiconductor (CMOS)
technology node. Power density increase has been experienced
of architectural integration introduced by this new design
paradigm, these accelerator-rich architectures play an essential
role in energy efficiency for recent applications. For example,
due to the observation that Dennard scaling is not feasible Hameed et al. [6] show that an ASIC solution is 500× more
energy-efficient than a four core general-purpose processor
Manuscript received July 24, 2018; revised December 5, 2018; accepted
January 4, 2019. This work was supported in part by CNPq, in part when considered H.264 video coding application. One essen-
by CAPES, and in part by FAPERGS Brazilian. This paper was recom- tial approach to conceive ASIC implementation of digital
mended by Associate Editor R. S. Murphy-Arteaga. (Corresponding author: filters and transforms is to implement the multiplications by
Cláudio Machado Diniz.)
L. B. Soares and S. Bampi are with the Graduate Program in constants in the light of MMCM problem formulated in [7]
Microelectronics, Federal University of Rio Grande do Sul, 91501-970 and [8]. In other words, these architectures are designed by
Porto Alegre, Brazil (e-mail: lbsoares@inf.ufrgs.br; bampi@inf.ufrgs.br). adopting the use of additions, subtractions, and shift in an
M. M. A. da Rosa, C. M. Diniz, and E. A. C. da Costa are with the Graduate
Program on Electronic Engineering and Computer Science, Catholic Univer- optimized configuration.
sity of Pelotas, 96015-560 Pelotas, Brazil (e-mail: claudio.diniz@ucpel.edu.br; The approximate computing paradigm emerged to increase
eduardo.costa@ucpel.edu.br). performance and to reduce power dissipation [9]. The critical
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. approach in approximate hardware is to reduce the compu-
Digital Object Identifier 10.1109/TCSI.2019.2892588 tation accuracy in favor of energy-efficiency. In circuit level
1549-8328 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS
design, this is performed by designing simpler circuits to commercial synthesis tool in the precise part of the
speed up the critical path timing and/or to consume less approximate adders;
power. Approximate computing techniques take advantage of 3) Combination of different approximate adders to com-
approximation-tolerant applications which do not need high pose hybrid adders for the add-and-shift architectures;
accuracy all the time but only “good enough” or “sufficiently Energy-efficiency
good” results for output perceptual quality. In [10] is stated 4) analysis based on the approximate configurations and
the following properties to define an approximation-resilient VOS estimation due to the insertion of PPAs.
application: (i) there is not a golden or accurate result, Results show that our approach substantially reduces energy
but a range of acceptable ones and (ii) robustness to input consumption ranging from 7.7% up to 73.2% for different
noisy data. For example, multimedia applications (e.g., video levels of quality.
coding, audio filtering, image processing, and so on), highly The remainder of this paper is organized as follows:
demanded by current portable devices, are intrinsically related Section II overviews the approximate and precise adders,
to human senses. The multimedia signals are, in fact, as well as the background for the Gaussian and Gradient
approximation-tolerant applications, since in [11] is stated that image filters. Section III presents the related works. Section IV
human senses process analog information and have difficulty presents the proposed design methodology to explore our
to realize the negative impact of digital approximations. hybrid approximate adders to conceive low power accelerators
It means that it is possible to adopt approximate computing design. In Section V the experimental setup and results are
techniques to improve energy efficiency in multimedia applica- shown. In Section VI the conclusions are drawn.
tions by adequately exploring the user experience at different
profiles of quality. II. BACKGROUND ON A PPROXIMATE A DDERS , PARALLEL
The excellent point of approximate computing is that P REFIX A DDERS , AND I MAGE F ILTERS
this paradigm can be adopted at any abstraction level from
transistor-level up to software application [12]. Furthermore, A. Approximate Adders
approximate computing can be an additive design compo- The approximate adders can be classified as computa-
nent for accelerator-rich architectures. One can consider that tional performance- and power-oriented designs. The former
the use of approximate hardware accelerators brings fur- is related to adders divided into m independent blocks or
ther energy efficiency improvements [12]. In the arithmetic sub-adders to speed up the critical path timing. The claim is
layer of abstraction, works in [11] and [13]–[20] have pro- that, for random and uniformly distributed pairs of operands,
posed approximate adders. Adders are basic building blocks more extended carry propagation rarely occurs. Based on that,
for several compute-intensive multimedia applications. There- additional logic is necessary to speculate carry-in for each
fore, approximate adders could drive energy efficiency for sub-adder, since this class of approximate adder breaks the
recent digital compute-intensive and approximation-tolerant carry propagation in many parts. Examples of adders which
applications. improve computational performance are the Error-Tolerant
Based on that, this work proposes a design methodol- Adder II [15], Error Tolerant Adder IV [14], and the Almost
ogy to explore state-of-the-art approximate adders for ASIC Correct Adder (ACA) [16]. This class of approximate adders
implementation of add-and-shift accelerators for image and is also characterized by the presence of infrequent and high
video processing. Previous works in [22] and [23] exam- magnitude sum errors. Therefore, the works in [16]–[19] pro-
ined the use of the state-of-the-art approximate adders for posed accuracy configurable adders to cope with this error
image filters. To explore approximation for the architec- characteristics. On the other hand, more logic is added to
tures, they adopted simulation-based methodologies in which detect and correct the sum errors.
search heuristics are implemented to seek for energy-efficient A different philosophy is to propose power-oriented adders
approximate configurations. The approximate adders taken which generally are divided into two parts: (i) the least sig-
by the previously mentioned related works are the Approx- nificant approximate part and (ii) the most significant accurate
imate Mirror Adder (AMA) [13] and the Error-Tolerant part. Examples of power-oriented approximate adders can
Adder I (ETAI) [11]. They are divided into precise and be observed in [11], [13], and [20]. The principal idea in
approximate parts. In both the works, only the Ripple Carry the approximate part is to replace the full adder cells by
Adder (RCA) topology is explored in the precise block of simpler adder circuits. Therefore, power reduction is the main
those approximate adders. The same observation is valid for objective of this class of adders. Besides, these adders also
the case study explored in [11]: the precise block of ETA-I is tend to reduce critical path timing, because in the approximate
only implemented with RCA topology. part there is not carry propagation scheme. One can observe
This work presents four novel contributions in the scope of that the classical truncation is a type of power-oriented adder
approximate computing: which truncates least significant full adder cells. This class of
1) A faster search heuristic and simulation-based methodol- approximate adders is also characterized by the presence of
ogy to configure feasible configurations with evaluation frequent and low magnitude sum errors. Such errors are of
of multiple levels of quality; low magnitude because the bit-width of the approximate part
2) A high-performance exploration of approximate hard- can be controlled through an approximate parameter k. In this
ware accelerators through the use of PPAs and low work, the proposed approach is to explore the power-oriented
power evaluation through the optimized adder from the adders to give priority to power-efficiency. It is also ratified
SOARES et al.: DESIGN METHODOLOGY TO EXPLORE HYBRID APPROXIMATE ADDERS 3
Fig. 1. Approximate adders: (a) Copy adder; (b) ETAI.

Fig. 2. Parallel prefix adder structure.
in [21] which states that the adders focused on delay reduction

cannot be used to explore power-efficiency in structures which B. Precise Adders
demand the massive use of additions like the multipliers. Also,
the power-oriented approximate adders enable the exploration As previously mentioned, adders are fundamental build-
of multiple conventional adder topologies in the precise part, ing blocks in a great variety of computational applications.
which is not true for the approximate adders divided into many Based on that, many adder topologies have been proposed
blocks. to deal with the tradeoff between power and computational
Therefore, we consider in this study the exploration of the performance. The RCA topology is characterized to present
fifth approximate version of AMA [13] and the Error-Tolerant low values of power consumption, area, and computational
Adder I [11], because they are also explored by related performance. Depending on the high-performance applica-
works [22], [23]. The former approximate adder is renamed tion requirements and given that computational complexity is
to “Copy adder” due to its copy function implemented by the increasing in nowadays tasks, the critical approach is to accel-
buffers. The approximate adders which are explored in this erate the adder’s critical path delay (i.e. carry propagation) at
study can be observed in Figure 1. the expense of higher area and power dissipation. Based on
The “Copy adder” in Figure 1(a) has its k bits long approxi- that, the Parallel Prefix Adders were proposed to deal with
mate part implemented by buffers to copy the operand a to the high-performance demands [24].
sum. This procedure has 50% probability of getting the correct The carry propagation structure in the PPA adders is imple-
sum for each bit position. Furthermore, the carry-in estimation mented by simple logic cells which tend to keep a regular
for the precise part is implemented by the more straightforward connection. Based on that, the sum computation can be divided
assignment of the input operand bit bk−1 . This procedure has into pre-processing, prefix computation and post-processing
75% probability of getting a correct carry-in estimation to parts, as can be seen in Figure 2.
the precise block. The use of half adders implements the In the pre-processing part, the propagate pi (i.e. ai ⊕bi ) and
approximate part of the ETAI in Figure 1(b). The sum is generate gi (i.e ai ∧bi ) signals are computed based on the input
performed in the non-conventional direction, (i.e. from the operands ai and bi . In the prefix computation stage, the carry
most significant bit k −1 to the least significant bit position 0). computation is accelerated by the parallel composition of the
The control logic block is conceived as follows: if the first black cells which implement the group propagate Pi: j and
carry-generate c is equal to “1,” then all the remaining least generate G i: j signals as well as the gray cells which compute
significant sum bits are set to “1.” Otherwise, the sum result is the carry ci . Finally, the post-processing step is given by the
the one computed by the propagate signal. The carry-in to the sum si = pi ⊕ ci−1 .
precise part in ETAI is statically set to “0.” This procedure has 1) Use of PPA Adders on the Precise Part of the Approxi-
50% probability of getting the correct carry-in to the precise mate Adders: Different configurations between the black and
part. One can observe that for both the approximate adders, gray cells can be obtained. According to the PPAs taxon-
any conventional adder topology can be implemented in the omy presented in [25], the different prefix cells configura-
precise part. Related works in [22] and [23] explore the use tions allow tradeoffs among (i) the number of logic levels,
of the RCA. In this work, we explore the RCA, the high- (ii) the maximum fanout, and (iii) the maximum number
performance PPAs, and the low power adder fully optimized of horizontal wire tracks (i.e wire density). All of these
by the commercial tool used in this study. That is why in the aspects affect the adder delay. Based on that, PPA topologies
next subsection a brief PPA overview is developed. proposed in [26]–[30] are considered in this study. Their main
TABLE I vertical derivatives calculates the magnitude of the gradient.

TAXONOMY OF n-B IT PPA’ S ⎡ ⎤
1 2 1
1/4 ⎣ 0 0 0⎦ (2)
−1 −2 −1
⎡ ⎤
1 0 −1
1/4 ⎣2 0 −2⎦ (3)
1 0 −1
Both the Gaussian and Sobel operator can be combined
to perform edge detection application, which enables feature
extraction for computer vision algorithms [31]. Convolution
characteristics concerning logic levels, maximum fanout, operators tend to demand high computational effort. For
and the maximum number of wiring tracks are presented instance, given a 512 × 512 pixels grayscale image, both the
in Table I [25]. 5 × 5 Gaussian and 3 × 3 Gradient filters are responsible
As can be seen in Table I, the Brent-Kung adder has the for more than 11.2 million multiplications and 10.4 million
highest number of levels, while presents low values for both additions. These numbers of arithmetic operations are much
fanout and wire density. The Sklansky and Kogge-Stone have higher when considering recent video and image resolutions.
the lowest number of logic levels, but the former gives the Therefore, these filters can be regarded as compute-intensive
highest fanout, and the latter has the worst wire density due applications, and hardware accelerator implementation should
to the highest number of prefix cells. The Han-Carlson is the be investigated.
hybrid solution between the Brent-Kung and Kogge-Stone. An energy-efficient approach to conceive hardware accel-
Therefore, this adder balance the tradeoff between the number erators for digital image and video filters is by solving the
of logic levels and wiring tracks. The Ladner-Fischer adder is MMCM problem [7], [8]. Hence, the hardware filters can be
the hybrid approach between Sklansky and Brent-Kung so that implemented by a tree of adder and shifts in a parallel topology
the tradeoff is balanced between the number of logic levels and increasing computational performance as well as sharing arith-
the fanout. metic operators to reduce area and power. In the next section,
a brief review of previously proposed design methodologies,
C. Gaussian and Gradient Image and Video Filters adder topologies exploration in approximate computing scope,
and the hybrid approximate adder are presented.
Image and Video filters are of great importance in nowadays
compute-intensive applications. With the emerging Internet-
of-Things (IoT) scenario, many applications rely on computer III. R ELATED W ORK ON D ESIGN M ETHODOLOGIES FOR
vision algorithms to extract features and to provide services A PPROXIMATE I MAGE F ILTERS AND H YBRID
in many fields such like agriculture, safety, transportation, A PPROXIMATE A DDERS
and so forth. The challenging point is that image, and video Given that an effective approach to conceiving image filters
sensors produce a large quantity of data to be processed, and, is the add-and-shift implementation, the work in [22] explores
consequently, increase the computational effort. Furthermore, the use of Copy adder and ETAI inside add-and-shift accelera-
if considered the ubiquitous environment, then energy effi- tors for Gaussian and Gradient image filtering process. There-
ciency gains much more priority. fore, their approach is to adopt a search heuristic, based on the
The Gaussian filter is a smoothing filter to remove noise. expected output magnitude, to combine different approximate
The two-dimension (i.e., x and y directions) Gaussian kernel parameters inside the architectures. Each configuration of
is obtained as shown in (1). approximate adders with different approximation parameters
1 − x 2 +y2 2 is simulated to evaluate the quality response of these filters.
G(x, y) = e 2σ (1) After that, quality constraints are imposed to select a set of
2πσ 2
configurations which are synthesized and analyzed concerning
In (1), σ 2 denotes the variance, and this is one of the energy efficiency. The work in [22] presents interesting power
parameters to obtain different versions of Gaussian kernels. results for iso-performance analysis. The power reduction for
The other parameter is the window size which determines the the best case is about 50% considering 45 nm technology.
number of image pixels to be convolved. Stronger smoothing On the other hand, limited quality and real-time evaluation
is obtained through the use of larger window sizes. On the are presented. Their work did not explore maximum frequency
other hand, for larger window sizes, the computational cost analysis, and the precise part of the adders are implemented
substantially increases. with only RCA.
The Sobel operator is a prominent example of a gradient In [23] the ETAI is explored, but the authors propose the
filter, whose pair of vertical and horizontal 3 × 3 convolution use of OR gate instead of XOR ones for the approximate
masks is presented in (2) and (3), respectively. One of the part. They adopt a search heuristic based on the number
masks estimates the gradient in the y-direction (rows), while of adder steps in critical path considering the trees of add-
the other estimates the gradient in the x-direction (columns). and-shift for Finite Impulse Response (FIR) image filters.
The square root of the sum of the squares of horizontal and Their design methodology is also evaluated for different FIR
filters and “Lena” benchmark. Although the authors use a

simulation-based analysis, the application quality analysis is
not comprehensively explored. Maximum energy reduction
of 50.7% is observed for 65 nm ASIC implementation.
No observation is performed about different adder topologies
exploration in the precise part of the approximate adders.
Furthermore, there is not information about real-time analysis
of the approximate filters.
From the related works, one can conclude that exploring
different conventional adder topologies in precise part is an
open research subject. The work in [32] examined different
PPA topologies for the computational performance-oriented
designs. It occurs because the ACA structure presented in [16]
is based on the Kogge-Stone adder. Therefore, in [32] other
PPAs are evaluated to carry speculation adders.
In a different point of view, our approach gives priority
to control power efficiency and error magnitude by adopting
power-based approximate adders and further foresees that
performance can be rescued by exploring faster topologies in
the precise part of the adders under evaluation.
The work in [33] explores the combination of performance-
and power-oriented adders to conceive hybrid approaches.
They evaluate these hybrid adders in a 40 nm ASIC implemen-
tation of a Single Instruction Multiple Data (SIMD) Central Fig. 3. Proposed design methodology.
Processing Unit (CPU). Their approach is validated for a
Sobel operator where limited real-time analysis is performed. application under evaluation by adopting MatLab or C models
According to the authors, the minimum time of 190 μs is considering both the architectures and approximate adders.
taken per 512 × 512 gray scale images, with a maximum The hybrid approximate adder models are implemented by
energy reduction of 15%. Furthermore, the authors propose an overloaded function in MatLab and C language. There-
a new metric in the arithmetic layer which considers both the fore, add-and-shift filter structures are developed in simula-
probability and magnitude of errors which are directly related tion scripts. The function is called to perform the necessary
to the two classes of approximate adders. On the other hand, additions, with the appropriate parameters to select between
quality metrics in edge detection level of abstraction is not the type of hybrid approximate adder, the number of bits of
observed in their work. The interesting hybrid solution pro- the approximate part, and so forth. The quality metrics are
posed by Najafi et al. [33] presents essential contributions to generated by considering the use of real test-cases. The next
the scope of approximate computing. However, when adopting stage is related to the application quality evaluation for all
performance-oriented approximate adders, the adder topology the exercised approximate configurations. After that, different
turns out to be static. levels of quality are selected to generate the Register Transfer
Based on that, our proposed approach is not directly in the Level (RTL) designs automatically. The logic synthesis is
same scope as the previous work in [33]. Our solution focuses performed for each RTL file, by using the standard cell
on exploring different precise adder topologies to rescue technology library files (i.e.,.lib,.lef, cap tables, and so forth).
performance and further evaluate VOS estimation while main- Then, the mapped gate-level netlist is created to enable the
taining the magnitude sum errors at a low and controlled level, next step of post-synthesis simulation. During the simulation,
which is desired in many image applications. Furthermore, this standard cell technology library files (i.e., the Verilog file with
work proposes the combination of truncation, Copy adder and the behavioral model of the standard cells) and real test cases
ETAI (i.e. power-oriented adders) to reduce power dissipation are used to capture switching activity which is saved in Value
in parallel shift-and-add ASIC hardware accelerators. Our first Change Dump (VCD) or Toggle Count Format (TCF) files.
explorations by using PPAs in approximate adders regarding The Verilog gate-level netlist, standard cell technology library
Sobel operation were presented in [34] and [35]. On the other files, and VCD or TCF files are then used to estimate power.
hand, these works did not present a complete design method- In the next subsections the proposed hybrid approximate
ology either do not explore a hybrid solution for approximate adders, the accelerator architectures under analysis, and the
adders considering power efficient solution for the image filter proposed heuristic adopted to perform simulation are shown.
architectures under evaluation.
IV. P ROPOSED D ESIGN M ETHODOLOGY A. Proposed Hybrid Approximate Adders

This section presents the proposed design methodology for Shift-and-Add Architectures
to explore our hybrid approximate adders. The basic design The 5 × 5 Gaussian and the 3 × 3 Sobel image filter
flow is shown in Figure 3. The first step is to simulate the architectures under evaluation are similar to the ones adopted
Fig. 4. Gaussian image filter architecture.
to the other can be leveraged with the use of the Copy adder.
One can notice in the examples in Figure 6(a) and (b) that
these procedures do not produce sum errors. The remaining
most significant region in the adder can further be explored
with the state-of-the-art Copy adder and ETAI (i.e., precise
plus approximate adder).
The example in Figure 6(a) shows the proposed hybrid
scheme of copy-copy-truncation. The operands are left shifted
two and four times, respectively. One can see that there are
three approximation parameters: (i) k1 = 2 which controls the
truncation in the overlapped shift region, (ii) k2 = 2 which
controls the copy adder in the excessive shift region, and
(iii) k3 = 4 which controls the approximate part of the copy
adder in the non-shifted region. Almost the same observations
can be made in Figure 6(b) for the configuration ETAI-copy-
Fig. 5. Gradient image filter architecture.
truncation. The only difference is that in the non-shifted region
the ETAI is adopted instead of the Copy adder. For the ETAI,
we adopted the modification proposed by Kang et al. [23]:
by Oliveira et al. [22]. The difference is observed in the
the use of OR logic gates instead of XOR. It occurs because
Gaussian architecture, where the partial terms of the adder tree
there is not difference regarding produced sum result, while
were reorganized to enable left shift overlapping regions in
the former gate has less area than the latter. Besides, this
the operands. It is performed to leverage the power efficiency
work also considers the use of carry-in estimation performed
provided by the proposed hybrid adders and to improve the
in Copy adder for the ETAI. As previously mentioned, this
proposed search heuristic which will be presented in the next
procedure has more probability of getting correct estimation
subsection.
than statically set the carry-in to “0.” If a given adder of the
The Gaussian and Gradient architectures can be observed in
architecture has not left shifted operands, then the approxi-
the Figure 4 and Figure 5, respectively [22]. One can observe
mation is not hybrid (i.e., k1 = 0 and k2 = 0). On the other
that these architectures are implemented by the shift opera-
hand, the copy adder or ETAI can be explored in this adder
tions, adders, and subtractors. There are two observable config-
(i.e., k3 ≥ 0).
urations in which the proposed hybrid approximate adders can
be adopted: (i) both the operands present overlapped number
of left shift operations, (ii) one operand present excessive B. Proposed Search Heuristic
number of left shift operations than the other operand. These The exhaustive search in simulation-based methodologies
aspects can be observed in Figure 6. tends to be time consuming or prohibitive to find the most
As can be seen in Figure 6, if there is overlapping between energy-efficient configuration. Therefore, the use of search
the number of left shifts in both the operands, then the heuristics is essential in this scenario. As previously shown,
proposed approach considers the use of truncation adder in the related works in [22] and [23] proposed search heuristics to
overlapped and least significant region. The excessive amount seek for energy-efficient accelerators. In this work, the pro-
of left shift operations in one of the operands, when compared posed approach is to first establish the k1 and k2 parameters
Algorithm 1 The Simulation-Based Search Heuristic

Input: The array structure T containing the pairs a and
b of left shifted bits per adder in the tree (right
shifts and no shifts are treated as zero)
Input: The maximum number of configurations N to be
tested
Input: The data set Y of real test cases
Output: The array structure S containing the values for
k1 and k2 per adder node in the tree
Output: The 2-D array Q containing the values for k3
per adder node in the tree and N tested
configurations
Output: The array P with length N containing the
application metric per tested configurations
1 initialization of the hybrid approximation;
2 for i ← 0 to length(T )−1 do
3 if T [i ].a ≥ T [i ].b then
4 S[i ].k1 ← T [i ].b;
5 S[i ].k2 ← T [i ].a − T [i ].b;
6 else
7 S[i ].k1 ← T [i ].a;
8 S[i ].k2 ← T [i ].b − T [i ].a;
9 end
10 for j ← 0 to N − 1 do
11 Q[ j ][i ] ← j ;
12 end
13 end
Fig. 6. Proposed hybrid approximate adders. (a) copy-copy-truncation adder.
(b) ETAI-copy-truncation adder. 14 initialization of the iterative search;
15 for j ← 0 to N − 1 do
16 P[ j ] ← evaluateApplication(S,Q[ j ],Y );
17 end
for the overlapped and excessive shift regions. After that,
the next step is to seek different k3 parameters for all the
adders in the tree through an iterative process. The algorithm
for the heuristic is shown in Algorithm 1. One can observe that analyzed by considering the precise filter as being the
the k1 and k2 parameters are fixed, while the k3 is iterated reference.
to seek for different approximate configurations. During the In the Figure 7(a) is shown the Peak Signal-to-Noise
initialization step from lines 1 to 9, the k1 and k2 parameters Ratio (PSNR) levels per iterated k3 parameter. One can
are determined solely by the overlapped regions previously observe that high PSNR level is experienced when the k3 = 0.
shown in Figure 6. The iteration in lines 10 to 12 is used Therefore, in this case, the approximation is only provided by
to create and initialize the data structure which stores the k3 the hybrid approximate adders as can be seen in Figure 4 and
values for each adder in the filter architectures. The search for Figure 5. Based on that, one can conclude that approximation
k3 parameters is performed in lines 14 to 17. During the search, provided by the k1 and k2 parameters insert shallow level
the quality metric is stored after running the application for of errors. Since the hybrid copy-truncation scheme does not
each k3 parameter value. The application quality is evaluated, produce addition errors, the low PSNR degradation is due
and objective metric is calculated by considering the use of to an error in carry-in estimation for the precise part of the
real test cases. hybrid approximate adders. As expected, the average PSNR
Based on the heuristic, eight greyscale images level decay for both the copy-copy-truncation and ETAI-copy-
(i.e, 7 MatLab built-in images plus the “Lena” standard truncation when the k3 increases. Furthermore, the coefficient
test image) were used to evaluate the Gaussian and Gradient of variation increases when the approximation is progressively
quality. The quality analysis for different k3 configurations explored. In Figure 7(a), the variability remains lower than
is shown in Figure 7. One can observe that quality metrics 15% for all the cases.
decrease when k3 increases. The bars represent the average μ, In Figure 7(b) the Performance Conformance is observed
while the error bars represent one standard deviation 1σ . for edge detection regarding the Sobel operator in Figure 5.
At the top of each error bar, the coefficient of variation The Performance Conformance is defined as in (4)(5)(6) [36].
(i.e., σ/μ) is denoted in percentage. The gray and white bars
represent the copy-copy-truncation and ETAI-copy-truncation tp
r ecall = (4)
hybrid exploration, respectively. The objective metrics are tp + f n
TABLE II
k3 PARAMETERS FOR THE S ELECTED Q UALITY P ROFILES
terms of PC is more aggressive for k3 ≥ 6. At this point,

the variability also substantially grows, and these values for k3
should be avoided in architectural design.
In this study three levels of quality are selected to imple-
ment the Gaussian filter architecture: (i) PSNR near and
higher than 50dB, (ii) PSNR near and higher than 40dB,
and (iii) PSNR near and higher than 30dB. These levels are
empirically selected after subjective observation of the output
filtered images from the benchmark. For the Gradient filter,
the following levels were selected based on the same analysis:
(i) PC near and above 0.95 and (ii) PC near and above 0.85.
In [36] the maximum achieved PC for the edge detection
application is of 0.914. The adopted k3 parameters which
respect the selected quality levels are shown in Table II.
V. R ESULTS
In this section, the experimental setup, quality results for
the two evaluated case studies, energy efficiency analysis, and
area evaluation are presented.
A. Experimental Setup
In order to evaluate the quality results, all the approximate
configurations were simulated by considering eight images
from the Berkeley Segmentation Dataset benchmark [37]. This
new set of images is adopted to compare quality metrics
Fig. 7. Quality Analysis. (a) Gaussian filter. (b) Gradient filter. with the previous analysis performed during the approximation
search shown in Figure 7.
tp The precise and approximate designs were described
pr eci si on = (5) in VHDL based on the approximation parameters shown
tp + f p
PC = mi n(r ecall, pr eci si on) (6) in Table II. For the Gaussian filter case study, three qual-
ity levels are under evaluation. Besides, two versions of
In (4), the recall is defined as the number of pixels which are approximate hybrid adders are adopted. Therefore, this results
correctly detected as edges, by the approximate solution, over in six approximate designs. The proposed approach in this
the number of pixels which should be correctly detected as work considers the exploration of seven conventional adder
edges. Therefore, t p and f n refer to the true positives and the topologies for the precise part (i.e., RCA plus 5 PPA’s and the
false negatives. In (5), the precision is defined as the number adder optimized by the synthesis tool). Thus, the total number
of pixels correctly detected as edges over the total number of of approximate designs under evaluation for the Gaussian filter
pixels detected as edges by the approximate solution. The term is 42. The seven precise architectures are also considered
f p refers to the false positives. Therefore, the Performance in this work since they are the baselines to compare to
Conformance PC is defined by the minimum between the the approximate solutions. It results in 49 described designs
recall and precision, where the result can range from 0 to 1. considering the precise and approximate ones. This same
One can observe in Figure 7(b) that the PC also decays analysis is performed on the Sobel operator case study, where
when k3 increases. The first configuration in which k3 = 0 a total of 35 designs are explored. Therefore, the total number
presents results near the maximum possible level with very of described designs are 84.
low variability. It can be explained because there are only two All the designs are synthesized by adopting the RTL Com-
hybrid adders approximated by the k2 parameter as can be piler tool from Cadence, and they are mapped to the 45 nm
observed in Figure 5. It occurs because there is not overlapping Nangate Free PDK. The cells’ structures from the designs
of left shifted operations in these adders. The degradation in are preserved to avoid distortions in the adder topologies.
The only exception is for the design where the behavioral

adder description implements the precise part (i.e., the “+”
operator in Hardware Description Language). For this case,
the synthesis tool is allowed to optimize this adder component
for low power. In this work, this optimized implementation is
identified as the “tool adder.” As presented in the proposed
design methodology in Figure 3, the switching activity is
extracted by considering 10 000 5 × 5 and 3 × 3 blocks
from images of the tested benchmark. Maximum frequency
analysis is performed by the use of the bisection search
method. In addition to the maximum frequency, two clock
frequencies constraints are also adopted during synthesis:
(i) 63 MHz, and (ii) 249 MHz. These frequencies are the
minimum clock frequency targets to achieve real-time for
Gaussian and Sobel operation considering Full High Defini-
tion (1920×1080) and Ultra High Definition 4K (3840×2160)
video resolutions at 30 frames per second. The Mean Energy
per Operation (MEOp) is obtained based on the period targets
and the estimated power dissipation. Energy reduction is
also evaluated for all the approximate designs about their
correspondent precise versions. The maximum and minimum
energy reductions are selected per filter design and frequency
target as shown in Figure 9. Therefore, they are obtained
considering all the approximate versions under analysis.
B. Gaussian and Gradient Quality Results

As can be seen in Figure 8, the PSNR and Performance
Conformance metrics tend to be respected when a different
benchmark is adopted to evaluate the approximate configura-
tions. The lowest quality targets of 30dB and 0.85 are the ones
which present higher degradation and coefficient of variation.
When quality increases for both the applications, the vari-
ability is reduced. These results show that the Gaussian and
Gradient responses follow the same behavior of the previous Fig. 8. Quality Results for the BSD benchmark. (a) Gaussian filter.
analysis to seek approximate parameters shown in Figure 7. (b) Gradient filter.
The proposed design methodology enabled the comparison
between the copy- and ETAI-based hybrid approximations.
One can notice that our approach can be used to compare other and the lowest Mean Energy per Operation consumption
approximate adders in a simulation-based design flow. One can among all the approximate versions for all the cases. The
conclude that the copy-based approximation tends to present operating frequencies are of 63 MHz and 249 MHZ to filter
better results than the ETAI. The exception is for the 30 dB videos at 30 fps regarding the Full HD and Ultra HD 4K video
and 0.85 targets where the ETAI-based presents the better resolutions.
results. It can be explained by the k3 parameter choice shown As expected, the copy-copy-truncation presents the higher
in Table II. For example, the k3 = 7 was the selected parameter energy reductions than the ETAI. It occurs due to two signifi-
for the Gaussian filter with a target of 30 dB. On the other cant observations: (i) the ETAI has more logic complexity than
hand, higher parameters tend to suffer from higher variability, the copy adder, and (ii) the copy adder presents less quality
and this may produce lower or higher results. distortion than the former, which enables more approximation.
When considering the precise versions as being the baseline
C. Energy Efficiency Results architecture, one can observe that a wide range of energy sav-
Figure 9(a) and (b) show the Mean Energy per Operation ings can be analyzed. In the Gaussian architecture, the energy
in p J for the Gaussian and Gradient filters, respectively. The reduction is more expressive than in the Gradient filter. It is
columns represent the precise filter results plus the following due to the massive presence of adders in the Gaussian adder
approximate versions: (i) ETAI-copy-truncation with PSNR tree shown in Figure 4 and Figure 5. The energy reductions
and Performance Conformance targets of 50 dB and 0.95, in Figure 9(a) for 63 MHz are ranging from 8.3% to 73.2%,
respectively, and (ii) Copy-copy-truncation with PSNR and while for 249 MHz they are from 7.7% up to 70.9%. The
Performance Conformance targets of 30 dB, and 0.85, respec- energy reductions in Figure 9(b) for 63 MHz are ranging
tively. Those two approximate profiles represent the highest from 18.7% to 57.2%, while for 249 MHz they are from
Fig. 9. Mean Energy per Operation. (a) Gaussian filter. (b) Gradient filter.
TABLE III TABLE IV

V OLTAGE OVER S CALING A NALYSIS @ 249 MHz FOR G AUSSIAN F ILTER V OLTAGE OVER S CALING A NALYSIS @ 249 MHz FOR G RADIENT F ILTER
13.4% up to 52.3%. These results clearly show that our design frequency reached by the precise Gaussian and Sobel oper-
methodology and proposed hybrid approximate adders provide ator implemented by the 5 PPAs and the RCA topology,
energy efficiency for compute-intensive image/video filters. respectively. This result shows that the RCA-based filter is
The Gaussian and Gradient filters fully implemented by the the slowest hardware, and, therefore, the PPA-based designs
precise “tool adder” present the lowest energy consumption are evaluated under VOS operation considering the frequency
when compared to the PPA adders and the RCA versions. of 249 MHz.
One can conclude that the synthesis tool makes substantial In [13] is shown that the VOS technique consists of scal-
effort to build low power designs. Also, one can observe ing down the V D D without scaling the clock frequency
that our proposed approximate approach further improves accordingly. The circuit delay (as in all digital CMOS) is
the energy efficiency in this scenario. For instance, energy inversely proportional to the voltage supply V D D, as demon-
reductions provided by the hybrid approximate solution range strated in [13]. Therefore, they propose a VOS model shown
from 6.3% up to 64.4% when considered the Gaussian filter in (7) to calculate the lower boundary regarding scaled V D D
with precise part implemented by the “tool adder” and an (V D Dscaled ) which still avoids timing induced errors. We con-
operating frequency of 249 MHz. These energy reductions sider this model to estimate additional dynamic power reduc-
are of up to 66.2% when the clock frequency target to the tion when applying VOS in adder topologies which are faster
Gaussian filter is of 63 MHz. than the RCA. This reduction is significant, as the dynamic
As expected, the PPAs versions consume more energy than power is directly proportional to V D D 2 . In (7) the term slack
the RCA-based version as can be seen in the Figure 9. refers to the difference between the minimum period achieved
On the other hand, Table III and Table IV show the maximum of a given PPA topology and the baseline RCA. The term Tc
TABLE V
A REA (μm 2 ) A NALYSIS @ 249 MHz
denotes the period of the operating clock, which in this part implemented in the precise parts of the hybrid approximate
of the study is 1/249000000 seconds. Since at VOS condition adders represent the minimum and maximum area reductions
the clock frequency is not accordingly scaled when V D D is for almost all the cases, respectively. It is expected since the
reduced, there is not performance penalty. Also, most of the “tool adder” is optimized for low power, while the KS has the
faster PPAs presents dynamic power (Pd yn ) reduction shown highest area. Based on that, these reductions are of 67.4% up
in Tables III and IV when compared to the RCA baseline to 73.8% for the Gaussian filter, and 39.5% up to 48.8% for
filter. One can conclude that the dynamic power reduction the Sobel operator implemented by the copy-copy-truncation.
reaches up to 17.4% (Ladner-Fischer version) and 19.3% When considered the ETAI-copy-truncation, the reductions are
(Kogge-Stone version) when considering the Gaussian and of 6.9% up to 14.1% for the Gaussian filter, and of 8.2% up to
Gradient operators, respectively. Besides the additional 21.8% for the Gradient filter. Following the same conclusions
dynamic power savings of the PPA adders, they can accom- made in the Energy Efficiency analysis, the area reductions
plish higher frame rates to process higher video resolutions ratify the contributions of this study.
than the RCA. Therefore, the use of PPA adders may be
preferable than the RCA, depending on the observed scope. E. Energy Efficiency Vs. Application Quality
slack In this subsection, the main objective is to evaluate the
V D Dscaled = V D D(1 − ) (7)
Tc relationship between application quality and energy consump-
tion. One can observe in Figure 10(a) and (b) that energy
The “tool adder” is not exercised in this context because the
consumption raises when the application quality is improved.
commercial synthesis tool tends to push the limits to achieve
The Figure 10(a) and (b) show all the evaluated approximate
the highest possible clock frequency. This procedure may con-
configurations for the Kogge-Stone and “tool adder” precise
ceive a gate-level netlist which is substantially different from
parts. These precise parts were selected because they represent
the one synthesized for 249 MHz. Based on that, the maximum
the highest and the lowest energy consumption among all the
frequency may not represent a fair analysis considering VOS
conventional topologies. The results for the Gaussian filter
operation.
are shown in Figure 10(a). One can observe that for all
the approximate versions, the energy consumption increases
D. Area Analysis when higher PSNR quality is demanded. The same can be
The area analysis is shown in Table V for the Gaussian observed for the Gradient filter in Figure 10(b). These results
and Gradient filters with a clock frequency of 249 MHz. The are expected since higher quality profiles are associated with
number of cells and area (μm 2 ) are shown for the precise designs which are less approximated.
designs plus the same approximate configurations previously
shown in Figure 9 and enumerated as follows: (i) the copy- VI. C OMPARISON W ITH R ELATED W ORK
copy-truncation with PSNR and Performance Conformance As previously mentioned in the related work section, essen-
targets of 30 dB and 0.85, and (ii) the ETAI-copy-truncation tial contributions were given in [22], [23], and [33]. In [22] the
with PSNR and Performance Conformance targets of 50 dB maximum energy reductions of 26.9% and 60% are provided
and 0.95. These configurations were chosen because they for the 45 nm ASIC implementation of Gaussian and Gradient
present the maximum and minimum area reductions among filters, respectively. On the other hand, the quality analysis
all the approximate designs. The results are organized per is limited, since the Sobel operator is evaluated by adopting
image filter and approximate configuration. One can observe PSNR quality metric instead of Performance Conformance
that the “tool adder” and the Kogge-Stone (KS) topology which is more appropriate for edge detection scope. Also, only
TABLE VI
C OMPARISON W ITH R ELATED W ORK
Among the related works, [33] is the single which proposes

a real-time analysis for Sobel operation. According to the
authors, the precise approach reaches a maximum performance
of 215 μs per 512×512 greyscale frame, while the energy con-
sumption is approximately 9 μJ per frame. When considering
the most energy efficient approximate approach, the authors
show that the maximum performance is of 195 μs, while
the energy consumption is approximately 8.2 μJ per frame.
In this work, the most energy-efficient approaches without
considering VOS estimation is the RCA topology. One can
conclude, by examining the frequency of 249 MHz, that the
precise Sobel operator, based on RCA, takes 1052 μs to
process a 512 × 512 gray scale image, while the energy con-
sumption is of 0.34 μJ per frame. The most energy-efficient
approximate Sobel architecture is the copy-copy-truncation
with Performance Conformance target of 0.85. The approx-
imate approach takes the same time to process 512 × 512
greyscale images, while the energy consumption is of 0.2 μJ
per frame. When comparing precise approaches (i.e same level
of quality), our 45 nm accelerator architecture presents energy
reduction of 96.2% when compared to the precise 40 nm
SIMD co-processor running Sobel operator in [33].
The limitations of this work are related to the filter designs
Fig. 10. Mean energy per operation vs. application quality. (a) Gaussian being implemented in application-specific architecture scope.
filter. (b) Gradient filter. Therefore, changes in filter kernels are not possible after the
ASIC is fabricated. When this scenario of general-purpose
the RCA is explored in the precise part of approximate adders, is required, the work in [33] is of remarkable use and
and no maximum frequency analysis is presented. importance.
In [23] the maximum energy reduction is of 50.7% for a set On the other hand, our approach is of notable contri-
of 65 nm ASIC FIR filters approximated by their proposed bution in the design of ASIC accelerators which can be
synthesis flow. On the other hand, the quality analysis is integrated into heterogeneous architectures (i.e., integration of
given regarding accuracy, and no evaluation at the application general-purpose processors plus ASIC accelerators in a chip).
layer is observed. Furthermore, the authors did not evaluate It is because the results of this work indicate that substantial
real-time scenario for the FIR filters under analysis. energy reduction is observed every time a predefined and static
In [33] the maximum energy reduction is of 15% consid- filter kernel is required during a given application.
ering the Sobel operator application being processed by their
proposed 40nm ASIC SIMD co-processor. On the other hand, VII. C ONCLUSION
no quality evaluation is performed in application quality. This work proposed a novel design flow methodology
The design methodology proposed in this work uses the to cope with energy efficiency in CMOS technology. The
same case studies and technology exercised in [22]. Based on proposed solution is focused on exploring approximation in
that, one can conclude that our proposed approach presents add-and-shift architectures to reduce power consumption and
substantial reductions of up to 73.2% and 57.2% for the increase computational performance. The proposed search
Gaussian and Gradient filters, respectively. heuristic and hybrid approximate adders presented substan-
In the scope of search heuristic and design methodology, tial energy reductions of up to 73.2% when compared to
this work also introduces new contributions when compared the precise and baseline architectures. PPA topologies were
to the related works. The energy efficiency results validate the explored, to rescue performance, in addition to RCA-based
effort of seeking for hybrid approximate adders inside shift- approach, where VOS estimation shows additional dynamic
and-add accelerators. Table VI shows an overall comparison power reduction of up to 19.3%. Area reduction up to 73.8% is
among this work and the related ones. also observed in this study. The proposed design methodology
also enabled a more comprehensive observation of application [20] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired
quality by considering average results and variability. Com- imprecise computational blocks for efficient VLSI implementation of
soft-computing applications,” IEEE Trans. Circuits Syst. I, Reg. Papers,
parison with state-of-the-art related work is provided showing vol. 57, no. 4, pp. 850–862, Apr. 2010.
the contributions of this work for low power digital CMOS [21] S. Rehman, W. El-Harouni, M. Shafique, A. Kumar, J. Henkel, and
design and approximate computing scope. Future work and J. Henkel, “Architectural-space exploration of approximate multipliers,”
in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Austin,
effort are focused on giving configurable capabilities to the TX, USA, Nov. 2016, pp. 1–8.
filter images under analysis, thus enabling the exploration [22] J. de Oliveira, L. Soares, E. Costa, and S. Bampi, “Exploiting approx-
of different power-performance profiles during the execution imate adder circuits for power-efficient Gaussian and Gradient filters
for Canny edge detector algorithm,” in Proc. IEEE 7th Latin Amer.
time. Symp. Circuits Systems (LASCAS), Florianopolis, Brazil, Feb./Mar. 2016,
pp. 379–382.
R EFERENCES [23] Y. Kang, J. Kim, and S. Kang, “Novel approximate synthesis flow
for energy-efficient FIR filter,” in Proc. IEEE 34th Int. Conf. Comput.
[1] R. H. Dennard, “Past progress and future challenges in LSI Technology: Design (ICCD), Scottsdale, AZ, USA, Oct. 2016, pp. 96–102.
From DRAM and scaling to ultra-low-power CMOS,” IEEE Solid-State [24] A. Beaumont-Smith and C.-C. Lim, “Parallel prefix adder design,” in
Circuits Mag., vol. 7, no. 2, pp. 29–38, 2015. Proc. 15th IEEE Symp. Comput. Arithmetic, Vail, CO, USA, Jun. 2001,
[2] J. Henkel, H. Khdr, S. Pagani, and M. Shafique, “New trends in dark pp. 218–225.
silicon,” in Proc. 52nd ACM/EDAC/IEEE Design Automat. Conf. (DAC), [25] D. L. Harris, “Parallel prefix networks that make tradeoffs between logic
San Francisco, CA, USA, Jun. 2015, pp. 1–6. levels, fanout and wiring racks,” U.S. Patent 7 152 089 B2, Dec. 19, 2006.
[3] M. Shafique, S. Garg, J. Henkel, and D. Marculescu, “The EDA [26] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE
challenges in the dark silicon era,” in Proc. 51st ACM/EDAC/IEEE Trans. Comput., vol. C-31, no. 3, pp. 260–264, Mar. 1982.
Design Automat. Conf. (DAC), San Francisco, CA, USA, Jun. 2014, [27] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient
pp. 1–6. solution of a general class of recurrence equations,” IEEE Trans.
[4] R. Iyer, “Accelerator-rich architectures: Implications, opportunities and Comput., vol. C-22, no. 8, pp. 786–793, Aug. 1973.
challenges,” in Proc. 17th Asia South Pacific Design Automat. Conf., [28] T. Han and D. A. Carlson, “Fast area-efficient VLSI adders,” in Proc.
Sydney, NSW, Australia, Jan./Feb. 2012, pp. 106–107. IEEE 8th Symp. Comput. Arithmetic, Como, Italy, May 1987, pp. 49–56.
[5] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and [29] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. ACM,
G. Reinman, “Accelerator-rich architectures: Opportunities and pro- vol. 27, no. 4, pp. 831–838, Oct. 1980.
gresses,” in Proc. 51st ACM/EDAC/IEEE Design Automat. Conf. (DAC), [30] J. Sklansky, “Conditional-sum addition logic,” IRE Trans. Electron.
San Francisco, CA, USA, Jun. 2014, pp. 1–6. Comput., vols. EC–9, no. 2, pp. 226–231, Jun. 1960.
[6] R. Hameed et al., “Understanding sources of inefficiency in [31] J. Canny, “A computational approach to edge detection,” IEEE Trans.
general-purpose chips,” ACM SIGARCH Comput. Archit. News, vol. 38, Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.
no. 3, pp. 37–47, Jun. 2010. [32] D. Esposito, D. De Caro, and A. G. M. Strollo, “Variable latency
[7] Y. Voronenko and M. Püschel, “Multiplierless multiple constant multi- speculative parallel prefix adders for unsigned and signed operands,”
plication,” ACM Trans. Algorithms, vol. 3, no. 2, p. 11, May 2017. IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 8, pp. 1200–1209,
[8] L. Aksoy, E. Costa, P. Flores, and J. Monteiro, “Optimization of area and Aug. 2016.
delay at gate-level in multiple constant multiplications,” in Proc. 13th [33] A. Najafi, M. Weißbrich, G. P. Vayá, and A. Garcia-Ortiz, “Coherent
Euromicro Conf. Digit. Syst. Design: Architectures, Methods Tools, Lille, design of hybrid approximate adders: Unified design framework and
France, Sep. 2010, pp. 3–10. metrics,” IEEE Trans. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 4,
[9] J. Han and M. Orshansky, “Approximate computing: An emerging pp. 736–745, Dec. 2018.
paradigm for energy-efficient design,” in Proc. 18th IEEE Eur. Test [34] M. Macedo, L. Soares, B. Silveira, C. M. Diniz, and E. A. C. da Costa,
Symp. (ETS), Avignon, France, May 2013, pp. 1–6. “Exploring the use of parallel prefix adder topologies into approxi-
[10] S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan, mate adder circuits,” in Proc. 24th IEEE Int. Conf. Electron., Circuits
“Approximate computing and the quest for computing efficiency,” Syst. (ICECS), Batumi, Georgia, Dec. 2017, pp. 298–301.
in Proc. 52nd ACM/EDAC/IEEE Design Automat. Conf. (DAC), [35] L. B. Soares, M. M. A. da Rosa, C. M. Diniz, E. A. C. da Costa, and
San Francisco, CA, USA, Jun. 2015, pp. 1–6. S. Bampi, “Exploring power-performance-quality tradeoff of approxi-
[11] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design of mate adders for energy efficient sobel filtering,” in Proc. IEEE 9th
low-power high-speed truncation-error-tolerant adder and its application Latin Amer. Symp. Circuits Syst. (LASCAS), Puerto Vallarta, Mexico,
in digital signal processing,” IEEE Trans. Very Large Scale Integr. (VLSI) Feb. 2018, pp. 1–4.
Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010. [36] J. Lee, H. Tang, and J. Park, “Energy efficient canny edge detector for
[12] Q. Xu, T. Mytkowicz, and N. S. Kim, “Approximate computing: advanced mobile vision applications,” IEEE Trans. Circuits Syst. Video
A survey,” IEEE Design Test, vol. 33, no. 1, pp. 8–22, Feb. 2016. Technol., vol. 28, no. 4, pp. 1037–1046, Apr. 2018.
[13] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power [37] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
digital signal processing using approximate adders,” IEEE Trans. segmented natural images and its application to evaluating segmentation
Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137, algorithms and measuring ecological statistics,” in Proc. 8th IEEE
Jan. 2013. Int. Conf. Comput. Vis. (ICCV), Vancouver, BC, Canada, Jul. 2001,
[14] N. Zhu, W. L. Goh, G. Wang, and K. S. Yeo, “Enhanced low-power pp. 416–423.
high-speed adder for error-tolerant application,” in Proc. Int. SoC Design
Conf. (ISOCC), Seoul, South Korea, Nov. 2010, pp. 323–327.
[15] N. Zhu, W. L. Goh, and K. S. Yeo, “An enhanced low-power high-speed
adder for error-tolerant application,” in Proc. 12th Int. Symp. Integr.
Circuits (ISIC), Singapore, Dec. 2009, pp. 69–72.
[16] A. K. Verma, P. Brisk, and P. Ienne, “Variable latency speculative
addition: A new paradigm for arithmetic circuit design,” in Proc. Design, Leonardo Bandeira Soares (S’12) received the
Automat. Test Eur. (DATE), Munich, Germany, 2008, pp. 1250–1255. Engineering degree in computer engineering from
[17] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On the Federal University of Rio Grande, Rio Grande,
reconfiguration-oriented approximate adder design and its application,” Brazil, in 2010, and the M.Sc. and Ph.D. degrees
in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), in microelectronics from the Federal University
San Jose, CA, USA, Nov. 2013, pp. 48–54. of Rio Grande do Sul, Porto Alegre, Brazil,
[18] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A low latency generic in 2013 and 2018, respectively. He is currently a
accuracy configurable adder,” in Proc. 52nd ACM/EDAC/IEEE Design Professor with the Federal Institute of Technology
Automat. Conf. (DAC), San Francisco, CA, USA, Jun. 2015, pp. 1–6. of Rio Grande do Sul. His research interests are very
[19] A. B. Kahng and S. Kang, “Accuracy-configurable adder for approximate large-scale integration architectures, approximate
arithmetic designs,” in Proc. Design Autom. Conf. (DAC), San Francisco, computing, video coding, digital signal processing,
CA, USA, Jun. 2012, pp. 820–825. and energy efficiency in complementary metal–oxide–semiconductor design.
Morgana Macedo Azevedo da Rosa received the Eduardo Antonio César da Costa (M’13) received
degree in computer engineering from the Catholic the Engineering degree in electrical engineering
University of Pelotas, Pelotas, Brazil, where she is from the University of Pernambuco, Recife, Brazil,
currently pursuing the master’s degree in electronic in 1988, the M.Sc. degree in electrical engineering
engineering and computing. Her research interests from the Federal University of Paraiba, Campina
are arithmetic circuits and very large-scale integra- Grande, Brazil, in 1991, and the Ph.D. degree in
tion design. computer science from the Federal University of Rio
Grande do Sul, Porto Alegre, Brazil, in 2002. Part
of his doctoral work was developed at the Instituto
de Engenharia de Sistemas Computadores, Lisbon,
Portugal. He is currently a Full Professor with the
Catholic University of Pelotas (UCPel), Pelotas, Brazil. He is a Co-Founder
and a Coordinator of the Graduate Program on Electronic Engineering and
Computing, UCPel. His research interests are very large-scale integration
architectures and low-power design.
Sergio Bampi (M’86–SM’17) received the Elec-

tronics Engineer degree and the B.Sc. degree in
physics from the Federal University of Rio Grande
do Sul in 1979 and the M.S.E.E. degree and the
Ph.D. degree in electrical engineering from Stanford
University in 1982 and 1986, respectively. In 1981,
he joined the Informatics Institute, Federal Univer-
Cláudio Machado Diniz (S’08–M’15) received the sity of Rio Grande do Sul, Brazil, where he is
degree in computer engineering from the Federal currently a Full Professor. He has published more
University of Rio Grande, Brazil, in 2007, and than 380 research papers in conferences and jour-
the M.Sc. and Ph.D. degrees in computer science nals, in the fields of complementary metal–oxide–
from the Federal University Rio Grande do Sul, semiconductor analog, digital, and RF design, video coding algorithms, and
Brazil, in 2009 and 2015, respectively. He was dedicated hardware architectures. He was the President of the Brazilian Micro-
an Intern Researcher with the Chair for Embed- electronics Society and the FAPERGS Brazilian research funding agency, and
ded Systems, Karlsruhe Institute of Technology, the CEITEC Technical Director. He was a Distinguished Lecturer of the IEEE
Karlsruhe, Germany. He is currently an Assistant CAS Society from 2009 to 2010. He served as the Technical Program Chair
Professor with the Catholic University of Pelotas, for SBCCI in 1997 and 2005, the IEEE LASCAS in 2013, VARI in 2015,
Pelotas, Brazil. His research interests include image and SBMICRO Congress in 1989 and 1995, and served on TPC committees
and video processing algorithms, architectures, and very large-scale integra- for ICCAD, SBCCI, ICM, LASCAS, VLSI-SoC, ICECS, and many other
tion design. international conferences.

1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators

Uploaded by

Copyright:

Available Formats

1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators

Uploaded by

Copyright:

Available Formats

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1

Design Methodology to Explore Hybrid

T HE semiconductor industry faces challenges at each new

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

SOARES et al.: DESIGN METHODOLOGY TO EXPLORE HYBRID APPROXIMATE ADDERS 3

Fig. 1. Approximate adders: (a) Copy adder; (b) ETAI.

in [21] which states that the adders focused on delay reduction

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

TABLE I vertical derivatives calculates the magnitude of the gradient.

SOARES et al.: DESIGN METHODOLOGY TO EXPLORE HYBRID APPROXIMATE ADDERS 5

filters and “Lena” benchmark. Although the authors use a

IV. P ROPOSED D ESIGN M ETHODOLOGY A. Proposed Hybrid Approximate Adders

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Fig. 4. Gaussian image filter architecture.

SOARES et al.: DESIGN METHODOLOGY TO EXPLORE HYBRID APPROXIMATE ADDERS 7

Algorithm 1 The Simulation-Based Search Heuristic

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

terms of PC is more aggressive for k3 ≥ 6. At this point,

SOARES et al.: DESIGN METHODOLOGY TO EXPLORE HYBRID APPROXIMATE ADDERS 9

The only exception is for the design where the behavioral

B. Gaussian and Gradient Quality Results

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

TABLE III TABLE IV

SOARES et al.: DESIGN METHODOLOGY TO EXPLORE HYBRID APPROXIMATE ADDERS 11

12 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Among the related works, [33] is the single which proposes

SOARES et al.: DESIGN METHODOLOGY TO EXPLORE HYBRID APPROXIMATE ADDERS 13

14 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Sergio Bampi (M’86–SM’17) received the Elec-

You might also like