1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators
1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators
1 - 04 - 2019 - Design Methodology To Explore Hybrid Approximate Adders For Energy-Efficient Image and Video Processing Accelerators
Abstract— This paper proposes a new design methodology to anymore [1]. Furthermore, power and thermal walls bring
explore the state-of-the-art approximate adders for accelerator much more effort to designers, so that digital CMOS design
architectures conceived in the realm of multiplier-less multiple is facing the so-called “Dark Silicon Era” even considering
constant multiplication optimization problem. The proposed
methodology is composed of: 1) a search heuristic to seek faster recent Fin Field-Effect Transistor (FinFET) technologies [2].
and feasible approximate configurations for the architectures Therefore, the current and future computing scenario is
under evaluation; 2) low-power techniques regarding hybrid characterized by the demand for numerous and ubiquitous
approximate adders design for accelerators based on trees of compute-intensive applications in constrained power budget
shift-and-add operations; 3) high-performance evaluation by digital devices. Based on that, energy-efficient techniques
exploring parallel prefix adders and low power analysis through
the use of the adder optimized by a commercial synthesis tool in (i.e. maximize the number of arithmetic operations per energy
the precise part of the approximate adders; and 4) energy effi- unit) are paramount to cope with the previously observed chal-
ciency analysis by considering both the approximate techniques lenges. According to [3], two trending energy-efficient tech-
and voltage over scaling estimation. Furthermore, improvements niques are listed as follows: (i) accelerator-rich architectures
are proposed for the state-of-the-art approximate adders under based on Application Specific Integrated Circuits (ASICs) and
evaluation in this paper. Two case studies are considered to
assess the proposed methodology: 1) Gaussian image filter and (ii) Approximate Computing (AC).
2) Sobel operator. The precise and approximate image filters Architectural heterogeneity and the use of ASIC accel-
were described in very high-speed integrated circuits hard- erators are energy-efficient techniques to execute the most
ware description language regarding the proposed methodology. compute-intensive kernels of an application [3]. On the other
Results are shown after synthesis to a 45-nm standard cell-based hand, the remaining tasks which demand less energy con-
technology, where energy reductions ranging from 7.7% up to
73.2% were experienced for multiple levels of quality considering sumption can be scheduled for general-purpose processors.
the applications under analysis. As a result, general-purpose processors’ workload is allevi-
ated due to the use of energy-efficient specific processing
Index Terms— Image and video processing, energy efficiency,
approximate computing, adders, multiplier-less multiple constant cores. Power-management schemes can be implemented to
multiplication. power off accelerators or general-purpose cores when not
in use, thus respecting the power and thermal constraints.
I. I NTRODUCTION The works in [4] and [5] show that despite the challenges
design, this is performed by designing simpler circuits to commercial synthesis tool in the precise part of the
speed up the critical path timing and/or to consume less approximate adders;
power. Approximate computing techniques take advantage of 3) Combination of different approximate adders to com-
approximation-tolerant applications which do not need high pose hybrid adders for the add-and-shift architectures;
accuracy all the time but only “good enough” or “sufficiently Energy-efficiency
good” results for output perceptual quality. In [10] is stated 4) analysis based on the approximate configurations and
the following properties to define an approximation-resilient VOS estimation due to the insertion of PPAs.
application: (i) there is not a golden or accurate result, Results show that our approach substantially reduces energy
but a range of acceptable ones and (ii) robustness to input consumption ranging from 7.7% up to 73.2% for different
noisy data. For example, multimedia applications (e.g., video levels of quality.
coding, audio filtering, image processing, and so on), highly The remainder of this paper is organized as follows:
demanded by current portable devices, are intrinsically related Section II overviews the approximate and precise adders,
to human senses. The multimedia signals are, in fact, as well as the background for the Gaussian and Gradient
approximation-tolerant applications, since in [11] is stated that image filters. Section III presents the related works. Section IV
human senses process analog information and have difficulty presents the proposed design methodology to explore our
to realize the negative impact of digital approximations. hybrid approximate adders to conceive low power accelerators
It means that it is possible to adopt approximate computing design. In Section V the experimental setup and results are
techniques to improve energy efficiency in multimedia applica- shown. In Section VI the conclusions are drawn.
tions by adequately exploring the user experience at different
profiles of quality. II. BACKGROUND ON A PPROXIMATE A DDERS , PARALLEL
The excellent point of approximate computing is that P REFIX A DDERS , AND I MAGE F ILTERS
this paradigm can be adopted at any abstraction level from
transistor-level up to software application [12]. Furthermore, A. Approximate Adders
approximate computing can be an additive design compo- The approximate adders can be classified as computa-
nent for accelerator-rich architectures. One can consider that tional performance- and power-oriented designs. The former
the use of approximate hardware accelerators brings fur- is related to adders divided into m independent blocks or
ther energy efficiency improvements [12]. In the arithmetic sub-adders to speed up the critical path timing. The claim is
layer of abstraction, works in [11] and [13]–[20] have pro- that, for random and uniformly distributed pairs of operands,
posed approximate adders. Adders are basic building blocks more extended carry propagation rarely occurs. Based on that,
for several compute-intensive multimedia applications. There- additional logic is necessary to speculate carry-in for each
fore, approximate adders could drive energy efficiency for sub-adder, since this class of approximate adder breaks the
recent digital compute-intensive and approximation-tolerant carry propagation in many parts. Examples of adders which
applications. improve computational performance are the Error-Tolerant
Based on that, this work proposes a design methodol- Adder II [15], Error Tolerant Adder IV [14], and the Almost
ogy to explore state-of-the-art approximate adders for ASIC Correct Adder (ACA) [16]. This class of approximate adders
implementation of add-and-shift accelerators for image and is also characterized by the presence of infrequent and high
video processing. Previous works in [22] and [23] exam- magnitude sum errors. Therefore, the works in [16]–[19] pro-
ined the use of the state-of-the-art approximate adders for posed accuracy configurable adders to cope with this error
image filters. To explore approximation for the architec- characteristics. On the other hand, more logic is added to
tures, they adopted simulation-based methodologies in which detect and correct the sum errors.
search heuristics are implemented to seek for energy-efficient A different philosophy is to propose power-oriented adders
approximate configurations. The approximate adders taken which generally are divided into two parts: (i) the least sig-
by the previously mentioned related works are the Approx- nificant approximate part and (ii) the most significant accurate
imate Mirror Adder (AMA) [13] and the Error-Tolerant part. Examples of power-oriented approximate adders can
Adder I (ETAI) [11]. They are divided into precise and be observed in [11], [13], and [20]. The principal idea in
approximate parts. In both the works, only the Ripple Carry the approximate part is to replace the full adder cells by
Adder (RCA) topology is explored in the precise block of simpler adder circuits. Therefore, power reduction is the main
those approximate adders. The same observation is valid for objective of this class of adders. Besides, these adders also
the case study explored in [11]: the precise block of ETA-I is tend to reduce critical path timing, because in the approximate
only implemented with RCA topology. part there is not carry propagation scheme. One can observe
This work presents four novel contributions in the scope of that the classical truncation is a type of power-oriented adder
approximate computing: which truncates least significant full adder cells. This class of
1) A faster search heuristic and simulation-based methodol- approximate adders is also characterized by the presence of
ogy to configure feasible configurations with evaluation frequent and low magnitude sum errors. Such errors are of
of multiple levels of quality; low magnitude because the bit-width of the approximate part
2) A high-performance exploration of approximate hard- can be controlled through an approximate parameter k. In this
ware accelerators through the use of PPAs and low work, the proposed approach is to explore the power-oriented
power evaluation through the optimized adder from the adders to give priority to power-efficiency. It is also ratified
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
to the other can be leveraged with the use of the Copy adder.
One can notice in the examples in Figure 6(a) and (b) that
these procedures do not produce sum errors. The remaining
most significant region in the adder can further be explored
with the state-of-the-art Copy adder and ETAI (i.e., precise
plus approximate adder).
The example in Figure 6(a) shows the proposed hybrid
scheme of copy-copy-truncation. The operands are left shifted
two and four times, respectively. One can see that there are
three approximation parameters: (i) k1 = 2 which controls the
truncation in the overlapped shift region, (ii) k2 = 2 which
controls the copy adder in the excessive shift region, and
(iii) k3 = 4 which controls the approximate part of the copy
adder in the non-shifted region. Almost the same observations
can be made in Figure 6(b) for the configuration ETAI-copy-
Fig. 5. Gradient image filter architecture.
truncation. The only difference is that in the non-shifted region
the ETAI is adopted instead of the Copy adder. For the ETAI,
we adopted the modification proposed by Kang et al. [23]:
by Oliveira et al. [22]. The difference is observed in the
the use of OR logic gates instead of XOR. It occurs because
Gaussian architecture, where the partial terms of the adder tree
there is not difference regarding produced sum result, while
were reorganized to enable left shift overlapping regions in
the former gate has less area than the latter. Besides, this
the operands. It is performed to leverage the power efficiency
work also considers the use of carry-in estimation performed
provided by the proposed hybrid adders and to improve the
in Copy adder for the ETAI. As previously mentioned, this
proposed search heuristic which will be presented in the next
procedure has more probability of getting correct estimation
subsection.
than statically set the carry-in to “0.” If a given adder of the
The Gaussian and Gradient architectures can be observed in
architecture has not left shifted operands, then the approxi-
the Figure 4 and Figure 5, respectively [22]. One can observe
mation is not hybrid (i.e., k1 = 0 and k2 = 0). On the other
that these architectures are implemented by the shift opera-
hand, the copy adder or ETAI can be explored in this adder
tions, adders, and subtractors. There are two observable config-
(i.e., k3 ≥ 0).
urations in which the proposed hybrid approximate adders can
be adopted: (i) both the operands present overlapped number
of left shift operations, (ii) one operand present excessive B. Proposed Search Heuristic
number of left shift operations than the other operand. These The exhaustive search in simulation-based methodologies
aspects can be observed in Figure 6. tends to be time consuming or prohibitive to find the most
As can be seen in Figure 6, if there is overlapping between energy-efficient configuration. Therefore, the use of search
the number of left shifts in both the operands, then the heuristics is essential in this scenario. As previously shown,
proposed approach considers the use of truncation adder in the related works in [22] and [23] proposed search heuristics to
overlapped and least significant region. The excessive amount seek for energy-efficient accelerators. In this work, the pro-
of left shift operations in one of the operands, when compared posed approach is to first establish the k1 and k2 parameters
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
k3 PARAMETERS FOR THE S ELECTED Q UALITY P ROFILES
V. R ESULTS
In this section, the experimental setup, quality results for
the two evaluated case studies, energy efficiency analysis, and
area evaluation are presented.
A. Experimental Setup
In order to evaluate the quality results, all the approximate
configurations were simulated by considering eight images
from the Berkeley Segmentation Dataset benchmark [37]. This
new set of images is adopted to compare quality metrics
Fig. 7. Quality Analysis. (a) Gaussian filter. (b) Gradient filter. with the previous analysis performed during the approximation
search shown in Figure 7.
tp The precise and approximate designs were described
pr eci si on = (5) in VHDL based on the approximation parameters shown
tp + f p
PC = mi n(r ecall, pr eci si on) (6) in Table II. For the Gaussian filter case study, three qual-
ity levels are under evaluation. Besides, two versions of
In (4), the recall is defined as the number of pixels which are approximate hybrid adders are adopted. Therefore, this results
correctly detected as edges, by the approximate solution, over in six approximate designs. The proposed approach in this
the number of pixels which should be correctly detected as work considers the exploration of seven conventional adder
edges. Therefore, t p and f n refer to the true positives and the topologies for the precise part (i.e., RCA plus 5 PPA’s and the
false negatives. In (5), the precision is defined as the number adder optimized by the synthesis tool). Thus, the total number
of pixels correctly detected as edges over the total number of of approximate designs under evaluation for the Gaussian filter
pixels detected as edges by the approximate solution. The term is 42. The seven precise architectures are also considered
f p refers to the false positives. Therefore, the Performance in this work since they are the baselines to compare to
Conformance PC is defined by the minimum between the the approximate solutions. It results in 49 described designs
recall and precision, where the result can range from 0 to 1. considering the precise and approximate ones. This same
One can observe in Figure 7(b) that the PC also decays analysis is performed on the Sobel operator case study, where
when k3 increases. The first configuration in which k3 = 0 a total of 35 designs are explored. Therefore, the total number
presents results near the maximum possible level with very of described designs are 84.
low variability. It can be explained because there are only two All the designs are synthesized by adopting the RTL Com-
hybrid adders approximated by the k2 parameter as can be piler tool from Cadence, and they are mapped to the 45 nm
observed in Figure 5. It occurs because there is not overlapping Nangate Free PDK. The cells’ structures from the designs
of left shifted operations in these adders. The degradation in are preserved to avoid distortions in the adder topologies.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 9. Mean Energy per Operation. (a) Gaussian filter. (b) Gradient filter.
13.4% up to 52.3%. These results clearly show that our design frequency reached by the precise Gaussian and Sobel oper-
methodology and proposed hybrid approximate adders provide ator implemented by the 5 PPAs and the RCA topology,
energy efficiency for compute-intensive image/video filters. respectively. This result shows that the RCA-based filter is
The Gaussian and Gradient filters fully implemented by the the slowest hardware, and, therefore, the PPA-based designs
precise “tool adder” present the lowest energy consumption are evaluated under VOS operation considering the frequency
when compared to the PPA adders and the RCA versions. of 249 MHz.
One can conclude that the synthesis tool makes substantial In [13] is shown that the VOS technique consists of scal-
effort to build low power designs. Also, one can observe ing down the V D D without scaling the clock frequency
that our proposed approximate approach further improves accordingly. The circuit delay (as in all digital CMOS) is
the energy efficiency in this scenario. For instance, energy inversely proportional to the voltage supply V D D, as demon-
reductions provided by the hybrid approximate solution range strated in [13]. Therefore, they propose a VOS model shown
from 6.3% up to 64.4% when considered the Gaussian filter in (7) to calculate the lower boundary regarding scaled V D D
with precise part implemented by the “tool adder” and an (V D Dscaled ) which still avoids timing induced errors. We con-
operating frequency of 249 MHz. These energy reductions sider this model to estimate additional dynamic power reduc-
are of up to 66.2% when the clock frequency target to the tion when applying VOS in adder topologies which are faster
Gaussian filter is of 63 MHz. than the RCA. This reduction is significant, as the dynamic
As expected, the PPAs versions consume more energy than power is directly proportional to V D D 2 . In (7) the term slack
the RCA-based version as can be seen in the Figure 9. refers to the difference between the minimum period achieved
On the other hand, Table III and Table IV show the maximum of a given PPA topology and the baseline RCA. The term Tc
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE V
A REA (μm 2 ) A NALYSIS @ 249 MHz
denotes the period of the operating clock, which in this part implemented in the precise parts of the hybrid approximate
of the study is 1/249000000 seconds. Since at VOS condition adders represent the minimum and maximum area reductions
the clock frequency is not accordingly scaled when V D D is for almost all the cases, respectively. It is expected since the
reduced, there is not performance penalty. Also, most of the “tool adder” is optimized for low power, while the KS has the
faster PPAs presents dynamic power (Pd yn ) reduction shown highest area. Based on that, these reductions are of 67.4% up
in Tables III and IV when compared to the RCA baseline to 73.8% for the Gaussian filter, and 39.5% up to 48.8% for
filter. One can conclude that the dynamic power reduction the Sobel operator implemented by the copy-copy-truncation.
reaches up to 17.4% (Ladner-Fischer version) and 19.3% When considered the ETAI-copy-truncation, the reductions are
(Kogge-Stone version) when considering the Gaussian and of 6.9% up to 14.1% for the Gaussian filter, and of 8.2% up to
Gradient operators, respectively. Besides the additional 21.8% for the Gradient filter. Following the same conclusions
dynamic power savings of the PPA adders, they can accom- made in the Energy Efficiency analysis, the area reductions
plish higher frame rates to process higher video resolutions ratify the contributions of this study.
than the RCA. Therefore, the use of PPA adders may be
preferable than the RCA, depending on the observed scope. E. Energy Efficiency Vs. Application Quality
slack In this subsection, the main objective is to evaluate the
V D Dscaled = V D D(1 − ) (7)
Tc relationship between application quality and energy consump-
tion. One can observe in Figure 10(a) and (b) that energy
The “tool adder” is not exercised in this context because the
consumption raises when the application quality is improved.
commercial synthesis tool tends to push the limits to achieve
The Figure 10(a) and (b) show all the evaluated approximate
the highest possible clock frequency. This procedure may con-
configurations for the Kogge-Stone and “tool adder” precise
ceive a gate-level netlist which is substantially different from
parts. These precise parts were selected because they represent
the one synthesized for 249 MHz. Based on that, the maximum
the highest and the lowest energy consumption among all the
frequency may not represent a fair analysis considering VOS
conventional topologies. The results for the Gaussian filter
operation.
are shown in Figure 10(a). One can observe that for all
the approximate versions, the energy consumption increases
D. Area Analysis when higher PSNR quality is demanded. The same can be
The area analysis is shown in Table V for the Gaussian observed for the Gradient filter in Figure 10(b). These results
and Gradient filters with a clock frequency of 249 MHz. The are expected since higher quality profiles are associated with
number of cells and area (μm 2 ) are shown for the precise designs which are less approximated.
designs plus the same approximate configurations previously
shown in Figure 9 and enumerated as follows: (i) the copy- VI. C OMPARISON W ITH R ELATED W ORK
copy-truncation with PSNR and Performance Conformance As previously mentioned in the related work section, essen-
targets of 30 dB and 0.85, and (ii) the ETAI-copy-truncation tial contributions were given in [22], [23], and [33]. In [22] the
with PSNR and Performance Conformance targets of 50 dB maximum energy reductions of 26.9% and 60% are provided
and 0.95. These configurations were chosen because they for the 45 nm ASIC implementation of Gaussian and Gradient
present the maximum and minimum area reductions among filters, respectively. On the other hand, the quality analysis
all the approximate designs. The results are organized per is limited, since the Sobel operator is evaluated by adopting
image filter and approximate configuration. One can observe PSNR quality metric instead of Performance Conformance
that the “tool adder” and the Kogge-Stone (KS) topology which is more appropriate for edge detection scope. Also, only
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VI
C OMPARISON W ITH R ELATED W ORK
also enabled a more comprehensive observation of application [20] H. R. Mahdiani, A. Ahmadi, S. M. Fakhraie, and C. Lucas, “Bio-inspired
quality by considering average results and variability. Com- imprecise computational blocks for efficient VLSI implementation of
soft-computing applications,” IEEE Trans. Circuits Syst. I, Reg. Papers,
parison with state-of-the-art related work is provided showing vol. 57, no. 4, pp. 850–862, Apr. 2010.
the contributions of this work for low power digital CMOS [21] S. Rehman, W. El-Harouni, M. Shafique, A. Kumar, J. Henkel, and
design and approximate computing scope. Future work and J. Henkel, “Architectural-space exploration of approximate multipliers,”
in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), Austin,
effort are focused on giving configurable capabilities to the TX, USA, Nov. 2016, pp. 1–8.
filter images under analysis, thus enabling the exploration [22] J. de Oliveira, L. Soares, E. Costa, and S. Bampi, “Exploiting approx-
of different power-performance profiles during the execution imate adder circuits for power-efficient Gaussian and Gradient filters
for Canny edge detector algorithm,” in Proc. IEEE 7th Latin Amer.
time. Symp. Circuits Systems (LASCAS), Florianopolis, Brazil, Feb./Mar. 2016,
pp. 379–382.
R EFERENCES [23] Y. Kang, J. Kim, and S. Kang, “Novel approximate synthesis flow
for energy-efficient FIR filter,” in Proc. IEEE 34th Int. Conf. Comput.
[1] R. H. Dennard, “Past progress and future challenges in LSI Technology: Design (ICCD), Scottsdale, AZ, USA, Oct. 2016, pp. 96–102.
From DRAM and scaling to ultra-low-power CMOS,” IEEE Solid-State [24] A. Beaumont-Smith and C.-C. Lim, “Parallel prefix adder design,” in
Circuits Mag., vol. 7, no. 2, pp. 29–38, 2015. Proc. 15th IEEE Symp. Comput. Arithmetic, Vail, CO, USA, Jun. 2001,
[2] J. Henkel, H. Khdr, S. Pagani, and M. Shafique, “New trends in dark pp. 218–225.
silicon,” in Proc. 52nd ACM/EDAC/IEEE Design Automat. Conf. (DAC), [25] D. L. Harris, “Parallel prefix networks that make tradeoffs between logic
San Francisco, CA, USA, Jun. 2015, pp. 1–6. levels, fanout and wiring racks,” U.S. Patent 7 152 089 B2, Dec. 19, 2006.
[3] M. Shafique, S. Garg, J. Henkel, and D. Marculescu, “The EDA [26] R. P. Brent and H. T. Kung, “A regular layout for parallel adders,” IEEE
challenges in the dark silicon era,” in Proc. 51st ACM/EDAC/IEEE Trans. Comput., vol. C-31, no. 3, pp. 260–264, Mar. 1982.
Design Automat. Conf. (DAC), San Francisco, CA, USA, Jun. 2014, [27] P. M. Kogge and H. S. Stone, “A parallel algorithm for the efficient
pp. 1–6. solution of a general class of recurrence equations,” IEEE Trans.
[4] R. Iyer, “Accelerator-rich architectures: Implications, opportunities and Comput., vol. C-22, no. 8, pp. 786–793, Aug. 1973.
challenges,” in Proc. 17th Asia South Pacific Design Automat. Conf., [28] T. Han and D. A. Carlson, “Fast area-efficient VLSI adders,” in Proc.
Sydney, NSW, Australia, Jan./Feb. 2012, pp. 106–107. IEEE 8th Symp. Comput. Arithmetic, Como, Italy, May 1987, pp. 49–56.
[5] J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, K. Gururaj, and [29] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” J. ACM,
G. Reinman, “Accelerator-rich architectures: Opportunities and pro- vol. 27, no. 4, pp. 831–838, Oct. 1980.
gresses,” in Proc. 51st ACM/EDAC/IEEE Design Automat. Conf. (DAC), [30] J. Sklansky, “Conditional-sum addition logic,” IRE Trans. Electron.
San Francisco, CA, USA, Jun. 2014, pp. 1–6. Comput., vols. EC–9, no. 2, pp. 226–231, Jun. 1960.
[6] R. Hameed et al., “Understanding sources of inefficiency in [31] J. Canny, “A computational approach to edge detection,” IEEE Trans.
general-purpose chips,” ACM SIGARCH Comput. Archit. News, vol. 38, Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.
no. 3, pp. 37–47, Jun. 2010. [32] D. Esposito, D. De Caro, and A. G. M. Strollo, “Variable latency
[7] Y. Voronenko and M. Püschel, “Multiplierless multiple constant multi- speculative parallel prefix adders for unsigned and signed operands,”
plication,” ACM Trans. Algorithms, vol. 3, no. 2, p. 11, May 2017. IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 63, no. 8, pp. 1200–1209,
[8] L. Aksoy, E. Costa, P. Flores, and J. Monteiro, “Optimization of area and Aug. 2016.
delay at gate-level in multiple constant multiplications,” in Proc. 13th [33] A. Najafi, M. Weißbrich, G. P. Vayá, and A. Garcia-Ortiz, “Coherent
Euromicro Conf. Digit. Syst. Design: Architectures, Methods Tools, Lille, design of hybrid approximate adders: Unified design framework and
France, Sep. 2010, pp. 3–10. metrics,” IEEE Trans. Emerg. Sel. Topics Circuits Syst., vol. 8, no. 4,
[9] J. Han and M. Orshansky, “Approximate computing: An emerging pp. 736–745, Dec. 2018.
paradigm for energy-efficient design,” in Proc. 18th IEEE Eur. Test [34] M. Macedo, L. Soares, B. Silveira, C. M. Diniz, and E. A. C. da Costa,
Symp. (ETS), Avignon, France, May 2013, pp. 1–6. “Exploring the use of parallel prefix adder topologies into approxi-
[10] S. Venkataramani, S. T. Chakradhar, K. Roy, and A. Raghunathan, mate adder circuits,” in Proc. 24th IEEE Int. Conf. Electron., Circuits
“Approximate computing and the quest for computing efficiency,” Syst. (ICECS), Batumi, Georgia, Dec. 2017, pp. 298–301.
in Proc. 52nd ACM/EDAC/IEEE Design Automat. Conf. (DAC), [35] L. B. Soares, M. M. A. da Rosa, C. M. Diniz, E. A. C. da Costa, and
San Francisco, CA, USA, Jun. 2015, pp. 1–6. S. Bampi, “Exploring power-performance-quality tradeoff of approxi-
[11] N. Zhu, W. L. Goh, W. Zhang, K. S. Yeo, and Z. H. Kong, “Design of mate adders for energy efficient sobel filtering,” in Proc. IEEE 9th
low-power high-speed truncation-error-tolerant adder and its application Latin Amer. Symp. Circuits Syst. (LASCAS), Puerto Vallarta, Mexico,
in digital signal processing,” IEEE Trans. Very Large Scale Integr. (VLSI) Feb. 2018, pp. 1–4.
Syst., vol. 18, no. 8, pp. 1225–1229, Aug. 2010. [36] J. Lee, H. Tang, and J. Park, “Energy efficient canny edge detector for
[12] Q. Xu, T. Mytkowicz, and N. S. Kim, “Approximate computing: advanced mobile vision applications,” IEEE Trans. Circuits Syst. Video
A survey,” IEEE Design Test, vol. 33, no. 1, pp. 8–22, Feb. 2016. Technol., vol. 28, no. 4, pp. 1037–1046, Apr. 2018.
[13] V. Gupta, D. Mohapatra, A. Raghunathan, and K. Roy, “Low-power [37] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human
digital signal processing using approximate adders,” IEEE Trans. segmented natural images and its application to evaluating segmentation
Comput.-Aided Design Integr. Circuits Syst., vol. 32, no. 1, pp. 124–137, algorithms and measuring ecological statistics,” in Proc. 8th IEEE
Jan. 2013. Int. Conf. Comput. Vis. (ICCV), Vancouver, BC, Canada, Jul. 2001,
[14] N. Zhu, W. L. Goh, G. Wang, and K. S. Yeo, “Enhanced low-power pp. 416–423.
high-speed adder for error-tolerant application,” in Proc. Int. SoC Design
Conf. (ISOCC), Seoul, South Korea, Nov. 2010, pp. 323–327.
[15] N. Zhu, W. L. Goh, and K. S. Yeo, “An enhanced low-power high-speed
adder for error-tolerant application,” in Proc. 12th Int. Symp. Integr.
Circuits (ISIC), Singapore, Dec. 2009, pp. 69–72.
[16] A. K. Verma, P. Brisk, and P. Ienne, “Variable latency speculative
addition: A new paradigm for arithmetic circuit design,” in Proc. Design, Leonardo Bandeira Soares (S’12) received the
Automat. Test Eur. (DATE), Munich, Germany, 2008, pp. 1250–1255. Engineering degree in computer engineering from
[17] R. Ye, T. Wang, F. Yuan, R. Kumar, and Q. Xu, “On the Federal University of Rio Grande, Rio Grande,
reconfiguration-oriented approximate adder design and its application,” Brazil, in 2010, and the M.Sc. and Ph.D. degrees
in Proc. IEEE/ACM Int. Conf. Comput.-Aided Design (ICCAD), in microelectronics from the Federal University
San Jose, CA, USA, Nov. 2013, pp. 48–54. of Rio Grande do Sul, Porto Alegre, Brazil,
[18] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A low latency generic in 2013 and 2018, respectively. He is currently a
accuracy configurable adder,” in Proc. 52nd ACM/EDAC/IEEE Design Professor with the Federal Institute of Technology
Automat. Conf. (DAC), San Francisco, CA, USA, Jun. 2015, pp. 1–6. of Rio Grande do Sul. His research interests are very
[19] A. B. Kahng and S. Kang, “Accuracy-configurable adder for approximate large-scale integration architectures, approximate
arithmetic designs,” in Proc. Design Autom. Conf. (DAC), San Francisco, computing, video coding, digital signal processing,
CA, USA, Jun. 2012, pp. 820–825. and energy efficiency in complementary metal–oxide–semiconductor design.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Morgana Macedo Azevedo da Rosa received the Eduardo Antonio César da Costa (M’13) received
degree in computer engineering from the Catholic the Engineering degree in electrical engineering
University of Pelotas, Pelotas, Brazil, where she is from the University of Pernambuco, Recife, Brazil,
currently pursuing the master’s degree in electronic in 1988, the M.Sc. degree in electrical engineering
engineering and computing. Her research interests from the Federal University of Paraiba, Campina
are arithmetic circuits and very large-scale integra- Grande, Brazil, in 1991, and the Ph.D. degree in
tion design. computer science from the Federal University of Rio
Grande do Sul, Porto Alegre, Brazil, in 2002. Part
of his doctoral work was developed at the Instituto
de Engenharia de Sistemas Computadores, Lisbon,
Portugal. He is currently a Full Professor with the
Catholic University of Pelotas (UCPel), Pelotas, Brazil. He is a Co-Founder
and a Coordinator of the Graduate Program on Electronic Engineering and
Computing, UCPel. His research interests are very large-scale integration
architectures and low-power design.