Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
Fully Parallel Stochastic Computing Hardware Implementation of Convolutional Neural Networks For Edge Computing Applications!
Abstract— Edge artificial intelligence (AI) is receiving a Index Terms— Convolutional neural networks (CNNs), edge
tremendous amount of interest from the machine learning computing (EC), stochastic computing (SC).
community due to the ever-increasing popularization of the
Internet of Things (IoT). Unfortunately, the incorporation of AI
characteristics to edge computing devices presents the drawbacks I. I NTRODUCTION
of being power and area hungry for typical deep learning
techniques such as convolutional neural networks (CNNs). In this
work, we propose a power-and-area efficient architecture based
on the exploitation of the correlation phenomenon in stochastic
E DGE computing (EC) is characterized by implementing
data processing at the edge of the network [1] instead
of doing it at the server level. This has brought about great
computing (SC) systems. The proposed architecture solves the
challenges that a CNN implementation with SC (SC-CNN) may interest in the microelectronic industry due to the proliferation
present, such as the high resources used in binary-to-stochastic of the Internet of Things (IoT). At the same time, incorporating
conversion, the inaccuracy produced by undesired correlation artificial intelligence (AI) capacity in everyday devices has
between signals, and the complexity of the stochastic maximum been in the spotlight in recent times, making the development
function implementation. To prove that our architecture meets of new techniques to extend AI to edge applications a must
the requirements of edge intelligence realization, we embed a fully
parallel CNN in a single field-programmable gate array (FPGA) [2], [3]. The idea behind these research efforts is to assist EC
chip. The results obtained showed a better performance than devices to further reduce their dependence on cloud processing
traditional binary logic and other SC implementations. In addi- and reduce the energy associated with data transmission. How-
tion, we performed a full VLSI synthesis of the proposed design, ever, research on edge intelligence is still in its early days since
showing that it presents better overall characteristics than other edge nodes normally present considerable limitations in terms
recently published VLSI architectures.
of area and power consumption, thus limiting the incorpora-
tion of typical state-of-the-art deep learning implementations
to embedded devices. Therefore, new solutions for efficient
Manuscript received October 20, 2020; revised September 28, 2021 and hardware implementations for machine learning applications,
January 23, 2022; accepted April 1, 2022. This work was supported in part such as neuromorphic hardware [4], [5] or convolutional neural
by the Ministerio de Ciencia e Innovación; in part by the European Union
NextGenerationEU/PRTR; in part by the European Regional Development networks (CNNs) [6], [7], have become a trending topic
Fund (ERDF) under Grant TEC2017-84877-R, Grant PID2019-105556GB- recently.
C31, Grant PCI2019-111826-2, Grant PID2020-120075RB-I00, and Grant Stochastic computing (SC) is an approximate computing
PDC2021-121847-I00; in part by the ERDF A way of making Europe under
Grant MCIN/AEI/10.13039/501100011033; and in part by the European technique that has been arousing increasing interest over the
Union NextGenerationEU/PRTR. The work of Pablo Linares-Serrano was last decade. Its capacity to compress complex functions within
supported by the CSIC JAE-Intro-ICU 2019 Scholarship, Instituto de Micro- a low number of logic gates has motivated the development
electrónica de Sevilla (IMSE). (Corresponding author: Josep L. Rosselló.)
Christiam F. Frasser, Alejandro Morán, and Erik S. Skibinsky-Gitlin are with of different proposals for pattern recognition applications [8],
the Electronics Engineering Group, Industrial Engineering and Construction to implement ANNs accelerators in hardware [9]–[14], to
Department, University of Balearic Islands, 07122 Palma, Spain. implement random vector functional link (RVFL) [15], and
Pablo Linares-Serrano, Iván Díez de los Ríos, and Teresa
Serrano-Gotarredona are with the Instituto de Microelectrónica de Sevilla more specifically to implement CNNs [7], [9], [10], [16], [17].
(IMSE-CNM), CSIC, 41092 Seville, Spain. Nonetheless, some realization challenges remain, such as the
Joan Font-Rosselló, Vincent Canals, Miquel Roca, and Josep L. Rosselló high resources used to implement independent random number
are with the Electronics Engineering Group, Industrial Engineering and
Construction Department, University of Balearic Islands, 07122 Palma, Spain, generators (RNGs), the accuracy degradation produced by the
and also with the Balearic Islands Health Research Institute (IdISBa), lack of full decorrelation between signals, and the compactness
07120 Palma, Spain (e-mail: j.rossello@uib.es). of convolutional and max-pooling (MP) functions. Tackling
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TNNLS.2022.3166799. these issues is not a trivial issue. Lee et al. [18] approached
Digital Object Identifier 10.1109/TNNLS.2022.3166799 them by implementing only the first convolutional layer
2162-237X © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I
C OMPARISON B ETWEEN S TOCHASTIC AND TC T ECHNIQUES FOR THE P RODUCT O PERATION
Fig. 5. FPGA outcome difference when operating with the maximum and
Fig. 4. Stochastic diagrams for the estimation of the stochastic function minimum correlation inputs for the AND and OR gates using 8-bit precision.
of a logic gate depending if signals are (a) uncorrelated or (b) completely
correlated.
operation is carried out as long as the inputs are totally corre-
z ∗ = (N1 − N0 )/(N1 + N0 ) can be easily obtained by lated. This interesting feature can be exploited to implement
estimating these areas, providing the result z ∗ = x ∗ y ∗ . essential functions for deep learning applications, such as the
In case of complete correlation (Rx = R y ), we can follow rectified linear unit (ReLU) and MP operations employed in
similar reasoning but using one single axis instead of two CNN, leading to produce high-performance architectures that
since both pseudorandom numbers are the same. To create a reduce the area and power for hardware implementations.
clear diagram without overlapping areas, and without loss of
generality, we can order the input signal values, thus defining E. Stochastic Addition
the boundaries x = max(X, Y ) and y = min(X, Y ). Ordering
Accurate implementation of the stochastic addition contin-
the input allows to identify the different areas related to the
ues to be a challenge. Different circuits have been put forward
pair of signals (max{x(t), y(t)}, min{x(t), y(t)}), as shown in
to achieve the task: a simple OR gate, a multiplexer, and an
Fig. 4(b). For the case of the XNOR gate, only the shaded area
accumulative parallel counter (APC) [25], [26]. Fig. 6 shows
will be related to a high output (z = 1), which corresponds
the different stochastic addition circuits, where for the sake
to the bipolar value of z ∗ = (N1 − N0 )/(N1 + N0 ) =
of clarity, stochastic signals are denoted as lowercase letters
1 − (x − y )/2 N −1 = 1 − |x ∗ − y ∗ |. For the cases of the AND
and without the time-dependent reference (t) hereafter. The
and OR gates, we can follow a similar procedure obtaining:
⎧ use of an OR gate as an adder [Fig. 6(a)] is the smallest
1
∗ ∗ ⎨ x ∗ y ∗ + x ∗ + y ∗ − 1 Rx = R y circuit in terms of hardware footprint, yet it has different
AND x , y = 2 (1) drawbacks. It is inaccurate when the input values are relatively
⎩ min x ∗ , y ∗ Rx = R y high and very sensitive to correlation. This is why its use as
⎧ a stochastic addition circuit is ruled out in most applications.
1
∗ ∗ ⎨ 1 + x ∗ + y ∗ − x ∗ y ∗ Rx = R y The multiplexer [Fig. 6(b)] is one of the most popular circuits
OR x , y = 2 (2)
⎩ max x ∗ , y ∗ R =R .
to achieve the addition. The circuit is low cost in terms of
x y
area and the precision is not affected by the correlation among
Fig. 5 shows the FPGA outcome when operating with the inputs. The main disadvantage is the inaccuracy increase
the maximum correlation (Rx = R y ) and minimum correlation as the number of inputs grows, being not suitable for deep
(Rx = R y ) for the AND and OR gates using 8-bit precision. learning implementations, where a high number of inputs per
As it can be observed, we have totally different results neuron are demanded. The last case is the APC [Fig. 6(c)],
depending on the correlation level of the input signals. Notice which counts the number of high pulses at the inputs and
the differences for the OR gate instance, where the maximum accumulates the counted value for a period of time, producing
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Fig. 8. Fully parallel stochastic CNN architecture. Only two unique pseudorandom number generators are employed. All neurons are working simultaneously
in parallel due to the correlation phenomenon exploitation. 5 × 5 size kernels are used for convolution layers. Shared signals among neurons are depicted by
dashed lines.
[Fig. 7(b)], saving precious area resources, latency time, and CNN architecture is needed. For this reason, an efficient FEB
energy consumption. will reverberate in the whole efficiency of the network.
Table II shows different FEB designs found in the SC
C. Full CNN Architecture literature. Each macro-column represents the main operations
carried out in the FEB: multiplication, addition, activation
Fig. 8 shows how the whole system is connected to
function, and pooling. For each operation, we show the
reproduce a CNN design (the LeNeT-5). As noted, only two
type, the circuit implemented, and the result of the operation
unique pseudorandom number generators [Rx (t) and Rw (t)]
(approximate or exact). As shown, the proposed FEB design
are needed to accomplish the overall calculus, considerably
has three operations providing exact results while occupying
saving area and power in the design. This could be achieved
a low area in the pooling and activation function blocks since
due to the stochastic neuron design, which exploits correlation
only a single OR gate is needed.
and decorrelation for computing. LFSR1 is used for the Rx (t)
In Table III, we compared the frequency, area, power,
number generation and is connected to the input image com-
and energy with respect to reference HEIF [10] results. The
parator, the 0∗ signal reference generator, and every stochastic
64-input FEB is made of 4 ReLU neurons of 16 inputs each
neuron in the whole design [for the APC stochastic generator,
and a 4-to-1 MP block. The design has been synthesized in
see Fig. 7(a)]. LFSR2 is used for the Rw (t) number generation,
TSMC 40-nm CMOS technology using the Cadence Genus
which is only used to produce the stochastic weights. In this
Tool. We have synthesized two FEB designs: with and without
way, each stochastic signal generated by LFSR1 is totally
pipelining. The pipeline is accomplished by inserting some
uncorrelated with those generated by LFSR2, allowing neuron
DFFs in critical paths of the design to improve the frequency
inputs to be multiplied by weights with the highest precision.
of the system. Pipelining is essential if a more complex
Moreover, the architecture proposed allows neuron outputs
architecture is required: it allows to split the whole network
from layer li to be connected to the neuron inputs of the next
into smaller processing elements that can fit in the device and
layer li+1 without any risk of signal degradation. Since the li
operate in a sequential manner (tiling technique).
neuron outputs are generated from a first LFSR block Rx (t)
Comparing the two proposed FEB designs (pipelined versus
and the li+1 weights from a second LFSR Rw (t), the error
nonpipelined), the pipeline optimization achieves 1.6× more
induced from layer to layer by the appearance of uncontrolled
clock speed while increasing 1.3× the area, 1.9× the power,
correlation between signals is totally avoided.
and 1.2× the energy. Comparing the proposed pipelined design
It is important to note that no pruning, weight sharing,
with the proposal taken from HEIF [10], this work presents a
or clustering has been carried out. The overall array of weights
1.8× increment in processing speed and 3.9× more energy
has been embedded in the design.
efficiency. The advantage is produced mainly due to the
As noted by dashed lines, Rx (t) and 0∗ are shared through
exploitation of the correlation phenomenon, achieving an exact
the whole network, saving plenty of resources and enabling
ReLU and MP functions while reducing the total circuitry
all neurons to work simultaneously in parallel. Power con-
path delay. The difference in area comes from the APC design
sumption plummets since no access to memory for reading
used in [10] (an approximate APC (AAPC) that is developed
and writing intermediate results is necessary.
in [29]). The block is considerably smaller than an exact APC
IV. E XPERIMENTAL R ESULTS although it is more imprecise.
A. FEB Evaluation
The feature extraction block (FEB) is defined as the union of B. SC-CNN Evaluation
convolutional and pooling neurons to generate a single feature To evaluate the proposed SC design, we have implemented
point. This block is the base of every single convolutional two different CNN architectures: the LeNet-5 and a 30M of
layer and the minimum block required in case that a bigger operations CNN capable of processing the CIFAR-10 dataset.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
N EURON D ESIGN C OMPARISON W ITH O THER SC M ODELS
TABLE III
FEB P ERFORMANCE C OMPARISON FOR P IPELINED
AND N ONPIPELINED A RCHITECTURES
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE IV
C OMPARISON W ITH O THER FPGA L ENET-5 I MPLEMENTATIONS
TABLE V
C OMPARISON W ITH O THER VLSI LENET-5 I MPLEMENTATION
and without a permanent memory access, its comparison with The total area of the full design is 10.88 mm2 in the UMC
the other works can be considered as a worst case scenario. 250-nm technology node. The design synthesized in the TSMC
This is one of the drawbacks of layer accelerator implemen- 40 nm takes up a total area of 2.01 mm2 and consumes a
tations, where you must save the output of the intermediate 651 mW operating at 200 MHz.
computations and then read them back to carry out the whole Table V summarizes and compares the performance
processing. By contrast, parallel pipelined architectures do of the synthesized SC-CNN LeNet5 with other imple-
not suffer from this phenomenon because all the parameters mentations published in the literature. Compared with
are embedded in the system, and the intermediate results state-of-the-art implementations of the LeNet5 using non-
are directly connected to the next layer, making the RAM traditional logic, the proposed system achieves 1.4× more
transactions needless. computational density evaluated in MOPS/mm2 , 1.28× more
To the best of our knowledge, this is the first time an entire throughput measured in TOPS, 1.58× more energy effi-
fully parallel SC CNN is embedded in a single FPGA. This ciency expressed in TOPS/watts, 3.8× more area efficiency
feature is in stark contrast to the studies presented, where expressed in TOPS/mm2 , 10.4× more throughput measured in
the inference operations are realized by using a loop-tiling images/microseconds, 2× more energy efficiency expressed in
technique (an optimization approach to use the same hardware images/microjoules, and 3.6× more area efficiency expressed
resources recursively). in images/(µs · mm2 ), compared to the best reference of each
In our design, DSP blocks are avoided since an unconven- case. This is due to the compact implementation of the ReLU
tional computing technique (SC) is used instead of traditional function and MP operation by adequately exploiting the signal
binary logic. At the same time, memory blocks are not correlations. Furthermore, the use of correlated signals allows
required since the computation is not performed in a tile-loop to implement the architecture by using a very reduced number
manner, thereby getting rid of the principal source of power of pseudorandom number generators.
consumption, which comes from the access operations to the 2) CIFAR-10 CNN Implementation: In addition, we also
memory. present the VLSI synthesis for a bigger CNN, which can
The complete SC-CNN architecture has also been synthe- be able to process the CIFAR-10 dataset. CIFAR-10 consists
sized in TSMC 40-nm CMOS technology and UMC 250-nm of 60k 32 × 32 RGB images. The images are composed of
technology using the Cadence Genus Tool. The implemented real objects, which can be categorized among ten different
design comprises a total number of 913 906 combinatorial classes. For training, we use 50k images and 10k let for
elementary cells (NAND, NOR, and inverter gates) and 104 317 testing. The CNN architecture is formed by two blocks of
sequential cells. two convolutional layers plus one MP, and two FC layers. All
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[15] D. Kleyko, M. Kheffache, E. P. Frady, U. Wiklund, and E. Osipov, [38] H. T. Kung, B. McDanel, and S. Q. Zhang, “Packing sparse convo-
“Density encoding enables resource-efficient randomly connected neural lutional neural networks for efficient systolic array implementations:
networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 32, no. 8, Column combining under joint optimization,” in Proc. 24th Int. Conf.
pp. 3777–3783, Aug. 2021. Architectural Support Program. Lang. Operating Syst., Apr. 2019,
[16] H. Sim and J. Lee, “Cost-effective stochastic MAC circuits for deep pp. 821–834, doi: 10.1145/3297858.3304028.
neural networks,” Neural Netw., vol. 117, pp. 152–162, Sep. 2019. [39] M. Dhouibi, A. K. Ben Salem, A. Saidi, and S. Ben Saoud, “Accelerating
[17] J. Yu, K. Kim, J. Lee, and K. Choi, “Accurate and efficient stochastic deep neural networks implementation: A survey,” IET Comput. Digit.
computing hardware for convolutional neural networks,” in Proc. IEEE Techn., vol. 15, no. 2, pp. 79–96, Mar. 2021.
Int. Conf. Comput. Design (ICCD), Nov. 2017, pp. 105–112.
[18] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze, “Energy-efficient
hybrid stochastic-binary neural networks for near-sensor computing,”
in Proc. Design, Autom. Test Eur. Conf. Exhib. (DATE), Mar. 2017,
pp. 13–18. Christiam F. Frasser received the B.Sc. degree
in electronics engineering from the University of
[19] H. Sim, D. Nguyen, J. Lee, and K. Choi, “Scalable stochastic-computing
accelerator for convolutional neural networks,” in Proc. 22nd Asia South Libertadores, Bogotá, Colombia, in 2010, and the
M.S. degree in electronics systems for smart envi-
Pacific Design Automat. Conf. (ASP-DAC), Jan. 2017, pp. 696–701.
ronments from the University of Málaga, Málaga,
[20] W. Huang et al., “FPGA-based high-throughput CNN hardware
Spain, in 2017. He is currently pursuing the Ph.D.
accelerator with high computing resource utilization ratio,” IEEE
degree with the Industrial Engineering and Construc-
Trans. Neural Netw. Learn. Syst., early access, Feb. 15, 2021, doi:
tion Department, University of the Balearic Islands,
10.1109/TNNLS.2021.3055814.
Palma, Spain.
[21] S. Liu, H. Fan, M. Ferianc, X. Niu, H. Shi, and W. Luk, “Toward full- His current research interest includes machine
stack acceleration of deep convolutional neural networks on FPGAs,” learning implementations in embedded devices.
IEEE Trans. Neural Netw. Learn. Syst., early access, Feb. 12, 2021, doi:
10.1109/TNNLS.2021.3055240.
[22] S. Han et al., “EIE: Efficient inference engine on compressed deep neural
network,” ACM SIGARCH Comput. Archit. News, vol. 44, no. 3, 2016,
pp. 243–254.
[23] A. Zhakatayev, S. Lee, H. Sim, and J. Lee, “Sign-magnitude SC: Getting Pablo Linares-Serrano received the B.Sc. and
10X accuracy for free in stochastic computing for deep neural networks,” M.Sc. degrees in telecommunication engineering
in Proc. 55th ACM/ESDA/IEEE Design Autom. Conf. (DAC), Jun. 2018, from the University of Seville, Seville, Spain,
pp. 1–6. in 2019 and 2021, respectively. He is currently
[24] A. Morro et al., “A stochastic spiking neural network for virtual pursuing the Ph.D. degree with the Department
screening,” IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 4, of Electrical and Computer Engineering, Johns
pp. 1371–1375, Apr. 2018. Hopkins University, Baltimore, MD, USA.
[25] B. Parhami and C.-H. Yeh, “Accumulative parallel counters,” in Proc. His research interests include analog and
Conf. Rec. 29th Asilomar Conf. Signals, Syst. Comput., vol. 2, 1995, mixed-signal circuit designs, switched-capacitor
pp. 966–970. filters, design of bioinspired vision sensors, and
[26] E. E. Swartzlander, “Parallel counters,” IEEE Trans. Comput., signal processing.
vols. C-22, no. 11, pp. 1021–1024, Nov. 1973. Mr. Linares-Serrano has served as the Chairman for the IEEE Student
Branch at the University of Seville from 2018 to 2019 and from 2020 to 2021.
[27] P. K. Muthappa, F. Neugebauer, I. Polian, and J. P. Hayes, “Hardware-
based fast real-time image classification with stochastic computing,”
in Proc. IEEE 38th Int. Conf. Comput. Design (ICCD), Oct. 2020,
pp. 340–347.
[28] Y. Zhang, X. Zhang, J. Song, Y. Wang, R. Huang, and R. Wang, “Parallel Iván Díez de los Ríos (Student Member, IEEE)
convolutional neural network (CNN) accelerators based on stochastic received the B.Sc. degree in telecommunication
computing,” in Proc. IEEE Int. Workshop Signal Process. Syst. (SiPS), technologies engineering and the M.Sc. degree in
Oct. 2019, pp. 19–24. telecommunication engineering from the University
[29] K. Kim, J. Lee, and K. Choi, “Approximate de-randomizer for sto- of Seville, Seville, Spain, in 2019 and 2022, respec-
chastic circuits,” in Proc. Int. SoC Design Conf. (ISOCC), Nov. 2015, tively. He is currently pursuing the Ph.D. degree in
pp. 123–124. physical sciences and technologies with the Institute
[30] F. Neugebauer, I. Polian, and J. P. Hayes, “On the maximum function in of Microelectronics of Seville, Spanish National
stochastic computing,” in Proc. 16th ACM Int. Conf. Comput. Frontiers, Research Council (IMSE-CNM-CSIC), Seville, and
Apr. 2019, pp. 59–66. the University of Seville.
[31] Y. Lecun. The MNIST Database of Handwritten Digits. His research interests include neural networks,
http://yann.lecun.com/exdb/mnist/ and [Online]. Available: https://ci. neuromorphic systems, memristors, field-programmable gate arrays (FPGAs),
nii.ac.jp/naid/10027939599/en/ and application-specific integrated circuit (ASIC) design.
[32] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
[33] Gidel Company. Proc10A Board Image. Accessed: Jun. 10, 2020.
[Online]. Available: https://www.intel.com/content/dam/altera-www/ Alejandro Morán (Student Member, IEEE)
global/en_US/portal/dsn/3/boardimage-us-dsnbk-3-3405483112768- received the B.Sc. degree in physics from the
proc10agxplatform.jpg University of the Balearic Islands (UIB), Palma,
[34] Z. Liu et al., “Throughput-optimized FPGA accelerator for deep con- Spain, in 2016, the M.Sc. degree in physics
volutional neural networks,” ACM Trans. Reconfigurable Technol. Syst., of complex systems from the Institute for
vol. 10, no. 3, p. 17, 2017. Cross-Disciplinary Physics and Complex Systems
[35] Z. Li et al., “Laius: An 8-bit fixed-point CNN hardware inference (CSIC-UIB), UIB, in 2017, and the Ph.D. degree
engine,” in Proc. IEEE ISPA/IUCC, Dec. 2017, pp. 143–150. from UIB in 2022.
[36] S.-S. Park, K.-B. Park, and K. Chung, “Implementation of a CNN He is currently a Teaching Assistant with the
accelerator on an Embedded SoC Platform using SDSoC,” Proc. 2nd Industrial Engineering and Construction Department
Int. Conf. Digit. Signal Process., Feb. 2018, pp. 161–165. and a Researcher with the Electronic Engineering
[37] A. Sayal, S. S. T. Nibhanupudi, S. Fathima, and J. P. Kulkarni, Group, UIB. His research interests include machine learning in general and
“A 12.08-TOPS/W all-digital time-domain CNN engine using bi- machine learning hardware based on unconventional computing techniques,
directional memory delay lines for energy efficient edge computing,” neuromorphic architectures, embedded systems, and field-programmable gate
IEEE J. Solid-State Circuits, vol. 55, no. 1, pp. 60–75, Jan. 2020. arrays (FPGAs).
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
Erik S. Skibinsky-Gitlin (Member, IEEE) received Teresa Serrano-Gotarredona received the B.S.
the B.Sc. degree in physics from the University degree in electronics physics and the Ph.D. degree
of Salamanca, Salamanca, Spain, in 2016, and the in VLSI neural categorizers from the University of
M.Sc. and Ph.D. degrees in physics from the Univer- Seville, Seville, Spain, in 1992 and 1996, respec-
sity of Granada, Granada, Spain, in 2017 and 2021, tively, and the M.Sc. degree from the Depart-
respectively. ment of Electrical and Computer Engineering, Johns
He is currently a contracted Research Scientist Hopkins University, Baltimore, MD, USA, in 1997.
with the Electronic Engineering Group, University She is currently a tenured Researcher at the
of the Balearic Islands, Palma, Spain. His research Seville Microelectronics Institute (IMSE-CNM-
interests include machine learning based on uncon- CSIC), Seville, Spain. She is also a part-time Pro-
ventional computing techniques, neuromorphic hard- fessor at the University of Seville, Seville. Her
ware, embedded systems, and field-programmable gate arrays (FPGAs). research interests include analog circuit design of linear and nonlinear cir-
cuits, VLSI neural-based pattern recognition systems, VLSI implementations
Joan Font-Rosselló received the Telecommuni- of neural computing and sensory systems, transistor parameter mismatch
cation Engineering degree from the Polytechnic characterization, bioinspired circuits, nanoscale memristor-type address event
University of Catalonia (UPC), Barcelona, Spain, representation (AER), and real-time vision sensing and processing chips.
in 1994, and the Ph.D. degree from the University of Dr. Serrano-Gotarredona has served as the Chair for the Sensory Systems
the Balearic Islands (UIB), Palma, Spain, and UPC Technical Committee of the IEEE Circuits and Systems Society and IEEE
in 2009. Circuits and Systems Spain Chapter. She was an Academic Editor of the PLOS
He is currently an Associate Professor in elec- One and an Associate Editor of the IEEE T RANSACTIONS ON C IRCUITS
tronic technology with the Industrial Engineering AND S YSTEMS —I: R EGULAR PAPERS and the IEEE T RANSACTIONS ON
and Construction Department and a Researcher with C IRCUITS AND S YSTEMS —II: E XPRESS B RIEFS . She is serving as an
the Electronic Engineering Group, UIB. He has Associate Editor for Frontiers in Neuromorphic Engineering and a Senior
been working in the oscillation-based predictive test Editor of the IEEE J OURNAL ON E MERGING T ECHNOLOGIES ON C IRCUITS
and neural networks. His current work focuses on nonconventional neural AND S YSTEMS .
networks and neuromorphic hardware.
Miquel Roca (Member, IEEE) received the B.Sc. Josep L. Rosselló (Member, IEEE) received the
degree in physics and the Ph.D. degree from the Ph.D. degree in physics from the University of the
University of the Balearic Islands, Palma, Spain, Balearic Islands (UIB), Palma, Spain, in 2002.
in 1990. He has been a Full Professor of electronic technol-
After a research period at the Electronic Engineer- ogy with the Industrial Engineering and Construc-
ing Department, Polytechnic University of Catalo- tion Department, UIB, since 2021. He is currently
nia, Barcelona, Spain, and a research stage at the the Principal Investigator of the Electronic Engi-
Department of Electrical Engineering and Computer neering Group, Industrial Engineering and Construc-
Science, INSA, Toulouse, France, he obtained a tion Department, UIB. His current research interests
post of Associate Professor at the University of include neuromorphic hardware, edge computing,
the Balearic Islands, where he is currently a Full stochastic computing, and high-performance data
Professor with the Electronic Engineering Research Group and the Head of mining for drug discovery.
the Industrial Engineering and Construction Department. He has been working Dr. Rosselló is also serving as an AI consultant and developer for different
in microelectronic design and test, and radiation dosimeters design. His current technological companies and is part of the Organizing Committee of several
research interests deal with neural networks-based systems, neuromorphic conferences, such as the Power and Timing Modeling Optimization and
hardware based on field-programmable gate arrays (FPGAs), and neural Simulation Conference and the International Joint Conference on Neural
networks applications. Networks.
Authorized licensed use limited to: MANIPAL INSTITUTE OF TECHNOLOGY. Downloaded on October 08,2022 at 07:26:17 UTC from IEEE Xplore. Restrictions apply.