Sabo Gal 2019
Sabo Gal 2019
Sabo Gal 2019
Abstract—Recent advancements in deep learning present learning as a scientific and technological opportunity to
new opportunities for enhanced scientific methods, autonomous extend the reach of Earth science through more efficient
operations, and intelligent applications for space missions. uses of limited resources. Semantic segmentation is a deep-
Semantic segmentation is a powerful computer-vision process
using convolutional neural networks (CNNs) to classify ob- learning algorithm, based on convolutional neural networks
jects within an image. Semantic segmentation has numerous (CNNs), that learns to infer dense labels for every pixel
space-science and defense applications, from semantic label- of an image. Semantic segmentation has numerous space
ing of Earth observations for insights about our changing applications, from semantic labeling of Earth’s features for
planet, to monitoring natural disasters for damage control, insights about our changing planet, to monitoring natural
to gathering intelligence for national defense and security.
Despite these advantages, CNNs can be computationally disasters, to gathering intelligence for national security.
expensive and prohibited on traditional radiation-hardened Due to ongoing innovations in both sensor technology and
space processors, which are often generations behind their spacecraft autonomy, on-board space processing continues
commercial-off-the-shelf counterparts in terms of performance to be outpaced by the computational demands required for
and energy-efficiency. FPGA-based hybrid System-on-Chips future missions. The application of deep-learning concepts
(SoCs), which combine fixed-logic CPUs with reconfigurable-
logic FPGAs, present numerous architectural advantages well- for on-board processing can enable spacecraft to efficiently
suited to address the computational capabilities required for process immense volumes of raw sensor-data into actionable
high-performance, intelligent spacecraft. To enable semantic data to overcome limitations in downlink communication.
segmentation for on-board space processing, we propose a However, spacecraft designers are challenged to create high-
hybrid (hardware/software partitioned) approach using our performance, intelligent space computers subject to unique
reconfigurable CNN accelerator (ReCoN) for accelerating CNN
inference on hybrid SoCs. When evaluated on the Xilinx Zynq requirements, with stringent constraints in size, weight,
SoC and Xilinx Zynq UltraScale+ MPSoC platforms, our power, and cost (SWaP-C), and unique hazards, includ-
hybrid approach demonstrates an improvement in performance ing radiation, thermal, vibration, and vacuum. Spacecraft
and energy-efficiency up to two orders of magnitude compared often employ radiation-hardened (rad-hard) processors to
to a software-only baseline on the hybrid SoC. Furthermore, satisfy reliability constraints and overcome space radia-
fault injection and wide-spectrum neutron beam-testing was
performed to characterize the ReCoN architectural response tion challenges. However, rad-hard processors are often
to injected errors and susceptibility to neutron irradiation. generations behind their commercial-off-the-shelf (COTS)
counterparts, which tend to offer superior performance
Keywords-Deep Learning; Convolutional Neural Networks;
Semantic Segmentation; Hybrid System-on-Chip; Hybrid and energy-efficiency but are more susceptible to space
Space Computing; Fault Injection; Radiation-beam Testing radiation. Despite the high-applicability of deep learning
for spaceflight, advanced deep-learning algorithms, such as
I. I NTRODUCTION semantic segmentation, are computationally expensive and
Recent advancements in deep learning present new op- prohibited on traditional rad-hard processors. Currently, the
portunities to enhance scientific methods, autonomous oper- application of deep-learning for space missions rely on
ations, and intelligent applications for space missions. The high-performance computing (HPC) resources, such as GPU
National Academies’ Space Studies Board (SSB) issued a clusters, to analyze downlinked data.
report for the 2017-2027 decadal strategy on Earth science Small satellites (SmallSats) and CubeSats are small form-
and applications from space, providing recommendations factor spacecraft emerging as high-risk, low-cost platforms
for NASA, NOAA, and USGS. In their survey, the SSB enabled by the miniaturization of electronics, sensors, and
highlighted the need for advanced methodologies to ana- instruments. In their 2016 report, the SSB identified Cube-
lyze and convert data from Earth observations (EO) into Sats as a disruptive innovation for space-science technology
scientific knowledge [1]. The SSB also identified machine and concluded that CubeSat missions were already meeting
42
Encoder Decoder
(STP-H5-CSP) and STP-H6 Spacecraft Supercomputing for Pooling Indices
Image and Video Processing (STP-H6-SSIVP) [9] experi-
ments. CSPv1 was flown on the NASA CeREs heliophysics-
science CubeSat and will be featured on the Lockheed-
Martin LunIR lunar-flyby CubeSat, the NASA Mass Spec- 4
5 5
4
3 3
trometer observing lunar operations (MSolo) instrument, and 1
2 2
1
several other planned missions. Other space computers based Convolution + Batch Normalization + ReLU Pool Unpool Softmax
on the Zynq7 devices include Innoflight’s Compact Flight
Computer (CFC-300), GomSpace’s Nanomind Z7000, and Figure 1. SegNet model.
Xiphos’ Q7. Space computers based on the ZynqMP devices
include Innoflight’s Compact Heterogeneous-processor Ar- formulated during training. The initial convolutional layers
ray for Multi-Parametric Sensing (CHAMPS), and Xiphos’ detect low-level features (e.g., lines, corners, etc.) and the
Q8. deeper layers extract more complex structures and patterns.
In [10], a framework was developed for analyzing poten- Activation layers are used to introduce nonlinearity into the
tial processor architectures for on-board space computing. network to allow for the approximation of nonlinear patterns.
Using this framework, the computational density (CD), Examples of activation functions include sigmoid, tanh, and
measured in giga operations per second (GOPS), and com- rectified linear unit (ReLU), with ReLU often preferred for
putational density per Watt (CD/W) were calculated for the faster training [14]. Pooling layers are used to downsample
Z7020 and several state-of-the-art rad-hard processors. In the spatial resolution of the input to reduce the number of
this comparison, the Z7020 demonstrated significant im- parameters and amount of processing. Examples of pooling
provements versus the rad-hard processors in both metrics. functions include max-pooling and average-pooling. The
Due to the immense computation demands required for deep- fully connected layer, often at the end of the CNN, performs
learning algorithms, which are not achievable by currently classification and maps features extracted from previous
available rad-hard processors, semantic segmentation is only layers into an output vector of classes. The arguments of the
viable on space platforms by leveraging the performance maxima (argmax) specify the most probable classification of
benefits of SoCs and relying on fault-tolerant mitigation the input and a class label is assigned. CNNs may append
techniques for radiation effects. a softmax layer to convert the output vector into a discrete
probability distribution vector specifying the confidence of
B. Radiation Effects the classification. Batch normalization (batch-norm) is an-
In the near-Earth space environment, radiation sources in- other layer that may be inserted between convolutional and
clude galactic cosmic rays, solar particle events, and charged activation layers to accelerate training by normalizing and
particles trapped within the Van Allen radiation belts. Radi- scaling the inputs to reduce the covariate shift [15].
ation presents numerous challenges for electronic devices in
space [11]. Radiation effects include long-term cumulative D. Semantic Segmentation
effects, such as total ionizing dose and displacement dam- Semantic segmentation is a computer-vision process that
age dose, and transient single-event effects (SEEs). Non- learns to label each pixel of an image, where pixels with
destructive SEEs include upsets, transients, and functional the same label share semantic characteristics. We selected
interrupts. These effects are extensively covered in [12]. SegNet [16] as the baseline model for evaluating our hy-
Radiation-beam testing is often exercised to characterize brid approach. SegNet uses pooling indices obtained from
device susceptibility to radiation, and to determine whether max-pooling layers for upsampling feature maps in max-
the device is suitable for the space radiation environment. unpooling layers, removing the need for fully connected
layers. As a result, the SegNet model substantially reduces
C. Convolutional Neural Networks (CNNs) the number of weights and functions to be accelerated, which
Convolutional neural networks (CNNs) have become in- is desirable to resource-constrained systems. The SegNet
creasingly popular in the computer-vision community for model has been applied to semantic segmentation of EO
classification, detection, localization, and segmentation ap- imagery in [17], which we leverage as a case-study to
plications. CNNs are a form of classical supervised learning facilitate the evaluation of our hybrid approach.
algorithms with a feed-forward process for inference and SegNet uses an encoder-decoder network architecture, as
a backpropagation process for training [13]. CNNs typi- illustrated in Figure 1. SegNet is symmetrical and contains
cally contain convolutional, activation, pooling, and fully five encoder and decoder blocks, each with two or three
connected layers. Convolutional layers are used for extract- convolutional layers followed by batch-norm and a ReLU
ing features in the input and producing feature maps for operation. Each encoder block is followed by a max-pooling
subsequent layers. Each convolution operation contains a layer which produces two outputs: discretized feature-maps
set of learnable weights, the kernel and bias, which are and pooling indices. Each decoder block begins with a
43
Software ReCoN Application FPGA
max-unpooling layer which uses the pooling indices of the CRAM
Scrubber
corresponding encoder block to upsample smaller feature- Configuration Manager libaccel Partial
Reconfiguration
maps back to the original spatial resolution. An optional
AXI-MM
Monitor Reconfigure DMA Region
softmax layer can be appended at the end of the network to
Memory AXIS
convert the output volume of the final convolution layer into Network
CPU
SGDMA
a volume where each pixel contains decimal probabilities Definition Scatter-Gather
Core0 DMA
about its classification. The argmax of the output layer can Trained
Weights
also be used to assign the most probable label for each pixel. Partial
Interrupts
CoreC ReCoN
Bitstreams
Reconfigurable CNN
E. Related Works DMA
Buffers Peripherals TMR
Accelerator
44
ܾ
C ܾ
C
C C
݇݇ ݇݇ ݇݇ 2D Convolution ܾ
݇݇ ݇݇ ݇݇ C C
ࡵࡰశ ݀ǡ ݀ǡଵ ݀ǡଶ Line Buffer
C C ܾ
݇݇ ݇݇ ݇݇ C C
݇݇ ݇݇ ݇݇
݀ଵǡ ݀ଵǡଵ ݀ଵǡଶ Line Buffer C C
C C
݇݇ ݇݇ ݇݇ C
݇݇ ݇݇ ݇݇ ܾ
݀ଶǡ ݀ଶǡଵ ݀ଶǡଶ ࡻࡰశ C
ܾ
ࡵࡰశ 2D Convolution ࡻࡰశ
ܾ
ࡵࡰశ 2D Convolution ࡻࡰశ
ܾ
ࡵࡰశ 2D Convolution ࡻࡰశ
Figure 4. 4×4 convolution function.
val/rdy Controller val/rdy
For the remainder of this article, we use the subscript
ȋȌ notation (ReCoNN ) to denote that the ReCoN accelerator is
ࡵ scaled by a factor of N . The quantization parameter specifies
ࡵ ࡻࡰ the data-type representation used by ReCoN, which includes
ࡵ ࡻࡰ
ࡵ
single-precision floating-point or arbitrary-precision fixed-
point. Arbitrary-precision fixed-point provides substantial
val/rdy Controller val/rdy
improvements in area for a minimal trade-off in inference
ȋȌ
accuracy. Quantization optimizations are discussed later
in this section. ReCoN is generated using Vivado High-
ࡵ
ࡵࡰ ݔ ___ ൈൈ ߛൈൈ
ݔሾݔሿ ݖReLU
ݖReLU
ࡻ
ࡻࡰ
Level Synthesis (HLS), which is a high-productivity tool
for translating synthesizable functions written in high-level
ࡵࡰࡰ
ࡵࡰ ݔ
ݔ _ ൈൈ ൈൈ ݖReLU
ݖReLU
ࡻࡰࡰ
ࡻࡰ programming languages (such as C or C++) into a register-
ିଵ level transfer (RTL) representation for FPGAs. Using Vivado
ݔ ߳ ߚ
HLS, numerous compiler directives are available to further
val/rdy Controller val/rdy
tune ReCoN, such as trading between resource sharing,
improved timing, and area.
ȋ
Ȍ
ReCoN is also designed to support numerous run-time
parameters to accommodate various network shapes, trained
ࡻࡰ
MAX
ܷܮ ܮܦ ࡰ
ࡰ
ࡵࡰ ܷܮ Line Buffer ܮܦ ࡹࡰ packets. The packet structure includes three sections: an
ࡵࡰ ܷܴ Line Buffer ܴܦ
ܷܴ ܴܦ accelerator header, with arguments for input resolution and
accelerator function, a function-specific section, with trained
val/rdy Controller val/rdy weights for convolution and batch-norm functions only, and
ȋȌ finally the data section containing feature maps.
ܷܮ 2) Accelerator Functions: ReCoN consolidates multiple
ܷܮ
ܷܴ functions into one accelerator, with each one having equal
ࡵࡰ ܷܴ ࡻࡰ data-widths for both input and output streaming interfaces.
ࡵࡹ
ࡰ ܮܦ ࡻࡰ
ࡰ Line Buffer
ࡹࡰ ܮܦ
ܴܦ ReCoNN includes:
Line Buffer
ܴܦ • One N ×N convolution function with N single-pixel
input and N single-pixel output channels
val/rdy Controller val/rdy N
• 2 vector-sum functions, each with two single-pixel
ȋȌ
input and one double-pixel output channels
Figure 3. ReCoN4 accelerator functions: (a) convolution, (b) vector sum, • N batch-norm and ReLU functions, each with one
(c) batch-norm and ReLU, (d) max-pooling, and (e) max-unpooling. single-pixel input and one single-pixel output channels
N
• 2 max-pool functions, each with one double-pixel
input and two single-pixel output channels
N
• 2 max-unpool functions, each with two single-pixel
input and one double-pixel output channels
45
3) Convolutional Layer: Each convolutional layer con- the input stream) into two outputs. Collectively, the max-
verts an input volume (H×W ×Din ) into an output volume pool function uses the full input bandwidth but can only use
(H×W ×Dout ), with Din ×Dout convolutions are required to half the output bandwidth because the output stream size is
produce Din ×Dout intermediate convolution outputs. Next, half the input stream size.
each output dimension (of Dout ) requires Din −1 vector 6) Unpooling Layers: The unpooling layer performs
sums to add Din convolution outputs and produce one max-unpooling with a filter size of 2×2 and converts
complete output, so each convolutional layer also requires two input volumes, feature maps (H×W ×D) and indices
Dout ×(Din −1) vector sums. (H×W ×D), into one output volume (4H×4W ×D). Max-
The convolution function contains N ×N convolutions, unpooling quadruples the input spatial-resolution and pro-
operates on N inputs, and produces N outputs, as illustrated duces an output volume that is double the combined input
in Figures 3(a) and 4. The convolution function reuses each volume. The max-unpooling function, as illustrated in Figure
of the N input channels across N convolutions, each with 3(e), contains N2 instances of the max-unpool operation.
a different set of trained weights (convolution kernel and Each max-unpool operation converts two inputs (feature
bias), to produce a total of N 2 intermediate outputs. For maps and pooling indices) into one output (as two channels
each of the N output channels, the intermediate outputs packed into the output stream). Collectively, the max-unpool
are added to produce N partial vector-sums. Next, the function uses the full output bandwidth but can only use half
vector-sum function, as illustrated in Figure 3(b), is used the input bandwidth because the input stream size is half the
to add all partial vector-sums to produce the complete output stream size.
convolutional layer output. Since each convolutional layer 7) Quantization Optimizations: ReCoN supports two
requires Din ×Dout convolutions and Dout ×(Din −1) vector data-type representations: single-precision floating-point and
sums, the convolution
and vector-sum
functions
must be arbitrary-precision fixed-point. When configured to use
invoked DNin × DNout and 2 DNout × ( DNin − 1) times, floating-point, the ReCoN output is identical to the software
respectively, to process one convolutional layer. output. However, floating-point arithmetic incurs a high area
The N ×N convolution function design has a quadratic overhead and may require resource sharing to fit ReCoN
relationship between the number of channels and the number into resource-constrained FPGAs at the cost of decreased
of convolutions. When N doubles, the number of chan- performance. To improve performance, quantization can
nels doubles and the number of convolutions quadruples, be used to constrain feature map and trained weights to
or equivalently, the number of invocations is quartered. arbitrary-precision fixed-point values. Generally, fixed-point
Therefore, when N doubles, the processing capability of the arithmetic is substantially more area-efficient than floating-
convolution function improves by a factor of four. However, point arithmetic, and can allow for a greater scaling factor.
this improvement only holds if the interconnect bandwidth Quantization also has the benefit of reducing the storage size
can satisfy the doubled bandwidth requirement. of trained weights, which is useful for resource-constrained
4) Batch Normalization and ReLU Layers: The batch- systems. However, due to loss of precision in arbitrary-
norm and ReLU operations each convert an input volume precision fixed-point, the ReCoN output may deviate slightly
(H×W ×D) into an output volume (H×W ×D). Both op- compared to the software output as the precision error
erations are merged into one function, as illustrated in accumulates. However, the average error is negligible and
Figure 3(c), because both have the same access-pattern and may justify the trade-off for area, performance, and energy-
one always precedes the other. This function requires four efficiency benefits.
additional weights: running mean (E[x]), running variance The digital signal processing (DSP) slices in the Zynq7
(Var[x]), scale (γ), and shift (β), for the batch-norm opera- (DSP48E1) feature 25-bit×18-bit multipliers. ReCoN uses
tion. Since the batch-norm and ReLU function depends on the 25-bit operand for feature maps using the Q9.16 (9
a single dimension of the input volume, the N instances of signed-integer bits and 16 fractional bits) fixed-point format
the function can run in parallel to use the full inputand and the 18-bit operand for trained weights. The fixed-point
D
output interconnect bandwidths. This function requires N format for trained weights varies by type (e.g., convolutional
invocations to perform all operations. weights, convolutional bias, etc.) and is selected by fitting
5) Pooling Layers: The pooling layer performs max- the minima and maxima of each type to an arbitrary-
pooling with a filter size of 2×2 and converts an input precision fixed-point format that maximizes precision. In the
volume (H×W ×D) into two output volumes: the maxima ZynqMP, the DSP slices (DSP48E2) feature 27-bit×18-bit
( H4 × W H W
4 ×D) and the pooling indices ( 4 × 4 ×D). Max- multipliers and can use the Q9.18 fixed-point format for a
pooling quarters the input spatial-resolution and produces a slight improvement in precision.
combined output volume that is half the input volume. The 8) Scatter-Gather Streaming Data-Flow Optimizations:
max-pooling function, as illustrated in Figure 3(d), contains Although ReCoN will accelerate the processing of CNN lay-
N
2 instances of the max-pool operation. Each max-pool ers, the communication cost associated with streaming data
operation converts one input (as two channels packed into between the CPU and FPGA subsystems is also essential for
46
Table I
reducing the overall execution time. We developed a scatter- E VALUATION PLATFORMS .
gather DMA (SGDMA) to facilitate the parallel streaming
of multi-dimensional data through ReCoN. The SGDMA Xilinx ZC706 (Z7045) Xilinx ZCU102 (ZU9EG)
provides three major functions: scatter-gather streaming, Processing System (PS)
ARM Cortex-A9 ARM Cortex-A53
stream-size parameterization, and decoupling logic for PR. CPU
(dual-core) (quad-core)
The SGDMA is full-duplex and converts an AXI interface L1 cache 32KB/32KB I/D per core 32KB/32KB I/D per core
into two AXI-Stream interfaces: one for DMA-to-accelerator L2 cache 512KB unified 1MB unified
Frequency 667MHz 1.2GHz
(D2A) streaming and one for accelerator-to-DMA (A2D)
Programmable Logic (PL)
streaming. For each direction, the SGDMA supports three Kintex 7 UltraScale architecture
FPGA
run-time parameters: the number of channels to use, the data- (28 nm) (16 nm)
width of the channels, and the length of the stream. The LUTs 218600 274080
FFs 437200 548160
stream-size parameterization capability allows the SGDMA BRAM 545 912
to interface to each accelerator function in ReCoN. DSPs 900 2520
The scatter-gather streaming data-flow provided by the Frequency 100MHz/200MHz 100MHz/300MHz
PS-PL Interface
SGDMA has the advantage of creating an interleaving AXI3 AXI4
architecture. During a scatter-gather transfer, the SGDMA Interface
(64-bit/16-beat burst) (128-bit/256-beat burst)
will rotate between AXI descriptors and complete one AXI Acceleration Framework
DMA 8-channel SGDMA 8-channel SGDMA
burst transfer per channel before proceeding to the next Accelerator ReCoN2/2-TMR/4/8 ReCoN2/2-TMR/4/8
one. Each AXI descriptor points to a DMA buffer, allow- Quantization fixed-point (Q9.16) fixed-point (Q9.18)
ing the SGDMA to access multiple DMA buffers. Since
the SGDMA effectively rotates between DMA buffers, the
scatter-gather flow will seamlessly interleave buffer data in
analysis, and training) and are uploaded to the spacecraft
the D2A direction and deinterleave stream data in the A2D
for deployment. For training, the dataset can be constructed
direction. The principal benefit is that multi-dimensional
using downlinked sensor data or approximated by using or
data can remain deinterleaved in DMA buffers, and the
modifying existing datasets.
SGDMA will automatically perform the data-interleaving
preprocess and data-deinterleaving postprocess in hardware. IV. E VALUATION
As a result, the SGDMA completely eliminates the overhead To evaluate our hybrid approach for semantic segmenta-
of software memory interleaving and deinterleaving. tion, we experimentally recorded accuracy, resource utiliza-
Furthermore, since all accelerator functions operate on tion, performance, and energy-efficiency metrics by running
data streams, the AXI descriptors can be configured to reuse our hybrid architecture on two hardware platforms at various
DMA buffers to perform the accelerator functions in-place. configurations. In this section, we describe our experimen-
This access pattern has the advantage of reducing the mem- tal setup, target platforms, and application case-study, and
ory overhead required for DMA buffers and significantly analyze our results.
reduces the amount of software memory copies. The only
software memory copies required are those that specify the A. Platforms
header and function-specific section of the stream packet, Our framework was realized on two hardware platforms,
which are negligible in terms of size compared to the data. including the Xilinx ZC706 (Z7045) and Xilinx ZCU102
(ZU9EG). The system specifications for these platforms are
C. ReCoN Control-Software detailed in Table I [6], [7]. For both platforms, Vivado
The ReCoN control-software provides the control-flow 2018.2 was used to synthesize ReCoN and generate sys-
operations required for hybrid semantic segmentation. The tem bitstreams (using default synthesis and implementation
control software allocates DMA buffers, loads input image settings), and Petalinux 2018.2 was used to deploy an
data and trained weights, and invokes the SGDMA to embedded Linux operating system.
asynchronously stream buffer data through ReCoN. For our semantic segmentation application, we selected
The control software is parameterizable to support ar- the Potsdam dataset from the ISPRS commission II/4 bench-
bitrary image volumes (spatial-resolution and dimension) mark for 2D semantic labeling [23]. This dataset provides
and network shapes of the SegNet model to accommodate EO imagery in RGB (red-green-blue) and IRRG (infrared-
various space applications and imaging sensors (e.g., mul- red-green) formats, with ground-truth labels including six
tispectral, hyperspectral, etc.). When initialized, the control classes: roads, buildings, low vegetation, trees, automobiles,
software references two resources: the network definition, and clutter. We resized the dataset to 512×512 images,
which specifies the network shape and the arrangement and then partitioned this dataset into 70% for training
of layers, and the corresponding trained weights. Both and 30% for testing. We trained three different network
resources are obtained after network development (testing, shapes: Net (86 layers, 7376806 weights), Net1⁄2 (86 layers,
47
Table II
I NFERENCE ACCURACY. DSP slices. The ZC706 and ZCU102 platforms provide
enough DSP slices to support the 8-channel SGDMA and
Inference Accuracy/Error Net Net1⁄2 Net1⁄4 ReCoN8 accelerator. Our CSPv1 space computer (Z7020)
Inference Accuracy (RGB) 90.17% 89.63% 88.30% provides enough DSP slices for the 4-channel SGDMA and
Inference Accuracy (IRRG) 90.00% 89.95% 88.92%
Accelerator Error (floating-point) 0.00% 0.00% 0.00%
ReCoN4 accelerator.
Accelerator Error (Q9.16) 0.73% 0.40% 0.30%
Accelerator Error (Q9.18) 0.72% 0.39% 0.29%
D. Performance
For performance, the average execution times were mea-
sured for several configurations of the software and hy-
Table III brid versions of the semantic segmentation application.
R ESOURCE UTILIZATION .
The software-only version was compiled using GCC with
O2 optimizations and NEON single-instruction, multiple-
Xilinx ZC706 (Z7045)
Subsystem Slices FFs BRAM DSPs data intrinsics enabled. For multi-threading, OpenMP, an
(218600) (437200) (545) (900) application programming interface for shared-memory mul-
Framework 3.87% 1.14% 6.33% 0.00% tiprocessing, was used to parallelize all CNN functions.
ReCoN2 1.62% 2.03% 1.84% 4.44%
ReCoN2-TMR 6.78% 6.05% 5.50% 13.33%
For the hybrid version, the performance was measured for
ReCoN4 3.81% 5.94% 3.49% 16.89% varied scaling factors and the FPGA operating frequencies,
ReCoN8 12.02% 20.39% 6.79% 65.78% as detailed in Table I. Table IV lists the execution times and
the performance improvements compared to the baseline. In
Xilinx ZCU102 (ZU9EG) all situations, the hybrid version outperforms the software
Subsystem Slices FFs BRAM DSPs
(274080) (548160) (912) (2520)
version by up to two orders of magnitude, depending on the
Framework 4.23% 1.00% 7.57% 0.04% network and system configuration.
ReCoN2 0.93% 1.24% 1.09% 1.59%
ReCoN2-TMR 4.59% 3.70% 3.29% 4.76% E. Energy Efficiency
ReCoN4 2.14% 3.57% 2.08% 6.03% Power and energy consumption are essential metrics for
ReCoN8 6.45% 11.64% 4.05% 23.49%
space systems. For a fair comparison, the FPGA was not
programmed when running the software versions to assume
a CPU-only system. Using a power meter, the overall system
1849814 weights), and Net1⁄4 (86 layers, 465262 weights), power was measured when idled (7.13W for the ZC706 and
where Net1⁄2 and Net1⁄4 halves or quarters the dimension of 21.60W for the ZCU102) and when actively processing the
each layer in Net, respectively. We use the single-threaded, application. The dynamic-power and dynamic-energy con-
software-only results as the baseline for our comparisons. sumption can be calculated by using the following equations:
B. Accuracy Dynamic Power = Active Power − Idle Power
In the context of semantic segmentation, accuracy refers to
Dynamic Energy = Execution Time × Dynamic Power
the prediction rate in which pixels of an image are assigned
the correct label. Accuracy depends on several factors (e.g., Table IV lists the dynamic-power and dynamic-energy con-
network shape, training method, dataset, etc.). Using the test sumptions, and the energy-efficiency improvements com-
set, we calculated the accuracy for all three sample networks pared to the baseline. Although the hybrid versions often
for each image format (RGB and IRRG), as shown in Table have a higher peak power-consumption, the reduced exe-
II. As noted previously, the floating-point version produces cution times result in significant improvements in overall
an output identical to the software version, but the fixed- dynamic-energy consumption, up to two orders of magnitude
point version has minor deviations due to accumulation of compared to the baseline. To accommodate space applica-
precision error. The average error for ZynqMP platforms tions with stricter power requirements, the FPGA operating
using the Q9.18 format is slightly improved compared to frequency and ReCoN configuration can be reduced at the
Zynq7 platforms using the Q9.16 format. cost of decreased performance.
C. Resource Utilization V. R ELIABILITY E XPERIMENTS
The resource utilization of the HARFT framework and This section describes the fault-injection and radiation-
ReCoN are separately shown in Table III. These num- beam experiments performed to analyze the architectural
bers were obtained using the Vivado design tools post- response of ReCoN to both injected and radiation-induced
implementation using default synthesis and implementation errors. The objective of these experiments was to analyze the
settings. When configured for efficient quantization, the vulnerability of two designs: simplex, quad-channel ReCoN4
scalability of ReCoN is bounded by the number of DSP and TMR, dual-channel ReCoN2-TMR . Both accelerators
slices available in the FPGA. ReCoNN requires 9N 2 + 2N have similar resource utilizations but introduce a trade-off
48
Table IV
P ERFORMANCE AND ENERGY-E FFICIENCY.
in performance and reliability. For both experiments, we executions until an erroneous output is expected, capturing
used a reconfigurable-system design, with TMR-protected 4- the trade-off between performance and reliability [27]. AVF
channel SGDMA and CRAM scrubber residing in the static and MWTF are calculated as follows:
region, and either ReCoN4 or ReCoN2-TMR residing in the
Number of Erroneous Executions
PRR. The BL-TMR tool, a highly user-configurable tool for AVF =
selective replication of FPGA designs, was used to apply Number of Fault Injections
low-level TMR [24]. The SGDMA was modified to compute Number of Correct Executions
MWTF =
XOR-based checksums on both the D2A and A2D streams. Number of Erroneous Executions
Since the execution of the SegNet model is deterministic, the To avoid modifying the design, the Processor Configura-
output and intermediate checksums can be compared against tion Access Port (PCAP) is used for injecting faults into the
golden checksums to determine the execution outcome and CRAM. The CRAM scrubber is inactive for this experiment
to identify exactly which layers were affected due to injected because the fault-injection procedure is controlled (i.e., one
or radiation-induced errors, respectively. error per iteration). In our fault-injection procedure, each
iteration begins with the random selection of a layer and
A. Fault Injection CRAM bit location (frame address, word, and bit). Next,
Fault injection was performed to observe the architectural the application is executed until it reaches the randomly
response of each design to errors injected into CRAM. selected layer, where the execution is halted, the fault is
For this experiment, the objective was to measure the injected via the PCAP, and the execution is resumed. When
architectural vulnerability factor (AVF) and mean-work-to- the randomly selected layer is complete, the execution is
failure (MWTF) metrics for each design, and the tolerance of halted, the fault is repaired, and the execution is resumed
erroneous outputs. In this context, the AVF is the probability until completion. The error is restricted to the execution of
that an injected error will manifest into an erroneous output the selected layer to focus on the vulnerability of that layer,
[26], and MWTF describes the average number of correct as well as to represent the behavior of the CRAM scrubber
49
Table V
FAULT I NJECTION AND W IDE - SPECTRUM N EUTRON - BEAM T EST R ESULTS .
50
offs. The dissimilarity in the cross-section magnitudes be-
tween both sets of DUTs can be attributed to architecture and
process-technology differences between both sets of DUTs.
VI. C ONCLUSIONS
Despite the high-applicability of deep learning for space-
flight, deep-learning algorithms such as CNNs are com-
putationally expensive and prohibited on traditional rad-
hard processors. Commercial hybrid SoCs present numerous
architectural advantages that address on-board processing
challenges. However, effective use of both the CPU and
FPGA subsystems is required to reliably maximize the
benefits provided by the hybrid architecture.
In this article, we introduced our hybrid approach for
semantic segmentation on hybrid SoCs. When evaluated on
the Xilinx Zynq SoC and Xilinx Zynq UltraScale+ MPSoC
Figure 6. Experimental setup at LANSCE 4FP30R/ICE-II. platforms, our hybrid approach demonstrates an improve-
was used to prevent the accumulation of errors in CRAM. ment in performance and energy-efficiency up to two orders
However, the radiation beam can expose the DUTs to error of magnitude compared to a software-only baseline on the
modes that cannot be directly compared with our fault- hybrid SoC. Due to significant performance speedup and
injection procedure (e.g., multi-bit upsets, CPU or memory reduced energy consumption, our hybrid approach can be an
errors, overwhelmed scrub-rate, etc.). enabling technology for applying semantic segmentation and
In our radiation-beam test procedure, the DUTs contin- other CNN algorithms to future space missions. For future
uously ran semantic segmentation on either version of Re- work, we will investigate new optimizations in ReCoN to
CoN. DUT-management software was used to automate the further enhance performance and energy-efficiency.
power-cycling of DUTs when it was detected that the DUT Additionally, fault-injection and radiation-beam testing
had hanged (failed to signal a heartbeat before timeout), re- was performed to characterize the architectural response
ported consecutive errors (counted as one error), or detected of two versions of ReCoN (one simplex, high-performance
a failure of the CRAM scrubber. Golden checksums were and one TMR, low-performance) to injected and neutron-
used to test the execution outcomes (correct, error, or hang), induced errors. In our CRAM fault-injection experiment,
which were recorded with timestamps. The 4FP30R/ICE-II we measured the AVF and MWTF of both designs, and
instrument contains a dosimeter that records the integrated identified a pattern in error tolerance across layers of the
neutron flux (above 10 MeV) with timestamps. The neutron SegNet model. In our radiation-beam test, we measured
fluence (above 10 MeV) can be calculated by integrating the cross-section and MWTF for both designs under wide-
the neutron flux over the time interval that the DUT was spectrum neutron irradiation. These experiments are the
active. The cross-section and corresponding 95% confidence basis for future work in adaptive CNNs, which alternate
interval are calculated as specified in [25]: between high-performance and high-reliability versions of
ReCoN across layers with varied susceptibilities, to maxi-
Number of Erroneous Executions
Cross-section = mize inference performance subject to reliability constraints.
Effective Fluence The HARFT SoC reliability framework is currently in-
For the ZedBoard DUTs, the DDR memory was config- tegrated into the STP-H6-SSIVP experiment on board the
ured with ECC enabled and the unified L2 caches were ISS [9]. Using EO imagery captured by SSIVP, a dataset
disabled to prevent the high neutron-flux from overwhelming can be constructed or approximated, and a CNN based on
the DUTs and to minimize CPU-related errors and failures. the SegNet model can be developed and trained using the
For the UltraZed-EG DUTs, the DDR memory was also new dataset. Finally, the network definition, trained weights,
configured with ECC enabled but the caches were kept and ReCoN (PR bitstream) can be uploaded to flight-qualify
enabled as the ZynqMP CPU demonstrates high resilience to hybrid semantic segmentation for on-board processing.
SEUs [30]. The designs were alternated between DUTs and
the recorded fluence was adjusted to account for the distance ACKNOWLEDGMENTS
between the DUT and beam source. The experimental results This work was supported by SHREC industry and agency
are detailed in Table V. For both sets of DUTs, the cross- members and by the IUCRC Program of the National
section and MWTF improvement reaffirms the advantage Science Foundation under Grant No. CNS-1738783. The
of ReCoN2-TMR to reliably execute more inferences than authors would like to thank Christopher Wilson, James
ReCoN4 , despite performance and energy-efficiency trade- MacKinnon, and Daniel Sabogal. The authors would also
51
like to thank the U.S. Department of Energy and LANSCE [16] V. Badrinarayanan, A. Handa, and R. Cipolla, “SegNet: A
for the invaluable beam time. deep convolutional encoder-decoder architecture for robust
semantic pixel-wise labelling,” CoRR, vol. abs/1505.07293,
R EFERENCES 2015. [Online]. Available: http://arxiv.org/abs/1505.07293
[17] N. Audebert, B. L. Saux, and S. Lefèvre, “Beyond RGB: Very
[1] National Academies of Sciences, Engineering, and high resolution urban remote sensing with multimodal deep
Medicine, Thriving on Our Changing Planet: A Decadal networks,” ISPRS Journal of Photogrammetry and Remote
Strategy for Earth Observation from Space. Washington, Sensing, 2017.
DC: The National Academies Press, 2018. [Online]. [18] K. Abdelouahab, M. Pelcat, J. Serot, and F. Berry, “Acceler-
Available: https://www.nap.edu/catalog/24938/thriving-on- ating CNN inference on FPGAs: A survey,” 2018.
our-changing-planet-a-decadal-strategy-for-earth [19] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and
[2] National Academies of Sciences, Engineering, and J. Cong, “Optimizing FPGA-based accelerator design for
Medicine, Achieving Science with CubeSats: Think- deep convolutional neural networks,” in Proceedings of
ing Inside the Box. Washington, DC: The the 2015 ACM/SIGDA International Symposium on Field-
National Academies Press, 2016. [Online]. Avail- Programmable Gate Arrays, ser. FPGA ’15. New York,
able: https://www.nap.edu/catalog/23503/achieving-science- NY, USA: ACM, 2015, pp. 161–170. [Online]. Available:
with-cubesats-thinking-inside-the-box http://doi.acm.org/10.1145/2684746.2689060
[3] DARPA BAA, “Blackjack (BAA HR001118S0032),” 2018. [20] Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An
[4] A. D. George and C. M. Wilson, “Onboard processing with energy-efficient reconfigurable accelerator for deep convolu-
hybrid and reconfigurable computing on small satellites,” tional neural networks,” IEEE Journal of Solid-State Circuits,
Proceedings of the IEEE, vol. 106, no. 3, pp. 458–470, March vol. 52, no. 1, pp. 127–138, Jan 2017.
2018. [21] C. Farabet, B. Martini, B. Corda, P. Akselrod, E. Culurciello,
[5] S. Sabogal, A. D. George, and G. Crum, “Hybrid and Y. Lecun, “NeuFlow: A runtime reconfigurable dataflow
semantic image segmentation using deep learning processor for vision,” in 2011 IEEE Computer Society Confer-
for on-board space processing,” in NASA Goddard ence on Computer Vision and Pattern Recognition Workshops,
Workshop on Artificial Intelligence, 2018. [Online]. Avail- CVPRW 2011, 2011.
able: https://asd.gsfc.nasa.gov/conferences/ai/program/031- [22] C. Wilson, S. Sabogal, A. George, and A. Gordon-Ross,
NASA-GSFC-AI-Workshop-ssabogal.pdf “Hybrid, adaptive, and reconfigurable fault tolerance,” in 2017
[6] Xilinx, Zynq-7000 All Programmable SoC Technical Refer- IEEE Aerospace Conference, March 2017, pp. 1–11.
ence Manual, Xilinx, Dec 2017, xilinx User Guide (UG585). [23] I. Potsdam, “2d semantic labeling
[7] Xilinx, Zynq UltraScale+ Device Technical Reference Man- dataset,” 2018. [Online]. Available:
ual, Xilinx, Dec 2017, xilinx User Guide (UG1085). http://www2.isprs.org/commissions/comm3/wg4/2d-sem-
[8] C. Wilson and A. George, “CSP hybrid space computing,” label-potsdam.html
Journal of Aerospace Information Systems, vol. 15, [24] J. M. Johnson and M. J. Wirthlin, “Voter
no. 4, pp. 215–227, Feb 2018. [Online]. Available: insertion algorithms for FPGA designs using triple
https://doi.org/10.2514/1.I010572 modular redundancy,” in Proceedings of the 18th
[9] S. Sabogal, P. Gauvin, B. Shea, D. Sabogal, A. Gillette, Annual ACM/SIGDA International Symposium on Field
C. Wilson, A. Barchowsky, A. D. George, G. Crum, and Programmable Gate Arrays, ser. FPGA ’10. New York,
T. Flatley, “SSIVP: Spacecraft supercomputing experiment NY, USA: ACM, 2010, pp. 249–258. [Online]. Available:
for STP-H6,” in Proceedings of the 31st Annual AIAA/USU http://doi.acm.org/10.1145/1723112.1723154
Conference on Small Satellites. Logan, UT: AIAA, 2017, [25] H. Quinn, “Challenges in testing complex systems,” IEEE
pp. 1–12. Transactions on Nuclear Science, vol. 61, no. 2, pp. 766–
[10] T. M. Lovelly and A. D. George, “Comparative analysis 786, April 2014.
of present and future space-grade processors with device [26] S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt,
metrics,” Journal of Aerospace Information Systems, vol. 14, and T. Austin, “A systematic methodology to compute the
no. 3, pp. 184–197, Mar 2017. [Online]. Available: architectural vulnerability factors for a high-performance mi-
https://doi.org/10.2514/1.I010472 croprocessor,” in Proceedings. 36th Annual IEEE/ACM Inter-
[11] K. A. LaBel, “Radiation effects on electronics 101,” NASA national Symposium on Microarchitecture, 2003. MICRO-36.,
Electronic Parts and Packaging Program (NEPP), Apr 2004. Dec 2003, pp. 29–40.
[12] R. H. Maurer, M. E. Fraeman, M. N. Martin, and D. R. Roth, [27] G. A. Reis, J. Chang, N. Vachharajani, S. S. Mukherjee,
“Harsh environments: Space radiation environment, effects, R. Rangan, and D. I. August, “Design and evaluation of
and mitigation,” Johns Hopkins APL Technical Digest, vol. 28, hybrid fault-detection systems,” in 32nd International Sym-
no. 1, pp. 17–29, 2008. posium on Computer Architecture (ISCA’05), June 2005, pp.
[13] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, 148–159.
“Learning representations by back-propagating errors,” [28] R. Le, “Soft error mitigation using prioritized essential bits,”
Nature, vol. 323, pp. 533–536, Oct 1986. [Online]. Xilinx XAPP538 (v1. 0), 2012.
Available: http://dx.doi.org/10.1038/323533a0 [29] S. F. Nowicki, S. A. Wender, and M. Mocko, “The Los
[14] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and Alamos Neutron Science Center spallation neutron sources,”
accurate deep network learning by exponential linear Physics Procedia, vol. 90, pp. 374 – 380, 2017.
units (ELUs),” CoRR, vol. abs/1511.07289, 2015. [Online]. [30] J. D. Anderson, J. C. Leavitt, and M. J. Wirthlin, “Neutron
Available: http://arxiv.org/abs/1511.07289 radiation beam results for the Xilinx UltraScale+ MPSoC,”
[15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating in 2018 IEEE Nuclear Space Radiation Effects Conference
deep network training by reducing internal covariate shift,” (NSREC 2018), July 2018, pp. 1–7.
CoRR, vol. abs/1502.03167, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03167
52