research-article

Open access

LiteCON: An All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning

Authors:

Dharanidhar Dang,

Bill Lin, and

Debashis SahooAuthors Info & Claims

ACM Transactions on Architecture and Code Optimization (TACO), Volume 19, Issue 3

Article No.: 43, Pages 1 - 22

https://doi.org/10.1145/3531226

Published: 01 September 2022 Publication History

All formats PDF

Abstract

Deep learning is highly pervasive in today's data-intensive era. In particular, convolutional neural networks (CNNs) are being widely adopted in a variety of fields for superior accuracy. However, computing deep CNNs on traditional CPUs and GPUs brings several performance and energy pitfalls. Several novel approaches based on ASIC, FPGA, and resistive-memory devices have been recently demonstrated with promising results. Most of them target only the inference (testing) phase of deep learning. There have been very limited attempts to design a full-fledged deep learning accelerator capable of both training and inference. It is due to the highly compute- and memory-intensive nature of the training phase. In this article, we propose LiteCON, a novel analog photonics CNN accelerator. LiteCON uses silicon microdisk-based convolution, memristor-based memory, and dense-wavelength-division-multiplexing for energy-efficient and ultrafast deep learning. We evaluate LiteCON using a commercial CAD framework (IPKISS) on deep learning benchmark models including LeNet and VGG-Net. Compared to the state of the art, LiteCON improves the CNN throughput, energy efficiency, and computational efficiency by up to 32×, 37×, and 5×, respectively, with trivial accuracy degradation.

1 Introduction

Convolution neural networks (CNNs) have become the go-to solution for a wide range of problems, such as object recognition [1], speech processing, and machine translation. Deep CNN models trained with large datasets are highly relevant and critical to ever-growing cloud services, such as face identification (e.g., Apple iPhoto and Google Picasa) and speech recognition (e.g., Apple Siri and Google assistant). However, a CNN algorithm involves a huge volume of computationally intensive convolutions. For example, a basic CNN model created in 2012, AlexNet [2] requires 724 M floating point multiply-accumulate (MAC) operations just for inference. The floating-point MAC count is more than 20 times in general for training a CNN. Such increase in the number of MAC makes training three order of magnitude higher compute and memory intensive than inference [5]. As a result, traditional CPUs and GPUs struggle to achieve high processing throughput per watt [3] for CNN applications. To address this, several FPGA [4] and ASIC [5] approaches have been proposed to accomplish large-scale CNN acceleration.

A CNN comprises two stages: training and inference. Most hardware accelerators for CNNs in prior literature focus only on the inference stage, while the training is done offline using GPUs. However, training a CNN is up to several hundred times more compute and power intensive than its inference [5]. Moreover, for many applications, training is not a one-time activity, especially under changing environmental and system conditions, where re-training of CNN at regular intervals is essential to maintaining prediction accuracy for the application over time. This calls for an energy-efficient training accelerator in addition to the inference accelerator.

Training a CNN, in general, employs a backpropagation algorithm that demands high memory locality and compute parallelism. Recently, a few resistive-memory (ReRAM or memristor crossbar)-based training accelerators have been demonstrated for CNNs, e.g., ISAAC [5], PipeLayer [6], and RCP [7]. ISAAC and RCP use highly parallel memristor crossbar arrays to address the need for parallel computations in CNNs. In addition, ISAAC uses a very deep pipeline to improve system throughput. However, this is only beneficial when a large number of consecutive images can be fed into the architecture. Unfortunately, during training, in many cases, a limited number of consecutive images need to be processed before weight updates. The deep pipeline in ISAAC also introduces frequent pipeline bubbles. Compared to ISAAC, PipeLayer demonstrates an improved pipeline approach to enhance throughput. However, RCP, DPE, ISAAC, and PipeLayer involve several analog-to-digital (AD) and digital-to-analog (DA) conversions, which become a performance bottleneck, in addition to their large power consumption. Also, training in these accelerators involves sequential weight updates from one layer to another. This incurs inter-layer waiting time for synchronization, which reduces overall performance. This calls for an analog accelerator that can drastically reduce the number of AD/DA conversions, and inter-layer waiting time. It has been recently demonstrated that a completely analog matrix-vector multiplication is 100× more efficient than its digital counterpart implemented with an ASIC, FPGA, or GPU [8]. Vandroome et al. [9] have demonstrated a small-scale efficient recurrent neural network using analog photonic computing. A few efficient on-chip photonic inference accelerators have also been proposed in References [10, 11, 23]. However, a full-fledged analog deep learning (or CNN to be precise) accelerator that is capable of both training and inference is yet to be demonstrated.

In this article, we propose LiteCON, a novel silicon photonics-based neuromorphic CNN accelerator. It comprises silicon photonic microdisk-based convolution, memristive memory, high-speed photonic waveguides, and analog amplifiers. LiteCON works completely in the analog domain, therefore, we use the term neuromorphic in it (a neuromorphic system is made up analog components to mimic human brain behavior, in this case CNN, an artificial neural network). The lower footprint, low-power characteristics, and ultrafast nature of silicon microdisk enhance the efficiency of LiteCON. LiteCON is a first-of-its-kind memristor-integrated silicon photonic CNN accelerator for end-to-end analog training and inference. It is intended to perform highly energy efficient and ultra-fast training for deep learning applications with state-of-the-art prediction accuracy. The main contributions of this article are summarized as follows:

•

We propose LiteCON, a fully analog and scalable silicon photonics-based CNN accelerator for energy-efficient training;

•

We introduce a novel compute and energy-efficient silicon microdisk-based convolution and backpropagation architecture;

•

We demonstrate a pipelined data distribution approach for high throughput training with LiteCON;

•

We synthesize the LiteCON architecture using a photonic CAD framework (IPKISS [16]). The synthesized LiteCON is used to execute four variants of VGG-Net [12] and two variants of LeNet [13], demonstrating up to 30×, 34×, and 4.5× improvements during training, and up to 34×, 40×, and 5.5× during inference, in throughput, energy efficiency, and computational efficiency per watt, respectively, compared to the state-of-the-art CNN accelerators.

The rest of the article is organized as follows. Section 2 presents a brief overview of CNNs and prior art. Section 3 provides a gentle introduction of the components used in LiteCON. The details of the LiteCON architecture are described in Section 4. Section 5 illustrates an example design of LiteCON, followed by Section 6, which contains the experimental setup, results, and comparative analysis. Last, we present concluding remarks in Section 7.

2 Background and Prior Art

2.1 Convolution Neural Networks

CNNs are a class of deep learning network commonly used for analyzing visual imagery for image classification and object detection tasks. A CNN comprises three types of layers: convolution layer (CONV), pooling layer (POOL), and a fully connected layer (FC). Generally, CONV is accompanied with a non-linear activation function, such as ReLU, Tanh, or Sigmoid. CNN operates in two stages: training and inference (testing). In the training phase, the filter weights (and biases) in CONV and FC layers are learnt by using a backpropagation (BP) algorithm. The BP algorithm involves a forward and a backward pass in the deep network. Given a training sample x in the forward pass, the weighted input sum (convolution) z is computed for neurons in each layer l with some initial filter weights w (and bias b) followed by neural activation \( \sigma (z) \) (ReLU(z) in our work), and POOL. The final layer L computes the output label of the overall network for every forward pass. This can be summarized as follows:

Forward Pass: For each layer l,

\( \begin{equation} {z}^{x,l} \leftarrow {w}^l{a}^{x,l - 1} + {b}^l, \end{equation} \)

(1)

\( \begin{equation} {a}^{x,l} \leftarrow \sigma ({z}^{x,l}). \end{equation} \)

(2)

The output error in the final prediction \( {\delta }^{x,L} \) is a result of errors induced by the neurons in each hidden layer during the forward pass. To determine the error contribution of a neuron in the previous layer, i.e., \( \ {\delta }^{x,l} \) , the final error is back propagated through the network starting from the output layer. This can be summarized as follows:

Output error: At the final layer L,

\( \begin{equation} {\delta }^{x,L} \leftarrow {\nabla }_a{C}_x \odot \sigma '({z}^{x,L}). \end{equation} \)

(3)

Backward Pass: For each layer l,

\( \begin{equation} {\delta }^{x,l} \leftarrow ({({w}^{l + 1})}^T \times {\delta }^{x,l + 1}) \odot \sigma '({z}^{x,l}). \end{equation} \)

(4)

Here, \( {\nabla }_a \) is gradient of \( {a}^{x,l} \) , and \( \sigma '({z}^{x,L}) \) is derivative of \( \sigma ({z}^{x,L}) \) . These error contributions are necessary to update the filter weights w and biases b in the respective layers using a gradient descent method. In gradient descent, the forward and backward pass happen iteratively until the cost function is minimized and the network is trained. This can be summarized as follows:

Gradient Descent: For each layer l and m training samples with learning rate \( \eta \) ,

\( \begin{equation} {w}^l \leftarrow \ {w}^l - \ \frac{\eta }{m}\mathop \sum \limits_x {\delta }^{x,l} \times {({a}^{x,l - 1})}^T, \end{equation} \)

(5)

\( \begin{equation} {b}^l \leftarrow \ {b}^l - \ \frac{\eta }{m}\mathop \sum \limits_x {\delta }^{x,l}. \end{equation} \)

(6)

The next section presents the details of the proposed LiteCON architecture.

2.2 Prior Art

To achieve high-speed and energy-efficient deep learning, researchers recently have demonstrated photonic accelerators by deploying microring weight banks [12, 21], Mach-Zehnder interferometers [10, 22] and multilayer diffractive optical elements [13]. Compared to these optical devices, silicon photonic microdisk (MD) has a smaller chip-area, ultrafast nature, and lower power characteristics [14]. Moreover, most silicon photonic accelerators cannot achieve the state-of-the-art inference accuracy even for small datasets. For example, none of the photonic accelerators perform with an accuracy >97% on small MNIST dataset, while recent CNNs easily reach >99.77% [2] accuracy on the same dataset. This is due to the large noises accumulated during fully optical inferences. Our proposed design based on MD addresses these bottlenecks.

3 Components Overview

CMOS compatible components such as photonic waveguides, silicon MDs, photodiodes and multi-wavelength LED array are used for on-chip photonic signaling [15]. An MD is a circular-shaped photonic structure that is used to modulate electronic signals into a photonic signal at the transmission source in a waveguide. MDs are also used to couple or filter out light from the waveguide at the destination. Each MD modulates light of a specific wavelength and its geometry (radius to be precise) determines its wavelength selectivity. We can also inject (or remove) charge carriers to (from) an MD or heat it to alter its operating wavelength.

In a typical high-bandwidth photonic link, an LED array (either on the board or on a 2.5 D interposer) generates multiple wavelengths, which are coupled by an optical grating coupler to an on-chip photonic waveguide. The phenomenon of using multiple wavelengths to transmit many streams of bits simultaneously is referred to as dense-wavelength-division-multiplexing (DWDM). To enable processing of these photonic signals, the on-chip photonic waveguide propagates the input optical power to the destination where they are captured by photodiodes and are converted to electronic data. These components are the building blocks of the proposed LiteCON architecture.

4 LiteCon Architecture

Overview: Our proposed LiteCON architecture is a fully analog, scalable silicon photonics-based CNN accelerator design. Unlike previously proposed CNN accelerators [4–6], LiteCON accelerator enables fully analog end-to-end training and inference for CNN. Figure 1 gives a high-level overview of the LiteCON architecture. As shown in the figure, LiteCON comprises four major parts: feature extractor, feature classifier, backpropagation accelerator, and weight update unit. The feature extractor (FE) and the feature classifier (FC) are made up of multiple silicon microdisk-based convolution layers, operational amplifier (OPAMP)-based ReLU layers, and pooling layers. Together, FE and FC make the feedforward CNN accelerator. The backpropagation accelerator is built using silicon microdisks, splitters, and multiplexers. And, LiteCON’s weight update unit is designed by deploying a group of memristors.

Fig. 1.

4.1 Feedfoward Accelerator

In this article, we consider image dataset as input and its classification as the task to be performed by LiteCON. The digital input data is stored in SRAM. The feedforward accelerator in LiteCON architecture (see Figure 1) performs feedforward FE followed by feature classification of input images. It operates in four stages: (a) data reading, (b) feature extraction, (c) feature classification, and (d) data writeback. The details are as follows.

4.1.1 Data Reading.

LiteCON is designed to convolve an input of 28 × 28 pixels at a time, i.e., one LiteCON cycle. Therefore, it requires 64 LiteCON cycles to execute a 224 × 224 image (typical size of an ImageNet image). Please note that a LiteCON cycle is different from its clock cycle. Here, one LiteCON cycle refers to the complete feature extraction and feature classification of a 28 × 28 image. The SRAM in LiteCON is of size 256 KB (dual data rate, 64 bits) to store the five images of size 224 × 224. In a pipelined fashion, four blocks of 28 × 28 pixels are written into the memristor crossbar (capable of storing four 28 × 28 pixels) via an n-channel DAC and memristor controller (Figure 2). The crossbar can be understood as a high-speed cache for LiteCON.

Fig. 2.

4.1.2 Feature Extraction.

The FE in our architecture is carried out using multiple FE stages \( (F{E}_i) \) . Each FE stage comprises multiple photonic convolution layers (PConv), an analog amplifier (OPAMP)-based ReLU layer, another OPAMP-based pooling (POOL) layer, and an interface layer. LiteCON's FE adopts a completely analog computing paradigm by avoiding inter-layer A-to-D (Analog-to-Digital) and D-to-A (Digital-to-Analog) conversions compared to state-of-the-art CNN accelerators [5, 6], which use analog memristive convolution and digital CPU/GPU-based ReLU and pooling.

Photonic Convolution (Pconv): Pconv is the first layer of an FE. The Pconv is based on the principle of analog multiplication using silicon microdisk [14]. A silicon microdisk is used for analog amplitude modulation of a light carrier. In its simplest term, analog amplitude modulation is the multiplication of a scalar input with an analog signal. The authors in Reference [14] have demonstrated photonic modulator-based analog multipliers. In our design, a Pconv is made to convolve 28 × 28 pixels at a time. It can be scaled up depending upon the requirements. A Pconv comprises (i) an array of leds capable of generating up to N wavelength carriers; (ii) a DWDM multiplexer, splitter, and waveguide arrangement to accommodate all the carriers into one channel; (iii) N × (M + 1) number of microdisks (N × M for microdisk multiplications and another N for weight modulation); and (iv) N photodiode.

Convolution in deep learning operates with kernels (or filters) of several sizes such as 1 × 1, 2 × 2, 3 × 3, 4 × 4, and so on. The widely adopted models that we consider in this article (Table 1) comprise 3 × 3 filters. Hence, Figure 3(a) depicts photonic convolution based on a 3 × 3 filter. To start with, each of the N wavelength channels from the LED array is integrated with a microdisk of respective wavelength. All microdisks are divided into K groups (K × 9 = N), each having 9 microdisks. All these groups of microdisks are then modulated with weight values \( (w_{11}^L,\ w_{12}^L, \ldots, w_{33}^L) \) stored in the memristor crossbar (one part of memristor crossbar stores weights and another part stores input data or features obtained in hidden layers). Here \( w_{ij}^L \) is the weight (i, j) of a filter in the Lth layer of convolution. All the N modulated wavelengths are multiplexed into one waveguide by a DWDM multiplexer following which the multiplexed light is split into P equal channels each carrying all the modulated wavelengths. Each channel is equipped with 784 microdisks (it can be scaled up or down depending upon the input size, here 28 × 28). Now, in each channel, pixel values stored in the memristor crossbar are modulated into individual wavelength by the microdisks. As shown in Figure 3(a), the first group of 9 pixel (a matrix) is modulated by the first group of 9 microdisks. The pixels for a channel are chosen in such a way that there is no same pixel modulated to two wavelengths in the same channel (to avoid data collision). That way, in each wavelength carrier, using the multiplication principle of a microdisk, an input pixel \( I{n}_{xy} \) is multiplied by a weight value \( w_{ij}^L \) . Please note that convolution at its core is nothing but a sum of input and wight multiplication. Finally, the multiplexed light from each channel is captured by an array of photodiode. Each photodiode is designed to capture nine consecutive wavelengths. For example, the first photodiode integrated with the first channel captures \( (w_{11}^L \times I{n}_{11} + w_{12}^L \times I{n}_{12} \cdots + w_{33}^L \times I{n}_{33}) \) , which is nothing but the first convolved matrix. Similarly, other convolved matrices are captured.

Table 1.

	FE₁	FE₂	FE₃	FE₄	FE₅
VGG-A	3 × 3, 64, 1	3 × 3, 128, 1	3 × 3, 256, 2	3 × 3, 512, 2	3 × 3, 512, 2	\( {FC-4096,\ 2\,\,\,\,\,} \) \( {FC-1000, 1\,\,\,\,\,} \)
VGG-B	3 × 3, 64, 2	3 × 3, 128, 2	3 × 3, 256, 2 1 × 1, 256, 1	3 × 3, 512, 2 1 × 1, 256, 1	3 × 3, 512, 2 1 × 1, 256, 1
VGG-C	3 × 3, 64, 2	3 × 3, 128, 2	3 × 3, 256, 3	3 × 3, 512, 3	3 × 3, 512, 3
VGG-D	3 × 3, 64, 2	3 × 3, 128, 2	3 × 3, 256, 4	3 × 3, 512, 4	3 × 3, 512, 4
LeNET-A	3 × 3, 6, 1	3 × 3, 6, 1	3 × 3, 16, 2	3 × 3, 16, 4	3 × 3, 120, 1	\( {FC84,\ 1\,\,} \)
LeNET-B	3 × 3, 6, 1	3 × 3, 6, 1	3 × 3, 256, 1	3 × 3, 16, 6	3 × 3, 120, 1	\( {FC84,\ 1\,\,} \)

Table 1. CNN Benchmark Configuration for VGG, LeNeT

Read (i × j, m, k) as filter size, i × j; number of such filters, m; and number of back-to-back convolutions in a layer, k.

Fig. 3.

Example of a simple pConv: Let us assume there are 9 pixels in an input. The nine pixels are stored as analog input in the memristor crossbar as \( I{n}_{11} \) , \( I{n}_{12} \) , …, \( I{n}_{33} \) . For simplicity of understanding, suppose there are 9 weights or a 3 × 3 Filter. They are \( {w}_{11} \) , \( {w}_{12} \) , …, \( {w}_{33} \) . The weights are modulated onto 9 wavelength channels and then passed in a single multiplexed waveguide. Now, each input pixel \( I{n}_{xy} \) is modulated by a microdisk into a weight-carrying channel. As in, the violet color microdisk modulates \( I{n}_{11} \) into the channel carrying weight \( {w}_{11} \) . Thus, microdisk performs amplitude modulation or in other words multiplication; so that channel now carries \( {w}_{11} \times I{n}_{11} \) . Similarly, other channels end up carrying \( {w}_{12} \times I{n}_{12}, \ldots, {w}_{33} \times I{n}_{33} \) . At the end, all these photonic signals are captured together by a photodiode as sum total, i.e., \( {w}_{11} \times I{n}_{11} + {w}_{12} \times I{n}_{12} + \cdots {w}_{33} \times I{n}_{33} \) . That is how photonic convolution works.

Electronic ReLU and Pooling: Neural activation in CNN can be performed by a variety of non-linear functions such as Sigmoid, Tanh, ReLU (rectified linear unit), and so on. ReLU is widely used for its simplicity of implementation and exemplary performance. Therefore, we consider a ReLU-based neural activation circuit. The following equation explains the working of a ReLU unit.

\( \begin{align} ReLU(z) &= z\ if\ z > 0\nonumber\\ & = 0\ if\ z\ \le 0 \end{align} \)

(7)

We deploy an operational amplifier (OPAMP) to mimic such a function as shown in Figure 3(b). Because Equation (7) can be seen as an example of a comparator. And analog OPAMP does the same thing. It takes two inputs and generates a target output based on the comparison. The output of photodiode from Figure 3(a) is fed as input to the OPAMP for ReLU operation. Please note that the OPAMP circuitry can be reconfigured to mimic other neural activation functions. The details of OPAMP mimicking other functions are omitted due to brevity.

The next operation in FE is pooling, which is used to reduce the feature size and keeps spatial invariance. It does so by taking the average or maximum of multiple elements of a feature vector. We choose maximum for its superior accuracy in a variety of applications. Pooling is also a comparator function at a fundamental level. Four or nine outputs (reason: 2 × 2 or 3 × 3 is the typical pooling size) from ReLU units are fed as input to an OPAMP-based comparator, which then selects the maximum value as the spatially invariant pooling output (Figure (1)). The outputs from all the comparators are the extracted feature, which is stored back in the memristor crossbar for the next layer of FE. When features go through all the FE stages, obtained features are stored back in SRAM.

4.1.3 Photonic Feature Classification.

After feature extraction is performed using the FE stages (by PConv, ReLU, and pooling), features are brought back from the SRAM via the memristor crossbar to undergo feature classification phase. In CNN, the feature classification segment can be seen as a special case of convolution, where each extracted feature map uses the largest possible kernel. In other words, feature classification comprises one or more fully connected (FC) layers (In FC, each element from one layer is connected to all the elements in the next layer).

LiteCON employs microdisk-based matrix vector multiplier (M-MVM) to implement the FC layer identical to Pconv (Section 4.1.2). In FC, each wavelength channel is modulated with a different weight value (unlike 9 weights in group as in the case of Pconv) to ensure fully connected network. When all the features from the feature extraction stage are brought back from the SRAM and available in the memristor crossbar, the features are fed to the FC layer. As an example, we consider 512 features coming from the feature extraction (FE) stages. VGG and LeNet operate on a 7 × 7 kernel in FC. Therefore, each feature is a 7 × 7 matrix. Therefore, 49 wavelength carriers from an LED array are modulated with 49 weights by microdisks. After multiplexing and splitting into 512 equal channels (similar to Pconv), each channel is matrix multiplied with one feature. The obtained output at the photodiode is fed to ReLU followed by pooling (if required). Then the results are fed to the next FC layer (if present in the model). After features go through all the FC layers, we obtain the classified output. During training, the classified outputs from the final FC and target outputs are input to an analog subtraction unit, the result (or error vector) of which is fed to the backpropagation architecture, as discussed next.

4.2 Backpropagation Accelerator

LiteCON’s backpropagation (BP) accelerator employs silicon microdisks, photodiodes, multiplexers, and splitters to perform completely analog matrix-multiplication and other arithmetic operations, similar to the Pconv. In contrast, previously proposed CNN accelerators [5, 6] adopt a hybrid approach by using analog memristors for matrix multiplications and digital CPU/GPU for other arithmetic operations, which requires performance hindering A-to-D and D-to-A conversions.

Figure 4(a) illustrates the microarchitecture of the proposed BP accelerator design. It is based on photonic matrix-vector multiplication using silicon microdisks (MDs). We use MDs for their smaller footprint, high accuracy and quality factor, and low-power nature. We now explain the operation of the proposed BP architecture. As discussed in Equation (3), the error at the final layer (l = L) of BP is \( {\delta }^{x,L} \leftarrow {\nabla }_a{C}_x \odot \sigma '({z}^{x,L}) \) . Here, \( {\nabla }_a{C}_x \) is rate of change in output w.r.t the output activation (i.e., difference between actual classified output from feedforward accelerator and the target output stored in memristor crossbar). \( \sigma '({z}^{x,L}) \) is the derivative of the ReLU layer in the final FC stage of the CNN architecture. Outputs from the final FC stage of the CNN architecture are fed to an analog subtraction and multiplication unit (microdisk multiplier) to determine \( {\delta }^{x,L} \) . Applying Equation (4) and using the computed \( {\delta }^{x,L} \) , we calculate error for the (L − 1)th layer as follows:

\( \begin{equation} {\delta }^{x,L - 1} \leftarrow ({({w}^L)}^T \times {\delta }^{x, L}) \odot \sigma '({z}^{x,L - 1}), \end{equation} \)

(8)

where \( {w}^L \) is weight matrix (stored in memristor crossbar) obtained from Lth layer of feedforward CNN architecture. Figure 4(a) shows the backpropagation between the final layer l = L and its penultimate layer l = L − 1. As illustrated, there are N number of wavelength carriers coming from an LED array. The value of N for a layer equal to the output feature size for the corresponding layer in the feedforward accelerator, e.g., N equals 49 (7 × 7) for the last layer. Each wavelength in layer L is modulated with error \( {\delta }^{x,L} \) by a MD tuned to that wavelength. In Figure 4(a), the violet MD is tuned to modulate \( {\lambda }_1 \) . Let us suppose the jth MD's output is \( M{D}_j = \delta _j^{x,L}*A\sin ( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } ) \) ( \( A\sin ( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } ) \) represents the photonic carrier with wavelength \( {\lambda }_j \) and phase difference \( \emptyset \) ). Each \( M{D}_j \) is split into two equal parts. The first part is sent to the weight-update circuitry (explained at the end of this section) to update the corresponding weights in the feedforward accelerator. The other part is fed to a DWDM multiplexer. A DWDM multiplexer is used to combine multiple light wavelengths into a single multi-wavelength carrier. After multiplexing, the multiplexed photonic data is split into M parts by an optical splitter where M equals the number of neurons in layer L − 1. Each part is fed to a multi-wavelength waveguide. As a result, in each waveguide there are N wavelengths each carrying data \( \delta _{j,n}^{x,L}*B\sin ( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } ) \) , where \( 1 \le n \le N,\ B = \frac{A}{{2N}} \) . Each weight \( w_{ij}^L \) of the transpose of \( {w}^L \) obtained from the memristor crossbar is modulated by an MD to a light carrier. This results in

\( \begin{equation} {D}_{i,n} = w_{ij}^L*\delta _{j,n}^{x,L}*A\sin \left( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } \right)\!. \end{equation} \)

(9)

Fig. 4.

Now, each \( {D}_{i,n} \) is modulated with \( a_n^L \) , which is a derivative of the ReLU functions of layer L − 1 (equal to \( \sigma '( {{z}^{x,L - 1}} ) \) in Equation (8)). Then, \( {D}_{i,n} \) becomes

\( \begin{equation} {D}_{i,n} = w_{ij}^L*\delta _{j,n}^{x,L}*a_n^L*A\sin \left( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } \right)\!. \end{equation} \)

(10)

Next, a photodiode is used to demodulate photonic data from each waveguide. The photodiode captures the combined output \( {D}_{i,n} \) for all wavelengths in a waveguide, which is nothing but the matrix-vector multiplied error vector identical to Equation (8). The output of each photodiode is passed through a signal conditioning circuit to remove unwanted noises. Details of the conditioning circuit are omitted for brevity. The output from the signal conditioning circuit looks as follows:

\( \begin{equation} {\delta }^{x,L - 1} = ({({w}^L)}^T \times {\delta }^{x,L}) \odot {a}^L, \end{equation} \)

(11)

where \( {\delta }^{x,L - 1} \) is the error to be propagated from layer (L − 1) to (L − 2). The above procedure is continued until the 1st layer of LiteCON is reached. While doing the backpropagation, the error value in each layer is also sent to the corresponding weight-update circuit, which is discussed in more detail below.

Weight-update circuitry: For weight update, each element of a weight kernel in any layer l of the CNN architecture can be written as \( w_{k,j}^l \) . Please note that l = L for the final layer. Each \( w_{k,j}^l \) is stored in a memristor cell of a memristor crossbar in layer l as \( C_{k,j}^l \) (which is the conductance of a memristor cell). The weight-update equation for \( w_{k,j}^l \) (or, \( \ C_{k,j}^l \) ) can be written as per Equation (5), as follows:

\( \begin{equation} C_{new\left( {k,j} \right)}^l \leftarrow \ C_{old\left( {k,j} \right)}^l - \ \frac{\eta }{m} \times \delta _k^l \times O_j^{l - 1}, \end{equation} \)

(12)

where \( O_j^{l - 1} \) is the jth output from the POOL of the (l − 1) layer of the CNN architecture. Figure 5 illustrates the weight-update circuitry for any layer l. As shown in Figure 4(b), \( \delta _k^l \) is obtained from the BP architecture as a photonic signal. \( O_j^{l - 1} \) , which is collected from the memristor crossbar (for data storage), is used to modulate the light carrier carrying the error value \( \delta _k^l \) . The modulated output is demodulated using a photodiode and then sent to a signal conditioning circuit. In the signal conditioning circuit, first the analog signal is filtered (from noises) and passed through a subtractor to obtain new \( C_{k,j}^l \) as depicted in Equation (12). The previous conductance or weight value \( C_{old( {k,j} )}^l \) is fed to the subtractor circuit from the lth layer BP architecture. The new conductance value \( C_{k,j}^l \) is now fed to the equivalent memristor control circuit to update its weight value. The conditioning circuit, as well as the memristor control circuit, is inspired from Reference [7].

Fig. 5.

5 LiteCON Case Study

In this section, we present the working of the proposed pipelined LiteCON architecture for a CNN benchmark VGG [16] on the ImageNet dataset [17]. In our experiments, we consider all variants of the VGG [16] and LeNet [18] benchmarks as shown in Table 1. We integrate the PConv layer, ReLU layer, POOL layer, and FC layer based on VGG-A model for this case study as shown in Figure 6(a). See that for one convolution in \( {\rm{F}}{{\rm{E}}}_1 \) for VGG-1 (Table 1), there is an equivalent PConv in \( {\rm{F}}{{\rm{E}}}_1 \) of Figure 6(a); similarly, for two back-to-back convolutions in \( {\rm{F}}{{\rm{E}}}_3 \) , there are two back-to-back PConv in \( {\rm{F}}{{\rm{E}}}_3 \) of Figure 6(a). The backpropagation accelerator is connected to the feedforward accelerator as follows: BP-1 with \( {\rm{F}}{{\rm{E}}}_1 \) , BP-2 with \( {\rm{F}}{{\rm{E}}}_2 \) , BP-3 with \( {\rm{F}}{{\rm{E}}}_3 \) , and so on. The rest of the section discusses how LiteCON mimics VGG-A.

Fig. 6.

VGG for the ImageNet dataset operates on a 224 × 224 image input. As mentioned earlier, LiteCON is designed to convolve 28 × 28 pixels at a time, i.e., one LiteCON cycle. Therefore, it requires 64 LiteCON cycles to execute a 224 × 224 image. The SRAM register array in LiteCON is of size 256 KB to store five images of size 224 × 224. PConv performs feature extraction on a 28 × 28 input data at a time in a pipelined manner.

Figure 6(b) demonstrates the pipelined dataflow of the feedforward operation in LiteCON. We consider a 2.5 GHz clock. Therefore, the clock cycle period \( {{\rm{T}}}_{{\rm{sm}}} = 400 \) ps. As shown in Figure 6(b), at \( t = {{\rm{T}}}_{{\rm{sm}}} \) , the first set of 28 × 28 pixels from SRAM (i.e., A) are convolved (64 filters/features) and are stored in memristor crossbars (for data storage) (the yellow interface module in Figure 5(a) represents data transfer into memristors in the peripheral circuit). To illustrate the pipelined approach, we explain the convolution of another three set of 28 × 28 pixels, namely, B, C, and D. Note that PConv convolves a 28 × 28 input in one clock cycle (Section 4.1.2). As \( {\rm{F}}{{\rm{E}}}_1 \) for VGG-A consists of one convolution layer (see Table 1), convolved photonic outputs of PConv-1 of \( {\rm{F}}{{\rm{E}}}_1 \) is sent to the ReLU layer through the photodiode followed by the POOL layer. The time required for convolved data of one FE to arrive at the next FE, \( {T}_{FE} \) = photodiode conversion time + ReLU time + POOL time + interface time = 20 ps + 10 ps + 10 ps + 10 ps = 50 ps. From \( t = {T}_{sm} \) to \( t = 2{T}_{sm} \) , PConv(A) outputs from the peripheral circuit of \( F{E}_1 \) are photodiode-converted, ReLUed and POOL'ed, and then fed to \( {\rm{F}}{{\rm{E}}}_2 \) .

There can be 8 such data movements as \( \frac{{{T}_{sm}}}{{{T}_{FE}}} = 8 \) . In one data movement, 4 28 × 28 features can be processed. Therefore, at t = \( 2{T}_{sm} \) , 32 PConv(A) features arrive at \( {\rm{F}}{{\rm{E}}}_2 \) . Similar to PConv(A), from \( t = 2{T}_{sm} \) to \( t = 3{T}_{sm} \) , 32 PConv(B) features; from \( t = 3{T}_{sm} \) to \( t = 4{T}_{sm} \) , 32 PConv(C) features; from \( t = 4{T}_{sm} \) to \( t = 5{T}_{sm} \) , 32 PConv(D) features are convolved and stored in the peripheral circuit of \( F{E}_2 \) . After this, from \( t = 5{T}_{sm} \) to \( t = 6{T}_{sm} \) , the remaining 32 PConv(A) features in \( F{E}_1 \) are convolved in \( F{E}_2 \) . In this way, by t = \( 6{T}_{sm} \) , all the 64 PConv(A) features in \( F{E}_1 \) are convolved with 128 \( F{E}_2 \) filters to produce 128 features and stored in the memristors of its peripheral circuit. Similarly, remaining 32 B, C, and D features are convolved and stored (Figure 6(b)) by \( t = 7{T}_{sm} \) , \( t = 8{T}_{sm} \) , and \( t = 9{T}_{sm} \) , respectively. \( F{E}_1 \) has 64 features, \( F{E}_2 \) has 128 features, \( F{E}_3 \) has 256 features, etc, as per the VGG-A configuration (Table 1). It is important to note that 64 PConv(A) features from \( F{E}_1 \) are convolved with 128 kernels/filters to produce 128 PConv(A) features for \( F{E}_2 \) . Similarly, 128 PConv(A) features from \( {\rm{F}}{{\rm{E}}}_2 \) are convolved with 256 kernels to produce 256 PConv(A) features for \( F{E}_3 \) .

A, B, C, and D are convolved separately until \( t = 10{T}_{sm} \) when all of them arrive at \( F{E}_3 \) as 256 7 × 7 features each. Now, all of these features are merged together to form 256 28 × 28 features. Therefore, it will require another \( 8{T}_{sm} \) time (i.e., \( t = 10{T}_{sm} \) \( \textrm{tot} = 18{T}_{sm} \) ) to send 256 28 × 28 features from \( F{E}_3 \) and convolve them as 512 14 × 14 features at \( F{E}_4 \) . Similarly, convolution, ReLU, and POOL are performed in \( F{E}_4 \) and \( F{E}_5 \) . As illustrated in Figure 6(b), at \( t = 24{T}_{sm} \) , 512 features are obtained from \( F{E}_5 \) for 56 × 56 pixels. As shown in Figure 5(a), features from \( F{E}_5 \) are stored in SRAM until all the 224 × 224 pixels are extracted. For 224 × 224 pixels, it will take \( 16 \times 24{T}_{sm} = 384 \) \( {T}_{sm} = 153.6\ \textrm{ns} \) (in \( 24{T}_{sm},4\;28 \times 28 \) pixels are convolved; therefore, \( 16 \times 24{T}_{sm} \) for 224 × 224 pixels). After this, all the features are retrieved from SRAM and fed to FC for feature classification. The first FC operation requires \( ({T}_{sm} + {T}_{FE}) \) time as FC is identical to FE. The second FC operation requires \( {T}_{FE} \) time as no more SRAM or memristor read is needed. This means that LiteCON requires 153.6 ns (for FE) \( + {T}_{sm} + 2T\ = 154 \) ns, for one forward pass. After a forward pass, the FC output is sent to the BP architecture for backpropagation. Each layer in BP requires \( {T}_b \) units of time where \( {T}_b \) = (error modulation to light carrier) + (split time) + (WDM multiplexing time) + (split time) + (weight modulation time) + (ReLU function derivative modulation time) + (photodiode time) = 10 ps + 10 ps + 10 ps + 10 ps + 10 ps + 10 ps +20 ps = 80 ps. It takes 6 \( {T}_b \) units of time to complete one backward pass. In summary, LiteCON requires 154 ns for one forward pass and 80 ps for a backward pass. The ultra-fast nature of photonic interconnects allows for high-speed backpropagation in LiteCON.

6 Experimental Analyses

6.1 Design Methology

We use IPKISS [19], a commercial photonic CAD toolchain, to design and synthesize all of the photonic components in LiteCON. The synthesized components are integrated together to build LiteCON. For all of the photonics components, we consider a 32 nm IPKISS library. We developed a C++-based architectural simulator, which takes device- and link-level parameters from IPKISS, to estimate performance of LiteCON accelerator for several benchmarks.

6.1.1 Power, Area, and Performance Models.

We use Caphe [19] for modeling power and area of all photonic elements such as microdisks, DWDM multiplexers, waveguides, leds, and so on. The energy, timing, and area parameters for memristor crossbars are obtained from Reference [6]. For DAC, we deploy an integration and fire mechanism identical to PipeLayer [6] in our design. The power, latency, and area models are adapted accordingly from PipeLayer. We also obtain power, timing, and area parameters of the ADC from Reference [5], used in the FC layer of LiteCON. All these parameters are listed in Table 2.

Table 2.

Components	Parameters	Values	Power (mW)	Area (mm²)
SRAM register	Size	2 KB	10	0.2
SRAM register	Count	128
DAC	Resolution	8-bit	4.374	0.000208
	Frequency	1.2 Gbps
	Channel	64
	Count	208
ADC	Resolution	8-bit	490	0.294
	Frequency	1.2 Gbps
	Count	245
Memristor Crossbar (for weights and data)	Size	64 KB	30	0.5
Microdisk	Time	20 ps	1080.8	39.38
Microdisk	Count	62720	1080.8	39.38
Photodiode	Time	20 ps	1080.8	39.38
Photodiode	Count	62720	1080.8	39.38
Trans-Impedance-Amplifiers (TIA)	Time	10 ps	0.18 pJ/bit	0.28
Trans-Impedance-Amplifiers (TIA)	Count	62720	0.18 pJ/bit	0.28
WDM coupler	Count	16	0	0.00028
WDM de coupler	Count	16	0	0.00028
OPAMP	Time	20 ps	0.05	0.0045
OPAMP	Count	980
LED	Wavelengths	16	32000	0.384
LED	Count	6	32000	0.384
Waveguide	DWDM	16	0	80
	Width	450 nm
	Count	520

Table 2. Parametric Details

We use TensorFlow [20], a widely used deep learning framework, to train the datasets in conjunction with photonic component results from IPKISS. We manually map each of our benchmarks in waveguides, ReLU, max-pool, and FC of LiteCON. This ensures zero pipeline hazards between any two layers in LiteCON. We compare the performance of LiteCON with a state-of-the-art CNN accelerator, namely, PipeLayer [6] and the latest GPU (obtained from Reference [6]).

For comparison, we evaluate the following metrics: Throughput is the total number of operations per unit time (GOPS/s); Computational efficiency per watt represents throughput per unit area per watt (GOPS/s/W/mm²); Energy efficiency refers to the number of fixed-point operations performed per watt (GOPS/s/W); and last, Prediction error rate is the percentage of error in inferring any datasets. Please note that all the results in our analysis are based on an 8-bit weight resolution as the ADC/DAC are of 8-bit resolution.

6.1.2 Benchmarks and Datasets.

We execute two widely used CNN benchmarks: VGG-Net [16] and LeNet [18] in LiteCON. We consider four variants of the VGG benchmark: VGG-A, VGG-B, VGG-C, and VGG-D and two variations of LeNet (LeNet-A and LeNet-B) as depicted in Table 1. For a fair comparison, the configuration of all stages of VGG-Net and LeNet benchmarks identical to Reference [6]. For VGG, we use ImageNet dataset [17] having 224 × 224 images. We consider a subset of ImageNet, i.e., 1M images with 1,000 labels. For LeNet, we use 60,000 28 × 28 images of MNIST datasets [18] for training and 10,000 28 × 28 images for testing, with 10 labels.

6.2 Performance Analysis

Figures 7(a) and 7(b) present throughput of the proposed LiteCON and PipeLayer [6] compared to the baseline GPU implementation results, also from Reference [6], during training and inference, respectively. The GPU-based accelerator performs with an average training throughput of 306 GOPS/s and an average inference throughput of 347 GOPS/s. PipeLayer shows an average training throughput of 2,923 GOPS/s and an average inference throughput of 3,102 GOPS/s. The proposed LiteCON performs with an average training throughput of 90,853 GOPS/s and an average inference throughput of 98,958 GOPS/s. The superior performance of LiteCON is due to the intelligent integration of ultra-fast memristors and high-speed photonic components such as MDs, photodiodes, and DWDM waveguides.

Fig. 7.

The overall throughput of PipeLayer is affected by inter-layer data conversion with relatively slow ADCs. Also, PipeLayer spends most of its time in sequential weight updates during training. However, LiteCON has an inherent advantage due to its photonic parallel weight update mechanism. On average, LiteCON outperforms PipeLayer and GPU by 32× and 292× in terms of speedup, respectively. Finally, for the results presented in Figures 7(a) and 7(b), the variance of throughput across benchmarks is 1,650 with a standard deviation of 40.02, which is negligible considering the extreme scale throughput of LiteCON.

Figure 7(c) illustrates the effects of weight resolution on overall speedup of LiteCON compared to GPU. In general, weight resolution has negligible effect on the speedup of LiteCON. This is due to the fact that the data conversion (A-D or D-A) is done either at the beginning or at the end of the forward pass in LiteCON. Further, we see a slightly decreasing trend of speedup from VGG-A to VGG-D in Figure 7(c). This is due to the increase in total number of convolution layers from VGG-A to VGG-D.

Figure 8 illustrates the computational efficiency per watt (CEPW) comparison of the proposed LiteCON, memristor crossbar-based PipeLayer [6], and baseline GPU. For both training and inference, the CEPW trend is similar. Therefore, we show only one plot. PipeLayer uses memristor crossbars for the bulk of its arithmetic operations, which has a CEPW of 120 GOPS/s/W/mm². However, the overall CEPW of PipeLayer comes down to 106 GOPS/s/W/mm² due to its extensive usage of data conversions. Also, ReLU and POOL are performed by a digital ALU in PipeLayer. This requires more memory to store intra-layer data for synchronizing with its pipeline mechanism. The superiority of LiteCON comes from the fact that it is a completely analog accelerator. Therefore, LiteCON does not involve inter-layer data conversions or storage for synchronization. AD and DA conversions are done either at the beginning or at the end of feature extraction in LiteCON. In addition to the compute efficient memristor, LiteCON also uses high speed OPAMP as ReLU. As shown in Figure 8, LiteCON has 5× and 60× higher computational efficiency compared to PipeLayer and GPU, respectively. The proposed LiteCON architecture shows a CEPW variance of 80.22 (standard deviation of 8.95 GOPS/s/W/mm²), which is reasonable considering its high computational efficiency.

Fig. 8.

6.3 Energy Savings

We compare the energy efficiency of LiteCON with PipeLayer and GPU as depicted in Figures 9(a) and 9(b). For VGG-Net benchmarks, the average energy efficiency for PipeLayer is 31.3 and 33.2 GOPS/s/W during training and inference, respectively. This is 1.5× and 1.7× higher than GPU-based accelerator during training and inference, respectively. For LeNet benchmarks, PipeLayer shows 21× and 22.7× higher energy efficiency compared to GPU. Unlike PipeLayer, LiteCON works uniformly across both VGG-Net and LeNet benchmarks with an energy efficiency of 1,027.5 and 1,096.5 GOPS/s/W during training and inference, respectively. PipeLayer replicates its early feature extraction layers several times (close to 50 K times) to maintain a balanced pipeline. This involves excessive use of high-power consuming data conversions. LiteCON uses passive optical components such as waveguides and microdisks, in addition to energy efficient components such as photodiodes and memristor. Also, LiteCON uses very few ADCs/DACs compared to PipeLayer. As shown in Figure 9(a), for demanding benchmarks such as VGG-Net, we obtain 37× and 45× improvements in energy efficiency for LiteCON compared to PipeLayer and GPU, respectively. Overall, LiteCON outperforms PipeLayer and GPU for all benchmarks by 5× and 43×.

Fig. 9.

6.4 Comparisons with Latest Photonic Accelerators

Most of the photonic accelerators today deal only with inference. Therefore, we choose to compare the inference speedup with two promising photonic CNN accelerators [30, 31]. The comparison is illustrated in Table 3.

Table 3.

	[30]	[31]	LiteCON
Type of accelerator	Inference only	Inference only	Complete Accelerator (Both inference & training)
Fully analog, hybrid, digital	Fully analog	Hybrid (i.e., analog convolution; A/D conversion to DRAM; D/A conversion for analog convolution in the next layer)	Fully analog
Components	Convolution: Star coupler Pooling: Star coupler Activation: Opto-electric component	Convolution: Memristor banks	Convolution: Memristor Pooling: Optical comparator Activation: Optical amplifier
Compatibility	Only CNN	Only CNN	CNN (can be extended to DNN by making changes at design time)
Modulation involved	Both phase and amplitude	Only amplitude	Only amplitude
Type of activation	ReLU	ReLU	ReLU
Datasets	MNIST	MNIST	MNIST and ImageNet
Speedup w.r.t state-of-the-art GPU	Up to 65×	Up to 165× (for small scale MNIST datasets); Up to 78× (for large-scale ImageNet dataset)	Up to 350× (for all datasets)

Table 3. Comparisons with Latest Photonic CNN Accelerators [30, 31]

6.5 Prediction Accuracy

We performed a sensitivity analysis to investigate the impact of weight resolution on average prediction accuracy. Our design shows a prediction accuracy of 98% (i.e., slightly lower than state-of-the-art GPU accuracy of 99.3% and PipeLayer accuracy of 98.8%) for an 8-bit weight resolution. We choose this weight resolution because we use 8-bit DAC/ADC in our design. The prediction accuracy can be enhanced further by adopting an AD/DA mechanism with higher resolution. We choose not to at present to be on the conservative side from a CAD design standpoint. We take into account other sensitivity analysis such as effects of noises, propagation losses, photonic intrinsic losses, quantization error (in ADC/DAC), and quality factor of photonic components on prediction accuracy. The major factor among them is propagation loss that happens over the course of light signal traversal from the source to its destination. Figure 10 shows the impact of propagation loss on accuracy.

Fig. 10.

For a 16-bit AD/DA resolution, LiteCON achieves 99.2% prediction accuracy for VGG and LeNet at the cost of 9% reduction in energy savings or energy efficiency. Anyway, the compromised energy efficiency with 16-bit resolution is still higher than the state-of-the-art PipeLayer and GPU (3.5× and 33×, respectively, on average across benchmarks).

For a fair comparison, we brought down PipeLayer to 98% by considering 4-bit AD/DA resolution. This would enhance its energy efficiency by 10%, i.e., it increases from 273 GOPS/s/W (average) to 300 GOPS/s/W. The energy efficiency of GPU is not affected by changing the resolution as it is a completely digital system. So, the 300 GOPS/s/W is still less than LiteCON’s energy efficiency of 1,132.85 GOPS/s/W (average).

Another factor that accounts for prediction accuracy of LiteCON is the finesse of the microdisk (MD) used in the system. Finesse determines the quality and operational accuracy of a microdisk; it depends on the intrinsic losses in an MD. Figure 11 shows the impact of intrinsic losses (in dB/cm) on the finesse of MD of various sizes. The intrinsic loss of an MD depends upon the materials used. We assume an intrinsic loss of 2.5 dB/Cm in our design.

Fig. 11.

Effects of component noise/error on accuracy: The error/noise encountered by individual components play a role in determining the overall prediction error (PER). (1) Each memristor can have 1,000 quantized states. The quantization error encountered due to limited number of memristor states contributes up to 1.2% of PER; (2) The signal-to-noise ratio of microdisks used in LiteCON is 10 dB, which is adapted from Reference [28]. The MD's contribution to the overall PER is 2.35%; (3) Each OPAMP in LiteCON has an SNR of 30 dB [29]. This accounts for a PER of 0.85%; and (4) the memristor-photonic interface is noisy. The signals from memristors going to modulators encounter a noise with an SNR of 25 dB, which leads to a PER of 1.45%. We obtained these numbers through detailed optoelectronic synthesis using the IPKISS tool.

Please note that silicon photonic technology keeps evolving at a fast rate. With future improvement in microdisk finesse, intrinsic losses, and propagation loss, we can see accuracy close to state-of-the-art GPU accelerators (99.7% and beyond).

Further improvement in accuracy by Incremental training: Incremental training is a proven approach to enhance accuracy further and reduce training time. We performed incremental learning with LiteCON based on Reference [27]. With such an approach, LiteCON’s accuracy increases from 98% to 98.7%. One major factor in the case of incremental training is storing previously learned model or parameters in memory to be used in the next learning phase. To perform incremental training, we needed to include additional SRAM memory of 64 KB.

6.6 Discussion 1: LiteCON with Complex Models

Nowadays, more complex deep learning models are emerging, such as GoogleNet [24], Transformer [25], and BERT [26]. Architecture and characteristics wise they look extremely complex with 150+ hidden layers and millions of parameters. However, at the core of their functionalities, all of them comprise a softmax, an activation function, a fully connected layer, and a masking unit. LiteCON contains fundamental photonic components (Section 4) to emulate these functionalities. For VGG and LeNet, we consider a ReLU activation; however, the activation circuit in LiteCON can be configured at design time to perform other neural activations as required by today's more complex deep learning models. One challenge that LiteCON would face while executing these large models is to perform multiple cycles of training without a long wait-time. That can be avoided by considering a multi-core LiteCON architecture connected by an optical on-chip network.

6.7 Discussion 2: Effects of Memristor Aging on LiteCON

Memristor plays a major role in LiteCON, i.e., to transfer analog data to the photonic realm in a pipelined fashion. This ensures the exemplary speedup of LiteCON. However, like any electrical device memristor also has many non-linear characteristics and is prone to degrade with aging. In LiteCON, we incorporated how an important non-idealities factor—aging—affects the performance of a memristor device. Being a non-reversible and inevitable process, it challenges the reliability of a memristor crossbar. We modeled an aging function to consider the effect of aging in a memristive device. We introduce a novel system-level aging model for memristor crossbars. Such a model can be integrated to any memristor CAD tool to investigate its performance accurately. In addition, we deploy an aging-aware memristor training scheme called skewed weight training. The proposed scheme incorporates age of each memristor cells to adjust their conductance matrix and current values dynamically thereby maintaining accuracy and energy efficiency. This is a first of its kind. Experiments with a standard CAD tool demonstrate 25% increase in the lifetime of a memristor crossbar by incorporating. The details of this work have been omitted due to brevity.

7 Conclusions

This article demonstrates a fully analog CNN accelerator called LiteCON that optimally integrates low-area, ultra-fast, and energy-efficient photonic components such microdisks, waveguides, photodiodes, and splitter. LiteCON comprises a completely analog photonic backpropagation architecture. Further, the proposed architecture deploys (i) a scalable photonic convolution design based on microdisks in each CNN layer to emulate a range of sample CNN models; (ii) pipelined dataflow approach for high throughput. Compared to PipeLayer [6] and GPU, LiteCON architecture shows higher computational and energy efficiency due to the use of energy efficient microdisks and high-speed memristor crossbars and also due to its use of a fully analog feature extraction method. We demonstrated that the proposed design has the potential to achieve up to 30×, 34×, and 4.5× improvements during training, and up to 34×, 40×, and 5.5× during inference, in throughput, energy efficiency, and computational efficiency per watt, respectively, compared to the state-of-the-art with little reduction in accuracy. Our future work will address how LiteCON can be modeled for broader applicability such as other types of deep learning models, e.g., deep neural networks (DNNs).

References

[1]

W. Li, K. Liu, L. Yan, et al. 2019. FRD-CNN: Object detection based on small-scale convolutional neural networks and feature reuse. Sci. Rep. 9 (2019), 16294.

Abstract

1 Introduction

2 Background and Prior Art

2.1 Convolution Neural Networks

2.2 Prior Art

3 Components Overview

4 LiteCon Architecture

4.1 Feedfoward Accelerator

4.1.1 Data Reading.

4.1.2 Feature Extraction.

4.1.3 Photonic Feature Classification.

4.2 Backpropagation Accelerator

5 LiteCON Case Study

6 Experimental Analyses

6.1 Design Methology

6.1.1 Power, Area, and Performance Models.

6.1.2 Benchmarks and Datasets.

6.2 Performance Analysis

6.3 Energy Savings

6.4 Comparisons with Latest Photonic Accelerators

6.5 Prediction Accuracy

6.6 Discussion 1: LiteCON with Complex Models

6.7 Discussion 2: Effects of Memristor Aging on LiteCON

7 Conclusions

References

Cited By

Index Terms

Recommendations

BPLight-CNN: A Photonics-Based Backpropagation Accelerator for Deep Learning

BPhoton-CNN: An Ultrafast Photonic Backpropagation Accelerator for Deep Learning

MEMTONIC: a neuromorphic accelerator for energy efficient deep learning

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations