Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

LiteCON: An All-photonic Neuromorphic Accelerator for Energy-efficient Deep Learning

Published: 01 September 2022 Publication History
  • Get Citation Alerts
  • Abstract

    Deep learning is highly pervasive in today's data-intensive era. In particular, convolutional neural networks (CNNs) are being widely adopted in a variety of fields for superior accuracy. However, computing deep CNNs on traditional CPUs and GPUs brings several performance and energy pitfalls. Several novel approaches based on ASIC, FPGA, and resistive-memory devices have been recently demonstrated with promising results. Most of them target only the inference (testing) phase of deep learning. There have been very limited attempts to design a full-fledged deep learning accelerator capable of both training and inference. It is due to the highly compute- and memory-intensive nature of the training phase. In this article, we propose LiteCON, a novel analog photonics CNN accelerator. LiteCON uses silicon microdisk-based convolution, memristor-based memory, and dense-wavelength-division-multiplexing for energy-efficient and ultrafast deep learning. We evaluate LiteCON using a commercial CAD framework (IPKISS) on deep learning benchmark models including LeNet and VGG-Net. Compared to the state of the art, LiteCON improves the CNN throughput, energy efficiency, and computational efficiency by up to 32×, 37×, and 5×, respectively, with trivial accuracy degradation.

    1 Introduction

    Convolution neural networks (CNNs) have become the go-to solution for a wide range of problems, such as object recognition [1], speech processing, and machine translation. Deep CNN models trained with large datasets are highly relevant and critical to ever-growing cloud services, such as face identification (e.g., Apple iPhoto and Google Picasa) and speech recognition (e.g., Apple Siri and Google assistant). However, a CNN algorithm involves a huge volume of computationally intensive convolutions. For example, a basic CNN model created in 2012, AlexNet [2] requires 724 M floating point multiply-accumulate (MAC) operations just for inference. The floating-point MAC count is more than 20 times in general for training a CNN. Such increase in the number of MAC makes training three order of magnitude higher compute and memory intensive than inference [5]. As a result, traditional CPUs and GPUs struggle to achieve high processing throughput per watt [3] for CNN applications. To address this, several FPGA [4] and ASIC [5] approaches have been proposed to accomplish large-scale CNN acceleration.
    A CNN comprises two stages: training and inference. Most hardware accelerators for CNNs in prior literature focus only on the inference stage, while the training is done offline using GPUs. However, training a CNN is up to several hundred times more compute and power intensive than its inference [5]. Moreover, for many applications, training is not a one-time activity, especially under changing environmental and system conditions, where re-training of CNN at regular intervals is essential to maintaining prediction accuracy for the application over time. This calls for an energy-efficient training accelerator in addition to the inference accelerator.
    Training a CNN, in general, employs a backpropagation algorithm that demands high memory locality and compute parallelism. Recently, a few resistive-memory (ReRAM or memristor crossbar)-based training accelerators have been demonstrated for CNNs, e.g., ISAAC [5], PipeLayer [6], and RCP [7]. ISAAC and RCP use highly parallel memristor crossbar arrays to address the need for parallel computations in CNNs. In addition, ISAAC uses a very deep pipeline to improve system throughput. However, this is only beneficial when a large number of consecutive images can be fed into the architecture. Unfortunately, during training, in many cases, a limited number of consecutive images need to be processed before weight updates. The deep pipeline in ISAAC also introduces frequent pipeline bubbles. Compared to ISAAC, PipeLayer demonstrates an improved pipeline approach to enhance throughput. However, RCP, DPE, ISAAC, and PipeLayer involve several analog-to-digital (AD) and digital-to-analog (DA) conversions, which become a performance bottleneck, in addition to their large power consumption. Also, training in these accelerators involves sequential weight updates from one layer to another. This incurs inter-layer waiting time for synchronization, which reduces overall performance. This calls for an analog accelerator that can drastically reduce the number of AD/DA conversions, and inter-layer waiting time. It has been recently demonstrated that a completely analog matrix-vector multiplication is 100× more efficient than its digital counterpart implemented with an ASIC, FPGA, or GPU [8]. Vandroome et al. [9] have demonstrated a small-scale efficient recurrent neural network using analog photonic computing. A few efficient on-chip photonic inference accelerators have also been proposed in References [10, 11, 23]. However, a full-fledged analog deep learning (or CNN to be precise) accelerator that is capable of both training and inference is yet to be demonstrated.
    In this article, we propose LiteCON, a novel silicon photonics-based neuromorphic CNN accelerator. It comprises silicon photonic microdisk-based convolution, memristive memory, high-speed photonic waveguides, and analog amplifiers. LiteCON works completely in the analog domain, therefore, we use the term neuromorphic in it (a neuromorphic system is made up analog components to mimic human brain behavior, in this case CNN, an artificial neural network). The lower footprint, low-power characteristics, and ultrafast nature of silicon microdisk enhance the efficiency of LiteCON. LiteCON is a first-of-its-kind memristor-integrated silicon photonic CNN accelerator for end-to-end analog training and inference. It is intended to perform highly energy efficient and ultra-fast training for deep learning applications with state-of-the-art prediction accuracy. The main contributions of this article are summarized as follows:
    We propose LiteCON, a fully analog and scalable silicon photonics-based CNN accelerator for energy-efficient training;
    We introduce a novel compute and energy-efficient silicon microdisk-based convolution and backpropagation architecture;
    We demonstrate a pipelined data distribution approach for high throughput training with LiteCON;
    We synthesize the LiteCON architecture using a photonic CAD framework (IPKISS [16]). The synthesized LiteCON is used to execute four variants of VGG-Net [12] and two variants of LeNet [13], demonstrating up to 30×, 34×, and 4.5× improvements during training, and up to 34×, 40×, and 5.5× during inference, in throughput, energy efficiency, and computational efficiency per watt, respectively, compared to the state-of-the-art CNN accelerators.
    The rest of the article is organized as follows. Section 2 presents a brief overview of CNNs and prior art. Section 3 provides a gentle introduction of the components used in LiteCON. The details of the LiteCON architecture are described in Section 4. Section 5 illustrates an example design of LiteCON, followed by Section 6, which contains the experimental setup, results, and comparative analysis. Last, we present concluding remarks in Section 7.

    2 Background and Prior Art

    2.1 Convolution Neural Networks

    CNNs are a class of deep learning network commonly used for analyzing visual imagery for image classification and object detection tasks. A CNN comprises three types of layers: convolution layer (CONV), pooling layer (POOL), and a fully connected layer (FC). Generally, CONV is accompanied with a non-linear activation function, such as ReLU, Tanh, or Sigmoid. CNN operates in two stages: training and inference (testing). In the training phase, the filter weights (and biases) in CONV and FC layers are learnt by using a backpropagation (BP) algorithm. The BP algorithm involves a forward and a backward pass in the deep network. Given a training sample x in the forward pass, the weighted input sum (convolution) z is computed for neurons in each layer l with some initial filter weights w (and bias b) followed by neural activation \( \sigma (z) \) (ReLU(z) in our work), and POOL. The final layer L computes the output label of the overall network for every forward pass. This can be summarized as follows:
    Forward Pass: For each layer l,
    \( \begin{equation} {z}^{x,l} \leftarrow {w}^l{a}^{x,l - 1} + {b}^l, \end{equation} \)
    (1)
    \( \begin{equation} {a}^{x,l} \leftarrow \sigma ({z}^{x,l}). \end{equation} \)
    (2)
    The output error in the final prediction \( {\delta }^{x,L} \) is a result of errors induced by the neurons in each hidden layer during the forward pass. To determine the error contribution of a neuron in the previous layer, i.e., \( \ {\delta }^{x,l} \) , the final error is back propagated through the network starting from the output layer. This can be summarized as follows:
    Output error: At the final layer L,
    \( \begin{equation} {\delta }^{x,L} \leftarrow {\nabla }_a{C}_x \odot \sigma '({z}^{x,L}). \end{equation} \)
    (3)
    Backward Pass: For each layer l,
    \( \begin{equation} {\delta }^{x,l} \leftarrow ({({w}^{l + 1})}^T \times {\delta }^{x,l + 1}) \odot \sigma '({z}^{x,l}). \end{equation} \)
    (4)
    Here, \( {\nabla }_a \) is gradient of \( {a}^{x,l} \) , and \( \sigma '({z}^{x,L}) \) is derivative of \( \sigma ({z}^{x,L}) \) . These error contributions are necessary to update the filter weights w and biases b in the respective layers using a gradient descent method. In gradient descent, the forward and backward pass happen iteratively until the cost function is minimized and the network is trained. This can be summarized as follows:
    Gradient Descent: For each layer l and m training samples with learning rate \( \eta \) ,
    \( \begin{equation} {w}^l \leftarrow \ {w}^l - \ \frac{\eta }{m}\mathop \sum \limits_x {\delta }^{x,l} \times {({a}^{x,l - 1})}^T, \end{equation} \)
    (5)
    \( \begin{equation} {b}^l \leftarrow \ {b}^l - \ \frac{\eta }{m}\mathop \sum \limits_x {\delta }^{x,l}. \end{equation} \)
    (6)
    The next section presents the details of the proposed LiteCON architecture.

    2.2 Prior Art

    To achieve high-speed and energy-efficient deep learning, researchers recently have demonstrated photonic accelerators by deploying microring weight banks [12, 21], Mach-Zehnder interferometers [10, 22] and multilayer diffractive optical elements [13]. Compared to these optical devices, silicon photonic microdisk (MD) has a smaller chip-area, ultrafast nature, and lower power characteristics [14]. Moreover, most silicon photonic accelerators cannot achieve the state-of-the-art inference accuracy even for small datasets. For example, none of the photonic accelerators perform with an accuracy >97% on small MNIST dataset, while recent CNNs easily reach >99.77% [2] accuracy on the same dataset. This is due to the large noises accumulated during fully optical inferences. Our proposed design based on MD addresses these bottlenecks.

    3 Components Overview

    CMOS compatible components such as photonic waveguides, silicon MDs, photodiodes and multi-wavelength LED array are used for on-chip photonic signaling [15]. An MD is a circular-shaped photonic structure that is used to modulate electronic signals into a photonic signal at the transmission source in a waveguide. MDs are also used to couple or filter out light from the waveguide at the destination. Each MD modulates light of a specific wavelength and its geometry (radius to be precise) determines its wavelength selectivity. We can also inject (or remove) charge carriers to (from) an MD or heat it to alter its operating wavelength.
    In a typical high-bandwidth photonic link, an LED array (either on the board or on a 2.5 D interposer) generates multiple wavelengths, which are coupled by an optical grating coupler to an on-chip photonic waveguide. The phenomenon of using multiple wavelengths to transmit many streams of bits simultaneously is referred to as dense-wavelength-division-multiplexing (DWDM). To enable processing of these photonic signals, the on-chip photonic waveguide propagates the input optical power to the destination where they are captured by photodiodes and are converted to electronic data. These components are the building blocks of the proposed LiteCON architecture.

    4 LiteCon Architecture

    Overview: Our proposed LiteCON architecture is a fully analog, scalable silicon photonics-based CNN accelerator design. Unlike previously proposed CNN accelerators [46], LiteCON accelerator enables fully analog end-to-end training and inference for CNN. Figure 1 gives a high-level overview of the LiteCON architecture. As shown in the figure, LiteCON comprises four major parts: feature extractor, feature classifier, backpropagation accelerator, and weight update unit. The feature extractor (FE) and the feature classifier (FC) are made up of multiple silicon microdisk-based convolution layers, operational amplifier (OPAMP)-based ReLU layers, and pooling layers. Together, FE and FC make the feedforward CNN accelerator. The backpropagation accelerator is built using silicon microdisks, splitters, and multiplexers. And, LiteCON’s weight update unit is designed by deploying a group of memristors.
    Fig. 1.
    Fig. 1. An overview of LiteCON architecture.

    4.1 Feedfoward Accelerator

    In this article, we consider image dataset as input and its classification as the task to be performed by LiteCON. The digital input data is stored in SRAM. The feedforward accelerator in LiteCON architecture (see Figure 1) performs feedforward FE followed by feature classification of input images. It operates in four stages: (a) data reading, (b) feature extraction, (c) feature classification, and (d) data writeback. The details are as follows.

    4.1.1 Data Reading.

    LiteCON is designed to convolve an input of 28 × 28 pixels at a time, i.e., one LiteCON cycle. Therefore, it requires 64 LiteCON cycles to execute a 224 × 224 image (typical size of an ImageNet image). Please note that a LiteCON cycle is different from its clock cycle. Here, one LiteCON cycle refers to the complete feature extraction and feature classification of a 28 × 28 image. The SRAM in LiteCON is of size 256 KB (dual data rate, 64 bits) to store the five images of size 224 × 224. In a pipelined fashion, four blocks of 28 × 28 pixels are written into the memristor crossbar (capable of storing four 28 × 28 pixels) via an n-channel DAC and memristor controller (Figure 2). The crossbar can be understood as a high-speed cache for LiteCON.
    Fig. 2.
    Fig. 2. Data Reading from SRAM to memristor crossbar.

    4.1.2 Feature Extraction.

    The FE in our architecture is carried out using multiple FE stages \( (F{E}_i) \) . Each FE stage comprises multiple photonic convolution layers (PConv), an analog amplifier (OPAMP)-based ReLU layer, another OPAMP-based pooling (POOL) layer, and an interface layer. LiteCON's FE adopts a completely analog computing paradigm by avoiding inter-layer A-to-D (Analog-to-Digital) and D-to-A (Digital-to-Analog) conversions compared to state-of-the-art CNN accelerators [5, 6], which use analog memristive convolution and digital CPU/GPU-based ReLU and pooling.
    Photonic Convolution (Pconv): Pconv is the first layer of an FE. The Pconv is based on the principle of analog multiplication using silicon microdisk [14]. A silicon microdisk is used for analog amplitude modulation of a light carrier. In its simplest term, analog amplitude modulation is the multiplication of a scalar input with an analog signal. The authors in Reference [14] have demonstrated photonic modulator-based analog multipliers. In our design, a Pconv is made to convolve 28 × 28 pixels at a time. It can be scaled up depending upon the requirements. A Pconv comprises (i) an array of leds capable of generating up to N wavelength carriers; (ii) a DWDM multiplexer, splitter, and waveguide arrangement to accommodate all the carriers into one channel; (iii) N × (M + 1) number of microdisks (N × M for microdisk multiplications and another N for weight modulation); and (iv) N photodiode.
    Convolution in deep learning operates with kernels (or filters) of several sizes such as 1 × 1, 2 × 2, 3 × 3, 4 × 4, and so on. The widely adopted models that we consider in this article (Table 1) comprise 3 × 3 filters. Hence, Figure 3(a) depicts photonic convolution based on a 3 × 3 filter. To start with, each of the N wavelength channels from the LED array is integrated with a microdisk of respective wavelength. All microdisks are divided into K groups (K × 9 = N), each having 9 microdisks. All these groups of microdisks are then modulated with weight values \( (w_{11}^L,\ w_{12}^L, \ldots, w_{33}^L) \) stored in the memristor crossbar (one part of memristor crossbar stores weights and another part stores input data or features obtained in hidden layers). Here \( w_{ij}^L \) is the weight (i, j) of a filter in the Lth layer of convolution. All the N modulated wavelengths are multiplexed into one waveguide by a DWDM multiplexer following which the multiplexed light is split into P equal channels each carrying all the modulated wavelengths. Each channel is equipped with 784 microdisks (it can be scaled up or down depending upon the input size, here 28 × 28). Now, in each channel, pixel values stored in the memristor crossbar are modulated into individual wavelength by the microdisks. As shown in Figure 3(a), the first group of 9 pixel (a matrix) is modulated by the first group of 9 microdisks. The pixels for a channel are chosen in such a way that there is no same pixel modulated to two wavelengths in the same channel (to avoid data collision). That way, in each wavelength carrier, using the multiplication principle of a microdisk, an input pixel \( I{n}_{xy} \) is multiplied by a weight value \( w_{ij}^L \) . Please note that convolution at its core is nothing but a sum of input and wight multiplication. Finally, the multiplexed light from each channel is captured by an array of photodiode. Each photodiode is designed to capture nine consecutive wavelengths. For example, the first photodiode integrated with the first channel captures \( (w_{11}^L \times I{n}_{11} + w_{12}^L \times I{n}_{12} \cdots + w_{33}^L \times I{n}_{33}) \) , which is nothing but the first convolved matrix. Similarly, other convolved matrices are captured.
    Table 1.
     FE1FE2FE3FE4FE5 
    VGG-A3 × 3, 64, 13 × 3, 128, 13 × 3, 256, 23 × 3, 512, 23 × 3, 512, 2 \( {FC-4096,\ 2\,\,\,\,\,} \) \( {FC-1000, 1\,\,\,\,\,} \)
    VGG-B3 × 3, 64, 23 × 3, 128, 23 × 3, 256, 2 1 × 1, 256, 13 × 3, 512, 2 1 × 1, 256, 13 × 3, 512, 2
    1 × 1, 256, 1
    VGG-C3 × 3, 64, 23 × 3, 128, 23 × 3, 256, 33 × 3, 512, 33 × 3, 512, 3
    VGG-D3 × 3, 64, 23 × 3, 128, 23 × 3, 256, 43 × 3, 512, 43 × 3, 512, 4
    LeNET-A3 × 3, 6, 13 × 3, 6, 13 × 3, 16, 23 × 3, 16, 43 × 3, 120, 1 \( {FC84,\ 1\,\,} \)
    LeNET-B3 × 3, 6, 13 × 3, 6, 13 × 3, 256, 13 × 3, 16, 63 × 3, 120, 1
    Table 1. CNN Benchmark Configuration for VGG, LeNeT
    Read (i × j, m, k) as filter size, i × j; number of such filters, m; and number of back-to-back convolutions in a layer, k.
    Fig. 3.
    Fig. 3. (a) Logical microarchitecture of MD-based photonic convolution; (b) OPAMP-based ReLU layer; and (c) four-input OPAMP-based POOL layer.
    Example of a simple pConv: Let us assume there are 9 pixels in an input. The nine pixels are stored as analog input in the memristor crossbar as \( I{n}_{11} \) , \( I{n}_{12} \) , …, \( I{n}_{33} \) . For simplicity of understanding, suppose there are 9 weights or a 3 × 3 Filter. They are \( {w}_{11} \) , \( {w}_{12} \) , …, \( {w}_{33} \) . The weights are modulated onto 9 wavelength channels and then passed in a single multiplexed waveguide. Now, each input pixel \( I{n}_{xy} \) is modulated by a microdisk into a weight-carrying channel. As in, the violet color microdisk modulates \( I{n}_{11} \) into the channel carrying weight \( {w}_{11} \) . Thus, microdisk performs amplitude modulation or in other words multiplication; so that channel now carries \( {w}_{11} \times I{n}_{11} \) . Similarly, other channels end up carrying \( {w}_{12} \times I{n}_{12}, \ldots, {w}_{33} \times I{n}_{33} \) . At the end, all these photonic signals are captured together by a photodiode as sum total, i.e., \( {w}_{11} \times I{n}_{11} + {w}_{12} \times I{n}_{12} + \cdots {w}_{33} \times I{n}_{33} \) . That is how photonic convolution works.
    Electronic ReLU and Pooling: Neural activation in CNN can be performed by a variety of non-linear functions such as Sigmoid, Tanh, ReLU (rectified linear unit), and so on. ReLU is widely used for its simplicity of implementation and exemplary performance. Therefore, we consider a ReLU-based neural activation circuit. The following equation explains the working of a ReLU unit.
    \( \begin{align} ReLU(z) &= z\ if\ z > 0\nonumber\\ & = 0\ if\ z\ \le 0 \end{align} \)
    (7)
    We deploy an operational amplifier (OPAMP) to mimic such a function as shown in Figure 3(b). Because Equation (7) can be seen as an example of a comparator. And analog OPAMP does the same thing. It takes two inputs and generates a target output based on the comparison. The output of photodiode from Figure 3(a) is fed as input to the OPAMP for ReLU operation. Please note that the OPAMP circuitry can be reconfigured to mimic other neural activation functions. The details of OPAMP mimicking other functions are omitted due to brevity.
    The next operation in FE is pooling, which is used to reduce the feature size and keeps spatial invariance. It does so by taking the average or maximum of multiple elements of a feature vector. We choose maximum for its superior accuracy in a variety of applications. Pooling is also a comparator function at a fundamental level. Four or nine outputs (reason: 2 × 2 or 3 × 3 is the typical pooling size) from ReLU units are fed as input to an OPAMP-based comparator, which then selects the maximum value as the spatially invariant pooling output (Figure (1)). The outputs from all the comparators are the extracted feature, which is stored back in the memristor crossbar for the next layer of FE. When features go through all the FE stages, obtained features are stored back in SRAM.

    4.1.3 Photonic Feature Classification.

    After feature extraction is performed using the FE stages (by PConv, ReLU, and pooling), features are brought back from the SRAM via the memristor crossbar to undergo feature classification phase. In CNN, the feature classification segment can be seen as a special case of convolution, where each extracted feature map uses the largest possible kernel. In other words, feature classification comprises one or more fully connected (FC) layers (In FC, each element from one layer is connected to all the elements in the next layer).
    LiteCON employs microdisk-based matrix vector multiplier (M-MVM) to implement the FC layer identical to Pconv (Section 4.1.2). In FC, each wavelength channel is modulated with a different weight value (unlike 9 weights in group as in the case of Pconv) to ensure fully connected network. When all the features from the feature extraction stage are brought back from the SRAM and available in the memristor crossbar, the features are fed to the FC layer. As an example, we consider 512 features coming from the feature extraction (FE) stages. VGG and LeNet operate on a 7 × 7 kernel in FC. Therefore, each feature is a 7 × 7 matrix. Therefore, 49 wavelength carriers from an LED array are modulated with 49 weights by microdisks. After multiplexing and splitting into 512 equal channels (similar to Pconv), each channel is matrix multiplied with one feature. The obtained output at the photodiode is fed to ReLU followed by pooling (if required). Then the results are fed to the next FC layer (if present in the model). After features go through all the FC layers, we obtain the classified output. During training, the classified outputs from the final FC and target outputs are input to an analog subtraction unit, the result (or error vector) of which is fed to the backpropagation architecture, as discussed next.

    4.2 Backpropagation Accelerator

    LiteCON’s backpropagation (BP) accelerator employs silicon microdisks, photodiodes, multiplexers, and splitters to perform completely analog matrix-multiplication and other arithmetic operations, similar to the Pconv. In contrast, previously proposed CNN accelerators [5, 6] adopt a hybrid approach by using analog memristors for matrix multiplications and digital CPU/GPU for other arithmetic operations, which requires performance hindering A-to-D and D-to-A conversions.
    Figure 4(a) illustrates the microarchitecture of the proposed BP accelerator design. It is based on photonic matrix-vector multiplication using silicon microdisks (MDs). We use MDs for their smaller footprint, high accuracy and quality factor, and low-power nature. We now explain the operation of the proposed BP architecture. As discussed in Equation (3), the error at the final layer (l = L) of BP is \( {\delta }^{x,L} \leftarrow {\nabla }_a{C}_x \odot \sigma '({z}^{x,L}) \) . Here, \( {\nabla }_a{C}_x \) is rate of change in output w.r.t the output activation (i.e., difference between actual classified output from feedforward accelerator and the target output stored in memristor crossbar). \( \sigma '({z}^{x,L}) \) is the derivative of the ReLU layer in the final FC stage of the CNN architecture. Outputs from the final FC stage of the CNN architecture are fed to an analog subtraction and multiplication unit (microdisk multiplier) to determine \( {\delta }^{x,L} \) . Applying Equation (4) and using the computed \( {\delta }^{x,L} \) , we calculate error for the (L − 1)th layer as follows:
    \( \begin{equation} {\delta }^{x,L - 1} \leftarrow ({({w}^L)}^T \times {\delta }^{x, L}) \odot \sigma '({z}^{x,L - 1}), \end{equation} \)
    (8)
    where \( {w}^L \) is weight matrix (stored in memristor crossbar) obtained from Lth layer of feedforward CNN architecture. Figure 4(a) shows the backpropagation between the final layer l = L and its penultimate layer l = L − 1. As illustrated, there are N number of wavelength carriers coming from an LED array. The value of N for a layer equal to the output feature size for the corresponding layer in the feedforward accelerator, e.g., N equals 49 (7 × 7) for the last layer. Each wavelength in layer L is modulated with error \( {\delta }^{x,L} \) by a MD tuned to that wavelength. In Figure 4(a), the violet MD is tuned to modulate \( {\lambda }_1 \) . Let us suppose the jth MD's output is \( M{D}_j = \delta _j^{x,L}*A\sin ( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } ) \) ( \( A\sin ( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } ) \) represents the photonic carrier with wavelength \( {\lambda }_j \) and phase difference \( \emptyset \) ). Each \( M{D}_j \) is split into two equal parts. The first part is sent to the weight-update circuitry (explained at the end of this section) to update the corresponding weights in the feedforward accelerator. The other part is fed to a DWDM multiplexer. A DWDM multiplexer is used to combine multiple light wavelengths into a single multi-wavelength carrier. After multiplexing, the multiplexed photonic data is split into M parts by an optical splitter where M equals the number of neurons in layer L − 1. Each part is fed to a multi-wavelength waveguide. As a result, in each waveguide there are N wavelengths each carrying data \( \delta _{j,n}^{x,L}*B\sin ( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } ) \) , where \( 1 \le n \le N,\ B = \frac{A}{{2N}} \) . Each weight \( w_{ij}^L \) of the transpose of \( {w}^L \) obtained from the memristor crossbar is modulated by an MD to a light carrier. This results in
    \( \begin{equation} {D}_{i,n} = w_{ij}^L*\delta _{j,n}^{x,L}*A\sin \left( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } \right)\!. \end{equation} \)
    (9)
    Fig. 4.
    Fig. 4. Schematic view of LiteCON’s backpropagation accelerator.
    Now, each \( {D}_{i,n} \) is modulated with \( a_n^L \) , which is a derivative of the ReLU functions of layer L − 1 (equal to \( \sigma '( {{z}^{x,L - 1}} ) \) in Equation (8)). Then, \( {D}_{i,n} \) becomes
    \( \begin{equation} {D}_{i,n} = w_{ij}^L*\delta _{j,n}^{x,L}*a_n^L*A\sin \left( {\frac{{2\pi }}{{{\lambda }_j}}t + \emptyset } \right)\!. \end{equation} \)
    (10)
    Next, a photodiode is used to demodulate photonic data from each waveguide. The photodiode captures the combined output \( {D}_{i,n} \) for all wavelengths in a waveguide, which is nothing but the matrix-vector multiplied error vector identical to Equation (8). The output of each photodiode is passed through a signal conditioning circuit to remove unwanted noises. Details of the conditioning circuit are omitted for brevity. The output from the signal conditioning circuit looks as follows:
    \( \begin{equation} {\delta }^{x,L - 1} = ({({w}^L)}^T \times {\delta }^{x,L}) \odot {a}^L, \end{equation} \)
    (11)
    where \( {\delta }^{x,L - 1} \) is the error to be propagated from layer (L − 1) to (L − 2). The above procedure is continued until the 1st layer of LiteCON is reached. While doing the backpropagation, the error value in each layer is also sent to the corresponding weight-update circuit, which is discussed in more detail below.
    Weight-update circuitry: For weight update, each element of a weight kernel in any layer l of the CNN architecture can be written as \( w_{k,j}^l \) . Please note that l = L for the final layer. Each \( w_{k,j}^l \) is stored in a memristor cell of a memristor crossbar in layer l as \( C_{k,j}^l \) (which is the conductance of a memristor cell). The weight-update equation for \( w_{k,j}^l \) (or, \( \ C_{k,j}^l \) ) can be written as per Equation (5), as follows:
    \( \begin{equation} C_{new\left( {k,j} \right)}^l \leftarrow \ C_{old\left( {k,j} \right)}^l - \ \frac{\eta }{m} \times \delta _k^l \times O_j^{l - 1}, \end{equation} \)
    (12)
    where \( O_j^{l - 1} \) is the jth output from the POOL of the (l − 1) layer of the CNN architecture. Figure 5 illustrates the weight-update circuitry for any layer l. As shown in Figure 4(b), \( \delta _k^l \) is obtained from the BP architecture as a photonic signal. \( O_j^{l - 1} \) , which is collected from the memristor crossbar (for data storage), is used to modulate the light carrier carrying the error value \( \delta _k^l \) . The modulated output is demodulated using a photodiode and then sent to a signal conditioning circuit. In the signal conditioning circuit, first the analog signal is filtered (from noises) and passed through a subtractor to obtain new \( C_{k,j}^l \) as depicted in Equation (12). The previous conductance or weight value \( C_{old( {k,j} )}^l \) is fed to the subtractor circuit from the lth layer BP architecture. The new conductance value \( C_{k,j}^l \) is now fed to the equivalent memristor control circuit to update its weight value. The conditioning circuit, as well as the memristor control circuit, is inspired from Reference [7].
    Fig. 5.
    Fig. 5. Weight-update circuitry for any layer l.

    5 LiteCON Case Study

    In this section, we present the working of the proposed pipelined LiteCON architecture for a CNN benchmark VGG [16] on the ImageNet dataset [17]. In our experiments, we consider all variants of the VGG [16] and LeNet [18] benchmarks as shown in Table 1. We integrate the PConv layer, ReLU layer, POOL layer, and FC layer based on VGG-A model for this case study as shown in Figure 6(a). See that for one convolution in \( {\rm{F}}{{\rm{E}}}_1 \) for VGG-1 (Table 1), there is an equivalent PConv in \( {\rm{F}}{{\rm{E}}}_1 \) of Figure 6(a); similarly, for two back-to-back convolutions in \( {\rm{F}}{{\rm{E}}}_3 \) , there are two back-to-back PConv in \( {\rm{F}}{{\rm{E}}}_3 \) of Figure 6(a). The backpropagation accelerator is connected to the feedforward accelerator as follows: BP-1 with \( {\rm{F}}{{\rm{E}}}_1 \) , BP-2 with \( {\rm{F}}{{\rm{E}}}_2 \) , BP-3 with \( {\rm{F}}{{\rm{E}}}_3 \) , and so on. The rest of the section discusses how LiteCON mimics VGG-A.
    Fig. 6.
    Fig. 6. (a) VGG-A implemented on LiteCON. (b) Pipelined dataflow in feedforward operation in LiteCON.
    VGG for the ImageNet dataset operates on a 224 × 224 image input. As mentioned earlier, LiteCON is designed to convolve 28 × 28 pixels at a time, i.e., one LiteCON cycle. Therefore, it requires 64 LiteCON cycles to execute a 224 × 224 image. The SRAM register array in LiteCON is of size 256 KB to store five images of size 224 × 224. PConv performs feature extraction on a 28 × 28 input data at a time in a pipelined manner.
    Figure 6(b) demonstrates the pipelined dataflow of the feedforward operation in LiteCON. We consider a 2.5 GHz clock. Therefore, the clock cycle period \( {{\rm{T}}}_{{\rm{sm}}} = 400 \) ps. As shown in Figure 6(b), at \( t = {{\rm{T}}}_{{\rm{sm}}} \) , the first set of 28 × 28 pixels from SRAM (i.e., A) are convolved (64 filters/features) and are stored in memristor crossbars (for data storage) (the yellow interface module in Figure 5(a) represents data transfer into memristors in the peripheral circuit). To illustrate the pipelined approach, we explain the convolution of another three set of 28 × 28 pixels, namely, B, C, and D. Note that PConv convolves a 28 × 28 input in one clock cycle (Section 4.1.2). As \( {\rm{F}}{{\rm{E}}}_1 \) for VGG-A consists of one convolution layer (see Table 1), convolved photonic outputs of PConv-1 of \( {\rm{F}}{{\rm{E}}}_1 \) is sent to the ReLU layer through the photodiode followed by the POOL layer. The time required for convolved data of one FE to arrive at the next FE, \( {T}_{FE} \) = photodiode conversion time + ReLU time + POOL time + interface time = 20 ps + 10 ps + 10 ps + 10 ps = 50 ps. From \( t = {T}_{sm} \) to \( t = 2{T}_{sm} \) , PConv(A) outputs from the peripheral circuit of \( F{E}_1 \) are photodiode-converted, ReLUed and POOL'ed, and then fed to \( {\rm{F}}{{\rm{E}}}_2 \) .
    There can be 8 such data movements as \( \frac{{{T}_{sm}}}{{{T}_{FE}}} = 8 \) . In one data movement, 4 28 × 28 features can be processed. Therefore, at t = \( 2{T}_{sm} \) , 32 PConv(A) features arrive at \( {\rm{F}}{{\rm{E}}}_2 \) . Similar to PConv(A), from \( t = 2{T}_{sm} \) to \( t = 3{T}_{sm} \) , 32 PConv(B) features; from \( t = 3{T}_{sm} \) to \( t = 4{T}_{sm} \) , 32 PConv(C) features; from \( t = 4{T}_{sm} \) to \( t = 5{T}_{sm} \) , 32 PConv(D) features are convolved and stored in the peripheral circuit of \( F{E}_2 \) . After this, from \( t = 5{T}_{sm} \) to \( t = 6{T}_{sm} \) , the remaining 32 PConv(A) features in \( F{E}_1 \) are convolved in \( F{E}_2 \) . In this way, by t = \( 6{T}_{sm} \) , all the 64 PConv(A) features in \( F{E}_1 \) are convolved with 128 \( F{E}_2 \) filters to produce 128 features and stored in the memristors of its peripheral circuit. Similarly, remaining 32 B, C, and D features are convolved and stored (Figure 6(b)) by \( t = 7{T}_{sm} \) , \( t = 8{T}_{sm} \) , and \( t = 9{T}_{sm} \) , respectively. \( F{E}_1 \) has 64 features, \( F{E}_2 \) has 128 features, \( F{E}_3 \) has 256 features, etc, as per the VGG-A configuration (Table 1). It is important to note that 64 PConv(A) features from \( F{E}_1 \) are convolved with 128 kernels/filters to produce 128 PConv(A) features for \( F{E}_2 \) . Similarly, 128 PConv(A) features from \( {\rm{F}}{{\rm{E}}}_2 \) are convolved with 256 kernels to produce 256 PConv(A) features for \( F{E}_3 \) .
    A, B, C, and D are convolved separately until \( t = 10{T}_{sm} \) when all of them arrive at \( F{E}_3 \) as 256 7 × 7 features each. Now, all of these features are merged together to form 256 28 × 28 features. Therefore, it will require another \( 8{T}_{sm} \) time (i.e., \( t = 10{T}_{sm} \) \( \textrm{tot} = 18{T}_{sm} \) ) to send 256 28 × 28 features from \( F{E}_3 \) and convolve them as 512 14 × 14 features at \( F{E}_4 \) . Similarly, convolution, ReLU, and POOL are performed in \( F{E}_4 \) and \( F{E}_5 \) . As illustrated in Figure 6(b), at \( t = 24{T}_{sm} \) , 512 features are obtained from \( F{E}_5 \) for 56 × 56 pixels. As shown in Figure 5(a), features from \( F{E}_5 \) are stored in SRAM until all the 224 × 224 pixels are extracted. For 224 × 224 pixels, it will take \( 16 \times 24{T}_{sm} = 384 \) \( {T}_{sm} = 153.6\ \textrm{ns} \) (in \( 24{T}_{sm},4\;28 \times 28 \) pixels are convolved; therefore, \( 16 \times 24{T}_{sm} \) for 224 × 224 pixels). After this, all the features are retrieved from SRAM and fed to FC for feature classification. The first FC operation requires \( ({T}_{sm} + {T}_{FE}) \) time as FC is identical to FE. The second FC operation requires \( {T}_{FE} \) time as no more SRAM or memristor read is needed. This means that LiteCON requires 153.6 ns (for FE) \( + {T}_{sm} + 2T\ = 154 \) ns, for one forward pass. After a forward pass, the FC output is sent to the BP architecture for backpropagation. Each layer in BP requires \( {T}_b \) units of time where \( {T}_b \) = (error modulation to light carrier) + (split time) + (WDM multiplexing time) + (split time) + (weight modulation time) + (ReLU function derivative modulation time) + (photodiode time) = 10 ps + 10 ps + 10 ps + 10 ps + 10 ps + 10 ps +20 ps = 80 ps. It takes 6 \( {T}_b \) units of time to complete one backward pass. In summary, LiteCON requires 154 ns for one forward pass and 80 ps for a backward pass. The ultra-fast nature of photonic interconnects allows for high-speed backpropagation in LiteCON.

    6 Experimental Analyses

    6.1 Design Methology

    We use IPKISS [19], a commercial photonic CAD toolchain, to design and synthesize all of the photonic components in LiteCON. The synthesized components are integrated together to build LiteCON. For all of the photonics components, we consider a 32 nm IPKISS library. We developed a C++-based architectural simulator, which takes device- and link-level parameters from IPKISS, to estimate performance of LiteCON accelerator for several benchmarks.

    6.1.1 Power, Area, and Performance Models.

    We use Caphe [19] for modeling power and area of all photonic elements such as microdisks, DWDM multiplexers, waveguides, leds, and so on. The energy, timing, and area parameters for memristor crossbars are obtained from Reference [6]. For DAC, we deploy an integration and fire mechanism identical to PipeLayer [6] in our design. The power, latency, and area models are adapted accordingly from PipeLayer. We also obtain power, timing, and area parameters of the ADC from Reference [5], used in the FC layer of LiteCON. All these parameters are listed in Table 2.
    Table 2.
    ComponentsParametersValuesPower (mW)Area (mm2)
    SRAM registerSize2 KB100.2
    Count128  
    DACResolution8-bit4.3740.000208
    Frequency1.2 Gbps
    Channel64
    Count208
    ADCResolution8-bit4900.294
    Frequency1.2 Gbps
    Count245
    Memristor Crossbar (for weights and data)Size64 KB300.5
    MicrodiskTime20 ps1080.839.38
    Count62720
    PhotodiodeTime20 ps1080.839.38
    Count62720
    Trans-Impedance-Amplifiers (TIA)Time10 ps0.18 pJ/bit0.28
    Count62720
    WDM couplerCount1600.00028
    WDM de couplerCount1600.00028
    OPAMPTime20 ps0.050.0045
    Count980  
    LEDWavelengths16320000.384
    Count6
    WaveguideDWDM16080
    Width450 nm
    Count520
    Table 2. Parametric Details
    We use TensorFlow [20], a widely used deep learning framework, to train the datasets in conjunction with photonic component results from IPKISS. We manually map each of our benchmarks in waveguides, ReLU, max-pool, and FC of LiteCON. This ensures zero pipeline hazards between any two layers in LiteCON. We compare the performance of LiteCON with a state-of-the-art CNN accelerator, namely, PipeLayer [6] and the latest GPU (obtained from Reference [6]).
    For comparison, we evaluate the following metrics: Throughput is the total number of operations per unit time (GOPS/s); Computational efficiency per watt represents throughput per unit area per watt (GOPS/s/W/mm2); Energy efficiency refers to the number of fixed-point operations performed per watt (GOPS/s/W); and last, Prediction error rate is the percentage of error in inferring any datasets. Please note that all the results in our analysis are based on an 8-bit weight resolution as the ADC/DAC are of 8-bit resolution.

    6.1.2 Benchmarks and Datasets.

    We execute two widely used CNN benchmarks: VGG-Net [16] and LeNet [18] in LiteCON. We consider four variants of the VGG benchmark: VGG-A, VGG-B, VGG-C, and VGG-D and two variations of LeNet (LeNet-A and LeNet-B) as depicted in Table 1. For a fair comparison, the configuration of all stages of VGG-Net and LeNet benchmarks identical to Reference [6]. For VGG, we use ImageNet dataset [17] having 224 × 224 images. We consider a subset of ImageNet, i.e., 1M images with 1,000 labels. For LeNet, we use 60,000 28 × 28 images of MNIST datasets [18] for training and 10,000 28 × 28 images for testing, with 10 labels.

    6.2 Performance Analysis

    Figures 7(a) and 7(b) present throughput of the proposed LiteCON and PipeLayer [6] compared to the baseline GPU implementation results, also from Reference [6], during training and inference, respectively. The GPU-based accelerator performs with an average training throughput of 306 GOPS/s and an average inference throughput of 347 GOPS/s. PipeLayer shows an average training throughput of 2,923 GOPS/s and an average inference throughput of 3,102 GOPS/s. The proposed LiteCON performs with an average training throughput of 90,853 GOPS/s and an average inference throughput of 98,958 GOPS/s. The superior performance of LiteCON is due to the intelligent integration of ultra-fast memristors and high-speed photonic components such as MDs, photodiodes, and DWDM waveguides.
    Fig. 7.
    Fig. 7. (a) Throughput comparison across accelerators during training; (b) throughput comparison across accelerators during inference; (c) speedup of LiteCON compared to GPU w.r.t. weight resolution.
    The overall throughput of PipeLayer is affected by inter-layer data conversion with relatively slow ADCs. Also, PipeLayer spends most of its time in sequential weight updates during training. However, LiteCON has an inherent advantage due to its photonic parallel weight update mechanism. On average, LiteCON outperforms PipeLayer and GPU by 32× and 292× in terms of speedup, respectively. Finally, for the results presented in Figures 7(a) and 7(b), the variance of throughput across benchmarks is 1,650 with a standard deviation of 40.02, which is negligible considering the extreme scale throughput of LiteCON.
    Figure 7(c) illustrates the effects of weight resolution on overall speedup of LiteCON compared to GPU. In general, weight resolution has negligible effect on the speedup of LiteCON. This is due to the fact that the data conversion (A-D or D-A) is done either at the beginning or at the end of the forward pass in LiteCON. Further, we see a slightly decreasing trend of speedup from VGG-A to VGG-D in Figure 7(c). This is due to the increase in total number of convolution layers from VGG-A to VGG-D.
    Figure 8 illustrates the computational efficiency per watt (CEPW) comparison of the proposed LiteCON, memristor crossbar-based PipeLayer [6], and baseline GPU. For both training and inference, the CEPW trend is similar. Therefore, we show only one plot. PipeLayer uses memristor crossbars for the bulk of its arithmetic operations, which has a CEPW of 120 GOPS/s/W/mm2. However, the overall CEPW of PipeLayer comes down to 106 GOPS/s/W/mm2 due to its extensive usage of data conversions. Also, ReLU and POOL are performed by a digital ALU in PipeLayer. This requires more memory to store intra-layer data for synchronizing with its pipeline mechanism. The superiority of LiteCON comes from the fact that it is a completely analog accelerator. Therefore, LiteCON does not involve inter-layer data conversions or storage for synchronization. AD and DA conversions are done either at the beginning or at the end of feature extraction in LiteCON. In addition to the compute efficient memristor, LiteCON also uses high speed OPAMP as ReLU. As shown in Figure 8, LiteCON has 5× and 60× higher computational efficiency compared to PipeLayer and GPU, respectively. The proposed LiteCON architecture shows a CEPW variance of 80.22 (standard deviation of 8.95 GOPS/s/W/mm2), which is reasonable considering its high computational efficiency.
    Fig. 8.
    Fig. 8. Computational efficiency per watt comparison of LiteCON, PipeLayer [6], and GPU [6].

    6.3 Energy Savings

    We compare the energy efficiency of LiteCON with PipeLayer and GPU as depicted in Figures 9(a) and 9(b). For VGG-Net benchmarks, the average energy efficiency for PipeLayer is 31.3 and 33.2 GOPS/s/W during training and inference, respectively. This is 1.5× and 1.7× higher than GPU-based accelerator during training and inference, respectively. For LeNet benchmarks, PipeLayer shows 21× and 22.7× higher energy efficiency compared to GPU. Unlike PipeLayer, LiteCON works uniformly across both VGG-Net and LeNet benchmarks with an energy efficiency of 1,027.5 and 1,096.5 GOPS/s/W during training and inference, respectively. PipeLayer replicates its early feature extraction layers several times (close to 50 K times) to maintain a balanced pipeline. This involves excessive use of high-power consuming data conversions. LiteCON uses passive optical components such as waveguides and microdisks, in addition to energy efficient components such as photodiodes and memristor. Also, LiteCON uses very few ADCs/DACs compared to PipeLayer. As shown in Figure 9(a), for demanding benchmarks such as VGG-Net, we obtain 37× and 45× improvements in energy efficiency for LiteCON compared to PipeLayer and GPU, respectively. Overall, LiteCON outperforms PipeLayer and GPU for all benchmarks by 5× and 43×.
    Fig. 9.
    Fig. 9. (a) Energy efficiency comparison across accelerators during training for VGG-Net and LeNet; (b) average energy efficiency comparison across accelerators during inference.

    6.4 Comparisons with Latest Photonic Accelerators

    Most of the photonic accelerators today deal only with inference. Therefore, we choose to compare the inference speedup with two promising photonic CNN accelerators [30, 31]. The comparison is illustrated in Table 3.
    Table 3.
     [30][31]LiteCON
    Type of acceleratorInference onlyInference onlyComplete Accelerator (Both inference & training)
    Fully analog, hybrid, digitalFully analogHybrid (i.e., analog convolution; A/D conversion to DRAM; D/A conversion for analog convolution in the next layer)Fully analog
    ComponentsConvolution: Star coupler
    Pooling: Star coupler
    Activation: Opto-electric component
    Convolution: Memristor banksConvolution: Memristor
    Pooling: Optical comparator
    Activation: Optical amplifier
    CompatibilityOnly CNNOnly CNNCNN
    (can be extended to DNN by making changes at design time)
    Modulation involvedBoth phase and amplitudeOnly amplitudeOnly amplitude
    Type of activationReLUReLUReLU
    DatasetsMNISTMNISTMNIST and ImageNet
    Speedup w.r.t state-of-the-art GPUUp to 65×Up to 165× (for small scale MNIST datasets);
    Up to 78× (for large-scale ImageNet dataset)
    Up to 350× (for all datasets)
    Table 3. Comparisons with Latest Photonic CNN Accelerators [30, 31]

    6.5 Prediction Accuracy

    We performed a sensitivity analysis to investigate the impact of weight resolution on average prediction accuracy. Our design shows a prediction accuracy of 98% (i.e., slightly lower than state-of-the-art GPU accuracy of 99.3% and PipeLayer accuracy of 98.8%) for an 8-bit weight resolution. We choose this weight resolution because we use 8-bit DAC/ADC in our design. The prediction accuracy can be enhanced further by adopting an AD/DA mechanism with higher resolution. We choose not to at present to be on the conservative side from a CAD design standpoint. We take into account other sensitivity analysis such as effects of noises, propagation losses, photonic intrinsic losses, quantization error (in ADC/DAC), and quality factor of photonic components on prediction accuracy. The major factor among them is propagation loss that happens over the course of light signal traversal from the source to its destination. Figure 10 shows the impact of propagation loss on accuracy.
    Fig. 10.
    Fig. 10. Impact of propagation loss on prediction accuracy.
    For a 16-bit AD/DA resolution, LiteCON achieves 99.2% prediction accuracy for VGG and LeNet at the cost of 9% reduction in energy savings or energy efficiency. Anyway, the compromised energy efficiency with 16-bit resolution is still higher than the state-of-the-art PipeLayer and GPU (3.5× and 33×, respectively, on average across benchmarks).
    For a fair comparison, we brought down PipeLayer to 98% by considering 4-bit AD/DA resolution. This would enhance its energy efficiency by 10%, i.e., it increases from 273 GOPS/s/W (average) to 300 GOPS/s/W. The energy efficiency of GPU is not affected by changing the resolution as it is a completely digital system. So, the 300 GOPS/s/W is still less than LiteCON’s energy efficiency of 1,132.85 GOPS/s/W (average).
    Another factor that accounts for prediction accuracy of LiteCON is the finesse of the microdisk (MD) used in the system. Finesse determines the quality and operational accuracy of a microdisk; it depends on the intrinsic losses in an MD. Figure 11 shows the impact of intrinsic losses (in dB/cm) on the finesse of MD of various sizes. The intrinsic loss of an MD depends upon the materials used. We assume an intrinsic loss of 2.5 dB/Cm in our design.
    Fig. 11.
    Fig. 11. Impact of intrinsic losses on the Finesse of a microdisk.
    Effects of component noise/error on accuracy: The error/noise encountered by individual components play a role in determining the overall prediction error (PER). (1) Each memristor can have 1,000 quantized states. The quantization error encountered due to limited number of memristor states contributes up to 1.2% of PER; (2) The signal-to-noise ratio of microdisks used in LiteCON is 10 dB, which is adapted from Reference [28]. The MD's contribution to the overall PER is 2.35%; (3) Each OPAMP in LiteCON has an SNR of 30 dB [29]. This accounts for a PER of 0.85%; and (4) the memristor-photonic interface is noisy. The signals from memristors going to modulators encounter a noise with an SNR of 25 dB, which leads to a PER of 1.45%. We obtained these numbers through detailed optoelectronic synthesis using the IPKISS tool.
    Please note that silicon photonic technology keeps evolving at a fast rate. With future improvement in microdisk finesse, intrinsic losses, and propagation loss, we can see accuracy close to state-of-the-art GPU accelerators (99.7% and beyond).
    Further improvement in accuracy by Incremental training: Incremental training is a proven approach to enhance accuracy further and reduce training time. We performed incremental learning with LiteCON based on Reference [27]. With such an approach, LiteCON’s accuracy increases from 98% to 98.7%. One major factor in the case of incremental training is storing previously learned model or parameters in memory to be used in the next learning phase. To perform incremental training, we needed to include additional SRAM memory of 64 KB.

    6.6 Discussion 1: LiteCON with Complex Models

    Nowadays, more complex deep learning models are emerging, such as GoogleNet [24], Transformer [25], and BERT [26]. Architecture and characteristics wise they look extremely complex with 150+ hidden layers and millions of parameters. However, at the core of their functionalities, all of them comprise a softmax, an activation function, a fully connected layer, and a masking unit. LiteCON contains fundamental photonic components (Section 4) to emulate these functionalities. For VGG and LeNet, we consider a ReLU activation; however, the activation circuit in LiteCON can be configured at design time to perform other neural activations as required by today's more complex deep learning models. One challenge that LiteCON would face while executing these large models is to perform multiple cycles of training without a long wait-time. That can be avoided by considering a multi-core LiteCON architecture connected by an optical on-chip network.

    6.7 Discussion 2: Effects of Memristor Aging on LiteCON

    Memristor plays a major role in LiteCON, i.e., to transfer analog data to the photonic realm in a pipelined fashion. This ensures the exemplary speedup of LiteCON. However, like any electrical device memristor also has many non-linear characteristics and is prone to degrade with aging. In LiteCON, we incorporated how an important non-idealities factor—aging—affects the performance of a memristor device. Being a non-reversible and inevitable process, it challenges the reliability of a memristor crossbar. We modeled an aging function to consider the effect of aging in a memristive device. We introduce a novel system-level aging model for memristor crossbars. Such a model can be integrated to any memristor CAD tool to investigate its performance accurately. In addition, we deploy an aging-aware memristor training scheme called skewed weight training. The proposed scheme incorporates age of each memristor cells to adjust their conductance matrix and current values dynamically thereby maintaining accuracy and energy efficiency. This is a first of its kind. Experiments with a standard CAD tool demonstrate 25% increase in the lifetime of a memristor crossbar by incorporating. The details of this work have been omitted due to brevity.

    7 Conclusions

    This article demonstrates a fully analog CNN accelerator called LiteCON that optimally integrates low-area, ultra-fast, and energy-efficient photonic components such microdisks, waveguides, photodiodes, and splitter. LiteCON comprises a completely analog photonic backpropagation architecture. Further, the proposed architecture deploys (i) a scalable photonic convolution design based on microdisks in each CNN layer to emulate a range of sample CNN models; (ii) pipelined dataflow approach for high throughput. Compared to PipeLayer [6] and GPU, LiteCON architecture shows higher computational and energy efficiency due to the use of energy efficient microdisks and high-speed memristor crossbars and also due to its use of a fully analog feature extraction method. We demonstrated that the proposed design has the potential to achieve up to 30×, 34×, and 4.5× improvements during training, and up to 34×, 40×, and 5.5× during inference, in throughput, energy efficiency, and computational efficiency per watt, respectively, compared to the state-of-the-art with little reduction in accuracy. Our future work will address how LiteCON can be modeled for broader applicability such as other types of deep learning models, e.g., deep neural networks (DNNs).

    References

    [1]
    W. Li, K. Liu, L. Yan, et al. 2019. FRD-CNN: Object detection based on small-scale convolutional neural networks and feature reuse. Sci. Rep. 9 (2019), 16294.
    [2]
    Alex Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural network. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS’12).
    [3]
    Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, et al. 2014. DaDianNao: A machine-learning supercomputer. In Proceedings of MICRO-47.
    [4]
    C. Zhang, Z. Fang, P. Zhou, P. Pan, and J. Cong. 2016. Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD’16).
    [5]
    A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar. 2016. ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars. In Proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA’16).
    [6]
    L. Song, X. Qian, H. Li, and Y. Chen. 2017. PipeLayer: A pipelined ReRAM-based accelerator for deep learning. In Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA’17).
    [7]
    T. Gokmen and Y. Vlasov. 2016. Acceleration of deep neural network training with resistive cross-point devices: Design considerations. Front. Neurosci. 10 (July 2016), 333.
    [8]
    Miao Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C. Graves, S. Lam, N. Ge, J. J. Yang, and R. S. Williams. 2016. Dot-Product engine for neuromorphic computing: programming 1T1M crossbar to accelerate matrix-vector multiplication. In Proceedings of the IEEE/ACM Design Automation Conference (DAC’16).
    [9]
    K. Vandoorne, J. Dambre, D. Verstraeten, B. Schrauwen, and P. Bienstman. 2011. Parallel reservoir computing using optical amplifiers. IEEE Trans. Neural Netw. 22, 9 (Sept. 2011), 1469–1481.
    [10]
    Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and M. Soljačić. 2017. Deep learning with coherent nanophotonic circuits. Nature Photon. 11 (Jun. 2017), 441–446.
    [11]
    D. Dang, J. Dass, and R. Mahapatra. 2017. ConvLight: A convolutional accelerator with memristor integrated photonic computing. In Proceedings of the IEEE International Conference on High Performance Computing (HiPC’17).
    [12]
    A. N. Tait et al. 2017. Neuromorphic photonic networks using silicon photonic weight banks. Sci. Rep. 7 (2017), 7430.
    [13]
    X. Lin et al. 2018. All-optical machine learning using diffractive deep neural networks. Science 361 (2018), 1004–1008.
    [14]
    Z. Ying et al. 2018. Electro-Optic ripple-carry adder in integrated silicon photonics for optical computing. IEEE J. Select. Top. Quant. Electron. 24, 6 (2018).
    [15]
    Y. Long, L. Zhou, and Jian Wang. 2016. Photonic-assisted microwave signal multiplication and modulation using a silicon Mach–Zehnder modulator. Sci. Rep. 6 (Feb. 2016), Art. No. 20215.
    [16]
    K. Simonyan and A. Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In Proceedings of the International Conference on Learning Representations (ICLR’15).
    [17]
    O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (Dec. 2015), 211–252.
    [18]
    Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov 1998), 2278–2324.
    [19]
    IPKISS-Photonic Framework. 2018. Retrieved from www.lucedaphotonics.com.
    [20]
    Martin Abadi et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). USENIX Association, 265–283.
    [21]
    Mengkun Li and Yongjian Wang. 2020. An energy-efficient silicon photonic-assisted deep learning accelerator for big data. In Proceedings of the Conference on Wireless Communications and Mobile Computing.
    [22]
    B. J. Shastri, A. N. Tait, T. Ferreira de Lima, et al. 2021. Photonics for artificial intelligence and neuromorphic computing. Nat. Photon. 15 (2021), 102–114.
    [23]
    D. Dang, S. Taheri, B. Lin, and D. Sahoo. 2020. MEMTONIC: A neuromorphic accelerator for energy efficient deep learning. In Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC’20). 1–2. DOI:
    [24]
    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’15).
    [25]
    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates, Red Hook, NY, 6000–6010.
    [26]
    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the North American Chapter of the Association for Computational Linguistics (NAACL-HLT’19). 4171–4186.
    [27]
    German I. Parisi, Ronald Kemker, Jose L. Part, Christopher Kanan, and Stefan Wermter. 2019. Continual lifelong learning with neural networks: A review. Neural Netw. 113 (May 2019), 54–71.
    [28]
    Farzaneh Zokaee, Qian Lou, Nathan Youngblood, Weichen Liu, Yiyuan Xie, and Lei Jiang. 2020. LightBulb: A photonic-nonvolatile-memory-based accelerator for binarized convolutional neural networks. In Proceedings of the 23rd Conference on Design, Automation and Test in Europe (DATE’20). EDA Consortium, San Jose, CA, 1438–1443.
    [29]
    M. Liu, P. Mak, Z. Yan, and R. P. Martins. 2011. A high-voltage-enabled recycling folded cascode OpAmp for nanoscale CMOS technologies. In Proceedings of the IEEE International Symposium of Circuits and Systems (ISCAS’11). 33–36. DOI:
    [30]
    J. R. Ong, C. C. Ooi, T. Y. L. Ang, S. T. Lim, and C. E. Png. 2020. Photonic convolutional neural networks usingintegrated diffractive optics. IEEE J. Select. Topics Quant. Electron. 26, 5 (Sept./Oct. 2020).
    [31]
    A. Mehrabian, Y. Al-Kabani, V. J. Sorger, and T. El-Ghazawi. 2018. PCNNA: A photonic convolutional neural network accelerator. In Proceedings of the IEEE Symposium on Cloud Computing (SOCC’18).

    Cited By

    View all
    • (2024)A review of emerging trends in photonic deep learning acceleratorsFrontiers in Physics10.3389/fphy.2024.136909912Online publication date: 15-Jul-2024
    • (2023)Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model InferenceACM Transactions on Architecture and Code Optimization10.1145/361768820:4(1-24)Online publication date: 26-Oct-2023

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 19, Issue 3
    September 2022
    418 pages
    ISSN:1544-3566
    EISSN:1544-3973
    DOI:10.1145/3530306
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 September 2022
    Online AM: 28 June 2022
    Accepted: 01 April 2022
    Revised: 01 January 2022
    Received: 01 July 2021
    Published in TACO Volume 19, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Deep learning
    2. on-chip photonics
    3. memristor

    Qualifiers

    • Research-article
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)489
    • Downloads (Last 6 weeks)58

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A review of emerging trends in photonic deep learning acceleratorsFrontiers in Physics10.3389/fphy.2024.136909912Online publication date: 15-Jul-2024
    • (2023)Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model InferenceACM Transactions on Architecture and Code Optimization10.1145/361768820:4(1-24)Online publication date: 26-Oct-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media