1 Introduction
In the
machine learning (ML) community,
convolutional neural network (CNN) accelerator design has gained significant popularity recently [
35]. Due to the high computational complexity of CNNs, developing a sophisticated hardware accelerator is essential to improve the energy efficiency and throughput of the targeted task. Custom accelerators, also sometimes called
application-specific integrated circuit– (ASIC) based accelerators, outperform general-purpose processors like
central processing units (CPUs) and
graphical processing units (GPUs) [
12,
83]. However, designing ASIC-based accelerators often comes with its own challenges, such as design complexity and long development time. Researchers have constantly strived to design more efficient accelerators in terms of throughput, energy consumption, chip area, and model performance. This has led to a large number of CNN accelerators, each incorporating different design choices and unique hardware modules to optimize certain operations. This requires a plethora of operation-specific hardware modules, sophisticated acceleration methods, and many other hyperparameters that define a vast design space in the hardware domain. Similarly, on the software side, many design choices for a CNN model result in an enormous design space as well. A challenge that remains is to
efficiently find an optimal pair of software and hardware hyperparameters, in other words, a CNN–accelerator pair that performs well while also meeting the required constraints.
The accelerator design space is immense. There are many hyperparameters and one needs to choose their values while designing an accelerator for the given application [
63]. They include the number and size of
processing elements (PEs) and on-chip buffers, dataflow, main memory size and type, and many more domain-specific modules, including those for sparsity-aware computation and reduced-precision design (more details in Section
2.2). Similarly, the CNN design space is also huge. Many hyperparameters come into play while designing a CNN. They include the number of layers, convolution type and size, normalization type, pooling type and size, structure of the final
multi-layer perceptron (MLP) head, activation function, training recipe, and many more (further details in Section
2.1). For a given task, one may search for a particular CNN architecture with the best performance. However, this architecture may not be able to meet user constraints on power, energy, latency, and chip area.
In the above context, given a hardware module, many works are aimed at tuning the CNN design instead, to optimize performance [
19]. However, this limits us to the exploration of only the CNN design space with no tuning possible in the hardware space. Having a fixed CNN, one could also optimize the hardware (called
automatic accelerator synthesis) in the context where searching the CNN space may be too expensive, especially for large datasets. This hardware optimization, however, requires a thorough knowledge of the architecture and design of ASIC-based accelerators, leading to long design cycles. However, exploring the CNN design space falls under the domain of
neural architecture search (NAS). Advancements in this domain have led to many NAS algorithms, including those employing
reinforcement learning (RL), Bayesian optimization, structure adaptation, and so on [
18,
59,
74]. However, these approaches have many limitations including suboptimal convergence, slow regression performance, and are limited to fixed training recipes (details in Section
3.1). Tuli et al. [
67] recently proposed a model for NAS in the space of transformer ML models [
70]. It overcomes many of these challenges by leveraging a heteroscedastic surrogate model to search for the model’s design decisions and its training recipe. However, this technique is not amenable to co-design between the two (namely, the accelerator and CNN) design spaces (details in Section
3.3.1).
This work also shows that a one-sided search leads to suboptimal CNN–accelerator pairs. This has led to recent advancements in
co-design of both the software and hardware. Researchers often leverage RL techniques to search for an optimal CNN–accelerator pair [
4,
34,
91]. However, most co-design works only use local search (mutation and crossover) and/or have limited search spaces, e.g., they only search over selected
field-programmable gate arrays (FPGAs) or microcontrollers [
4,
42]. Some works have also leveraged differentiable search of the CNN architecture [
16,
41]. However, recent surveys have shown that these methods are much slower than surrogate-based methods and fail to explore potential clusters with higher performance models [
57]. Limited search spaces, as shown by some very recent works [
26,
67], often lead to suboptimal neural network models and even CNN–accelerator pairs (or, in general, the combination of the hardware architecture and the software algorithm). Thus, expanding existing design spaces in both the hardware and the software regimes is necessary. However, blindly growing these design spaces further prolongs design times and exponentially increases compute resource requirements.
To tackle the above challenges, we propose a framework for comprehensively and simultaneously exploring massive CNN architecture and accelerator design domains, leveraging a novel co-design workflow to obtain an optimal CNN–accelerator pair for the targeted application that meets user-specified constraints. These could include not just edge applications with highly constrained power envelopes, but also server applications where model accuracy is of utmost importance.
Our optimal CNN–accelerator pair outperforms the state-of-the-art pair [
83], achieving 1.4% higher model accuracy on the CIFAR-10 dataset [
21,
36], while enabling 59.1% lower latency and 60.8% lower energy consumption, with only 17.1% increase in chip area. This pair also achieves 3.7% higher Top1 accuracy on the ImageNet dataset while incurring 43.8% lower latency and 11.2% lower energy (with the same increase in chip area). Experiments with our expanded design spaces that include popular CNNs and accelerators, using our proposed framework, show an improvement of 1.5% in model accuracy on the CIFAR-10 dataset, while enabling 11.0× lower
energy-delay product (EDP) and 34.7× higher throughput, with 4.0× lower area for our CNN–accelerator pair relative to a state-of-the-art co-design framework, namely Auto-NBA [
26]. We plan to release trained CNN models, accelerator architecture simulations, and our framework to enable future benchmarking.
The main contributions of this article are summarized next.
•
We expand on previously proposed CNN design spaces and present a new tool, called CNNBench, to characterize the vast design space of CNN architectures that includes a diverse set of supported convolution operations, unlike any previous work [
41,
44,
80]. We propose
CNN2vec that employs similarity measures to compare computational graphs of CNN models to obtain a dense embedding that captures architecture similarity reflected in the Euclidean space. We also leverage a new NAS technique,
Bayesian Optimization using Second-order Gradients and Heteroscedastic Surrogate Model for Neural Architecture Search (BOSHNAS), for searching our expanded design space. CNNBench also leverages similarity between neighboring CNN computational graphs for weight transfer to speed up the training process. Due to the massive design space, along with
CNN2vec embeddings, weight transfer from previously trained neighbors, and simultaneous optimization of the CNN architecture and the training recipe, CNNBench achieves state-of-the-art performance while limiting the number of search iterations compared to previous works [
72].
•
We survey popular accelerators proposed in the literature and encapsulate their design decisions in a unified framework. This gives rise to a benchmarking tool, AccelBench, that runs inference of CNN models on any accelerator within the design space, employing cycle-accurate simulations. AccelBench incorporates accelerators with diverse memory configurations that one could use for future benchmarking, rather than using traditional ASIC templates [
26,
78]. With the goal to reap the benefits of vast design spaces [
26,
67], AccelBench is the first benchmarking tool for diverse ASIC-based accelerators, supporting variegated design decisions in modern accelerator deployments. Unlike previous works that use FPGAs or off-the-shelf ASIC templates, it builds accelerator designs from the ground up, mapping each CNN in a modular and efficient fashion. It supports
\(2.28 \times 10^8\) unique accelerators, a design space much more extensive than investigated in any previous work.
•
To
efficiently search the proposed massive design space, we present a novel co-design method,
Bayesian Optimization using Second-order Gradients and Heteroscedastic Surrogate Model for Co-Design of CNNs and Accelerators (BOSHCODE). To make the search of such a vast design space possible, BOSHCODE incorporates numerous novelties, including a hierarchical search technique that gradually increases the granularity of hyperparameters searched, a neural network surrogate-based model that leverages gradients to the input for reliable query prediction, and an active learning pipeline that makes the search more efficient. Here, by gradients to the input, we mean the gradients to the CNN–accelerator pair simulated in the next iteration of the active learning loop. BOSHCODE is a fundamental pillar for the joint exploration of vast hardware-software design spaces. CODEBench, our proposed framework, combines CNNBench (which trains and obtains the accuracy of any queried CNN), AccelBench (which simulates the hardware performance of any queried accelerator architecture), and the proposed co-design method, BOSHCODE, to find the optimal CNN–accelerator pair, given a set of user-defined constraints. Figure
1 presents an overview of the proposed framework. Figure
1(a) shows how the CNNBench and AccelBench simulation pipelines output the performance values for every CNN–accelerator pair. Figure
1(b) shows how BOSHCODE learns a surrogate model for this mapping from all CNN–accelerator pairs to their simulated performance values. CNNBench trains the CNN model to obtain model accuracy and feeds the checkpoints to AccelBench that obtains other performance measures using a cycle-accurate simulator.
The rest of the article is organized as follows. Section
2 discusses the CNN and accelerator design spaces and highlights the advantages of co-design over one-sided optimization approaches. Section
3 presents the co-design framework that uses BOSHCODE to search for an optimal CNN–accelerator pair. Section
4 describes the experimental setup and baselines considered. Section
5 discusses the results. Finally, Section
6 concludes the article.
1