# Statistical Hardware Design With Multi-model Active Learning

Alireza Ghaffari, Masoud Asgharian and Yvon Savaria.

Abstract-With the rising complexity of numerous novel applications that serve our modern society comes the strong need to design efficient computing platforms. Designing efficient hardware is, however, a complex multi-objective problem that deals with multiple parameters and their interactions. Given that there are a large number of parameters and objectives involved in hardware design, synthesizing all possible combinations is not a feasible method to find the optimal solution. One promising approach to tackle this problem is statistical modeling of a desired hardware performance. Here, we propose a model-based active learning approach to solve this problem. Our proposed method uses Bayesian models to characterize various aspects of hardware performance. We also use transfer learning and Gaussian regression bootstrapping techniques in conjunction with active learning to create more accurate models. Our proposed statistical modeling method provides hardware models that are sufficiently accurate to perform design space exploration as well as performance prediction simultaneously. We use our proposed method to perform design space exploration and performance prediction for various hardware setups, such as micro-architecture design and OpenCL kernels for FPGA targets. Our experiments show that the number of samples required to create performance models significantly reduces while maintaining the predictive power of our proposed statistical models. For instance, in our performance prediction setting, the proposed method needs 65% fewer samples to create the model, and in the design space exploration setting, our proposed method can find the best parameter settings by exploring less than 50 samples.

Index Terms—Statistical modeling, Design space exploration, Hardware performance prediction, Transfer learning, Active learning, Bayesian models, Gaussian regression bootstrap.

### I. INTRODUCTION

FFICIENT hardware design is essential for various industrial and scientific applications. With the emergence of new applications such as machine learning, the Internet of Things (IoT), and 5G wireless communication, there is a strong need to design computing platforms that provide the required computational power. Designing efficient hardware is challenging and requires considering various hardware and software parameters and their interactions. In most cases, designing efficient digital hardware is a multi-objective optimization problem where the designers try to optimize multiple competing objectives, such as energy consumption and throughput. Moreover, performing a brute-force search to find the optimum design is not feasible, because synthesizing complex hardware designs is very time-consuming. In addition, designers must

Alireza Ghaffari and Yvon Savaria are with the department of Electrical Engineering Polytechnique Montreal, and Masoud Asgharian is with the department of Mathematics and Statistics, McGill University. *Corresponding Author: seyed-alireza.ghaffari@polymtl.ca*.

test a large number of parameter combinations to find the optimum solution.

Another prominent challenge of efficient hardware design is that the objective function is not smooth, and most parameters are discrete. For example, the memory delay, the number of floating-point arithmetic units, and clock frequencies are parameters that need to be tuned for throughput. Since these parameters are discrete, the objective function is not differentiable. Thus using gradient-based methods such as Stochastic Gradient Descent (SGD) to find the optimum parameter set is not possible.

For years, meta-heuristic methods, such as Genetic Algorithm, were used to perform design space exploration [1], [2]. Although these meta-heuristic algorithms can find the optimum setup in various settings, they cannot provide a model that predicts the performance. Moreover, the heuristic methods cannot provide an abstract model to study the interactions and inter-dependencies between the parameters and their effect on the desired performance metrics. Researchers proposed Bayesian models for design space exploration tasks to solve this issue. Bayesian optimization is not only capable of finding the optimum parameter setup but is also able to reveal the interactions and inter-dependencies between the parameters of the design.

Although research on Bayesian models has produced remarkable results for design space exploration as well as abstract modeling of parameter inter-dependencies, they incorporate two fundamental shortcomings. First, there is a need for a feedback mechanism from a real-world application to update the Bayesian model to improve accuracy. Second, Bayesian models are application specific, meaning that a model trained for a particular task cannot be used for another. Thus, there is a need to propose a *transfer learning* strategy to use currently trained models to predict the performance of applications that were not previously exposed to the model. Using transfer learning is crucial in hardware design since it provides the ability to perform design space exploration for emerging applications using prior knowledge acquired from previous experiences and applications.

Here we propose a novel methodology that solves the two shortcomings mentioned above. Besides the Bayesian modeling approach, we suggest using Bayesian active learning ([3], [4]) to increase our proposed statistical model accuracy. Bayesian active learning is an iterative machine learning algorithm that actively selects some informative samples to construct the Bayesian model of the objective function. The iterative nature of the Bayesian active learning method helps the Bayesian model to learn the underlying structure of the

data gradually. The ultimate goal of Bayesian active learning is to provide a surrogate function that closely resembles the system's behavior with the minimum number of samples. In our proposed methodology, once a Bayesian model is created, it can be updated using a feedback mechanism that is provided by a real-world application. For instance, a Bayesian surrogate model can be updated using a processor or computing cluster performance metrics reflecting its performance for different computation loads. Furthermore, we propose using the Bayesian transfer learning [5], where the information of one domain can be transferred to another domain to improve the accuracy of the statistical model of the target task. Transfer learning is particularly useful in hardware design in scenarios where the target application is novel and there are limited information and data on the performance of the hardware in that application. In such cases, the information of previously trained hardware models can be leveraged to infer the performance of the new application.

To further reduce the number of samples needed to make hardware performance prediction, we suggest using *Gaussian regression bootstrapping*. Once we have a Bayesian model of the hardware, it is possible to re-sample the model in order to acquire new composite samples to perform other types of statistical analysis. In this paper, we used Gaussian regression bootstrapping to generate new data to perform various regression analyses of the hardware model and showed that the results based on bootstrapped samples could closely resemble the analyses done on the real data.

To summarize, this paper makes the following contributions:

- We propose a model-based active learning framework that generates statistical models of different aspects of hardware performance.
- We propose *Bayesian transfer learning* to effectively perform hardware design space exploration using prior hardware performance information.
- We use a Gaussian regression bootstrapping method to effectively estimate the behavior of the hardware from the limited samples acquired from real-world applications.

To the best of our knowledge, this is the first time that model-based active learning has been used in conjunction with Bayesian transfer learning to provide accurate statistical hardware models for design-space exploration. Furthermore, this is the first time that Bayesian regression bootstrapping is used to estimate the behavior of specific hardware better.

The rest of the paper is structured as follows. Section II reviews related works in the hardware design space exploration and performance prediction. We introduce our methodology in Section III. In Section IV, we distinguish two important settings, notably design space exploration and performance modeling. In Section V, we show that our proposed methodology is effective for both design space exploration and performance modeling.

### II. RELATED WORKS

Exploring various design parameters in hardware design is a ubiquitous problem. The most obvious but inefficient solution is brute-force analysis or exhaustive search. The complexity of brute-force analysis increases exponentially with the number of parameters, making this method infeasible in practical problems. For years, researchers have used meta-heuristic methods such as Genetic Algorithms (GAs) [2] to perform design space exploration and obtain the Pareto optimal solutions. Although meta-heuristic methods effectively find optimum solutions, they do not provide a mathematical model reflecting interdependencies between parameters and their effects on the objectives.

Another valuable design space exploration approach is Bayesian Optimization (BO) [4]. Bayesian Optimization is especially interesting for design space exploration applications since (i) it provides a Bayesian surrogate model that helps to understand the inter-dependencies of the parameters and (ii) it is an iterative method that incorporates information from previously observed samples in the surrogate model. BO has been widely used in hardware design space exploration. For instance, in [6], BO was found to have superior performance for design space exploration of C/C++ High-Level Synthesis (HLS) compared to traditional search methods. Likewise, BO was used in [7] to optimize hardware accelerators for deep neural networks. The authors reported that the BO method outperforms traditional design space methods in all metrics of interest, such as accuracy and energy consumption. Furthermore, in [8], BO is deployed to tune HLS directives to achieve minimum latency.

Inspired by traditional meta-heuristic approaches, researchers have combined BO and Grey Wolf Optimization (GWO) [9] to perform design space exploration for HLS of OpenCL kernels [10]. The hybrid GWO-BO method was shown to outperform both BO and traditional search methods.

Other iterative methods such as Reinforcement Learning (RL) [11] are also effective for design space exploration applications. An RL agent tunes hardware parameters through some iterations to maximize the reward function related to metrics of interest. For example, in [12], the RL method is used to explore parameters that optimize the latency of deep neural networks on the ARM-Cortex-A CPUs. Likewise, in [13], a time-limited reinforcement learning [14] is used to perform design space exploration for deeply pipelined OpenCL kernels of the convolutional neural networks. In [15], an Advantage Actor Critic (A2C) [16] design space exploration is proposed to minimize area subject to a timing constraint in logic synthesis. Additionally, in [17], the authors proposed an RL-based design space exploration method and used the performance predicted by their cost model to compute the reward values. Moreover, they augmented their RL algorithm with a genetic algorithm to improve the design space exploration results.

Furthermore, using other statistical methods for hardware analysis and design is commonplace in the hardware community. For instance, in [18], authors studied accurate regression modeling of micro-architecture energy consumption and throughput. Moreover, in [19], Random Forest was used for HLS design space exploration. Similarly, in [20], Random Forest regression is used for predicting instruction per cycle for near-memory computing applications. In [21] Markov decision process was found effective for the design space exploration of multi-processor platforms. In [22], parametric bootstrapping

was used to perform stochastic ordering deep learning models that run efficiently on different hardware, while the inference efficiency and training efficiency are chosen as the metrics of interest.

Our proposed method is different from the previous Bayesian methods ([6], [7], [8]) in the way that it uses Active Learning in conjunction with Bayesian Optimization to create more accurate Bayesian models. We present results showing that our method is more effective than the hybrid GWO-BO method proposed in [10]. Furthermore, we use Bayesian Transfer Learning (TL) [5] and show that the performance of the parameter search is significantly improved by incorporating information from other correlated hardware design tasks. Our proposed method differs from [22] since it can produce Bayesian models capable of being used for Gaussian regression bootstrapping [23]. Also, after performing Gaussian regression bootstrapping, it is possible to produce composite data from our Bayesian surrogate model and perform statistical regression analysis similar to ([18], [21]) without requiring real-world data samples.

### III. METHODOLOGY

Statistical modeling of hardware and related applications such as design space exploration and performance estimation can be classified as Black Box Optimization (BBO) [24] problems. This is because the objective function and its behavior, such as smoothness, continuity, and convexity, are unknown. Thus, gradient-based methods such as Gradient Descent (GD) and Stochastic Gradient Descent (SGD) are not the best choice for these applications. On the other hand, derivative-free methods such as meta-heuristic methods and Bayesian Optimization are among the best choices for these applications.

Here we propose a *Model-Based Active Learning* framework for statistical hardware performance modeling. We further analyze this method in the context of design space exploration and performance prediction (regression analysis). In our proposed method, we use Bayesian Optimization with Gaussian Processes. We also use Bayesian Transfer Learning and Bayesian regression bootstrapping methods to further improve such models in practical applications. Fig. 1 shows an overview of our proposed framework. In this figure, Bayesian models are de facto Gaussian processes that are updated iteratively using active learning. These Bayesian models can be used in design space exploration or for performance prediction. The details of our mathematical modeling are elaborated on in the following sections.

### A. Multi-Model Active Learning

We use active learning to update the Gaussian process models. Our proposed framework uses multiple Bayesian models for various objectives. For instance, power consumption and throughput are each associated with their corresponding models. The algorithm starts from a random set of parameters for all models. It then queries the synthesis tool and creates initial Gaussian processes. Next, the algorithm evaluates the Gaussian processes and gathers each model parameters candidate

to be sent to the synthesis tool in the next query. This process continues iteratively until a particular stopping criterion is met. Furthermore, when used with active learning, the algorithm chooses the next set of parameters using the Gaussian models that are also affected by other pre-trained Gaussian processes, as explained in Section III-C. Also, note that our proposed algorithm considers the direction of optimization of each objective function. For instance, designers usually aim to minimize power consumption while maximizing the throughput. Therefore, our proposed algorithm chooses the sets of parameters that respect the optimization constraints of each objective function. Having multiple models associated with multiple objectives enables us to study the Pareto efficiency of the solutions. Section V-A1 reports experimental results that validate the proposed method. Algorithm 1 provides a pseudocode for our proposed algorithm.

### B. Bayesian Optimization with Gaussian Processes

Bayesian Optimization [4] is a powerful tool to find the extrema of an objective function that is hard to evaluate or does not have a close mathematical form. Statistical modeling of hardware implementations deals with both problems simultaneously. It is well-known that synthesizing hardware models takes a very long time, and generally, there is no closed-form expression of the power consumption and latency of the resulting hardware models.

In Bayesian Optimization, the goal is to find a surrogate function  $f_s$  that closely resembles the behavior of the desired objective function O. Note that in Bayesian Optimization,  $f_s$  is a stochastic process.

Let us assume  $\mathbf{x}_i$  is the  $i^{th}$  sample, and  $O(\mathbf{x}_i)$  is the evaluation of the objective function for that sample. Using Bayes theorem, for a collection of N samples such as  $C_{1:N} = \{\mathbf{x}_{1:N}, O(\mathbf{x}_{1:N})\}$ , we have

$$P(O|C_{1:N}) \propto P(C_{1:N}|O)P(O),$$
 (1)

where  $P(O|C_{1:N})$  is the posterior distribution. The posterior distribution P incorporates probabilistic information about the

### Algorithm 1: Multi-Model Bayesian Active Learning.

**Step 1.** Randomly choose the first set of parameters  $\mathbf{x}_{1:N} = {\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_N}$  for query.

while stop criterion is not satisfied do

**Step 2.** Get the query Objectives  $O(\mathbf{x}_{1:N})$ .

**Step 3.** For all  $\{\mathcal{GP}_1, \mathcal{GP}_2, \dots \mathcal{GP}_M\}$ , calculate new covariance matrix using eq (3)

**Step 4.** Update all  $\{\mathcal{GP}_1, \mathcal{GP}_2, \dots \mathcal{GP}_M\}$  using eq (2)

**Step 5.** If transfer learning is needed: Update  $\{\mathcal{GP}_1, \mathcal{GP}_2, \dots \mathcal{GP}_M\}$  mean vector and covariance matrix using eq (4)

**Step 6.** Evaluate all  $\{\mathcal{GP}_1, \mathcal{GP}_2, \dots \mathcal{GP}_M\}$  and update the set  $\mathbf{x}_{1:N}$ 

end return  $\{\mathcal{GP}_1, \mathcal{GP}_2, \dots \mathcal{GP}_M\}$ 



Fig. 1. Statistical modeling of hardware using Multi-model Bayesian Active Learning. The framework also incorporates an optional Bayesian Transfer Learning setup to improve the accuracy of the Bayesian models.

objective function O. Also, the commonplace assumption in Bayesian optimization is that the objective function is continuous. This assumption does not hold in our experimental setup. However, our goal is to perform Bayesian optimization on a discrete objective function by fitting a continuous posterior distribution P that passes through the discrete points of the objective function O, see [25].

We note that the surrogate function  $f_s$  is a stochastic process. Let us further assume that  $f_s$  is a Gaussian Process that provides a mean and a standard deviation at each point. i.e.

$$f_s(\mathbf{x}) = \mathcal{GP}(\mu(\mathbf{x}), \mathbf{K} = k(\mathbf{x}, \mathbf{x}'))).$$
 (2)

This assumption implies that the sampling process of  $O(\mathbf{x}_i)$  is noisy with Gaussian distribution  $\mathcal{N}(0, \sigma^2)$ .

In eq (2),  $\mathcal{GP}$  denotes a Gaussian process with a mean function  $\mu$  and a covariance operator  $\mathbf{K}$ . Here, we use the squared exponential covariance kernel k defined below to construct the covariance matrix  $\mathbf{K}$ 

$$\mathbf{K}_{ij} = k(x_i, x_j) = \exp(-\frac{|x_i - x_j|^2}{2})$$
 (3)

There are other types of covariance kernels such as *Matern kernels* [26] that are suitable for non-smooth data. Although studying the effect of various covariance kernels is not the main focus of this work, we show in Appendix B that our proposed method for hardware performance modeling is robust in terms of changing the covariance kernel to Matern.

### C. Transfer Learning

We propose using  $Transfer\ Learning$  to transfer prior information from a source surrogate function to a target function. Let us denote source domain  $\mathcal{D}_{source}$  that contains design parameters of our hardware design problem  $\mathbf{x} = \{x_1, x_2 \dots x_p\}$ . The surrogate function of source task  $\mathcal{T}_{source}$  is also denoted as  $f_{source}$ . The goal is to improve the learning process of a target task  $\mathcal{T}_{source}$  with the surrogate function  $f_{target}$  and domain  $\mathcal{D}_{target}$ . Note that our application considers  $Inductive\ Transfer\ Learning$ , where  $\mathcal{T}_{source} \neq \mathcal{T}_{target}$  and  $\mathcal{D}_{source} = \mathcal{D}_{target}$ . This

means that in our experimental setup, although the tasks and their surrogate functions are different, the design parameters  $\mathbf{x} = \{x_1, x_2 \dots x_p\}$  of our hardware design problems are the same

In our proposed method, we consider *Inductive Gaussian Process Transfer Learning* where we allow the mean function and covariance operator of the source surrogate function to affect the target surrogate function. To this end, we define

$$\begin{cases} \mu^{\text{TL}}(\mathbf{x}) = \mu_{\text{target}}(\mathbf{x}) + \lambda_1 \mu_{\text{source}}(\mathbf{x}) \\ \mathbf{K}^{\text{TL}} = \mathbf{K}^{\text{target}} + \lambda_2 \mathbf{K}^{\text{source}} \end{cases}$$
(4)

where  $\lambda_1, \lambda_2$  are hyper-parameters that regularize the magnitude of the source information transferred to the target.

We emphasize that we only use transfer learning for design-space exploration experiments where finding the position of the objective function optimum point is more important than its value, see Section IV. This method is beneficial when the  $\mathcal{T}_{\mathrm{source}}$  and  $\mathcal{T}_{\mathrm{target}}$  are correlated. We will show in Section V-A2 that this assumption is true for our intended application.

### D. Gaussian Regression Bootstrapping

In statistical hardware modeling problems, the aim is to predict the behavior of an objective function O. Examples of such objective functions are power consumption and latency. In practice, we have access to limited samples of this objective function because synthesizing a hardware design is a very time-consuming process. Thus, we can use the Bayesian surrogate function  $f_s$  to evaluate the behavior of the objective function O for new sets of parameters. This method is commonly known as Gaussian Regression Bootstrapping [23] and its theoretical aspect is studied in [27]. This is specifically beneficial because, unlike O, sampling  $f_s$  is inexpensive. Note that  $f_s$  is a Gaussian process incorporating mean and variance information. Here, we can use the mean function  $\mu$  of  $f_s$  for resampling the objective function O

$$O(\mathbf{x}) = \mu(\mathbf{x}) + \varepsilon(\mathbf{x}) \tag{5}$$

which provides an inexpensive evaluation of O for hardware parameter setups that are not actually synthesized. Note that  $\epsilon$  denotes the error of a  $\mathcal{GP}$ 's with a mean function equal to zero and a covariance operator equal to K.

## IV. DESIGN SPACE EXPLORATION VS PERFORMANCE MODELING SETTINGS

Hardware design space exploration and hardware performance modeling are two related but distinct tasks in hardware design. Our proposed statistical modeling methodology provides a unified framework for both applications that have been crucial for hardware designers historically. In the *Performance Modeling* application, the goal is to accurately *predict* the hardware performance (or value of the objective function O) for a specific hardware design parameter setting. Thus, the performance modeling application is essentially a regression problem and the accuracy of the model can be measured with *Mean Squared Error (MSE)* between the surrogate function  $f_s$  and the actual objective function O over N parameter sets  $\mathbf{x}_{1:N} = \{\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_N\}$ 

MSE = 
$$\frac{1}{N} \sum_{i=1}^{N} (O(\mathbf{x}_i) - f_s(\mathbf{x}_i))^2 = \frac{1}{N} \sum_{i=1}^{N} \varepsilon_i^2$$
. (6)

Fig. 2 (a) demonstrates the performance modeling setup. The black curve is the actual objective function, while the dotted red curve is the regression model. Note that in our experiments, we report  $\sqrt{\text{MSE}}/\mu_{f_s(\mathbf{x}_i)}$  to normalize MSE with respect to the mean of the surrogate function  $\mu_{f_s(\mathbf{x}_i)}$ .

On the other hand, in the *Design Space Exploration (DSE)* application, the goal is to find the hardware design parameter set that maximizes the objective function. Moreover, in the case of multi-objective optimization, DSE aims to find the dominant or Pareto optimal solutions. Fig. 2 (b) shows the DSE setup. The model is accurate if

$$\arg\max(f_s) = \arg\max(O),\tag{7}$$

meaning that the surrogate function  $f_s$  can accurately predict the parameter set that maximizes the objective function. Also, note that here argmax is used for the objectives that need to be maximized, such as throughput. Likewise, the goal for the objective functions that are to be minimized, e.g.



Fig. 2. (a) Performance modeling (regression) setting where the accuracy of the model is evaluated with the Mean Squared Error (MSE) and (b) Design space exploration setting where the algorithm tries to find the parameters that correspond to the design with maximum performance.

power consumption, is the same, except that  $\arg\max$  should be replaced by  $\arg\min$ .

### V. RESULTS AND DISCUSSIONS

Here, we discuss the performance of our proposed method and compare it with previous works. We show that our proposed method not only outperforms other algorithms for design space exploration applications, but it also provides a viable performance prediction in the regression setting.

### A. Design Space Exploration (DSE) Setting

We study the performance of our algorithm for DSE setting for two applications. In Section V-A1, we study the multiobjective optimization of OpenCL kernels on FPGA targets, and in Section V-A2 we show the effect of transfer learning on design space exploration of the CPU micro-architecture design.

1) Case Study I: Multi-Objective Optimization of OpenCL kernels on FPGA targets: As studied previously in ([29], [10]), design space exploration of the OpenCL kernels on FPGA targets is a challenging task that involves optimizing multiple competing performance objectives such as throughput and area utilization using various design parameters. These design parameters include, but are not limited to, the number of work-groups, the number of compute units, SIMD size, unrolling, etc. Furthermore, such kernels must be optimized for both area and throughput. Having two objectives, i.e., logic utilization and throughput, makes the problem a multi-objective problem where the Pareto Frontier must be found by the proposed algorithm. To do so, we create a Bayesian active learning multi-model for throughput and area according to Algorithm 1.

The idea behind using our proposed method is to iteratively select new design parameters and configure the application based on them in order to find the optimal set of parameters that achieve the required performance. As shown in Algorithm 1, the Bayesian models of area utilization and latency are trained on a set of initial design parameters and their corresponding performance metrics using Gaussian processes. Then, these models are used to model the performance of other design configurations and also as a guide for the selection of new parameters to be evaluated by the Bayesian model. By using active learning, the optimization process proceeds to update the Bayesian model based on the new samples evaluated by the model and it continues until satisfactory results are achieved for the performance metrics. For this experiment, we retain the five candidate solutions defined by a set of design parameters that give the best score according to the surrogate function. More generically, there could be more than one objective function, in which case each objective function would point to the five best designs. Finally, the associated parameter sets producing the best performance in the surrogate models are used to update the Bayesian models in each active learning round.

Fig. 3 shows the objective space of the matrix multiplication OpenCL kernel<sup>1</sup>. The red triangles show the set of dominant

 $<sup>^{1}\</sup>mbox{For}$  a complete list of matrix multiplication HLS design parameters, refer to  $\mbox{\bf Appendix}~\mbox{\bf A}.$ 



Fig. 3. Objective space of the Matrix Multiplication OpenCL kernel is demonstrated by the blue dots. The Pareto frontier estimated by our proposed multi-model active learning method is depicted with red triangles.

solutions, also known as Pareto frontier [30] that is estimated by our proposed method. The estimated Pareto frontier resembles the actual Pareto frontier in [29].

Table 1 shows the performance of the proposed method when optimizing the OpenCL kernels for latency. Our proposed method outperforms all the previously reported methods to find the optimum parameters for achieving the best latency results in various OpenCL benchmarks.

2) Case Study II: Design Space Exploration of Micro-Architectures using Transfer Learning: Design space exploration of micro-architectures is one of the most prominent problems of digital hardware design. Similar to other design space exploration problems, the number of possible designs grows exponentially with the number of parameters (See Appendix A for the list of design parameters). As such, finding the optimal design parameters that meet the performance requirement is a challenging and time-consuming task that requires significant computational and synthesis resources. Thus, there is a need to find performance models with a minimum number of queries for the synthesis tool. In order to show that our proposed algorithm performs efficiently in this setup,

similar to [18], we consider the SPEC2k [31] benchmarks to evaluate our proposed methodology. More specifically, we use the ammp, applu, equake, gcc, gzip, mesa, twolf and jbb benchmarks. The SPEC2k dataset exposes twelve parameters that can be tuned to optimize performance. These parameters are included but not limited to the cache size, the memory latency, the fixed-point and floating-point latency, etc. [18]. Moreover, considering the number of parameters, the design space for each benchmark is about *one Billion* design points, which makes it impossible to do design space exploration using exhaustive search methods such as brute force.

Similar to Section V-A1, we can consider the design space exploration of the micro-architectures as a multi-objective problem and illustrate the dominant solutions using a Pareto frontier. Fig. 4 shows a selected group of samples for objective space for SPEC2k 3-D graphics benchmark (mesa). In this figure, the Pareto frontier shows the trade-off between the Billion Instructions per Second (BIPS) rating and the normalized power, confirming that a faster microprocessor needs to consume more power. Also note that the red dots in Fig. 4 are determined by our proposed Bayesian multi-model active learning of power consumption and instructions per second (BIPS) rating created using 600 samples. As discussed in Section V-B, our proposed method considerably reduces the number of samples needed to perform regression analysis compared to the method proposed by [18].

Now we show that Bayesian Transfer Learning, as introduced in Section III-C, can further reduce the number of samples required for design space exploration. We use transfer learning as a technique to leverage knowledge gained from training Bayesian models on a source task to improve the modeling process of a target task. Let us consider the SPEC2k Java Business Benchmark (jbb) as the source task  $\mathcal{T}_{source}$  and show that the information of its Gaussian surrogate function can be used to reduce the number of samples needed to perform design space exploration for other benchmark tasks. Our approach for transfer learning in this context is to use a Bayesian model that is trained using SPEC2k Java Business Benchmark (jbb) and use its Gaussian process information to affect the training of other target tasks as explained in Section III-C. By doing so, the number of design evaluations is reduced, which leads to faster and more efficient design

TABLE 1
DESIGN SPACE EXPLORATION USING BAYESIAN ACTIVE LEARNING AND COMPARISON WITH PREVIOUS WORKS.

| Benchmark Function                  | Random<br>Search [28]<br>Average<br>Latency(ms) | LASSO model<br>(ms) [29] | hybrid<br>GWO-BO [10]<br>Average<br>Latency (ms) | This work<br>Latency (ms) |
|-------------------------------------|-------------------------------------------------|--------------------------|--------------------------------------------------|---------------------------|
| <b>Breath-First Search Dense</b>    | 4.372                                           | 4.3825                   | 4.329                                            | 4.2804                    |
| <b>Breath-First Search Sparse</b>   | 13.371                                          | 13.2857                  | 12.5151                                          | 12.5151                   |
| Discrete Cosine Transform           | 2.904                                           | 2.9522                   | 2.5891                                           | 2.1916                    |
| FIR                                 | $1.8 \times 10^{-3}$                            | $2.4 \times 10^{-3}$     | $1.6 \times 10^{-3}$                             | $7.11 \times 10^{-4}$     |
| Histogram                           | 1.6632                                          | 1.5715                   | 1.5652                                           | 1.5481                    |
| Mergesort                           | 1.9955                                          | 2.0441                   | 1.9386                                           | 1.9269                    |
| Matrix Multiplication               | 40.9993                                         | 40.6412                  | 36.8913                                          | 34.198                    |
| Normal Estimation                   | 6.5449                                          | 6.4763                   | 6.4315                                           | 6.2779                    |
| Sobel Filter                        | 1.8448                                          | 1.9331                   | 1.7434                                           | 1.719                     |
| Sparse Matrix Vector Multiplication | 0.1729                                          | 0.1798                   | 0.1671                                           | 0.1661                    |



Fig. 4. Selected samples of the objective space for the SPEC2k 3-D graphics benchmark (mesa) demonstrated by the blue dots. The estimated Pareto frontier obtained with our proposed multi-model active learning method is depicted with red triangles.

TABLE 2
CORRELATION OF THE SPEC2k BENCHMARKS WITH SPEC2k JAVA BUSINESS BENCHMARK (JBB).

| Benchmark Task (BIPS) | <b>Correlation</b> $(\rho)$ | pvalue                  |
|-----------------------|-----------------------------|-------------------------|
| ammp                  | 0.6395                      | $3.25 \times 10^{-230}$ |
| applu                 | 0.7523                      | 0                       |
| equake                | 0.9100                      | 0                       |
| gcc                   | 0.9397                      | 0                       |
| gzip                  | 0.8469                      | 0                       |
| mesa                  | 0.8720                      | 0                       |
| twolf                 | 0.8920                      | 0                       |

space exploration.

Note that the performance results of a micro-architecture on various benchmark tasks are strongly correlated. This means that we expect a powerful microprocessor to perform better in all tasks. This correlation is a crucial characteristic of design space exploration that can be exploited by inductive transfer learning. Table 2 shows this phenomenon for our selected SPEC2k benchmark tasks with the SPEC2k jbb benchmark. Note that small *pvalues* in Table 2 shows that the correlation results are statistically significant. In other words, it confirms the hypothesis that there is a strong correlation between the metric performance of the considered tasks.

Table 3 shows the experimental results for design space exploration using transfer learning for different target tasks where the source task is the SPEC2k jbb benchmark. The numbers reported in the table indicate a considerably faster convergence to the desired BIPS. For this experiment,  $\lambda_2=0$  and  $\lambda_1$ , in eq (4), are both set to 0.5 at the beginning of the iterations and linearly decreased to zero through the course of iterations. Also note that each iteration is a query to the synthesis environment, each of which consists of five samples, to create the Gaussian processes.

# B. Gaussian Regression Bootstrapping and Performance Prediction Models

Bootstrap is a resampling technique in statistics that allows one to create simulated data from a given set of samples. In our

TABLE 3
DESIGN SPACE EXPLORATION OF THE SPEC2K BENCHMARKS FOR THE BILLIONS OF INSTRUCTIONS PER SECONDS (BIPS) PERFORMANCE.

|                |       |                   | Number of Samples |  |
|----------------|-------|-------------------|-------------------|--|
|                |       | without           | with              |  |
| Benchmark Task | BIPS  | Transfer Learning | Transfer Learning |  |
| ammp           | 1.302 | 35                | 25                |  |
| applu          | 1.060 | 115               | 100               |  |
| equake         | 1.027 | 15                | 15                |  |
| gcc            | 1.009 | 20                | 15                |  |
| gzip           | 1.005 | 25                | 15                |  |
| mesa           | 2.012 | 25                | 20                |  |
| twolf          | 1.138 | 20                | 15                |  |

proposed method, when the Bayesian models are created using Gaussian processes, it is possible to use such probabilistic methods to create simulated data that closely resemble the behavior of the hardware. Gaussian regression bootstrapping can be used in various contexts such as bioinformatics [23].

The Gaussian regression bootstrapping involves fitting Gaussian process regression to a set of performance data and using this model to resample simulated data to create new statistical models. This allows quantifying the uncertainty in the data and improves the accuracy of the statistical modeling for performance prediction models. The simulated data that is generated by Gaussian regression bootstrapping can be used in a variety of statistical analyses to predict the behavior of the hardware. For instance, we show that Gaussian regression bootstrapping can generate robust simulated data that can be efficiently used in various regression analyses such as linear regression, LASSO, and Random Forest regression.

Here we explore the effectiveness of our proposed active learning multi-model to predict the performance of hardware benchmarks. As mentioned previously in Section IV, the regression setting is substantially different from design space exploration and requires more samples from the actual hardware to provide acceptable models. For instance, in the case of SPEC2k benchmarks, the authors of [18] show that 2000 samples are enough to create performance regression models. However, using our proposed methodology, it is possible to reduce this number to 700 samples using Gaussian regression bootstrapping which is 65% sample reduction.

- 1) Linear Regression: Linear regression is the most basic tool for the statistical modeling of hardware. The underlying assumption, of course, is that the performance of the hardware under study is a linear function of the design parameters. Linear regression can be used to develop performance models of hardware based on tuning various design parameters such as cache size, floating point delay, etc. Table 4 shows the comparison of  $\sqrt{\text{MSE}}/\mu$  for linear regression using 2000 samples suggested in [18] and 700 samples using our proposed Gaussian regression bootstrapping method.
- 2) Random Forest: Random forest [32] is a tree-based regression method that has been used previously for design space exploration of hardware [19]. Given the non-linear nature of hardware design, non-linear regression methods, such as Random Forest, might behave better in such scenarios. In random forest regression, multiple decision trees are created, and each tree chooses a random subset of design parameters

TABLE 4 Comparison of the Predictive power of regression models with or without using Gaussian regression bootstrapping on the SPECT2 $\kappa$  AMMP benchmar $\kappa$ .

|                                           |                           | Regression using          |
|-------------------------------------------|---------------------------|---------------------------|
|                                           | Original                  | Gaussian Bootstrapping    |
|                                           | regression setting        | simulated data            |
|                                           | $\sqrt{\mathrm{MSE}}/\mu$ | $\sqrt{\mathrm{MSE}}/\mu$ |
| Linear Regression Instructions Per Second | 0.056                     | 0.061                     |
| Linear Regression Power Consumption (mW)  | 0.45                      | 0.46                      |
| Random Forest Instructions Per Second     | 0.017                     | 0.055                     |
| Random Forest Power Consumption (mW)      | 0.050                     | 0.14                      |



Fig. 5. (a) Original LASSO model with a shrinkage factor  $\lambda$  for the coefficients of the power consumption model produced using 4000 samples (b) LASSO shrinkage factor  $\lambda$  effect on the coefficients of the power consumption model that is created using 700 samples and Gaussian regression bootstrapping. The comparison shows the extent to which Gaussian regression bootstrapping is efficient in producing simulated data samples to create statistical hardware performance prediction models.

from the training data to fit a statistical model. The main advantage of Random Forest regression over linear regression is that it can model the complex and nonlinear relation between the parameters and the desired performance objective. Table 4 reports that the performance of the random forest regression based on simulated data using the Gaussian regression bootstrapping is close to the original Random Forest regression. This means that the random forest regression on the simulated data and actual data produces similar results.

Note that the  $\sqrt{\text{MSE}}/\mu$  values in Table 4 are reported for the case when the covariance kernel function is squared exponential. We show in **Appendix B** that using other types of covariance kernel (such as Matern covariance kernels) for this specific application has a negligible effect on the reported results.

3) LASSO: The Least Absolute Shrinkage and Selection Operator (LASSO) [33] is a regularization and variable selection method that is used in regression analysis. LASSO is also used in [29] for performance modeling of hardware. Furthermore, LASSO can be used in hardware performance prediction models to identify the most important design parameters by leveraging the variable selection property of the LASSO. The variable selection of the LASSO is particularly important for hardware performance prediction models when working with large datasets with many design parameters since the inclusion of irrelevant parameters in the performance model can lead to over-fitting the hardware performance model. Fig. 5.a shows that as the LASSO shrinkage factor  $\lambda$  increases, the design parameter coefficients shrink to zero. The first coefficient that collapses to zero identifies the least significant parameter in the design space and also the last parameter that collapses to zero

is the most important parameter in the design space. The top axis in this figure shows the number of variables that are not zero at the given value of  $\lambda$ . Fig. 5.b depicts the behavior of the design parameter coefficients for the regression model created from simulated data using Gaussian regression bootstrapping. Comparing Fig. 5.a and Fig. 5.b reveals that the behavior of the Bayesian model created using simulated data and the original LASSO model are similar for different values of the LASSO shrinkage factor,  $\lambda$ . For instance, in both LASSO models, 'Fixed point latency' and 'Number of reservation stations' are the last design parameter coefficients that are shrunk to zero. Also, note a complete list of design parameters is reported in **Appendix A**. This shows that Gaussian regression bootstrapping is efficient in producing simulated data samples to create a reliable statistical model for hardware performance prediction.

### VI. CONCLUSION

We proposed a multi-model Bayesian active learning framework to statistically model digital hardware performance. Our proposed model uses Gaussian processes to characterize the performance of hardware designs. We also use active learning to iteratively guide the sampling process in order to reduce the number of samples needed to create the statistical model. Furthermore, we used Bayesian transfer learning in order to incorporate the prior design knowledge acquired in a source application to better model a target application with fewer data samples in the design space exploration task. We further propose using the Gaussian regression bootstrapping method to reduce the number of samples required for the performance prediction task. To show the effectiveness of our proposed method, we performed design space exploration and performance prediction for various hardware setups, such as micro-architecture design and OpenCL kernels. For OpenCL kernels on FPGA targets, our approach was able to identify the minimum latency in all benchmarks as well as predict the correct Pareto frontier. Our experiments show that our approach can detect the optimal design parameters for processors' micro-architectures with less than 50 samples in most cases. Our experiments also show that the number of samples required to create performance models significantly reduces, by 65%, while maintaining the model's predictive power. We have also performed commonplace regression analysis, such as linear regression, LASSO, and random forest regressions, using simulated data generated by Gaussian regression bootstrapping and demonstrated that our proposed statistical hardware models closely follow the actual hardware behavior in various scenarios.

### VII. ACKNOWLEDGMENTS

The authors would like to thank Dr. Vahid Partovi Nia whose constructive comments improved the presentation and exposition of this manuscript.

### REFERENCES

 K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, "A fast and elitist multiobjective genetic algorithm: Nsga-ii," *IEEE transactions on evolu*tionary computation, vol. 6, no. 2, pp. 182–197, 2002.

- [2] M. Palesi and T. Givargis, "Multi-objective design space exploration using genetic algorithms," in *Proceedings of the tenth international* symposium on Hardware/software codesign, 2002, pp. 67–72.
- [3] K. Kandasamy, J. Schneider, and B. Póczos, "Bayesian active learning for posterior estimation," in *Twenty-Fourth International Joint Confer*ence on Artificial Intelligence, 2015.
- [4] E. Brochu, V. M. Cora, and N. De Freitas, "A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning," arXiv preprint arXiv:1012.2599, 2010.
- [5] G. Skolidis, "Transfer learning with gaussian processes," 2012.
- [6] A. Mehrabi, A. Manocha, B. C. Lee, and D. J. Sorin, "Bayesian optimization for efficient accelerator synthesis," ACM Transactions on Architecture and Code Optimization (TACO), vol. 18, no. 1, pp. 1–25, 2020.
- [7] B. Reagen, J. M. Hernández-Lobato, R. Adolf, M. Gelbart, P. Whatmough, G.-Y. Wei, and D. Brooks, "A case for efficient accelerator design space exploration via bayesian optimization," in 2017 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 2017, pp. 1–6.
- [8] C. Lo and P. Chow, "Model-based optimization of high level synthesis directives," in 2016 26th International Conference on Field Programmable Logic and Applications (FPL). IEEE, 2016, pp. 1–10.
- [9] S. Mirjalili, S. M. Mirjalili, and A. Lewis, "Grey wolf optimizer," Advances in engineering software, vol. 69, pp. 46–61, 2014.
- [10] A. Ghaffari and Y. Savaria, "Efficient design space exploration of opencl kernels for fpga targets using black box optimization," *IEEE Access*, vol. 9, pp. 136819–136830, 2021.
- [11] L. P. Kaelbling, M. L. Littman, and A. W. Moore, "Reinforcement learning: A survey," *Journal of artificial intelligence research*, vol. 4, pp. 237–285, 1996.
- [12] M. de Prado, A. Mundy, R. Saeed, M. Denna, N. Pazos, and L. Benini, "Automated design space exploration for optimised deployment of dnn on arm cortex-a cpus," *IEEE Transactions on Computer-Aided Design* of Integrated Circuits and Systems, 2020.
- [13] A. Ghaffari and Y. Savaria, "Cnn2gate: An implementation of convolutional neural networks inference on fpgas with automated design space exploration," *Electronics*, vol. 9, no. 12, p. 2200, 2020.
- [14] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, "Asynchronous methods for deep reinforcement learning," in *International conference on machine learning*. PMLR, 2016, pp. 1928–1937.
- [15] A. Hosny, S. Hashemi, M. Shalan, and S. Reda, "Drills: Deep reinforcement learning for logic synthesis," in 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2020, pp. 581–586.
- [16] V. R. Konda and J. N. Tsitsiklis, "Onactor-critic algorithms," SIAM journal on Control and Optimization, vol. 42, no. 4, pp. 1143–1166, 2003.
- [17] S.-C. Kao, G. Jeong, and T. Krishna, "Confuciux: Autonomous hardware resource assignment for dnn accelerators using reinforcement learning," in 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2020, pp. 622–636.
- [18] B. C. Lee and D. M. Brooks, "Accurate and efficient regression modeling for microarchitectural performance and power prediction," ACM SIGOPS operating systems review, vol. 40, no. 5, pp. 185–194, 2006.
- [19] H.-Y. Liu and L. P. Carloni, "On learning-based methods for design-space exploration with high-level synthesis," in *Proceedings of the 50th annual design automation conference*, 2013, pp. 1–7.
- [20] G. Singh, J. Gómez-Luna, G. Mariani, G. F. Oliveira, S. Corda, S. Stuijk, O. Mutlu, and H. Corporaal, "Napel: Near-memory computing application performance prediction via ensemble learning," in *Proceedings of* the 56th annual design automation conference 2019, 2019, pp. 1–6.
- [21] G. Beltrame, L. Fossati, and D. Sciuto, "Decision-theoretic design space exploration of multiprocessor platforms," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 29, no. 7, pp. 1083–1095, 2010.
- [22] V. Partovi Nia, A. Ghaffari, M. Zolnouri, and Y. Savaria, "Rethinking pareto frontier for performance evaluation of deep neural networks," *International Conference on Machine Learning (ICML) HAET workshop.*, 2022.
- [23] P. D. Kirk and M. P. Stumpf, "Gaussian process regression bootstrapping: exploring the effects of uncertainty in time course data," *Bioinformatics*, vol. 25, no. 10, pp. 1300–1306, 2009.
- [24] C. Audet and W. Hare, "Derivative-free and blackbox optimization,"

- [25] P. Luong, S. Gupta, D. Nguyen, S. Rana, and S. Venkatesh, "Bayesian optimization with discrete variables," in *Australasian Joint Conference* on Artificial Intelligence. Springer, 2019, pp. 473–484.
- [26] M. G. Genton, "Classes of kernels for machine learning: a statistics perspective," *Journal of machine learning research*, vol. 2, no. Dec, pp. 299–312, 2001.
- [27] K. Hayashi, M. Imaizumi, and Y. Yoshida, "On random subsampling of gaussian process regression: A graphon-based analysis," in *International Conference on Artificial Intelligence and Statistics*. PMLR, 2020, pp. 2055–2065.
- [28] J. Bergstra and Y. Bengio, "Random search for hyper-parameter optimization." *Journal of machine learning research*, vol. 13, no. 2, 2012.
- [29] Q. Gautier, A. Althoff, P. Meng, and R. Kastner, "Spector: An opencl fpga benchmark suite," in 2016 International Conference on Field-Programmable Technology (FPT). IEEE, 2016, pp. 141–148.
- [30] P. Ngatchou, A. Zarei, and A. El-Sharkawi, "Pareto multi objective optimization," in *Proceedings of the 13th International Conference on, Intelligent Systems Application to Power Systems*. IEEE, 2005, pp. 84-01.
- [31] "Spec2k: Spec cpu2000 v1.3," https://www.spec.org/cpu2000/, accessed: 2023-01-01.
- [32] A. Liaw, M. Wiener et al., "Classification and regression by randomforest," R news, vol. 2, no. 3, pp. 18–22, 2002.
- [33] R. Tibshirani, "Regression shrinkage and selection via the lasso," *Journal of the Royal Statistical Society: Series B (Methodological)*, vol. 58, no. 1, pp. 267–288, 1996.

TABLE 5
OPENCL MATRIX MULTIPLICATION DESIGN PARAMETERS.

| # | Parameter               |
|---|-------------------------|
| 1 | number of blocks        |
| 2 | Sub-dimension x         |
| 3 | Sub-dimension y         |
| 4 | Manual SIMD x           |
| 5 | Manual SIMD y           |
| 6 | SIMD                    |
| 7 | Number of compute units |
| 8 | Enable Unroll           |
| 9 | Unroll factor           |

TABLE 6
SPEC2K DESIGN PARAMETERS.

| #  | Parameter                      |
|----|--------------------------------|
| 1  | FO4 depth                      |
| 2  | Load/store queue width         |
| 3  | Number of physical registers   |
| 4  | Number of reservation stations |
| 5  | i-L1 cache                     |
| 6  | d-L1 cache                     |
| 7  | L2 cache                       |
| 8  | Control latency                |
| 9  | Floating point latency         |
| 10 | Fixed point latency            |
| 11 | Load/store latency             |
| 12 | Memory latency                 |

### VIII. APPENDIX A: HARDWARE DESIGN PARAMETERS

The HLS design parameters of OpenCL matrix multiplication kernel are listed in Table 5. Additionally, the design parameters of SPEC2k benchmark are reported in Table 6.

### IX. APPENDIX B: BAYESIAN OPTIMIZATION USING MATERN COVARIANCE KERNEL

In Bayesian optimization, choosing a proper covariance kernel is essential for accuracy of the algorithm. Matern kernels are a class of stationary covariance function that depends on two hyper-parameters known as the length scale l and the smoothness  $\nu$ . By choosing a suitable value for  $\nu$  the Bayesian model can capture the smoothness and roughness of the objective function. The Matern class of covariance kernels is defined as

$$\mathbf{K}_{ij} = k_{\text{Matern}}(x_i, x_j)$$

$$= \frac{1}{2^{\nu - 1} \Gamma(\nu)} \left( \frac{\sqrt{2\nu} |x_i - x_j|}{l} \right)^{\nu} H_{\nu} \left( \frac{\sqrt{2\nu} |x_i - x_j|}{l} \right),$$
(8)

where  $\Gamma$  is the Gamma function and  $H_{v}$  is modified Bessel function of the second kind of order  $\nu$ . A common choice for  $\nu$  are  $\frac{3}{2}, \frac{5}{2}$  and also when  $\nu \to \infty$  the Matern kernels becomes the squared exponential kernel [26].

We repeated the experiments in Table 4 using Matern covariance kernels with  $\nu \in \left\{\frac{3}{2}, \frac{5}{2}\right\}$  and reported the results in Table 7. The reported  $\sqrt{\text{MSE}}/\mu$  in Table 7 shows that our method exhibits a robust behavior in terms of choosing various covariance kernels for modeling SPECT2k ammp benchmark. Note that the choice of covariance kernel depends on the nature of the data and can vary in different applications.

TABLE 7

Comparison of Matern Kernels vs squared exponential Kernel on Gaussian regression bootstrapping on the SPECT2k AMMP BENCHMARK.

| Covariance Kernel                         |                           | Squared Exponential       | Matern ( $\nu = 3/2$ )    | $(Matern \ \nu = 5/2)$    |
|-------------------------------------------|---------------------------|---------------------------|---------------------------|---------------------------|
|                                           | Original                  | Gaussian Bootstrapping    | Gaussian Bootstrapping    | Gaussian Bootstrapping    |
|                                           | $\sqrt{\mathrm{MSE}}/\mu$ | $\sqrt{\mathrm{MSE}}/\mu$ | $\sqrt{\mathrm{MSE}}/\mu$ | $\sqrt{\mathrm{MSE}}/\mu$ |
| Linear Regression Instructions Per Second | 0.056                     | 0.061                     | 0.061                     | 0.062                     |
| Linear Regression Power Consumption (mW)  | 0.45                      | 0.46                      | 0.46                      | 0.46                      |
| Random Forest Instructions Per Second     | 0.017                     | 0.055                     | 0.052                     | 0.056                     |
| Random Forest Power Consumption (mW)      | 0.050                     | 0.14                      | 0.13                      | 0.13                      |