1. Introduction
In recent years, the rapid advancement of integrated circuit technology has led to a significant increase in the complexity of very large-scale integrated circuit (VLSI) circuits. According to Moore’s law [
1], the number of transistors on a chip doubles every 18 months. With the continuous advancement in semiconductor technology, the scale and complexity of integrated circuits have been increasing. Faced with such a vast scale of chip design, traditional manual design methods are no longer able to meet growing design demands. Therefore, electronic design automation (EDA) [
2] technology has become an indispensable trend for the future. In the design and optimization process of VLSIs, physical design [
3] plays a crucial role as an essential part of the VLSI design flow, serving as both a key link and the core of electronic design automation technology. The floorplan phase [
4], as a critical part of the physical design flow, not only directly determines the area and overall floorplan of the integrated circuit chip, influencing the subsequent routing work, but also directly dictates the final performance of the entire circuit. The VLSI floorplanning problem, being a classic NP-hard problem, has a significant impact on performance metrics such as circuit delay, power consumption, congestion, and reliability [
5]. Despite being a classical problem [
6] and the subject of previous algorithms, block placement continues to pose significant challenges [
7].
As the first stage of the physical design flow, the quality of floorplanning significantly impacts the subsequent floorplan and routing. Generally, research on floorplanning can be divided into two categories. One category is based on planar graph representations. For floorplan graphs with slicing structures [
8], binary trees are widely used, where leaves correspond to blocks and internal nodes define the vertical or horizontal merge operations of their respective descendants. For more general non-slicing floorplan representations, several effective forms have been developed, including sequence pairs (SP) [
9], the bounded slicing grid (BSG) [
10], O-trees [
11], transitive closure graphs with packed sequences (TCG-S) [
12], and B*-trees [
13]. Among these, the representation of block placement with sequence pairs, which uses positive and negative sequences to represent the geometric relationships between any two modules, has been extended in subsequent work to handle obstacles [
14], soft modules, rectilinear blocks, and analog floorplans [
15,
16,
17,
18]. The decoding time complexity of the sequence pair representation is O(N
2). In order to reduce the decoding complexity, Tang et al. [
19]. utilized the longest common subsequence algorithm to decrease the decoding complexity to O(NlogN). Subsequently, Tang and Wong [
20] proposed an enhanced Fast Sequence Pair (FSP) algorithm, further reducing the decoding time complexity to O(NloglogN). Another category involves the study of floorplanning algorithms. By employing suitable planar graph representations and/or efficient perturbation methods, high-quality floorplans can be achieved through linear programming [
21] or some metaheuristic methods such as simulated annealing (SA) [
22,
23], genetic algorithms (GA) [
24,
25], memetic algorithms (MA) [
26], and ant colony optimization [
27].
Despite decades of research on VLSI floorplanning problems, the existing studies indicate that current EDA floorplan tools still struggle to achieve a floorplan close to optimal. These tools continue to face numerous limitations, making it challenging to obtain satisfactory design outcomes. Existing floorplan tools generally require long runtimes and experienced experts to spend weeks designing integrated circuit floorplans. Furthermore, these tools have a limited scalability and often require a time-consuming redesign when faced with new problems or different constraints. Reinforcement learning (RL) [
28] provides a promising direction to address these challenges. Reinforcement learning possesses autonomy and generalization capabilities, allowing the agent in reinforcement learning, through interactions with the environment, to automatically extract knowledge about the space it operates in. In addition to breakthroughs in gaming [
29] and robot control [
30], reinforcement learning has been applied to solve combinatorial optimization problems. Ref. [
31] proposed deep reinforcement learning (DRL) for solving the Traveling Salesman Problem (TSP). Moreover, significant progress has been made in the application of reinforcement learning to task scheduling [
32], vehicle routing problems [
33], graph coloring [
34], and more. Recently, integrating reinforcement learning into electronic design automation (EDA) has become a trend. For example, the Google team [
35] formulated macro-module placement as a reinforcement learning problem and trained an agent using reinforcement learning algorithms to place macro-modules on chips. He et al. [
36] utilized the Q-learning algorithm to train an agent that selects the best neighboring solution at each search step. Cheng et al. [
37] introduced cooperative learning to address floorplan and routing problems in chip design. Agnesina et al. [
38] proposed a deep reinforcement learning method for VLSI placement parameter optimization. Vashisht et al. [
39] utilized iterative reinforcement learning combined with simulated annealing to place modules. Xu et al. [
40] employed graph convolutional networks and reinforcement learning methods for floorplanning under fixed-outline constraints.
This paper proposes a deep reinforcement learning-based floorplanning algorithm utilizing sequence pairs for the floorplanning problem. The algorithm aims to optimize the area and wirelength of the floorplan. To evaluate the effectiveness of our algorithm, we conduct experiments on the internationally recognized benchmark circuits MCNC and GSRC, comparing our approach with simulated annealing and the deep Q-learning algorithm proposed by He et al. [
36]. In terms of dead space on the MCNC benchmark circuits, our algorithm outperforms simulated annealing and the literature [
36] by an average improvement of 2.7% and 1.1%, respectively. Additionally, concerning wirelength, our algorithm shows an average improvement of 9.1% compared to simulated annealing. On the GSRC benchmark circuits, our algorithm demonstrates an average improvement of 7.0% and 3.7% in dead space to simulated annealing and the literature [
36], respectively. Furthermore, for wirelength, our algorithm exhibits an average improvement of 8.8% over simulated annealing. These results validate the superior performance and robustness of our algorithm in handling ultra-large-scale circuit designs.
2. Description of Floorplanning Problem
Generally, floorplanning involves determining the relative positions of modules. Let B = {bi|1 ≤ i ≤ n} be a set of rectangular modules, where each module bi has a specified width wi and height hi. N = {ni|1 ≤ i ≤ m} represents a netlist that describes the connections between modules. The goal of floorplanning is to assign a set of coordinates to each module bi, while ensuring that no two modules overlap.
Let (
xi,
yi) denote the coordinates of the bottom-left corner of module
bi. The floorplan area
A is defined as the minimum rectangular area that encompasses all modules, and it can be calculated as follows:
We employ the widely used Half-Perimeter Wirelength (HPWL) model [
41] as the method to estimate the total wirelength, which is defined as follows:
Based on the optimization objective defined by the minimum rectangle area
A and the wirelength
W, the formulation is as follows:
Among these, F is a feasible floorplan diagram, indicating the weighted sum of the total area A and the total wirelength W. The coefficients α and β are weight factors ranging from 0 to 1.
3. Sequence Pair Representation
Tamarana et al. [
9] proposed a graph-encoding method called sequence pair (SP) for encoding non-sliced planar graphs. Given a non-sliced planar graph with
n modules, a sequence pair consists of a positive sequence Г
+ and a negative sequence Г
−, which contains all the information about which subsets of modules are located above, below, to the right, and to the left of a given module. Through the analysis of different graph representation methods, we believe that sequence pair has unique advantages compared to other graph representations. Firstly, the sequence pair representation is concise and easy to understand, making it highly suitable for integration with reinforcement learning to jointly solve graph-planning problems. Secondly, it can represent the complete solution space and has a one-to-one correspondence with non-sliced graphs, allowing for the unique reconstruction of non-sliced graphs from it. Lastly, compared with other methods, the introduction of fast sequences significantly reduces the complexity of the decoding time.
3.1. Properties of Sequence Pair
Sequence pair is a method used to describe the relative order between sequences. In each pair of sequences, each sequence consists of a set of module names, where the module names are the same in the positive sequence Г+ and the negative sequence Г−, but their order is inconsistent between Г+ and Г−. For example, in a given pair of sequences (Г+, Г−), there are four possible positional relationships between any two modules, bi and bj:
- (1)
If bi is positioned before bj in Г+, i.e., <....bi....bj....>, and bi is also positioned before bj in Г−, i.e., <....bi....bj....>, it indicates that bi is located on the left side of bj.
- (2)
If bj is positioned before bi in Г+, i.e., <....bj....bi....>, and bj is also positioned before bi in Г−, i.e., <....bj....bi....>, it indicates that bi is located on the right side of bj.
- (3)
If bi is positioned before bj in Г+, i.e., <....bi....bj....>, and bj is positioned before bi in Г−, i.e., <....bj....bi....>, it indicates that bi is located above bj.
- (4)
If bj is positioned before bi in Г+, i.e., <....bj....bi....>, and bi is positioned before bj in Г−, i.e., <....bi....bj....>, it indicates that bi is located below bj.
As an example,
Figure 1 shows an inclined grid representing the relative positions between modules in a sequence pair (Г
+, Г
−) = (<4, 3, 1, 6, 2, 5) and (<6, 3, 5, 4, 1, 2>).
From the figure, it can be observed that all modules satisfy the requirements of sequence pairs. In fact, for any given sequence pair, the positions of each module can be efficiently determined by calculating the weighted longest common subsequence (LCS). The time complexity of this algorithm is O(nlog(logn)).
3.2. Sequence Pair Representation Floorplan
To obtain a floorplan from a sequence pair, we first construct a geometric constraint graph corresponding to the sequence pair. This constraint graph consists of a set of edges (E) and a set of vertices (V). The vertex set (V) represents nodes for each module name, source node S, and added receiving node T. A horizontal constraint graph (HCG) and a vertical constraint graph (VCG) can be constructed, based on the positional relationships of each module. The specific steps for construction are as follows:
- (1)
For a given module x in the sequence pair (Г+, Г−), we obtain a list of modules that appear before x in Г+ and Г−. These modules are positioned to the left of x in the plane graph. A group of modules that appear after x in Г+ and Г− are positioned to the right of x in the plane graph. A group of modules that appear after x in Г+ and before x in Г− are positioned below x in the plane graph. Finally, a group of modules that appear before x in Г+ and after x in Г− are positioned above x in the plane graph.
- (2)
Next, we construct a directed graph for the horizontal constraint graph based on the left and right relationships. A directed edge E (a, b) represents module a being positioned to the left of module b. We add a source node S connected to all nodes in the horizontal constraint graph, and we also add a receiving node T connected to all nodes. The longest path length from the source node S to each node in the horizontal constraint graph represents the x coordinate of the modules in the floorplan.
- (3)
By computing the longest path length from the source node
S to the added receiving node
T, we can obtain the width of the floorplan. Similarly, we construct the vertical constraint graph based on the relationships above and below, and calculate the
y coordinate of the modules and the height of the plane graph in a similar manner.
Figure 2 illustrates the constructed horizontal and vertical constraint graphs for the sequence pair (Г
+, Г
−) = (<4, 3, 1, 6, 2, 5), (<6, 3, 5, 4, 1, 2>) as an example.
So, by constructing horizontal and vertical constraint graphs and calculating the longest path lengths for both directions, we can determine the width and height of the minimum bounding rectangle of the floorplan. Subsequently, floorplanning is performed for the pair of sequences, thereby determining the size and position of the non-slicing plane graph.
4. Reinforcement Learning
Reinforcement learning is a machine learning approach aimed at learning how to make decisions to achieve specific goals through interaction with the environment. In reinforcement learning, an agent observes the state of the environment, selects appropriate actions, and continuously optimizes its strategy based on feedback from the environment regarding its actions. This feedback is typically provided in the form of rewards [
42,
43] or penalties, and the agent’s objective is to learn the optimal strategy by maximizing the long-term cumulative reward.
Almost all reinforcement learning satisfies the framework of Markov Decision Processes (MDPs). A typical MDP, as shown in
Figure 3, consists of four key elements:
- (1)
States S: a finite set of environmental states.
- (2)
Actions A: a finite set of actions taken by the reinforcement learning agent.
- (3)
State transition model P (s, a, s′): representing the probability of transitioning from state s Є S to the next state, s′ Є S, when action a Є A is taken.
- (4)
Reward function R (s, a): representing the numerical reward for taking action a Є A in state s Є S. This reward can be positive, negative, or zero.
The goal of an MDP is to find a policy
π that maximizes the total accumulated numerical reward. The expression for the total cumulative reward is as follows:
where
γ represents the reward discount factor,
t denotes the time step, and
r represents the reward value at time step
t. The state value function
Vπ(
s) in an MDP is defined as the expected reward value of state
s under policy
π, as defined in Equation (5).
In this context,
Eπ represents the expected value of the reward function under policy
π. Similarly, the state–action value function
Qπ (
s,
a) is the expected reward value when action
a is taken in state
s under policy
π, defined as follows:
4.1. The MDP Framework for Solving Floorplanning Problems
In floorplanning problems, the agent in reinforcement learning interacts with the environment by selecting a perturbation to iteratively generate new floorplan solutions. The objective is to minimize the total area and total wirelength, which serve as rewards to encourage the agent to learn better strategies and ultimately find an optimal floorplan solution. To explore better floorplan solutions, the following MDP is defined:
- (1)
State space S: for the floorplanning problem, a state s Є S represents a floorplan solution, including a complete sequence of gates (Г+, Г−), and the orientation of each module.
- (2)
Action space A: A neighboring solution of a floorplan is generated by predefined perturbations in the action space. The following five perturbations are defined:
- (a)
Swap any two modules in the Г+ sequence.
- (b)
Swap any two modules in the Г− sequence.
- (c)
Swap one module from the Г+ sequence with one module from the Г− sequence.
- (d)
Randomly move a module to a new position in both Г+ and Г−.
- (e)
Rotate any module in the sequence pair by 90°.
- (3)
State transition P: given a state, applying any of the above perturbations will result in the agent transitioning to another state, simplifying the probabilistic setting in the MDP.
- (4)
Reward R: Allocating rewards for actions taken in a state is crucial in reinforcement learning. In this floorplanning problem, the objective is to minimize the area and wirelength. Thus, the reward is assigned as the reduction in the objective cost. A positive reward is assigned whenever the agent discovers a better solution, while no reward is assigned otherwise. The reward function is defined as follows:
The reward r in this context refers to a local reward, representing the reward value obtained when the current floorplan transitions from state s to state s′ through a perturbation. Here, F represents the optimization objective function defined in Equation (3).
4.2. Deep Reinforcement Learning Algorithm
After defining the MDP framework for solving the floorplanning problem, we choose to utilize the deep reinforcement learning algorithm to train the agent. In this paper, we employ the policy gradient (PG) algorithm based on Actor–Critic (AC) architecture, which is a model-free policy-based algorithm. In comparison to value-based methods, policy-based methods ensure a faster convergence. The gradient method utilizes gradient descent to optimize the policy
π. Specifically, the parameter
θ is associated with the policy
π that is to be optimized. We define an objective function
J(
θ), as illustrated in Equation (8), where the objective function
J(
θ) represents the target of obtaining the expected discounted total reward by following the parameterized policy
π. Hence, our objective is to learn an
θ that maximizes the function
J(
θ).
Reinforcement learning algorithms are a type of Monte Carlo method, allowing them to learn from a sequence of episodes or a series of steps taken by an agent during its exploration, without prior knowledge of the transition function in the MDP framework. The parameter vector
θ is learned through a deep neural network. This network is referred to as the policy gradient network, and its weights represent the
θ parameters. The policy network, or agent, is a deep neural network with a set of hidden layers. The activation function of the last layer of this network is “Softmax”, as the network’s output is a probability distribution of environmental actions. The policy gradient algorithm updates the parameters
θ in the direction of actions with the highest rewards. The weights are then updated using the computed gradient, defined as follows:
α represents the learning rate. By taking the derivative of Formula (8), Formula (10) is derived, which will be utilized to update the values of the
θ parameters. The definition of this formula is as follows:
This is the calculation of the expected trajectory τ obtained by sampling the policy πθ. R(τ) represents the reward accumulated over a single episode.
To train the policy network, the deep reinforcement learning algorithm is employed as shown in Algorithm 1, At each step of the episode, the policy network will predict a probability distribution assigned to the different actions available in the environment given the state description. The set of state, action, reward and next state in each episode are recorded. The set of discounted rewards are employed to calculate the gradient and update the weights of the policy network.
Algorithm 1: Deep Reinforcement Learning Algorithm |
Input: number of episodes, number of steps Output: Policy π 1: Initialize θ (policy network weights) randomly |
2: for e in episodes do |
3: for s in steps do |
4: Perform an action as predicted by the policy network |
5: Record s, a, r, s′ |
6: Calculate the gradient as per Equation (10) |
7: end |
8: Update θ as per Equation (9) |
9: end |
After numerous experiments for parameter tuning and optimization, the hyperparameters settings of the algorithm in this paper are presented in
Table 1.