Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
\useunder

\ul \newmdtheoremenv[backgroundcolor=yellow!10,outerlinecolor=black,innertopmargin = littopskip = ntheorem = false,roundcorner=4pt] framedtheoremTheorem \newmdtheoremenv[backgroundcolor=gray!10,outerlinecolor=black,innertopmargin = littopskip = ntheorem = false,roundcorner=4pt] assumptionAssumption

Large Language Model for Verilog Generation with Golden Code Feedback

Ning Wang1∗, Bingkun Yao1, Jie Zhou2, Xi Wang2, Zhe Jiang2†, Nan Guan1†
1City University of Hong Kong   2Southeast University
Abstract

Recent advancements in large language models (LLMs) have catalyzed significant interest in the automatic generation of Register-Transfer Level (RTL) code, particularly Verilog, from natural language instructions. While commercial LLMs like ChatGPT have dominated this domain, open-source alternatives have lagged considerably in performance, limiting the flexibility and data privacy of this emerging technology. This study introduces a novel approach utilizing reinforcement learning with golden code feedback to enhance the performance of pre-trained models. Leveraging open-source data and base models, we have achieved state-of-the-art (SOTA) results with a substantial margin. Notably, our 6.7B parameter model VeriSeek demonstrates superior performance compared to current best-in-class 13B and 16B models. Furthermore, through a comprehensive analysis of the limitations in direct fine-tuning and the training dynamics of reinforcement learning, we posit that the development of comprehensive supervisory signals, which are align with the inherent parallel semantics of Verilog code, is critical to effective generation. The code and data associated with this research are publicly available at https://github.com/CatIIIIIIII/veriseek. The model weights can be accessed at https://huggingface.co/WANGNingroci/VeriSeek.

11footnotetext: Email: nwang227-c@my.cityu.edu.hk22footnotetext: Corresponding authors.

1 Introduction

In recent years, the field of natural language processing (NLP) has witnessed a paradigm shift with the advent of large language models (LLMs), exemplified by GPT [2]. These models have demonstrated unprecedented capabilities in various linguistic tasks. Inspired by this remarkable progress, researchers in the domain of hardware design have begun to explore the potential of LLMs in revolutionizing hardware development methodologies.

Among diverse applications, the automatic generation of RTL designs based on natural language specifications has emerged as a particularly promising and widely studied direction. This task aims to transform high-level functional descriptions in natural language into fully-fledged Hardware Description Language (HDL) code, such as Verilog, VHDL, ab initio. In contrast to the well-established predictive machine learning (ML) methodologies in EDA, these generative techniques offer a more direct and potentially renewal impact on the hardware design and optimization process.

A subset of the current literature has primarily concentrated on the development and refinement of prompt engineering methodologies, leveraging commercial Large Language Models (LLMs) such as GPT. The utilization of commercial LLMs, while offering immediate access to powerful language processing capabilities, inherently constrains the degree of customization and adaptation possible for domain-specific tasks such as RTL generation. This approach, while expedient, may not fully address the unique challenges and nuances inherent in hardware description languages and digital design paradigms. Furthermore, the black-box nature of these commercial models precludes comprehensive analysis of their internal mechanisms, limiting the potential for targeted improvements in the realm of hardware design automation.

The adoption of open-source Large Language Models (LLMs) for hardware design automation presents compelling advantages over closed-source commercial solutions, such as GPT, in both research and practical applications. From a research perspective, open-source LLMs facilitate unrestricted scientific inquiry, enabling in-depth studies and customizations of this emerging technique. In practical applications, open-source solutions address critical data privacy concerns.

The development of high-performance open-source RTL generation models is currently hindered by the scarcity of high-quality circuit design data for training. This challenge stems from the proprietary nature of organized design data held by semiconductor companies and the inadequacy of publicly available data, which is often unstructured and of poor quality. Furthermore, existing open-source models rely on simplistic pre-training and fine-tuning approaches that are highly dependent on dataset quality. To address these limitations, there is an urgent need to explore more sophisticated methodologies capable of effectively leveraging limited data resources, thereby advancing the field of open-source RTL generation.

Our work presents a novel approach for training language models to generate Verilog code using Proximal Policy Optimization (PPO). The method introduces a reward function based on Abstract Syntax Tree (AST) similarity between generated and golden code. We use Pyverilog to parse code into ASTs and design an algorithm to compute AST similarity scores. The reward function penalizes invalid code and rewards syntactically correct code based on its AST similarity to the golden standard. This approach allows for a more semantically meaningful evaluation of generated Verilog code, focusing on structural similarities rather than token-level matches, potentially improving the quality and correctness of the generated code. The contributions of our work can be summarized as follows:

  • Leveraging entirely open-source models and data, VeriSeek achieves state-of-the-art results with a significant margin. Notably, our 6.7B model outperforms the current best 13B and 16B models.

  • To the best of our knowledge, VeriSeek is the first to employ reinforcement learning to enhance model performance and validate its effectiveness. It also demonstrates the efficacy of using Abstract Syntax Trees as a supervisory signal.

  • Through an analysis of the training dynamics of reinforcement learning, we posit that for Verilog code generation tasks, designing document-level training objectives can better align the model with the parallel logic of hardware description languages, thereby achieving superior results.

2 Method

2.1 Continual Pre-training Utilizing C-Language Integrated Dataset

In the process of continual pre-training, we utilized the publicly available VGen dataset, as developed by [15]. VGen aggregates Verilog repositories from GitHub, systematically filters out duplicates and excessively large files, and retains only those files containing module and endmodule statements. Additionally, it extracts text from 70 Verilog textbooks using pymuPDF, subsequently cleans the data, and retains only the verified Verilog syntax. Following this, we further refine the dataset to include only Verilog code, resulting in a final cleaned VGen dataset of approximately 200MB in size.

During implementation, we discovered that C programs could assist the model in understanding and generating Verilog code. This intriguing phenomenon was also observed and utilized by [10], who treated the corresponding C code as a functional description. Consequently, we incorporated the CodeSearchNet dataset [5], which contains approximately 40MB function codes and their documentation.

2.2 Reinforcement Learning with Golden Code Feedback

2.2.1 Dataset

We gathered high-quality specification-code pairs from Opencores [3], a community aimed to developing digital open-source hardware using electronic design automation (EDA). We then filtered out data instances exceeding 4096 characters in length and those that could not be parsed into Abstract Syntax Trees (AST). The final dataset comprises approximately 800 data instances.

2.2.2 Proximal Policy Optimization

We propose to formulate Verilog program generation as a reinforcement learning (RL) problem, aiming to further align the pre-trained model πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT with golden code preferences. Following the Reinforcement Learning from Human Feedback (RLHF) procedure as proposed in [9; 18], we fine-tuned the pre-trained model on our environment using Proximal Policy Opzimization (PPO, [12]). The environment is a bandit environment which receives a designer specification and generate a response to the specification. Given the generated code and golden code, it produces a reward determined by the reward function and ends the episode. In addition, we add a per-token KL penalty from the pre-trained model at each token to mitigate over-optimization.

Consider a large language model (LLM) as a policy πθ(𝐲^𝐱)subscript𝜋𝜃conditional^𝐲𝐱\pi_{\theta}(\mathbf{\hat{y}}\mid\mathbf{x})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_y end_ARG ∣ bold_x ) parameterized by θ𝜃\thetaitalic_θ. The policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is designed to receive user an instruction 𝐱𝒳𝐱𝒳\mathbf{x}\in\mathcal{X}bold_x ∈ caligraphic_X and generate a text response 𝐲^𝒴^𝐲𝒴\mathbf{\hat{y}}\in\mathcal{Y}over^ start_ARG bold_y end_ARG ∈ caligraphic_Y. Here we only consider single-turn conversations to simplify notations and modelling. Given a specification 𝐱𝐱\mathbf{x}bold_x, the LLM πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT will generate a response 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG in an auto-regressive manner:

πθ(𝐲^𝐱)=tπθ(y^t𝐱,𝐲^<t),subscript𝜋𝜃conditional^𝐲𝐱subscriptproduct𝑡subscript𝜋𝜃conditionalsubscript^𝑦𝑡𝐱subscript^𝐲absent𝑡\pi_{\theta}(\mathbf{\hat{y}}\mid\mathbf{x})=\prod_{t}\pi_{\theta}(\hat{y}_{t}% \mid\mathbf{x},\mathbf{\hat{y}}_{<t}),italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_y end_ARG ∣ bold_x ) = ∏ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) , (1)

where y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the t𝑡titalic_t-th token in the response and 𝐲^<tsubscript^𝐲absent𝑡\mathbf{\hat{y}}_{<t}over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT represents the tokens in the response before y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The objective function between the generated code and the golden code is:

Jr(πθ)=𝔼𝐱pdata,𝐲^πθ[r(𝐲^,𝐲)βlogπθ(𝐲^𝐱)πref(𝐲^𝐱)].subscript𝐽𝑟subscript𝜋𝜃subscript𝔼formulae-sequencesimilar-to𝐱subscript𝑝datasimilar-to^𝐲subscript𝜋𝜃delimited-[]𝑟^𝐲𝐲𝛽subscript𝜋𝜃conditional^𝐲𝐱subscript𝜋refconditional^𝐲𝐱J_{r}(\pi_{\theta})=\mathbb{E}_{\mathbf{x}\sim p_{\text{data}},\mathbf{\hat{y}% }\sim\pi_{\theta}}\left[r(\mathbf{\hat{y}},\mathbf{y})-\beta\log\frac{\pi_{% \theta}(\mathbf{\hat{y}}\mid\mathbf{x})}{\pi_{\text{ref}}(\mathbf{\hat{y}}\mid% \mathbf{x})}\right].italic_J start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT bold_x ∼ italic_p start_POSTSUBSCRIPT data end_POSTSUBSCRIPT , over^ start_ARG bold_y end_ARG ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_r ( over^ start_ARG bold_y end_ARG , bold_y ) - italic_β roman_log divide start_ARG italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over^ start_ARG bold_y end_ARG ∣ bold_x ) end_ARG start_ARG italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT ( over^ start_ARG bold_y end_ARG ∣ bold_x ) end_ARG ] . (2)

where r𝑟ritalic_r represents the reward function that reflects golden code preferences. It takes a response 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG and the corresponding golden code 𝐲𝐲\mathbf{y}bold_y as input, producing a scalar value as output. The model πrefsubscript𝜋ref\pi_{\text{ref}}italic_π start_POSTSUBSCRIPT ref end_POSTSUBSCRIPT serves as the reference model used to regularize πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT through Kullback–Leibler (KL) divergence. The constant β𝛽\betaitalic_β is introduced to control the degree of regularization.

2.2.3 Defining Reward by Golden Code AST

Parsing code to AST.  Unlike traditional language models that treat code as simple sequences of subword tokens, we leverages the Abstract Syntax Tree (AST) to gain deeper semantic insights. For the purpose of parsing, we assume the provided code is syntactically valid — a reasonable assumption for code understanding. We employ Pyverilog [14], which is a popular open-source toolkit for RTL design, to construct ASTs. In this structure, each subtree represents a consecutive span of subword tokens, while every leaf node corresponds to an individual token.

Reward definition.  For each generated Verilog code segment sequence 𝐲^^𝐲\mathbf{\hat{y}}over^ start_ARG bold_y end_ARG, the reward r𝑟ritalic_r is defined by calculating the similarity between the AST of the response and the AST of the label, provided the response is valid. The reward r𝑟ritalic_r is then given by:

r(𝐲^,𝐲)={10.0,if 𝐲^ contains no valid Verilog code segment5.0,if 𝐲^ cannot be parsed into an AST (i.e., no Verilog semantics)10simAST(𝐲^,𝐲),if 𝐲^ can be parsed into an AST successfully𝑟^𝐲𝐲cases10.0if ^𝐲 contains no valid Verilog code segment5.0if ^𝐲 cannot be parsed into an AST (i.e., no Verilog semantics)10subscriptsimAST^𝐲𝐲if ^𝐲 can be parsed into an AST successfullyr(\mathbf{\hat{y}},\mathbf{y})=\begin{cases}-10.0,&\text{if }\mathbf{\hat{y}}% \text{ contains no valid Verilog code segment}\\ -5.0,&\text{if }\mathbf{\hat{y}}\text{ cannot be parsed into an AST (i.e., no % Verilog semantics)}\\ 10*\mathrm{sim_{AST}}(\mathbf{\hat{y}},\mathbf{y}),&\text{if }\mathbf{\hat{y}}% \text{ can be parsed into an AST successfully}\end{cases}italic_r ( over^ start_ARG bold_y end_ARG , bold_y ) = { start_ROW start_CELL - 10.0 , end_CELL start_CELL if over^ start_ARG bold_y end_ARG contains no valid Verilog code segment end_CELL end_ROW start_ROW start_CELL - 5.0 , end_CELL start_CELL if over^ start_ARG bold_y end_ARG cannot be parsed into an AST (i.e., no Verilog semantics) end_CELL end_ROW start_ROW start_CELL 10 ∗ roman_sim start_POSTSUBSCRIPT roman_AST end_POSTSUBSCRIPT ( over^ start_ARG bold_y end_ARG , bold_y ) , end_CELL start_CELL if over^ start_ARG bold_y end_ARG can be parsed into an AST successfully end_CELL end_ROW (3)

To get the similarity score between abstract syntax trees (ASTs), we designed a simple yet effective algorithm. As shown in Algorithm 1, the comparison algorithm, simASTsubscriptsimAST\rm sim_{\rm AST}roman_sim start_POSTSUBSCRIPT roman_AST end_POSTSUBSCRIPT, computes a similarity score between two normalized ASTs. If the trees are identical, the score is 1.0. If both trees are tuples and their root node types match, it recursively compares their children. The similarity score is the average of the scores of the children. If the number of children differs, the score is penalized by dividing by the maximum number of children. The score ranges from 0.0 (completely different) to 1.0 (identical).

Algorithm 1 simASTsubscriptsimAST\mathrm{sim}_{\mathrm{AST}}roman_sim start_POSTSUBSCRIPT roman_AST end_POSTSUBSCRIPT: Compare Two Verilog Code Segments
1:  Input: Two AST-parsable Verilog code segments 𝐲1subscript𝐲1\mathbf{y}_{1}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐲2subscript𝐲2\mathbf{y}_{2}bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
2:  Output: A similarity score between 0.0 and 1.0
3:  Function:
4:  simAST(tree1, tree2)
5:  if tree1 == tree2 then
6:     return  1.0 {Trees are identical}
7:  end if
8:  if tree1 is a tuple and tree2 is a tuple then
9:     if tree1.root == tree2.root then
10:        children1 \leftarrow tree1.children
11:        children2 \leftarrow tree2.children
12:        children_score = simAST(c1, c2) for c1, c2 in zip(children1, children2)simAST(c1, c2) for c1, c2 in zip(children1, children2)\sum\texttt{sim${}_{\rm AST}$(c1, c2)}\text{ {for} {c1, c2} {in} zip({children% 1, children2})}∑ typewriter_simAST(c1, typewriter_c2) for c1, c2 in zip(children1, children2)
13:        if len(children1) == len(children2then
14:           return  children_scorelen(children1)children_scorelen(children1)\frac{\texttt{children\_score}}{\texttt{len(children1)}}divide start_ARG children_score end_ARG start_ARG len(children1) end_ARG
15:        else
16:           return  children_scoremax(len(children1),len(children2))children_scorelen(children1)len(children2)\frac{\texttt{children\_score}}{\max(\text{len(children1)},\text{len(children2% )})}divide start_ARG children_score end_ARG start_ARG roman_max ( len(children1) , len(children2) ) end_ARG
17:        end if
18:     end if
19:  end if
20:  return  0.0
21:  
22:  Parse 𝐲1subscript𝐲1\mathbf{y}_{1}bold_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐲2subscript𝐲2\mathbf{y}_{2}bold_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to normalized verilog AST tree1 and tree2
23:  return  simAST(tree1, tree2)

This approach provides a flexible and structure-focused comparison of ASTs, abstracting away specific details and concentrating on the overall shape and structure of the trees. By focusing on the structural similarities, we can effectively measure the quality of generated Verilog code segments.

3 Experimental Setup and Performance Evaluation

3.1 Training Detail

We continually pre-train VeriSeek (6.7B) based on deepseek-coder-6.7b-base [4]. Our experiments are conducted on a server equipped with 8 A800-80G GPUs. All experiments utilize a cosine learning rate scheduler with a warmup phase comprising 10% of the total training steps, and an AdamW optimizer [7] with a weight decay of 0.05. Additionally, we employ deepspeed ZeRO-3 offload [11] for acceleration.

During continual pre-training, we use a peak learning rate (LR) of 1e-4, a batch size of 32, and train for 1 epoch. For reinforcement learning, we adopt a peak learning rate (LR) of 1e-5, a batch size of 8, and train for 15 epochs. The duration of continual pre-training is approximately 1 hour, whereas the reinforcement learning task requires about 1 day to complete training.

3.2 Metric and Benchmark

Metric.  We follow [16] and evaluate the models using the widely-adopted pass@k𝑘kitalic_k metric for code generation, which is the percentage of problems solved by using k𝑘kitalic_k generated programs per problem:

pass@k:=𝔼problems[1(nck)(nk)]assign𝑝𝑎𝑠𝑠@𝑘subscript𝔼problemsdelimited-[]1binomial𝑛𝑐𝑘binomial𝑛𝑘pass@k:=\mathbb{E}_{\text{problems}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}% }\right]italic_p italic_a italic_s italic_s @ italic_k := blackboard_E start_POSTSUBSCRIPT problems end_POSTSUBSCRIPT [ 1 - divide start_ARG ( FRACOP start_ARG italic_n - italic_c end_ARG start_ARG italic_k end_ARG ) end_ARG start_ARG ( FRACOP start_ARG italic_n end_ARG start_ARG italic_k end_ARG ) end_ARG ] (4)

where nk𝑛𝑘n\geq kitalic_n ≥ italic_k samples are generated per problem and a problem is solved if any of the k𝑘kitalic_k samples passes the testbench. In our experiments, we sample n=20𝑛20n=20italic_n = 20 code completions per problem for benchmark and measuring pass@11{1}1 and pass@55{5}5.

Moreover, to facilitate a more accurate comparison with the baseline model RTLCoder, we adopt the ’pass@5’ metric as described in [6]. This metric considers an experiment successful if any test in five trials passes the testbench. To avoid confusion with the notation ’pass@k,k=5𝑘𝑘5k,k=5italic_k , italic_k = 5’, we have renamed this metric to hit@5.

Benchmark.  For fair comparison of large language models, we evaluate their performance on RTLLM V1.1 benchmark. The updated version of the RTLLM V1.1 benchmark [8] includes 29 RTL design tasks at a larger design scale, addressing several issues present in the original RTLLM V1.0. While we primarily adhere to the testing methods outlined in the original paper, we evaluate syntax and functional pass rate, utilizing ModelSim [13]. Syntax correction only requires the design to comply with Verilog syntax rules while function also mandates that the design interface corresponds to the testbench, ensuring the circuit can be simulated.

3.3 Performance Evaluation

In the evaluation section, we provide a comprehensive comparison of the performance of various large language models for Verilog code generation, as detailed in Table 1. This table includes both closed-source and open-source models, offering a broad perspective on the current state of the field. Among these models, our designed model stands out, particularly the version that has been pre-trained with C programs (PTwC) and further enhanced through reinforcement learning (PTwC+RL). This combination of continual pre-training and reinforcement learning has enabled our model to achieve state-of-the-art (SOTA) functional performance among open-source models, setting a new baseline.

Syntax Function
Type Model pass@11{1}1 pass@55{5}5 hit@5 pass@11{1}1 pass@55{5}5 hit@5
GPT-3.5 53.3 84.6 89.7 31.7 51.7 37.9
Closed-Source GPT-4 73.4 89.2 100.0 32.1 54.1 65.0
Thakur-16B 77.5 89.0 86.2 13.3 21.0 24.1
ChipGPT 32.3 40.4 38.9 17.8 29.0 34.3
ChipGPT-13B 57.4 68.2 65.5 25.4 39.6 45.7
Open-Source RTLCoder 66.9 88.4 96.6 25.3 40.5 48.3
Base Deepseek-coder-base 65.3 88.2 89.3 21.4 31.6 39.3
VeriSeekPT 73.3 93.5 86.2 27.6 44.0 51.7
VeriSeekPTwC 83.6 95.0 92.2 31.5 46.3 51.7
Ours VeriSeekPTwC+RL 84.1 95.2 93.1 31.7 48.5 55.2
  • *

    All models in this table have approximately 7 billion parameters, except for ChipGPT-13B, Thakur-16B, GPT-3.5, and GPT-4.

  • +

    Bold font represents the best metric, excluding GPT-4.

Table 1: Performance comparison of large language models for verilog code generation. Our model after pretraining and reinforcement learning outperforms all other open-source models with considerable gap especially on functional metric.

Specifically, our PTwC+RL model demonstrates exceptional capabilities in both syntax and functional metrics. It achieves syntax pass@55{5}5 of 93.1%, which indicates its robust ability to generate syntactically correct Verilog code, although it is slightly lower than RTLCoder’s 96.6%. However, our model excels in functional performance, achieving a functional pass@55{5}5 of 55.2%, underscoring its superior performance in generating functionally accurate code, which is a critical aspect of practical code generation tasks. These metrics, especially the functional pass rate, are the highest among all open-source models listed in the table, highlighting the effectiveness of our approach.

4 Discussion

4.1 Failure of Instruction Fine-tuning

Instruction finetuning is a process where large language models are further trained on datasets comprising instructions and corresponding responses [17]. This enhances their ability to follow human instructions accurately and generate relevant outputs. It is widely adopted due to its efficacy in improving model performance on diverse, real-world tasks and user interactions.

For decoder-only LLMs, they generate a sequence by continuously predicting the next token based on the already generated previous ones. We denote the probability of generating the t𝑡titalic_t-th token y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as Pπ(y^t𝐱,𝐲^<t)subscript𝑃𝜋conditionalsubscript^𝑦𝑡𝐱subscript^𝐲absent𝑡P_{\pi}(\hat{y}_{t}\mid\mathbf{x},\mathbf{\hat{y}}_{<t})italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ), and the log probability of generating the whole sequence could be written as t=1TlogPπ(y^t𝐱,𝐲^<t)superscriptsubscript𝑡1𝑇subscript𝑃𝜋conditionalsubscript^𝑦𝑡𝐱subscript^𝐲absent𝑡\sum_{t=1}^{T}\log P_{\pi}(\hat{y}_{t}\mid\mathbf{x},\mathbf{\hat{y}}_{<t})∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ). Instruction fine-tuning employ Maximum Likelihood Estimation (MLE) to find the best parameters. MLE is usually defined as below:

mle=t=1TlogPπ(y^t𝐱,𝐲^<t)subscriptmlesuperscriptsubscript𝑡1𝑇subscript𝑃𝜋conditionalsubscript^𝑦𝑡𝐱subscript^𝐲absent𝑡\mathcal{L}_{\text{mle}}=-\sum_{t=1}^{T}\log P_{\pi}(\hat{y}_{t}\mid\mathbf{x}% ,\mathbf{\hat{y}}_{<t})caligraphic_L start_POSTSUBSCRIPT mle end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT roman_log italic_P start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_x , over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT < italic_t end_POSTSUBSCRIPT ) (5)

As mentioned in [6], there exists exposure bias [1] in auto-regressive sequence generation, where the model predicts the next token based on its own generated previous tokens rather than the reference tokens, leading to potential deviations from the reference code during generation. We confirm this problem by fine-tuning the pre-trained model directly on Opencores dataset.

Syntax Function
Model pass@11{1}1 pass@55{5}5 hit@5 pass@11{1}1 pass@55{5}5 hit@5
VeriSeekPT 73.3 93.5 86.2 27.6 44.0 51.7
VeriSeekPTwC 83.6 95.0 92.2 31.5 46.3 51.7
VeriSeekPTwC+FT 79.3 94.6 89.1 31.9 46.1 48.3
VeriSeekPTwC+RL 84.1 95.2 93.1 31.7 48.5 55.2
  • *

    Bold font represents the best metric.

Table 2: A comparative analysis of the performance between the directly fine-tuned model and other model variants. The fine-tuned model demonstrates a slight improvement in the functional pass@11{1}1 metric, but exhibits suboptimal performance across other evaluation metrics.

As shown in Table 2, a comparative analysis of the performance between the directly fine-tuned model and other model variants reveals that the fine-tuned model demonstrates a slight improvement in the functional pass@11{1}1 metric, but exhibits suboptimal performance across other evaluation metrics.

In addition to exposure bias, there is another potential explanation for the inadequacies observed in direct fine-tuning. The limitations imposed by the auto-regressive nature of current generative large language models (LLMs) restrict new tokens to attend only to their prefixes, which is not aligned with the inherently parallel semantics of Verilog. While direct fine-tuning can enhance the model’s response to instructions, it may adversely affect the model’s creative capabilities. This phenomenon will be further examined in the subsequent subsection. We recommend that the research community consider adopting parallel and global supervise signals instead of sequential ones for Verilog code generation.

4.2 Training Dynamics of Reinforcement Learning

To gain a deeper understanding of the impact of reinforcement learning on model performance, we meticulously recorded the reward at each training step as well as the functional pass@11{1}1 and pass@55{5}5 every five training steps.

As illustrated in Fig. 1, it is noteworthy that the optimal model solution emerges at the initial stages of the training process rather than at the point of convergence. Moreover, after a long time, the model would converge to the fine-tuned model. The training process can be divided into four distinct phases:

Refer to caption
Figure 1: Reward and functional pass rate with reinforcement learning. This figure illustrates the evolution of model performance during reinforcement learning training. The graph depicts reward values and functional pass rates (pass@1 and pass@5) over training steps. The training process exhibits four distinct phases: warm-up (0-20 steps), learning (20-50 steps), deviation (50-100 steps), and convergence (100-150 steps). Notably, the optimal model solution emerges early in the training process rather than at convergence.
  • Warm-up (0-20 steps): During this initial stage, the model tries to escape from local optima. This phase is crucial for ensuring that the model does not become prematurely trapped in suboptimal solutions, thereby setting a foundation for more effective learning.

  • Learning (20-50 steps): In the subsequent phase, the model actively explores its representational capacity by maximizing the rewards. This exploration is aimed at enhancing the model’s ability to generalize and adapt to diverse scenarios, thereby improving its overall performance.

  • Deviation (50-100 steps): During this phase, the model intentionally deviates from the optimal region. This deviation arises from the misalignment between the reward signal and our evaluation expectations: although AST serves as a global supervisory signal, it still does not align completely with the anticipated evaluation metric. Consequently, the model continues to explore a broader solution space, thereby moving away from the optimal region.

  • Convergence (100-150 steps): Finally, the model converges towards a fine-tuned state. In this phase, the model refines its parameters to achieve an optimal balance between exploration and exploitation, culminating in a well-tuned and stable policy.

Refer to caption
Figure 2: Learning Dynamics in the Optimization Landscape. This figure illustrates the optimization trajectory of the model through various stages of the proposed methodology. The contour plot represents the optimization landscape, with the base model (black dot) positioned at the periphery, indicating an initial low-accuracy state. The pre-trained model’s path (yellow) diverges from the base model, suggesting initial performance improvement. The fine-tuned model’s trajectory (purple) continues from the pre-trained state. The reinforcement learning models’ path (red) demonstrates a complex optimization process with significant fluctuations, ultimately converging towards the fine-tuned state.

This training process can be visualized through the learning dynamics illustrated in Fig. 2, which provides a comprehensive representation of the optimization landscape across various stages of our proposed methodology. The base model, denoted by a black dot, is positioned at the periphery of the contour plot, indicative of an initial low-accuracy state. The pre-trained model’s trajectory, depicted in yellow, diverges from the base model, suggesting a preliminary enhancement in model performance. The fine-tuned model, represented in purple, continues this trajectory from the pre-trained state. The best and converged Proximal Policy Optimization (PPO) models are delineated in red, with their convoluted path demonstrating the PPO algorithm’s optimization process, characterized by significant fluctuations and explorations within the loss landscape. The terminal point represents the ultimate convergence of the PPO model towards the fine-tuned state.

It is noteworthy that while fine-tuning can propel the model from a sub-optimal region, it tends to direct the model towards another sub-optimal area. This is attributable to the potency of simple auto-regressive training, which is not inherently aligned with the parallel semantics of Verilog code. During the reinforcement learning phase, the model demonstrates enhanced capability in exploring the landscape to attain an optimal solution. Another critical factor contributing to this success is the utilization of the Abstract Syntax Tree (AST), which serves as a globally defined reward. Although the AST, representing the program’s structure, does not fully encapsulate the parallel nature of the code, it nonetheless provides robust guidance for the learning process. It is important to acknowledge that in the long term, reinforcement learning is expected to converge the model towards the fine-tuned state, as the reward is directly generated from the golden reference code.

In conclusion, our findings underscore the critical importance of a well-defined global supervisory signal in enabling the model to effectively acquire the capability for Verilog code generation. Furthermore, in scenarios where such a well-defined signal is not readily available, it is imperative to consider employing algorithms capable of exploring the optimization landscape to enhance the model’s performance.

5 Conclusion

In conclusion, this study has advanced the automatic generation of Verilog code from natural language instructions using reinforcement learning with golden code feedback. Our approach, implemented in a 6.7B parameter model, achieves state-of-the-art results, outperforming larger models. The research highlights the crucial role of comprehensive supervisory signals aligned with Verilog’s parallel semantics. By offering an open-source alternative to commercial LLMs, this work enhances flexibility and data privacy in hardware design. Our findings underscore the importance of well-defined global supervisory signals and exploration-capable algorithms in optimizing model performance, paving the way for more efficient hardware development methodologies.

References

  • [1] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  • [2] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [3] E. Greenbaum. Open source semiconductor core licensing. Harv. JL & Tech., 25:131, 2011.
  • [4] D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Li, et al. Deepseek-coder: When the large language model meets programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
  • [5] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
  • [6] S. Liu, W. Fang, Y. Lu, Q. Zhang, H. Zhang, and Z. Xie. Rtlcoder: Outperforming gpt-3.5 in design rtl generation with our open-source dataset and lightweight solution. arXiv preprint arXiv:2312.08617, 2023.
  • [7] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • [8] Y. Lu, S. Liu, Q. Zhang, and Z. Xie. Rtllm: An open-source benchmark for design rtl generation with large language model. In 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC), pages 722–727. IEEE, 2024.
  • [9] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • [10] Z. Pei, H.-L. Zhen, M. Yuan, Y. Huang, and B. Yu. Betterv: Controlled verilog generation with discriminative guidance. arXiv preprint arXiv:2402.03375, 2024.
  • [11] J. Rasley, S. Rajbhandari, O. Ruwase, and Y. He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  • [12] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • [13] Siemens Software. Modelsim.
  • [14] S. Takamaeda-Yamazaki. Pyverilog: A python-based hardware design processing toolkit for verilog hdl. In K. Sano, D. Soudris, M. Hübner, and P. C. Diniz, editors, Applied Reconfigurable Computing, pages 451–460, Cham, 2015. Springer International Publishing.
  • [15] S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg. Benchmarking large language models for automated verilog rtl code generation, 2022.
  • [16] S. Thakur, B. Ahmad, Z. Fan, H. Pearce, B. Tan, R. Karri, B. Dolan-Gavitt, and S. Garg. Benchmarking large language models for automated verilog rtl code generation. In 2023 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1–6, 2023.
  • [17] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  • [18] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.