Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

Python-based DSL for generating Verilog model of Synchronous Digital Circuits

Mandar Datar, Dhruva S. Hegde, Vendra Durga Prasad, Manish Prajapati, Neralla Manikanta,
Devansh Gupta, Janampalli Pavanija, Pratyush Pare, Akash, Shivam Gupta, and Sachin B. Patkar
Department of Electrical Engineering
Indian Institute of Technology Bombay, India
Email:{mandardatar, patkar}@ee.iitb.ac.in
Abstract

We have designed a Python-based Domain Specific Language (DSL) for modeling synchronous digital circuits. In this DSL, hardware is modeled as a collection of transactions – running in series, parallel, and loops. When the model is executed by a Python interpreter, synthesizable and behavioural Verilog is generated as output, which can be integrated with other RTL designs or directly used for FPGA and ASIC flows. In this paper, we describe - 1) the language (DSL), which allows users to express computation in series/parallel/loop constructs, with explicit cycle boundaries, 2) the internals of a simple Python implementation to produce synthesizable Verilog, and 3) several design examples and case studies for applications in post-quantum cryptography, stereo-vision, digital signal processing and optimization techniques. In the end, we list ideas to extend this framework.

Index Terms:
Python, Verilog, DSL, RTL, FPGA, ASIC

I Introduction

Current high-level synthesis (HLS) tools such as Xilinx Vivado HLS, Intel HLS, etc. convert ‘high-level un-timed C/C++ model’ into synthesizable Verilog code [1]. A tool like Bluespec-SystemVerilog [2] expects the user to break the computation into ‘atomic actions’ and the compiler emits synthesizable Verilog after scheduling the actions correctly and optimally. If a project is entirely developed using such tools, one can debug (edit-compile-debug) at high-level C/C++/BSV itself. However, if one has to interface the design with existing RTL modules (e.g. third party), RTL/Verilog simulation needs to be carried out. Signal-level debugging is not easy when the RTL itself is machine-generated since most of the signal names in the RTL can not be correlated with variable names in the high-level model. With HLS tools, the user can influence the RTL generation using pragmas (e.g. to unroll/pipeline loops) however, final scheduling is done by the tool. There are also tools like PyMTL [3], and MyHDL [4], which convert a logic design expressed in a high-level language (Python) to RTL. The DSL described in this chapter is not an HLS tool. It is at a lower level than HLS, but it is at a higher level compared to RTL. It is extremely light-weight, easy to use and generates synthesizable Verilog with good readability.
In the Python-based DSL presented in this paper:

  • Actual computation (comparisons, addition, multiplication, etc.) is written as Python expressions.

  • Python statements are grouped into ‘leaf sections’, where the entire computation under one leaf happens in a single clock cycle.

  • Series, parallel and loop sections group other such sections forming a tree.

  • Generated Verilog is behavioural and preserves register/wire names and expression structure from the user’s Python code.

  • Full Python facilities are available for static elaboration.

Figure 1 provides a high-level picture of the Python DSL.

Refer to caption
Figure 1: Python to Verilog

The flow of this paper is as follows. Section 2 introduces the proposed DSL framework and its constructs. Section 3 elaborates on how Python language features are used to build the DSL. This is followed by Section 4, which presents several complex examples of hardware designed using the DSL and the results obtained. Section 5 outlines the future directions of this work. Finally, Section 6 concludes the paper.

II Python DSL Constructs

In the Python DSL model, the computation is broken into chunks called LeafSections. The entire body of one LeafSection ‘executes’ in one clock cycle. For each module, clock, reset, enable-ready for inputs and outputs signals are implicitly present in the generated Verilog.

The statements that are written under SerialSections are executed in sequence. This construct can be used for sequential modelling.

with SerialSections ("S1"):
  with LeafSection ("a"):
    display ("running ’a’")
  with LeafSection ("b"):
    display ("running ’b’ after ’a’")
  with LeafSection ("c"):
    display ("running ’c’ after ’b’")

Similarly, the statements that are written under ParallelSections are executed in parallel. This construct can be used for concurrent modelling.

with ParallelSections ("P1"):
  with LeafSection ("a"):
    display ("running ’a’")
  with LeafSection ("b"):
    display ("running ’b’ with ’a’")

The statements under WhileLoopSections and ForLoopSections are executed iteratively.

with ForLoopSection ("F1", "i", 0, 3):
  with LeafSection ("a"):
    display ("i=%0d, running ’a’", i)
  with LeafSection ("b"):
    display ("i=%0d, running ’b’", i)

All the constructs described can be combined (i.e. LeafSections can be arranged to run serially, in parallel, in a nested manner, and in loops) to model complex algorithms.

The Python DSL has mainly two types of program variables - Reg and Var. They both can be of arbitrary bit-width.

  • Reg variables are used to synthesize registers. The values assigned to variables of type Reg are updated in the next clock cycle.

  • Var variables are used to synthesize pure combinational logic. The values assigned to variables of type Var are updated in the same clock cycle.

RegArray is also supported, which synthesizes an array of registers interfaced with decoders at the inputs and multiplexers at the outputs.

III Construction of the DSL

This section explains the construction of the Python DSL.

III-A Symbolic expressions and context blocks

This section shows the Python language features used in the proposed DSL framework. These features keep the statement syntax simple and build the section tree when the Python model is executed. Operator overloading is used to build symbolic expressions from user-written Python expressions [5, 6].

A ‘symbol’ base class is created to represent register, port and wire objects. As shown below, overloaded operators build ‘BinaryExpression’ objects.

 1 class BinaryExpression:
 2     def __init__ (self, op, a, b):
 3       ...
 4     def to_string (self):
 5       return ’(’ + self.a.to_string ()
 6                  + self.op
 7                  + self.b.to_string () + ’)’)
 8
 9 class Symbol:
10     def __init__ (self, name, width):
11       ...
12     def to_string (self):
13       return self.name
14     def __add__ (self, other):
15       return BinaryExpression (’+’, self, other)
16     ...

On line 14, the + operator is overloaded. It prepares a BinaryExpression object, referring to the operands self and other. The following code indicates how a symbolic expression is built when Python code is executed.

a = Symbol ("a", 32)
b = Symbol ("b", 32)
c = Symbol ("c", 32)

result = a + b + c

Here, ‘result’ is a label pointing to a symbolic expression object.

>>> print (result.to_string ())
((a + b) + c)

Python language has a construct called “ ‘with’ expression” which is used for opening a file, processing it with a block of statements, and closing the file automatically. This construct is used to demarcate the boundaries of the Sections.

class ContextClass:
  def __enter__ (self):
    print ("Entering.")
  def __exit__ (self, t, value, traceback):
    print ("Exiting.")

c = ContextClass ()

print ("statement 1")
with c:
    print ("statement 2")
    print ("statement 3")
print ("statement 4")

The above block of code produces:

statement 1
Entering.
statement 2
statement 3
Exiting.
statement 4

Inside the __enter__ method, we add a new Section object as root, or add it to an already open section object. As the body of the section executes, we add child sections/statements to this new section object. In the __exit__ method, we close the section object.

III-B Building a tree of section objects

With this background (symbolic expressions and context objects), we proceed to explain the construction of module objects, which will contain a tree of sections (series/parallel/loop etc.), and each LeafSection will hold a block of statements.

 1  class example1(HWModule):
 2    def __init__ (self, instancename, ..):
 3      HWModule.__init__ (instancename)    # [
 4      c = Reg ("c", 32)
 5      with LeafSection ("L1"):
 6        Assignment (c, c + 1)
 7      self.endModule ()                   # ]
 8
 9  e = example1 ("e", ..)
10  e.emitVerilog ()

On creation of object ‘e’ at line 9, __init__ method gets called automatically. Inside example1.__init__, its base class is called __init__ method (line 3). It will record the start of the definition of a new hardware module. The statement on line 4 will create a symbol object and its constructor will record it as a member of the current hardware module. On entering the LeafSection ‘L1’, a section object gets added to the tree of sections of the current hardware module. When the statements within the LeafSection ‘L1’ are executed, e.g. the assignment statement on line 6, it will record the LHS and RHS expressions (which will be of type Symbol/BinaryExpression etc.) into the code block associated with currently open LeafSection object. On exiting from the LeafSection ‘L1’, the object in the tree will be closed, so that, any subsequent ‘with LeafSection..’ will add a new child to the tree of sections. On line 7, endModule () statement will mark the end of the definition of the current hardware module.
Thus, the object e will contain a Register member ‘c’, and a tree of sections, containing one LeafSection ‘L1’, with a single assignment statement in it. On line 10, the base class i.e. HWModule.emitVerilog method is called. It will emit Verilog code defining the module example1.

III-C Conversion of Python to Verilog

A simple example having two LeafSections in series is provided for illustration, with the intended state diagram, and parts of the generated Verilog.

@hardware
def add_sub (a, b, c):
    a = RegIn  ("a", 32)
    b = RegIn  ("b", 32)
    c = RegIn  ("c", 32)
    d = RegOut ("d", 32)
    tmp = Reg  ("tmp", 32)
    with LeafSection ("add"):
      tmp = a + b
      display ("add: a=%0d, b=%0d", a, b)
    with LeafSection ("sub"):
      d = tmp - c
      display ("result: %0d", tmp - c)

Here, ‘@hardware’ is a transformer, that replaces the definition of ‘add_sub’ with a class similar to ‘example1’ above. In particular, the original contents of ‘add_sub’ become part of the ‘__init__’ method of the class.

In Python, AST (abstract-syntax-tree) can be easily modified. The AST for the code inside ‘with Section*’ blocks are re-written and the assignment statements are updated.

    tmp = a + b

The above code is translated to

    Assignment (tmp, a + b)
Refer to caption
Figure 2: A state-diagram for ‘LeafSections’ in ‘add_sub’ module

Parts of the behavioural Verilog code emitted by the framework are listed below.

module add_sub (
  CLK, RST,
  START, Done, get_done, Ready,
  a, b, c, d
  );
  ...
  reg [1:0] state_add = 2’d0;
  reg [1:0] state_add_WIRE;
  reg [1:0] state_sub = 2’d0;
  reg [1:0] state_sub_WIRE;
  ...
  always @(*) begin
  ...
    if (state_add == 1) begin
      tmp_WIRE = (a_inreg + b_inreg);
      state_add_WIRE = 2;
      state_sub_WIRE = 1;
    end
    if (state_sub == 1) begin
      d_outreg_WIRE = (tmp - c_inreg);
      state_sub_WIRE = 2;
      state_st_WIRE = 2;
      ...
    end
  ...
  end
  ...
endmodule

A testbench can be written in Python itself as:

 1 @hardware
 2 def my_tb():
 3  m1 = add_sub ("m1", None, None, None)
 4  with SerialSections ("S"):
 5    with LeafSection ("S10"):
 6      m1.start (a=Const (32,21), b=Const (32,34),
 7                c=Const (32,5))
 8    with LeafSection ("S11"):
 9      display ("Result = [%0d]", m1.isDone ()[0])

Here, on line 3, the ‘add_sub’ module is instantiated. In this way, hierarchy of modules can be created. Note that, the hierarchy is not flattened. Each module is emitted separately i.e. the hierarchy is preserved in the generated Verilog. So, in this example, my_tb module will declare an instance of add_sub as m1. And the definition of add_sub will be emitted as a separate module.

III-D Simulation and Synthesis

As shown in Figure 3, user-written Python code is passed through preprocessing and Verilog generation stages. Generated Verilog is simulated using an RTL simulator (e.g. Verilator). A few more Verilog files such as BRAM and FIFO are also provided to the simulator. The simulator produces a vcd file and prints results of display statements specified by the user in Python. Generated Verilog is then given to Yosys tool for synthesis. The BRAM module is written in Verilog, and Yosys infers FPGA BRAM instance when ‘memory -bram’ pass is run.

Refer to caption
Figure 3: Python DSL : Overall Flow

Even though the basic modules are available as Verilog modules, while building the module hierarchy in Python, a Python model/interface is needed to represent them. The Python interface model for the basic FIFO module looks like this:

 1 class Fifo (HWModule):
 2   def __init__ (self, instancename, width):
 3     ...
 4     self.addParameter ("width", width)
 5     ...
 6     self.addOutPort ("read_data", width)
 7     self.addInPort  ("read_enable",  1)
 8     ...
 9   def read (self):
10     addCondition (self.getPortWire ("read_ready")
                                     == Const (1,1))
11     Assignment (self.getPortWire ("read_enable"),
                                        Const (1,1))
12     return self.getPortWire ("read_data")
13   def write (self, data):
14       ...

On line 4, a parameter ‘width’ is defined. The symbolic variable for ‘read_data’ wire will get that width. On line 10, a condition ”read_ready == 1’b1” is added. Whenever a ‘read’ function on a FIFO instance is called inside a leaf section, the body of the leaf section will be wrapped inside this ‘if condition’ i.e. that LeafSection will ‘execute’, and trigger the execution of subsequent LeafSections only when this condition is satisfied.

Python facilities can be used for static elaboration.

 1     a = Reg ("a", 32)
 2     with LeafSection ("loop1"):
 3       tmp = Var ("tmp", 5)
 4       for i in range (32):
 5         tmp = tmp + a[i]

The pre-processing stage preserves the for-loop on line 4. Note that this is Python’s built-in ‘for’ loop, and not ‘with ForLoopSection’ defined by the DSL framework. When this code is run for Verilog generation, the loop gets expanded by Python, making it equivalent to:

         tmp = tmp + a[0]
         tmp = tmp + a[1]
         ...
         tmp = tmp + a[31]

Similarly, all facilities available in Python are available to the user for static elaboration.

III-E Interface Generation

Apart from the standard FIFO interface, the Python DSL may be ported to generate Verilog code for IPs with AXI4 interfaces. This generated code of the IP having an AXI4 interface can be used to create peripherals in Vivado, a popular tool for designing and implementing digital circuits. The generated AXI4 IP having port names compatible with the standard port names of AXI4 peripherals in Vivado can make it easy to automate the connection process in the Vivado block design, which automatically connects the IP with the MicroBlaze/Zynq processor through AXI-Interconnect.

IV Example Case Studies

This section offers a diverse collection of case studies comparing Python DSL implementations with hand-crafted RTL and HLS-based implementations.

IV-A Post Quantum Cryptography

This case study illustrates the usage of Python DSL to implement primitives for Post Quantum Cryptography (PQC). PQC refers to new cryptography schemes that are resistant to attacks from quantum computers. CRYSTALS-Kyber [7] is a Lattice-based PQC scheme, which depends on the hardness of the Module-Learning-with-Errors (M-LWE) problem. Kyber is one of the winners in the NIST PQC Standardization competition and is being integrated into libraries and systems by the industry.

The mathematical objects in CRYSTALS-Kyber are polynomials over Rq=𝒵q[x]/(xn+1)subscript𝑅𝑞subscript𝒵𝑞delimited-[]𝑥superscript𝑥𝑛1R_{q}=\mathcal{Z}_{q}[x]/(x^{n}+1)italic_R start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT = caligraphic_Z start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT [ italic_x ] / ( italic_x start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT + 1 ), where q𝑞qitalic_q is 3329, n𝑛nitalic_n is 256 and k𝑘kitalic_k (which is the number of polynomials used) is 2 or 3 or 4. Hence, computations in Kyber involve polynomial matrix-matrix products and matrix-vector products. It also requires pseudo-random number generators.
Key building blocks for Kyber computations are Number Theoretic Transform (NTT) and Keccak-f[1600] core. Because polynomial products are compute-intensive (time complexity is O(n2)𝑂superscript𝑛2O(n^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )), but if performed in the NTT domain, they are way more efficient (time complexity reduces to O(nlogn)𝑂𝑛𝑛O(n\log{n})italic_O ( italic_n roman_log italic_n )). And Keccak core is used in pseudo-random number generation.

Various methods exist for implementing an NTT unit [8]. For Kyber, 256-point NTT is to be computed by using two independent 128-point NTTs. Here, FIFOs are used to receive inputs and send outputs. The data is stored in registers (instead of BRAMs) for parallel memory access. Pipelining is done by running multiple ForLoopSections under a ParallelSection as shown. A single LeafSection is present under each ForLoopSection (describing a single pipeline stage).

with ParallelSections ("PS_1):
    with ForLoopSection ("FLS_1, "i", 0, N):
        with LeafSection ("LS_1):
            ...stage 1 computation...
    with ForLoopSection ("FLS_2, "j", 1, N+1):
        with LeafSection ("LS_2):
            ...stage 2 computation...
    with ForLoopSection ("FLS_3, "k", 2, N+2):
        with LeafSection ("LS_3):
            ...stage 3 computation...

For Vivado HLS implementation, pipeline pragma is used. Inverse NTT is described similarly, followed by an extra pipeline in the end for multiplying with n1superscript𝑛1n^{-1}italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

The Keccak-f[1600] function (also called block transformation) involves 5 steps (θ𝜃\thetaitalic_θ, ρ𝜌\rhoitalic_ρ, π𝜋\piitalic_π, χ𝜒\chiitalic_χ and ι𝜄\iotaitalic_ι) [9]. The computation takes 24 iterations (described using ForLoopSection) and each iteration takes 9 clock cycles (i.e. 9 LeafSections). Static elaboration of Python for-loop is used under each LeafSection to perform operations in parallel. For Vivado HLS implementation, unroll pragma is used.

Table 1 compares Vivado HLS and Python DSL implementations of the mentioned hardware units. All the units generated by both tools can run at a clock frequency of 125MHz125𝑀𝐻𝑧125~{}MHz125 italic_M italic_H italic_z on PYNQ-Z2 FPGA.

Vivado HLS
Python DSL
(no pipeline)
Python DSL
(pipeline)
NTT-256 3141 3136 454
INTT-256 3272 3904 587
Keccak-f[1600] 673 216 -
TABLE 1: PQC blocks using Vivado HLS and Python DSL (clock cycles)

IV-B Matrix Multiplication

The next example presents an integer matrix multiplier design utilizing SimpleFIFOs and SimpleBRAMs, along with a single pipelined multiply-add (MAC) unit. It serves as a test-bed to compare the performance of Python DSL-generated Verilog code against manually written Verilog code for the same matrix multiplication algorithm across square matrices of various dimensions. Both implementations maintain identical interfaces and follow the same computational steps. The comparison aims to understand the trade-offs between these coding methods using various metrics. The multiplication involves two matrices, AN×Qsubscript𝐴𝑁𝑄A_{N\times Q}italic_A start_POSTSUBSCRIPT italic_N × italic_Q end_POSTSUBSCRIPT and BQ×Msubscript𝐵𝑄𝑀B_{Q\times M}italic_B start_POSTSUBSCRIPT italic_Q × italic_M end_POSTSUBSCRIPT, resulting in a product matrix CN×Msubscript𝐶𝑁𝑀C_{N\times M}italic_C start_POSTSUBSCRIPT italic_N × italic_M end_POSTSUBSCRIPT. The study focuses on square matrices where N=Q=M𝑁𝑄𝑀N=Q=Mitalic_N = italic_Q = italic_M.

Initialization of SimpleBRAMs

Matrices AN×Qsubscript𝐴𝑁𝑄A_{N\times Q}italic_A start_POSTSUBSCRIPT italic_N × italic_Q end_POSTSUBSCRIPT and BQ×Msubscript𝐵𝑄𝑀B_{Q\times M}italic_B start_POSTSUBSCRIPT italic_Q × italic_M end_POSTSUBSCRIPT are fed to SimpleBRAMs through InputFIFOs and SimpleBRAM C is initialized with zeros in parallel.

Three ForLoopSection (R_A, R_B and R_C) are ran in parallel using withParallelSections (par_A_B_C) construct to effectively use potential parallelism for initializing SimpleBRAMs

 1    with ParallelSections ("par_A_B_C"):
 2     with ForLoopSection ("R_A", "p", 0, N * Q):
 3      with LeafSection ("recv_A"):
 4       A.writeData (p, fA.read ())
 5     with ForLoopSection ("R_B", "q", 0, Q * M):
 6      with LeafSection ("recv_B"):
 7       B.writeData (q, fB.read ())
 8     with ForLoopSection ("R_C", "r", 0, N * M):
 9      with LeafSection ("Initialize_C"):
 10      C.writeData (r, Const (32, 0))

Matrix Multiplication

After initialization of SimpleBRAMs, by accessing specific addresses, the data from SimpleBRAMs will be enqueued into MAC and the result of MAC will be stored in SimpleBRAM C. The following algorithm will be followed.

Algorithm 1 Matrix Multiplication
1:Initialized A𝐴Aitalic_A, B𝐵Bitalic_B, and C𝐶Citalic_C SimpleBRAMs
2:for k=0𝑘0k=0italic_k = 0 to Q1𝑄1Q-1italic_Q - 1 do
3:    for i=0𝑖0i=0italic_i = 0 to N1𝑁1N-1italic_N - 1 do
4:         for j=0𝑗0j=0italic_j = 0 to M1𝑀1M-1italic_M - 1 do
5:             C[iM+j]A[iQ+k]B[kM+j]+C[iM+j]𝐶delimited-[]𝑖𝑀𝑗𝐴delimited-[]𝑖𝑄𝑘𝐵delimited-[]𝑘𝑀𝑗𝐶delimited-[]𝑖𝑀𝑗C[i*M+j]\leftarrow A[i*Q+k]\cdot B[k*M+j]+C[i*M+j]italic_C [ italic_i ∗ italic_M + italic_j ] ← italic_A [ italic_i ∗ italic_Q + italic_k ] ⋅ italic_B [ italic_k ∗ italic_M + italic_j ] + italic_C [ italic_i ∗ italic_M + italic_j ]
6:         end for
7:    end for
8:end for

Refer to Figure 4 for the architecture of the matrix multiplier.

Refer to caption
Figure 4: Matrix Multiplication Hardware

Code Efficiency

Comparing the number of lines in the Python DSL-generated code versus the manually written code, the DSL-generated Verilog code is approximately 43.94%percent43.9443.94\%43.94 % greater in size than the manually written Verilog code.

Code Lines of Code
DSL generated Verilog code 416
Manually written Verilog code 289
TABLE 2: Code Efficiency

Performance

To measure the execution time for RTL Simulation, both codes were tested using the same Verilog testbench, and the number of clock cycles between the assertion of the start signal and the assertion of the done signal were compared for both codes for matrix multiplication of square matrices of various dimensions. The number of clock cycles is plotted against the matrix dimensions as shown in Figure 5.

Refer to caption
Figure 5: Clock Cycles taken for RTL Simulation

Resource Utilization

The hardware resource utilization for matrix multiplication on the EP4CE22F17C6 device (De0-Nano FPGA Board) was analyzed and compared for Python DSL-generated and manually written Verilog codes. Compilation reports from Quartus Prime Lite 18.1 were used for this analysis. The utilization of resources such as logic elements, flip-flops, and BRAMs was plotted for matrix multiplication of square matrices of various dimensions. Figures 6, 7, and 8 depict the comparisons.

Refer to caption
Figure 6: Comparison of Memory Bits
Refer to caption
Figure 7: Comparison of Number of Registers
Refer to caption
Figure 8: Comparison of Number of Logic Elements

IV-C Stereo-vision

Stereo-vision is an imaging technique used to obtain 3D measurements of an arbitrary scene, using 2D images of the same scene captured from different viewpoints using a stereo camera. The algorithm uses the principle of triangulation method. Real-time stereo-vision processing of the scene involves calibration of the captured images, finding correspondence (disparity) between captured images, calculating depth using disparity values, and imaging configuration geometric parameters.
Semi-global mapping [10, 11] is used in stereo vision for dense disparity map computation which helps to obtain improved accuracy in estimating disparities (depth information) between corresponding points in stereo images over local block-matching techniques. SGM uses a dynamic programming approach to optimize the energy cost function of a pixel. This work presents a variant of SGM called MGM-4 (More Global Matching) wherein neighbouring pixel disparity variations along four directions (top, top-left, top-right, left) are considered. For stereo-image inputs (Image height – H𝐻Hitalic_H, Image Width – W𝑊Witalic_W) and search range - D𝐷Ditalic_D, Table 3 illustrates how SGM is optimally implemented.

S.No Steps used in SGM
Execution count
1 Input pixel reading H×W𝐻𝑊H\times Witalic_H × italic_W
2 Matching cost computation H×W×D𝐻𝑊𝐷H\times W\times Ditalic_H × italic_W × italic_D
3
Path cost computation
(along different paths)
4×H×W×D4𝐻𝑊𝐷4\times H\times W\times D4 × italic_H × italic_W × italic_D
4 Average sum cost computation H×W×D𝐻𝑊𝐷H\times W\times Ditalic_H × italic_W × italic_D
5 Disparity assignment H×W𝐻𝑊H\times Witalic_H × italic_W
TABLE 3: SGM Optimization

Step 3 is executed a maximum number of times for the disparity computation of one image pair. As the input image size scales up, computational complexity increases. Hence, cost optimisation along 4 paths can be computed in parallel, reducing time complexity from (4×H×W×D4𝐻𝑊𝐷4\times H\times W\times D4 × italic_H × italic_W × italic_D) to H×W×D𝐻𝑊𝐷H\times W\times Ditalic_H × italic_W × italic_D. All the steps are inter-dependent on their previous step except for step 2 and step 3; hence, step 2 and step 3 can also be parallelized.

Python DSL implementation of SGM is elaborated below.

  • Two input image pixel values are sent using FIFOs and are stored in registers for internal computations.

  • Internal buffers and arrays are initialized in parallel.

    with ParallelSections ("P1"):
        with ForLoopSections ("inp","v0",0,size):
            with LeafSection ("LS_1):
                ...read pixels from input FIFOs...
        with ForLoopSections ("bf,"v1",0,size):
            with LeafSection ("LS_2):
                ...initialize buffer and array...
    
  • Correspondence values are computed pixel by pixel using SGM, by calculating disparity volume and disparity decision.

    with ForLoopSections ("r",v1,0,height):
        with ForLoopSections ("c",v2,0,width):
            ...update buffers and arrays...
            with ForLoopSections ("d",v3,0,range):
                ...matching cost calculation...
                with LeafSection ("path_cost_cal"):
                    for var4 in range(n_dir)
                        ..compute path_cost...
                ...compute sum_cost...
                ...compute avg(MGM-4)...
                ..disparity decision...
        ...assign the final disparity...
    
  • Output disparity is written into FIFO and is read from it for depth calculation as a final step.

Table 4 compares Python DSL implementation against Vivado HLS and PandA-bambu HLS [12] implementations of SGM (input images of size 8×8888\times 88 × 8, filter size 3×3333\times 33 × 3 and search range 5555). All units are synthesized on PYNQ-Z2 FPGA and can run at a clock frequency of 100MHz100𝑀𝐻𝑧100~{}MHz100 italic_M italic_H italic_z.

Tool Design Model Clock Cycles
Vivado HLS Without pragmas 11833
With parallel
cost computations
2021
bambu HLS
Without any
optimizations
9080
With level 3
compiler optimization
1994
Python DSL
With parallel
cost computation
1247
TABLE 4: SGM using Vivado HLS, bambu HLS and Python DSL

IV-D Fast Fourier Transform

FFT is arguably the most widely used algorithm in signal processing. It is used to obtain the frequency-domain spectrum of a time-domain signal [13]. In this case study, 256-point FFT is designed using the Python DSL with FIFO interface. Inputs are taken through input FIFOs and stored in BRAMs. This FFT implementation uses the Cooley-Tuckey Algorithm. 256-point FFT requires 8 stages. After each computation stage, outputs are stored in BRAMs and the next stage takes inputs from these BRAMs. The final outputs are taken out using output FIFO. For illustration, Figure 9 shows the hardware architecture of 8-point FFT implemented using Python DSL. The same architecture has been extended to compute 256-point FFT.

Refer to caption
Figure 9: FFT Hardware Architecture

The Python DSL implementation takes 12606 clock cycles to perform 256-point FFT. The same design using Vivado HLS without exploiting parallelism and pipelining takes 12897 clock cycles for the computation. Both are synthesized on PYNQ-Z2 FPGA with a target clock frequency of 100MHz100𝑀𝐻𝑧100~{}MHz100 italic_M italic_H italic_z.

IV-E Discrete Wavelet Transform

DWT is an important algorithm used in signal and image processing applications. Image compression is one of the prominent applications that use DWT. DWT can be computed using various kernel functions the simplest one being the Haar wavelet [14]. This wavelet involves averaging and different operations to compress the image.

The architecture of the image compression consists of a row-processing module and a column-processing module. These modules perform averaging and differencing operations row-wise and column-wise respectively on the input pixels. The image is divided into processing blocks of 16 pixels each which are enqueued into input FIFOs and after the processing dequeued from output FIFOs. The intermediate pixel values are stored in BRAMs after the row-processing module. Figures 10 and 11 show the architectures of the row and column-processing modules used in the Python DSL implementation of DWT.

Refer to caption
Figure 10: DWT Row Processing Module
Refer to caption
Figure 11: DWT Column Processing Module

The Python DSL implementation takes approximately 10560 clock cycles to compress an image of 32×32323232\times 3232 × 32 pixels.

IV-F Digital Correlator

Digital correlator is widely used in signal processing applications such as detecting characteristics of an input signal with respect to reference signals. One of the applications of this is to detect the signal amplitude at particular frequencies from an input signal comprising different frequencies, by calculating the maximum value of correlation with the reference signal.
In Python DSL, the above correlation application is modelled using the linear buffering algorithm. The flipped version of the reference signal is stored in BRAM using an input FIFO. Based on its frequency, 37 samples are stored based on the sampling frequency covering one period of the reference signal of uint type for 32-bit fixed point representation. In the linear buffering algorithm, input data is received sequentially using an input FIFO. The linear buffer, which has a size of 512, is shifted, storing the latest sample in the initial position. Accumulation is then performed with the reference signal. The correlator hardware takes 3586 clock cycles. For hardware running at 100MHz100𝑀𝐻𝑧100~{}MHz100 italic_M italic_H italic_z, the design can run on input signals with a maximum of 27.8kHz27.8𝑘𝐻𝑧27.8~{}kHz27.8 italic_k italic_H italic_z sampling frequency.

Refer to caption
Figure 12: Folding correlator architecture

The Python DSL tool can be used to parallelize such designs by implementing folded architectures. Figure 12 shows a folded-by-4 architecture for a sample size of 16 as a test example which can be scaled to large sizes. The architecture uses a chain of FIFOs (BRAMs as FIFOs) to schedule the inputs and reference signals.

IV-G Butterfly Mating Optimization Algorithm

Inspired by mating behavior in birds, the BMO algorithm [15] tackles complex optimization problems with multiple possible solutions. The algorithm achieves this by introducing “Bot” entities that explore the search space in place of the traditional “natural butterfly” concept. The algorithm starts with a random initialization of Bots (x,y𝑥𝑦x,yitalic_x , italic_y coordinates) in the search space aimed at reaching the target position. Each Bot has its self UV value updated from the UV updating phase of the algorithm based on the positions of the Bots. After UV updating, each Bot distributes its updated UV to the remaining Bots in the UV distribution phase such that the nearest Bot gets more than the farthest one. Once the Bot receives multiple UVs from the distribution phase, it searches for the maximum UV value distributed Bot (local mate). The Bot then moves towards the local mate by updating its position in the movement phase based on the bot’s step size. Figure 13 illustrates the core aspects of the BMO algorithm and it aims to provide a comprehensive overview of certain optimizations. The following pseudo-code (Algorithm 2) outlines the key steps.

Number of Bots (Bot-count),
Pre-initialized Bots & Source Locations
Set all bots UVi = 0
Number of iterations \leftarrow iteration-count
it0𝑖𝑡0it\leftarrow 0italic_i italic_t ← 0
B1𝐵1B\leftarrow 1italic_B ← 1
while it \leq iteration-count do
     while B \leq Bot-count do
         Update UV
         Distribute UV
         L-mate selection
         Position Update
         BB+1𝐵𝐵1B\leftarrow B+1italic_B ← italic_B + 1
     end while
     itit+1𝑖𝑡𝑖𝑡1it\leftarrow it+1italic_i italic_t ← italic_i italic_t + 1
end while
Algorithm 2 BMO algorithm
Refer to caption
Figure 13: Flowchart of BMO Algorithm

The computed positions of 4-Bots for 40 iterations achieve an accuracy of 98.2% (Refer Table 5). Increasing the iteration count above 40 will not improve the accuracy much for a particular step-size. Table 6 shows the Synthesis Reports and Cycle Count of implemented BMO algorithm for 4-Bots using Python DSL and Vivado HLS (without pipeline pragma) on targeting ZCU104 board.

Number of Bots Iterations Clock Cycles Error
2 20 852 1.2%
4 20 852 12.2%
4 40 1692 1.8%
TABLE 5: Clock Cycle Count for Different Number of Bots
Parameter Python DSL Vivado HLS
Clock Cycles 2438 321553
Frequency of Operation 100MHz100𝑀𝐻𝑧100~{}MHz100 italic_M italic_H italic_z 100MHz100𝑀𝐻𝑧100~{}MHz100 italic_M italic_H italic_z
Co-Simulation Latency 24μs24𝜇𝑠24~{}\mu s24 italic_μ italic_s 3.21ms3.21𝑚𝑠3.21~{}ms3.21 italic_m italic_s
FF 1% 2%
LUT 2% 4%
DSP 0% 2%
BRAM 0% 1%
TABLE 6: BMO algorithm using Vivado HLS and Python DSL

V Future Work

Some interesting extensions and potential additional features are outlined in this section. These can be developed in the existing DSL framework itself.

V-A Detecting conflicting updates

Static code analysis frameworks such as Frama-C can estimate a variable’s possible set of values when execution reaches a given line of source code. This tool can be used to detect if two LeafSection blocks that are updating the same state element can be active at the same clock cycle. Further, one may think of automatically breaking a LeafSection into multiple LeafSections.

V-B Breakpoints in Hardware

Debuggers such as gdb are immensely useful to software developers. A debugger allows tracing execution of a program, often without even recompiling the code. As HLS tools allow users to express the model in a higher-level language such as C/C++/Python, it will be interesting if the user is also allowed to set a breakpoint at a line of input source code, and when the update operation specified by that line is active in hardware, the hardware can be paused, allowing users to inspect the values of all other state elements. This feature would be useful when the entire HDL model cannot be simulated to debug interfacing problems with I/O, sensors, etc.

V-C Dynamic memory allocation

HLS tools are typically used to design large parts of an entire application (as opposed to RTL design – where module-by-module low-level design is done). These applications are run on a reconfigurable SoC, e.g. ARM + FPGA (such as Xilinx Zynq, Intel Stratix), or with Microblaze (Xilinx) / Nios (Intel) CPUs implemented in FPGA itself. Interconnects such as AXI/Avalon and compatible DDR controller modules make it easy for hardware IPs to access DDR DRAM. Applications may need dynamic memory allocation. It is better to leave the DDR memory management to the CPU. It typically involves maintaining boundary tags, various flags, free lists for various-sized buffers, etc. Hardware modules could be made to interrupt the CPU with the allocation/de-allocation request, and the CPU can manage memory that can be used by the hardware as it continues. This feature can be prototyped in this DSL easily.

VI Conclusion

In this paper, a Python DSL for generating Verilog model of synchronous digital circuits is introduced. Details on how the DSL has been constructed and how to use it have been provided. Various case studies have been made, illustrating how the Python DSL can be used for rapid prototyping of complicated hardware. Finally, ideas for extending the DSL have been provided. The source codes for all the case studies presented in this paper are on the GitHub repository: https://github.com/HPC-Lab-IITB/Python-DSL. The source code for the Python DSL will be released upon publishing of the paper.

References

  • [1] P. Joo M. P. Cardoso, Compilation Techniques for Reconfigurable Architectures.   Springer Publishing Company, Incorporated., 2010.
  • [2] R. S. Nikhil, “Bluespec system verilog: efficient, correct rtl from high level specifications.” in MEMOCODE.   IEEE Computer Society, 2004, pp. 69–70. [Online]. Available: http://dblp.uni-trier.de/db/conf/memocode/memocode2004.html#Nikhil04
  • [3] D. Lockhart, G. Zibrat, and C. Batten, “Pymtl: A unified framework for vertically integrated computer architecture research,” in 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014, pp. 280–292.
  • [4] J. Decaluwe, “Myhdl: a python-based hardware description language,” https://www.linuxjournal.com/article/7542, 2004.
  • [5] A. Meurer, C. P. Smith, M. Paprocki, O. Čertík, S. B. Kirpichev, M. Rocklin, A. Kumar, S. Ivanov, J. K. Moore, S. Singh, T. Rathnayake, S. Vig, B. E. Granger, R. P. Muller, F. Bonazzi, H. Gupta, S. Vats, F. Johansson, F. Pedregosa, M. J. Curry, A. R. Terrel, v. Roučka, A. Saboo, I. Fernando, S. Kulal, R. Cimrman, and A. Scopatz, “Sympy: symbolic computing in python,” PeerJ Computer Science, vol. 3, p. e103, Jan. 2017. [Online]. Available: https://doi.org/10.7717/peerj-cs.103
  • [6] C. Drake, “Pyeda: a python library for electronic design automation,” https://github.com/cjdrake/pyeda, 2016.
  • [7] Avanzi, Bos, Ducas, Kiltz, Lepoint, Lyubashevsky, Schanck, Schwabe, Seiler, and Stehlé, “Crystals-kyber algorithm specifications and supporting documentation,” https://pq-crystals.org/kyber/data/kyber-specification-round3-20210804.pdf, 2021.
  • [8] A. C. Mert, E. Karabulut, E. Öztürk, E. Savaş, and A. Aysu, “An extensive study of flexible design methods for the number theoretic transform,” IEEE Transactions on Computers, vol. 71, no. 11, pp. 2829–2843, 2022.
  • [9] A. Dolmeta, M. Martina, and G. Masera, “Hardware architecture for crystals-kyber post-quantum cryptographic sha-3 primitives,” in 2023 18th Conference on Ph.D Research in Microelectronics and Electronics (PRIME), 2023, pp. 209–212.
  • [10] R. Spangenberg, T. Langner, S. Adfeldt, and R. Rojas, “Large scale semi-global matching on the cpu,” in 2014 IEEE Intelligent Vehicles Symposium Proceedings, 2014, pp. 195–201.
  • [11] P. Sawant, Y. Temburu, M. Datar, I. Ahmed, V. Shriniwas, and S. B. Patkar, “Single storage semi-global matching for real time depth processing,” ArXiv, vol. abs/2007.03269, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220381365
  • [12] F. Ferrandi, V. G. Castellana, S. Curzel, P. Fezzardi, M. Fiorito, M. Lattuada, M. Minutoli, C. Pilato, and A. Tumeo, “Invited: Bambu: an open-source research framework for the high-level synthesis of complex applications,” in 2021 58th ACM/IEEE Design Automation Conference (DAC), 2021, pp. 1327–1330.
  • [13] W. Cochran, J. Cooley, D. Favin, H. Helms, R. Kaenel, W. Lang, G. Maling, D. Nelson, C. Rader, and P. Welch, “What is the fast fourier transform?” IEEE Transactions on Audio and Electroacoustics, vol. 15, no. 2, pp. 45–55, 1967.
  • [14] H. Kanagaraj and V. Muneeswaran, “Image compression using haar discrete wavelet transform,” in 2020 5th International Conference on Devices, Circuits and Systems (ICDCS), 2020, pp. 271–274.
  • [15] C. Jada, L. chintala, A. Urlana, S. G. Basha, and P. Baswani, “Bflybot: Mobile robotic platform for implementing butterfly mating phenomenon,” IFAC-PapersOnLine, vol. 51, no. 1, pp. 512–517, 2018, 5th IFAC Conference on Advances in Control and Optimization of Dynamical Systems ACODS 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2405896318302507