CIS: Composable Instruction Set for Streaming Applications: Design, Modeling, and Scheduling

Yu Yang yuyang2@kth.se 0000-0003-2396-3590 KTH Royal Institute of TechnologyStockholmStockholmSweden , Jordi Altayó González jordiag@kth.se 0000-0002-7693-6994 KTH Royal Institute of TechnologyStockholmStockholmSweden and Ahmed Hemani hemani@kth.se 0000-0003-0565-9376 KTH Royal Institute of TechnologyStockholmStockholmSweden

Abstract.

The efficiency improvement of hardware accelerators such as single-instruction-multiple-data (SIMD) and coarse-grained reconfigurable architecture (CGRA) empowers the rapid advancement of AI and machine learning applications. These streaming applications consist of numerous vector operations that can be naturally parallelized. Despite the outstanding achievements of today’s hardware accelerators, their potential is limited by their instruction set design. Traditional instruction sets, designed for microprocessors and accelerators, focus on computation and pay little attention to instruction composability and instruction-level cooperation. It leads to a rigid instruction set that is difficult to extend and significant control overhead in hardware. This paper presents an instruction set that is composable in both spatial and temporal sense and suitable for streaming applications. The proposed instruction set contains significantly fewer instruction types but can still efficiently implement complex multi-level loop structures, which is essential for accelerating streaming applications. It is also a resource-centric instruction set that can be conveniently extended by adding new hardware resources, thus creating a custom heterogeneous computation machine. Besides presenting the composable instruction set, we propose a simple yet efficient instruction scheduling algorithm. We analyzed the scalability of the scheduling algorithm and compared the efficiency of our compiled programs against RISC-V programs. The results indicate that our scheduling algorithm scales linearly, and our instruction set leads to near-optimal execution latency. The mapped applications on CIS are nearly 10 times faster than the RISC-V version.

ISA, streaming application, CGRA, instruction scheduling, constraint programming

^†^†ccs: Hardware Methodologies for EDA^†^†ccs: Computer systems organization Parallel architectures

1. Introduction

Streaming applications, dominated by data-parallel vector operations, have been widely used in AI and Machine Learning, signal processing, multimedia, and many other application domains. Accelerating these streaming applications and reducing their energy consumption is essential. The key to achieving this goal relies on improving the execution of the static loops that dominate the steaming application operations. Next, we will show a motivational example to introduce the concept of resource-centric instructions and their benefits compared to traditional computation-centric instructions.

⬇

1for(i=0; i<64; i++){

A[i] = A[i] + 1;

}

Listing 1: A static C program

Consider the simple C program in Listing 1, where we perform an “ADD-1” operation 64 times. The conventional processor compiler will translate the C program to assembly code similar to Listing 2. The instructions for traditional processors and the majority of configurable accelerators are computation-centric. Each instruction mobilizes all necessary resources to implement the complete instruction functionality. For example, the key instruction in Listing 2, “add A[i], A[i], 1”, modifies the storage field $A[i]$ by adding 1 to it. It commands at least four individual resources: the datapath multiplexors that build the correct data transfer path, the storage reading port to read $A[i]$ , the storage writing port to write to $A[i]$ , and the arithmetic unit performing “ADD-1” operation. Computation-centric instructions are inherently sequential in the scope of the controller that issues these instructions. Multiple instructions cannot be executed simultaneously with other instructions due to resource conflict. Even if we add more instruction issue slots like in the VLIW processors, the degree of instruction level parallelism (ILP) still cannot be improved dramatically. Fetching, decoding, and executing a computation-centric instruction is also very costly regarding power consumption because the controller is usually far away from the resources it controls; wires dominate the energy and latency cost metrics as the technology scales. Increasing the degree of parallelism will not reduce the cost of instructions since the total number of instruction issues will not be affected by the parallelism; each instruction has to be fetched, decoded, and executed, see (Qadeer et al., 2013).

⬇

1 load i, 0

LOOP:

compare i, 64

branch >=, END

add A[i], A[i], 1

6 add i, i, 1

jump LOOP

END:

nop

Listing 2: Computation-centric pseudo assembly code

In this paper, we introduce resource-centric instructions. In contrast to computation-centric instructions, resource-centric instructions only configure and control a single resource. By limiting the scope of the local controller to a single resource, we can place the controller near the resource it controls and reduce the complexity of its decoding logic and control state machine, thus reducing the execution cost of each instruction. Resource-centric instructions are inherently parallel. They cooperate with other instructions in a cycle-accurate manner. The operation emerges from the cooperative behavior of these instructions. If we compile the static C program in Listing 1 to resource-centric instructions, we will get the pseudo assembly code similar to Listing 3. We can see that the 64-element vector addition is achieved by issuing only four instructions. It is a considerable reduction in control overhead compared to processor ISA, which needs to repeatedly issue the add instruction and many other auxiliary instructions in a loop.

⬇

1Interconnect : forever { make connection }

Storage Read Port : repeat (i=0:64) { read A[i] }

Storage Write Port : repeat (i=0:64) { write A[i] }

Computation : forever { perform ADD-1 }

Listing 3: Resource-centric pseudo assembly code

Resource-centric instructions have spatial composability. Vector operations such as “adding 1 to every element in a vector of 64 numbers” can be constructed not by repeating the same “add” instruction 64 times in a loop but by the collaboration of concurrent and independent micro-threads running on different resources: one creates the necessary datapath connection; one reads out the data from the vector; one writes back the computed results to the original vector; and the last one performs the “ADD-1” arithmetic operation. The complex vector operation emerges from the collaboration of these micro-threads, each of which is bounded to a single resource.

When a vector operation to be mapped is complex, e.g., contains many levels of loops, it becomes impossible to configure a resource to such a complex action pattern by a single instruction due to its limited bitwidth. Therefore, giving the instructions temporal composablity becomes necessary. Temporal composability allows the construction of the description of a complex operation by combining simple instructions. For example, a complex operation “reading elements from a vector following a 2D affine address pattern, start from address 0, inner loop repeat 3 times, outer loop repeat 5 times” can be constructed by three simple instructions: 1) read a number; 2) repeat 3 times with setp=1; and 3) repeat 5 times with step=1. Note that conventional processor instructions are not temporally composable because each instruction exists independently. In our example, the “repeat” operation cannot independently exist without the instruction that “reads a number”. Temporally composable instructions must collaborate.

We can see from the motivational example above that the composable instruction set (CIS), in both spatial and temporal sense, can naturally express the loop structures in a distributed and concurrent manner. This makes it ideal for accelerating streaming applications dominated by static loops. Therefore, in this paper, we conduct a detailed analysis of the CIS, focusing on instruction design, modeling, and scheduling. The main contribution of this paper includes:

•

Introducing CIS for streaming applications.
•

Creating a timing model and instruction scheduling algorithm that works with CIS.
•

Demonstrating the instruction scheduling algorithm’s scalability and the CIS’s accelerating effects on typical streaming applications.

The rest of the paper is organized as follows: Section 2 reviews the state-of-the-art. Section 3 explains the composable instruction set – CIS with a toy example for demonstration purposes. Section 4 focuses on solving the instruction scheduling problem. Section 5 presents the experiment results. Finally, in Section 6, we conclude this paper and show future research direction.

2. State-of-the-Art

The processor instruction set can be categorized into two families: Complex Instruction Set Computer (CISC) and Reduced Instruction Set Computer (RISC). It has been well acknowledged by the community that the shifting from the CISC (e.g., x86 ISA (Dandamudi, 2013)) to RISC (e.g., MIPS (Britton, 2002), ARM (Knaggs and Welsh, 2004), RISC-V (Waterman et al., 2014) ISA) in general improves the processors’ efficiency because the RISC reduces the complexity of controller design by reducing the instruction types (Patterson and Ditzel, 1980). However, the processor ISA focuses on computation instead of each resource. The computation-centric instructions are inherently sequential and cannot implement loop structure efficiently. The centralized controller also increases the cost for instruction issues because the control points in the datapath are far away from the centralized controller.

The micro-programmable instructions (Rauscher, 1980) look similar to temporally composable instructions, but they are, in fact, very different. The controller that supports microprogrammable instructions translates macro instructions to a micro-instruction sequence. Each micro-instruction is an independent instruction. They don’t cooperate with other micro instructions to create complex control structures like loops. The program of micro-programmable instructions is nothing different than a subroutine.

Architectures like Graphic Processing Unit (GPU) use SIMD style instructions (Franchetti et al., 2005). They are very similar to standard processor instructions, except a single instruction will perform a vector operation distributed spatially. SIMD has its limitations for vector acceleration because it can only tackle simple shallow loops. It is also very rigid because it has a fixed vector size.

Spatially composable ISA has been proposed in the literature. The persistent and fully cooperative instructions proposed in (Yang et al., 2021, 2022) are resource-centric. However, the authors didn’t address the need for temporal composability to accommodate the configuration of complex operations. Instead, the author uses complex CISC-like long instructions, significantly complicating the controller decoding logic. In (Catthoor et al., 2010), the authors proposed a spatially composable ISA by introducing the concept of loop buffers for each resource. The programmable loop buffer is similar to a processor controller, which makes it very generic and flexible. However, it failed to address complex operation configuration problems. It also needs extra instructions for secondary tasks like address computation inside the loop structure.

Instruction set with data-driven control logic has also been proposed in literature such as (Parashar et al., 2013). These instruction sets have too much control overhead. It’s not very suitable for static streaming application acceleration.

For instruction scheduling, as-soon-as-possible (ASAP) and as-late-as-possible (ALAP) scheduling algorithms (Micheli, 1994) are suitable for scheduling problems that do not care about the resources. The LIST scheduling (Graham, 1966) and force-directed scheduling (Paulin and Knight, 1989) are widely used for scheduling problems with resource management. However, they are all order-based scheduling algorithms, meaning they are not good at dealing with very restricted timing constraints. For scheduling problems that have restricted timing constraints, integer linear programming (ILP) (Nowatzki et al., 2013; Chin and Anderson, 2018) and constraint programming (CP) (Perron and Furnon, 2024) are good tools. The challenge usually becomes properly formulating the scheduling problem so the ILP or CP solver can solve it quickly. Our work uses a CP solver to solve the scheduling problem.

3. Composable Instruction-Set Architecture

This section describes the characteristics of a composible instruction set architecture (CIS). To lay the foundation, we first explain the hardware architecture and then move to the concept of spatial and temporal composability in the context of ISA design. After that, the CIS will be introduced as a toy example. We only include the essential components for simplicity. Finally, we will discuss the hardware and compiler design implications of CIS. The explanation of the impact will be kept at a high abstract level, as neither hardware architecture design nor compiler design is the focus of this paper.

3.1. Hardware Architecture

Before defining the CIS, we have to explain the conceptual hardware architecture that CIS will work on. The hardware architecture we introduce here is high-level abstraction. We only expose minimal hardware implementation details required to understand the CIS. As shown in Fig. 1, the hardware architecture template has a sequencer as controller and several resource slots. Each resource slot has two local FSMs and two ports, one for input and one for output. It can use part or all of the FSMs and ports, depending on the resource. The minimal hardware architecture instance that will be used for later demonstration includes three resources: a computation unit (C), a interconnection unit (I), and a storage units (S). A colored FSM or port means that the resource uses it.

Refer to caption — Figure 1. The hardware architecture is a template that consists of a sequencer and multiple resource slots. The hardware architecture instance will be used to demonstrate later examples.

The hardware template shown in Fig. 1 naturally forms a single computation or storage tile. We can construct a CGRA-style fabric by connecting those tiles to a 2-D grid. However, in this paper, we will only focus on a single tile because it is much simpler to analyze, and our focus is on the instruction set, not the hardware architecture design.

3.2. Spatial Composability

Spatial composability refers to the ability to construct a complex operation from many simple atomic operations distributed on different hardware resources. These simple atomic operations are self-contained and autonomous.

Instruction-set with spatial composability is resource-centric. This means that instruction inside the instruction set targets a specific hardware resource. For example, one instruction could configure the interconnection switchboxes to build the path for the operands and return values of a function to be mapped; another instruction could perform a read operation to deliver one of the operands; another instruction could configure the arithmetic-logic unit (ALU) to perform the desired transformation function such as addition or multiplication. This way, the complete operation is decomposed into many small tasks, such as building data transfer paths, reading operands, performing arithmetic functions, etc. Each small task is carried out by a specific instruction that only interacts with a particular set of hardware resources. A more concrete example will be shown after introducing the instruction table for CIS.

3.3. Temporal Composability

Temporal composability refers to the ability to construct a complex operation from one or more atomic basic operations and some well-defined temporal transformation operators. The basic operations are stand-alone and can configure certain resources to a specific state. They are usually single-cycle operations. The temporal transformation operators can transform the timing properties of the basic operation and must be used with one or more basic operations.

The temporal property of an operation can usually be represented by a finite state machine (FSM). A basic operation can be treated as an event that forms a state in the FSM, and the temporal transformation operators create the transition edges in the FSM. Arbitrary FSMs are challenging to construct using a minimal set of transition patterns. Since we specifically target streaming applications in this paper, We only need two temporal transformation operators to implement most FSMs for streaming applications. The operators are “REPEAT” and “TRANSITION”. In the later sections, we will call them $\mathbf{R}$ and $\mathbf{T}$ operators.

The “REPEAT” operator or $\mathbf{R}$ operator accepts only one instruction block as an argument. It makes its inner instruction block repeat multiple times. It corresponds to the FSM that implements the FOR-LOOP structure. A single $\mathbf{R}$ operator mimics a single layer of FOR-LOOP. We can implement a multi-level FOR-LOOP structure by stacking multiple $\mathbf{R}$ operators.

The “TRANSITION” operator or $\mathbf{T}$ operator accepts two instruction blocks as arguments. After a specific delay, it forces the transition from the first inner block to the second inner block. This operator corresponds to the FSM that implements a two-number counter. We can implement N-number counter FSM by stacking multiple $\mathbf{T}$ operators.

In section 4, we will introduce a formal way to express these temporal transformation operators.

When a streaming application requires more complex FSMs (e.g., WHILE-LOOP or IF-THEN-ELSE) that cannot be directly decomposed to the combination of $\mathbf{R}$ and $\mathbf{T}$ operators, we use the traditional control instructions like COMPARE, BRANCH, or JUMP instructions to implement the outer control flow. Then, the simpler inner control flow could be expressed by the combination of $\mathbf{R}$ and $\mathbf{T}$ operators. We remind readers that the inner control flow that can be handled by the CIS of a typical streaming application can reach 4-5 layers of loop. The CIS is not just accelerating the innermost loop like the mainstream GPU and CGRA would do.

3.4. Toy Example

We define a toy example of a CIS instance. It consists of 6 instructions, as shown in table 1. There are three types of instructions: resource, transform, and control.

Table 1. Example of a toy ISA consisting of only seven instructions.

Type	Name	Format and Description
Resource	`@C`	`@C [slot:FSM] [option] [function]`
		Configure a computation resource.
	`@I`	`@I [slot:FSM] [option] [path]`
		Create an interconnection between ports.
	`@S`	`@S [slot:FSM] [address]`
		Create a read/write for a storage unit.
Transform	`@R`	`@R [slot:FSM] [iter] [step] [delay]`
		Implementation of the $\mathbf{R}$ operator.
	`@T`	`@T [slot:FSM] [delay]`
		Implementation of the $\mathbf{T}$ operator.
Control	`@W`	`@W [delay]`
		The controller waits for specific cycles.
	`@A`	`@A [slot:FSM list]`
		Activate FSMs in the list.

Now, we can use CIS to map the simple vector addition shown in Listing 1. We must configure four FSMs to accomplish the vector addition: interconnection, storage read, storage write, and computation. The big FOR-LOOP is naturally decomposed into four concurrent operations shown in Fig. 2. Some operations have a loop structure; some don’t. For example, the first FSM in slot-0 will build a connection from slot-1 to slot-2 and from slot-2 to slot-1. This operation does not require repetition because once a connection is established, it will remain there until it has been reconfigured. On the contrary, the event on the first FSM in slot-1 must repeat 64 times because it must repeatedly write the computation result back to the storage unit. We remind readers that even though many local FSM are running in parallel, the hardware architecture is not similar to VLIW, whose controllers are multi-issued. The “sequencer” on our architecture is a very simple single-issued controller.

To make the example more concrete, we can write down the complete assembly program that implements the vector addition as shown in Listing 4.

⬇

1# Configure the interconnection path; both instructions configure option0 because both paths should exist simultaneously as configuration option 0.

@I slot0:FSM0 option0 slot1->slot2

@I slot0:FSM0 option0 slot2->slot1

# Set the computation unit to perform +1

@C slot2:FSM0 option0 ADD-1

6# Configure the output port for the storage unit to generate addresses from 0 to 63

@S slot1:FSM1 address=0

@R slot1:FSM1 iter=64 step=1 delay=0

# Configure the input port for the storage unit to generate addresses from 0 to 63

@S slot1:FSM0 address=0

11@R slot1:FSM0 iter=64 step=1 delay=0

# Activate the interconnection and computation unit

@A [slot0:FSM0, slot2:FSM0]

# Activate the output port of the storage unit

@A [slot1:FSM1]

16# Wait for a cycle to allow the result data to be computed

@W delay=1

# Activate the input port of the storage unit

@A [slot1:FSM0]

# Wait until all operations are finished

21@W delay=63

Listing 4: CIS assembly code

3.5. Impact

CIS for streaming applications impacts hardware architecture and compiler design positively and negatively. The ISA is naturally designed for heterogeneous computing. Each resource slot can host any resource necessary for the specific application. Each resource only accepts a small subset of the ISA (typically 2-3 instruction types). The reduced instruction type drastically simplifies the hardware decoding logic. Placing the control FSM near the resource reduces the control wire length, thus reducing power consumption. The drawback of CIS on hardware design is that it sometimes requires more instructions to implement the same functionality since operations are decomposed to fine-grained atomic instructions. For example, the $\mathbf{R}$ operator that represents the same shared loop structure has to be repeated on each resource since their FSMs are not shared.

CIS also impacts its compiler design, mainly negatively. Since the hardware that supports CIS is massively parallel due to all those distributed FSMs on each resource, the compiler must schedule the instructions in a cycle-accurate manner to guarantee the correct instruction cooperation and produce the right computation results. It imposes high requirements on the compiler instruction scheduler. In section 4, we will tackle the instruction scheduling challenge by proposing a timing model and an instruction scheduling algorithm.

4. Timing Model and Scheduling Algorithm

4.1. Premises

We first define the following concepts: operation, event, transformation operator, anchor, and constraint.

An operation is a procedure a single FSM can fully implement on a specific resource. For example, “generate a sequence of address from 0 to 64” would be an operation.

Each operation consists of one or more core events and some transformation operators. For example, the operation “create a path from slot-1 to slot-3, wait for 5 cycles, and switch to another path from slot-2 to slot-3, then repeat for 3 times” consists of two events (“create a path from slot-1 to slot-3” and “create a path from slot-2 to slot-3”), a $\mathbf{T}$ operator (transit from event-1 to event-2 after 5 cycles), and an $\mathbf{R}$ operator (“repeat for 3 times”). We numerically name the events from the same operation: $e0$ , $e1$ , etc. So, the timing behavior of an operation can be defined by its events and a stack of transformation operators. For example, the above operation can be expressed as $\mathbf{R}<3,t1>(\mathbf{T}<5>(e0,e1))$ . Note that, the $\mathbf{R}$ operator has two attributes: “ $3$ ” and “ $t1$ ”. It indicates that it will repeat 3 times, and between each iteration, there is a delay of $t1$ , which is an unknown variable. The $\mathbf{T}$ operator also has one attribute, “ $5$ ”, which waits for 5 cycles when transitioning from $e0$ to $e1$ .

An anchor is a specific instance of an event during the repetition process. For example, If we define operation $a$ as $\mathbf{R}<4,t1>(\mathbf{R}<8,t2>(e0))$ . The anchor $a.e0[2][3]$ points to the event instance for event $e0$ located at outer iteration 2 and inner iteration 3.

Finally, a constraint is an equality or inequality among anchors. For example, equation $a.e0[0]==b.e0[0]+1$ can force the anchor $a.e0[0]$ to be scheduled exactly 1 cycle after the anchor $b.e0[0]$ .

The input file for instruction scheduling includes two parts: operation definition and constraint definition. The input file for the toy example is shown in Listing 5. In the input file, we define four operations: $I$ , $C$ , $S\_RD$ , and $S\_WR$ . The $I$ and $C$ operations are simple single-event operations. The other two operations contain the $\mathbf{R}$ operator. Custom constraints are defined in the input file. The equality and inequality format is very straightforward.

⬇

1# Operation definition

operation I e0 # interconnection

operation C e0 # computation

operation S_RD R<64, t1>(e0) # read from storage

operation S_WR R<64, t2>(e0) # write to storage

# Custom constraints

constraint I.e0 < S_RD.e0[0]

constraint C.e0 <= S_RD.e0[0]+1

constraint S_WR.e0[0] == S_RD.e0[0]+2

11constraint S_RD.e0[1]-S_RD.e0[0] == \

S_WR.e0[1]-S_WR.e0[0]

Listing 5: Input file for scheduler

4.2. Timing Model

Now, we define the timing model and describe the method for finding the timing expression of any anchor based on the operation expression. Determining the timing expression for any anchor is essential because the constraints usually specify the timing relation among operation anchors. For demonstration purposes, we will use an example operation defined in Listing 6:

⬇

1operation a R<2,t1>(T<t2>(R<3,t3>(T<t4>(e0,e1)),e2))

Listing 6: Definition of operation a

The above operation $a$ can be expressed as a binary tree, as shown below. In the tree, we also write the duration (pink box) and start time (yellow box) for each node. The duration and start time are computed by traversing the binary tree in a specific order. The algorithms that calculate each node’s duration and start time are shown in algorithm 1 and 2.

\Tree

[. $\mathbf{R}<2,t1>$
6*t4+4*t3+2*t2+t1+14
0 [. $\mathbf{T}<t2>$
3*t4+2*t3+t2+7
0 [. $\mathbf{R}<3,t3>$
3*t4+2*t3+6
0 [. $\mathbf{T}<t4>$
t4+2
0 $e0$
1
0 $e1$
1
t4+1 ]] $e2$
1
3*t4+2*t3+t2+6 ]]

Function Traversal_LRC( $tree$ , $node$ ):

duration\leftarrow 0

;

if $tree[node].type$ is Event then

duration\leftarrow 1

;

else if $tree[node].type$ is $\mathbf{T}$ operator then

Traversal_LRC( $tree$ , $tree[node].left$ );

Traversal_LRC( $tree$ , $tree[node].right$ );

duration\leftarrow tree[node].left.duration+tree[node].right.duration+tree[% node].delay

;

else if $tree[node].type$ is $\mathbf{R}$ operator then

Traversal_LRC( $tree$ , $tree[node].left$ );

duration\leftarrow tree[node].left.duration\times tree[node].iter+tree[node].% delay\times(tree[node].iter-1)

;

tree[node].duration\leftarrow duration

;

return ;

Function ComputeDuration( $tree$ ):

Traversal_LRC( $tree$ , $tree.root$ );

return ;

Algorithm 1 Compute the duration using Left-Right-Center traversal order.

Function Traversal_CLR( $tree$ , $node$ ):

if $tree[node].left$ exists then

tree[node].left.start\_time\leftarrow tree[node].start\_time

;

Traversal_CLR

tree

tree[node].left

;

end if

if $tree[node].right$ exists then

tree[node].left.start\_time\leftarrow tree[node].start\_time+tree[node].left.% duration+tree[node].delay

;

Traversal_CLR

tree

tree[node].right

;

end if

return ;

Function ComputeStartTime( $tree$ ):

tree[tree.root].start\_time\leftarrow 0

;

Traversal_LRC( $tree$ , $tree.root$ );

return ;

Algorithm 2 Compute the start time using Center-Left-Right traversal order.

Once the duration and the start time of each node in the tree have been computed, the time of any anchor in an operation becomes known. We denote $D_{i}$ as the duration of the child node of the $i$ -th level of $\mathbf{R}$ operator and $S_{i}$ as the start time of the $i$ -th level of $\mathbf{R}$ operator. The delay of the $i$ -th level of $\mathbf{R}$ operator is $\tau_{i}$ . We also use $T_{a}$ to represent the start time of the operation $a$ and $T_{m}$ to represent the start time of the event $a.e_{m}$ in the operation $a$ . The time of the anchor $a.e_{m}[x_{0}][x_{1}]...[x_{n}]$ can be calculated by equation 1. The timing model guarantees the finding of the timing expression of any anchor in polynomial time.

(1)

t_{anchor}=T_{a}+T_{m}+\sum_{i=0}^{n}(x_{i}\times(D_{i}+\tau_{i}))

4.3. Scheduling

We use a constraint programming (CP) solver to solve the instruction scheduling problem. We chose CP because it supports many types of constraints, and users have more freedom to express the timing relationship among anchors. In theory, any CP solver can be used. However, we select the CP solver called ortools (Perron and Furnon, 2024) because it supports state-of-the-art CP techniques like lazy clause generation (Stuckey, 2010). The central task for scheduling is formulating the problem to decision variable definitions and constraints that the CP solver can recognize. Thanks to the timing model we have discussed in the previous subsection, we can efficiently compute the timing of any anchor. The scheduling problem formulation becomes trivial by following the steps below:

(1)

Declare a variable for each operation to represent its start time.
(2)

Declare a variable for each unknown variable symbol used as delay of the $\mathbf{R}$ and $\mathbf{T}$ operators.
(3)

For each anchor referenced by the constraints in the input file, declare a variable to represent the timing of that anchor.
(4)

For each anchor referenced by the constraints in the input file, compute the anchor’s expression based on equation 1. Then, post a constraint that makes the corresponding variable defined in the previous step equal to the expression.
(5)

Post all the custom constraints in the input file.
(6)

Declare the latency variable and force it to be the end time of the last finished operation.
(7)

The objective is to minimize the latency.

Note that we do not create a variable for every possible event anchor. It is because the number of potential event anchors is directly related to the number of iterations of each $\mathbf{R}$ operator. It is a waste of resources to model all possible anchors because most of them are only constrained by their predecessors in a regular pattern, which can be computed analytically by our timing model. We don’t enforce any restriction on the type of custom constraints. However, due to the nature of the instruction scheduling problem, most custom constraints posted by the CP formulation are simple linear inequalities that are very easy to solve. Therefore, the scheduling algorithm can easily find the optimal solution for almost all programs in a reasonable time, see Section 5.

If we schedule the input file in listing 5, we will get the result shown in table 2. Readers can verify that all constraints hold with the assignment in the table.

Table 2. Scheduling Result

variable	value	variable	value
I	0	I.e0	0
C	0	C.e0	0
S_RD	1	S_RD.e0[0]	1
S_RD.e0[1]	2	S_WR	3
S_WR.e0[0]	3	S_WR.e0[1]	4
t1	0	t2	0

4.4. Synchronization

Wait instructions (@W), and activation instructions (@A) must be inserted to produce a valid instruction list. The hardware model uses a single sequencer to issue the resource and transform instructions and execute the control instructions. It can only process one instruction per cycle. The activation instructions must be placed in the exact position in the list to trigger the resources at the scheduled time. The wait instruction must be inserted whenever no instruction is needed to fill the timing gap.

We can break down the instruction synchronization process into the following steps:

(1)

Create an activation instruction and assign it to a cycle at which at least one operation is scheduled.
(2)

Start from the operation that needs to be activated last, place its resource and transform instructions to the empty time slot just before its activation time. If the slot has been occupied, skip it and use the slot before that. Repeat the process until all the associated instructions have been assigned to a time slot. It is OK if the time slot becomes negative.
(3)

Repeat the previous step until all operations have been processed.
(4)

If the starting time slot becomes negative. We shift the whole instruction list to make the first time slot assignment equal to cycle 0.
(5)

Fill any continuous empty time slot with a wait instruction.
(6)

If the last time slot does not reach the scheduled latency, add a wait instruction to force it to reach the scheduled latency.

The instruction list is created after instruction synchronization. Listing 4 shows the created instructions for our vector addition example.

5. Experiment

In this section, we present the experiment results. We design two sets of experiments. The first set of experiments tests the efficiency of the instruction scheduling process proposed in section 4. The second set of experiments will compare our architecture that supports CIS with vanilla RISC-V (Waterman et al., 2014) architecture. The second set of experiments aims to showcase the advantage of the CIS in the context of streaming applications. Specifically, we will focus on program size and total execution latency.

To test the efficiency of the instruction scheduling algorithm, we use artificial applications that vary in complexity. We define three dimensions of complexity metrics:

(1)

$N$ : The number of operations.
(2)

$M$ : The number of timing transformation operators (the $\mathbf{T}$ and $\mathbf{R}$ operators) for each operation.
(3)

$C$ : The number of custom constraint statements.

By definition, the bigger the value of $N$ , $M$ , and $C$ , the more complex the application is. For each set of configurations, we randomly generate ten applications. We then schedule these applications and record their compilation time. We don’t record the scheduling quality because the CP formulation in the scheduling process requires the solver to find the optimum solution. The optimality is always guaranteed if the scheduling process finishes successfully.

From Fig. 3, we can see that the scheduler finds optimal solutions for all applications in a reasonably short time. We use three curve-fitting strategies to probe the scalability of the scheduling algorithm. We can see from the figure that the linear fit lines are the best-fitting lines with the smallest mean square errors (MSE) for all three sub-graphs. It means that the compilation time increases linearly with the increase of parameters $N$ , $M$ , and $C$ , thus indicating that the instruction scheduling algorithm scales linearly with the complexity of the problem, at least for problems within a typical complexity metrics range. The scheduling algorithm scales so well because almost all constraints are linear equality or inequality constraints that are very easy to evaluate and propagate by the solver. Those constraints are also conjunctive, which is preferable to the CP solver.

To test the benefits of CIS compared to conventional processor ISA, we compare the assembly code of CIS and RISC-V ISA for a set of selected streaming applications. We chose RISC-V ISA because it is a very general processor ISA whose characteristics are shared by many hardware architectures. The RISC-V ISA is a very suitable baseline for our comparison. The application set for the comparison includes dot product (DOT), matrix-vector multiplication (MVM), matrix-matrix multiplication (MMM), 1D convolution (1DCONV), and 2D convolution (2DCONV). For each application, two instances that vary in size are implemented. We compare the program size (instruction count) and total execution latency (total number of cycles) for the mapping on CIS and the RISC-V ISA. The program size comparison is shown in Table 3, and the execution latency comparison is shown in Table 4.

Table 3. Program size comparison

Application	This work	RISC-V
DOT-32	52	11
DOT-512	66	11
MVM-32x32	73	19
MVM-64x64	84	19
MMM-32x32	74	28
MMM-64x64	85	28
1DCONV-32/3	59	24
1DCONV-512/3	102	24
2DCONV-32x32/3x3	95	49
2DCONV-64x64/3x3	115	49

Table 4. Execution latency comparison. The overhead, compared to the ideal case, is written in parenthesis.

Application	This work	RISC-V
DOT-32	$5.60\times 10^{1}$ (75.0%)	$2.59\times 10^{2}$ (709.4%)
DOT-512	$5.42\times 10^{2}$ (5.9%)	$4.10\times 10^{3}$ (700.8%)
MVM-32x32	$1.06\times 10^{3}$ (3.9%)	$8.42\times 10^{3}$ (725.5%)
MVM-64x64	$4.14\times 10^{3}$ (1.0%)	$3.32\times 10^{4}$ (709.8%)
MMM-32x32	$3.38\times 10^{4}$ (3.0%)	$2.71\times 10^{5}$ (726.2%)
MMM-64x64	$2.65\times 10^{5}$ (1.1%)	$2.13\times 10^{6}$ (713.0%)
1DCONV-32/3	$1.17\times 10^{2}$ (30.0%)	$9.64\times 10^{2}$ (971.1%)
1DCONV-512/3	$1.59\times 10^{3}$ (3.9%)	$1.86\times 10^{4}$ (1115.7%)
2DCONV-32x32/3x3	$8.16\times 10^{3}$ (0.7%)	$7.52\times 10^{4}$ (828.4%)
2DCONV-64x64/3x3	$3.47\times 10^{4}$ (0.3%)	$3.37\times 10^{5}$ (874.0%)

In terms of program size, the RISC-V ISA has advantages. This is mainly because RISC-V instructions don’t distribute loop structures to different resources. Therefore, the instruction is not duplicated. Conversely, the CIS implements a loop structure on each resource to make it execute autonomously. The second reason for the increase in program size is that CIS uses some instructions to manage the software-controlled L1 cache, while RISC-V depends on the hardware cache system. However, the program size of the CIS program is still small (less than 128 instructions) and can be handled by the hardware architecture.

The execution latency comparison in Table 4 clearly shows the advantage of CIS. By comparing the cycle count, we see that RISC-V ISA is about 10 times slower than CIS. We need to point out that the experiments give advantages to RISC-V by assuming that RISC-V has no cache miss, while our hardware template contains a software-controlled L1 cache, and our CIS instructions have to manage that L1 cache. We further analyze the control overhead by comparing both solutions to the ideal case. The ideal case is defined as a machine with a single ALU, and its ALU is never idle. There is no cycle wasted by control and memory operations. For example, the DOT-32 application contains 32 multiplication operations, so the latency for the idea case will be 32 cycles. We can assess how optimized each ISA is by comparing our work and RISC-V with the ideal case. It is similar to the spirit of intrinsic computational efficiency introduced in (Claasen, 1999). From the overhead number in Table 4, we can tell that CIS is quite optimized since the overhead for large applications can be lower than 1%. Even small applications that are supposed to be inefficient have less than 100% overhead. While applications mapped on RISC-V ISA are approaching 1000% overhead.

The overhead analysis demonstrates that the CIS and its underlying hardware architecture improve the ALU utilization to a near-optimal level. In contrast, computation-centric ISA, like RISC-V ISA, has a much lower ALU utilization. We need to emphasize that, increasing the degree of parallelism by adding more computation resources (ALUs) will not increase the ALU utilization. On the contrary, it will lower the utilization since keeping more ALUs busy all the time is much more challenging. Therefore, architectures like VLIW, SIMD, and CGRA with a higher degree of parallelism will not magically eliminate the inefficiency inherent in the computation-centric ISA.

We also want to point out that even though our comparison focuses on the ALU utilization, the conclusion also applies to applications that are traditionally considered as “memory-bound”. Those applications are memory-bound only because the conventional architecture cannot provide enough memory bandwidth, leading to ALU’s starvation. A memory-bound application can become computation-bound once the memory bottleneck is removed. Our hardware architecture template could improve on this front as well. It can be extended to a tiled architecture with multiple I/O ports. SRAM blocks can also be treated as regular resources embedded in the tiled architecture and thus can be used as scratchpad memory wherever needed.

To conclude, we have compared our CIS with the RISC-V ISA. The CIS vastly outperforms the RISC-V ISA in terms of execution latency, but the CIS has a bigger program size. We also tested the performance and scalability of the instruction scheduling algorithm that can work on CIS. The results are very positive since all test cases finish reasonably quickly, and the compilation time scales linearly.

6. Conclusion and Future Works

This paper presents a much more efficient composable instruction set (CIS) for streaming applications. We also propose an efficient instruction scheduling algorithm that can work with CIS. The future research direction will be quantitatively investigating how CIS improves the hardware design and comparing our hardware platform against other major hardware platforms in terms of power consumption. The instruction scheduling algorithm should also be integrated into a complete compiler for CIS.

Acknowledgements.

This project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement no. 101007321. The JU received support from the European Union’s Horizon 2020 research and innovation programme and France, Belgium, Czech Republic, Germany, Italy, Sweden, Switzerland, and Turkey.

References

(1)
Britton (2002) Robert Britton. 2002. MIPS assembly language programming.
Catthoor et al. (2010) Francky Catthoor, Praveen Raghavan, Andy Lambrechts, Murali Jayapala, Angeliki Kritikakou, and Javed Absar. 2010. Ultra-low energy domain-specific instruction-set processors. Springer Science & Business Media.
Chin and Anderson (2018) S Alexander Chin and Jason H Anderson. 2018. An architecture-agnostic integer linear programming approach to CGRA mapping. In Proceedings of the 55th Annual Design Automation Conference on - DAC ’18. ACM Press, New York, New York, USA, 1–6. https://doi.org/10.1145/3195970.3195986
Claasen (1999) Theo ACM Claasen. 1999. High speed: not the only way to exploit the intrinsic computational power of silicon. In 1999 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. ISSCC. First Edition (Cat. No. 99CH36278). IEEE, 22–25.
Dandamudi (2013) Sivarama P Dandamudi. 2013. Introduction to assembly language programming: from 8086 to Pentium processors. Springer Science & Business Media.
Franchetti et al. (2005) Franz Franchetti, Stefan Kral, Juergen Lorenz, and Christoph W Ueberhuber. 2005. Efficient utilization of SIMD extensions. Proc. IEEE 93, 2 (2005), 409–425.
Graham (1966) R. L. Graham. 1966. Bounds for Certain Multiprocessing Anomalies. Bell System Technical Journal 45, 9 (11 1966), 1563–1581. https://doi.org/10.1002/j.1538-7305.1966.tb01709.x
Knaggs and Welsh (2004) Peter Knaggs and Stephen Welsh. 2004. ARM: Assembly Language Programming. Bournemouth University, School of Design, Engineering, and Computing.
Micheli (1994) Giovanni De Micheli. 1994. Synthesis and optimization of digital circuits. McGraw-Hill Higher Education.
Nowatzki et al. (2013) Tony Nowatzki, Michael Sartin-Tarm, Lorenzo De Carli, Karthikeyan Sankaralingam, Cristian Estan, and Behnam Robatmili. 2013. A general constraint-centric scheduling framework for spatial architectures. In ACM SIGPLAN Notices, Vol. 48. 495–506. https://doi.org/10.1145/2499370.2462163
Parashar et al. (2013) Angshuman Parashar, Michael Pellauer, Michael Adler, Bushra Ahsan, Neal Crago, Daniel Lustig, Vladimir Pavlov, Antonia Zhai, Mohit Gambhir, Aamer Jaleel, et al. 2013. Triggered instructions: A control paradigm for spatially-programmed architectures. ACM SIGARCH Computer Architecture News 41, 3 (2013), 142–153.
Patterson and Ditzel (1980) David A. Patterson and David R. Ditzel. 1980. The case for the reduced instruction set computer. SIGARCH Comput. Archit. News 8, 6 (oct 1980), 25–33. https://doi.org/10.1145/641914.641917
Paulin and Knight (1989) Pierre G Paulin and John P Knight. 1989. Force-directed scheduling for the behavioral synthesis of ASICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 8, 6 (1989), 661–679.
Perron and Furnon (2024) Laurent Perron and Vincent Furnon. 2024. Or-tools, 2024. URL https://developers.google.com/optimization (2024).
Qadeer et al. (2013) Wajahat Qadeer, Rehan Hameed, Ofer Shacham, Preethi Venkatesan, Christos Kozyrakis, and Mark A Horowitz. 2013. Convolution engine: balancing efficiency & flexibility in specialized computing. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 24–35.
Rauscher (1980) Rauscher. 1980. Microprogramming: A tutorial and survey of recent developments. IEEE transactions on Computers 100, 1 (1980), 2–20.
Stuckey (2010) Peter J Stuckey. 2010. Lazy clause generation: Combining the power of SAT and CP (and MIP?) solving. In International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming. Springer, 5–9.
Waterman et al. (2014) Andrew Waterman, Yunsup Lee, David Patterson, and Asanovic. 2014. The RISC-V instruction set manual. Volume I: User-Level ISA’, version 2 (2014), 1–79.
Yang et al. (2021) Yu Yang, Ahmed Hemani, and Kolin Paul. 2021. Scheduling persistent and fully cooperative instructions. In 2021 24th Euromicro Conference on Digital System Design (DSD). IEEE, 229–237.
Yang et al. (2022) Yu Yang, Dimitrios Stathis, and Ahmed Hemani. 2022. Reducing the configuration overhead of the distributed two-level control system. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 104–107.