In this section, we optimize the micro-architecture and dataflow program execution model with the aim of improving the resource utilization of the dataflow architecture for multi-batch processing. First, at the inter-PE level, we designed a configurable interconnect architecture that is able to work in multiple modes. Second, at the inner-PE level, we designed a fully decoupled architecture with the aim of (1) improving the utilization of computational components by overlapping the latency caused by memory access and data transfer as much as possible, and (2) increasing the throughput of the chip through a dynamic task scheduling mechanism. Finally, we designed a task-based execution model and mapping method for our dataflow architecture.
4.2 Inter-PE Design
PEs are designed to be adaptive to the data sizes of different batches. First, the basic idea is to combine multiple execution engines into a cluster that performs the same computational tasks and processes multiple batches of data synchronously. As shown in Figure
4, the execution engines labeled ❶ and ❷ are combined into a cluster, and the execution engines labeled ❸ and ❹ are combined into a cluster. In this way, a PE consists of two clusters, each of which can process two batches of data in parallel. For the 4-batch mode, the four execution engines are combined into a cluster, processing four batches of data in parallel. While in 1-batch mode, each execution engine acts as a cluster. Second, the mode engine plays the role of configuration generation and distribution. On the one hand, it generates configuration information for each
\(\mu\) -router. The structure of
\(\mu\) -router is displayed in the right side of Figure
4. Each
\(\mu\) -router consists of a set of multiplexers and routing units. The structure inside each routing box is a traditional router structure that parses and forwards packets in four directions (North, East, South, West). The input and output ports in X and Y directions have dedicated control signals (S1, S2, S3, S4, S5, S6, S7, S8) that control the connection of the routing units and the data transmission networks. On the other hand, the mode engine distributes command and control information (activation signals, ack signals, etc.) to each execution engine (datapath in red). Finally, the
\(\mu\) -router structure dynamically changes the connections of the data links according to the different batch configurations, thus ensuring efficient and synchronized transmission of multi-batch data.
Execution Engine. Each PE contains several execution engines. To facilitate understanding, we take four execution engines as an example in Figure
3. It is important to note that the number of execution engines in a PE is scalable. The execution engine consists of a function unit, a local buffer and a
\(\mu\) -router. The function unit performs specific operations and supports different operations, including LD/ST, calculation and data transfer. To support diverse kernels, the calculation data-path is designed to support different data types, including integer, fixed-point, float-point, and complexed-value. Each execution engine has a dedicated local buffer and is built with a
\(\mu\) -router. The local buffer stores configurations (instructions) and data during runtime. The
\(\mu\) -router is connected with the mode engine and also embedded into a circuit-switch mesh data network. When these
\(\mu\) -routers receive mode configuration from mode engines, they will be statically configured to route to each other, forming the link paths between these execution engines. Execution engines time multiplex these links to communicate. We discuss more details about the execution engines in Section
4.2.
Network-on-Chip. The interconnection plays a crucial role in the multiple-mode PE. It ensures that these multiple pieces of data can reach these execution engines in the same cluster simultaneously. The structure of the interconnection in a PE can be found in Figure
4. There are two main interconnections: a network for transferring configurations (red paths in Figure
4) and a dedicated network for data (yellow and green paths in Figure
4). The configure network transports the configurations to each
\(\mu\) -router and the instructions to each execution engine. The data network consists of several data paths to accommodate the multiple-batch modes. The number of data paths in the vertical and horizontal direction is equal to the number of execution engines in that direction. In our example, the number of data paths is two, which is determined by the number of execution engines in a PE. The
\(\mu\) -router is connected with the data network via crossbar switches, and establishes different virtual circuit links under different configurations before the next configure period.
Mode Engine. Each PE has a dedicated mode engine to dispatch control signals and instructions. In principle, the mode engine reconstructs the execution engines into different
clusters to support multi-batch modes. As shown in Figure
3, the mode engine consists of a hierarchical controller. In our example, there are two L1-controllers and one L2-controller in the controller. They are connected based on a tree topology. The L1-controller is connected with two execution engines through their
\(\mu\) -router interface and the L2-controller is also connected with the global configuration buffer. The mode engine is mainly responsible for the following functions during the configuration period. First, it parses the PE’s multi-batch configuration, and then it generates configurations for each
\(\mu\) -router and delivers the configurations to each
\(\mu\) -router. After the top-level controller (L2-controller) receives the task configuration information from the Global Configuration Buffer, it will extract the batch configuration fields (‘B_conf’ in Figure
7) from it. Configurations for the four directions of each
\(\mu\) -router will then be generated based on the rules for specific mode based on this batch configuration. Second, instructions are loaded through the mode engine and distributed to each execution engine. Since these execution engines in a PE may belong to different clusters, the controller uses a hierarchical tree-based structure, which makes control simple and easy to implement. It should be noted that the controller will become more complicated when the number of execution engines in a PE increases. The hierarchical controller is scaled according to
\(log_2\) (number of execution engines).
Multiple Modes. As shown in Figure
5(a), each PE supports multi-modes of single-batch mode (in Figure
5(b)), 2-batch mode (in Figure
5(c), and 4-batch mode (in Figure
5(d))). Its function is controlled by an 8-bit configuration word (S1, S2, S3, S4, S5, S6, S7, S8) that is detailed in Figure If the PE contains
N execution engines, then the PE can support
\(log_2N\) +1 patterns, where
N is the exponent of 2.
Single-Batch Mode. This mode is designed for algorithms with small-scale source data and little data parallelism. The PE array can be configured as a pure MIMD-like mode, in other words, a many-core architecture with a typical 2D topology. In this mode, each execution engine works as an independent core. It has its own instructions and data, processing a dataflow graph (DFG) node. Horizontal and vertical execution engines need to be connected to the same data path. Therefore, the rule for the configuration word is: “S1 == S5 && S3 == S7 && S2 == S4 && S6 == S8”. Figure
5(b) shows the network connection under the “0000-0000 (S1 to S8)” configuration.
Two-Batch Mode. Two execution engines that connected to the same L1-controller are combined as a cluster. As shown in Figure
5(c), execution engine❶ and execution engine❷ serve as a cluster, while execution engine❸ and execution engine❹ serve as another cluster. Since two execution engines in the Y-axis are in the same cluster, the
\(\mu\) -router of execution engine❷ should be connected to the data link that is different from engine❶ to guarantee that the two execution engines can receive data from the Y-axis at the same cycle. Similarly, the
\(\mu\) -router of execution engine❹ should be connected to the data path that is different from execution engine❸. In the X-axis direction, they are connected to the same data path. Since the horizontally oriented execution engines need to interact, they need to be connected to the same data path. The vertically oriented execution engines act as two parallel processing units, so they need to be connected to different data paths. Thus, the configuration logic of PE in two-batch mode is: “S1 == S5 && S3 == S7 && S2 ==
\(\sim\) S4 && S6 ==
\(\sim\) S8”.
Four-Batch Mode. All exection engines in a PE form a cluster, as shown in Figure
5(d). These exection engines are controlled by the L2-controller. In both X-axis and Y-axis,
\(\mu\) -routers of these exection engines should be connected to different data path. In X-axis direction,
\(\mu\) -router❶ and
\(\mu\) -router❷ should connect to the data paths that is different from the path
\(\mu\) -router❸ and
\(\mu\) -router❹ connect to, respectively. Similarly, in the vertical direction,
\(\mu\) -router❶ and
\(\mu\) -router❸ should be connected to different data paths with
\(\mu\) -router❷ and
\(\mu\) -router❹, respectively. Therefore, the rule for the configuration word is: “S1 ==
\(\sim\) S5 && S3 ==
\(\sim\) S7 && S2 ==
\(\sim\) S4 && S6 ==
\(\sim\) S8”. Figure
5(b) shows the data path under the “0001-1011 (S1 to S8)” configuration.
The two-batch mode and four-batch mode are designed for scenarios with high data parallelism. Execution engines are divided into multiple clusters under the control of the mode engine. The instructions are loaded and distributed to the corresponding cluster by the mode engines. Execution engines in the same cluster perform the same operations on multiple data synchronously. Limited by the number of execution engines, the PE can work at three different modes in our example. To explain the structure more clearly, we also show the domain division for different configurations by different colors in Figure
5. Note that this design principle is scalable. As the number of execution engines in a PE increases (preferably by an exponent of 2), the number of available modes also increases. For example, when each PE contains 16 execution engines, the structure of the mode engine will become complex. There will be L3-controller and L4-controller. In addition, the mode of PE contains eight-batch mode and 16-batch mode.
Memory Access. Global buffers are built with multiple SRAM banks matching the scale of data. Address decoding logic around the scratchpad can be configured to operate in several banking modes to support various access patterns. Physical banks cascade and are grouped into logic banks according to the width of configuration. Besides, the global buffers are sliced into two lines, which work in a Ping-Pong way to cover transmission time. To support diverse modes, DMA can transmit and reshape variable length of multi-batch data with scatter and gather operations, exchanging data between on-chip buffers and off-chip memory.
4.3 Inner-PE Design
We create a decoupled execution model that defines a novel scheme to schedule and trigger DFG nodes to exploit instruction block level parallelism. The code of each DFG node consists of up to four consecutive stages: Load stage, Calculating stage, Flow stage, and Store stage, which we describe below:
–
Ld (Load) Stage. This stage loads data from the memory hierarchy to the in-PE local memory.
–
Cal (Calculating) Stage. This stage completes calculations. A node can enter the Cal stage only when the following two conditions are met: first, its Ld stage (if it exists) has already finished; second, it has received all the necessary data from its predecessor nodes.
–
Flow Stage. This stage transfers data from the current node to its successors.
–
ST (Store) Stage. This stage transfers data from the in-PE operand memory to the memory hierarchy.
Similarly, instructions in a DFG node will be rearranged according to their types and divided into four different blocks. The block is a basic schedule and trigger unit. Instruction-block-level dataflow is the middle ground between instruction-level dataflow and thread-level dataflow. It can be seen as a further development of thread-level dataflow. In the thread-level dataflow model, each dataflow graph node is a thread and serves as the basic unit for launching and scheduling. Instruction-block-level dataflow decomposes each node of thread-level dataflow into four stages. Each phase consists of a segment of instructions and serves as the basic unit for launching and scheduling. Unlike the traditional out-of-order execution, the decoupled execution model exploits more instruction-block level parallelism without complex control logic, such as reorder buffer.
Figure
6 illustrates the top-level diagram of our dataflow architecture, which is comprised of a set of identical
decoupled processing elements (dPE). To support the decoupled execution model, separated four-stage components are designed within each PE to correspond to the four different states of the nodes. This approach allows a processing element to be shared by up to four different DFG nodes simultaneously, enabling the overlap of memory access and data transfer latency as much as possible. By decoupling the datapaths of different stages and equipping each PE with a dedicated scheduler, the DFG nodes of different iterations can be pipelined more efficiently. The function of the controller is to maintain the maintenance, scheduling execution of the different node states. To ensure the correctness of the execution, separate operand RAM space is provided for different iterations. A shared operand RAM space is set up to store the data that has dependencies between iterations, which are marked by special registers in the instructions.
The dPE consists of a calculation pipeline, a load unit, a store unit, a flow unit, an instruction RAM module, an operand RAM module, a controller and a router (in the middle of Figure
6). These four separate functional components (CAL, LOAD, FLOW, STORE) and the controller are designed for the decoupled execution model, which are different from previous structures. The calculation pipeline is a data path for arithmetic operations and logical operations. It fetches instructions from the instruction RAM module and performs computations on the source data. The load/store unit transfers data from/to on-chip data memory to/from operand RAM module, respectively. And the flow unit dispatches data to downstream dPEs. Each execution unit has a corresponding DFG node state, as described in Figure
6, and such a decoupling method is the key to improving the utilization.
The controller plays a non-negligible role in the state transition and DFG nodes triggering. It consists of a kernel table, a status table, a free list, a dedicated acknowledgment buffer (Ack port), and a scheduler module. The kernel table stores the configurations of the nodes mapped to the dPE, which contain the Task ID (TID), node ID (NID), instance number (instance), instruction address list (inst_addr) and data address (LD_base&ST_base). The TID and NID are used to identify task and DFG node, because the PE array can be mapped to multiple tasks at the same time, and a PE can be mapped to multiple nodes. The instance is a value related to the pipeline parallelism, which indicates how many times the DFG node needs to be executed. Taking BFS as an example, for a large graph, it may need to be decomposed into many subgraphs, such as 100, then each DFG node needs to be executed 100 times. The inst_addr records the location of the four-stage instruction of the DFG node in the instruction RAM. The LD_base&ST_base are the base addresses for the source and destination, which can work with the offset in the status table to access the data in the operand RAM.
The status table maintains the runtime information for different instances. It uses the instance_counter to record different instances of DFG nodes. Although different instances share the same instructions, they handle different data. Therefore, the offsets (offset) of different instances are different. In addtion, the status table records the activations (Up_counter) and status information. The value of Up_counter decreases with the arrival of activation data. When this value is 0, it means that all the upstream data of the current node has arrived and it can be triggered by the scheduler.
The scheduler uses the instance_counter to evaluate the priority and schedules nodes according to their priority. We also tried other scheduler policies, such as a round-robin scheduler or finer-grain multithreading, but found that these did not work as well. This makes sense: the completed application work is nearly constant regardless of the scheduling strategy, so a simple scheduling mechanism is effective. Also, simple scheduling principles reduce configuration overhead. The Ack port is connected to the four pipeline units in order to obtain the status of each section. Additionally, the Ack port uses this information to dynamically modify the contents of the state table for scheduling by the scheduler. And the free list queue maintains free entries in this buffer.
The instruction RAM module consists of multiple single-port SRAM banks. Each bank can be occupied by a single functional unit at any time. The operand RAM module consists of multiple 1-write-1-read SRAM banks. To ensure the pipeline execution between instances, a separate context is allocated for each iteration. Considering that there may be dependent data between instances, a shared context is established in the operand RAM. Shared data are marked by special registers in the instructions.
4.4 Task_based Program Execution
We propose the task_based program execution model, which augments a dataflow architecture’s ISA with primitives for runtime task management and structured access. In task_based program execution model, a task consists of multiple sequentially executed subtasks. Each subtask is a dataflow graph which consists of multiple computation nodes and directed edges. The finite-state controller is used to configure our processor at three level: task level, subtask level, and node level, as shown in Figure
7. Each task contains multiple subtasks, where each subtask is a dataflow graph. The multiple subtasks are executed sequentially, due to the fact that the number of subtasks executed may be different. First, the task parameter words are used to control the processing of one specific program, which indicates the exection number and the number of subtasks. Second, the subtask parameter words are used to control the processing of a codelet, usually a loop struture. It contains the number of iteration and DFG nodes, as well as batch configurations, the number of root nodes, the base address of input and output data, and so on. Third, the node parameter words are used to control a specific DFG node, which records the storage location of instructions within that node, as well as the number of upstream and downstream nodes, the mapping location of upstream and downstream nodes, the coordinates of the execution cluster to which the node is mapped and the priority, etc. In this execution model, multiple levels of pipeline parallelism can be exploited: (1) pipeline parallelism between different iterations within a subtask; (2) pipeline parallelism between different iterations within a DFG node; and (3) instruction-level pipeline parallelism.
Figure
7(c) shows an example of the task-based program execution. This task completes the core computational process of the
Fast Fourier Transform (FFT) and contains mainly two loop bodies that offload to the dataflow coprocessor through hints (
pragma). First, this task contains two subtasks, subtask 1 and subtask 2, which are marked as different colors in Figure
7(b). Then, each subtask is compiled into a dataflow graph, where each dataflow graph node contains a segment of instructions, and the order of instructions follows the principles of the decoupled model we proposed in Section
4.2. Next, three-level configuration words are loaded into each PE, configuring for each execution engine and combining into a clustered array. The dataflow graph is then mapped to the execution engine array and pipelined for execution. Execution engines within the same cluster execute the same code segments. The mapping process maps a dataflow graph onto a cluster array. Each cluster can be mapped with one or more dataflow graph nodes. Execution arrays within the same cluster perform the same computational process and process different data in parallel. Unlike the traditional mapping approaches, the size of the execution engine cluster array is variable under different configurations. As a result, the DFG may need to be extended at the time of mapping. Our approach is inspired by the literature [
29]: the DFG will be replicated to ensure that each cluster can be utilized.