Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this article, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95× energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01× energy efficiency improvement over state-of-the-art dataflow architectures.

1 Introduction

Advances in integrated circuit technology have been utilized as a primary approach to enhancing computing power over the past decades. However, this approach is losing effectiveness as Moore’s Law [31] and Dennard scaling [8] slowdown or even termination. Improving the efficient utilization of hardware resources and enhancing the energy efficiency of architectures have emerged as prominent areas of research in the field of computer architecture [15, 21, 23, 53, 60].

Another emerging challenge, known as cross-domain processing, introduces a novel trade-off between energy efficiency and expressiveness, as depicted in Figure 1. On one end, we have General-Purpose Processors that enable the representation of all domains, albeit at the cost of performance and/or efficiency. On the opposite end, domain-specific accelerators that cater to a single domain and run on specialized architectures exhibit high performance. Nevertheless, creating an end-to-end application that spans multiple domains necessitates a deep understanding of various interfaces and diverse hardware accelerators. Consequently, the development of cross-domain accelerator stacks remains an ongoing challenge [19]. Ideally, we aim for architectures that approach maximal specialization in terms of efficiency, while also being programmable and capable of executing a wide array of applications. The dataflow architecture holds promise in attaining this objective.

Fig. 1.

For a given kernel, a fixed circuit is formed, which enables repeated execution, thus approaching the efficiency of an ASIC. With a reconfigurable datapath, dataflow architectures can harness multi-level parallelism, leading to a significant enhancement in their computational throughput and efficiency. As interest in dataflow architectures grows, the significance of maximizing the utilization of available on-chip cores through programming is escalating. This is particularly prominent in the context of dataflow architecture, where a single processor houses a greater number of simpler Processing Elements (PEs) compared to a typical multi-core processor. Figure 1 also illustrates the high-level abstraction of dataflow program execution. Characteristics represent abstractions of the application, encompassing factors like batch data size, regularity, and irregular patterns. The dataflow execution model defines the operational mechanism of both the hardware microarchitecture and the scheduling policy.

Abundant prior works have been proposed to improve the utilization of dataflow architectures (Section 3): pipeline parallelism [35, 39, 61], decoupled access-execute architectures [16, 36, 45, 51], and dedicated interfaces between cores or threads [5, 57]. Nevertheless, these solutions are inefficient because they: (i) lack flexibility. They rarely consider the impact of data size on the utilization, while we found that the hardware will be limited when the data size does not match the vectorized design of the hardware.; and (ii) lack fine-grained pipelining scheduling. The schedule of each DFG node in this work is coarse-grained, which misses opportunities to exploit more parallelism within DFG nodes to boost the utilization.

To this end, we introduce a reconfigurable dataflow architecture for multi-batch data processing. The contributions that we made are as follows:

–

We propose a novel reconfigurable interconnection structure that can organize execution units into different cluster topologies as a way to accommodate different data sizes.

–

We introduce a decoupled dataflow execution model and provide architectural support for the model. By decoupling the datapath of different stages and equipping with a dedicated scheduler within each PE, the DFG nodes of different iterations can be pipelined more efficiently.

–

We evaluate our methods on a wide range of applications, demonstrating their applicability. Experiments show that our design attains up to 11.95 × energy efficiency improvement over GPU (V100) and 2.01× energy efficiency improvement over state-of-the-art dataflow architectures.

The rest of this article is organized as follows: In Section 2, we discuss the background. In Section 3, we present related works, based on which we motivate the need for improving the utilization of dataflow fabric for muti-batch and cross-domain processing. In Section 4, we present our methods. We discuss our experimental methodology and results in Section 5 and Section 6, respectively. We finally conclude this article in Section 7.

2 Background

In this section, we describe the characteristics of current emerging applications and the new challenges posed to the hardware, and then we introduce the dataflow architecture.

2.1 Cross-Domain Processing and Multi-Batch Processing

Emerging cross-domain technologies have significantly transformed people’s lives. Cross-domain application scenarios are becoming increasingly critical computational workloads for computing platforms, spanning various domains from delivery drones to smart speakers [32]. One such application involves a sequence of steps: (1) sensing the environment, (2) pre-processing input data, which is then fed to (3) a perception module, triggering a subsequent (4) decision-making process to determine actions. Currently, perception is primarily driven by deep Learning, which has garnered substantial attention. However, applications are not exclusively reliant on deep Learning. Sensory data processing leverages algorithms from Digital Signal Processing (DSP), while Control Theory and Robotics play a role in final actions, which can also provide feedback to the perception module.

Despite these domains working in concert to realize complete applications, they are facing isolation due to the prevailing trend towards Domain-Specific Accelerators (DSAs). On one hand, traditional general-purpose computational stacks struggle to meet the computational demands of emerging applications [17]. On the other hand, these DSAs [2, 3, 4] sacrifice generality for performance and energy efficiency, limiting programmability to a single domain. While DSAs address the performance gap of General-Purpose Processors (GPPs), they introduce the challenge of dealing with isolated programming interfaces, complicating implementation. Consequently, the scope of expressiveness is curtailed, making the composition of cross-domain applications a significant hurdle when executed on accelerators. While recent advancements are pushing the boundaries of DSAs for improved performance and energy efficiency, a recent study on chip specialization has predicted an eventual ‘accelerator wall’ [12]. Specifically, due to limitations in mapping computational problems onto hardware platforms with fixed resources, the optimization space for chip specialization is bounded by a theoretical limit.

Cross-domain application scenarios have also become more complex in terms of batch size as well as data parallelism, making it more difficult to improve the efficient use of computing resources. On one hand, the number of users at the edge side is time-sensitive [46]. Indeed, over time, ranging from low-load periods (e.g., late at night) to peak periods (peak hours), the quantity of matrix batches for uplink/downlink algorithms varies from \(2\times 2\) to \(N\times N\) , where N ranges from tens to hundreds. This real-time fluctuation in the count of active antennas and receivers exerts diverse throughput demands on the hardware. On the other hand, the number of input data batches (e.g., the number of channels in the activations) varies significantly across network layers as the depth of the deep neural networks used for inference on the server side increases. For instance, the count of channel batches in Alexnet [20] fluctuates from 3 to 384, contingent upon the number of convolutional kernels. Consequently, in cross-domain application scenarios, there are instances involving both small and large input data batches concurrently, implying a wide variation in the extent of data parallelism. The optimal hardware would ideally support large-scale processing of multiple data batches while efficiently managing discrete small data batches. Vector processing techniques, like SIMD (Single Instruction Multiple Data), are widely employed methods for performing batch processing by exploiting data-level parallelism. However, this technique lacks the required flexibility for accommodating various batch sizes. As data parallelism intensifies, the architecture’s efficiency scales up in tandem with batch size augmentation. Nonetheless, this correlation is not boundless in its growth. Beyond a certain threshold, wherein the architecture’s vectorized capacity surpasses the inherent data parallelism of the application, surplus underutilized lanes emerge. This surplus leads to a subsequent diminution in the architecture’s overall efficiency.

The demand for hardware capable of cross-domain and multi-batch processing continues to grow relentlessly. Existing programmable and ‘general-purpose’ solutions (e.g., CPUs, GPGPUs) are inadequate, as evidenced by the significant improvements and industry adoption of application and domain-specific accelerators in critical domains like machine learning [26], computer vision [33], and big data [10]. In the realm of FPGAs [58, 59], these customized datapaths are configurable at the bit level, allowing users to prototype diverse digital logic and leverage architectural support for precision computation. However, this flexibility comes with architectural inefficiencies. Bit-level reconfigurability in computation and interconnect resources incurs substantial area and power overheads. For instance, more than 60% of the chip area and power in an FPGA are dedicated to the programmable interconnect. Long combinational paths traversing multiple logic elements limit the maximum clock frequency at which an accelerator design can function. These inefficiencies have driven the development of dataflow architectures featuring word-level functional units that align with the computational demands of many accelerated applications. Dataflow architectures offer dense computing resources, power efficiency, and clock frequencies up to an order of magnitude higher than FPGAs.

2.2 Dataflow Architecture

With the growing interest in many-core architectures, driven in part by ongoing transistor scaling and the consequent anticipated exponential rise in the number of on-chip cores, the significance of optimizing the utilization of available on-chip cores through programming is on the rise. In this context, dataflow program execution models are gaining increasing attention. The dataflow model was initially proposed by Dennis [9] to harness instruction-level parallelism. The dataflow model introduces an alternative order of code execution compared to the traditional control flow model, emphasizing the pivotal role of data. A dataflow program is delineated by a dataflow graph (DFG), composed of nodes and directed edges that connect these nodes. Nodes signify computations, while edges signify data dependencies between nodes. The fundamental principle of the dataflow execution model is that any DFG node can be executed as soon as all the operands it requires are available (dataflow principle[9]).

Figure 2 depicts the core process of a contemporary dataflow program. Initially, the compiler analyzes the computational kernels necessitating offloading to the dataflow hardware based on the program’s hints, generating corresponding DFGs by considering the data dependencies in the code. Subsequently, the assembly process converts the high-level language within each DFG node into assembly instructions. Finally, DFG nodes are mapped to Processing Elements (PEs) for scheduling and execution through a DFG mapping algorithm. Figure 2 presents a representative dataflow architecture, in which each PE is notably simpler and there is a more significant number of them within a single processor compared to a typical multi-core processor. The dataflow architecture primarily comprises a PE array, a Micro-Controller (MicC), a configuration buffer, and a data buffer. The PE array is composed of numerous PEs interconnected by an on-chip network. Within each PE, multiple pipeline functional units, a local instruction RAM, data caching register files, and a router are present. The functional units perform data processing based on the instructions stored in the instruction RAM. The router’s role is to parse and forward packets, facilitating data exchange between PEs. To efficiently handle multi-batch processing, vector-oriented designs like SIMD (Single Instruction Multiple Data) are frequently employed within PEs.

Fig. 2.

The dataflow processor operates as a co-processor or accelerator alongside the host processor, collaborating to execute program computations. In essence, the dataflow processor necessitates configuration from the host. The micro-controller furnishes the interface for host-side configuration and manages the execution of the PE array. The configuration buffer stores configuration details received from the host, encompassing kernel parameters, mapping information, and more. Both configuration and input data can be preloaded into the on-chip data buffer through Direct Memory Access (DMA) mechanisms. The computational resources of dataflow architectures are numerous and spatially distributed. Maximizing the use of these resources becomes critical to improving the performance and energy efficiency of the dataflow processor. Therefore, in recent years, many studies have been proposed to improve the utilization of dataflow processors. We will discuss them in the next section.

3 Related Works

Software Parallelism. Dataflow architectures are amenable to creating static spatial pipelines, in which an application is split into DFG nodes and mapped to functional units across the fabric [39, 52, 56, 61]. To perform a particular computation, operands are passed from one functional unit to the next in this fixed pipeline. Pipette [35] structures applications as a pipeline of DFG nodes which is connected by queues. The queues hide latency when they allow producer nodes to run far ahead of consumers. But it exploits this property in general-purpose cores, not for specialized architecture. These efforts may be inefficient for irregular workloads due to load imbalance among DFG nodes. SARA [61], a compiler of reconfigurable dataflow architecture, employs a novel mapping strategy to efficiently utilize large-scale accelerator. It decomposes the application DFG across the distributed resources to hide low-level reconfigurable dataflow architecture constraints and exploits dataflow parallelism within and across hyper-blocks to saturate the computation throughput. Atomic dataflow [62] schedules DFG in atom granularity to ensure PE-array utilization and supports flexible atom mapping and computing to optimize data reuse in the architecture. GoSPA [7], leveraging the idea of on-the-fly intersection and specialized computation reordering, recodes the sparsity information to deliver necessary values to the compute units and reorders the computation to reduce the fetch time. Based on these two ideas, GoSPA optimizes the sparse convolutional neural network accelerator globally, and achieves high performance and energy efficient. ANNA [22] introduces a memory traffic optimization technique to accelerate ANNS algorithms, which reduces the memory traffic and improves performance by reusing data efficiently. Monsson [37] is a dynamic dataflow architecture that employs tokens to designate various thread contexts. A dataflow program can be called by different threads, and this token serves to mark these threads. When a matching token is identified, it is extracted, enabling the corresponding instruction for execution. If no matching token is found, the incoming token is stored for future use. In the TRIPS [40] dataflow architecture, an instruction serves as the basic unit for launching and scheduling, and the operands of an instruction are ready to be dispatched to the compute unit for execution, which represents an instruction-level dataflow model. Groq [6] introduces a new, simpler processing architecture designed specifically for the performance requirements of machine learning applications. Groq’s overall product architecture provides an innovative and unique approach to accelerated computation. This architecture provides a new paradigm for achieving both flexibility and massive parallelism without the limitations and communication overheads of traditional GPU and CPU architectures. The Groq compiler orchestrates everything: Dataflows into the chip and is plugged in at the right time and the right place to make sure calculations occur immediately, with no stalls. Maxeler [34, 48, 49] provides various dataflow engines such as MPC-X, MPC-C, and MPC-N. In Maxeler’s dataflow approach, program source code is transformed into dataflow engine configuration files, which describe the operation, layout, and connections of the dataflow engine. The hardware (FPGAs) then generates specific circuits based on these profiles, which is similar to Xilinx HLS (High-Level Synthesis) [24].

SW/HW Custom Interface. To improve the utilization of dataflow fabric, recent works have focused on software and hardware co-design architecture. Aurochs [47] introduces a threading model for a reconfigurable dataflow accelerator and uses lightweight thread contexts to extract enormous parallelism from irregular data structures. CANDLES [14] proposes a novel microarchitecture and dataflow by adopting a pixel-first compression and channel-first dataflow, which can significantly improve the performance of deep neural network accelerator with low energy overhead. ESCALATE [23] utilizes an algorithm-hardware co-design approach to achieve high data compression ratio and energy efficiency in convolutional neural network accelerator. The decomposed and reorganized computation stages in ESCALATE can obtain the maximized benefits in its basis-first dataflow and corresponding microarchitecture design. NASA [27] provides suitable architecture for target machine learning workload. NASA is able to partition and reschedule the candidate architecture at fine -granularity to maximize data reuse. In addition, it can remove the redundant computation in the mapping stage by a special fusion unit equipped on the on-chip network, which further improves the utilization of the accelerator arrays. Sanger [25] processes the sparse attention mechanism through the coordination of reconfigurable architecture and software part prunes, which leads to high hardware efficiency and computing utilization. NASGuard [50] leverages a topology-aware performance prediction model and a multi-branch mapping model to prefetch data and obtain high efficiency of the underlying computing resources. Cambricon-P [18] adopts a carry parallel computing mechanism that can transform original multiplication into inner-products to exploit the computation parallelism. It also employs a bit-indexed inner-product processing scheme that can eliminate bit-level redundancy in the inner-product computing unit, which further improves the computing efficiency of the architecture. DRIPS [43] manages the partial dynamic reconfiguration of coarse-grained reconfigurable arrays with the help of special software and hardware components. Based on the execution status, it can dynamically rebalance the pipeline of data-dependent streaming applications to achieve the maximum throughput.

Decoupled Hardware. DAE [41] separates the computer architecture into access processors and execution processors. The two processors execute separate programs with similar structure, but which perform two different functions. Fifer [36] decouples memory access datapath from computing pipeline. Each DFG node is divided into two stages: access and execution. Equipped with a dedicated scheduler, at most two DFG nodes can be executed on the same PE at the same time. In this way, the memory access latency can be overlapped and the utilization can be further improved. DESC [16] proposes a framework that has been inspired by decoupled access and execution, and can also update and expand for modern heterogeneous processors. REVEL [51] extends the traditional dataflow model with primitives for inductive data dependences and memory access patterns, and develops a hybrid spatial architecture combining systolic and dataflow execution. RAW [45] introduces hardware support for decoupled communication between cores, which can stream values over the network. TaskStream [5] introduces a task execution model which annotates task dependences with information sufficient to recover inter-task structure. It enables work-aware load balancing, recovery of pipelined inter-task dependences, and recovery of inter-task read sharing through multicasting. Chen et al. [57] propose subgraph decoupling and rescheduling to accelerate irregular applications, which decouples the inconsistent regions into control-independent subgraphs. Each subgraph can be rescheduled with zero-cost context switching and parallelized to fully utilize the PE resources. Saambhavi et al. [1] propose an offload interface with minimal limitations for both distributed computation and distributed access capabilities architecture models, which is designed for offloading arbitrary units to heterogeneous accelerator resources and able to offer energy-efficient orchestration of control and data with flexible communication mechanisms. NN-Baton [44], a hierarchical and analytical framework, provides an architecture consisting of package, chiplet and core three parallel hierarchies, which enables efficient application mapping and design exploration.

In the Codelet dataflow model, each node within the dataflow graph functions as a thread, essentially acting as the fundamental entity for initiation and execution [42]. Once all the inputs of a thread are prepared, it becomes launch-ready, embodying a thread-level dataflow model. This model also encompasses dataflow-thread [13] and data-driven multithreading [30]. These dataflow models share a common vision for dataflow execution, and that these efforts aim to maximize parallelism and provide architectural support for data-driven execution. This is also consistent with our vision. Dataflow-thread [13] and our dataflow model share a common characteristic, which is that each node in the dataflow graph is a thread containing a piece of instructions or code. Threads communicate and activate each other using the dataflow principle. However, Dataflow-thread differs from our dataflow model in the way dataflow graph nodes communicate with each other. Dataflow-thread does not define directives or interfaces for direct communication between dataflow threads. Communication between dataflow threads is achieved by reading and writing shared memory between different dataflow threads. In our work, we define dataflow directives between dataflow threads. Data from upstream nodes can be transferred directly to the computational component where the downstream nodes are located. Additionally, DF-Thread’s API interfaces follow C-like semantics, which require support from the operating system and system calls. In data-driven multithreading [30], the basic unit of its scheduling is the thread, which corresponds to the dataflow graph node. It only needs to record and maintain the upstream and downstream threads for each node. In our dataflow architecture, each processing element contains four different types of functional components to support our proposed decoupled execution model. In addition to the entries in data-driven multithreading, our scheduling table maintains the states of the four different types of components as well as the states of different threads. This is because in our decoupled model, computational resources can be occupied by four threads or iterations simultaneously.

Despite the categories of work that we mentioned above, software parallelism, software/ hardware custom interface and decoupled hardware design, have made significant contributions to improving the utilization of dataflow unit, they face new challenges when process the advanced application that we introduced in Section 2. As shown in the clearly listed examples of Table 1, some work [6, 38, 51, 52, 54, 56] only focused on single application field and do not consider the process of cross-domain applications, while others [14, 36, 38, 39, 52, 56] ignored different data scales or only considered one fixed side. Consequentely, they all have limitations in processing the applications of cross-domain and multi-batch. DFU [11] introduces a software and hardware co-design method to enhance the hardware utilization of dataflow architectures. It introduces a decoupled execution model and provides architectural support for the decoupled execution model. Unfortunately, DFU does not perform well in multi-batch processing scenarios. Therefore, in the face of challenges related to diverse data sizes and data parallelism in cross-domain processing, this article devises a unified scale-vector architecture that leverages the benefits of SISD and SIMD technology simultaneously. Furthermore, this article presents the task-based program execution model, which augments a dataflow architecture’s ISA with primitives for runtime task management and structured access. This article considers the interPE and inner PE two aspects comprehensively, and optimizes the cross-domain and multi-batch process with execution model and hardware.

Table 1.

Design	Characteristics	Cross-Domain	Multi-batch
TRIPS [40]	Instruction-level dataflow model	-	-
Monsson [37]	Dynamic dataflow model	-	-
Codelet[42]	Thread-level dataflow model	-	-
RABP [38]	A large-scale PE array with flexible scheduler	No	No
Groq [6]	A reconfigurable dataflow NN accelerator	No	No
LRPPU[52]	Pipeline parallelism	No	No
Fifer [36]	Decoupling execution and memory access	Yes (GP+MM)	No
Plasticine [39]	Decoupling pattern units and memory units	Yes (MM+GP)	No
CANDLES [14]	Channel-aware dataflow and hardware co-design	Yes (MM+NN)	No
DFU [11]	Decoupled execution model	Yes (NN+DSP+GP)	No
REVEL [51]	A systolic-dataflow heterogeneous platform	No	Yes (SIMD1+SIMD8)
GANAX [54]	A unified SIMD and MIMD design for GAN	No	Yes (SIMD1+SIMD4)
This article	Execution model and hardware co-design	Yes (NN+DSP+GP)	Yes (1,2,4,8,16)

Table 1. Comparisons between Representative Dataflow Architectures

‘GP’- Graph processing, ‘MM’- Matrix multiplication, ‘NN’- Neural networks, ‘DSP’-Digital signal processing.

4 Our Methods

In this section, we optimize the micro-architecture and dataflow program execution model with the aim of improving the resource utilization of the dataflow architecture for multi-batch processing. First, at the inter-PE level, we designed a configurable interconnect architecture that is able to work in multiple modes. Second, at the inner-PE level, we designed a fully decoupled architecture with the aim of (1) improving the utilization of computational components by overlapping the latency caused by memory access and data transfer as much as possible, and (2) increasing the throughput of the chip through a dynamic task scheduling mechanism. Finally, we designed a task-based execution model and mapping method for our dataflow architecture.

4.1 Overview

In order to mitigate the resource under-utilization, We devise a unified scale-vector architecture that reaps the benefits of single-instruction-single-data and single-instruction-multiple-data execution models at the same time. That is, while our architecture executes the operations with distinct computation patterns in a single exection unit, it performs the operations with the same computation pattern in a cluster unit. Figure 3 illustrates the high-level diagram of our proposed architecture which is comprised of a set of identical multiple mode PEs. The PEs are arranged in a 2-D array and connected through a dedicated network. Each PE consists of two engines, namely the mode engine and the execution engine. The execution engine merely performs operations, whereas the mode engine controls these execution engines to work in multiple modes. A novel decoupled architecture is designed within each execution engine, differing from the traditional Out-of-Order cores or sequential execution cores. In addition, there are two on-chip networks, one for transmission of configuration information and control signals, and the other for custom data transmission. There are seveal main considerations for such a design: (1) the bandwidth of configuration information and data is different: (2) with the design of multiple sets of networks, the control logic for routing and forwarding becomes simpler, and (3) to reduce data packet and configuration packet conflicts and reduce on-chip network stress and transmission delay. The memory hierarchy is composed of an off-chip memory, on-chip global buffers and local buffers in each PE. These global on-chip buffers are shared across all PEs.

Fig. 3.

In the task-based dataflow execution model, three levels of pipeline parallelism are utilized: subtask-level (dataflow graph, DFG) pipeline parallelism, DFG node-level pipeline parallelism, and instruction-level pipeline parallelism (Instruction pipelining technology is used). The subtask-level pipeline parallelism refers to the execution of each dataflow graph in a pipeline manner. The dataflow graph node-level pipeline parallelism refers to the decoupled dataflow execution within each dataflow graph node. Instruction-level pipeline parallelism is the traditional instruction pipeline. Tasks could be annotated with information that describes the operations they perform, and the hardware could take advantage of structured patterns. Performing this analysis in software may not be that profitable, especially in an accelerator system where tasks are short. Our solution is to expose task-management and operation types as first-class primitives of the hardware’s execution model. Furthermore, traditional dataflow graphs do not have the semantics of batch processing. Dataflow graphs often correspond to internal loops, while batch processing information is expressed as the number of iterations of the internal loop. When the hardware is highly reconfigurable, especially when the topology of the execution units is variable, a more flexible approach to dataflow program mapping is proposed.

4.2 Inter-PE Design

PEs are designed to be adaptive to the data sizes of different batches. First, the basic idea is to combine multiple execution engines into a cluster that performs the same computational tasks and processes multiple batches of data synchronously. As shown in Figure 4, the execution engines labeled ❶ and ❷ are combined into a cluster, and the execution engines labeled ❸ and ❹ are combined into a cluster. In this way, a PE consists of two clusters, each of which can process two batches of data in parallel. For the 4-batch mode, the four execution engines are combined into a cluster, processing four batches of data in parallel. While in 1-batch mode, each execution engine acts as a cluster. Second, the mode engine plays the role of configuration generation and distribution. On the one hand, it generates configuration information for each \(\mu\) -router. The structure of \(\mu\) -router is displayed in the right side of Figure 4. Each \(\mu\) -router consists of a set of multiplexers and routing units. The structure inside each routing box is a traditional router structure that parses and forwards packets in four directions (North, East, South, West). The input and output ports in X and Y directions have dedicated control signals (S1, S2, S3, S4, S5, S6, S7, S8) that control the connection of the routing units and the data transmission networks. On the other hand, the mode engine distributes command and control information (activation signals, ack signals, etc.) to each execution engine (datapath in red). Finally, the \(\mu\) -router structure dynamically changes the connections of the data links according to the different batch configurations, thus ensuring efficient and synchronized transmission of multi-batch data.

Fig. 4.

Execution Engine. Each PE contains several execution engines. To facilitate understanding, we take four execution engines as an example in Figure 3. It is important to note that the number of execution engines in a PE is scalable. The execution engine consists of a function unit, a local buffer and a \(\mu\) -router. The function unit performs specific operations and supports different operations, including LD/ST, calculation and data transfer. To support diverse kernels, the calculation data-path is designed to support different data types, including integer, fixed-point, float-point, and complexed-value. Each execution engine has a dedicated local buffer and is built with a \(\mu\) -router. The local buffer stores configurations (instructions) and data during runtime. The \(\mu\) -router is connected with the mode engine and also embedded into a circuit-switch mesh data network. When these \(\mu\) -routers receive mode configuration from mode engines, they will be statically configured to route to each other, forming the link paths between these execution engines. Execution engines time multiplex these links to communicate. We discuss more details about the execution engines in Section 4.2.

Network-on-Chip. The interconnection plays a crucial role in the multiple-mode PE. It ensures that these multiple pieces of data can reach these execution engines in the same cluster simultaneously. The structure of the interconnection in a PE can be found in Figure 4. There are two main interconnections: a network for transferring configurations (red paths in Figure 4) and a dedicated network for data (yellow and green paths in Figure 4). The configure network transports the configurations to each \(\mu\) -router and the instructions to each execution engine. The data network consists of several data paths to accommodate the multiple-batch modes. The number of data paths in the vertical and horizontal direction is equal to the number of execution engines in that direction. In our example, the number of data paths is two, which is determined by the number of execution engines in a PE. The \(\mu\) -router is connected with the data network via crossbar switches, and establishes different virtual circuit links under different configurations before the next configure period.

Mode Engine. Each PE has a dedicated mode engine to dispatch control signals and instructions. In principle, the mode engine reconstructs the execution engines into different clusters to support multi-batch modes. As shown in Figure 3, the mode engine consists of a hierarchical controller. In our example, there are two L1-controllers and one L2-controller in the controller. They are connected based on a tree topology. The L1-controller is connected with two execution engines through their \(\mu\) -router interface and the L2-controller is also connected with the global configuration buffer. The mode engine is mainly responsible for the following functions during the configuration period. First, it parses the PE’s multi-batch configuration, and then it generates configurations for each \(\mu\) -router and delivers the configurations to each \(\mu\) -router. After the top-level controller (L2-controller) receives the task configuration information from the Global Configuration Buffer, it will extract the batch configuration fields (‘B_conf’ in Figure 7) from it. Configurations for the four directions of each \(\mu\) -router will then be generated based on the rules for specific mode based on this batch configuration. Second, instructions are loaded through the mode engine and distributed to each execution engine. Since these execution engines in a PE may belong to different clusters, the controller uses a hierarchical tree-based structure, which makes control simple and easy to implement. It should be noted that the controller will become more complicated when the number of execution engines in a PE increases. The hierarchical controller is scaled according to \(log_2\) (number of execution engines).

Fig. 5.

Fig. 6.

Fig. 7.

Multiple Modes. As shown in Figure 5(a), each PE supports multi-modes of single-batch mode (in Figure 5(b)), 2-batch mode (in Figure 5(c), and 4-batch mode (in Figure 5(d))). Its function is controlled by an 8-bit configuration word (S1, S2, S3, S4, S5, S6, S7, S8) that is detailed in Figure If the PE contains N execution engines, then the PE can support \(log_2N\) +1 patterns, where N is the exponent of 2.

Single-Batch Mode. This mode is designed for algorithms with small-scale source data and little data parallelism. The PE array can be configured as a pure MIMD-like mode, in other words, a many-core architecture with a typical 2D topology. In this mode, each execution engine works as an independent core. It has its own instructions and data, processing a dataflow graph (DFG) node. Horizontal and vertical execution engines need to be connected to the same data path. Therefore, the rule for the configuration word is: “S1 == S5 && S3 == S7 && S2 == S4 && S6 == S8”. Figure 5(b) shows the network connection under the “0000-0000 (S1 to S8)” configuration.

Two-Batch Mode. Two execution engines that connected to the same L1-controller are combined as a cluster. As shown in Figure 5(c), execution engine❶ and execution engine❷ serve as a cluster, while execution engine❸ and execution engine❹ serve as another cluster. Since two execution engines in the Y-axis are in the same cluster, the \(\mu\) -router of execution engine❷ should be connected to the data link that is different from engine❶ to guarantee that the two execution engines can receive data from the Y-axis at the same cycle. Similarly, the \(\mu\) -router of execution engine❹ should be connected to the data path that is different from execution engine❸. In the X-axis direction, they are connected to the same data path. Since the horizontally oriented execution engines need to interact, they need to be connected to the same data path. The vertically oriented execution engines act as two parallel processing units, so they need to be connected to different data paths. Thus, the configuration logic of PE in two-batch mode is: “S1 == S5 && S3 == S7 && S2 == \(\sim\) S4 && S6 == \(\sim\) S8”.

Four-Batch Mode. All exection engines in a PE form a cluster, as shown in Figure 5(d). These exection engines are controlled by the L2-controller. In both X-axis and Y-axis, \(\mu\) -routers of these exection engines should be connected to different data path. In X-axis direction, \(\mu\) -router❶ and \(\mu\) -router❷ should connect to the data paths that is different from the path \(\mu\) -router❸ and \(\mu\) -router❹ connect to, respectively. Similarly, in the vertical direction, \(\mu\) -router❶ and \(\mu\) -router❸ should be connected to different data paths with \(\mu\) -router❷ and \(\mu\) -router❹, respectively. Therefore, the rule for the configuration word is: “S1 == \(\sim\) S5 && S3 == \(\sim\) S7 && S2 == \(\sim\) S4 && S6 == \(\sim\) S8”. Figure 5(b) shows the data path under the “0001-1011 (S1 to S8)” configuration.

The two-batch mode and four-batch mode are designed for scenarios with high data parallelism. Execution engines are divided into multiple clusters under the control of the mode engine. The instructions are loaded and distributed to the corresponding cluster by the mode engines. Execution engines in the same cluster perform the same operations on multiple data synchronously. Limited by the number of execution engines, the PE can work at three different modes in our example. To explain the structure more clearly, we also show the domain division for different configurations by different colors in Figure 5. Note that this design principle is scalable. As the number of execution engines in a PE increases (preferably by an exponent of 2), the number of available modes also increases. For example, when each PE contains 16 execution engines, the structure of the mode engine will become complex. There will be L3-controller and L4-controller. In addition, the mode of PE contains eight-batch mode and 16-batch mode.

Memory Access. Global buffers are built with multiple SRAM banks matching the scale of data. Address decoding logic around the scratchpad can be configured to operate in several banking modes to support various access patterns. Physical banks cascade and are grouped into logic banks according to the width of configuration. Besides, the global buffers are sliced into two lines, which work in a Ping-Pong way to cover transmission time. To support diverse modes, DMA can transmit and reshape variable length of multi-batch data with scatter and gather operations, exchanging data between on-chip buffers and off-chip memory.

4.3 Inner-PE Design

We create a decoupled execution model that defines a novel scheme to schedule and trigger DFG nodes to exploit instruction block level parallelism. The code of each DFG node consists of up to four consecutive stages: Load stage, Calculating stage, Flow stage, and Store stage, which we describe below:

–

Ld (Load) Stage. This stage loads data from the memory hierarchy to the in-PE local memory.

–

Cal (Calculating) Stage. This stage completes calculations. A node can enter the Cal stage only when the following two conditions are met: first, its Ld stage (if it exists) has already finished; second, it has received all the necessary data from its predecessor nodes.

–

Flow Stage. This stage transfers data from the current node to its successors.

–

ST (Store) Stage. This stage transfers data from the in-PE operand memory to the memory hierarchy.

Similarly, instructions in a DFG node will be rearranged according to their types and divided into four different blocks. The block is a basic schedule and trigger unit. Instruction-block-level dataflow is the middle ground between instruction-level dataflow and thread-level dataflow. It can be seen as a further development of thread-level dataflow. In the thread-level dataflow model, each dataflow graph node is a thread and serves as the basic unit for launching and scheduling. Instruction-block-level dataflow decomposes each node of thread-level dataflow into four stages. Each phase consists of a segment of instructions and serves as the basic unit for launching and scheduling. Unlike the traditional out-of-order execution, the decoupled execution model exploits more instruction-block level parallelism without complex control logic, such as reorder buffer.

Figure 6 illustrates the top-level diagram of our dataflow architecture, which is comprised of a set of identical decoupled processing elements (dPE). To support the decoupled execution model, separated four-stage components are designed within each PE to correspond to the four different states of the nodes. This approach allows a processing element to be shared by up to four different DFG nodes simultaneously, enabling the overlap of memory access and data transfer latency as much as possible. By decoupling the datapaths of different stages and equipping each PE with a dedicated scheduler, the DFG nodes of different iterations can be pipelined more efficiently. The function of the controller is to maintain the maintenance, scheduling execution of the different node states. To ensure the correctness of the execution, separate operand RAM space is provided for different iterations. A shared operand RAM space is set up to store the data that has dependencies between iterations, which are marked by special registers in the instructions.

The dPE consists of a calculation pipeline, a load unit, a store unit, a flow unit, an instruction RAM module, an operand RAM module, a controller and a router (in the middle of Figure 6). These four separate functional components (CAL, LOAD, FLOW, STORE) and the controller are designed for the decoupled execution model, which are different from previous structures. The calculation pipeline is a data path for arithmetic operations and logical operations. It fetches instructions from the instruction RAM module and performs computations on the source data. The load/store unit transfers data from/to on-chip data memory to/from operand RAM module, respectively. And the flow unit dispatches data to downstream dPEs. Each execution unit has a corresponding DFG node state, as described in Figure 6, and such a decoupling method is the key to improving the utilization.

The controller plays a non-negligible role in the state transition and DFG nodes triggering. It consists of a kernel table, a status table, a free list, a dedicated acknowledgment buffer (Ack port), and a scheduler module. The kernel table stores the configurations of the nodes mapped to the dPE, which contain the Task ID (TID), node ID (NID), instance number (instance), instruction address list (inst_addr) and data address (LD_base&ST_base). The TID and NID are used to identify task and DFG node, because the PE array can be mapped to multiple tasks at the same time, and a PE can be mapped to multiple nodes. The instance is a value related to the pipeline parallelism, which indicates how many times the DFG node needs to be executed. Taking BFS as an example, for a large graph, it may need to be decomposed into many subgraphs, such as 100, then each DFG node needs to be executed 100 times. The inst_addr records the location of the four-stage instruction of the DFG node in the instruction RAM. The LD_base&ST_base are the base addresses for the source and destination, which can work with the offset in the status table to access the data in the operand RAM.

The status table maintains the runtime information for different instances. It uses the instance_counter to record different instances of DFG nodes. Although different instances share the same instructions, they handle different data. Therefore, the offsets (offset) of different instances are different. In addtion, the status table records the activations (Up_counter) and status information. The value of Up_counter decreases with the arrival of activation data. When this value is 0, it means that all the upstream data of the current node has arrived and it can be triggered by the scheduler.

The scheduler uses the instance_counter to evaluate the priority and schedules nodes according to their priority. We also tried other scheduler policies, such as a round-robin scheduler or finer-grain multithreading, but found that these did not work as well. This makes sense: the completed application work is nearly constant regardless of the scheduling strategy, so a simple scheduling mechanism is effective. Also, simple scheduling principles reduce configuration overhead. The Ack port is connected to the four pipeline units in order to obtain the status of each section. Additionally, the Ack port uses this information to dynamically modify the contents of the state table for scheduling by the scheduler. And the free list queue maintains free entries in this buffer.

The instruction RAM module consists of multiple single-port SRAM banks. Each bank can be occupied by a single functional unit at any time. The operand RAM module consists of multiple 1-write-1-read SRAM banks. To ensure the pipeline execution between instances, a separate context is allocated for each iteration. Considering that there may be dependent data between instances, a shared context is established in the operand RAM. Shared data are marked by special registers in the instructions.

4.4 Task_based Program Execution

We propose the task_based program execution model, which augments a dataflow architecture’s ISA with primitives for runtime task management and structured access. In task_based program execution model, a task consists of multiple sequentially executed subtasks. Each subtask is a dataflow graph which consists of multiple computation nodes and directed edges. The finite-state controller is used to configure our processor at three level: task level, subtask level, and node level, as shown in Figure 7. Each task contains multiple subtasks, where each subtask is a dataflow graph. The multiple subtasks are executed sequentially, due to the fact that the number of subtasks executed may be different. First, the task parameter words are used to control the processing of one specific program, which indicates the exection number and the number of subtasks. Second, the subtask parameter words are used to control the processing of a codelet, usually a loop struture. It contains the number of iteration and DFG nodes, as well as batch configurations, the number of root nodes, the base address of input and output data, and so on. Third, the node parameter words are used to control a specific DFG node, which records the storage location of instructions within that node, as well as the number of upstream and downstream nodes, the mapping location of upstream and downstream nodes, the coordinates of the execution cluster to which the node is mapped and the priority, etc. In this execution model, multiple levels of pipeline parallelism can be exploited: (1) pipeline parallelism between different iterations within a subtask; (2) pipeline parallelism between different iterations within a DFG node; and (3) instruction-level pipeline parallelism.

Figure 7(c) shows an example of the task-based program execution. This task completes the core computational process of the Fast Fourier Transform (FFT) and contains mainly two loop bodies that offload to the dataflow coprocessor through hints (pragma). First, this task contains two subtasks, subtask 1 and subtask 2, which are marked as different colors in Figure 7(b). Then, each subtask is compiled into a dataflow graph, where each dataflow graph node contains a segment of instructions, and the order of instructions follows the principles of the decoupled model we proposed in Section 4.2. Next, three-level configuration words are loaded into each PE, configuring for each execution engine and combining into a clustered array. The dataflow graph is then mapped to the execution engine array and pipelined for execution. Execution engines within the same cluster execute the same code segments. The mapping process maps a dataflow graph onto a cluster array. Each cluster can be mapped with one or more dataflow graph nodes. Execution arrays within the same cluster perform the same computational process and process different data in parallel. Unlike the traditional mapping approaches, the size of the execution engine cluster array is variable under different configurations. As a result, the DFG may need to be extended at the time of mapping. Our approach is inspired by the literature [29]: the DFG will be replicated to ensure that each cluster can be utilized.

5 Experimental Methodology

Setup. We implemented a dataflow simulator based on the SimICT parallel framework [55]. This simulator is mainly used to verify the correctness and get the performance and computational component utilization, it simulates the behavior of computation, access to memory, instruction conflicts, and the like. Additionally, we implemented the modules of the dataflow architecture in Verilog using the Synopsis tool. We use Synopsys Design Compiler and a TSMC 28nm GP standard VT library to synthesize it and obtain area, delay and energy consumption, which meets timing at 1.25 GHz. We calibrate the latency error of the simulator to within ±7% using Verilog environment with functional correctness. First, we verify the computational results and functional correctness of the simulator in C and the implement in verilog. The error here is the error in the total latency of the test program in the C simulator and verilog environment. Since the latency of task switching as well as the latency of pipeline blocking is very difficult to ensure consistency, there will always cycle be an error between the two platforms.

Table 2 shows the hardware parameters. Each PE is equipped with 16 execution engines connected via 4-level controllers, thus enabling each PE to support more modes. Fixed-point, integer and load instructions consume one cycle, floating-point, store and dataflow instructions consume two clock cycles, floating-point division consumes nine cycles. Table 2 also shows the area and power breakdown of our architecture. It has an area footprint of 16.477 \(mm^2\) in a 28 nm process, and consumes a maximum power of 2.038 W at a 1.25 GHz clock. The PE array occupies the largest proportion of area and power consumption, accounting for 57.03% of the area and 53.09% of the power, respectively. In each PE, execution engines (including function units, controller and inst & data RAM) account for the largest proportion.

Table 2.

Component		Parameter	Area( \(mm^2\) )	Power(mW)
PE	Func. Unit	INT & FP32, #16	0.165(44.00%)	21.43 (31.73%)
	Controller	-	0.044(11.73%)	3.59 (5.32%)
	Inst. RAM	4 KB	0.020(5.33%)	2.3 (3.41%)
	Data RAM	16 KB	0.072(19.20%)	28.78 (42.61%)
	Mode Enigne	L1, L2, L3, L4,	0.018(4.80%)	2.40 (3.55%)
	\(\mu\) -routers	#16	0.056(14.93%)	9.04 (13.38%)
	Total		0.375	67.54
PE Array		4 × 4, 1.25 GHz	6.00(58.36%)	1080(52.96%)
Network-on-chip		1 cycle/hop, X-Y routing	1.50(14.59%)	259(12.69%)
Global Data Buffer		512 KB SPM, double-buffer	1.47(14.26%)	534(26.17%)
Global Config Buffer		128 KB, double-buffer	0.21(1.99%)	109(5.34%)
DMA		ping-pong	0.36(3.50%)	58(2.84%)
*Total*			*10.28*	*2040*

Table 2. Hardware Parameters

Benchmarks. To evaluate our methods, we select several real-world applications from Plasticine [39] and REVEL [51]. These workloads contain digital signal processing algorithms, CNNs and scientific computing, and contains different parameters. Table 3 lists the selected workloads. We use Synopsys PrimeTime PX for accurate power analysis. These kernels are mapped to PE arrays by the compiler introduced in [28], a compilation framework based on the LLVM framework. The host compiles and assembles the high-level language and configures the dataflow processor. In our actual system, we are using an ARM CPU as our host. The dataflow accelerator is controlled through PCIe interrupts. Figure 8 shows the process of transforming partitioned serial code into configurations for our dataflow architecture. We generate LLVM intermediate representation (IR) for each stage, which represents low-level operations on data and their dependences. An automated tool examines the LLVM IR and produces a DFGusing the actual operations that can be performed by a PE’s functional units. To ensure load balance, we use the DFG balancing algorithm [11], a heuristic algorithm that achieves load balancing through instruction scheduling among DFG nodes. The code in each dataflow graph node is then converted into the decoupled execution model, which then generates bitstreams that can be executed on decoupled hardware.

Table 3.

Application	Domain	Parameter Scales	Batch
FFT (Fast Fourier Transform)	Digital Signal Processing (DSP)	16 \(\sim\) 512	1 \(\sim\) 32
FIR(Finite Impulse Response)	DSP	1G	1 \(\sim\) 32
SVD(Singular Value Decomposition)	DSP and AI	16 \(\sim\) 256	1 \(\sim\) 32
Cholesky	DSP	16 \(\sim\) 256	1 \(\sim\) 32
Alexnet	Artificial Intelligence (AI)	Conv layers	3 \(\sim\) 384 (channel)
VGG16	AI	Conv layers	3 \(\sim\) 512 (channel)
Resnet50	AI	Conv layers	3 \(\sim\) 512 (channel)
Stencil3d7p	scientific computing	163225K	1 \(\sim\) 32
SHA256	scientific computing	1 M	1 \(\sim\) 32
MM (Matrix Multiplication)	scientific computing, AI and DSP	16 \(\sim\) 512	1 \(\sim\) 32
BFS (Breadth First Sort)	Graph processing	web-Google	\|V\|=9K,\|E\|=5M

Table 3. Benchmark Specifications

Fig. 8.

The mapping of a DFG to our architecture primarily involves two main stages: one during the compilation stage and the other occurring during hardware configuration. Each processing unit is identified by an ID indicating its position in a two-dimensional array. The mapping algorithm traverses each node in the DFG and determines the position of each node in the PE array. At the end of the compilation stage, all the information, including the DFG topology, mapping configurations and instructions for DFG nodes, is written to a file. This file will be sent to the hardware’s global configuration buffer. When the global configuration buffer receives this file, it parses the information, including the type of packet, destination processing unit, and other execution-related information. Among other things, the destination processing unit field specifies the location of a particular processing unit where the instruction packet is to be executed, so that the instruction can be sent to the processing unit through the network on chip.

Comparisons. To quantify the performance of our processor, we first compare it with CPU, DSP, and GPU. For fairness, these platforms have similar ideal maximal FLOPs (except GPU which has more). DSP: TI C6678, 8-core DSP, each core has 16-FP adders/multipliers, using DSPLIB C66x 3.4.0.0. CPU: Intel Xeon 4116. Conventional out-of-order processor using highly optimized Intel MKL library (8 cores used). GPU: NVIDIA Tesla V100 (NVLink version), using cuSOLVER, cuFFT, and cuBLAS NVIDIA CUDA library. We employed the “nvidia-smi” tool to measure the dynamic power consumption instead of maximum power consumption of the GPU. To measure CPU power consumption while executing the program, we utilized a resource manager.¹ Furthermore, we utilized a power instrument² to assess the power consumption of the DSP during program execution.

In addtion, we compare it to three state-of-the-art dataflow designs: RABP [38], REVEL [51], and Plasticine [39], and their configuration is listed in Table 4. RABP is characterized by the high peak performance obtained using large-scale PE arrays with scalar computation, and PE arrays obtain high utilization by executing multiple DFGs simultaneously. REVEL and Plasticine feature a vector SIMD architecture for batch processing to improve performance and energy efficiency. REVEL uses SIMD8, a heterogeneous dataflow architecture. The PE array consists of simple PEs in systolic form and dataflows PEs that can perform complex computations. While Plasticine is a homogeneous dataflow array that uses SIMD16 to obtain high utilization by decoupling dedicated pattern compute units and memory units. We leverage the open source implementations of them and develop a simulator for each design for performance and utilization evaluation. We also implement their main components using Verilog to obtain area and power consumption. For fairness, these designs are extended to have similar peak performance and process. Their clock frequencies are using the configurations from their original articles.

Table 4.

Architecture	PE array	PE	Vector size	On-chip Memory
RABP	20 × 20	INT & FP32	SIMD 1	512KB DBUF, 128KB CBUF
REVEL	4 × 8	add, sqrt/div,mult	SIMD 8	128 KB
Plasticine	9 PCUs, 6 stages	(PCU)INT & FP32	SIMD 16	9 PMUs, 288 KB
OURS	4 ×4	INT&FP32	1,2,4,8,16	512KB DBUF, 128KB CBUF

Table 4. Hardware Configurations

6 Evaluation Results

6.1 Utilization and Performance

First, we validate the processing capability of our proposed architecture for multiple batch data sizes. Figure 9 shows the computational resource utilization for different data sizes. In all cases, it achieves an average utilization of over 60% and a maximum of 92.2%. From the experimental data, we find that applications with larger parameter sizes in a single batch of data, such as FFT-512, MM-512, and the like, achieve higher utilization in the multiple batch data case. This is due to the fact that each execution unit is offloaded with computational tasks and executed iteratively in a busy state. The small variations in utilization are due to the different overheads of the configuration and the dynamic behavior of runtime. On the other hand, the benefits of our proposed architecture are also reflected in the processing of multiple batches of small-scale data, such as FFT-16 and MM-16. When the parameter size is so small that a large number of execution units do not get computational load, the performance of the processor will not be demonstrated. Our design dynamically and adaptively maps the cluster to the batch size so that the execution units are all mapped to computational tasks and the utilization is thus improved.

Fig. 9.

Second, we also evaluate the computational resource utilization of a single batch of data under different models, as shown in Figure 10. We use FFT and MM as representatives of digital signal processing algorithms. The computational resources can be well utilized when the parameter scale is large. While the parameter scale becomes small, the utilization decreases significantly. For example, the utilization of 16-batch mode is only 6% for MM-32. For applications with small parameter sizes, the utilization of high-width-batch mode is extremely low due to the small computational resources that are required, such as FFT/MM-16. In these cases, most execution engines are not used, and the 1-batch mode is most suitable. For AI algorithms, we select different convolutional layers from Alexnet (abbreviated as ‘A’), VGG16 (abbreviated as ‘V’), and Resnet (abbreviated as ‘R’). We performed parallel processing of different channels of the feature map using batch dimensionality. We found that the utilization of most convolutional layers decreases with higher bandwidth configurations. In the case of A_conv1, for example, the channel of the feature map is 3. When the configuration exceeds 4, there are some execution engines that cannot be fully utilized, so the utilization decreases.

Fig. 10.

Third, we evaluate the benefits of the coupled execution model. Figure 11(a) shows the utilization of serial and decoupled execution within the dataflow graph nodes, and it can be noticed that the decoupled execution obtains a significant improvement in utilization. Figure 11(b) shows the performance of different cases for two execution methods. The decoupled execution obtains an average performance improvement of 1.92 times compared to the sequential execution. Thus, the decoupled execution model we proposed plays an important role in improving the performance and utilization of dataflow fabric.

Fig. 11.

Fourth, Figure 12 illustrates the speedup normalized to DSP. We select the best-performing results among multiple configurations of our architecture for comparison with other platforms. It attains up to 25.7× speedup over DSP, with geomean of 6.40× and 9.47× for small and large parameter scales. The DSP and CPU have similar mean performance. For a large-scale parameter, GPU obtains similar performance with ours. For MM, SHA256, and SVD algorithms, GPU obtains better performance than ours. GPUs have a large amount of computational resources and are suitable for handling large-scale concurrent data. Also, thanks to the SIMT execution model, the GPU achieves better performance. However, its drawback is that it is difficult to efficiently deal with small and discrete data scenarios. Although for a small-scale parameter, CPU achieves better performance than GPU in most cases. For large- and small-scale parameters, our architecture provides an average speedup 8.45× over DSP and 1.76× over GPU.

Fig. 12.

Figure 13 illustrates the performance comparison between our architecture and three dataflow architectures. Compared to RABP, REVEL gained an average of 1.83× performance improvement and Plasticine gained an average of 2.41× performance improvement. Our architecture gained an average of 3.34× performance improvement. RABP uses large-scale PE arrays rather than vectorization techniques, and the configuration latency and long data communication delays between PEs and degrade performance. REVEL employs SIMD so that the PE size is smaller for the same peak performance condition, and the data communication latency between the PEs is reduced, thus obtaining a performance improvement. Plasticine obtains a further performance gain because of its design with computation and access separation, which allows some of the access latency to be overlapped, thus reducing the total time. Although our maximum vector width is the same as Plasticine, our architecture still achieves higher performance. This is due to our fully decoupled design. In our architecture, not only the latency of accesses can be overlapped, but also the transmission distance of data between PEs is further overlapped. The improvement brought by the decoupled design is demonstrated in Figure 11. For irregular applications such as BFS, the vectorized design basically brings no performance improvement, and only the decoupled design brings performance improvement.

Fig. 13.

6.2 Energy Efficiency

Figure 14 shows the energy efficiency comparison with GPU in different modes. We use performance-per-watt metric to evaluate energy efficiency. We find that for digital signal processing algorithms with high concurrent large-scale parameters, a high-width configuration achieves better energy efficiency. For small-scale parameters, a low-width mode achieves better energy efficiency. For example, MM-512 and FFT-512 achieve optimal efficiency in 16-batch configuration. The highest efficiency is achieved in 1-batch and 2-batch mode for FFT-32 and MM-32, respectively. High-width configurations are inferior in these cases. The reason is that, for small-scale parameters, the low data parallelism leads to underutilization of exection engines, resulting in lower utilization (Figure 10) and thus lower efficiency. For most CNN workloads, we find that our design achieves different energy efficiency in different modes. The best-performing efficiency is gained in 8-batch configuration for most cases except Alexnet_conv1. Since the channel of the Alexnet_conv1 is 3, the execution engines in high-width configurations are not fully utilized. Therefore, the 4-batch mode achieves the best efficiency. Compared with GPU, our design attains up to 11.95× efficiency improvement, with geomean of 10.23× and 4.89× for digital signal processing algorithms and CNN algorithms. Thus, it achieves significant efficiency gains over GPU in a wide range of applications.

Fig. 14.

Figure 15 shows the energy efficiency comparison normalized to GPU. For each algorithm mapped to our hardware, we select the best energy efficiency for each parameter scale in different modes. The average of these efficiencies for all parameter scales is used for comparison. On average, our design achieves 7.49× efficiency improvement over GPU, 2.01× over RABP, 1.34× over REVEL, and 1.19× over Plasticine. RABP is a SIMD-free solution where data parallelism is not exploited, resulting in inefficiencies. REVEL achieves the highest efficiency for the matrix multiplication, because the heterogeneous design of REVEL uses systolic and dataflow arrays. The matrix multiplication is executed with high utilization of the systolic array and with low control overhead. However, for algorithms with poor data parallelism characteristics, such as sorting and SHA256, REVEL performs poorly in terms of utilization and energy efficiency. For most algorithms, Plasticine achieves high energy efficiency. Plasticine uses a high-width SIMD design to achieve high peak-performance. However, it has the disadvantage of low utilization when confronted with small-scale parameters, hence limiting the energy efficiency.

Fig. 15.

Table 5 shows the hardware comparisons of our architecture with other architectures. We extend these several architectures to have similar peak performance and process, and the frequencies of the architectures use the frequencies in their papers. RABP achieves energy efficiencies in the range of 32.67 \(\sim\) 89.52 GFLOPS/W. REVEL obtains an efficiency up to 137.29 GFLOPS/W. However, for the worst case, the efficiency is only 27.85 GFLOPS/W. Plasticine has a high peak-performance, and it gains energy efficiencies in the range of 63.50 \(\sim\) 150.86 GFLOPS/W. Our final architecture has a peak floating point performance of 320 single precision GFLOPS. It gets energy efficiencies in the range of 94.38 \(\sim\) 154.25 GFLOPS/W. However, our design has a large area overhead caused by the execution engines and their interconnections. Compared with other reconfigurable architectures, our design exhibits more stable efficiency in a variety of situations.

Table 5.

Arch	GPU	RABP	REVEL	Plasticine	OURS
Tech(nm)	12	28	28	28	28
Area( \(mm^2\) )	-	11.802	2.709	3.76	10.28
Max Power(W)	300	3.488	2.282	1.933	2.040
Freq(GHz)	-	0.8	1.25	1	1.25
PeakPerf (GFLOPS)	15700	320	320	410	320
Efficiency	11.51 \(\sim\)	32.67 \(\sim\)	27.85 \(\sim\)	63.50 \(\sim\)	94.38 \(\sim\)
(GFLOPS/W)	21.68	89.52	137.29	150.86	154.25

Table 5. Hardware Comparisons

6.3 Cost and Discussions

The overhead of our proposed method is mainly in the following areas: First is the overhead added by the configuration context, where the configuration information of the PE array is added for batch processing. On the one hand, in the final structure, five different modes are supported, so three bits of space are required. On the other hand, with 16 execution engines equipped within each PE, the mode engines need to generate a 32-bit routing configuration message. Next is the time overhead of the reconfiguration process. For a specific computational task, the array only needs to be configured once. Figure 16 shows the percentage of microarchitecture configuration and algorithm execution time. The time spent on configuration accounts for an average of 6.91% of the total time. Figure 16 also illustrates the latency of hardware components for different applications. Finally, our architecture has a similar area overhead to RABP, yet is larger than REVEL and Plasticine. First, for REVEL, since each of our PEs contains computational parts of type INT and FP, while REVEL’s PEs contain only the parts part. Also, our on-chip storage capacity is larger than REVEL. Second, the storage hierarchy is different compared to Plasticine. Our on-chip storage contains the global buffer shared by the PEs and the private buffer within the PE, while Plasticine has only the on-chip Pattern memory unit. the total storage capacity is smaller than our design.

Fig. 16.

7 Conclusion

In this article, we describe a novel reconfigurable architecture with flexible multi-batch modes. We devise a unified scale-vector architecture that reaps the benefits of single-instruction-single-data and single-instruction-multiple-data execution models at the same time. That is, while our architecture executes the operations with distinct computation patterns in a single exection unit, it performs the operations with the same computation pattern in a cluster unit. With a more fine-grained DFG node scheduling and trigger mechanism, we exhibit significant utilization and performance improvement on key application domains and achieves significant improvement on performance and energy efficiency compared with state-of-the-art designs.

Footnotes

HWINFO:https://www.hwinfo.com

FLUKE Digital Multimeter:https://www.fluke.com/en-us/products/electrical-testing/digital-multimeters

References

[1]

Saambhavi Baskaran, Mahmut Taylan Kandemir, and Jack Sampson. 2022. An architecture interface and offload model for low-overhead, near-data, distributed accelerators. In 55th IEEE/ACM International Symposium on Microarchitecture (MICRO 2022) (Chicago, IL, October 1-5, 2022). IEEE, 1160–1177.

Abstract

1 Introduction

2 Background

2.1 Cross-Domain Processing and Multi-Batch Processing

2.2 Dataflow Architecture

3 Related Works

4 Our Methods

4.1 Overview

4.2 Inter-PE Design

4.3 Inner-PE Design

4.4 Task_based Program Execution

5 Experimental Methodology

6 Evaluation Results

6.1 Utilization and Performance

6.2 Energy Efficiency

6.3 Cost and Discussions

7 Conclusion

Footnotes

References

Index Terms

Recommendations

Improving Utilization of Dataflow Architectures Through Software and Hardware Co-Design

An analysis of GPU utilization trends on the Keeneland initial delivery system

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations