Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Improving Utilization of Dataflow Unit for Multi-Batch Processing

Published: 15 February 2024 Publication History

Abstract

Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this article, we propose a unified scale-vector architecture that can work in multiple modes and adapt to diverse algorithms and requirements efficiently. First, a novel reconfigurable interconnection structure is proposed, which can organize execution units into different cluster typologies as a way to accommodate different data-level parallelism. Second, we decouple threads within each DFG node into consecutive pipeline stages and provide architectural support. By time-multiplexing during these stages, dataflow hardware can achieve much higher utilization and performance. In addition, the task-based program model can also exploit multi-level parallelism and deploy applications efficiently. Evaluated in a wide range of benchmarks, including digital signal processing algorithms, CNNs, and scientific computing algorithms, our design attains up to 11.95× energy efficiency (performance-per-watt) improvement over GPU (V100), and 2.01× energy efficiency improvement over state-of-the-art dataflow architectures.

1 Introduction

Advances in integrated circuit technology have been utilized as a primary approach to enhancing computing power over the past decades. However, this approach is losing effectiveness as Moore’s Law [31] and Dennard scaling [8] slowdown or even termination. Improving the efficient utilization of hardware resources and enhancing the energy efficiency of architectures have emerged as prominent areas of research in the field of computer architecture [15, 21, 23, 53, 60].
Another emerging challenge, known as cross-domain processing, introduces a novel trade-off between energy efficiency and expressiveness, as depicted in Figure 1. On one end, we have General-Purpose Processors that enable the representation of all domains, albeit at the cost of performance and/or efficiency. On the opposite end, domain-specific accelerators that cater to a single domain and run on specialized architectures exhibit high performance. Nevertheless, creating an end-to-end application that spans multiple domains necessitates a deep understanding of various interfaces and diverse hardware accelerators. Consequently, the development of cross-domain accelerator stacks remains an ongoing challenge [19]. Ideally, we aim for architectures that approach maximal specialization in terms of efficiency, while also being programmable and capable of executing a wide array of applications. The dataflow architecture holds promise in attaining this objective.
Fig. 1.
Fig. 1. Emerging tradeoff (left) and the high-level abstraction of dataflow architectures (right).
For a given kernel, a fixed circuit is formed, which enables repeated execution, thus approaching the efficiency of an ASIC. With a reconfigurable datapath, dataflow architectures can harness multi-level parallelism, leading to a significant enhancement in their computational throughput and efficiency. As interest in dataflow architectures grows, the significance of maximizing the utilization of available on-chip cores through programming is escalating. This is particularly prominent in the context of dataflow architecture, where a single processor houses a greater number of simpler Processing Elements (PEs) compared to a typical multi-core processor. Figure 1 also illustrates the high-level abstraction of dataflow program execution. Characteristics represent abstractions of the application, encompassing factors like batch data size, regularity, and irregular patterns. The dataflow execution model defines the operational mechanism of both the hardware microarchitecture and the scheduling policy.
Abundant prior works have been proposed to improve the utilization of dataflow architectures (Section 3): pipeline parallelism [35, 39, 61], decoupled access-execute architectures [16, 36, 45, 51], and dedicated interfaces between cores or threads [5, 57]. Nevertheless, these solutions are inefficient because they: (i) lack flexibility. They rarely consider the impact of data size on the utilization, while we found that the hardware will be limited when the data size does not match the vectorized design of the hardware.; and (ii) lack fine-grained pipelining scheduling. The schedule of each DFG node in this work is coarse-grained, which misses opportunities to exploit more parallelism within DFG nodes to boost the utilization.
To this end, we introduce a reconfigurable dataflow architecture for multi-batch data processing. The contributions that we made are as follows:
We propose a novel reconfigurable interconnection structure that can organize execution units into different cluster topologies as a way to accommodate different data sizes.
We introduce a decoupled dataflow execution model and provide architectural support for the model. By decoupling the datapath of different stages and equipping with a dedicated scheduler within each PE, the DFG nodes of different iterations can be pipelined more efficiently.
We evaluate our methods on a wide range of applications, demonstrating their applicability. Experiments show that our design attains up to 11.95 × energy efficiency improvement over GPU (V100) and 2.01× energy efficiency improvement over state-of-the-art dataflow architectures.
The rest of this article is organized as follows: In Section 2, we discuss the background. In Section 3, we present related works, based on which we motivate the need for improving the utilization of dataflow fabric for muti-batch and cross-domain processing. In Section 4, we present our methods. We discuss our experimental methodology and results in Section 5 and Section 6, respectively. We finally conclude this article in Section 7.

2 Background

In this section, we describe the characteristics of current emerging applications and the new challenges posed to the hardware, and then we introduce the dataflow architecture.

2.1 Cross-Domain Processing and Multi-Batch Processing

Emerging cross-domain technologies have significantly transformed people’s lives. Cross-domain application scenarios are becoming increasingly critical computational workloads for computing platforms, spanning various domains from delivery drones to smart speakers [32]. One such application involves a sequence of steps: (1) sensing the environment, (2) pre-processing input data, which is then fed to (3) a perception module, triggering a subsequent (4) decision-making process to determine actions. Currently, perception is primarily driven by deep Learning, which has garnered substantial attention. However, applications are not exclusively reliant on deep Learning. Sensory data processing leverages algorithms from Digital Signal Processing (DSP), while Control Theory and Robotics play a role in final actions, which can also provide feedback to the perception module.
Despite these domains working in concert to realize complete applications, they are facing isolation due to the prevailing trend towards Domain-Specific Accelerators (DSAs). On one hand, traditional general-purpose computational stacks struggle to meet the computational demands of emerging applications [17]. On the other hand, these DSAs [2, 3, 4] sacrifice generality for performance and energy efficiency, limiting programmability to a single domain. While DSAs address the performance gap of General-Purpose Processors (GPPs), they introduce the challenge of dealing with isolated programming interfaces, complicating implementation. Consequently, the scope of expressiveness is curtailed, making the composition of cross-domain applications a significant hurdle when executed on accelerators. While recent advancements are pushing the boundaries of DSAs for improved performance and energy efficiency, a recent study on chip specialization has predicted an eventual ‘accelerator wall’ [12]. Specifically, due to limitations in mapping computational problems onto hardware platforms with fixed resources, the optimization space for chip specialization is bounded by a theoretical limit.
Cross-domain application scenarios have also become more complex in terms of batch size as well as data parallelism, making it more difficult to improve the efficient use of computing resources. On one hand, the number of users at the edge side is time-sensitive [46]. Indeed, over time, ranging from low-load periods (e.g., late at night) to peak periods (peak hours), the quantity of matrix batches for uplink/downlink algorithms varies from \(2\times 2\) to \(N\times N\) , where N ranges from tens to hundreds. This real-time fluctuation in the count of active antennas and receivers exerts diverse throughput demands on the hardware. On the other hand, the number of input data batches (e.g., the number of channels in the activations) varies significantly across network layers as the depth of the deep neural networks used for inference on the server side increases. For instance, the count of channel batches in Alexnet [20] fluctuates from 3 to 384, contingent upon the number of convolutional kernels. Consequently, in cross-domain application scenarios, there are instances involving both small and large input data batches concurrently, implying a wide variation in the extent of data parallelism. The optimal hardware would ideally support large-scale processing of multiple data batches while efficiently managing discrete small data batches. Vector processing techniques, like SIMD (Single Instruction Multiple Data), are widely employed methods for performing batch processing by exploiting data-level parallelism. However, this technique lacks the required flexibility for accommodating various batch sizes. As data parallelism intensifies, the architecture’s efficiency scales up in tandem with batch size augmentation. Nonetheless, this correlation is not boundless in its growth. Beyond a certain threshold, wherein the architecture’s vectorized capacity surpasses the inherent data parallelism of the application, surplus underutilized lanes emerge. This surplus leads to a subsequent diminution in the architecture’s overall efficiency.
The demand for hardware capable of cross-domain and multi-batch processing continues to grow relentlessly. Existing programmable and ‘general-purpose’ solutions (e.g., CPUs, GPGPUs) are inadequate, as evidenced by the significant improvements and industry adoption of application and domain-specific accelerators in critical domains like machine learning [26], computer vision [33], and big data [10]. In the realm of FPGAs [58, 59], these customized datapaths are configurable at the bit level, allowing users to prototype diverse digital logic and leverage architectural support for precision computation. However, this flexibility comes with architectural inefficiencies. Bit-level reconfigurability in computation and interconnect resources incurs substantial area and power overheads. For instance, more than 60% of the chip area and power in an FPGA are dedicated to the programmable interconnect. Long combinational paths traversing multiple logic elements limit the maximum clock frequency at which an accelerator design can function. These inefficiencies have driven the development of dataflow architectures featuring word-level functional units that align with the computational demands of many accelerated applications. Dataflow architectures offer dense computing resources, power efficiency, and clock frequencies up to an order of magnitude higher than FPGAs.

2.2 Dataflow Architecture

With the growing interest in many-core architectures, driven in part by ongoing transistor scaling and the consequent anticipated exponential rise in the number of on-chip cores, the significance of optimizing the utilization of available on-chip cores through programming is on the rise. In this context, dataflow program execution models are gaining increasing attention. The dataflow model was initially proposed by Dennis [9] to harness instruction-level parallelism. The dataflow model introduces an alternative order of code execution compared to the traditional control flow model, emphasizing the pivotal role of data. A dataflow program is delineated by a dataflow graph (DFG), composed of nodes and directed edges that connect these nodes. Nodes signify computations, while edges signify data dependencies between nodes. The fundamental principle of the dataflow execution model is that any DFG node can be executed as soon as all the operands it requires are available (dataflow principle[9]).
Figure 2 depicts the core process of a contemporary dataflow program. Initially, the compiler analyzes the computational kernels necessitating offloading to the dataflow hardware based on the program’s hints, generating corresponding DFGs by considering the data dependencies in the code. Subsequently, the assembly process converts the high-level language within each DFG node into assembly instructions. Finally, DFG nodes are mapped to Processing Elements (PEs) for scheduling and execution through a DFG mapping algorithm. Figure 2 presents a representative dataflow architecture, in which each PE is notably simpler and there is a more significant number of them within a single processor compared to a typical multi-core processor. The dataflow architecture primarily comprises a PE array, a Micro-Controller (MicC), a configuration buffer, and a data buffer. The PE array is composed of numerous PEs interconnected by an on-chip network. Within each PE, multiple pipeline functional units, a local instruction RAM, data caching register files, and a router are present. The functional units perform data processing based on the instructions stored in the instruction RAM. The router’s role is to parse and forward packets, facilitating data exchange between PEs. To efficiently handle multi-batch processing, vector-oriented designs like SIMD (Single Instruction Multiple Data) are frequently employed within PEs.
Fig. 2.
Fig. 2. Illustration of the execution process of a dataflow program (a) and a typical dataflow architecture (b). (’TRANS’: Customized dataflow instruction. The function of this instruction is to transfer data to the PE where the downstream nodes are located.)
The dataflow processor operates as a co-processor or accelerator alongside the host processor, collaborating to execute program computations. In essence, the dataflow processor necessitates configuration from the host. The micro-controller furnishes the interface for host-side configuration and manages the execution of the PE array. The configuration buffer stores configuration details received from the host, encompassing kernel parameters, mapping information, and more. Both configuration and input data can be preloaded into the on-chip data buffer through Direct Memory Access (DMA) mechanisms. The computational resources of dataflow architectures are numerous and spatially distributed. Maximizing the use of these resources becomes critical to improving the performance and energy efficiency of the dataflow processor. Therefore, in recent years, many studies have been proposed to improve the utilization of dataflow processors. We will discuss them in the next section.

3 Related Works

Software Parallelism. Dataflow architectures are amenable to creating static spatial pipelines, in which an application is split into DFG nodes and mapped to functional units across the fabric [39, 52, 56, 61]. To perform a particular computation, operands are passed from one functional unit to the next in this fixed pipeline. Pipette [35] structures applications as a pipeline of DFG nodes which is connected by queues. The queues hide latency when they allow producer nodes to run far ahead of consumers. But it exploits this property in general-purpose cores, not for specialized architecture. These efforts may be inefficient for irregular workloads due to load imbalance among DFG nodes. SARA [61], a compiler of reconfigurable dataflow architecture, employs a novel mapping strategy to efficiently utilize large-scale accelerator. It decomposes the application DFG across the distributed resources to hide low-level reconfigurable dataflow architecture constraints and exploits dataflow parallelism within and across hyper-blocks to saturate the computation throughput. Atomic dataflow [62] schedules DFG in atom granularity to ensure PE-array utilization and supports flexible atom mapping and computing to optimize data reuse in the architecture. GoSPA [7], leveraging the idea of on-the-fly intersection and specialized computation reordering, recodes the sparsity information to deliver necessary values to the compute units and reorders the computation to reduce the fetch time. Based on these two ideas, GoSPA optimizes the sparse convolutional neural network accelerator globally, and achieves high performance and energy efficient. ANNA [22] introduces a memory traffic optimization technique to accelerate ANNS algorithms, which reduces the memory traffic and improves performance by reusing data efficiently. Monsson [37] is a dynamic dataflow architecture that employs tokens to designate various thread contexts. A dataflow program can be called by different threads, and this token serves to mark these threads. When a matching token is identified, it is extracted, enabling the corresponding instruction for execution. If no matching token is found, the incoming token is stored for future use. In the TRIPS [40] dataflow architecture, an instruction serves as the basic unit for launching and scheduling, and the operands of an instruction are ready to be dispatched to the compute unit for execution, which represents an instruction-level dataflow model. Groq [6] introduces a new, simpler processing architecture designed specifically for the performance requirements of machine learning applications. Groq’s overall product architecture provides an innovative and unique approach to accelerated computation. This architecture provides a new paradigm for achieving both flexibility and massive parallelism without the limitations and communication overheads of traditional GPU and CPU architectures. The Groq compiler orchestrates everything: Dataflows into the chip and is plugged in at the right time and the right place to make sure calculations occur immediately, with no stalls. Maxeler [34, 48, 49] provides various dataflow engines such as MPC-X, MPC-C, and MPC-N. In Maxeler’s dataflow approach, program source code is transformed into dataflow engine configuration files, which describe the operation, layout, and connections of the dataflow engine. The hardware (FPGAs) then generates specific circuits based on these profiles, which is similar to Xilinx HLS (High-Level Synthesis) [24].
SW/HW Custom Interface. To improve the utilization of dataflow fabric, recent works have focused on software and hardware co-design architecture. Aurochs [47] introduces a threading model for a reconfigurable dataflow accelerator and uses lightweight thread contexts to extract enormous parallelism from irregular data structures. CANDLES [14] proposes a novel microarchitecture and dataflow by adopting a pixel-first compression and channel-first dataflow, which can significantly improve the performance of deep neural network accelerator with low energy overhead. ESCALATE [23] utilizes an algorithm-hardware co-design approach to achieve high data compression ratio and energy efficiency in convolutional neural network accelerator. The decomposed and reorganized computation stages in ESCALATE can obtain the maximized benefits in its basis-first dataflow and corresponding microarchitecture design. NASA [27] provides suitable architecture for target machine learning workload. NASA is able to partition and reschedule the candidate architecture at fine -granularity to maximize data reuse. In addition, it can remove the redundant computation in the mapping stage by a special fusion unit equipped on the on-chip network, which further improves the utilization of the accelerator arrays. Sanger [25] processes the sparse attention mechanism through the coordination of reconfigurable architecture and software part prunes, which leads to high hardware efficiency and computing utilization. NASGuard [50] leverages a topology-aware performance prediction model and a multi-branch mapping model to prefetch data and obtain high efficiency of the underlying computing resources. Cambricon-P [18] adopts a carry parallel computing mechanism that can transform original multiplication into inner-products to exploit the computation parallelism. It also employs a bit-indexed inner-product processing scheme that can eliminate bit-level redundancy in the inner-product computing unit, which further improves the computing efficiency of the architecture. DRIPS [43] manages the partial dynamic reconfiguration of coarse-grained reconfigurable arrays with the help of special software and hardware components. Based on the execution status, it can dynamically rebalance the pipeline of data-dependent streaming applications to achieve the maximum throughput.
Decoupled Hardware. DAE [41] separates the computer architecture into access processors and execution processors. The two processors execute separate programs with similar structure, but which perform two different functions. Fifer [36] decouples memory access datapath from computing pipeline. Each DFG node is divided into two stages: access and execution. Equipped with a dedicated scheduler, at most two DFG nodes can be executed on the same PE at the same time. In this way, the memory access latency can be overlapped and the utilization can be further improved. DESC [16] proposes a framework that has been inspired by decoupled access and execution, and can also update and expand for modern heterogeneous processors. REVEL [51] extends the traditional dataflow model with primitives for inductive data dependences and memory access patterns, and develops a hybrid spatial architecture combining systolic and dataflow execution. RAW [45] introduces hardware support for decoupled communication between cores, which can stream values over the network. TaskStream [5] introduces a task execution model which annotates task dependences with information sufficient to recover inter-task structure. It enables work-aware load balancing, recovery of pipelined inter-task dependences, and recovery of inter-task read sharing through multicasting. Chen et al. [57] propose subgraph decoupling and rescheduling to accelerate irregular applications, which decouples the inconsistent regions into control-independent subgraphs. Each subgraph can be rescheduled with zero-cost context switching and parallelized to fully utilize the PE resources. Saambhavi et al. [1] propose an offload interface with minimal limitations for both distributed computation and distributed access capabilities architecture models, which is designed for offloading arbitrary units to heterogeneous accelerator resources and able to offer energy-efficient orchestration of control and data with flexible communication mechanisms. NN-Baton [44], a hierarchical and analytical framework, provides an architecture consisting of package, chiplet and core three parallel hierarchies, which enables efficient application mapping and design exploration.
In the Codelet dataflow model, each node within the dataflow graph functions as a thread, essentially acting as the fundamental entity for initiation and execution [42]. Once all the inputs of a thread are prepared, it becomes launch-ready, embodying a thread-level dataflow model. This model also encompasses dataflow-thread [13] and data-driven multithreading [30]. These dataflow models share a common vision for dataflow execution, and that these efforts aim to maximize parallelism and provide architectural support for data-driven execution. This is also consistent with our vision. Dataflow-thread [13] and our dataflow model share a common characteristic, which is that each node in the dataflow graph is a thread containing a piece of instructions or code. Threads communicate and activate each other using the dataflow principle. However, Dataflow-thread differs from our dataflow model in the way dataflow graph nodes communicate with each other. Dataflow-thread does not define directives or interfaces for direct communication between dataflow threads. Communication between dataflow threads is achieved by reading and writing shared memory between different dataflow threads. In our work, we define dataflow directives between dataflow threads. Data from upstream nodes can be transferred directly to the computational component where the downstream nodes are located. Additionally, DF-Thread’s API interfaces follow C-like semantics, which require support from the operating system and system calls. In data-driven multithreading [30], the basic unit of its scheduling is the thread, which corresponds to the dataflow graph node. It only needs to record and maintain the upstream and downstream threads for each node. In our dataflow architecture, each processing element contains four different types of functional components to support our proposed decoupled execution model. In addition to the entries in data-driven multithreading, our scheduling table maintains the states of the four different types of components as well as the states of different threads. This is because in our decoupled model, computational resources can be occupied by four threads or iterations simultaneously.
Despite the categories of work that we mentioned above, software parallelism, software/ hardware custom interface and decoupled hardware design, have made significant contributions to improving the utilization of dataflow unit, they face new challenges when process the advanced application that we introduced in Section 2. As shown in the clearly listed examples of Table 1, some work [6, 38, 51, 52, 54, 56] only focused on single application field and do not consider the process of cross-domain applications, while others [14, 36, 38, 39, 52, 56] ignored different data scales or only considered one fixed side. Consequentely, they all have limitations in processing the applications of cross-domain and multi-batch. DFU [11] introduces a software and hardware co-design method to enhance the hardware utilization of dataflow architectures. It introduces a decoupled execution model and provides architectural support for the decoupled execution model. Unfortunately, DFU does not perform well in multi-batch processing scenarios. Therefore, in the face of challenges related to diverse data sizes and data parallelism in cross-domain processing, this article devises a unified scale-vector architecture that leverages the benefits of SISD and SIMD technology simultaneously. Furthermore, this article presents the task-based program execution model, which augments a dataflow architecture’s ISA with primitives for runtime task management and structured access. This article considers the interPE and inner PE two aspects comprehensively, and optimizes the cross-domain and multi-batch process with execution model and hardware.
Table 1.
DesignCharacteristicsCross-DomainMulti-batch
TRIPS [40]Instruction-level dataflow model--
Monsson [37]Dynamic dataflow model--
Codelet[42]Thread-level dataflow model--
RABP [38]A large-scale PE array with flexible schedulerNoNo
Groq [6]A reconfigurable dataflow NN acceleratorNoNo
LRPPU[52]Pipeline parallelismNoNo
Fifer [36]Decoupling execution and memory accessYes (GP+MM)No
Plasticine [39]Decoupling pattern units and memory unitsYes (MM+GP)No
CANDLES [14]Channel-aware dataflow and hardware co-designYes (MM+NN)No
DFU [11]Decoupled execution modelYes (NN+DSP+GP)No
REVEL [51]A systolic-dataflow heterogeneous platformNoYes (SIMD1+SIMD8)
GANAX [54]A unified SIMD and MIMD design for GANNoYes (SIMD1+SIMD4)
This articleExecution model and hardware co-designYes (NN+DSP+GP)Yes (1,2,4,8,16)
Table 1. Comparisons between Representative Dataflow Architectures
‘GP’- Graph processing, ‘MM’- Matrix multiplication, ‘NN’- Neural networks, ‘DSP’-Digital signal processing.

4 Our Methods

In this section, we optimize the micro-architecture and dataflow program execution model with the aim of improving the resource utilization of the dataflow architecture for multi-batch processing. First, at the inter-PE level, we designed a configurable interconnect architecture that is able to work in multiple modes. Second, at the inner-PE level, we designed a fully decoupled architecture with the aim of (1) improving the utilization of computational components by overlapping the latency caused by memory access and data transfer as much as possible, and (2) increasing the throughput of the chip through a dynamic task scheduling mechanism. Finally, we designed a task-based execution model and mapping method for our dataflow architecture.

4.1 Overview

In order to mitigate the resource under-utilization, We devise a unified scale-vector architecture that reaps the benefits of single-instruction-single-data and single-instruction-multiple-data execution models at the same time. That is, while our architecture executes the operations with distinct computation patterns in a single exection unit, it performs the operations with the same computation pattern in a cluster unit. Figure 3 illustrates the high-level diagram of our proposed architecture which is comprised of a set of identical multiple mode PEs. The PEs are arranged in a 2-D array and connected through a dedicated network. Each PE consists of two engines, namely the mode engine and the execution engine. The execution engine merely performs operations, whereas the mode engine controls these execution engines to work in multiple modes. A novel decoupled architecture is designed within each execution engine, differing from the traditional Out-of-Order cores or sequential execution cores. In addition, there are two on-chip networks, one for transmission of configuration information and control signals, and the other for custom data transmission. There are seveal main considerations for such a design: (1) the bandwidth of configuration information and data is different: (2) with the design of multiple sets of networks, the control logic for routing and forwarding becomes simpler, and (3) to reduce data packet and configuration packet conflicts and reduce on-chip network stress and transmission delay. The memory hierarchy is composed of an off-chip memory, on-chip global buffers and local buffers in each PE. These global on-chip buffers are shared across all PEs.
Fig. 3.
Fig. 3. The diagram of overall architecture.
In the task-based dataflow execution model, three levels of pipeline parallelism are utilized: subtask-level (dataflow graph, DFG) pipeline parallelism, DFG node-level pipeline parallelism, and instruction-level pipeline parallelism (Instruction pipelining technology is used). The subtask-level pipeline parallelism refers to the execution of each dataflow graph in a pipeline manner. The dataflow graph node-level pipeline parallelism refers to the decoupled dataflow execution within each dataflow graph node. Instruction-level pipeline parallelism is the traditional instruction pipeline. Tasks could be annotated with information that describes the operations they perform, and the hardware could take advantage of structured patterns. Performing this analysis in software may not be that profitable, especially in an accelerator system where tasks are short. Our solution is to expose task-management and operation types as first-class primitives of the hardware’s execution model. Furthermore, traditional dataflow graphs do not have the semantics of batch processing. Dataflow graphs often correspond to internal loops, while batch processing information is expressed as the number of iterations of the internal loop. When the hardware is highly reconfigurable, especially when the topology of the execution units is variable, a more flexible approach to dataflow program mapping is proposed.

4.2 Inter-PE Design

PEs are designed to be adaptive to the data sizes of different batches. First, the basic idea is to combine multiple execution engines into a cluster that performs the same computational tasks and processes multiple batches of data synchronously. As shown in Figure 4, the execution engines labeled ❶ and ❷ are combined into a cluster, and the execution engines labeled ❸ and ❹ are combined into a cluster. In this way, a PE consists of two clusters, each of which can process two batches of data in parallel. For the 4-batch mode, the four execution engines are combined into a cluster, processing four batches of data in parallel. While in 1-batch mode, each execution engine acts as a cluster. Second, the mode engine plays the role of configuration generation and distribution. On the one hand, it generates configuration information for each \(\mu\) -router. The structure of \(\mu\) -router is displayed in the right side of Figure 4. Each \(\mu\) -router consists of a set of multiplexers and routing units. The structure inside each routing box is a traditional router structure that parses and forwards packets in four directions (North, East, South, West). The input and output ports in X and Y directions have dedicated control signals (S1, S2, S3, S4, S5, S6, S7, S8) that control the connection of the routing units and the data transmission networks. On the other hand, the mode engine distributes command and control information (activation signals, ack signals, etc.) to each execution engine (datapath in red). Finally, the \(\mu\) -router structure dynamically changes the connections of the data links according to the different batch configurations, thus ensuring efficient and synchronized transmission of multi-batch data.
Fig. 4.
Fig. 4. The custom interconnect design for multi-batch processing.
Execution Engine. Each PE contains several execution engines. To facilitate understanding, we take four execution engines as an example in Figure 3. It is important to note that the number of execution engines in a PE is scalable. The execution engine consists of a function unit, a local buffer and a \(\mu\) -router. The function unit performs specific operations and supports different operations, including LD/ST, calculation and data transfer. To support diverse kernels, the calculation data-path is designed to support different data types, including integer, fixed-point, float-point, and complexed-value. Each execution engine has a dedicated local buffer and is built with a \(\mu\) -router. The local buffer stores configurations (instructions) and data during runtime. The \(\mu\) -router is connected with the mode engine and also embedded into a circuit-switch mesh data network. When these \(\mu\) -routers receive mode configuration from mode engines, they will be statically configured to route to each other, forming the link paths between these execution engines. Execution engines time multiplex these links to communicate. We discuss more details about the execution engines in Section 4.2.
Network-on-Chip. The interconnection plays a crucial role in the multiple-mode PE. It ensures that these multiple pieces of data can reach these execution engines in the same cluster simultaneously. The structure of the interconnection in a PE can be found in Figure 4. There are two main interconnections: a network for transferring configurations (red paths in Figure 4) and a dedicated network for data (yellow and green paths in Figure 4). The configure network transports the configurations to each \(\mu\) -router and the instructions to each execution engine. The data network consists of several data paths to accommodate the multiple-batch modes. The number of data paths in the vertical and horizontal direction is equal to the number of execution engines in that direction. In our example, the number of data paths is two, which is determined by the number of execution engines in a PE. The \(\mu\) -router is connected with the data network via crossbar switches, and establishes different virtual circuit links under different configurations before the next configure period.
Mode Engine. Each PE has a dedicated mode engine to dispatch control signals and instructions. In principle, the mode engine reconstructs the execution engines into different clusters to support multi-batch modes. As shown in Figure 3, the mode engine consists of a hierarchical controller. In our example, there are two L1-controllers and one L2-controller in the controller. They are connected based on a tree topology. The L1-controller is connected with two execution engines through their \(\mu\) -router interface and the L2-controller is also connected with the global configuration buffer. The mode engine is mainly responsible for the following functions during the configuration period. First, it parses the PE’s multi-batch configuration, and then it generates configurations for each \(\mu\) -router and delivers the configurations to each \(\mu\) -router. After the top-level controller (L2-controller) receives the task configuration information from the Global Configuration Buffer, it will extract the batch configuration fields (‘B_conf’ in Figure 7) from it. Configurations for the four directions of each \(\mu\) -router will then be generated based on the rules for specific mode based on this batch configuration. Second, instructions are loaded through the mode engine and distributed to each execution engine. Since these execution engines in a PE may belong to different clusters, the controller uses a hierarchical tree-based structure, which makes control simple and easy to implement. It should be noted that the controller will become more complicated when the number of execution engines in a PE increases. The hierarchical controller is scaled according to \(log_2\) (number of execution engines).
Fig. 5.
Fig. 5. The reconfigurable data path (Enabled in red). (a) Configure parameters for different modes. (b) Single-batch datapath. (c) Two-batch datapath. (d) Four-batch datapath.
Fig. 6.
Fig. 6. The architecture of each execution engine.
Fig. 7.
Fig. 7. The figure of task_based program execution. (a) Three-level configuration hierarchy: Task parameters, Subtask parameters and DFG node configurations. (b) The parallel hierarchy. (c) An example of the program model.
Multiple Modes. As shown in Figure 5(a), each PE supports multi-modes of single-batch mode (in Figure 5(b)), 2-batch mode (in Figure 5(c), and 4-batch mode (in Figure 5(d))). Its function is controlled by an 8-bit configuration word (S1, S2, S3, S4, S5, S6, S7, S8) that is detailed in Figure If the PE contains N execution engines, then the PE can support \(log_2N\) +1 patterns, where N is the exponent of 2.
Single-Batch Mode. This mode is designed for algorithms with small-scale source data and little data parallelism. The PE array can be configured as a pure MIMD-like mode, in other words, a many-core architecture with a typical 2D topology. In this mode, each execution engine works as an independent core. It has its own instructions and data, processing a dataflow graph (DFG) node. Horizontal and vertical execution engines need to be connected to the same data path. Therefore, the rule for the configuration word is: “S1 == S5 && S3 == S7 && S2 == S4 && S6 == S8”. Figure 5(b) shows the network connection under the “0000-0000 (S1 to S8)” configuration.
Two-Batch Mode. Two execution engines that connected to the same L1-controller are combined as a cluster. As shown in Figure 5(c), execution engine❶ and execution engine❷ serve as a cluster, while execution engine❸ and execution engine❹ serve as another cluster. Since two execution engines in the Y-axis are in the same cluster, the \(\mu\) -router of execution engine❷ should be connected to the data link that is different from engine❶ to guarantee that the two execution engines can receive data from the Y-axis at the same cycle. Similarly, the \(\mu\) -router of execution engine❹ should be connected to the data path that is different from execution engine❸. In the X-axis direction, they are connected to the same data path. Since the horizontally oriented execution engines need to interact, they need to be connected to the same data path. The vertically oriented execution engines act as two parallel processing units, so they need to be connected to different data paths. Thus, the configuration logic of PE in two-batch mode is: “S1 == S5 && S3 == S7 && S2 == \(\sim\) S4 && S6 == \(\sim\) S8”.
Four-Batch Mode. All exection engines in a PE form a cluster, as shown in Figure 5(d). These exection engines are controlled by the L2-controller. In both X-axis and Y-axis, \(\mu\) -routers of these exection engines should be connected to different data path. In X-axis direction, \(\mu\) -router❶ and \(\mu\) -router❷ should connect to the data paths that is different from the path \(\mu\) -router❸ and \(\mu\) -router❹ connect to, respectively. Similarly, in the vertical direction, \(\mu\) -router❶ and \(\mu\) -router❸ should be connected to different data paths with \(\mu\) -router❷ and \(\mu\) -router❹, respectively. Therefore, the rule for the configuration word is: “S1 == \(\sim\) S5 && S3 == \(\sim\) S7 && S2 == \(\sim\) S4 && S6 == \(\sim\) S8”. Figure 5(b) shows the data path under the “0001-1011 (S1 to S8)” configuration.
The two-batch mode and four-batch mode are designed for scenarios with high data parallelism. Execution engines are divided into multiple clusters under the control of the mode engine. The instructions are loaded and distributed to the corresponding cluster by the mode engines. Execution engines in the same cluster perform the same operations on multiple data synchronously. Limited by the number of execution engines, the PE can work at three different modes in our example. To explain the structure more clearly, we also show the domain division for different configurations by different colors in Figure 5. Note that this design principle is scalable. As the number of execution engines in a PE increases (preferably by an exponent of 2), the number of available modes also increases. For example, when each PE contains 16 execution engines, the structure of the mode engine will become complex. There will be L3-controller and L4-controller. In addition, the mode of PE contains eight-batch mode and 16-batch mode.
Memory Access. Global buffers are built with multiple SRAM banks matching the scale of data. Address decoding logic around the scratchpad can be configured to operate in several banking modes to support various access patterns. Physical banks cascade and are grouped into logic banks according to the width of configuration. Besides, the global buffers are sliced into two lines, which work in a Ping-Pong way to cover transmission time. To support diverse modes, DMA can transmit and reshape variable length of multi-batch data with scatter and gather operations, exchanging data between on-chip buffers and off-chip memory.

4.3 Inner-PE Design

We create a decoupled execution model that defines a novel scheme to schedule and trigger DFG nodes to exploit instruction block level parallelism. The code of each DFG node consists of up to four consecutive stages: Load stage, Calculating stage, Flow stage, and Store stage, which we describe below:
Ld (Load) Stage. This stage loads data from the memory hierarchy to the in-PE local memory.
Cal (Calculating) Stage. This stage completes calculations. A node can enter the Cal stage only when the following two conditions are met: first, its Ld stage (if it exists) has already finished; second, it has received all the necessary data from its predecessor nodes.
Flow Stage. This stage transfers data from the current node to its successors.
ST (Store) Stage. This stage transfers data from the in-PE operand memory to the memory hierarchy.
Similarly, instructions in a DFG node will be rearranged according to their types and divided into four different blocks. The block is a basic schedule and trigger unit. Instruction-block-level dataflow is the middle ground between instruction-level dataflow and thread-level dataflow. It can be seen as a further development of thread-level dataflow. In the thread-level dataflow model, each dataflow graph node is a thread and serves as the basic unit for launching and scheduling. Instruction-block-level dataflow decomposes each node of thread-level dataflow into four stages. Each phase consists of a segment of instructions and serves as the basic unit for launching and scheduling. Unlike the traditional out-of-order execution, the decoupled execution model exploits more instruction-block level parallelism without complex control logic, such as reorder buffer.
Figure 6 illustrates the top-level diagram of our dataflow architecture, which is comprised of a set of identical decoupled processing elements (dPE). To support the decoupled execution model, separated four-stage components are designed within each PE to correspond to the four different states of the nodes. This approach allows a processing element to be shared by up to four different DFG nodes simultaneously, enabling the overlap of memory access and data transfer latency as much as possible. By decoupling the datapaths of different stages and equipping each PE with a dedicated scheduler, the DFG nodes of different iterations can be pipelined more efficiently. The function of the controller is to maintain the maintenance, scheduling execution of the different node states. To ensure the correctness of the execution, separate operand RAM space is provided for different iterations. A shared operand RAM space is set up to store the data that has dependencies between iterations, which are marked by special registers in the instructions.
The dPE consists of a calculation pipeline, a load unit, a store unit, a flow unit, an instruction RAM module, an operand RAM module, a controller and a router (in the middle of Figure 6). These four separate functional components (CAL, LOAD, FLOW, STORE) and the controller are designed for the decoupled execution model, which are different from previous structures. The calculation pipeline is a data path for arithmetic operations and logical operations. It fetches instructions from the instruction RAM module and performs computations on the source data. The load/store unit transfers data from/to on-chip data memory to/from operand RAM module, respectively. And the flow unit dispatches data to downstream dPEs. Each execution unit has a corresponding DFG node state, as described in Figure 6, and such a decoupling method is the key to improving the utilization.
The controller plays a non-negligible role in the state transition and DFG nodes triggering. It consists of a kernel table, a status table, a free list, a dedicated acknowledgment buffer (Ack port), and a scheduler module. The kernel table stores the configurations of the nodes mapped to the dPE, which contain the Task ID (TID), node ID (NID), instance number (instance), instruction address list (inst_addr) and data address (LD_base&ST_base). The TID and NID are used to identify task and DFG node, because the PE array can be mapped to multiple tasks at the same time, and a PE can be mapped to multiple nodes. The instance is a value related to the pipeline parallelism, which indicates how many times the DFG node needs to be executed. Taking BFS as an example, for a large graph, it may need to be decomposed into many subgraphs, such as 100, then each DFG node needs to be executed 100 times. The inst_addr records the location of the four-stage instruction of the DFG node in the instruction RAM. The LD_base&ST_base are the base addresses for the source and destination, which can work with the offset in the status table to access the data in the operand RAM.
The status table maintains the runtime information for different instances. It uses the instance_counter to record different instances of DFG nodes. Although different instances share the same instructions, they handle different data. Therefore, the offsets (offset) of different instances are different. In addtion, the status table records the activations (Up_counter) and status information. The value of Up_counter decreases with the arrival of activation data. When this value is 0, it means that all the upstream data of the current node has arrived and it can be triggered by the scheduler.
The scheduler uses the instance_counter to evaluate the priority and schedules nodes according to their priority. We also tried other scheduler policies, such as a round-robin scheduler or finer-grain multithreading, but found that these did not work as well. This makes sense: the completed application work is nearly constant regardless of the scheduling strategy, so a simple scheduling mechanism is effective. Also, simple scheduling principles reduce configuration overhead. The Ack port is connected to the four pipeline units in order to obtain the status of each section. Additionally, the Ack port uses this information to dynamically modify the contents of the state table for scheduling by the scheduler. And the free list queue maintains free entries in this buffer.
The instruction RAM module consists of multiple single-port SRAM banks. Each bank can be occupied by a single functional unit at any time. The operand RAM module consists of multiple 1-write-1-read SRAM banks. To ensure the pipeline execution between instances, a separate context is allocated for each iteration. Considering that there may be dependent data between instances, a shared context is established in the operand RAM. Shared data are marked by special registers in the instructions.

4.4 Task_based Program Execution

We propose the task_based program execution model, which augments a dataflow architecture’s ISA with primitives for runtime task management and structured access. In task_based program execution model, a task consists of multiple sequentially executed subtasks. Each subtask is a dataflow graph which consists of multiple computation nodes and directed edges. The finite-state controller is used to configure our processor at three level: task level, subtask level, and node level, as shown in Figure 7. Each task contains multiple subtasks, where each subtask is a dataflow graph. The multiple subtasks are executed sequentially, due to the fact that the number of subtasks executed may be different. First, the task parameter words are used to control the processing of one specific program, which indicates the exection number and the number of subtasks. Second, the subtask parameter words are used to control the processing of a codelet, usually a loop struture. It contains the number of iteration and DFG nodes, as well as batch configurations, the number of root nodes, the base address of input and output data, and so on. Third, the node parameter words are used to control a specific DFG node, which records the storage location of instructions within that node, as well as the number of upstream and downstream nodes, the mapping location of upstream and downstream nodes, the coordinates of the execution cluster to which the node is mapped and the priority, etc. In this execution model, multiple levels of pipeline parallelism can be exploited: (1) pipeline parallelism between different iterations within a subtask; (2) pipeline parallelism between different iterations within a DFG node; and (3) instruction-level pipeline parallelism.
Figure 7(c) shows an example of the task-based program execution. This task completes the core computational process of the Fast Fourier Transform (FFT) and contains mainly two loop bodies that offload to the dataflow coprocessor through hints (pragma). First, this task contains two subtasks, subtask 1 and subtask 2, which are marked as different colors in Figure 7(b). Then, each subtask is compiled into a dataflow graph, where each dataflow graph node contains a segment of instructions, and the order of instructions follows the principles of the decoupled model we proposed in Section 4.2. Next, three-level configuration words are loaded into each PE, configuring for each execution engine and combining into a clustered array. The dataflow graph is then mapped to the execution engine array and pipelined for execution. Execution engines within the same cluster execute the same code segments. The mapping process maps a dataflow graph onto a cluster array. Each cluster can be mapped with one or more dataflow graph nodes. Execution arrays within the same cluster perform the same computational process and process different data in parallel. Unlike the traditional mapping approaches, the size of the execution engine cluster array is variable under different configurations. As a result, the DFG may need to be extended at the time of mapping. Our approach is inspired by the literature [29]: the DFG will be replicated to ensure that each cluster can be utilized.

5 Experimental Methodology

Setup. We implemented a dataflow simulator based on the SimICT parallel framework [55]. This simulator is mainly used to verify the correctness and get the performance and computational component utilization, it simulates the behavior of computation, access to memory, instruction conflicts, and the like. Additionally, we implemented the modules of the dataflow architecture in Verilog using the Synopsis tool. We use Synopsys Design Compiler and a TSMC 28nm GP standard VT library to synthesize it and obtain area, delay and energy consumption, which meets timing at 1.25 GHz. We calibrate the latency error of the simulator to within ±7% using Verilog environment with functional correctness. First, we verify the computational results and functional correctness of the simulator in C and the implement in verilog. The error here is the error in the total latency of the test program in the C simulator and verilog environment. Since the latency of task switching as well as the latency of pipeline blocking is very difficult to ensure consistency, there will always cycle be an error between the two platforms.
Table 2 shows the hardware parameters. Each PE is equipped with 16 execution engines connected via 4-level controllers, thus enabling each PE to support more modes. Fixed-point, integer and load instructions consume one cycle, floating-point, store and dataflow instructions consume two clock cycles, floating-point division consumes nine cycles. Table 2 also shows the area and power breakdown of our architecture. It has an area footprint of 16.477 \(mm^2\) in a 28 nm process, and consumes a maximum power of 2.038 W at a 1.25 GHz clock. The PE array occupies the largest proportion of area and power consumption, accounting for 57.03% of the area and 53.09% of the power, respectively. In each PE, execution engines (including function units, controller and inst & data RAM) account for the largest proportion.
Table 2.
ComponentParameterArea( \(mm^2\) )Power(mW)
PEFunc. UnitINT & FP32, #160.165(44.00%)21.43 (31.73%)
Controller-0.044(11.73%)3.59 (5.32%)
Inst. RAM4 KB0.020(5.33%)2.3 (3.41%)
Data RAM16 KB0.072(19.20%)28.78 (42.61%)
Mode EnigneL1, L2, L3, L4,0.018(4.80%)2.40 (3.55%)
\(\mu\) -routers#160.056(14.93%)9.04 (13.38%)
Total 0.37567.54
PE Array4 × 4, 1.25 GHz6.00(58.36%)1080(52.96%)
Network-on-chip1 cycle/hop, X-Y routing1.50(14.59%)259(12.69%)
Global Data Buffer512 KB SPM, double-buffer1.47(14.26%)534(26.17%)
Global Config Buffer128 KB, double-buffer0.21(1.99%)109(5.34%)
DMAping-pong0.36(3.50%)58(2.84%)
Total 10.282040
Table 2. Hardware Parameters
Benchmarks. To evaluate our methods, we select several real-world applications from Plasticine [39] and REVEL [51]. These workloads contain digital signal processing algorithms, CNNs and scientific computing, and contains different parameters. Table 3 lists the selected workloads. We use Synopsys PrimeTime PX for accurate power analysis. These kernels are mapped to PE arrays by the compiler introduced in [28], a compilation framework based on the LLVM framework. The host compiles and assembles the high-level language and configures the dataflow processor. In our actual system, we are using an ARM CPU as our host. The dataflow accelerator is controlled through PCIe interrupts. Figure 8 shows the process of transforming partitioned serial code into configurations for our dataflow architecture. We generate LLVM intermediate representation (IR) for each stage, which represents low-level operations on data and their dependences. An automated tool examines the LLVM IR and produces a DFGusing the actual operations that can be performed by a PE’s functional units. To ensure load balance, we use the DFG balancing algorithm [11], a heuristic algorithm that achieves load balancing through instruction scheduling among DFG nodes. The code in each dataflow graph node is then converted into the decoupled execution model, which then generates bitstreams that can be executed on decoupled hardware.
Table 3.
ApplicationDomainParameter ScalesBatch
FFT (Fast Fourier Transform)Digital Signal Processing (DSP)16 \(\sim\) 5121 \(\sim\) 32
FIR(Finite Impulse Response)DSP1G1 \(\sim\) 32
SVD(Singular Value Decomposition)DSP and AI16 \(\sim\) 2561 \(\sim\) 32
CholeskyDSP16 \(\sim\) 2561 \(\sim\) 32
AlexnetArtificial Intelligence (AI)Conv layers3 \(\sim\) 384 (channel)
VGG16AIConv layers3 \(\sim\) 512 (channel)
Resnet50AIConv layers3 \(\sim\) 512 (channel)
Stencil3d7pscientific computing16*32*25K1 \(\sim\) 32
SHA256scientific computing1 M1 \(\sim\) 32
MM (Matrix Multiplication)scientific computing, AI and DSP16 \(\sim\) 5121 \(\sim\) 32
BFS (Breadth First Sort)Graph processingweb-Google|V|=9K,|E|=5M
Table 3. Benchmark Specifications
Fig. 8.
Fig. 8. The process of mapping a program to hardware.
The mapping of a DFG to our architecture primarily involves two main stages: one during the compilation stage and the other occurring during hardware configuration. Each processing unit is identified by an ID indicating its position in a two-dimensional array. The mapping algorithm traverses each node in the DFG and determines the position of each node in the PE array. At the end of the compilation stage, all the information, including the DFG topology, mapping configurations and instructions for DFG nodes, is written to a file. This file will be sent to the hardware’s global configuration buffer. When the global configuration buffer receives this file, it parses the information, including the type of packet, destination processing unit, and other execution-related information. Among other things, the destination processing unit field specifies the location of a particular processing unit where the instruction packet is to be executed, so that the instruction can be sent to the processing unit through the network on chip.
Comparisons. To quantify the performance of our processor, we first compare it with CPU, DSP, and GPU. For fairness, these platforms have similar ideal maximal FLOPs (except GPU which has more). DSP: TI C6678, 8-core DSP, each core has 16-FP adders/multipliers, using DSPLIB C66x 3.4.0.0. CPU: Intel Xeon 4116. Conventional out-of-order processor using highly optimized Intel MKL library (8 cores used). GPU: NVIDIA Tesla V100 (NVLink version), using cuSOLVER, cuFFT, and cuBLAS NVIDIA CUDA library. We employed the “nvidia-smi” tool to measure the dynamic power consumption instead of maximum power consumption of the GPU. To measure CPU power consumption while executing the program, we utilized a resource manager.1 Furthermore, we utilized a power instrument2 to assess the power consumption of the DSP during program execution.
In addtion, we compare it to three state-of-the-art dataflow designs: RABP [38], REVEL [51], and Plasticine [39], and their configuration is listed in Table 4. RABP is characterized by the high peak performance obtained using large-scale PE arrays with scalar computation, and PE arrays obtain high utilization by executing multiple DFGs simultaneously. REVEL and Plasticine feature a vector SIMD architecture for batch processing to improve performance and energy efficiency. REVEL uses SIMD8, a heterogeneous dataflow architecture. The PE array consists of simple PEs in systolic form and dataflows PEs that can perform complex computations. While Plasticine is a homogeneous dataflow array that uses SIMD16 to obtain high utilization by decoupling dedicated pattern compute units and memory units. We leverage the open source implementations of them and develop a simulator for each design for performance and utilization evaluation. We also implement their main components using Verilog to obtain area and power consumption. For fairness, these designs are extended to have similar peak performance and process. Their clock frequencies are using the configurations from their original articles.
Table 4.
ArchitecturePE arrayPEVector sizeOn-chip Memory
RABP20 × 20INT & FP32SIMD 1512KB DBUF, 128KB CBUF
REVEL4 × 8add, sqrt/div,multSIMD 8128 KB
Plasticine9 PCUs, 6 stages(PCU)INT & FP32SIMD 169 PMUs, 288 KB
OURS4 ×4INT&FP321,2,4,8,16512KB DBUF, 128KB CBUF
Table 4. Hardware Configurations

6 Evaluation Results

6.1 Utilization and Performance

First, we validate the processing capability of our proposed architecture for multiple batch data sizes. Figure 9 shows the computational resource utilization for different data sizes. In all cases, it achieves an average utilization of over 60% and a maximum of 92.2%. From the experimental data, we find that applications with larger parameter sizes in a single batch of data, such as FFT-512, MM-512, and the like, achieve higher utilization in the multiple batch data case. This is due to the fact that each execution unit is offloaded with computational tasks and executed iteratively in a busy state. The small variations in utilization are due to the different overheads of the configuration and the dynamic behavior of runtime. On the other hand, the benefits of our proposed architecture are also reflected in the processing of multiple batches of small-scale data, such as FFT-16 and MM-16. When the parameter size is so small that a large number of execution units do not get computational load, the performance of the processor will not be demonstrated. Our design dynamically and adaptively maps the cluster to the batch size so that the execution units are all mapped to computational tasks and the utilization is thus improved.
Fig. 9.
Fig. 9. Utilization of the computing components under different batch sizes.
Second, we also evaluate the computational resource utilization of a single batch of data under different models, as shown in Figure 10. We use FFT and MM as representatives of digital signal processing algorithms. The computational resources can be well utilized when the parameter scale is large. While the parameter scale becomes small, the utilization decreases significantly. For example, the utilization of 16-batch mode is only 6% for MM-32. For applications with small parameter sizes, the utilization of high-width-batch mode is extremely low due to the small computational resources that are required, such as FFT/MM-16. In these cases, most execution engines are not used, and the 1-batch mode is most suitable. For AI algorithms, we select different convolutional layers from Alexnet (abbreviated as ‘A’), VGG16 (abbreviated as ‘V’), and Resnet (abbreviated as ‘R’). We performed parallel processing of different channels of the feature map using batch dimensionality. We found that the utilization of most convolutional layers decreases with higher bandwidth configurations. In the case of A_conv1, for example, the channel of the feature map is 3. When the configuration exceeds 4, there are some execution engines that cannot be fully utilized, so the utilization decreases.
Fig. 10.
Fig. 10. Utilization of computational components in different modes at single batch size.
Third, we evaluate the benefits of the coupled execution model. Figure 11(a) shows the utilization of serial and decoupled execution within the dataflow graph nodes, and it can be noticed that the decoupled execution obtains a significant improvement in utilization. Figure 11(b) shows the performance of different cases for two execution methods. The decoupled execution obtains an average performance improvement of 1.92 times compared to the sequential execution. Thus, the decoupled execution model we proposed plays an important role in improving the performance and utilization of dataflow fabric.
Fig. 11.
Fig. 11. Benefits from decoupling methods within execution engines. (a) Comparison of utilization between serial and decoupled execution. (b) Speedup of decoupled datapath (normalised to serial execution).
Fourth, Figure 12 illustrates the speedup normalized to DSP. We select the best-performing results among multiple configurations of our architecture for comparison with other platforms. It attains up to 25.7× speedup over DSP, with geomean of 6.40× and 9.47× for small and large parameter scales. The DSP and CPU have similar mean performance. For a large-scale parameter, GPU obtains similar performance with ours. For MM, SHA256, and SVD algorithms, GPU obtains better performance than ours. GPUs have a large amount of computational resources and are suitable for handling large-scale concurrent data. Also, thanks to the SIMT execution model, the GPU achieves better performance. However, its drawback is that it is difficult to efficiently deal with small and discrete data scenarios. Although for a small-scale parameter, CPU achieves better performance than GPU in most cases. For large- and small-scale parameters, our architecture provides an average speedup 8.45× over DSP and 1.76× over GPU.
Fig. 12.
Fig. 12. Performance comparison normalized to DSP.
Figure 13 illustrates the performance comparison between our architecture and three dataflow architectures. Compared to RABP, REVEL gained an average of 1.83× performance improvement and Plasticine gained an average of 2.41× performance improvement. Our architecture gained an average of 3.34× performance improvement. RABP uses large-scale PE arrays rather than vectorization techniques, and the configuration latency and long data communication delays between PEs and degrade performance. REVEL employs SIMD so that the PE size is smaller for the same peak performance condition, and the data communication latency between the PEs is reduced, thus obtaining a performance improvement. Plasticine obtains a further performance gain because of its design with computation and access separation, which allows some of the access latency to be overlapped, thus reducing the total time. Although our maximum vector width is the same as Plasticine, our architecture still achieves higher performance. This is due to our fully decoupled design. In our architecture, not only the latency of accesses can be overlapped, but also the transmission distance of data between PEs is further overlapped. The improvement brought by the decoupled design is demonstrated in Figure 11. For irregular applications such as BFS, the vectorized design basically brings no performance improvement, and only the decoupled design brings performance improvement.
Fig. 13.
Fig. 13. Performance comparison normalized to RABP.

6.2 Energy Efficiency

Figure 14 shows the energy efficiency comparison with GPU in different modes. We use performance-per-watt metric to evaluate energy efficiency. We find that for digital signal processing algorithms with high concurrent large-scale parameters, a high-width configuration achieves better energy efficiency. For small-scale parameters, a low-width mode achieves better energy efficiency. For example, MM-512 and FFT-512 achieve optimal efficiency in 16-batch configuration. The highest efficiency is achieved in 1-batch and 2-batch mode for FFT-32 and MM-32, respectively. High-width configurations are inferior in these cases. The reason is that, for small-scale parameters, the low data parallelism leads to underutilization of exection engines, resulting in lower utilization (Figure 10) and thus lower efficiency. For most CNN workloads, we find that our design achieves different energy efficiency in different modes. The best-performing efficiency is gained in 8-batch configuration for most cases except Alexnet_conv1. Since the channel of the Alexnet_conv1 is 3, the execution engines in high-width configurations are not fully utilized. Therefore, the 4-batch mode achieves the best efficiency. Compared with GPU, our design attains up to 11.95× efficiency improvement, with geomean of 10.23× and 4.89× for digital signal processing algorithms and CNN algorithms. Thus, it achieves significant efficiency gains over GPU in a wide range of applications.
Fig. 14.
Fig. 14. Energy efficiency (performance-per-watt) Comparison with GPU.
Figure 15 shows the energy efficiency comparison normalized to GPU. For each algorithm mapped to our hardware, we select the best energy efficiency for each parameter scale in different modes. The average of these efficiencies for all parameter scales is used for comparison. On average, our design achieves 7.49× efficiency improvement over GPU, 2.01× over RABP, 1.34× over REVEL, and 1.19× over Plasticine. RABP is a SIMD-free solution where data parallelism is not exploited, resulting in inefficiencies. REVEL achieves the highest efficiency for the matrix multiplication, because the heterogeneous design of REVEL uses systolic and dataflow arrays. The matrix multiplication is executed with high utilization of the systolic array and with low control overhead. However, for algorithms with poor data parallelism characteristics, such as sorting and SHA256, REVEL performs poorly in terms of utilization and energy efficiency. For most algorithms, Plasticine achieves high energy efficiency. Plasticine uses a high-width SIMD design to achieve high peak-performance. However, it has the disadvantage of low utilization when confronted with small-scale parameters, hence limiting the energy efficiency.
Fig. 15.
Fig. 15. Comparison with state-of-the-arts dataflow designs.
Table 5 shows the hardware comparisons of our architecture with other architectures. We extend these several architectures to have similar peak performance and process, and the frequencies of the architectures use the frequencies in their papers. RABP achieves energy efficiencies in the range of 32.67 \(\sim\) 89.52 GFLOPS/W. REVEL obtains an efficiency up to 137.29 GFLOPS/W. However, for the worst case, the efficiency is only 27.85 GFLOPS/W. Plasticine has a high peak-performance, and it gains energy efficiencies in the range of 63.50 \(\sim\) 150.86 GFLOPS/W. Our final architecture has a peak floating point performance of 320 single precision GFLOPS. It gets energy efficiencies in the range of 94.38 \(\sim\) 154.25 GFLOPS/W. However, our design has a large area overhead caused by the execution engines and their interconnections. Compared with other reconfigurable architectures, our design exhibits more stable efficiency in a variety of situations.
Table 5.
ArchGPURABPREVELPlasticineOURS
Tech(nm)1228282828
Area( \(mm^2\) )-11.8022.7093.7610.28
Max Power(W)3003.4882.2821.9332.040
Freq(GHz)-0.81.2511.25
PeakPerf (GFLOPS)15700320320410320
Efficiency11.51 \(\sim\) 32.67 \(\sim\) 27.85 \(\sim\) 63.50 \(\sim\) 94.38 \(\sim\)
(GFLOPS/W)21.6889.52137.29150.86154.25
Table 5. Hardware Comparisons

6.3 Cost and Discussions

The overhead of our proposed method is mainly in the following areas: First is the overhead added by the configuration context, where the configuration information of the PE array is added for batch processing. On the one hand, in the final structure, five different modes are supported, so three bits of space are required. On the other hand, with 16 execution engines equipped within each PE, the mode engines need to generate a 32-bit routing configuration message. Next is the time overhead of the reconfiguration process. For a specific computational task, the array only needs to be configured once. Figure 16 shows the percentage of microarchitecture configuration and algorithm execution time. The time spent on configuration accounts for an average of 6.91% of the total time. Figure 16 also illustrates the latency of hardware components for different applications. Finally, our architecture has a similar area overhead to RABP, yet is larger than REVEL and Plasticine. First, for REVEL, since each of our PEs contains computational parts of type INT and FP, while REVEL’s PEs contain only the parts part. Also, our on-chip storage capacity is larger than REVEL. Second, the storage hierarchy is different compared to Plasticine. Our on-chip storage contains the global buffer shared by the PEs and the private buffer within the PE, while Plasticine has only the on-chip Pattern memory unit. the total storage capacity is smaller than our design.
Fig. 16.
Fig. 16. The ratio of configuration time to execution time.

7 Conclusion

In this article, we describe a novel reconfigurable architecture with flexible multi-batch modes. We devise a unified scale-vector architecture that reaps the benefits of single-instruction-single-data and single-instruction-multiple-data execution models at the same time. That is, while our architecture executes the operations with distinct computation patterns in a single exection unit, it performs the operations with the same computation pattern in a cluster unit. With a more fine-grained DFG node scheduling and trigger mechanism, we exhibit significant utilization and performance improvement on key application domains and achieves significant improvement on performance and energy efficiency compared with state-of-the-art designs.

Footnotes

References

[1]
Saambhavi Baskaran, Mahmut Taylan Kandemir, and Jack Sampson. 2022. An architecture interface and offload model for low-overhead, near-data, distributed accelerators. In 55th IEEE/ACM International Symposium on Microarchitecture (MICRO 2022) (Chicago, IL, October 1-5, 2022). IEEE, 1160–1177.
[2]
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Architectural Support for Programming Languages and Operating Systems (ASPLOS 2014) (Salt Lake City, UT, March 1-5, 2014), Rajeev Balasubramonian, Al Davis, and Sarita V. Adve (Eds.). ACM, 269–284.
[3]
Yunji Chen, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2016. DianNao family: Energy-efficient hardware accelerators for machine learning. Commun. ACM 59, 11 (2016), 105–112.
[4]
Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui Sun, and Olivier Temam. 2014. DaDianNao: A machine-learning supercomputer. In 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2014) (Cambridge, United Kingdom, December 13-17, 2014). IEEE Computer Society, 609–622.
[5]
Vidushi Dadu and Tony Nowatzki. 2022. TaskStream: Accelerating task-parallel workloads by recovering program structure. In ASPLOS. 1–13.
[6]
Groq Dale Southard, Ecosystem Solutions Distinguished Architect. 2019. Tensor streaming architecture delivers unmatched Performance for compute-intensive workloads. https://groq.com/wp-content/uploads/2019/10/Groq_Whitepaper_2019Oct.pdf
[7]
Chunhua Deng, Yang Sui, Siyu Liao, Xuehai Qian, and Bo Yuan. 2021. GoSPA: An energy-efficient high-performance globally optimized sparse convolutional neural network accelerator. In 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA 2021) (Valencia, Spain, June 14-18, 2021). IEEE, 1110–1123.
[8]
Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien Yu, V. Leo Rideout, Ernest Bassous, and Andre R. Leblanc. 1974. Design of ion-implanted MOSFET’s with very small physical dimensions. Proc. IEEE 87, 4 (1974), 668–678.
[9]
Jack B. Dennis. 1974. First version of a data flow procedure language. In Programming Symposium, Proceedings Colloque sur la Programmation (Paris, France, April 9-11, 1974)(Lecture Notes in Computer Science, Vol. 19), Bernard J. Robinet (Ed.). Springer, 362–376.
[10]
Dongrui Fan, Wenming Li, Xiaochun Ye, Da Wang, Hao Zhang, Zhimin Tang, and Ninghui Sun. 2018. SmarCo: An efficient many-core processor for high-throughput applications in datacenters. In IEEE International Symposium on High Performance Computer Architecture (HPCA 2018) (Vienna, Austria, February 24-28, 2018). IEEE Computer Society, 596–607.
[11]
Zhihua Fan and Wenming Li. 2023. Improving utilization of dataflow architectures through software and hardware co-design. In EuroPar. 1–14.
[12]
Adi Fuchs and David Wentzlaff. 2019. The accelerator wall: Limits of chip specialization. In 25th IEEE International Symposium on High Performance Computer Architecture (HPCA 2019) (Washington, DC, February 16-20, 2019). IEEE, 1–14.
[13]
Roberto Giorgi and Paolo Faraboschi. 2014. An introduction to DF-threads and their execution model. In 2014 International Symposium on Computer Architecture and High Performance Computing Workshop. 60–65.
[14]
Sumanth Gudaparthi, Sarabjeet Singh, Surya Narayanan, Rajeev Balasubramonian, and Visvesh Sathe. 2022. CANDLES: Channel-aware novel dataflow-microarchitecture co-design for low energy sparse neural network acceleration. In IEEE International Symposium on High-Performance Computer Architecture (HPCA 2022) (Seoul, South Korea, April 2-6, 2022). IEEE, 876–891.
[15]
Jawad Haj-Yahya, Haris Volos, Davide B. Bartolini, Georgia Antoniou, Jeremie S. Kim, Zhe Wang, Kleovoulos Kalaitzidis, Tom Rollet, Zhirui Chen, Ye Geng, Onur Mutlu, and Yiannakis Sazeides. 2022. AgileWatts: An energy-efficient CPU core idle-state architecture for latency-sensitive server applications. In 55th IEEE/ACM International Symposium on Microarchitecture (MICRO 2022) (Chicago, IL, October 1-5, 2022). IEEE, 835–850.
[16]
Tae Jun Ham, Juan L. Aragón, and Margaret Martonosi. 2015. DeSC: Decoupled supply-compute communication management for heterogeneous architectures. In MICRO. 191–203.
[17]
Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of inefficiency in general-purpose chips. In 37th International Symposium on Computer Architecture (ISCA 2010), (Saint-Malo, France, June 19-23, 2010), André Seznec, Uri C. Weiser, and Ronny Ronen (Eds.). ACM, 37–47.
[18]
Yifan Hao, Yongwei Zhao, Chenxiao Liu, Zidong Du, Shuyao Cheng, Xiaqing Li, Xing Hu, Qi Guo, Zhiwei Xu, and Tianshi Chen. 2022. Cambricon-P: A bitflow architecture for arbitrary precision computing. In 55th IEEE/ACM International Symposium on Microarchitecture (MICRO 2022) (Chicago, IL, October 1-5, 2022). IEEE, 57–72.
[19]
Sean Kinzer, Joon Kyung Kim, Soroush Ghodrati, Brahmendra Reddy Yatham, Alric Althoff, Divya Mahajan, Sorin Lerner, and Hadi Esmaeilzadeh. 2021. A computational stack for cross-domain acceleration. In IEEE International Symposium on High-Performance Computer Architecture (HPCA 2021) (Seoul, South Korea, February 27 - March 3, 2021). IEEE, 54–70.
[20]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2017. ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 6 (2017), 84–90.
[21]
Hunjun Lee, Minseop Kim, Dongmoon Min, Joonsung Kim, Jongwon Back, Honam Yoo, Jong-Ho Lee, and Jangwoo Kim. 2022. 3D-FPIM: An extreme energy-efficient DNN acceleration system using 3D NAND flash-based in-situ PIM unit. In 55th IEEE/ACM International Symposium on Microarchitecture (MICRO 2022) (Chicago, IL, October 1-5, 2022). IEEE, 1359–1376.
[22]
Yejin Lee, Hyunji Choi, Sunhong Min, Hyunseung Lee, Sangwon Beak, Dawoon Jeong, Jae W. Lee, and Tae Jun Ham. 2022. ANNA: Specialized architecture for approximate nearest neighbor search. In IEEE International Symposium on High-Performance Computer Architecture (HPCA 2022) (Seoul, South Korea, April 2-6, 2022). IEEE, 169–183.
[23]
Shiyu Li, Edward Hanson, Xuehai Qian, Hai (Helen) Li, and Yiran Chen. 2021. ESCALATE: Boosting the efficiency of sparse CNN accelerator with kernel decomposition. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21) (Virtual Event, Greece, October 18-22, 2021). ACM, 992–1004.
[24]
Declan Loughlin, Aedan Coffey, Frank Callaly, Darren Lyons, and Fearghal Morgan. 2014. Xilinx vivado high level synthesis: Case studies. In 25th IET Irish Signals and Systems Conference 2014 and 2014 China-Ireland International Conference on Information and Communications Technologies. 352–356.
[25]
Liqiang Lu, Yicheng Jin, Hangrui Bi, Zizhang Luo, Peng Li, Tao Wang, and Yun Liang. 2021. Sanger: A co-design framework for enabling sparse attention using reconfigurable architecture. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21) (Virtual Event, Greece, October 18-22, 2021). ACM, 977–991.
[26]
Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xiaowei Li. 2017. FlexFlow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA 2017) (Austin, TX, February 4-8, 2017). IEEE Computer Society, 553–564.
[27]
Xiaohan Ma, Chang Si, Ying Wang, Cheng Liu, and Lei Zhang. 2021. NASA: Accelerating neural network design with a NAS processor. In 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA 2021) (Valencia, Spain, June 14-18, 2021). IEEE, 790–803.
[28]
Xingchen Man, Leibo Liu, Jianfeng Zhu, and Shaojun Wei. 2019. A general pattern-based dynamic compilation framework for coarse-grained reconfigurable architectures. In 56th Annual Design Automation Conference 2019 (DAC 2019) (Las Vegas, NV, June 02-06, 2019). ACM, 195.
[29]
Xingchen Man, Jianfeng Zhu, Guihuan Song, Shouyi Yin, Shaojun Wei, and Leibo Liu. 2022. CaSMap: Agile mapper for reconfigurable spatial architectures by automatically clustering intermediate representations and scattering mapping process. In 49th Annual International Symposium on Computer Architecture (ISCA’22) (New York, June 18-22, 2022), Valentina Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 259–273.
[30]
George Matheou and Paraskevas Evripidou. 2015. Architectural support for data-driven execution. ACM Trans. Archit. Code Optim. 11, 4, Article 52 (Jan 2015), 25 pages.
[31]
Gordon E. Moore. 1998. Cramming more components onto integrated circuits. Proc. IEEE 86, 1 (1998), 82–85.
[32]
Sean Murray, William Floyd-Jones, Ying Qi, George Dimitri Konidaris, and Daniel J. Sorin. 2016. The microarchitecture of a real-time robot motion planning accelerator. In 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2016) (Taipei, Taiwan, October 15-19, 2016). IEEE Computer Society, 45:1–45:12.
[33]
Ponnanna Kelettira Muthappa, Florian Neugebauer, Ilia Polian, and John P. Hayes. 2020. Hardware-based fast real-time image classification with stochastic computing. In 38th IEEE International Conference on Computer Design (ICCD 2020) (Hartford, CT, October 18-21, 2020). IEEE, 340–347.
[34]
Anna Maria Nestorov, Enrico Reggiani, Hristina Palikareva, Pavel Burovskiy, Tobias Becker, and Marco D. Santambrogio. 2017. A scalable dataflow implementation of Curran’s approximation algorithm. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops. 150–157.
[35]
Quan M. Nguyen and Daniel Sánchez. 2020. Pipette: Improving core utilization on irregular applications through intra-core pipeline parallelism. In 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2020), (Athens, Greece, October 17-21, 2020). IEEE, 596–608.
[36]
Quan M. Nguyen and Daniel Sanchez. 2021. Fifer: Practical acceleration of irregular applications on reconfigurable architectures. In 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’21) (Virtual Event, Greece, October 18-22, 2021). ACM, 1064–1077.
[37]
Gregory M. Papadopoulos and David E. Culler. 1990. Monsoon: An explicit token-store architecture. SIGARCH Comput. Archit. News 18, 2SI (May 1990), 82–91.
[38]
Guiqiang Peng, Leibo Liu, Sheng Zhou, Shouyi Yin, and Shaojun Wei. 2020. A 2.92-Gb/s/W and 0.43-Gb/s/MG flexible and scalable CGRA-based baseband processor for massive MIMO detection. IEEE JSSC. 55, 2 (2020), 505–519.
[39]
Raghu Prabhakar, Yaqi Zhang, David Koeplinger, Matt Feldman, Tian Zhao, S. tefan Hadjis, Ardavan Pedram, Christos Kozyrakis, and Kunle Olukotun. 2017. Plasticine: A reconfigurable architecture for parallel patterns. In ISCA. ACM, 389–402.
[40]
A. Smith, J. Burrill, J. Gibson, B. Maher, N. Nethercote, B. Yoder, D. Burger, and K. S. McKinley. 2006. Compiling for EDGE architectures. In International Symposium on Code Generation and Optimization (CGO’06). 185–195.
[41]
James E. Smith. 1982. Decoupled access/execute computer architectures. SIGARCH Comput. Archit. News 10, 3 (Apr. 1982), 112–119.
[42]
Joshua Suettlerlein, Stéphane Zuckerman, and Guang R. Gao. 2013. An implementation of the codelet model. In Euro-Par 2013 Parallel Processing. 1–14.
[43]
Cheng Tan, Nicolas Bohm Agostini, Tong Geng, Chenhao Xie, Jiajia Li, Ang Li, Kevin J. Barker, and Antonino Tumeo. 2022. DRIPS: Dynamic rebalancing of pipelined streaming applications on CGRAs. In IEEE International Symposium on High-Performance Computer Architecture(HPCA 2022) (Seoul, South Korea, April 2-6, 2022). IEEE, 304–316.
[44]
Zhanhong Tan, Hongyu Cai, Runpei Dong, and Kaisheng Ma. 2021. NN-baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators. In 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA 2021) (Valencia, Spain, June 14-18, 2021). IEEE, 1013–1026.
[45]
Michael B. Taylor, Jason Sungtae Kim, Jason E. Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul R. Johnson, Jae W. Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matthew I. Frank, Saman P. Amarasinghe, and Anant Agarwal. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2 (2002), 25–35.
[46]
David Tse and Pramod Viswanath. 2005. Fundamentals of Wireless Communication. Cambridge University Press.
[47]
Matthew Vilim, Alexander Rucker, and Kunle Olukotun. 2021. Aurochs: An architecture for dataflow threads. In 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA 2021) (Valencia, Spain, June 14-18, 2021). IEEE, 402–415.
[48]
Nils Voss, Marco Bacis, Oskar Mencer, Georgi Gaydadjiev, and Wayne Luk. 2017. Convolutional neural networks on dataflow engines. In 2017 IEEE International Conference on Computer Design (ICCD). 435–438.
[49]
Nils Voss, Pablo Quintana, Oskar Mencer, Wayne Luk, and Georgi Gaydadjiev. 2019. Memory mapping for multi-die FPGAs. In 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). 78–86.
[50]
Xingbin Wang, Boyan Zhao, Rui Hou, Amro Awad, Zhihong Tian, and Dan Meng. 2021. NASGuard: A novel accelerator architecture for robust neural architecture search (NAS) networks. In 48th ACM/IEEE Annual International Symposium on Computer Architecture (ISCA 2021) (Valencia, Spain, June 14-18, 2021). IEEE, 776–789.
[51]
Jian Weng, Sihao Liu, Zhengrong Wang, Vidushi Dadu, and Tony N. Owatzki. 2020. A hybrid systolic-dataflow architecture for inductive matrix algorithms. In HPCA. 703–716.
[52]
Xinxin Wu, Zhihua Fan, Tianyu Liu, Wenming Li, Xiaochun Ye, and Dongrui Fan. 2022. LRP: Predictive output activation based on SVD approach for CNN s acceleration. In 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE 2022) (Antwerp, Belgium, March 14-23, 2022), Cristiana Bolchini, Ingrid Verbauwhede, and Ioana Vatajelu (Eds.). IEEE, 831–836.
[53]
Jianguo Yao, Hao Zhou, Yalin Zhang, Ying Li, Chuang Feng, Shi Chen, Jiaoyan Chen, Yongdong Wang, and Qiaojuan Hu. 2023. High performance and power efficient accelerator for cloud inference. In IEEE International Symposium on High-Performance Computer Architecture (HPCA 2023) (Montreal, QC, Canada, February 25 - March 1, 2023). IEEE, 1003–1016.
[54]
Amir Yazdanbakhsh, Kambiz Samadi, Nam Sung Kim, Hadi Esmaeilzadeh, Hajar Falahati, and Philip J. Wolfe. 2018. GANAX: A unified MIMD-SIMD acceleration for generative adversarial networks. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture. 650–661.
[55]
Xiaochun Ye, Dongrui Fan, Ninghui Sun, Shibin Tang, Mingzhe Zhang, and Hao Zhang. 2013. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In International Symposium on Low Power Electronics and Design (ISLPED) (Beijing, China, September 4-6, 2013), Pai H. Chou, Ru Huang, Yuan Xie, and Tanay Karnik (Eds.). IEEE, 273–278.
[56]
Xiaochun Ye, Xu Tan, Meng Wu, Yujing Feng, Da Wang, Hao Zhang, Songwen Pei, and Dongrui Fan. 2020. An efficient dataflow accelerator for scientific applications. Future Gener. Comput. Syst. 112 (2020), 580–588.
[57]
Chen Yin and Qin Wang. 2021. Subgraph decoupling and rescheduling for increased utilization in CGRA architecture. In DATE. 1394–1399.
[58]
Shouyi Yin, Shibin Tang, Xinhan Lin, Peng Ouyang, Fengbin Tu, Leibo Liu, and Shaojun Wei. 2019. A high throughput acceleration for hybrid neural networks with efficient resource management on FPGA. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 38, 4 (2019), 678–691.
[59]
Chen Zhang, Peng Li, Guangyu Sun, Yijin Guan, Bingjun Xiao, and Jason Cong. 2015. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’15) (Monterey, California) . ACM, New York, 161–170.
[60]
Yunan Zhang, Po-An Tsai, and Hung-Wei Tseng. 2022. SIMD \({}^{\mbox{2}}\) : A generalized matrix instruction set for accelerating tensor computation beyond GEMM. In : The 49th Annual International Symposium on Computer Architecture (ISCA’22) (New York, June 18-22, 2022), Valentina Salapura, Mohamed Zahran, Fred Chong, and Lingjia Tang (Eds.). ACM, 552–566.
[61]
Yaqi Zhang, Nathan Zhang, Tian Zhao, Matt Vilim, Muhammad Shahbaz, and Kunle Olukotun. 2021. SARA: Scaling a reconfigurable dataflow accelerator. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA). 1041–1054.
[62]
Shixuan Zheng, Xianjue Zhang, Leibo Liu, Shaojun Wei, and Shouyi Yin. 2022. Atomic dataflow based graph-level workload orchestration for scalable DNN accelerators. In IEEE International Symposium on High-Performance Computer Architecture (HPCA 2022) (Seoul, South Korea, April 2-6, 2022). IEEE, 475–489.

Index Terms

  1. Improving Utilization of Dataflow Unit for Multi-Batch Processing

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Architecture and Code Optimization
    ACM Transactions on Architecture and Code Optimization  Volume 21, Issue 1
    March 2024
    500 pages
    EISSN:1544-3973
    DOI:10.1145/3613496
    • Editor:
    • David Kaeli
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 February 2024
    Online AM: 18 December 2023
    Accepted: 11 December 2023
    Revised: 10 November 2023
    Received: 04 May 2023
    Published in TACO Volume 21, Issue 1

    Check for updates

    Author Tags

    1. Utilization
    2. network-on-chip
    3. decoupled architecture
    4. batch processing

    Qualifiers

    • Research-article

    Funding Sources

    • National Key R&D Program of China
    • Beijing Nova Program
    • CAS Project for Young Scientists in Basic Research
    • CAS Project for Youth Innovation Promotion Association and Open Research Projects of Zhejiang Lab

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 1,421
      Total Downloads
    • Downloads (Last 12 months)1,421
    • Downloads (Last 6 weeks)176
    Reflects downloads up to 02 Sep 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Get Access

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media