Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training

Jiahao Fang1, Huizheng Wang1, Qize Yang1, Dehao Kong2, Xu Dai2, Jinyi Deng1, Yang Hu1, and Shouyi Yin1 1School of Integrated Circuits, Tsinghua University, Beijing, China
2 Shanghai Artificial Intelligence Laboratory, Shanghai, China
wanghz22@mails.tsinghua.edu.cn
Abstract

Deep learning (DL) models are piquing high interest and scaling at an unprecedented rate. To this end, a handful of tiled accelerators have been proposed to support such large-scale training tasks. However, these accelerators often incorporate numerous cores or tiles even extending to wafer-scale, substantial on-chip bandwidth, and distributed memory systems. This results in an exceedingly complex design space. Moreover, conducting actual training experiments to find optimal configurations is impractical due to time constraints. Hence, predicting the optimal mapping of various parallelisms to such tiled system architectures becomes crucial. In this study, leveraging an analysis of existing mainstream DL model training strategies, we introduce a performance simulator named PALM. PALM targets both the training and inference processes for tiled accelerators, aiming to inspire the design of current and future accelerators. Specifically, (i) we establish a scheduling mechanism among tiled accelerators based on an event-driven framework; (ii) we support user-configurable pipeline, tensor, and data parallelism on tiled accelerators, determining the absolute performance throughput under these parallelism strategies; (iii) we model the interaction of on-chip SRAM, NoC, and off-chip DRAM during operator execution. This work is available here: https://github.com/fangjh21/PALM.

Index Terms:
Tiled Accelerator; Wafer-Scale; Pipeline Parallelism; Event-Driven

I Introduction

Deep learning (DL) and deep neural networks (DNN) play a crucial role in advancing artificial intelligence (AI) across diverse application domains, including image processing [1, 2, 3, 4, 5, 6], natural language processing [7, 8, 9], and autonomous driving [10, 11, 12]. As the popularity and applications of AI continue to grow, researchers are actively working to enhance the capabilities and accuracy of DNN. This involves designing more complex networks and training them with extensive datasets, often comprising millions or even billions of samples [13, 14, 15]. However, these advancements come with the challenge of extended training times and skyrocketing memory requirements, thereby fueling the need for scalable high-performance training platforms. For example, training GPT-3 (175 B) on Nvidia Tesla V100 GPUs acquires 3.1 million hours and would cost around $currency-dollar\$$4.6 million [16]. Even worse, the overall size of these huge models surpasses the physical memory capacity of a single accelerator. This holds true even for contemporary GPUs equipped with substantial memory, such as the 80GB Nvidia H100 cards [17]. Therefore, numerous efforts have been devoted to expediting the training process by distributing it across multiple accelerators.

The fundamental concept behind distributed training is to allocate the independent computations of the model across multiple accelerators, facilitating parallel execution. Various parallelization strategies are available [18, 19, 20], each with its own set of advantages and drawbacks. Identifying the appropriate type and degree of parallelism to be leveraged under different constraints (such as budget, time, memory, and ease of implementation) can significantly enhance training throughput. However, it is impractical to find the optimal type and degree of parallelism by performing actual training experiments given some specific constraints due to the prohibitive expense. Although most academic projects leverage cloud frameworks like Microsoft Azure, Google Cloud Computing, or Amazon Web Services for training their proposed models, conducting these long-running experiments on cloud-hosted systems is also expensive as users are billed per hour. Therefore, an effective prediction for the training time under given workloads, parallelism configurations, and accelerator architectures becomes an indispensable part of the distributed training system design.

Recently, tiled accelerators [21, 22, 23, 24, 25, 26] have been recognized for significant potential in DL distributed training tasks due to their higher utilization and energy efficiency [27]. These accelerators feature spatial multi-tiled architectures, with each hardware tile comprising a processing element (PE) array and a global buffer, interconnected by a network on chip (NoC). Therefore, it becomes crucial to perform simulation modeling for tiled accelerators. However, existing simulators often lack DL training support on tiled accelerators for the following reasons: (i) Current simulators adopt cycle-accurate or event-driven approaches, lacking of a scheduling mechanism to model a large number of tiles. (ii) These simulators lack user-configurable parallelism strategies, ignoring users’ needs to optimize performance with hybrid parallelism strategies. (iii) Tiled accelerators exhibit spatial properties that involve interaction between DRAM and NoC bandwidth, posing a challenge for existing analytical models to capture, while cycle-accurate models are cumbersome.

Given these insights, PALM is introduced as a simulator tailored for DL training on tiled accelerators. PALM utilizes three internal mechanisms to tackle these issues: (i) Virtual Tile Aggregation, with which pipeline execution and layer-wise execution for the training of DL models ranging from tens to thousands of tiles can be modeled ; (ii) Adaptive Parallelism Interface which supports parallelism strategies and spatial mapping configured by users, providing them with a broad search space; (iii) Detailed Bandwidth Model which supports modeling bandwidth contention phenomenon on multi communication and access task. The main contributions of this work are summarized as follows:

  • To the best of the author’s knowledge, PALM is the first simulator considering the spatial property of tiled accelerators on DL training tasks with event-driven mechanism.

  • We identify three major challenges in modeling tiled accelerators: software overhead in simulating a large number of tiles, lack of user interfaces for configuring parallelism strategies, and difficulty in modeling influence between DRAM and NoC with existing methods.

  • In response to these modeling challenges, we propose three corresponding mechanisms: Virtual Tile Aggregation, Adaptive Parallelism Interface, and Detailed Bandwidth Model.

  • Through several case studies, we demonstrate PALM’s modeling accuracy. Compared to published data, our average error remains within 17%. Additionally, we show that subtle differences in spatial mapping and parallelism within tiled accelerators result in a performance gap 2×\times× larger. Finally, we delve into the optimization of communication across tile groups.

II Background

Refer to caption
Figure 1: Diverse parallelism strategies, collective communication patterns, and typical architecture of tiled accelerators.

II-A Parallelism Schemes of Distributed Training

II-A1 Data Parallelism (DP)

As shown in Fig. 1(a), DP means each worker utilizes the same model to train on distinct micro-batches of data [20]. In DP, there is no synchronization between workers during forward computation, as each worker possesses a complete copy of the model. The storage for holistic structure and parameters also leads to an expensive memory footprint. Despite the elimination of data synchronization during the forward process, gradient all-reduce becomes essential as a collective operation during the backward process.

II-A2 Tensor Parallelism (TP)

In TP, the model weights are divided (depicted by diverse colors in Fig. 1(a)), while training data is duplicated across workers [28]. Consequently, each worker observes the same data but computes only a portion of the activation. The communication of these partial results is necessary across workers in layers during both forward and backward propagation. Compared to the DP, the communication cost from TP is higher, but it can effectively relieve the memory capacity pressure [29]. This allows multiple devices to jointly serve a larger model, addressing the challenge of fitting huge models onto limited hardware resources.

II-A3 Pipeline Parallelism (PP)

This parallelism entails the division of the layers of DL model among workers [19], as illustrated by the four white boxes in Fig. 1(a). Activations from a specific set of layers, assigned to one worker, are transferred to the subsequent set of layers, assigned to another worker. These consecutive layers operate on distinct data concurrently when the input batch is segmented into micro-batches that can be sequentially fed to the pipeline workers. However, this strategy may introduce pipeline bubbles [30, 31] or periods during which an accelerator remains idle, awaiting data from the preceding accelerator in the pipeline.

II-B Collective Communication

Based on the chosen parallelization strategy, models and input batches are distributed across workers. This makes communication and synchronization of data, like forward activation or weight/input gradients, among devices inevitable [32]. This traffic is typically formulated and processed through collective communications. Four primary collective communication operations are key contributors in DNN training [33, 34]: (i) reduce-scatter, (ii) all-gather, (iii) all-reduce, (iv) all-to-all. In Fig. 1(b), reduce-scatter operation sums all initial data in workers, resulting in each worker holding a portion of globally reduced data. The all-gather operation gathers the data initially distributed across workers, ensuring each worker possesses the complete data. All-reduce can be regarded as a combination of reduce-scatter followed by an all-gather operation. In the all-to-all pattern, each node is required to send a distinct portion of data to other nodes.

II-C Tiled Accelerator

Fig. 1(c) illustrates the architecture for a tiled accelerator, which usually consists of multiple independent operating tiles. Each tile has its unique instruction queue, local memory and progresses at its own pace, which thus allows the tiled accelerators to specialize in supporting flexible dataflow and mapping. Moreover, the NoC is employed for transferring data among the tiles and synchronizing tiles at different stages throughout the program execution. Also, the NoC establishes connections among all tiles, as well as off-chip communication and memory controller blocks. As a result, each tile has access to the off-chip memory or other chips. Compared to traditional monolithic chips and single-tile SIMD GPUs, such architectures usually exhibit higher execution efficiency. Such improved efficiency comes from employing optimized dataflow strategies to spatially/temporally partition data across the tiles and fine-grained scheduling.

II-D Modeling Method for DL training on Hardware

II-D1 Analytical Model and Prediction Model

The analytical model[35, 36, 37] examines the DL model training process, using approximate methods to derive formulas for DL model and hardware parameters to estimate latency or energy consumption. While providing a quick assessment, its reliability is moderate and may not fully capture the dynamic features of hardware systems. The prediction model[36] gathers throughput data and hardware-related information from DL training, utilizing models like Multilayer Perceptrons (MLP) for training. However, its applicability is limited, relying heavily on specific datasets and training conditions.

II-D2 Simulator

Existing simulators fall into two main categories: cycle-accurate and discrete event-driven[34, 38]. The former delves into low-level hardware logic, processing operations within each clock cycle with fine granularity and high-precision modeling, suitable for scenarios with well-defined hardware architectures. However, drawbacks include a longer development cycle and extended software runtime. In contrast, discrete event-driven simulators’ trigger changes through events, maintaining an event queue for each hardware component. These simulators demonstrate faster speeds and are ideal for early-stage hardware development and architectural exploration.

Refer to caption

Figure 2: An overview of PALM framework. The highlighted portions in red boxes are focal points of the work.

III Motivation

Existing simulators and analytical or prediction models primarily focus on modeling GPU clusters but lack robust support for tiled accelerators. To inspire the design of tiled accelerators for DL training, based on the property of DL models and architecture, we identify the following three essential requirements: (i) Scheduling mechanism to model a large number of tiles; (ii) User-configurable parallelism strategies; (iii) Interaction between DRAM and NoC bandwidth.

III-A scheduling mechanism to model a large number of tiles

A sensible modeling approach is essential for simulating the training process of DL models on a substantial number of tiles, as depicted in Fig. 1(c). Real tiled accelerator systems exhibit a range of scales, from 4×\times×4 and 10×\times×12 [39, 40] to a wafer-scale architecture of 633×\times×633 [41]. A straightforward but very coarse approach is to assign each tile an independent thread or event queue. However, handling a large number of tiles using such a simulation mechanism would lead to a notable increase in software overhead. Therefore, to efficiently implement a tiled accelerator simulator for DL training tasks, it is imperative to introduce a unique scheduling mechanism among tiled accelerators.

III-B User-configurable parallelism strategies

Current simulators lack interfaces that support arbitrary parallelism strategies. Typically, users need to extract computation graphs with embedded parallelism information from established DL frameworks such as PyTorch and TensorFlow. This limitation prevents the direct iteration of parallelism strategies based on simulation results. Additionally, existing simulators lack support for various types of PP which is an important parallelism strategy of LLM, nor have they discussed the differences in bubble and capacity requirements under PP. In fact, the proposal of PP is mainly aimed at solving the storage problem of LLM, which has problems in resource utilization. The advantage of PP on tiled accelerators is that it fits the characteristics of a large number of tiles, can more evenly split the pipeline, increase the number of pipeline stages, and reduce the bubble ratio. TP and DP are two inherent parallelism strategies. In the tiled accelerators, when some tiles/cores form a tile group to execute the same operator, certain dimensions must be segmented as illustrated in Fig. 1(a). Hence, it is crucial to offer a flexible user-visible interface that supports parallelism across various dimensions.

III-C Interaction between DRAM and NoC bandwidth

SRAM, being faster but costlier than DRAM, is utilized to temporarily store data for computation and exchange data with DRAM. Table I indicates that the SRAM capacity per computing power unit in tiled accelerators surpasses that in traditional GPUs. Specifically, WSE’s SRAM capacity per computing unit is nearly 26×26\times26 × that of GPU A100. Studies [31, 40] explore using SRAM to statically store frequently read data, accelerating tile computation based on dataflow. Recognizing the significant role of SRAM in computation, memory access, and communication is thus reasonable.

Efficient model training relies on DRAM with large capacity and high bandwidth. DRAM is crucial for storing extensive model parameters, intermediate activations, and optimizer states during training. Tiled accelerators, designed for high-density computing power, differ significantly from GPUs in their memory hierarchy. For example, in the WSE-2 system [41], of which the computing power is equivalent to 46 GPUs, there is no on-wafer DRAM; instead, DRAM is located off the wafer. Consequently, DRAM access in tiled accelerators becomes costly due to NoC routing, as depicted in Fig. 1(c). Therefore, modeling DRAM behavior is crucial to accurately reflect practical behaviors of tiled accelerators.

NoC acts as a physical bridge among tiles [39, 40, 42], impacting communication between pipeline stages generated by mapping and parallelism, as well as intra-stage communication. Frequent DRAM access will occupy NoC bandwidth. In Table I, various tiled accelerators exhibit different NoC hop counts to DRAM, presenting a disadvantage for on-chip access tasks. Additionally, in the same table, the Link bandwidth-to-DRAM bandwidth ratio is higher in tiled accelerators, providing an advantage for communication tasks.

In summary, it is essential to model the behavior of SRAM, DRAM, and NoC during the training process to accurately reflect the architectural characteristics of tiled accelerators.

TABLE I: GPU VS Tiled Accelerator Hardware Parameters.
Hardware S_Cap. /Comp.1 D_Hops2 L_BW /D_BW3
H100[17] 0.050 - 0.179
A100[43] 0.128 - 0.310
Grayskull[40] 1.304 5 1.900
Dojo D1  [44] 1.215  25 2.275
WSE2[45] 2.666  316 1.375
  • 1

    SRAM capacity-to-compute ratio (MB/TFLOPs@FP16);

  • 2

    Maximum hop counts to DRAM;

  • 3

    Link bandwidth-to-DRAM bandwidth ratio.

TABLE II: Factors affecting performance considered by PALM.
Factors Tpye Direct affect
Pipe schedule GPipe, (interleaved)1F1B[20] bubble & mem
Parallelism PP, DP, TP latency & mem.
Tile dataflow[46] IS, WS access times
Optimizer[47] SGD, Adam mem
ZERO[48] ZERO latency & mem.
Congestion NoC, DRAM latency

IV The Making of PALM

Fig. 2 shows the overall framework of PALM and the main factors considered by PALM are concluded in Table. II. The PALM is built based on the discrete event-driven framework–SimPy [49]. Moreover, PALM models a two-level tiled accelerator, as shown in Fig. 1. This section will introduce how to efficiently obtain performance throughput from DL models, hardware configurations, and other settings.

Refer to caption
Figure 3: Left: partitioning DL computation graph into pipeline stages; Right: two pipeline scheduling methods. S means the number of stages and B means the batch size. The overhead of setup time and drain time is needed to be considered.
Refer to caption
Figure 4: The details of pipeline scheduler. Adjacent Stage units share a message queue.

IV-A Virtual Tile Aggregation

We distinguish the concept between pipeline scheduling mechanism and pipeline parallelism. The former concerns modeling the training process effectively, while the latter involves partitioning the computation graph into stages, as discussed in the next subsection. The pipeline scheduling includes two mechanisms: pipeline execution and layer-wise execution [31]. In our modeling, layer-wise execution is treated as pipeline execution with a depth of 1. Fig. 4 illustrates the pipeline scheduling process: the computation graph is partitioned into stages (S0, S1, S2), and each stage is mapped to a tile group based on the parallelism strategy. The pipeline is divided into three processes: Forward (FD) representing the forward computation of all operators in each stage, Backward (BD) representing the backward propagation of all operators in each stage, and Gradient Update (GU) representing the gradient update process. Additionally, PALM defines Act/Grad Pass to transfer activations/gradients across stages, serving as the start signal for the next stage. In DL, a batch (mini-batch) is taken as the period for gradient updates. To reduce the pipeline bubble ratio, a batch is divided into multiple micro-batches, with one micro-batch executing FD and BD. Once all micro-batches are completed, GU is executed. Data_Fetch simulates the input data fetching of one micro-batch, representing the start of the first stage S0. In our scheduling mechanism, GPipe[19] and 1F1B[20] scheduling in Fig. 3 are supported . PALM places one of the four types of events into the Virtual Tile Executor based on the signal selected by Prior Selector. For example, in the 1F1B pipeline, priority is accorded to the execution of BD over FD. The Act/Grad Pass between different stages is accomplished through communication events on NoC. This process is primarily determined by the dependency relationships between adjacent operators in the different stages, which will be discussed in the next subsection.

Within each stage, operators are executed in the order of their dependency relationships such as op B𝐵Bitalic_B and C𝐶Citalic_C of S1 in Fig. 3, as layer-wise execution does. Operators without dependencies are executed in the pre-order rule in the computation graph or in parallel. When tiles/cores execute the same operator, they are called a tile group. In the tile analysis level (tile analyzer in Fig. 2), PALM assumes different tiles in each tile group have the same computation and memory access cost. Therefore, each stage exclusively furnishes one or a few simulated tiles representing these tiles in tile group, denoted as virtual tiles. We have coined this modeling method as Virtual Tile Aggregation.

We assume that a single tile mainly consists of two entities: the tile internal logic unit and NoC router which have their own event queue. Additionally, we suppose the number of tiles is N×N𝑁𝑁N\times Nitalic_N × italic_N, and the number of stages S𝑆Sitalic_S is less than or equal to the number of layers M𝑀Mitalic_M in the computation graph. The naive modeling complexity is 𝒪(2N2)𝒪2superscript𝑁2\mathcal{O}(2N^{2})caligraphic_O ( 2 italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for all tiles, while PALM with virtual tile aggregation reduces it to 𝒪(N2+M)𝒪superscript𝑁2𝑀\mathcal{O}(N^{2}+M)caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_M ). By incorporating an analytical model for the NoC, the complexity is further reduced to 𝒪(M)𝒪𝑀\mathcal{O}(M)caligraphic_O ( italic_M ). Given that M𝑀Mitalic_M typically falls in the range of tens to hundreds, this significantly alleviates the modeling overhead.

Refer to caption
Figure 5: The events in PALM backward process.

In PALM, each operator also generates three types of events: forward, backward, and gradient update. Each type of event is further divided into computation, communication, and memory access tasks. Fig. 5 describes the main events during backward execution. For each operator, the backward process includes loss computation, activation re-computation, and gradient computation. Activation re-computation occurs only when there is insufficient memory capacity. Each sub-process requires accessing data from memory for computation, with non-negligible communication overhead. The next sub-process begins only after the completion of the current sub-process. For example, in Recompute sub-process, we wait for the completion of the Loss computation event before entering Gradient sub-process. During the three sub-processes, DP communication from the previous operator can overlap with the current operator’s execution. The forward process is similar to the re-computation in the backward process and is not separately listed here. The main events in the gradient update process only include full-precision weights load from DRAM and store back to DRAM, and we have omitted the accumulation computation in the gradient update process.

IV-B Adaptive Parallelism Interface

PP. PP partitions operators of the computation graph into different stages to minimize the pipeline bubble. The ideal execution time in the pipeline training scenario can be evaluated using Eq. (1).

ETtotal𝐸subscript𝑇total\displaystyle ET_{\text{total}}italic_E italic_T start_POSTSUBSCRIPT total end_POSTSUBSCRIPT =(Bb1)maxstages(ETFD+ETBD)absent𝐵𝑏1subscriptstages𝐸subscript𝑇𝐹𝐷𝐸subscript𝑇𝐵𝐷\displaystyle=(\frac{B}{b}-1)\max_{\text{stages}}(ET_{FD}+ET_{BD})= ( divide start_ARG italic_B end_ARG start_ARG italic_b end_ARG - 1 ) roman_max start_POSTSUBSCRIPT stages end_POSTSUBSCRIPT ( italic_E italic_T start_POSTSUBSCRIPT italic_F italic_D end_POSTSUBSCRIPT + italic_E italic_T start_POSTSUBSCRIPT italic_B italic_D end_POSTSUBSCRIPT ) (1)
+stages(ETFD+ETBD)+ETGU,superscriptstages𝐸subscript𝑇𝐹𝐷𝐸subscript𝑇𝐵𝐷𝐸subscript𝑇𝐺𝑈\displaystyle+\sum^{\text{stages}}(ET_{FD}+ET_{BD})+ET_{GU},+ ∑ start_POSTSUPERSCRIPT stages end_POSTSUPERSCRIPT ( italic_E italic_T start_POSTSUBSCRIPT italic_F italic_D end_POSTSUBSCRIPT + italic_E italic_T start_POSTSUBSCRIPT italic_B italic_D end_POSTSUBSCRIPT ) + italic_E italic_T start_POSTSUBSCRIPT italic_G italic_U end_POSTSUBSCRIPT ,

where ET𝐸𝑇ETitalic_E italic_T is executing time, B𝐵Bitalic_B is the batch size and b𝑏bitalic_b is the micro-batch size. In fact, on the tiled accelerator, the execution time is influenced by the spatial position of the physical tiles corresponding to the stages. We will further discuss this phenomenon with experiments in Section V-B2. PALM takes into account that PP results in differences in memory capacity requirements, as discussed in [41]. Considering a training pipeline with S𝑆Sitalic_S stages, activations from each stage are stored in the FD process, until they are consumed for GU in the BD process. For example, the first stage should store S𝑆Sitalic_S times the activation in 1F1B, and B𝐵Bitalic_B times the activation in GPipe as illustrated in Fig. 3. Incorporating the aforementioned considerations into PP modeling, PALM supports users to bind stages based on tile IDs and op IDs with Adaptive Parallelism Interface in Fig. 2, and provides a default way for DL models to allocate stages based on computing power requirements.

TP and DP. We analyzed the communication size of all-reduce generated by TP and DP strategies in common operators, as shown in Table III. PALM partitions mapped physical tile groups into communication groups, automatically inserting collective communication events into the tile group event queue. Taking the simple linear operator as an example: The linear operator Y=WXT𝑌𝑊superscript𝑋𝑇Y=WX^{T}italic_Y = italic_W italic_X start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT has four dimensions (B,M,N,K)𝐵𝑀𝑁𝐾(B,M,N,K)( italic_B , italic_M , italic_N , italic_K ), where B𝐵Bitalic_B represents batch size, K𝐾Kitalic_K represents the reduce dimension, M,N𝑀𝑁M,Nitalic_M , italic_N represents the output dimension, X(N×K)superscript𝑋𝑁𝐾X^{(N\times K)}italic_X start_POSTSUPERSCRIPT ( italic_N × italic_K ) end_POSTSUPERSCRIPTrepresents input, W(M×K)superscript𝑊𝑀𝐾W^{(M\times K)}italic_W start_POSTSUPERSCRIPT ( italic_M × italic_K ) end_POSTSUPERSCRIPT represents weights, and Y(M×N)superscript𝑌𝑀𝑁Y^{(M\times N)}italic_Y start_POSTSUPERSCRIPT ( italic_M × italic_N ) end_POSTSUPERSCRIPT represents output. The dimensions (b,m,n,k)𝑏𝑚𝑛𝑘(b,m,n,k)( italic_b , italic_m , italic_n , italic_k ) represent the parallelism degree for each corresponding dimension. If we map the operator onto 16 tiles from 0 to 15, it is essential to ensure that b×m×n×k=16𝑏𝑚𝑛𝑘16b\times m\times n\times k=16italic_b × italic_m × italic_n × italic_k = 16. The parallelism strategy can be configured by the user as (2,2,2,2)2222(2,2,2,2)( 2 , 2 , 2 , 2 ) or (4,4,1,1)4411(4,4,1,1)( 4 , 4 , 1 , 1 ), and so on. Further, corresponding communication groups are automatically generated. During the FD, BD, and GU processes, there is a need for collective communication in the corresponding tile groups. The parallelism of other operators like Conv2 and Pool are the same. For simplicity, we assume that the input shape of Conv2 or Pool is (B,C,I,I)𝐵𝐶𝐼𝐼(B,C,I,I)( italic_B , italic_C , italic_I , italic_I ), the shape of weight is (W,W,K)𝑊𝑊𝐾(W,W,K)( italic_W , italic_W , italic_K ), and the shape of output is (B,K,O,O)𝐵𝐾𝑂𝑂(B,K,O,O)( italic_B , italic_K , italic_O , italic_O ). Specially, K𝐾Kitalic_K is equal to 1 in Pool operator. The communication size of all-reduce is also represented in Table III. For transformer operator, it is a combination of a series of linear operators. And the shapes of input and output are (B,S,H)𝐵𝑆𝐻(B,S,H)( italic_B , italic_S , italic_H ). We support both DP (Ndsubscript𝑁𝑑N_{d}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) and TP (Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT) as described by Megatron[20]. The communication size generated by splitting these linear operators is accumulated. These parallelism dimensions such as (b,m,n,k)𝑏𝑚𝑛𝑘(b,m,n,k)( italic_b , italic_m , italic_n , italic_k ) and (Nd,Nm)subscript𝑁𝑑subscript𝑁𝑚(N_{d},N_{m})( italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) can also be configured by the user with the interface in Fig. 2.

TABLE III: Parallelism analysis of common operators.
Operator
Type
Dimension
Symbol
Parallelism
Symbol
Computation Count
(FLOPs)
FD;BD;GU
(comm. size, comm. dim.)
Linear𝐿𝑖𝑛𝑒𝑎𝑟Linearitalic_L italic_i italic_n italic_e italic_a italic_r [B,M,N,K] (b,m,n,k) 2BMNKbmnk2𝐵𝑀𝑁𝐾𝑏𝑚𝑛𝑘\frac{2BMNK}{bmnk}divide start_ARG 2 italic_B italic_M italic_N italic_K end_ARG start_ARG italic_b italic_m italic_n italic_k end_ARG (BMNbmn,k);(BMKbmk,n);(NKnk,b),(NKnk,m)𝐵𝑀𝑁𝑏𝑚𝑛𝑘𝐵𝑀𝐾𝑏𝑚𝑘𝑛𝑁𝐾𝑛𝑘𝑏𝑁𝐾𝑛𝑘𝑚(\frac{BMN}{bmn},k);(\frac{BMK}{bmk},n);(\frac{NK}{nk},b),(\frac{NK}{nk},m)( divide start_ARG italic_B italic_M italic_N end_ARG start_ARG italic_b italic_m italic_n end_ARG , italic_k ) ; ( divide start_ARG italic_B italic_M italic_K end_ARG start_ARG italic_b italic_m italic_k end_ARG , italic_n ) ; ( divide start_ARG italic_N italic_K end_ARG start_ARG italic_n italic_k end_ARG , italic_b ) , ( divide start_ARG italic_N italic_K end_ARG start_ARG italic_n italic_k end_ARG , italic_m )
Conv2𝐶𝑜𝑛𝑣2Conv2italic_C italic_o italic_n italic_v 2 [B,H,W,C,R,S,K] (b,c,i,k) 2BHWRSCKbick2𝐵𝐻𝑊𝑅𝑆𝐶𝐾𝑏𝑖𝑐𝑘\frac{2BHWRSCK}{bick}divide start_ARG 2 italic_B italic_H italic_W italic_R italic_S italic_C italic_K end_ARG start_ARG italic_b italic_i italic_c italic_k end_ARG (BHOWOKbik,c);(BHWCbic,k);(RSCKck,b),(RSCKck,i)𝐵subscript𝐻𝑂subscript𝑊𝑂𝐾𝑏𝑖𝑘𝑐𝐵𝐻𝑊𝐶𝑏𝑖𝑐𝑘𝑅𝑆𝐶𝐾𝑐𝑘𝑏𝑅𝑆𝐶𝐾𝑐𝑘𝑖(\frac{BH_{O}W_{O}K}{bik},c);(\frac{BHWC}{bic},k);(\frac{RSCK}{ck},b),(\frac{% RSCK}{ck},i)( divide start_ARG italic_B italic_H start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT italic_K end_ARG start_ARG italic_b italic_i italic_k end_ARG , italic_c ) ; ( divide start_ARG italic_B italic_H italic_W italic_C end_ARG start_ARG italic_b italic_i italic_c end_ARG , italic_k ) ; ( divide start_ARG italic_R italic_S italic_C italic_K end_ARG start_ARG italic_c italic_k end_ARG , italic_b ) , ( divide start_ARG italic_R italic_S italic_C italic_K end_ARG start_ARG italic_c italic_k end_ARG , italic_i )
Pool𝑃𝑜𝑜𝑙Poolitalic_P italic_o italic_o italic_l [B,H,W,C,R,S] (b,c,i,1) 2BHWRSCbci2𝐵𝐻𝑊𝑅𝑆𝐶𝑏𝑐𝑖\frac{2BHWRSC}{bci}divide start_ARG 2 italic_B italic_H italic_W italic_R italic_S italic_C end_ARG start_ARG italic_b italic_c italic_i end_ARG N/A
Transformer𝑇𝑟𝑎𝑛𝑠𝑓𝑜𝑟𝑚𝑒𝑟Transformeritalic_T italic_r italic_a italic_n italic_s italic_f italic_o italic_r italic_m italic_e italic_r [B,H,S,A] (Nd,Nmsubscript𝑁𝑑subscript𝑁𝑚N_{d},N_{m}italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT,1,1)1 24BSH2+4BS2HNdNm24𝐵𝑆superscript𝐻24𝐵superscript𝑆2𝐻subscript𝑁𝑑subscript𝑁𝑚\frac{24BSH^{2}+4BS^{2}H}{N_{d}N_{m}}divide start_ARG 24 italic_B italic_S italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 4 italic_B italic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG (2BSHNd,Nm);(2BSHNd,Nm);(12H2Nm,Nd)2𝐵𝑆𝐻𝑁𝑑𝑁𝑚2𝐵𝑆𝐻𝑁𝑑𝑁𝑚12superscript𝐻2𝑁𝑚𝑁𝑑(\frac{2BSH}{Nd},Nm);(\frac{2BSH}{Nd},Nm);(\frac{12H^{2}}{Nm},Nd)( divide start_ARG 2 italic_B italic_S italic_H end_ARG start_ARG italic_N italic_d end_ARG , italic_N italic_m ) ; ( divide start_ARG 2 italic_B italic_S italic_H end_ARG start_ARG italic_N italic_d end_ARG , italic_N italic_m ) ; ( divide start_ARG 12 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N italic_m end_ARG , italic_N italic_d )
  • 1

    1Megatron[20]

1Input: 𝐚𝐥𝐥𝐨𝐜𝐚𝐭𝐞𝐝_𝐭𝐢𝐥𝐞𝐬,𝐬𝐩𝐥𝐢𝐭_𝐨𝐩𝐬_𝐛𝐲_𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥𝐢𝐬𝐦𝐚𝐥𝐥𝐨𝐜𝐚𝐭𝐞𝐝_𝐭𝐢𝐥𝐞𝐬𝐬𝐩𝐥𝐢𝐭_𝐨𝐩𝐬_𝐛𝐲_𝐩𝐚𝐫𝐚𝐥𝐥𝐞𝐥𝐢𝐬𝐦\mathbf{allocated\_tiles},\mathbf{split\_ops\_by\_parallelism}bold_allocated _ bold_tiles , bold_split _ bold_ops _ bold_by _ bold_parallelism; 2Wt0,WSG0,ACT0formulae-sequence𝑊𝑡0formulae-sequence𝑊𝑆𝐺0𝐴𝐶𝑇0Wt\leftarrow 0,WSG\leftarrow 0,ACT\leftarrow 0italic_W italic_t ← 0 , italic_W italic_S italic_G ← 0 , italic_A italic_C italic_T ← 0; for (i,Op)𝑖𝑂𝑝(i,Op)( italic_i , italic_O italic_p ) in split_ops_by_parallelism𝑠𝑝𝑙𝑖𝑡_𝑜𝑝𝑠_𝑏𝑦_𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑠𝑚split\_ops\_by\_parallelismitalic_s italic_p italic_l italic_i italic_t _ italic_o italic_p italic_s _ italic_b italic_y _ italic_p italic_a italic_r italic_a italic_l italic_l italic_e italic_l italic_i italic_s italic_m do 3    WtWt+Op.Wtformulae-sequence𝑊𝑡𝑊𝑡𝑂𝑝𝑊𝑡Wt\leftarrow Wt+Op.Wtitalic_W italic_t ← italic_W italic_t + italic_O italic_p . italic_W italic_t; 4    WSGWSG+Op.Wt+Op.Optimizer_State+Op.Gradientformulae-sequence𝑊𝑆𝐺𝑊𝑆𝐺𝑂𝑝𝑊𝑡𝑂𝑝𝑂𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑟_𝑆𝑡𝑎𝑡𝑒𝑂𝑝𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡WSG\leftarrow WSG+Op.Wt+Op.Optimizer\_State+Op.Gradientitalic_W italic_S italic_G ← italic_W italic_S italic_G + italic_O italic_p . italic_W italic_t + italic_O italic_p . italic_O italic_p italic_t italic_i italic_m italic_i italic_z italic_e italic_r _ italic_S italic_t italic_a italic_t italic_e + italic_O italic_p . italic_G italic_r italic_a italic_d italic_i italic_e italic_n italic_t; 5    ACTACT+Op.Iformulae-sequence𝐴𝐶𝑇𝐴𝐶𝑇𝑂𝑝𝐼ACT\leftarrow ACT+Op.Iitalic_A italic_C italic_T ← italic_A italic_C italic_T + italic_O italic_p . italic_I; end for 6S_Cap.SRAM_Capacity_SizeS\_Cap.\leftarrow SRAM\_Capacity\_Sizeitalic_S _ italic_C italic_a italic_p . ← italic_S italic_R italic_A italic_M _ italic_C italic_a italic_p italic_a italic_c italic_i italic_t italic_y _ italic_S italic_i italic_z italic_e; 7Op_Fd_Access_Size[0,,0]𝑂𝑝_𝐹𝑑_𝐴𝑐𝑐𝑒𝑠𝑠_𝑆𝑖𝑧𝑒00Op\_Fd\_Access\_Size\leftarrow[0,...,0]italic_O italic_p _ italic_F italic_d _ italic_A italic_c italic_c italic_e italic_s italic_s _ italic_S italic_i italic_z italic_e ← [ 0 , … , 0 ]; for (i,Op)𝑖𝑂𝑝(i,Op)( italic_i , italic_O italic_p ) in split_ops_by_parallelism𝑠𝑝𝑙𝑖𝑡_𝑜𝑝𝑠_𝑏𝑦_𝑝𝑎𝑟𝑎𝑙𝑙𝑒𝑙𝑖𝑠𝑚split\_ops\_by\_parallelismitalic_s italic_p italic_l italic_i italic_t _ italic_o italic_p italic_s _ italic_b italic_y _ italic_p italic_a italic_r italic_a italic_l italic_l italic_e italic_l italic_i italic_s italic_m do     if WtS_Cap.𝑊𝑡𝑆_𝐶𝑎𝑝Wt\leq S\_Cap.italic_W italic_t ≤ italic_S _ italic_C italic_a italic_p . then 8       strategy𝐚𝐜𝐭𝐢𝐯𝐚𝐭𝐢𝐨𝐧_𝐬𝐭𝐫𝐞𝐚𝐦𝑠𝑡𝑟𝑎𝑡𝑒𝑔𝑦𝐚𝐜𝐭𝐢𝐯𝐚𝐭𝐢𝐨𝐧_𝐬𝐭𝐫𝐞𝐚𝐦strategy\leftarrow\mathbf{activation\_stream}italic_s italic_t italic_r italic_a italic_t italic_e italic_g italic_y ← bold_activation _ bold_stream; 9       Op_Fd_Access_Size[i]Op.I+Op.Oformulae-sequence𝑂𝑝_𝐹𝑑_𝐴𝑐𝑐𝑒𝑠𝑠_𝑆𝑖𝑧𝑒delimited-[]𝑖𝑂𝑝𝐼𝑂𝑝𝑂Op\_Fd\_Access\_Size[i]\leftarrow Op.I+Op.Oitalic_O italic_p _ italic_F italic_d _ italic_A italic_c italic_c italic_e italic_s italic_s _ italic_S italic_i italic_z italic_e [ italic_i ] ← italic_O italic_p . italic_I + italic_O italic_p . italic_O;     else if WSGS_Cap.𝑊𝑆𝐺𝑆_𝐶𝑎𝑝WSG\leq S\_Cap.italic_W italic_S italic_G ≤ italic_S _ italic_C italic_a italic_p . then 10       strategy𝐰𝐞𝐢𝐠𝐡𝐭_𝐬𝐭𝐫𝐞𝐚𝐦𝑠𝑡𝑟𝑎𝑡𝑒𝑔𝑦𝐰𝐞𝐢𝐠𝐡𝐭_𝐬𝐭𝐫𝐞𝐚𝐦strategy\leftarrow\mathbf{weight\_stream}italic_s italic_t italic_r italic_a italic_t italic_e italic_g italic_y ← bold_weight _ bold_stream; 11       Op_Fd_Access_Size[i]Op.Wtformulae-sequence𝑂𝑝_𝐹𝑑_𝐴𝑐𝑐𝑒𝑠𝑠_𝑆𝑖𝑧𝑒delimited-[]𝑖𝑂𝑝𝑊𝑡Op\_Fd\_Access\_Size[i]\leftarrow Op.Wtitalic_O italic_p _ italic_F italic_d _ italic_A italic_c italic_c italic_e italic_s italic_s _ italic_S italic_i italic_z italic_e [ italic_i ] ← italic_O italic_p . italic_W italic_t;     else 12       Φ1=Op.WS_Cap.×Op.I\Phi_{1}=\lceil\frac{Op.W}{S\_Cap.}\rceil\times Op.Iroman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ⌈ divide start_ARG italic_O italic_p . italic_W end_ARG start_ARG italic_S _ italic_C italic_a italic_p . end_ARG ⌉ × italic_O italic_p . italic_I; 13       Φ2=Op.IS_Cap.×Op.Wt\Phi_{2}=\lceil\frac{Op.I}{S\_Cap.}\rceil\times Op.Wtroman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ⌈ divide start_ARG italic_O italic_p . italic_I end_ARG start_ARG italic_S _ italic_C italic_a italic_p . end_ARG ⌉ × italic_O italic_p . italic_W italic_t;        if Φ1<Φ2subscriptΦ1subscriptΦ2\Phi_{1}<\Phi_{2}roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT then 14          strategy𝐰𝐞𝐢𝐠𝐡𝐭_𝐬𝐭𝐚𝐭𝐢𝐨𝐧𝐚𝐫𝐲𝑠𝑡𝑟𝑎𝑡𝑒𝑔𝑦𝐰𝐞𝐢𝐠𝐡𝐭_𝐬𝐭𝐚𝐭𝐢𝐨𝐧𝐚𝐫𝐲strategy\leftarrow\mathbf{weight\_stationary}italic_s italic_t italic_r italic_a italic_t italic_e italic_g italic_y ← bold_weight _ bold_stationary; 15          Op_Fd_Access_Size[i]=Φ1+Op.Oformulae-sequence𝑂𝑝_𝐹𝑑_𝐴𝑐𝑐𝑒𝑠𝑠_𝑆𝑖𝑧𝑒delimited-[]𝑖subscriptΦ1𝑂𝑝𝑂Op\_Fd\_Access\_Size[i]=\Phi_{1}+Op.Oitalic_O italic_p _ italic_F italic_d _ italic_A italic_c italic_c italic_e italic_s italic_s _ italic_S italic_i italic_z italic_e [ italic_i ] = roman_Φ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_O italic_p . italic_O;        else 16          strategy𝐢𝐧𝐩𝐮𝐭_𝐬𝐭𝐚𝐭𝐢𝐨𝐧𝐚𝐫𝐲𝑠𝑡𝑟𝑎𝑡𝑒𝑔𝑦𝐢𝐧𝐩𝐮𝐭_𝐬𝐭𝐚𝐭𝐢𝐨𝐧𝐚𝐫𝐲strategy\leftarrow\mathbf{input\_stationary}italic_s italic_t italic_r italic_a italic_t italic_e italic_g italic_y ← bold_input _ bold_stationary; 17          Op_Fd_Access_Size[i]=Φ2+Op.Oformulae-sequence𝑂𝑝_𝐹𝑑_𝐴𝑐𝑐𝑒𝑠𝑠_𝑆𝑖𝑧𝑒delimited-[]𝑖subscriptΦ2𝑂𝑝𝑂Op\_Fd\_Access\_Size[i]=\Phi_{2}+Op.Oitalic_O italic_p _ italic_F italic_d _ italic_A italic_c italic_c italic_e italic_s italic_s _ italic_S italic_i italic_z italic_e [ italic_i ] = roman_Φ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_O italic_p . italic_O;        end if     end if end for Output: 𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐲,𝐎𝐩_𝐅𝐝_𝐀𝐜𝐜𝐞𝐬𝐬_𝐒𝐢𝐳𝐞𝐬𝐭𝐫𝐚𝐭𝐞𝐠𝐲𝐎𝐩_𝐅𝐝_𝐀𝐜𝐜𝐞𝐬𝐬_𝐒𝐢𝐳𝐞\mathbf{strategy},\mathbf{Op\_Fd\_Access\_Size}bold_strategy , bold_Op _ bold_Fd _ bold_Access _ bold_Size
Algorithm 1 SRAM allocation method.

IV-C Detailed Bandwidth Model

SRAM allocation. PALM holds the view that SRAM primarily influences DRAM access. Alg. 1 explains the main modeling idea: Operators are split by the parallelism strategy and their corresponding tiles are taken as input to obtain the corresponding SRAM strategy and DRAM access size in forward process. Strategies SWSG_ACTsubscript𝑆𝑊𝑆𝐺_𝐴𝐶𝑇S_{WSG\_ACT}italic_S start_POSTSUBSCRIPT italic_W italic_S italic_G _ italic_A italic_C italic_T end_POSTSUBSCRIPT, SWSGsubscript𝑆𝑊𝑆𝐺S_{WSG}italic_S start_POSTSUBSCRIPT italic_W italic_S italic_G end_POSTSUBSCRIPT, SACTsubscript𝑆𝐴𝐶𝑇S_{ACT}italic_S start_POSTSUBSCRIPT italic_A italic_C italic_T end_POSTSUBSCRIPT, SPTYsubscript𝑆𝑃𝑇𝑌S_{PTY}italic_S start_POSTSUBSCRIPT italic_P italic_T italic_Y end_POSTSUBSCRIPT respectively represent weights, optimizer states, weight gradients (WSG) and input/output activation (ACTIN,ACTOUT𝐴𝐶subscript𝑇𝐼𝑁𝐴𝐶subscript𝑇𝑂𝑈𝑇ACT_{IN},ACT_{OUT}italic_A italic_C italic_T start_POSTSUBSCRIPT italic_I italic_N end_POSTSUBSCRIPT , italic_A italic_C italic_T start_POSTSUBSCRIPT italic_O italic_U italic_T end_POSTSUBSCRIPT) either statically stored in on-chip SRAM, one of them stored in on-chip SRAM, or none stored on-chip. It is worth noting that when WSG and ACT cannot be retained in SRAM for a long time, PALM adopts a penalty strategy SPTYsubscript𝑆𝑃𝑇𝑌S_{PTY}italic_S start_POSTSUBSCRIPT italic_P italic_T italic_Y end_POSTSUBSCRIPT, modeling extra DRAM accesses for WSG and ACT. When ACTW𝐴𝐶𝑇𝑊ACT\geq Witalic_A italic_C italic_T ≥ italic_W, we use input stationary (IS), otherwise, we use weight stationary (WS). PALM considers storage differences brought about by the optimizer. For optimizer Adam, it requires storage for first-order and second-order moments related to weights, and gradients of backward activations, significantly increasing storage requirements. If optimizer SGD is used, there is no overhead for optimizer states. During inference, there is no storage overhead for gradients. Alg. 1 only lists the DRAM access size for the forward process. The analysis for the backward and gradient update process follows the same methodology, thus being neglected here.

Detailed NoC model. The ideal communication latency of the NoC can be obtained using Eq. (2), where Link_Time𝐿𝑖𝑛𝑘_𝑇𝑖𝑚𝑒Link\_Timeitalic_L italic_i italic_n italic_k _ italic_T italic_i italic_m italic_e represents single hop link delay and Hops𝐻𝑜𝑝𝑠Hopsitalic_H italic_o italic_p italic_s represents the total number of hops in the communication path. However, the analytical model [38] does not consider whether all links are idle at a given moment in a transmission path. Hence, the specific latency of contention_delay can not be obtained by the analytical method. In the presence of congestion, the communication time may degrade to Eq. (3) in the analytical model, which means a hop-by-hop data transmission, without forming a pipeline transmission along the link. But it is equivalent to reducing the bandwidth of the NoC by Hops𝐻𝑜𝑝𝑠Hopsitalic_H italic_o italic_p italic_s times. Even the modeling of the latter cannot guarantee that the single hop transmission is not occupied by other tasks.

Comm_Time𝐶𝑜𝑚𝑚_𝑇𝑖𝑚𝑒\displaystyle Comm\_Timeitalic_C italic_o italic_m italic_m _ italic_T italic_i italic_m italic_e =Link_Time×Hops+Comm_SizeBWLinkabsent𝐿𝑖𝑛𝑘_𝑇𝑖𝑚𝑒𝐻𝑜𝑝𝑠𝐶𝑜𝑚𝑚_𝑆𝑖𝑧𝑒𝐵subscript𝑊𝐿𝑖𝑛𝑘\displaystyle=Link\_Time\times Hops+\frac{Comm\_Size}{BW_{Link}}= italic_L italic_i italic_n italic_k _ italic_T italic_i italic_m italic_e × italic_H italic_o italic_p italic_s + divide start_ARG italic_C italic_o italic_m italic_m _ italic_S italic_i italic_z italic_e end_ARG start_ARG italic_B italic_W start_POSTSUBSCRIPT italic_L italic_i italic_n italic_k end_POSTSUBSCRIPT end_ARG (2)
+𝐂𝐨𝐧𝐭𝐞𝐧𝐭𝐢𝐨𝐧_𝐝𝐞𝐥𝐚𝐲,𝐂𝐨𝐧𝐭𝐞𝐧𝐭𝐢𝐨𝐧_𝐝𝐞𝐥𝐚𝐲\displaystyle+\mathbf{Contention\_delay},+ bold_Contention _ bold_delay ,
Comm_Time𝐶𝑜𝑚𝑚_𝑇𝑖𝑚𝑒\displaystyle Comm\_Timeitalic_C italic_o italic_m italic_m _ italic_T italic_i italic_m italic_e =(Link_Time+Comm_SizeBWLink)×Hops.absent𝐿𝑖𝑛𝑘_𝑇𝑖𝑚𝑒𝐶𝑜𝑚𝑚_𝑆𝑖𝑧𝑒𝐵subscript𝑊𝐿𝑖𝑛𝑘𝐻𝑜𝑝𝑠\displaystyle=(Link\_Time+\frac{Comm\_Size}{BW_{Link}})\times Hops.= ( italic_L italic_i italic_n italic_k _ italic_T italic_i italic_m italic_e + divide start_ARG italic_C italic_o italic_m italic_m _ italic_S italic_i italic_z italic_e end_ARG start_ARG italic_B italic_W start_POSTSUBSCRIPT italic_L italic_i italic_n italic_k end_POSTSUBSCRIPT end_ARG ) × italic_H italic_o italic_p italic_s . (3)

PALM considers NoC congestion, treating the link as an exclusive resource during execution. When a link is occupied by the current task, the execution time can be obtained by Eq. (2). Communication tasks can only be executed when needed link are not occupied. Otherwise, they will wait for the release of resources.

Detailed DRAM model. Through the analysis of SRAM, the size of DRAM memory access has been determined, and ideally, the memory access latency Access_Time𝐴𝑐𝑐𝑒𝑠𝑠_𝑇𝑖𝑚𝑒Access\_Timeitalic_A italic_c italic_c italic_e italic_s italic_s _ italic_T italic_i italic_m italic_e can be obtained using Eq. (4). However, in the tiled accelerator, the DRAM is shared among tiles. Due to the varying distances of different tiles from DRAM and the different times they initiate memory access requests, the understanding of whether the bandwidth (BWDRAM𝐵subscript𝑊𝐷𝑅𝐴𝑀BW_{DRAM}italic_B italic_W start_POSTSUBSCRIPT italic_D italic_R italic_A italic_M end_POSTSUBSCRIPT) is occupied at a particular moment is not clear enough. Eq. (4) cannot accurately represent memory access latency.

Access_Time=Response_Time+Access_Size𝐁𝐖𝐃𝐑𝐀𝐌.𝐴𝑐𝑐𝑒𝑠𝑠_𝑇𝑖𝑚𝑒𝑅𝑒𝑠𝑝𝑜𝑛𝑠𝑒_𝑇𝑖𝑚𝑒𝐴𝑐𝑐𝑒𝑠𝑠_𝑆𝑖𝑧𝑒subscript𝐁𝐖𝐃𝐑𝐀𝐌\displaystyle Access\_Time=Response\_Time+\frac{Access\_Size}{\mathbf{BW_{DRAM% }}}.italic_A italic_c italic_c italic_e italic_s italic_s _ italic_T italic_i italic_m italic_e = italic_R italic_e italic_s italic_p italic_o italic_n italic_s italic_e _ italic_T italic_i italic_m italic_e + divide start_ARG italic_A italic_c italic_c italic_e italic_s italic_s _ italic_S italic_i italic_z italic_e end_ARG start_ARG bold_BW start_POSTSUBSCRIPT bold_DRAM end_POSTSUBSCRIPT end_ARG . (4)
DRAM_Time=Access_Time+NoC_Time.𝐷𝑅𝐴𝑀_𝑇𝑖𝑚𝑒𝐴𝑐𝑐𝑒𝑠𝑠_𝑇𝑖𝑚𝑒𝑁𝑜𝐶_𝑇𝑖𝑚𝑒\displaystyle DRAM\_Time=Access\_Time+NoC\_Time.italic_D italic_R italic_A italic_M _ italic_T italic_i italic_m italic_e = italic_A italic_c italic_c italic_e italic_s italic_s _ italic_T italic_i italic_m italic_e + italic_N italic_o italic_C _ italic_T italic_i italic_m italic_e . (5)

Based on the above equation, PALM constructs a memory access model for edge-shared DRAM in tiled accelerators. PALM considers DRAM bandwidth as a resource that is occupied during execution like the NoC model. The data transmission time, denoted as NoC_Time𝑁𝑜𝐶_𝑇𝑖𝑚𝑒NoC\_Timeitalic_N italic_o italic_C _ italic_T italic_i italic_m italic_e, through the NoC has been taken into account. Therefore, the total DRAM access time DRAM_Time𝐷𝑅𝐴𝑀_𝑇𝑖𝑚𝑒DRAM\_Timeitalic_D italic_R italic_A italic_M _ italic_T italic_i italic_m italic_e of a tile can be obtained using Eq. (5).

V Case Study

V-A Verification of Simulation Accuracy

V-A1 Verification of NoC model and DRAM model

Refer to caption
(a) 4 devices in ring topology
Refer to caption
(b) 16 devices in ring topology
Figure 6: Performance comparison of PALM simulator with GPU system for the all-reduce task under ring topology .
Refer to caption
Figure 7: Error of multi-task stacked on tiled accelerator in PALM VS Analytical model.

To validate NoC model, we conduct the base ring all-reduce task on PALM. As depicted in Fig. 6, the error on 4 and 16 tiles is within 5%, compared with the results from a real GPU system with ring topology in [38].

To validate the congestion phenomenon, we conduct experiments in Fig. 7 involving all-reduce, all-to-all, and DRAM read and write tasks overlapping, where we use a different number of task combinations. The results show that the execution time of the analytical model is at most 50% less than that of the congestion model. When the number of tasks is 5 and the single task communication/access size is 8MB, the execution time of the analytical model is 30% less, and it stabilizes at this value as the communication/access increases. According to the previous analysis, these numerical differences reflect the modeling error of the analytical model. Therefore, it can be proven that PALM modeling tasks are necessary for congestion scenarios.

V-A2 Verification of Scheduling and Parallelism

Because of the limited LLM data for tiled architecture, we collect published LLM data from GPU cluster to validate the scheduling and parallelism analysis. We replace the underlying 2D topology of PALM with GPU topology. The result in Table IV indicates that the average total error of PALM scheduling and parallelism analysis is less than 15%.

V-A3 Verification on tiled accelerator

We use PALM to simulate the ResNet50 and Bert-base inference task on Tenstorrent Grayskull [40] architecture. By adjusting the mapping strategy, our simulated throughput has an error of less than 13% compared to the published throughput as shown in Table V. In pipeline inference, there is continuous data input without a backward process. Therefore, we obtain throughput that ignores the pipeline drain time and setup time as illustrated in Fig. 3.

TABLE IV: Performance comparison of
PALM and Megatron published data.
Model TP, DP, PP PALM seq/s Published seq/s1 Error %
T-18B 8, 32, 1 114.294 116.415 1.82
T-39B 8, 32, 2 100.230 111.565 10.16
T-76B 8, 32, 4 96.601 115.898 15.65
T-145B 8, 24, 8 83.888 95.720 12.36
T-310B 8, 15, 16 51.140 58.738 12.94
T-530B 8, 9, 35 40.007 47.440 15.60
  • 1 Performance with mixed precision training.

TABLE V: Performance comparison of
PALM and Grayskull published data.
Model name PALM sample/s Published sample/s Error %
ResNet50 23033.46 224311  [50] 2.68
Bert-base 3190.12 2830  [40] 12.72
  • 1Performance with int 8 computing power.

V-B Parallelism of LLM on Wafer-scale Architecture

We explore the influence of wafer-scale architecture on the optimal parallelism of LLM. Based on PALM, we build a wafer-scale architecture with specific parameters, as shown in Table VI. The overall system consists of a 5×4545\times 45 × 4 tile array with 4×4444\times 44 × 4 core per tile, communicated with tile-to-tile and core-to-core NoC. We have selected models T-18B, T-76B, and T-145B as the baseline in Table VII, with (TP=8, DP=2, PP=20). The performance of the baseline is close to the result presented in Table IV.

V-B1 Optimal parallelism analysis

Refer to caption
Figure 8: Position mapping in inter-tile groups and communication strategies in intra-tile groups.
Refer to caption
(a) mapping1+++comm2 strategy
Refer to caption
(b) mapping2+++comm1 strategy
Figure 9: The average utilization and absolute occupy time of NoC on wafer-scale architecture for T-145B task.

For a single transformer operator, the total communication size is determined by Eq. (6), which influences the communication latency at the top level.

Comm_Size=8BSHNmN+24H2Nm,𝐶𝑜𝑚𝑚_𝑆𝑖𝑧𝑒8𝐵𝑆𝐻subscript𝑁𝑚𝑁24superscript𝐻2subscript𝑁𝑚\displaystyle Comm\_Size=\frac{8BSHN_{m}}{N}+\frac{24H^{2}}{N_{m}},italic_C italic_o italic_m italic_m _ italic_S italic_i italic_z italic_e = divide start_ARG 8 italic_B italic_S italic_H italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG + divide start_ARG 24 italic_H start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG , (6)

where B𝐵Bitalic_B, S𝑆Sitalic_S, and H𝐻Hitalic_H represent the model parameters. Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT represents the degree of TP, and N𝑁Nitalic_N represents the degree of DP multiplied by TP. In this experiment, N𝑁Nitalic_N is set to 16. To minimize communication size, the optimal value for Nmsubscript𝑁𝑚N_{m}italic_N start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is 1.6, close to 2. The optimal throughputs shown in Fig. 10a and Fig. 10b validate this conclusion.

As illustrated in Fig. 9, the minimum average NoC occupancy time on T-145B task is consistent with (TP=2, DP=8) to minimize communication size. However, the optimal throughput corresponds to (TP=4, DP=4) as shown in Fig. 10c. This indicates that minimal communication size does not always lead to absolute performance optimization, and actual architecture needs to be considered as well.

V-B2 Impact of position mapping for stage

Two common mapping layouts are illustrated in Fig. 8. The line layout arranges the pipeline vertically, with data passing vertically across stages, and intra-stage communication and memory access occurring horizontally. The S-shaped layout considers the trade-off of the furthest distance between mapped tiles and the boundary length of the tile group. In our experiments, the number of layers in the baseline model is the same as the number of tiles, with the 4×4444\times 44 × 4 cores in a tile forming one stage. The high bandwidth within the tile supports DP and TP effectively, while inter-tile bandwidth is lower, aligning with the low communication requirements of PP.

Fig. 10 illustrates experimental results, where mapping1 represents the Line layout, and mapping2 represents the S-shaped layout. The results validate that the S-shaped layout exhibits better performance.

Refer to caption
(a) T-18B
Refer to caption
(b) T-67B
Refer to caption
(c) T-145B
Figure 10: Performance comparison among various combinations of mapping methods and TP communication strategies on wafer-scale architecture.

V-B3 Impact of communication group in stage

comm1 represents TP communication group as close as possible in topology, comm2 represents the opposite, which is shown in Fig. 8. Fig. 10 also shows that the performance with comm1 is better. As analyzed earlier, when TP\geq2, the first term in Eq. (6) contributes to an increasing communication size. Considering the allocation of TP within intra-groups, it is crucial to prioritize minimizing the distance between cores along the TP communication dimension to reduce communication time.

Based on the results, we conclude that the minor optimizing parallelism strategies can lead to at least 2×2\times2 × performance gap. This improvement comprises a 40% contribution from stage position layout and a 60% contribution from operator-level parallelism and communication optimization.

TABLE VI: Wafer-scale Configuration Parameters.
Computing power of single tile 256 TFlops@FP16
Capacity of single tile SRAM 60 MB
Number of intra-tiles 4×4444\times 44 × 4
Edge shared DRAM per tile 256 GB/s
Number of tiles 5×4545\times 45 × 4
NoC bandwidth of intra-tile 1024 GB/s
NoC bandwidth of inter-tile 256 GB/s
Topology 2D-mesh
TABLE VII: Performance comparison of
PALM on wafer-scale with GPU published data.
Model name PALM sample/s Published sample/s1 Gap %
T-18B 7.3457 7.2760 0.9
T-76B 2.0652 1.7968 14.94
T-145B 1.1238 0.9896 13.56
  • 1 Linear equivalence based on computational power.

V-C Communication Optimization

Due to the bandwidth limitations of the GPU cluster architecture, there is only a single choice for its communication strategy [51]. In wafer-scale systems, close intra- and inter-bandwidth can support different communication strategies to minimize costs. Adapter tiles[37] are the tiles within the destination group receiving data from the source tile group.

Two communication strategies for inter-tile groups are depicted in Fig 11. The first involves all-reduce within the source group, data transmission to the destination, and broadcast within the destination. The second reduces the source based on adapters, performs inter-tile transmission, and conducts all-reduce and broadcast in the destination.

Strategy 1’s inter-tile communication time is shown by Formula 7, while Strategy 2’s is shown by Formula 8. In the formulas SG represents the source tile group, DG represents the destination tile group, AR represents all-reduce, R represents reduce, and B represents broadcast.

T=TSG_AR+TInter_Comm+TDG_B.𝑇subscript𝑇𝑆𝐺_𝐴𝑅subscript𝑇𝐼𝑛𝑡𝑒𝑟_𝐶𝑜𝑚𝑚subscript𝑇𝐷𝐺_𝐵\displaystyle T=T_{SG\_AR}+T_{Inter\_Comm}+T_{DG\_B}.italic_T = italic_T start_POSTSUBSCRIPT italic_S italic_G _ italic_A italic_R end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r _ italic_C italic_o italic_m italic_m end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_D italic_G _ italic_B end_POSTSUBSCRIPT . (7)
T=TSG_R+TInter_Comm+TAdapters_AR+TDG_B.𝑇subscript𝑇𝑆𝐺_𝑅subscript𝑇𝐼𝑛𝑡𝑒𝑟_𝐶𝑜𝑚𝑚subscript𝑇𝐴𝑑𝑎𝑝𝑡𝑒𝑟𝑠_𝐴𝑅subscript𝑇𝐷𝐺_𝐵\displaystyle T=T_{SG\_R}+T_{Inter\_Comm}+T_{Adapters\_AR}+T_{DG\_B}.italic_T = italic_T start_POSTSUBSCRIPT italic_S italic_G _ italic_R end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_I italic_n italic_t italic_e italic_r _ italic_C italic_o italic_m italic_m end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_A italic_d italic_a italic_p italic_t italic_e italic_r italic_s _ italic_A italic_R end_POSTSUBSCRIPT + italic_T start_POSTSUBSCRIPT italic_D italic_G _ italic_B end_POSTSUBSCRIPT . (8)

Based on BERT-base model, we assess the performance of two communication strategies. The first set of experiments compares 12 tile source and destination groups under ring shape all-reduce, while the second adds a tile to disrupt ring formation and reassessing performance.

Refer to caption
Figure 11: Communication strategies in inter-tile groups.
Refer to caption
(a) Ring shape
Refer to caption
(b) Non-ring shape
Figure 12: Performance of diverse communication strategies in inter-tile groups.

In Fig 12a, when a ring structure is formed in the source tile group, strategy 1 outperforms strategy 2 in inter-communication performance. This is due to the smaller overall latency of ring all-reduce, resulting in a smaller communication time compared to strategy 2. Moreover, with more adapters participating in inter-communication, the performance of strategy 1 gradually improves by reducing broadcast time in the destination tile group. In Fig 12b, when a ring structure cannot be formed, strategy 2 shows better communication performance. In this case, the total time of the reduce and the all-reduce in strategy 2 is smaller than the all-reduce time in the source group of strategy 1. Additionally, the performance of strategy 2 initially improves and then declines as the number of adapters increases, due to the trade-off between the reduce cost and the all-reduce time among adapters.

According to the result, it is evident that inter-tile communication in ring shape configurations exhibits superior performance under strategy 1, leading to 3.08×\times× performance gap over strategy 2. Conversely, non-ring shapes are more suitable for the adoption of strategy 2, with a performance increase of approximately 1.23×\times× compared with strategy 1.

VI Related Work

There have been multiple arts aimed at predicting the performance of training workload in deep learning. Works [31, 52] were devoted to designing an automatic planner to partition the workload more evenly, aiming at reducing the pipeline bubble time. Moreover, Diksha et al. provided an analytical model to predict the training time targeting distributed Transformer [35]. Rasshidi et al. proposed a simulator named Astra-Sim [34], for hardware-software co-design exploration of deep learning training. However, the Astra-Sim mainly focused on examining the impact of varied network topologies and neglects the support for arbitrary parallelism. To this end, its improved version Astra-Sim 2.0 [38] was proposed to further provide a mechanism to represent and study arbitrary multi-dimensional topologies at scale, with different shapes and bandwidth configurations. However, all the works mentioned above fail to model the space property for tiled accelerators. Though work  [39] designed an inter-layer scheduling space and exploration framework for tiled accelerators, it focused on DNN inference and operator mapping, instead of performance evaluation for DNN training.

VII Conclusion

We propose PALM, a simulator for evaluating tiled accelerators and even wafer-scale architecture in DL training. We consider multiple dimensions that impact training, such as pipeline scheduling, parallelism, tile dataflow, NoC congestion, and so on. Using PALM, we evaluate the training and inference performance throughput of LLM and ResNet models under several tiled accelerators. Compared with the published data, our result has an error of less than 16%. We discuss the spatial optimization problem of parallelism strategy and communication. We hope that this work will be further refined in the future to guide subsequent research on mapping algorithms and tiled accelerator design.

References

  • [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object Detection with Transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  • [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020.
  • [4] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin Transformer V2: Scaling up Capacity and Resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
  • [5] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  • [6] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  • [7] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training. 2018.
  • [8] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • [9] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  • [10] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
  • [11] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
  • [12] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018.
  • [13] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMa: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
  • [14] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • [16] Xiaohui Wang, Yang Wei, Ying Xiong, Guyue Huang, Xian Qian, Yufei Ding, Mingxuan Wang, and Lei Li. Lightseq2: Accelerated training for transformer-based models on gpus. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14. IEEE, 2022.
  • [17] Jack Choquette. Nvidia hopper gpu: Scaling performance. In 2022 IEEE Hot Chips 34 Symposium (HCS), pages 1–46. IEEE Computer Society, 2022.
  • [18] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
  • [19] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
  • [20] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  • [21] Apple. Apple A15 Bionic, 2021. https://en.wikipedia.org/wiki/Apple_A15.
  • [22] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 751–764, 2017.
  • [23] Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. Tangram: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 807–820, 2019.
  • [24] Norman P Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, et al. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2021.
  • [25] Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, et al. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 14–27, 2019.
  • [26] Ofri Wechsler, Michael Behar, and Bharat Daga. Spring hill (nnp-i 1000) intel’s data center inference chip. In 2019 IEEE Hot Chips 31 Symposium (HCS), pages 1–12. IEEE Computer Society, 2019.
  • [27] Gordon Euhyun Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, and Tushar Krishna. Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication. IEEE Transactions on Parallel and Distributed Systems, 33(4):1002–1014, 2021.
  • [28] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  • [29] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
  • [30] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
  • [31] Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, and Urs Koster. Pipelined backpropagation at scale: training large models without batches. Proceedings of Machine Learning and Systems, 3:479–501, 2021.
  • [32] Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, and Tushar Krishna. Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 581–596, 2022.
  • [33] Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 996–1009. IEEE, 2020.
  • [34] Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 81–92. IEEE, 2020.
  • [35] Diksha Moolchandani, Joyjit Kundu, Frederik Ruelens, Peter Vrancx, Timon Evenblij, and Manu Perumkunnil. Amped: An analytical model for performance in distributed training of transformers. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 306–315. IEEE, 2023.
  • [36] X Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. Habitat: A {{\{{Runtime-Based}}\}} computational performance predictor for deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 503–521, 2021.
  • [37] Michael James, Marvin Tom, Patrick Groeneveld, and Vladimir Kibardin. Ispd 2020 physical mapping of neural networks on a wafer-scale deep learning accelerator. In Proceedings of the 2020 International Symposium on Physical Design, pages 145–149, 2020.
  • [38] William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 283–294. IEEE, 2023.
  • [39] Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. Inter-layer scheduling space definition and exploration for tiled accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–17, 2023.
  • [40] Jasmina Vasiljevic, Ljubisa Bajic, Davor Capalija, Stanislav Sokorac, Dragoljub Ignjatovic, Lejla Bajic, Milos Trajkovic, Ivan Hamer, Ivan Matosevic, Aleksandar Cejkov, et al. Compute substrate for software 2.0. IEEE micro, 41(2):50–55, 2021.
  • [41] Stewart Hall, Rob Schreiber, Sean Lie, Cerebras Systems, Inc. Cs weight streaming white paper. https://8968533.fs1.hubspotusercontent-na1.net/hubfs/8968533/VirtualBoothDocs/CSWeightStreamingWhitePaper.pdf, 2023.
  • [42] Drago Ignjatović, Daniel W Bailey, and Ljubisa Bajić. The wormhole ai training processor. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), volume 65, pages 356–358. IEEE, 2022.
  • [43] Nvidia. Nvidia a100 tensor core gpu architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.pdf, 2017.
  • [44] Emil Talpes, Debjit Das Sarma, Doug Williams, Sahil Arora, Thomas Kunjan, Benjamin Floering, Ankit Jalote, Christopher Hsiong, Chandrasekhar Poorna, Vaidehi Samant, John Sicilia, Anantha Kumar Nivarti, Raghuvir Ramachandran, Tim Fischer, Ben Herzberg, Bill McGee, Ganesh Venkataramanan, and Pete Banon. The microarchitecture of dojo, tesla’s exa-scale computer. IEEE Micro, 43(3):31–39, 2023.
  • [45] S. Lie. Cerebras architecture deep dive: First look inside the hw/sw co-design for deep learning : Cerebras systems. In 2022 IEEE Hot Chips 34 Symposium (HCS), pages 1–34, Los Alamitos, CA, USA, aug 2022. IEEE Computer Society.
  • [46] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883, 2018.
  • [47] PyTorch . Torch.optim. https://pytorch.org/docs/stable/optim.html, 2023.
  • [48] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  • [49] Klaus G. Müller and Tony Vignaux. Simpy-discrete event simulation for python. https://simpy.readthedocs.io/en/latest/, 2023.
  • [50] Linley Gwennap. Tenstorrent scales ai performance: New multicore architecture leads in data-center power efficiency, 2020.
  • [51] Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, and Haotong Zhang. On optimizing the communication of model parallelism. ArXiv, abs/2211.05322, 2022.
  • [52] Weijie Liu, Zhiquan Lai, Shengwei Li, Yabo Duan, Keshi Ge, and Dongsheng Li. Autopipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing. In 2022 IEEE International Conference on Cluster Computing (CLUSTER), pages 301–312. IEEE, 2022.