PALM: A Efficient Performance Simulator for Tiled Accelerators with Large-scale Model Training
Abstract
Deep learning (DL) models are piquing high interest and scaling at an unprecedented rate. To this end, a handful of tiled accelerators have been proposed to support such large-scale training tasks. However, these accelerators often incorporate numerous cores or tiles even extending to wafer-scale, substantial on-chip bandwidth, and distributed memory systems. This results in an exceedingly complex design space. Moreover, conducting actual training experiments to find optimal configurations is impractical due to time constraints. Hence, predicting the optimal mapping of various parallelisms to such tiled system architectures becomes crucial. In this study, leveraging an analysis of existing mainstream DL model training strategies, we introduce a performance simulator named PALM. PALM targets both the training and inference processes for tiled accelerators, aiming to inspire the design of current and future accelerators. Specifically, (i) we establish a scheduling mechanism among tiled accelerators based on an event-driven framework; (ii) we support user-configurable pipeline, tensor, and data parallelism on tiled accelerators, determining the absolute performance throughput under these parallelism strategies; (iii) we model the interaction of on-chip SRAM, NoC, and off-chip DRAM during operator execution. This work is available here: https://github.com/fangjh21/PALM.
Index Terms:
Tiled Accelerator; Wafer-Scale; Pipeline Parallelism; Event-DrivenI Introduction
Deep learning (DL) and deep neural networks (DNN) play a crucial role in advancing artificial intelligence (AI) across diverse application domains, including image processing [1, 2, 3, 4, 5, 6], natural language processing [7, 8, 9], and autonomous driving [10, 11, 12]. As the popularity and applications of AI continue to grow, researchers are actively working to enhance the capabilities and accuracy of DNN. This involves designing more complex networks and training them with extensive datasets, often comprising millions or even billions of samples [13, 14, 15]. However, these advancements come with the challenge of extended training times and skyrocketing memory requirements, thereby fueling the need for scalable high-performance training platforms. For example, training GPT-3 (175 B) on Nvidia Tesla V100 GPUs acquires 3.1 million hours and would cost around 4.6 million [16]. Even worse, the overall size of these huge models surpasses the physical memory capacity of a single accelerator. This holds true even for contemporary GPUs equipped with substantial memory, such as the 80GB Nvidia H100 cards [17]. Therefore, numerous efforts have been devoted to expediting the training process by distributing it across multiple accelerators.
The fundamental concept behind distributed training is to allocate the independent computations of the model across multiple accelerators, facilitating parallel execution. Various parallelization strategies are available [18, 19, 20], each with its own set of advantages and drawbacks. Identifying the appropriate type and degree of parallelism to be leveraged under different constraints (such as budget, time, memory, and ease of implementation) can significantly enhance training throughput. However, it is impractical to find the optimal type and degree of parallelism by performing actual training experiments given some specific constraints due to the prohibitive expense. Although most academic projects leverage cloud frameworks like Microsoft Azure, Google Cloud Computing, or Amazon Web Services for training their proposed models, conducting these long-running experiments on cloud-hosted systems is also expensive as users are billed per hour. Therefore, an effective prediction for the training time under given workloads, parallelism configurations, and accelerator architectures becomes an indispensable part of the distributed training system design.
Recently, tiled accelerators [21, 22, 23, 24, 25, 26] have been recognized for significant potential in DL distributed training tasks due to their higher utilization and energy efficiency [27]. These accelerators feature spatial multi-tiled architectures, with each hardware tile comprising a processing element (PE) array and a global buffer, interconnected by a network on chip (NoC). Therefore, it becomes crucial to perform simulation modeling for tiled accelerators. However, existing simulators often lack DL training support on tiled accelerators for the following reasons: (i) Current simulators adopt cycle-accurate or event-driven approaches, lacking of a scheduling mechanism to model a large number of tiles. (ii) These simulators lack user-configurable parallelism strategies, ignoring users’ needs to optimize performance with hybrid parallelism strategies. (iii) Tiled accelerators exhibit spatial properties that involve interaction between DRAM and NoC bandwidth, posing a challenge for existing analytical models to capture, while cycle-accurate models are cumbersome.
Given these insights, PALM is introduced as a simulator tailored for DL training on tiled accelerators. PALM utilizes three internal mechanisms to tackle these issues: (i) Virtual Tile Aggregation, with which pipeline execution and layer-wise execution for the training of DL models ranging from tens to thousands of tiles can be modeled ; (ii) Adaptive Parallelism Interface which supports parallelism strategies and spatial mapping configured by users, providing them with a broad search space; (iii) Detailed Bandwidth Model which supports modeling bandwidth contention phenomenon on multi communication and access task. The main contributions of this work are summarized as follows:
-
•
To the best of the author’s knowledge, PALM is the first simulator considering the spatial property of tiled accelerators on DL training tasks with event-driven mechanism.
-
•
We identify three major challenges in modeling tiled accelerators: software overhead in simulating a large number of tiles, lack of user interfaces for configuring parallelism strategies, and difficulty in modeling influence between DRAM and NoC with existing methods.
-
•
In response to these modeling challenges, we propose three corresponding mechanisms: Virtual Tile Aggregation, Adaptive Parallelism Interface, and Detailed Bandwidth Model.
-
•
Through several case studies, we demonstrate PALM’s modeling accuracy. Compared to published data, our average error remains within 17%. Additionally, we show that subtle differences in spatial mapping and parallelism within tiled accelerators result in a performance gap 2 larger. Finally, we delve into the optimization of communication across tile groups.
II Background
II-A Parallelism Schemes of Distributed Training
II-A1 Data Parallelism (DP)
As shown in Fig. 1(a), DP means each worker utilizes the same model to train on distinct micro-batches of data [20]. In DP, there is no synchronization between workers during forward computation, as each worker possesses a complete copy of the model. The storage for holistic structure and parameters also leads to an expensive memory footprint. Despite the elimination of data synchronization during the forward process, gradient all-reduce becomes essential as a collective operation during the backward process.
II-A2 Tensor Parallelism (TP)
In TP, the model weights are divided (depicted by diverse colors in Fig. 1(a)), while training data is duplicated across workers [28]. Consequently, each worker observes the same data but computes only a portion of the activation. The communication of these partial results is necessary across workers in layers during both forward and backward propagation. Compared to the DP, the communication cost from TP is higher, but it can effectively relieve the memory capacity pressure [29]. This allows multiple devices to jointly serve a larger model, addressing the challenge of fitting huge models onto limited hardware resources.
II-A3 Pipeline Parallelism (PP)
This parallelism entails the division of the layers of DL model among workers [19], as illustrated by the four white boxes in Fig. 1(a). Activations from a specific set of layers, assigned to one worker, are transferred to the subsequent set of layers, assigned to another worker. These consecutive layers operate on distinct data concurrently when the input batch is segmented into micro-batches that can be sequentially fed to the pipeline workers. However, this strategy may introduce pipeline bubbles [30, 31] or periods during which an accelerator remains idle, awaiting data from the preceding accelerator in the pipeline.
II-B Collective Communication
Based on the chosen parallelization strategy, models and input batches are distributed across workers. This makes communication and synchronization of data, like forward activation or weight/input gradients, among devices inevitable [32]. This traffic is typically formulated and processed through collective communications. Four primary collective communication operations are key contributors in DNN training [33, 34]: (i) reduce-scatter, (ii) all-gather, (iii) all-reduce, (iv) all-to-all. In Fig. 1(b), reduce-scatter operation sums all initial data in workers, resulting in each worker holding a portion of globally reduced data. The all-gather operation gathers the data initially distributed across workers, ensuring each worker possesses the complete data. All-reduce can be regarded as a combination of reduce-scatter followed by an all-gather operation. In the all-to-all pattern, each node is required to send a distinct portion of data to other nodes.
II-C Tiled Accelerator
Fig. 1(c) illustrates the architecture for a tiled accelerator, which usually consists of multiple independent operating tiles. Each tile has its unique instruction queue, local memory and progresses at its own pace, which thus allows the tiled accelerators to specialize in supporting flexible dataflow and mapping. Moreover, the NoC is employed for transferring data among the tiles and synchronizing tiles at different stages throughout the program execution. Also, the NoC establishes connections among all tiles, as well as off-chip communication and memory controller blocks. As a result, each tile has access to the off-chip memory or other chips. Compared to traditional monolithic chips and single-tile SIMD GPUs, such architectures usually exhibit higher execution efficiency. Such improved efficiency comes from employing optimized dataflow strategies to spatially/temporally partition data across the tiles and fine-grained scheduling.
II-D Modeling Method for DL training on Hardware
II-D1 Analytical Model and Prediction Model
The analytical model[35, 36, 37] examines the DL model training process, using approximate methods to derive formulas for DL model and hardware parameters to estimate latency or energy consumption. While providing a quick assessment, its reliability is moderate and may not fully capture the dynamic features of hardware systems. The prediction model[36] gathers throughput data and hardware-related information from DL training, utilizing models like Multilayer Perceptrons (MLP) for training. However, its applicability is limited, relying heavily on specific datasets and training conditions.
II-D2 Simulator
Existing simulators fall into two main categories: cycle-accurate and discrete event-driven[34, 38]. The former delves into low-level hardware logic, processing operations within each clock cycle with fine granularity and high-precision modeling, suitable for scenarios with well-defined hardware architectures. However, drawbacks include a longer development cycle and extended software runtime. In contrast, discrete event-driven simulators’ trigger changes through events, maintaining an event queue for each hardware component. These simulators demonstrate faster speeds and are ideal for early-stage hardware development and architectural exploration.
III Motivation
Existing simulators and analytical or prediction models primarily focus on modeling GPU clusters but lack robust support for tiled accelerators. To inspire the design of tiled accelerators for DL training, based on the property of DL models and architecture, we identify the following three essential requirements: (i) Scheduling mechanism to model a large number of tiles; (ii) User-configurable parallelism strategies; (iii) Interaction between DRAM and NoC bandwidth.
III-A scheduling mechanism to model a large number of tiles
A sensible modeling approach is essential for simulating the training process of DL models on a substantial number of tiles, as depicted in Fig. 1(c). Real tiled accelerator systems exhibit a range of scales, from 44 and 1012 [39, 40] to a wafer-scale architecture of 633633 [41]. A straightforward but very coarse approach is to assign each tile an independent thread or event queue. However, handling a large number of tiles using such a simulation mechanism would lead to a notable increase in software overhead. Therefore, to efficiently implement a tiled accelerator simulator for DL training tasks, it is imperative to introduce a unique scheduling mechanism among tiled accelerators.
III-B User-configurable parallelism strategies
Current simulators lack interfaces that support arbitrary parallelism strategies. Typically, users need to extract computation graphs with embedded parallelism information from established DL frameworks such as PyTorch and TensorFlow. This limitation prevents the direct iteration of parallelism strategies based on simulation results. Additionally, existing simulators lack support for various types of PP which is an important parallelism strategy of LLM, nor have they discussed the differences in bubble and capacity requirements under PP. In fact, the proposal of PP is mainly aimed at solving the storage problem of LLM, which has problems in resource utilization. The advantage of PP on tiled accelerators is that it fits the characteristics of a large number of tiles, can more evenly split the pipeline, increase the number of pipeline stages, and reduce the bubble ratio. TP and DP are two inherent parallelism strategies. In the tiled accelerators, when some tiles/cores form a tile group to execute the same operator, certain dimensions must be segmented as illustrated in Fig. 1(a). Hence, it is crucial to offer a flexible user-visible interface that supports parallelism across various dimensions.
III-C Interaction between DRAM and NoC bandwidth
SRAM, being faster but costlier than DRAM, is utilized to temporarily store data for computation and exchange data with DRAM. Table I indicates that the SRAM capacity per computing power unit in tiled accelerators surpasses that in traditional GPUs. Specifically, WSE’s SRAM capacity per computing unit is nearly that of GPU A100. Studies [31, 40] explore using SRAM to statically store frequently read data, accelerating tile computation based on dataflow. Recognizing the significant role of SRAM in computation, memory access, and communication is thus reasonable.
Efficient model training relies on DRAM with large capacity and high bandwidth. DRAM is crucial for storing extensive model parameters, intermediate activations, and optimizer states during training. Tiled accelerators, designed for high-density computing power, differ significantly from GPUs in their memory hierarchy. For example, in the WSE-2 system [41], of which the computing power is equivalent to 46 GPUs, there is no on-wafer DRAM; instead, DRAM is located off the wafer. Consequently, DRAM access in tiled accelerators becomes costly due to NoC routing, as depicted in Fig. 1(c). Therefore, modeling DRAM behavior is crucial to accurately reflect practical behaviors of tiled accelerators.
NoC acts as a physical bridge among tiles [39, 40, 42], impacting communication between pipeline stages generated by mapping and parallelism, as well as intra-stage communication. Frequent DRAM access will occupy NoC bandwidth. In Table I, various tiled accelerators exhibit different NoC hop counts to DRAM, presenting a disadvantage for on-chip access tasks. Additionally, in the same table, the Link bandwidth-to-DRAM bandwidth ratio is higher in tiled accelerators, providing an advantage for communication tasks.
In summary, it is essential to model the behavior of SRAM, DRAM, and NoC during the training process to accurately reflect the architectural characteristics of tiled accelerators.
Factors | Tpye | Direct affect |
---|---|---|
Pipe schedule | GPipe, (interleaved)1F1B[20] | bubble & mem |
Parallelism | PP, DP, TP | latency & mem. |
Tile dataflow[46] | IS, WS | access times |
Optimizer[47] | SGD, Adam | mem |
ZERO[48] | ZERO | latency & mem. |
Congestion | NoC, DRAM | latency |
IV The Making of PALM
Fig. 2 shows the overall framework of PALM and the main factors considered by PALM are concluded in Table. II. The PALM is built based on the discrete event-driven framework–SimPy [49]. Moreover, PALM models a two-level tiled accelerator, as shown in Fig. 1. This section will introduce how to efficiently obtain performance throughput from DL models, hardware configurations, and other settings.
IV-A Virtual Tile Aggregation
We distinguish the concept between pipeline scheduling mechanism and pipeline parallelism. The former concerns modeling the training process effectively, while the latter involves partitioning the computation graph into stages, as discussed in the next subsection.
The pipeline scheduling includes two mechanisms: pipeline execution and layer-wise execution [31]. In our modeling, layer-wise execution is treated as pipeline execution with a depth of 1. Fig. 4 illustrates the pipeline scheduling process: the computation graph is partitioned into stages (S0
, S1
, S2
), and each stage is mapped to a tile group based on the parallelism strategy. The pipeline is divided into three processes: Forward (FD)
representing the forward computation of all operators in each stage, Backward (BD)
representing the backward propagation of all operators in each stage, and Gradient Update (GU)
representing the gradient update process. Additionally, PALM defines Act/Grad Pass
to transfer activations/gradients across stages, serving as the start signal for the next stage.
In DL, a batch (mini-batch) is taken as the period for gradient updates. To reduce the pipeline bubble ratio, a batch is divided into multiple micro-batches, with one micro-batch executing FD
and BD
. Once all micro-batches are completed, GU
is executed. Data_Fetch
simulates the input data fetching of one micro-batch, representing the start of the first stage S0
. In our scheduling mechanism, GPipe[19] and 1F1B[20] scheduling in Fig. 3 are supported . PALM places one of the four types of events into the Virtual Tile Executor
based on the signal selected by Prior Selector
. For example, in the 1F1B pipeline, priority is accorded to the execution of BD over FD. The Act/Grad Pass
between different stages is accomplished through communication events on NoC.
This process is primarily determined by the dependency relationships between adjacent operators in the different stages, which will be discussed in the next subsection.
Within each stage, operators are executed in the order of their dependency relationships such as op and of S1
in Fig. 3, as layer-wise execution does. Operators without dependencies are executed in the pre-order rule in the computation graph or in parallel. When tiles/cores execute the same operator, they are called a tile group. In the tile analysis level (tile analyzer in Fig. 2), PALM assumes different tiles in each tile group have the same computation and memory access cost. Therefore, each stage exclusively furnishes one or a few simulated tiles representing these tiles in tile group, denoted as virtual tiles. We have coined this modeling method as Virtual Tile Aggregation.
We assume that a single tile mainly consists of two entities: the tile internal logic unit and NoC router which have their own event queue. Additionally, we suppose the number of tiles is , and the number of stages is less than or equal to the number of layers in the computation graph. The naive modeling complexity is for all tiles, while PALM with virtual tile aggregation reduces it to . By incorporating an analytical model for the NoC, the complexity is further reduced to . Given that typically falls in the range of tens to hundreds, this significantly alleviates the modeling overhead.
In PALM, each operator also generates three types of events: forward, backward, and gradient update. Each type of event is further divided into computation, communication, and memory access tasks. Fig. 5 describes the main events during backward execution. For each operator, the backward process includes loss computation, activation re-computation, and gradient computation. Activation re-computation occurs only when there is insufficient memory capacity. Each sub-process requires accessing data from memory for computation, with non-negligible communication overhead. The next sub-process begins only after the completion of the current sub-process. For example, in Recompute
sub-process, we wait for the completion of the Loss
computation event before entering Gradient
sub-process. During the three sub-processes, DP communication from the previous operator can overlap with the current operator’s execution.
The forward process is similar to the re-computation in the backward process and is not separately listed here.
The main events in the gradient update process only include full-precision weights load from DRAM and store back to DRAM, and we have omitted the accumulation computation in the gradient update process.
IV-B Adaptive Parallelism Interface
❶ PP. PP partitions operators of the computation graph into different stages to minimize the pipeline bubble. The ideal execution time in the pipeline training scenario can be evaluated using Eq. (1).
(1) | ||||
where is executing time, is the batch size and is the micro-batch size. In fact, on the tiled accelerator, the execution time is influenced by the spatial position of the physical tiles corresponding to the stages. We will further discuss this phenomenon with experiments in Section V-B2. PALM takes into account that PP results in differences in memory capacity requirements, as discussed in [41]. Considering a training pipeline with stages, activations from each stage are stored in the FD process, until they are consumed for GU in the BD process. For example, the first stage should store times the activation in 1F1B, and times the activation in GPipe as illustrated in Fig. 3. Incorporating the aforementioned considerations into PP modeling, PALM supports users to bind stages based on tile IDs and op IDs with Adaptive Parallelism Interface in Fig. 2, and provides a default way for DL models to allocate stages based on computing power requirements.
❷ TP and DP.
We analyzed the communication size of all-reduce generated by TP and DP strategies in common operators, as shown in Table III. PALM partitions mapped physical tile groups into communication groups, automatically inserting collective communication events into the tile group event queue. Taking the simple linear operator as an example: The linear operator has four dimensions , where represents batch size, represents the reduce dimension, represents the output dimension, represents input, represents weights, and represents output. The dimensions represent the parallelism degree for each corresponding dimension. If we map the operator onto 16 tiles from 0 to 15, it is essential to ensure that . The parallelism strategy can be configured by the user as or , and so on. Further, corresponding communication groups are automatically generated. During the FD
, BD
, and GU
processes, there is a need for collective communication in the corresponding tile groups. The parallelism of other operators like Conv2 and Pool are the same.
For simplicity, we assume that the input shape of Conv2 or Pool is , the shape of weight is , and the shape of output is . Specially, is equal to 1 in Pool operator. The communication size of all-reduce is also represented in Table III. For transformer operator, it is a combination of a series of linear operators. And the shapes of input and output are . We support both DP () and TP () as described by Megatron[20]. The communication size generated by splitting these linear operators is accumulated. These parallelism dimensions such as and can also be configured by the user with the interface in Fig. 2.
|
|
|
|
|
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
[B,M,N,K] | (b,m,n,k) | |||||||||||||
[B,H,W,C,R,S,K] | (b,c,i,k) | |||||||||||||
[B,H,W,C,R,S] | (b,c,i,1) | N/A | ||||||||||||
[B,H,S,A] | (,1,1)1 |
-
1
1Megatron[20]
IV-C Detailed Bandwidth Model
❶ SRAM allocation. PALM holds the view that SRAM primarily influences DRAM access. Alg. 1 explains the main modeling idea: Operators are split by the parallelism strategy and their corresponding tiles are taken as input to obtain the corresponding SRAM strategy and DRAM access size in forward process. Strategies , , , respectively represent weights, optimizer states, weight gradients (WSG) and input/output activation () either statically stored in on-chip SRAM, one of them stored in on-chip SRAM, or none stored on-chip. It is worth noting that when WSG and ACT cannot be retained in SRAM for a long time, PALM adopts a penalty strategy , modeling extra DRAM accesses for WSG and ACT. When , we use input stationary (IS), otherwise, we use weight stationary (WS). PALM considers storage differences brought about by the optimizer. For optimizer Adam, it requires storage for first-order and second-order moments related to weights, and gradients of backward activations, significantly increasing storage requirements. If optimizer SGD is used, there is no overhead for optimizer states. During inference, there is no storage overhead for gradients. Alg. 1 only lists the DRAM access size for the forward process. The analysis for the backward and gradient update process follows the same methodology, thus being neglected here.
❷ Detailed NoC model. The ideal communication latency of the NoC can be obtained using Eq. (2), where represents single hop link delay and represents the total number of hops in the communication path. However, the analytical model [38] does not consider whether all links are idle at a given moment in a transmission path. Hence, the specific latency of contention_delay can not be obtained by the analytical method. In the presence of congestion, the communication time may degrade to Eq. (3) in the analytical model, which means a hop-by-hop data transmission, without forming a pipeline transmission along the link. But it is equivalent to reducing the bandwidth of the NoC by times. Even the modeling of the latter cannot guarantee that the single hop transmission is not occupied by other tasks.
(2) | ||||
(3) |
PALM considers NoC congestion, treating the link as an exclusive resource during execution. When a link is occupied by the current task, the execution time can be obtained by Eq. (2). Communication tasks can only be executed when needed link are not occupied. Otherwise, they will wait for the release of resources.
❸ Detailed DRAM model. Through the analysis of SRAM, the size of DRAM memory access has been determined, and ideally, the memory access latency can be obtained using Eq. (4). However, in the tiled accelerator, the DRAM is shared among tiles. Due to the varying distances of different tiles from DRAM and the different times they initiate memory access requests, the understanding of whether the bandwidth () is occupied at a particular moment is not clear enough. Eq. (4) cannot accurately represent memory access latency.
(4) |
(5) |
Based on the above equation, PALM constructs a memory access model for edge-shared DRAM in tiled accelerators. PALM considers DRAM bandwidth as a resource that is occupied during execution like the NoC model. The data transmission time, denoted as , through the NoC has been taken into account. Therefore, the total DRAM access time of a tile can be obtained using Eq. (5).
V Case Study
V-A Verification of Simulation Accuracy
V-A1 Verification of NoC model and DRAM model
To validate NoC model, we conduct the base ring all-reduce task on PALM. As depicted in Fig. 6, the error on 4 and 16 tiles is within 5%, compared with the results from a real GPU system with ring topology in [38].
To validate the congestion phenomenon, we conduct experiments in Fig. 7 involving all-reduce, all-to-all, and DRAM read and write tasks overlapping, where we use a different number of task combinations. The results show that the execution time of the analytical model is at most 50% less than that of the congestion model. When the number of tasks is 5 and the single task communication/access size is 8MB, the execution time of the analytical model is 30% less, and it stabilizes at this value as the communication/access increases. According to the previous analysis, these numerical differences reflect the modeling error of the analytical model. Therefore, it can be proven that PALM modeling tasks are necessary for congestion scenarios.
V-A2 Verification of Scheduling and Parallelism
Because of the limited LLM data for tiled architecture, we collect published LLM data from GPU cluster to validate the scheduling and parallelism analysis. We replace the underlying 2D topology of PALM with GPU topology. The result in Table IV indicates that the average total error of PALM scheduling and parallelism analysis is less than 15%.
V-A3 Verification on tiled accelerator
We use PALM to simulate the ResNet50 and Bert-base inference task on Tenstorrent Grayskull [40] architecture. By adjusting the mapping strategy, our simulated throughput has an error of less than 13% compared to the published throughput as shown in Table V. In pipeline inference, there is continuous data input without a backward process. Therefore, we obtain throughput that ignores the pipeline drain time and setup time as illustrated in Fig. 3.
PALM and Megatron published data.
Model | TP, DP, PP | PALM seq/s | Published seq/s1 | Error % |
---|---|---|---|---|
T-18B | 8, 32, 1 | 114.294 | 116.415 | 1.82 |
T-39B | 8, 32, 2 | 100.230 | 111.565 | 10.16 |
T-76B | 8, 32, 4 | 96.601 | 115.898 | 15.65 |
T-145B | 8, 24, 8 | 83.888 | 95.720 | 12.36 |
T-310B | 8, 15, 16 | 51.140 | 58.738 | 12.94 |
T-530B | 8, 9, 35 | 40.007 | 47.440 | 15.60 |
-
•
1 Performance with mixed precision training.
V-B Parallelism of LLM on Wafer-scale Architecture
We explore the influence of wafer-scale architecture on the optimal parallelism of LLM. Based on PALM, we build a wafer-scale architecture with specific parameters, as shown in Table VI. The overall system consists of a tile array with core per tile, communicated with tile-to-tile and core-to-core NoC. We have selected models T-18B, T-76B, and T-145B as the baseline in Table VII, with (TP=8, DP=2, PP=20). The performance of the baseline is close to the result presented in Table IV.
V-B1 Optimal parallelism analysis
For a single transformer operator, the total communication size is determined by Eq. (6), which influences the communication latency at the top level.
(6) |
where , , and represent the model parameters. represents the degree of TP, and represents the degree of DP multiplied by TP. In this experiment, is set to 16. To minimize communication size, the optimal value for is 1.6, close to 2. The optimal throughputs shown in Fig. 10a and Fig. 10b validate this conclusion.
As illustrated in Fig. 9, the minimum average NoC occupancy time on T-145B task is consistent with (TP=2, DP=8) to minimize communication size. However, the optimal throughput corresponds to (TP=4, DP=4) as shown in Fig. 10c. This indicates that minimal communication size does not always lead to absolute performance optimization, and actual architecture needs to be considered as well.
V-B2 Impact of position mapping for stage
Two common mapping layouts are illustrated in Fig. 8. The line layout arranges the pipeline vertically, with data passing vertically across stages, and intra-stage communication and memory access occurring horizontally. The S-shaped layout considers the trade-off of the furthest distance between mapped tiles and the boundary length of the tile group. In our experiments, the number of layers in the baseline model is the same as the number of tiles, with the cores in a tile forming one stage. The high bandwidth within the tile supports DP and TP effectively, while inter-tile bandwidth is lower, aligning with the low communication requirements of PP.
Fig. 10 illustrates experimental results, where mapping1
represents the Line layout, and mapping2
represents the S-shaped layout. The results validate that the S-shaped layout exhibits better performance.
V-B3 Impact of communication group in stage
comm1
represents TP communication group as close as possible in topology, comm2
represents the opposite, which is shown in Fig. 8. Fig. 10 also shows that the performance with comm1
is better. As analyzed earlier, when TP2, the first term in Eq. (6) contributes to an increasing communication size. Considering the allocation of TP within intra-groups, it is crucial to prioritize minimizing the distance between cores along the TP communication dimension to reduce communication time.
Based on the results, we conclude that the minor optimizing parallelism strategies can lead to at least performance gap. This improvement comprises a 40% contribution from stage position layout and a 60% contribution from operator-level parallelism and communication optimization.
Computing power of single tile | 256 TFlops@FP16 |
---|---|
Capacity of single tile SRAM | 60 MB |
Number of intra-tiles | |
Edge shared DRAM per tile | 256 GB/s |
Number of tiles | |
NoC bandwidth of intra-tile | 1024 GB/s |
NoC bandwidth of inter-tile | 256 GB/s |
Topology | 2D-mesh |
PALM on wafer-scale with GPU published data.
Model name | PALM sample/s | Published sample/s1 | Gap % |
---|---|---|---|
T-18B | 7.3457 | 7.2760 | 0.9 |
T-76B | 2.0652 | 1.7968 | 14.94 |
T-145B | 1.1238 | 0.9896 | 13.56 |
-
•
1 Linear equivalence based on computational power.
V-C Communication Optimization
Due to the bandwidth limitations of the GPU cluster architecture, there is only a single choice for its communication strategy [51]. In wafer-scale systems, close intra- and inter-bandwidth can support different communication strategies to minimize costs. Adapter tiles[37] are the tiles within the destination group receiving data from the source tile group.
Two communication strategies for inter-tile groups are depicted in Fig 11. The first involves all-reduce within the source group, data transmission to the destination, and broadcast within the destination. The second reduces the source based on adapters, performs inter-tile transmission, and conducts all-reduce and broadcast in the destination.
Strategy 1’s inter-tile communication time is shown by Formula 7, while Strategy 2’s is shown by Formula 8. In the formulas SG represents the source tile group, DG represents the destination tile group, AR represents all-reduce, R represents reduce, and B represents broadcast.
(7) |
(8) |
Based on BERT-base model, we assess the performance of two communication strategies. The first set of experiments compares 12 tile source and destination groups under ring shape all-reduce, while the second adds a tile to disrupt ring formation and reassessing performance.
In Fig 12a, when a ring structure is formed in the source tile group, strategy 1 outperforms strategy 2 in inter-communication performance. This is due to the smaller overall latency of ring all-reduce, resulting in a smaller communication time compared to strategy 2. Moreover, with more adapters participating in inter-communication, the performance of strategy 1 gradually improves by reducing broadcast time in the destination tile group. In Fig 12b, when a ring structure cannot be formed, strategy 2 shows better communication performance. In this case, the total time of the reduce and the all-reduce in strategy 2 is smaller than the all-reduce time in the source group of strategy 1. Additionally, the performance of strategy 2 initially improves and then declines as the number of adapters increases, due to the trade-off between the reduce cost and the all-reduce time among adapters.
According to the result, it is evident that inter-tile communication in ring shape configurations exhibits superior performance under strategy 1, leading to 3.08 performance gap over strategy 2. Conversely, non-ring shapes are more suitable for the adoption of strategy 2, with a performance increase of approximately 1.23 compared with strategy 1.
VI Related Work
There have been multiple arts aimed at predicting the performance of training workload in deep learning. Works [31, 52] were devoted to designing an automatic planner to partition the workload more evenly, aiming at reducing the pipeline bubble time. Moreover, Diksha et al. provided an analytical model to predict the training time targeting distributed Transformer [35]. Rasshidi et al. proposed a simulator named Astra-Sim [34], for hardware-software co-design exploration of deep learning training. However, the Astra-Sim mainly focused on examining the impact of varied network topologies and neglects the support for arbitrary parallelism. To this end, its improved version Astra-Sim 2.0 [38] was proposed to further provide a mechanism to represent and study arbitrary multi-dimensional topologies at scale, with different shapes and bandwidth configurations. However, all the works mentioned above fail to model the space property for tiled accelerators. Though work [39] designed an inter-layer scheduling space and exploration framework for tiled accelerators, it focused on DNN inference and operator mapping, instead of performance evaluation for DNN training.
VII Conclusion
We propose PALM, a simulator for evaluating tiled accelerators and even wafer-scale architecture in DL training. We consider multiple dimensions that impact training, such as pipeline scheduling, parallelism, tile dataflow, NoC congestion, and so on. Using PALM, we evaluate the training and inference performance throughput of LLM and ResNet models under several tiled accelerators. Compared with the published data, our result has an error of less than 16%. We discuss the spatial optimization problem of parallelism strategy and communication. We hope that this work will be further refined in the future to guide subsequent research on mapping algorithms and tiled accelerator design.
References
- [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- [2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-End Object Detection with Transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- [3] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929, 2020.
- [4] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, Furu Wei, and Baining Guo. Swin Transformer V2: Scaling up Capacity and Resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022.
- [5] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- [6] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- [7] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training. 2018.
- [8] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- [9] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- [10] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
- [11] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017.
- [12] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4490–4499, 2018.
- [13] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMa: Open and Efficient Foundation Language Models. arXiv preprint arXiv:2302.13971, 2023.
- [14] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- [15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- [16] Xiaohui Wang, Yang Wei, Ying Xiong, Guyue Huang, Xian Qian, Yufei Ding, Mingxuan Wang, and Lei Li. Lightseq2: Accelerated training for transformer-based models on gpus. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14. IEEE, 2022.
- [17] Jack Choquette. Nvidia hopper gpu: Scaling performance. In 2022 IEEE Hot Chips 34 Symposium (HCS), pages 1–46. IEEE Computer Society, 2022.
- [18] Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020.
- [19] Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems, 32, 2019.
- [20] Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-LM: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
- [21] Apple. Apple A15 Bionic, 2021. https://en.wikipedia.org/wiki/Apple_A15.
- [22] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. Tetris: Scalable and efficient neural network acceleration with 3D memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, pages 751–764, 2017.
- [23] Mingyu Gao, Xuan Yang, Jing Pu, Mark Horowitz, and Christos Kozyrakis. Tangram: Optimized coarse-grained dataflow for scalable NN accelerators. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 807–820, 2019.
- [24] Norman P Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B Jablin, George Kurian, James Laudon, Sheng Li, Peter Ma, Xiaoyu Ma, et al. Ten lessons from three generations shaped google’s tpuv4i: Industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pages 1–14. IEEE, 2021.
- [25] Yakun Sophia Shao, Jason Clemons, Rangharajan Venkatesan, Brian Zimmer, Matthew Fojtik, Nan Jiang, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, Priyanka Raina, et al. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 14–27, 2019.
- [26] Ofri Wechsler, Michael Behar, and Bharat Daga. Spring hill (nnp-i 1000) intel’s data center inference chip. In 2019 IEEE Hot Chips 31 Symposium (HCS), pages 1–12. IEEE Computer Society, 2019.
- [27] Gordon Euhyun Moon, Hyoukjun Kwon, Geonhwa Jeong, Prasanth Chatarasi, Sivasankaran Rajamanickam, and Tushar Krishna. Evaluating spatial accelerator architectures with tiled matrix-matrix multiplication. IEEE Transactions on Parallel and Distributed Systems, 33(4):1002–1014, 2021.
- [28] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
- [29] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensorflow: Deep learning for supercomputers. Advances in neural information processing systems, 31, 2018.
- [30] Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles, pages 1–15, 2019.
- [31] Atli Kosson, Vitaliy Chiley, Abhinav Venigalla, Joel Hestness, and Urs Koster. Pipelined backpropagation at scale: training large models without batches. Proceedings of Machine Learning and Systems, 3:479–501, 2021.
- [32] Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivas Sridharan, and Tushar Krishna. Themis: A network bandwidth-aware collective scheduling policy for distributed training of dl models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 581–596, 2022.
- [33] Benjamin Klenk, Nan Jiang, Greg Thorson, and Larry Dennison. An in-network architecture for accelerating shared-memory multiprocessor collectives. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 996–1009. IEEE, 2020.
- [34] Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. Astra-sim: Enabling sw/hw co-design exploration for distributed dl training platforms. In 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 81–92. IEEE, 2020.
- [35] Diksha Moolchandani, Joyjit Kundu, Frederik Ruelens, Peter Vrancx, Timon Evenblij, and Manu Perumkunnil. Amped: An analytical model for performance in distributed training of transformers. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 306–315. IEEE, 2023.
- [36] X Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko. Habitat: A Runtime-Based computational performance predictor for deep neural network training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 503–521, 2021.
- [37] Michael James, Marvin Tom, Patrick Groeneveld, and Vladimir Kibardin. Ispd 2020 physical mapping of neural networks on a wafer-scale deep learning accelerator. In Proceedings of the 2020 International Symposium on Physical Design, pages 145–149, 2020.
- [38] William Won, Taekyung Heo, Saeed Rashidi, Srinivas Sridharan, Sudarshan Srinivasan, and Tushar Krishna. Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 283–294. IEEE, 2023.
- [39] Jingwei Cai, Yuchen Wei, Zuotong Wu, Sen Peng, and Kaisheng Ma. Inter-layer scheduling space definition and exploration for tiled accelerators. In Proceedings of the 50th Annual International Symposium on Computer Architecture, pages 1–17, 2023.
- [40] Jasmina Vasiljevic, Ljubisa Bajic, Davor Capalija, Stanislav Sokorac, Dragoljub Ignjatovic, Lejla Bajic, Milos Trajkovic, Ivan Hamer, Ivan Matosevic, Aleksandar Cejkov, et al. Compute substrate for software 2.0. IEEE micro, 41(2):50–55, 2021.
- [41] Stewart Hall, Rob Schreiber, Sean Lie, Cerebras Systems, Inc. Cs weight streaming white paper. https://8968533.fs1.hubspotusercontent-na1.net/hubfs/8968533/VirtualBoothDocs/CSWeightStreamingWhitePaper.pdf, 2023.
- [42] Drago Ignjatović, Daniel W Bailey, and Ljubisa Bajić. The wormhole ai training processor. In 2022 IEEE International Solid-State Circuits Conference (ISSCC), volume 65, pages 356–358. IEEE, 2022.
- [43] Nvidia. Nvidia a100 tensor core gpu architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf.pdf, 2017.
- [44] Emil Talpes, Debjit Das Sarma, Doug Williams, Sahil Arora, Thomas Kunjan, Benjamin Floering, Ankit Jalote, Christopher Hsiong, Chandrasekhar Poorna, Vaidehi Samant, John Sicilia, Anantha Kumar Nivarti, Raghuvir Ramachandran, Tim Fischer, Ben Herzberg, Bill McGee, Ganesh Venkataramanan, and Pete Banon. The microarchitecture of dojo, tesla’s exa-scale computer. IEEE Micro, 43(3):31–39, 2023.
- [45] S. Lie. Cerebras architecture deep dive: First look inside the hw/sw co-design for deep learning : Cerebras systems. In 2022 IEEE Hot Chips 34 Symposium (HCS), pages 1–34, Los Alamitos, CA, USA, aug 2022. IEEE Computer Society.
- [46] Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883, 2018.
- [47] PyTorch . Torch.optim. https://pytorch.org/docs/stable/optim.html, 2023.
- [48] Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
- [49] Klaus G. Müller and Tony Vignaux. Simpy-discrete event simulation for python. https://simpy.readthedocs.io/en/latest/, 2023.
- [50] Linley Gwennap. Tenstorrent scales ai performance: New multicore architecture leads in data-center power efficiency, 2020.
- [51] Yonghao Zhuang, Hexu Zhao, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, and Haotong Zhang. On optimizing the communication of model parallelism. ArXiv, abs/2211.05322, 2022.
- [52] Weijie Liu, Zhiquan Lai, Shengwei Li, Yabo Duan, Keshi Ge, and Dongsheng Li. Autopipe: A fast pipeline parallelism approach with balanced partitioning and micro-batch slicing. In 2022 IEEE International Conference on Cluster Computing (CLUSTER), pages 301–312. IEEE, 2022.