With its powerful information collection and multi-layer I/O activity correlation, Beacon provides a new capability to perform detailed application or user behavior analysis. Results of such analysis assist in performance optimization, resource provisioning, and future system design. Here, we showcase several application/user behavior studies, some of which have led to corresponding optimizations or design changes to the TaihuLight system.
4.4.1 Application I/O Mode Analysis.
First, Table
3 gives an overview of the I/O volume across all profiled jobs with a non-trivial I/O, categorized by per-job core-hour consumption. Here, 1,000 K core-hours correspond to a 10-hour run using 100,000 cores on 25,000 compute nodes, and jobs with such consumption or higher write more than 40 TB of data on average. Further examination reveals that in each core-hour category, average read/write volumes are influenced by a minority group of heavy consumers. Overall, the amount of data read/written grows as the jobs consume more compute node resources. The less resource-intensive applications tend to perform more reads, while the larger consumers are more write-intensive.
Figure
16 shows the breakdown of I/O-mode adoption among all TaihuLight jobs performing non-trivial I/O, by total read/write volume. The first impression one takes from these results is that the rather “extreme” cases, such as N:N and 1:1, form the dominant choices, especially in the case of writes. We suspect that this distribution may be skewed by a large number of small jobs doing limited I/O, and calculate the average per-job read/write volume for each I/O mode. The results (Table
4) show that this is not the case. Actually, applications that choose to use the 1:1 mode for writes actually have a much higher overall write volume.
The 1:1 mode is the closest to sequential processing behavior and is conceptually simple. However, it obviously lacks scalability and fails to utilize the abundant hardware parallelism in the TaihuLight I/O system. The wide presentation of this I/O mode may help explain the overall under-utilization of forwarding resources, discussed earlier in Section
4.2. Echoing similar findings (though not so extreme) on other supercomputers [
47] (including Intrepid [
30], Mira [
58], and Edison [
51]), effective user education on I/O performance and scalability can both help improve storage system utilization and reduce wasted compute resources.
The N:1 mode tells a different story. It is an intuitive parallel I/O solution that allows compute processes to directly read to or write from their local memory without gather-scatter operations, while retaining the convenience of having a single input/output file. However, our detailed monitoring finds it to be a damaging I/O mode that users should steer away from, as explained below.
First, our monitoring results confirm the findings of existing research [
2,
46]: The N:1 mode offers low application I/O performance (by reading/writing to a shared file). Even with a large N, such applications receive no more than 250 MB/s of I/O aggregate throughput despite the peak TaihuLight back end combined bandwidth of 260 GB/s. For read operations, users here also rarely modify the default Lustre stripe width, confirming the behavior reported in a recent ORNL study [
38]. The problem is much worse with writes, as performance severely degrades owing to file system locking.
This study, however, finds that applications with the N:1 mode are extraordinarily disruptive, as they harm all kinds of neighbor applications that share forwarding nodes with them, particularly when N is large (e.g., over 32 compute nodes).
The reason is that each forwarding node operates an LWFS server thread pool (currently sized at 16), providing forwarding service to assigned compute nodes. Applications using the N:1 mode tend to flood this thread pool with requests in bursts. Unlike the N:N or N:M modes, N:1 suffers from the aforementioned poor back-end performance by using a single shared file. This, in turn, makes N:1 requests slow to process, further exacerbating their congestion in the queue and delaying requests from other applications, even when those victims are accessing disjointed back-end servers and OSTs.
Here, we give a concrete example of I/O mode-induced performance interference, featuring an earthquake simulation
AWP [
20] (2017 Gordon Bell Prize winner) that started with the N:1 mode. In this sample execution,
AWP co-runs with the weather forecast application
WRF [
69] using the 1:1 mode, each having 1024 processes on 256 compute nodes. Under the “solo” mode, we assign each application a dedicated forwarding node in a small testbed partition of TaihuLight. In the “co-run” mode, we let the applications share one forwarding node (as the default compute-to-forwarding mapping is 512-to-1).
Table
5 lists the two applications’ average request wait times, processing times, and forwarding node queue lengths during these runs. Note that with the “co-run”, the queue is shared by both applications. We find that the average wait time of
WRF increases by 11
\(\times\) when co-running, but
AWP is not affected. This result reveals the profound malpractice of the N:1 file sharing mode and confirms the prior finding that I/O interference is access-pattern-dependent [
37,
43].
Solution. Our tests confirm that increasing the LWFS thread pool size does not help in this case, as the bottleneck lies on the OSTs. Moreover, avoiding the N:1 mode has been advised in prior work [
2,
90], as well as numerous parallel I/O tutorials. Considering our new inter-application study results, it is an obvious “win-win” strategy that simultaneously improves large applications’ I/O performance and reduces their disruption to concurrent workloads. However, based on our experience with real applications, this message needs to be better promoted.
In our case, the Beacon developers worked with the
AWP team to replace its original N:1 file read (for initialization/restart) with the N:M mode during the 2017 ACM Gordon Bell Prize final submission phase. Changing applications’ I/O modes from N:1 to N:M means selecting M out of N processors to perform I/O. The number of M was selected empirically based on N:M experiments. Figure
17 shows the N:M experiment by changing the value of M. The 1024-processor
AWP runs on 256 compute nodes connected to one forwarding node during our experiment. We can see that the bandwidth achieves near-linear growth with M, increasing in the range of 1 to 32. The reason is that when the aggregate bandwidth of processors performing I/O operations does not reach the peak bandwidth of a forwarding node, applications can obtain a larger aggregate bandwidth, with more processors writing to more separate files. When M increases to 64, the aggregate bandwidth increases slightly, with the limitation of a forwarding node. When M
\(\gt\) 64, the aggregate bandwidth even declines slightly because of the resource contention. Also, more files may lead to unstable performance for applications. Thus, we suggest that when changing applications’ I/O modes from N:1 to N:M, selecting 1 out of every 16 processors or every 32 processors to perform I/O operation is a cost-effective choice on TaihuLight.
This change produces an over 400% enhancement in I/O performance. Note that the GB Prize submission does not report I/O time; we find that AWP’s 130,000-process production runs spend the bulk of their execution time reading around 100 TB of input or checkpoint data. Significant reduction in this time greatly facilitates AWP’s development/testing and saves non-trivial supercomputer resources.
4.4.2 Metadata Server Usage.
Unlike forwarding nodes’ utilization (discussed earlier), the Lustre MDS is found with rather evenly distributed load levels by Beacon’s continuous load monitoring (Figure
18(a)). In particular, 26.8% of the time, the MDS experiences a load level (in requests per second) above 75% of its peak processing throughput.
Beacon allows us to further split the requests between systems sharing the MDS, including the TaihuLight forwarding nodes, login nodes, and the ACC. To the surprise of TaihuLight administrators, over 80% of the metadata access workload actually comes from the ACC (Figure
18(b)).
Note that the login node and ACC have their own local file systems, ext4 and GPFS [
66], respectively, which users are encouraged to use for purposes such as application compilation and data post-processing/visualization. However, as the users are likely TaihuLight users too, we find most of them prefer to directly use the main Lustre scratch file system intended for TaihuLight jobs, for convenience. While the I/O bandwidth/IOPS resources consumed by such tasks are negligible, user interactive activities (such as compiling or post-processing) turn out to be metadata-heavy.
Large waves of unintended user activities correspond to the most heavy-load periods at the tail end in Figure
18(a), and lead to MDS crashes directly affecting applications running on TaihuLight. According to our survey, many other machines, including two out of the top 10 supercomputers (Sequoia [
83] and Sierra [
33]), also have a single MDS, assuming that their users follow similar usage guidelines.
Solution. There are several potential solutions to this problem. With the help of Beacon, we can identify and remind users performing metadata-heavy activities to avoid using the PFS directly. Or, we can support more scalable Lustre metadata processing with an MDS cluster. A third approach is to facilitate intelligent workflow support that automatically performs data transfer based on users’ needs. This third approach is the one we are currently developing.