research-article

Open access

End-to-end I/O Monitoring on Leading Supercomputers

Authors:

Wei Xue,

Weiguo LiuAuthors Info & Claims

ACM Transactions on Storage, Volume 19, Issue 1

Article No.: 3, Pages 1 - 35

https://doi.org/10.1145/3568425

Published: 11 January 2023 Publication History

All formats PDF

Abstract

This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.¹

1 Introduction

Modern supercomputers are networked systems with increasingly deep storage hierarchies, serving applications with growing scale and complexity. The long I/O path from storage media to application, combined with complex software stacks and hardware configurations, makes I/O optimizations increasingly challenging for application developers and supercomputer administrators. In addition, because I/O utilizes heavily shared system components (unlike computation or memory accesses), it usually suffers from substantial inter-workload interference, causing high performance variance [23, 32, 37, 45, 52, 63, 71].

Online tools that can capture/analyze I/O activities and guide optimization are urgently needed. They also need to provide I/O usage information and performance records to guide future systems’ design, configuration, and deployment. To this end, several profiling/tracing tools and frameworks have been developed, including application-side (e.g., Darshan [9], ScalableIOTrace [81], and IOPin [34]), back-end side (e.g., LustreDU [7], IOSI [40], and LIOProf [91]), and multi-layer tools (e.g., EZIOTracer [49], GUIDE [80], and Logaider [12]).

These proposed tools, however, have one or more of the following limitations. Application-oriented tools often require developers to instrument their source code or link extra libraries. They also do not offer intuitive ways to analyze inter-application I/O performance behaviors such as interference issues. Back-end-oriented tools can collect system-level performance data and monitor cross-application interactions but have difficulty in identifying performance issues for specific applications and in finding their root causes. Finally, problematic applications issuing inefficient I/O requests escape the radar of back-end-side analytical methods [40, 41] relying on high-bandwidth applications.

This paper reports the design, implementation, and deployment of a lightweight, end-to-end I/O resource monitoring and diagnosis system, Beacon, for TaihuLight, currently the fourth-ranked supercomputer in the world [29]. It works with TaihuLight’s 40,960 compute nodes (over ten-million cores in total), 288 forwarding nodes, 288 storage nodes, and two metadata nodes. Beacon integrates front-end tracing and back-end profiling into a seamless framework, enabling tasks such as automatic per-application I/O behavior profiling, I/O bottleneck/interference analysis, and system anomaly detection.

To the best of our knowledge, this is the first system-level, multi-layer monitoring and real-time diagnosis framework deployed on ultra-scale supercomputers. Beacon collects performance data simultaneously from different types of nodes (including the compute, I/O forwarding, storage, and metadata nodes) and analyzes them collaboratively, without requiring any involvement of application developers. Its elaborated collection scheme and aggressive compression minimize the system cost; only 85 part-time servers to monitor the entire 40960-node system, with \(\lt \!1\%\) performance overhead in user applications.

We have deployed Beacon for production use since April 2017. It has already helped the TaihuLight system administration and I/O performance team identify several performance degradation problems. With its rich I/O performance data collection and real-time system monitoring, Beacon successfully exposes the mismatch between application I/O patterns and widely adopted underlying storage design/configurations. To help application developers and users, it enables detailed per-application I/O behavior study, with novel inter-application interference identification and analysis. Beacon also performs automatic anomaly detection. Finally, we have recently started to expand Beacon beyond I/O to network switch monitoring.

Based on our design and deployment experience, we argue that having such an end-to-end detailed I/O monitoring framework is highly rewarding. Beacon’s all-system-level monitoring decouples it from language, library, or compiler constraints, enabling the monitoring data of collection and analysis for all applications and users. Much of its infrastructure reuses existing server/network/storage resources, and it has proved to have negligible overhead. In exchange, users and administrators harvest deep insights into the complex I/O system components’ operations and interactions, and reduce both human resources and machine core-hours wasted on unnecessarily slow/jittery I/O or system anomalies.

2 TaihuLight Network Storage

Let us first introduce the TaihuLight supercomputer (and its Icefish I/O subsystem) used to perform our implementation and deployment. Though the rest of our discussion is based on this specific platform, many aspects of Beacon’s design and operation can be applied to other large-scale supercomputers or clusters.

TaihuLight, currently the fourth-ranked supercomputer in the world, is a many-core accelerated 125-petaflop system [22]. Figure 1 illustrates its architecture, highlighting the Icefish storage subsystem. The 40,960 260-core compute nodes are organized into 40 cabinets, each containing four supernodes. Through dual-rail FDR InfiniBand, all the 256 compute nodes in one supernode are fully connected and then connected to Icefish via a Fat-tree network. In addition, Icefish serves an Auxiliary Compute Cluster (ACC) with Intel Xeon processors, mainly used for data pre- and post-processing.

Fig. 1.

The Icefish back end employs the Lustre parallel file system [4], with an aggregate capacity of 10 PB on top of 288 storage nodes and 144 Sugon DS800 disk enclosures. An enclosure contains 60 1.2-TB SAS HDD drives, composing six Object Storage Targets (OSTs), each an 8+2 RAID6 array. The controller within each enclosure connects to two storage nodes, via two fiber channels for path redundancy. Therefore, every storage node manages three OSTs, while the two adjacent storage nodes sharing a controller form a failover pair.

Between the compute nodes and the Lustre back end is a layer of 288 I/O forwarding nodes. Each plays a dual role, both as a Lightweight File System (LWFS) based on the Gluster [13] server to the compute nodes and a client to the Lustre back end. This I/O forwarding practice is adopted by multiple other platforms that operate at such a scale [6, 44, 53, 82, 95].

A forwarding node provides a bandwidth of 2.5 GB/s, aggregating to over 720 GB/s for the entire forwarding system. Each back-end controller provides about 1.8 GB/s, amounting to a file system bandwidth of around 260 GB/s. Overall, Icefish delivers 240 GB/s and 220 GB/s aggregate bandwidths for reads and writes, respectively.

TaihuLight debuted on the Top500 list in June 2016. At the time of this study, Icefish was equally partitioned into two namespaces: Online1 (for everyday workloads) and Online2 (reserved for ultra-scale jobs that occupy the majority of the compute nodes), with disjointed sets of forwarding nodes. A batch job can only use either namespace. I/O requests from a compute node are served by a specified forwarding node using a static mapping strategy for easy maintenance (48 fixed forwarding nodes for ACC and 80 fixed forwarding nodes for Sunway compute nodes).

Therefore, the two namespaces, along with statically partitioned back-end resources, are currently utilized separately by routine jobs and “VIP” jobs. One motivation for deploying an end-to-end monitoring system is to analyze the I/O behavior of the entire supercomputer’s workloads and design more flexible I/O resource allocation/scheduling mechanisms. For example, motivated by the findings of our monitoring system, a dynamic forwarding allocation system [31] for better forwarding resource utilization was developed, tested, and deployed.

3 Beacon Design and Implementation

3.1 Beacon Architecture Overview

Figure 2 shows the three components of Beacon: the monitoring component, the storage component, and a dedicated Beacon server. Beacon performs I/O monitoring at the six components of TaihuLight, including the LWFS client (on the compute nodes), the LWFS serve, the Lustre client (the latter two are both on the forwarding nodes), the Lustre server (on the storage nodes), the Lustre metadata server (on the metadata nodes), and the job scheduler (on the scheduler node). For the first five, Beacon deploys the lightweight daemons that can collect I/O-relevant events, status, and performance data locally, and then delivers the aggregated and compressed data to Beacon’s distributed databases, which are deployed on 84 part-time servers. Aggressive first-pass compression is conducted on all compute nodes for efficient per-application I/O trace collection/storage. For the job scheduler, Beacon interacts with the job queuing system to keep track of per-job information, and then sends the job information to the MySQL database (on the 85th part-time server). Details of Beacon’s monitoring component can be found in Section 3.2.

Fig. 2.

Beacon’s storage component is deployed on 85 of 288 storage nodes. Beacon has its major back-end processing and storage workflow distributed across these storage nodes with their node-local disks, achieving a low overall overhead and satisfying stability of services. To this end, Beacon divides the 40,960 compute nodes into 80 groups and enlists 80 of the 288 storage nodes to communicate with one group each. Two more storage nodes are used to collect data from the forwarding nodes, plus another for storage nodes and one last for Metadata Data Server (MDS). Together, these 84 “part-time” servers (shown as “N1” to “N84” in Figure 2) are called log servers, which host a distributed I/O record database of Beacon. Considering the data collection across a total number of more than 50,000 nodes, a certain number of servers is beneficial to the stability and concurrent access efficiency of Beacon. In addition, one more storage node (N85 in Figure 2) is used to host Beacon’s job database (implemented using MySQL [16]). By leveraging the hardware devices available on the supercomputer, we can deploy Beacon quickly.

These log servers adopt a layered software architecture built upon mature open-source frameworks. They collect I/O-relevant events, status, and performance data through Logstash [78], a server-side log processing pipeline for simultaneously ingesting data from multiple sources. The data are then imported to Redis [65], a widely used in-memory data store, acting as a cache to quickly absorb monitoring output. Persistent data storage and subsequent analysis are done via Elasticsearch [36], a distributed lightweight search and analytics engine supporting a NoSQL database. It also supports efficient Beacon queries for real-time and offline analysis.

Finally, Beacon conducts data analytics and visualizes the results of analysis to Beacon’s users (either system administrators or application users) with a dedicated Beacon server. Then, it performs two kinds of offline data analysis periodically: (1) second-pass, inter-node compression to further remove data redundancy by comparing and combining logs from compute nodes running the same job, and (2) extracting and caching in MySQL using SQL views of the per-job statistic summary while generating and caching in Redis common performance visualization results, so as to facilitate a speedy user response. Log and monitoring data, after the two-pass compression, are permanently stored using Elasticsearch on this dedicated Beacon server. Data in the distributed I/O record database are kept for six months. Considering the typical daily data collection size of 10–100 GB, its 120-TB RAID5 capacity far exceeds the system’s lifetime storage space needs.

Beacon’s web interface uses the Vue [93]+Django [19] framework, which can efficiently separate the front end (a user-friendly GUI for processing and visualizing the I/O-related job/system information queries) and the back end (the service for obtaining the analysis results of Beacon and feeding them back to the front end). For instance, application users can query a summary of their programs’ I/O behavior based on the job ID, along the entire I/O path, to help diagnose I/O performance problems. Moreover, system administrators can monitor real-time load levels on all forwarding nodes, storage nodes, and metadata servers, facilitating future job scheduling optimizations and center-level resource allocation policies. Figure 3 shows the corresponding screenshots. Section 4 provides more details, with concrete case studies.

Fig. 3.

All communication among Beacon entities uses a low-cost, easy-to-maintain Ethernet connection (marked in green in Figure 1) that is separate from both the main computation and the storage interconnects.

3.2 Multi-layer I/O Monitoring

Figure 4 shows the format of all data collected by Beacon, including the LWFS client trace entry, LWFS server log entry, Lustre client log entry, Lustre server log entry, Lustre MDS log entry, and Job scheduler log entry. For details, see the following section.

Fig. 4.

3.2.1 Compute Nodes.

On each of the 40,960 compute nodes, Beacon collects LWFS client trace logs via instrumenting in the FUSE (File system in User Space) [17]. Each log entry contains the node’s IP, I/O operation type, file descriptor, offset, request size, and timestamp.

On a typical day, such raw trace data alone amount to over 100 GB, making their collection/processing a non-trivial task on Beacon’s I/O record database, which takes away resources from the storage nodes. However, there exists abundant redundancy in HPC workloads’ I/O operations. For example, as each compute node is usually dedicated to one job at a time, the job IDs are identical among many trace entries. Similarly, owing to the regular, tightly coupled nature of many parallel applications, adjacent I/O operations likely have common components, such as the target file, operation type, and request size. Recognizing this, Beacon performs aggressive online compression on each compute node to dramatically reduce the I/O trace size. This is done by a simple, linear algorithm comparing adjacent log items and combining them with an identical operation type, file descriptor, and request size, and accessing contiguous areas. These log items are replaced with a single item plus a counter. Considering the low computing overhead, we perform such parallel first-pass compression on compute nodes.

Beacon conducts offline log processing and second-pass compression on the dedicated server. Here, it extracts the feature vector \(\lt\)time, operation, file descriptor, size, offset\(\gt\) from the original log records and performs inter-node compression by comparing feature vector lists from all nodes and merging identical vectors, using a similar approach as in block trace modeling [77] or ScalaTrace [54].

Table 1 summarizes the effectiveness of Beacon’s monitoring data compression. It gives the compression ratio under two kinds of methods of eight applications, including six open-source applications (APT [84], WRF [69], DNDC [25], CAM [21], AWP [20], and Shentu [39]) and two closed-source computational fluid dynamics simulators (XCFD and GKUA). The results indicate that the compute-node-side first-pass compression reduces the raw trace size by a factor of 5.4 to 34.6 across eight real-world, large-scale applications. However, the second pass achieves a less impressive reduction, partly because data have already undergone one pass of compression. Here, although the compute nodes perform similar I/O operations, different parameter values such as the file offset make it harder to combine data entries.

Table 1.

Applications	1st-pass	2nd-pass
APT	5.4	2.1
WRF	14.2	3.8
DNDC	10.1	3.4
XCFD	12.2	3.8
GKUA	34.6	3.6
CAM	9.2	4.4
AWP	15.1	3.2
Shentu	22.2	2.6

Table 1. Compression Ratio of Sample Applications

3.2.2 Forwarding Nodes.

On each forwarding node, Beacon profiles both the LWFS server and Lustre client. It collects the latency and processing time for each LWFS server request by instrumenting all I/O operations at the POSIX layer and the request queue length for each LWFS server by sampling the queue status once per 1,000 requests. Rather than saving the per-request traces, the Beacon daemon periodically processes new traces and only saves I/O request statistics such as latency and queue length distribution.

For the Lustre client, Beacon collects request statistics by sampling the status of all outstanding RPC requests once every second. Each sample contains the forwarding ID and RPC request size sent to the Lustre server.

3.2.3 Storage Nodes and MDS.

On the storage nodes, Beacon daemons periodically sample the Lustre OST status table, record data items such as the OST ID and OST total data size, and further send high-level statistics such as the count of RPC requests and average per-RPC data size in the past time window. On the Lustre MDS, Beacon also periodically collects and records statistics on active metadata operations (such as open and lookup) at 1-second intervals while storing a summary of the periodic statistics in its database.

3.3 Multi-layer I/O Profiling

All the aforementioned monitoring data are transmitted for long-term storage and processing at the database on the dedicated Beacon server as JSON objects, on top of which Beacon builds I/O monitoring/profiling services. These include automatic anomaly detection, which runs periodically, as well as query and visualization tools, which supercomputer users and administrators can use interactively. Below, we give more detailed descriptions of these functions.

3.3.1 Automatic Anomaly Detection.

Beacon performs two types of automatic anomaly detection. One is to locate the job I/O performance anomaly. The job I/O performance anomaly is common in the complicated HPC environment. Various factors can cause performance anomalies, and I/O interference is one of the major factors. However, as supercomputer architectures become more complicated, it becomes increasingly difficult to identify and locate I/O interference. The other type of detection aims to identify the node anomaly. Outright failure, which implies the node is entirely out of service, is a common type of node anomaly that can be detected relatively straightforwardly in a large system; it is commonly handled by tools such as heartbeat detection [67, 74]. We do not discuss outright failure in this paper. Here, we focus on the other type, faulty system components, which are alive yet slow components, such as forwarding nodes and OSTs under performance degradation. Faulty system components may continue to serve requests, but at a much slower pace, draining the entire application’s performance and reducing overall system utilization. In a dynamic storage system serving multiple platforms and many concurrent applications, such stragglers are difficult to identify.

With the assistance of Beacon’s continuous, end-to-end and multi-layer I/O monitoring, a new option is made available to application developers and supercomputer administrators to examine jobs’ performance and system health by connecting statistics on application-issued I/O requests to that of individual OST’s bandwidth measurement. Such a connection guides Beacon to deduce what is the norm and what is an exception. Leveraging this capability, we design and implement a lightweight, automatic anomaly detection tool. Figure 5 shows the workflow of the anomaly detection tool.

Fig. 5.

The left part of the figure shows the job I/O performance anomaly detection workflow. Beacon detects the job I/O performance anomaly by checking newly measured I/O performance results against historical records, based on the assumption that most data-intensive applications have relatively consistent I/O behavior. First, it adopts the automatic I/O phase identification technique as in the IOSI system [40] developed on the Oak Ridge National Laboratory Titan supercomputer, which uses Discrete Wavelet Transform (DWT) to find distinct “I/O bursts” from continuous I/O bandwidth time-series data. Then, Beacon deploys a two-stage approach to detect jobs’ abnormal I/O phase effectively. In the first stage, Beacon classifies the I/O phase s into several distinct categories in terms of their I/O mode and total I/O volume by using the DBSCAN algorithm [18]. In the second stage, Beacon calculates I/O phase s’ performance vectors for each category, clusters the performance vectors with DBSCAN again, and then identifies the abnormal I/O phase s for each job with the clustering results. Here, we propose a new measurement feature named the performance vector, which is a description of the I/O phase’s throughput waveform. Intuitively, the throughput of the abnormal I/O phase is substantially lower for most of the time during the I/O phase’s period when compared to the I/O phase with normal performance. Therefore, the throughput distribution may become an important feature to differentiate whether the I/O phase is abnormal.

The process of calculating the performance vector is shown in Algorithm 1. We determine the I/O phase’s time span in each range by dividing the throughput between the minimum and maximum into N intervals. Here, we take WRF [69] as an example to describe the process of calculating performance vectors. WRF is a weather forecast application with the highest corehour occupancy rate on the TaihuLight, using a 1:1 I/O mode. Figure 6(a) illustrates two WRF jobs running at a scale of 128 compute nodes. The job with normal performance is shown above, while the job with abnormal performance is shown below. The maximum bandwidth of these I/O phase s is around 60 MB/s, and the minimum is 0.3 MB/s, according to Beacon’s historical statistics, implying that the bandwidth range of o these I/O phase s is (0, 60] (\(TH_{min}\)=0 and \(TH_{max}\)=60). Each I/O phase’s throughput is divided into five (N=5) intervals, and the interval R is set to 12. The number of “five” is selected empirically, based on WRF’s monitoring data. Figure 6(b) shows the calculation results of the distribution of the four I/O phase s’ throughput. In the smallest sub-interval ((0, 12]), the time ratio of abnormal I/O phase s is substantially larger than the time ratio of regular I/O phase s in the same intervals. That is, the performance vectors of abnormal I/O phase s are considerably different from those of other I/O phase s. According to the previous description, Beacon performs the second-stage clustering with performance vectors from the same category of I/O phase s. The outliers obtained after clustering are considered as the abnormal I/O phase s. After testing with the real-world dataset, we find that Beacon’s two-stage clustering approach improves accuracy by around 20% over IOSI’s simple one-stage clustering method (IOSI detects the outliers only by clustering the I/O phase s’ consumed time and I/O volume).

Fig. 6.

Then, Beacon utilizes its rich monitoring data to examine neighbor jobs that share forwarding node(s) with the abnormal job when outliers are found. In particular, it judges the cause of the anomaly by whether such neighbors have interference-prone features, such as high MDOPS, high I/O bandwidth, high IOPS, or N:1 I/O mode. The I/O mode indicates the parallel file sharing mode among processes, where common modes include “N:N” (each compute process accesses a separated file), “N:1” (all processes share one file), “N:M” (N processes perform I/O aggregation to access M files, M\(\lt\)N), and “1:1” (only one of all processes performs sequential I/O on a single file). Such findings are saved in the Beacon database and provided to users via the Beacon web-based application I/O query tool. Applications, of course, need to accumulate at least several executions for such detection to take effect.

The right part of Figure 5 shows the workflow of Beacon’s node anomaly detection, which relies on the execution of large-scale jobs (those using 1,024 or more compute nodes in our current implementation). To spot outliers, it leverages the common homogeneity in I/O behavior across compute and server nodes. Beacon’s multi-level monitoring allows the correlation of I/O activities or loads back to actual client-side issued requests. Again, by using clustering algorithms like DBSCAN and configurable thresholds, Beacon performs outlier detection across forwarding nodes and OSTs involved in a single job, where the vast majority of entities report a highly similar performance, while only a few members produce contrasting readings. Figure 15 in Section 4.3 gives an example of per-OST bandwidth data within the same execution.

3.3.2 Per-job I/O Performance Analysis.

Upon a job’s completion, Beacon performs automatic analysis of its I/O monitoring data collected from all layers. It performs inter-layer correlation by first identifying jobs from the job database that run on given compute node(s) at the log entry collection time. The involved forwarding nodes, leading to relevant forwarding monitoring data, are then located via the compute-to-forwarding node mapping using a system-wide mapping table lookup. As mentioned above, the mapping from computing nodes to forwarding nodes on TaihuLight is statically configured. Finally, relevant OSTs and corresponding storage nodes monitoring data entries are found by the file system lookup using the Lustre command lfs. Note that the correlation can be easily obtained when the application uses each layer node exclusively. However, when several jobs share part of the forwarding and storage nodes, Beacon can only make a simple estimation by using the I/O throughput at the compute layer that is monopolized for each job.

From the above data, Beacon derives and stores coarse-grained information for quick query, including the average and peak I/O bandwidth, average IOPS, runtime, number of processes (and compute nodes) performing I/O, I/O mode, total count of metadata operations, and average metadata operations per second during I/O phases.

To help users understand/debug their applications’ I/O performance, Beacon provides web-based I/O data visualization. This diagnosis system can be queried using a job ID, and after appropriate authentication, it allows visualizing the I/O statistics of the job, both real-time and post-mortem. It reports the measured I/O metrics (such as bandwidth and IOPS) and inferred characteristics (such as the number of I/O processes and I/O mode). Users are also presented with user-configurable visualization tools, showing time-series measurement in I/O metrics, statistics information such as request type/size distribution, and performance variances. Our powerful I/O monitoring database allows for further user-initiated navigation, such as per-compute-node traffic history and zooming control to examine data at different granularity. For security/privacy, users are only allowed to view I/O data from compute, forwarding, and storage nodes involved in and for the duration of their jobs’ execution.

3.3.3 I/O Subsystem Monitoring for Administrators.

Beacon also provides administrators with the capability to monitor the I/O status for any time period, on any node.

Besides all the user-visible information and facilities mentioned above, administrators can further obtain and visualize: (1) the detailed I/O bandwidth and IOPS for each compute node, forwarding node, and storage node, (2) resource utilization status of forwarding nodes, storage nodes and the MDS, including detailed request queue length statistics, and (3) I/O request latency distribution on forwarding nodes. Additionally, Beacon grants administrators direct I/O record database access to facilitate in-depth analysis.

Combining such facilities, administrators can perform powerful and thorough I/O traffic and performance analysis, for example, by checking multi-level traffic, latency, and throughput monitoring information regarding a job execution.

3.4 Generality

Beacon is not an ad-hoc I/O monitoring system for the TaihuLight. It can be adopted not just for data collection in other fields but also for other platforms. Beacon’s building blocks, such as operation log collection, compression, and data management components, are also suitable for collecting from other fields. Section 4.5.1 will show an example of collecting network data.

In addition, Beacon is also applicable to other advanced supercomputers with the I/O forwarding architecture. Beacon’s multi-layer data collection and storage, scheduler-assisted per-application data correlation and analysis, history-based anomaly identification, automatic I/O mode detection, and built-in interference analysis can all be performed on other supercomputers. Its data management components, such as Logstash, Redis, and ElasticSearch, are open-source software that can run on these machines as well. Our forwarding layer design validation and load analysis can also help recent platforms with a layer of burst buffer nodes, such as NERSC’s Cori [10]. Section 4.5.2 gives an example of extending Beacon to another supercomputer with the I/O forwarding architecture.

Finally, we find that while Beacon is designed and deployed on a cutting-edge supercomputer with multi-layer architectures, it can also be applied to traditional two-layer supercomputers. An example of extending Beacon to a traditional two-layer supercomputer is given in Section 4.5.3.

4 Beacon Use Cases

We now discuss several use cases of Beacon. Beacon has been deployed on TaihuLight for over three years, gathering massive I/O information and accumulating around 25 TB of trace data (after two passes of compression) from April 2017 to July 2020. As TaihuLight’s back-end storage changed in August 2020, we use data before August 2020 for analysis. This history contains 1,460,662 jobs using at least 32 compute nodes and consuming 789,308,498 core-hours in total. Of these jobs, 238,585 (16.3%) featured non-trivial I/O, with per-job I/O volume over 200 MB.

The insights and issues revealed by Beacon’s monitoring and diagnosis have already helped TaihuLight administrators fix several design flaws, develop a dynamic and automatic forwarding node allocation tool, and improve system reliability and application efficiency. Owing to Beacon’s success on TaihuLight, we extend Beacon to other platforms. In this section, we focus on four types of use cases and the extended applications of Beacon for network monitoring and monitoring of different storage architectures:

(1)

System performance overview

(2)

Performance issue diagnosis

(3)

Automatic I/O anomaly diagnosis

(4)

Application and user behavior analysis

4.1 System Performance Overview

Beacon’s multi-layer monitoring, especially I/O subsystem monitoring, gives us an overview of the whole system, which helps manage and construct future storage systems. Liu’s work [41] took Titan as an example to prove that individual pieces of hardware (such as storage nodes and disks) are often under-utilized in HPC storage systems, and we make similar observations on TaihuLight. Figure 7 shows back-end utilization level statistics of the Lustre parallel file system on TaihuLight supercomputers for eight months. For each object storage target (OST), a disk array, we plot the percentage of time it reaches a certain average throughput, normalized to its peak throughput. OSTs are almost idle at least 60% of the time, using less than 1% of the I/O bandwidth. At the same time, these OSTs’ utilization is less than 5% about 70% of the time. So we can conclude that OSTs are under-utilized most of the time. Moreover, we also obtain similar conclusions for compute and forwarding nodes using Beacon’s multi-layer monitoring data.

Fig. 7.

Besides the conclusion obtained from the individual layer, Beacon can also discover the relationship between different layers, which is unavailable for traditional trace tools. Figure 8 shows the daily access volume from three layers during the sample period. Especially for read operations, the total daily volume requested by the compute layer is larger than that of the forwarding layer most of the time, which results in effective caching for Lustre clients in the forwarding layer. Sometimes, the read volume requested by the forwarding layer is much larger than that of the compute layer, which reveals the phenomenon of cache thrashing, and we discuss the details of it later in this section. For write operations, the total daily volume requested from the forwarding layer is always slightly larger than that of the compute layer. Write amplification is a major reason for this phenomenon, which is caused by writing data aligned with the request size of 4 KB, or the multiples of 4 KB.

Fig. 8.

However, the OST layer has a different story. We find that both the read and write volumes on the compute and forwarding layer are much smaller than on the OST layer. Besides write amplification, there are other reasons for this phenomenon. In addition to the compute and forwarding nodes on TaihuLight, other nodes like login or ACC nodes can also access the shared Lustre back-end storage system. Currently, Beacon does not capture these nodes. However, from the figure, we can conclude that system administrators should also pay attention to a load of file system access on login nodes or ACC nodes. According to our survey, users often make many file I/O operations, like copying data from local file systems to Lustre or from one directory to another on login nodes or performing data post-processing on ACC nodes. More details are given in Section 4.4.

4.2 Performance Issue Diagnosis

4.2.1 Forwarding Node Cache Thrashing.

Beacon’s end-to-end monitoring facilitates cross-layer correlation of I/O profiling data, at different temporal or spatial granularities. By comparing the total request volume at each layer, we can see that Beacon has helped TaihuLight’s infrastructure management team identify a previously unknown performance issue, as detailed below.

A major driver for the adoption of I/O forwarding or the burst buffer layer is the opportunity to perform prefetching, caching, and buffering, so as to reduce the pressure on slower disk storage. Figure 9 shows the read volume on compute and forwarding node layers, during two sampled 70-hour periods in August 2017. Figure 9(a) shows a case with expected behavior, where the total volume requested by the compute nodes is significantly higher than that requested by the forwarding nodes, signaling good access locality and effective caching. Figure 9(b), however, tells the opposite story, to the surprise of system administrators: The forwarding layer incurs much higher read traffic from the back end than requested by user applications, reading much more data from the storage nodes than returning to compute nodes. Such a situation does not apply to writes, where Beacon always shows the matching aggregate bandwidth across the two levels.

Fig. 9.

Further analysis of the applications executed and their assigned forwarding nodes during the problem period in Figure 9(b) reveals an unknown cache thrashing problem, caused by the N:N sequential data access behavior. By default, the Lustre client has a 40-MB read-ahead cache for each file. Under the N:N sequential read scenarios, such aggressive prefetching causes severe memory contention, with data repeatedly read from the back end (and evicted on forwarding nodes). For example, a 1024-process Shentu [39] execution has each I/O process read a 1-GB single file, incurring a 3.5\(\times\) I/O amplification at the Lustre back end of Icefish. This echoes the previous finding on the existence of I/O self-contention within a single application [45].

Solution. This problem can be addressed by adjusting the Lustre prefetching cache size per file. For example, changing it from 40 MB per file to 2 MB is shown to remove the thrashing. Automatic, per-job forwarding node cache reconfiguration, which leverages real-time Beacon monitoring results, is currently under development for TaihuLight. Alternatively, reducing the number of accessed files through data aggregation is one of the effective ways to relieve this problem. Using MPI collective I/O is a convenient method to refactor the application from the N:N I/O mode to the N:M mode, leading to a fewer number of files to access at the same time. Given the close collaboration between application teams and machine administrators, making performance-critical program changes as suggested by monitoring data analysis is an accepted practice on leading supercomputers.

4.2.2 Bursty Forwarding Node Utilization.

Beacon’s continuous end-to-end I/O monitoring gives center management a global picture on system resource utilization. While such systems have often been built and configured using rough estimates based on past experience, Beacon collects detailed resource usage history to help improve the current system’s efficiency and assist future system upgrade and design.

Figure 10 gives one example, again on the forwarding load distribution, by showing two 1-day samples from July 2017. Each row portrays the by-hour peak load on one of the same 40 forwarding nodes randomly sampled from the 80 active ones. The darkness reflects the maximum bandwidth reached within that hour. The labels “high”, “mid”, “low”, and “idle” correspond to the maximum residing in the \(\gt\)90%, 50–90%, 10–50%, or 0–10% interval (relative to the benchmarked per-forwarding-node peak bandwidth), respectively.

Fig. 10.

Figure 10(a) shows the more typical load distribution, where the majority of forwarding nodes stay lightly used for the vast majority of the time (90.7% of cells show a maximum load of under 50% of peak bandwidth). Figure 10(b) gives a different picture, with a significant set of sampled forwarding nodes serving I/O-intensive large jobs for a good part of the day. Moreover, 35.7% of the cells actually see a maximum load of over 99% of the peak forwarding node bandwidth.

These results indicate that (1) overall, there is forwarding resource overprovisioning (confirming prior findings [27, 41, 47, 62]); (2) even with the more representative low-load scenarios, it is not rare for the forwarding node bandwidth to be saturated by application I/O; and (3) a load imbalance across forwarding nodes exists regardless of load level, making idle resources potentially helpful to I/O-intensive applications.

Solution. In view of the above, recently, TaihuLight has enlisted more of its “backup forwarding nodes” into regular service. Moreover, a dynamic, application-aware forwarding node allocation scheme has been designed and partially deployed (turned on for a subset of applications) [31]. Leveraging application-specific job history information, such an allocation scheme is intended to replace the default, static mapping between compute and forwarding nodes.

4.2.3 MDS Request Priority Setting.

Overall, we find that most TaihuLight jobs were rather metadata-light, but Beacon does observe a small fraction of parallel jobs (0.69%) with a high metadata request rate (more than 300 metadata operations/s on average during I/O phases). Beacon finds that these metadata-heavy (“high-MDOPS”) applications tend to cause significant I/O performance interference. Among jobs with Beacon-detected I/O performance anomaly, those sharing forwarding nodes with high-MDOPS jobs experience, an average 13.6\(\times\) increase in read/write request latency during affected time periods.

Such severe delay and corresponding Beacon forwarding node queue status history prompts us to examine the TaihuLight LWFS server policy. We find that metadata requests are given priority over the file I/O, based on the single-MDS design and the need to provide fast response to interactive user operations such as ls. Here, as neither disk bandwidth nor metadata server capacity is saturated, such interference can easily remain undetected using existing approaches that focus on I/O-intensive workloads only [23, 41].

Solution. As a temporary solution, we add probabilistic processing across priority classes to the TaihuLight LWFS scheduling. Instead of always giving metadata requests high priority, an LWFS server thread now follows a \(P\!:\!(1\!-\!P)\) split (P configurable) between picking the next request from the separate queues hosting metadata and non-metadata requests. Figure 11 shows the “before” and “after” pictures, with LAMMPS [15] (a classical molecular dynamics simulator) running against the high-MDOPS DNDC [25] (a bio-geochemistry application for agro-ecosystem simulation). Throughput of their solo-runs, where each application runs by itself on an isolated testbed, is given as reference. With a simple equal probability split, LAMMPS co-run throughput doubles, while DNDC only perceives a 10% slowdown. For a long-term solution, we plan to leverage Beacon to automatically adapt the LWFS scheduling policies by considering operation types, the MDS load level, and application request scheduling fairness.

Fig. 11.

4.3 Automatic I/O Anomaly Diagnosis

In extreme-scale supercomputers, users typically accept jittery application performance, recognizing widespread resource sharing among jobs. System administrators, moreover, see different behaviors among system components with a homogeneous configuration, but cannot tell how much of that difference comes from these components’ functioning and how much comes from the diversity of tasks they perform.

Beacon’s multi-layer monitoring capacity, therefore, presents a new window for supercomputer administrators to examine system health by connecting statistics on application-issued I/O requests all the way to that of an individual OST’s bandwidth measurement.

4.3.1 Overview of Anomaly Detection Results of Applications.

Figure 12 shows the results of anomaly detection with historical data collected from April 2017 to July 2020. Our results show that about 4.8% of all jobs that featured non-trivial I/O have experienced abnormal performance.

Fig. 12.

Figure 12(a) shows abnormal jobs’ categories distribution. Low-bandwidth jobs make up the majority of all jobs, and WRF accounts for most of these low-bandwidth jobs. Jobs with N:1 I/O and high bandwidth also play an important role. This paper later analyzes how applications with N:1 I/O mode can be easy to be disturbed. Jobs with high MDOPS and IOPS account for the smallest percentage, owing to the fact that these two types of jobs make up a small portion of all jobs on TaihuLight.

Figure 12(b) shows the factors that neighbor jobs bring to abnormal jobs, and we divide them into three categories: (1) system anomaly, (2) I/O interference, and (3) unknown factors. I/O interference factors include the N:1 I/O mode, high MDOPS, high I/O bandwidth, high IOPS, mix, and multiple jobs. This figure illustrates that application-interfering jobs account for more than 90% of all jobs, implying that application interference is the predominant cause of jobs’ performance degradation. Among them, the proportion of interference caused by jobs with the N:1 I/O mode occupies the primary partition, which means jobs with the N:1 I/O mode are not only susceptible to disturbance but also bring interference to other applications. Section 4.4 provides more information. Mix and jobs with high MDOPS rank second and third, respectively. The LWFS server thread pool on the forwarding node is restricted to 16, and jobs suffer from performance degradation when I/O operations on the same forwarding node surpass the LWFS server thread pool’s service capabilities.

4.3.2 Applications Affected by Interference.

Figure 13 illustrates an example of 1024-process Shentu co-running with other applications with different I/O patterns on a shared forwarding node. We find that Shentu suffers from various degrees of interference while co-running with other jobs. Among them, jobs with the N:1 I/O mode and high metadata have a significantly higher performance impact than jobs with the other two I/O patterns on Shentu. Because the forwarding nodes and compute nodes on the Sunway TaihuLight are statically connected, I/O interference on forwarding nodes is a major cause of applications’ performance anomalies.

Fig. 13.

Solution. With Beacon’s real-time collection data, we can find the I/O interference on the forwarding node in advance, which can help to improve performance for applications. Motivated by the findings, a dynamic forwarding allocation system [31] for isolating I/O interference on the forwarding nodes is developed, tested, and deployed.

4.3.3 Application-driven Anomaly Detection.

Most I/O-intensive applications have distinct I/O phases (i.e., episodes in their execution where they perform I/O continuously), such as those to read input files during initialization or to write intermediate results or checkpoints. For a given application, such I/O phase behavior is often consistent. Taking advantage of such repeated I/O operations and its multi-layer I/O information collection, Beacon performs automatic I/O phase recognition, on top of which it conducts I/O-related anomaly detection. More specifically, larger applications (e.g., those using 1024 compute nodes or more) spread their I/O load to multiple forwarding nodes and back-end nodes, giving us opportunities to directly compare the behavior of servers processing requests known to Beacon as homogeneous or highly similar.

Figure 14 gives an example of a 6000-process LAMMPS run with checkpointing, which is affected by abnormal forwarding nodes. The 1500 compute nodes are assigned to three forwarding nodes, whose bandwidth and I/O time are reflected in the time-series data from Beacon. We can clearly see that the Fwd1 node is a straggler in this case, serving at a bandwidth much slower than its peak (without answering to other applications). As a result, there is a 20\(\times\) increase in the application-visible checkpoint operation time, estimated using the other two forwarding nodes’ I/O phase durations.

Fig. 14.

4.3.4 Anomaly Alert and Node Screening.

Such continuous, online application performance anomaly detection can identify forwarding nodes or back-end units with deviant performance metrics, which in turn triggers Beacon’s more detailed monitoring and analysis. If it finds such a system component to consistently under-perform relative to peers serving similar workloads, with configurable thresholds in monitoring window and degree of behavior deviation, it reports this as an automatically detected system anomaly. By generating and sending an alarm email to the system administration team, Beacon prompts system administrators to do a thorough examination, where its detailed performance history information and visualization tools are also helpful.

Such anomaly screening is particularly important for expensive, large-scale executions. For example, among all applications running on TaihuLight so far, the parallel graph engine Shentu [39] has the most intensive I/O load. It scales well to the entire supercomputer in both computation and I/O, with 160,000 processes and large input graphs distributed evenly to nearly 400 Lustre OSTs. During test runs preparing for its Gordon Bell bid in April 2018, Beacon’s monitoring discovered a few OSTs significantly lagging behind in the parallel read, slowing down the initialization as a result (Figure 15). By removing them temporarily from service and relocating their data to other OSTs, Shentu cuts its production run initialization time by 60%, saving expensive dedicated system allocation and power consumption. In this particular case, further manual examination attributes the problem to these OSTs’ RAID controllers, which are now fixed.

Fig. 15.

However, without Beacon’s back-end monitoring, applications like Shentu will accept the bandwidth they obtain without suspecting that the I/O performance is abnormal. Similarly, without Beacon’s routine front-end tracing, profiling, and per-application performance anomaly detection, back-end outliers will go unnoticed. Therefore, as full-system benchmarking requires taking the supercomputer offline and cannot be regularly attempted, Beacon provides a much more affordable way for continuous system health monitoring and diagnosis by coupling application-side and server-side tracing/profiling information.

Beacon has been deployed on TaihuLight since April 2017, with features and tools incrementally developed and added to production use. Table 2 summarizes the automatically identified I/O system anomaly occurrences at the two service layers, from April 2017 to July 2020. Such identification adopts a minimum threshold of the measured maximum bandwidth under 30% of the known peak value, as well as a minimum duration of 60 minutes. Such parameters can be configured to adjust the anomaly detection system sensitivity. Most performance anomaly occurrences are found to be transient, lasting under 4 hours.

Table 2.

Duration (hours)	Forwarding node (times)	OSS+OST (times)
	Location of anomaly
\((0,1)\)	193	185
\([1,4)\)	59	73
\([4,12)\)	33	51
\([12,96)\)	22	25
\(\ge\)96, manually verified	15	22

Table 2. Duration of Beacon-identified System Anomalies

There are a total of 70 occasions of performance anomaly over 4 hours on the forwarding layer and 98 on the back-end layer, confirming the existence of fail-slow situations that are common with data centers [28]. Reasons for such relatively long yet “self-healed” anomalies include service migration and RAID reconstruction. With our rather conservative setting during the initial deployment period, Beacon is set to send the aforementioned alert email when a detected anomaly situation lasts beyond 96 hours (except for large-scale production runs as in the Shentu example above, where the faulty units are immediately reported). With all these occasions, the Beacon-detected anomaly is confirmed by human examination.

4.4 Application and User Behavior Analysis

With its powerful information collection and multi-layer I/O activity correlation, Beacon provides a new capability to perform detailed application or user behavior analysis. Results of such analysis assist in performance optimization, resource provisioning, and future system design. Here, we showcase several application/user behavior studies, some of which have led to corresponding optimizations or design changes to the TaihuLight system.

4.4.1 Application I/O Mode Analysis.

First, Table 3 gives an overview of the I/O volume across all profiled jobs with a non-trivial I/O, categorized by per-job core-hour consumption. Here, 1,000 K core-hours correspond to a 10-hour run using 100,000 cores on 25,000 compute nodes, and jobs with such consumption or higher write more than 40 TB of data on average. Further examination reveals that in each core-hour category, average read/write volumes are influenced by a minority group of heavy consumers. Overall, the amount of data read/written grows as the jobs consume more compute node resources. The less resource-intensive applications tend to perform more reads, while the larger consumers are more write-intensive.

Table 3.

Type	\((0, 1K]\)	\((1K, 10K]\)	\((10K, 100K]\)	\((100K, 1000K]\)	\((1000K, \infty)\)
Read	8.1 GB	101.0 GB	166.9 GB	1172.9 GB	2010.6 GB
Write	18.2 GB	83.9 GB	426.6 GB	615.9 GB	41458.8 GB

Table 3. Average Per-job I/O Volume by Core-hour Consumption

Figure 16 shows the breakdown of I/O-mode adoption among all TaihuLight jobs performing non-trivial I/O, by total read/write volume. The first impression one takes from these results is that the rather “extreme” cases, such as N:N and 1:1, form the dominant choices, especially in the case of writes. We suspect that this distribution may be skewed by a large number of small jobs doing limited I/O, and calculate the average per-job read/write volume for each I/O mode. The results (Table 4) show that this is not the case. Actually, applications that choose to use the 1:1 mode for writes actually have a much higher overall write volume.

Fig. 16.

Table 4.

I/O mode	Avg. read volume	Avg. write volume	Job count
N:N	96.8 GB	120.1 GB	11073
N:M	36.2 GB	63.2 GB	324
N:1	19.6 GB	19.3 GB	2382
1:1	33.0 GB	142.3 GB	16251

Table 4. Average I/O Volume and Job Count by I/O Mode

The 1:1 mode is the closest to sequential processing behavior and is conceptually simple. However, it obviously lacks scalability and fails to utilize the abundant hardware parallelism in the TaihuLight I/O system. The wide presentation of this I/O mode may help explain the overall under-utilization of forwarding resources, discussed earlier in Section 4.2. Echoing similar findings (though not so extreme) on other supercomputers [47] (including Intrepid [30], Mira [58], and Edison [51]), effective user education on I/O performance and scalability can both help improve storage system utilization and reduce wasted compute resources.

The N:1 mode tells a different story. It is an intuitive parallel I/O solution that allows compute processes to directly read to or write from their local memory without gather-scatter operations, while retaining the convenience of having a single input/output file. However, our detailed monitoring finds it to be a damaging I/O mode that users should steer away from, as explained below.

First, our monitoring results confirm the findings of existing research [2, 46]: The N:1 mode offers low application I/O performance (by reading/writing to a shared file). Even with a large N, such applications receive no more than 250 MB/s of I/O aggregate throughput despite the peak TaihuLight back end combined bandwidth of 260 GB/s. For read operations, users here also rarely modify the default Lustre stripe width, confirming the behavior reported in a recent ORNL study [38]. The problem is much worse with writes, as performance severely degrades owing to file system locking.

This study, however, finds that applications with the N:1 mode are extraordinarily disruptive, as they harm all kinds of neighbor applications that share forwarding nodes with them, particularly when N is large (e.g., over 32 compute nodes).

The reason is that each forwarding node operates an LWFS server thread pool (currently sized at 16), providing forwarding service to assigned compute nodes. Applications using the N:1 mode tend to flood this thread pool with requests in bursts. Unlike the N:N or N:M modes, N:1 suffers from the aforementioned poor back-end performance by using a single shared file. This, in turn, makes N:1 requests slow to process, further exacerbating their congestion in the queue and delaying requests from other applications, even when those victims are accessing disjointed back-end servers and OSTs.

Here, we give a concrete example of I/O mode-induced performance interference, featuring an earthquake simulation AWP [20] (2017 Gordon Bell Prize winner) that started with the N:1 mode. In this sample execution, AWP co-runs with the weather forecast application WRF [69] using the 1:1 mode, each having 1024 processes on 256 compute nodes. Under the “solo” mode, we assign each application a dedicated forwarding node in a small testbed partition of TaihuLight. In the “co-run” mode, we let the applications share one forwarding node (as the default compute-to-forwarding mapping is 512-to-1).

Table 5 lists the two applications’ average request wait times, processing times, and forwarding node queue lengths during these runs. Note that with the “co-run”, the queue is shared by both applications. We find that the average wait time of WRF increases by 11\(\times\) when co-running, but AWP is not affected. This result reveals the profound malpractice of the N:1 file sharing mode and confirms the prior finding that I/O interference is access-pattern-dependent [37, 43].

Table 5.

Operation	Avg. wait time	Avg. proc. time	Avg. queue length
WRF write (solo)	2.73 ms	0.052 ms	0.22
WRF write (co-run)	30.06 ms	0.054 ms	208.51
AWP read (solo)	58.17 ms	3.44 ms	226.37
AWP read (co-run)	58.18 ms	3.44 ms	208.51

Table 5. Performance Interference During WRF and AWP Co-run Sharing a Forwarding Node

Solution. Our tests confirm that increasing the LWFS thread pool size does not help in this case, as the bottleneck lies on the OSTs. Moreover, avoiding the N:1 mode has been advised in prior work [2, 90], as well as numerous parallel I/O tutorials. Considering our new inter-application study results, it is an obvious “win-win” strategy that simultaneously improves large applications’ I/O performance and reduces their disruption to concurrent workloads. However, based on our experience with real applications, this message needs to be better promoted.

In our case, the Beacon developers worked with the AWP team to replace its original N:1 file read (for initialization/restart) with the N:M mode during the 2017 ACM Gordon Bell Prize final submission phase. Changing applications’ I/O modes from N:1 to N:M means selecting M out of N processors to perform I/O. The number of M was selected empirically based on N:M experiments. Figure 17 shows the N:M experiment by changing the value of M. The 1024-processor AWP runs on 256 compute nodes connected to one forwarding node during our experiment. We can see that the bandwidth achieves near-linear growth with M, increasing in the range of 1 to 32. The reason is that when the aggregate bandwidth of processors performing I/O operations does not reach the peak bandwidth of a forwarding node, applications can obtain a larger aggregate bandwidth, with more processors writing to more separate files. When M increases to 64, the aggregate bandwidth increases slightly, with the limitation of a forwarding node. When M \(\gt\) 64, the aggregate bandwidth even declines slightly because of the resource contention. Also, more files may lead to unstable performance for applications. Thus, we suggest that when changing applications’ I/O modes from N:1 to N:M, selecting 1 out of every 16 processors or every 32 processors to perform I/O operation is a cost-effective choice on TaihuLight.

Fig. 17.

This change produces an over 400% enhancement in I/O performance. Note that the GB Prize submission does not report I/O time; we find that AWP’s 130,000-process production runs spend the bulk of their execution time reading around 100 TB of input or checkpoint data. Significant reduction in this time greatly facilitates AWP’s development/testing and saves non-trivial supercomputer resources.

4.4.2 Metadata Server Usage.

Unlike forwarding nodes’ utilization (discussed earlier), the Lustre MDS is found with rather evenly distributed load levels by Beacon’s continuous load monitoring (Figure 18(a)). In particular, 26.8% of the time, the MDS experiences a load level (in requests per second) above 75% of its peak processing throughput.

Fig. 18.

Beacon allows us to further split the requests between systems sharing the MDS, including the TaihuLight forwarding nodes, login nodes, and the ACC. To the surprise of TaihuLight administrators, over 80% of the metadata access workload actually comes from the ACC (Figure 18(b)).

Note that the login node and ACC have their own local file systems, ext4 and GPFS [66], respectively, which users are encouraged to use for purposes such as application compilation and data post-processing/visualization. However, as the users are likely TaihuLight users too, we find most of them prefer to directly use the main Lustre scratch file system intended for TaihuLight jobs, for convenience. While the I/O bandwidth/IOPS resources consumed by such tasks are negligible, user interactive activities (such as compiling or post-processing) turn out to be metadata-heavy.

Large waves of unintended user activities correspond to the most heavy-load periods at the tail end in Figure 18(a), and lead to MDS crashes directly affecting applications running on TaihuLight. According to our survey, many other machines, including two out of the top 10 supercomputers (Sequoia [83] and Sierra [33]), also have a single MDS, assuming that their users follow similar usage guidelines.

Solution. There are several potential solutions to this problem. With the help of Beacon, we can identify and remind users performing metadata-heavy activities to avoid using the PFS directly. Or, we can support more scalable Lustre metadata processing with an MDS cluster. A third approach is to facilitate intelligent workflow support that automatically performs data transfer based on users’ needs. This third approach is the one we are currently developing.

4.4.3 Jobs’ Request Size Analysis.

Figure 19 shows the relationships between the applications according to their bandwidth and IOPS, with all points forming five lines, which represent jobs mainly containing five request size types: 1 KB, 16 KB, 64 KB, 128 KB, 512 KB. Among them, 128 KB for read and 512 KB for write are the most common request sizes, which follow the system configurations of Icefish. On Sunway compute nodes, applications’ small I/O requests are merged, while larger I/O requests are split into multiple requests before being transferred to the forwarding nodes via the LWFS client. So we conclude that the average request size of most applications can reach the set upper limit, implying that the upper limit can be appropriately increased to enable applications to obtain a better read and write performance. In addition, further statistical analysis reveals that 6.89% of jobs still have an I/O request size of less than 1 KB. However, small I/O requests are associated with inefficient I/O behavior, and jobs with such I/O behavior cannot make good use of the high-performance parallel file system.

Fig. 19.

Solution. We take APT [84] (An application for particle dynamical simulations) as an example. APT is designed for systematic large-scale applications of geometric algorithms for particle dynamics simulations, which runs on TaihuLight with 1024 processes and outputs file with the HDF5 file format. A large number of small I/O requests is the main reason for its low performance. As a quick solution, we change its HDF5 format to the binary format and achieve a 20\(\times\) I/O performance improvement.

4.5 Extended Applications of Beacon

4.5.1 Extension to Network Monitoring.

Encouraged by Beacon’s success in I/O monitoring, in summer 2018, we began to design and test its extension to monitor and analyze network problems, motivated by the network performance debugging needs of ultra-large-scale applications. Figure 20 shows the architecture of this new module. Beacon samples performance counters on the 5984 Mellanox InfiniBand network switches, such as per-port sent and received volumes. Again, the data collected are passed to low-overhead daemons on Beacon log servers, more specifically, 75 of its 85 part-time servers, each assigned 80 switches. Similar processing and compression are conducted, with result data persisting in Beacon’s distributed database and then being periodically relocated to its dedicated server for user queries and permanent storage.

Fig. 20.

This Beacon network monitoring prototype was tested in time to help in the aforementioned Shentu [39] production runs, for its final submission to Supercomputing ’18 as an ACM Gordon Bell Award finalist. Beacon was called upon to identify the reason that the aggregate network bandwidth was significantly lower than theoretical peak. Figure 21 illustrates this with a 3-supernode Shentu test run. The dark bars (FixedPart) form a histogram of communication volumes measured on 40 switches connecting these 256-node supernodes for inter-supernode communication, reporting the count of switches within five volume brackets. There is a clear bi-polar distribution, showing severe load imbalance and more than expected inter-supernode communication. This monitoring result led to discovery that owing to the existence of faulty compute nodes within each supernode, the fixed partitioning relay strategy adopted by Shentu led to a subset of relay nodes receiving twice the “normal” load. Note that Shentu’s own application-level profiling found that the communication volume across compute nodes was well balanced. Hence, the problem was not obvious to application developers until Beacon provided such switch-level traffic data.

Fig. 21.

Solution. This finding prompted Shentu designers to optimize their relay strategy, using a topology-aware scholastic assignment algorithm to uniformly partition source nodes to relay nodes [39]. The results are shown by gray bars (FlexPart) in Figure 21. The peak per-switch communication volume is reduced by 27.0% (from 6.3 GB to 4.6 GB), with a significantly improved load balance, bringing a total communication performance enhancement of around 30%.

4.5.2 Extension to the Cutting-edge Supercomputer with I/O Forwarding Architecture.

The Sunway next-generation supercomputer inherits and develops the architecture of the Sunway TaihuLight and is built on a homegrown high-performance heterogeneous multi-core processor, SW26010P. It consists of more than 100, 000 compute nodes, each node equipped with a 390-core SW26010P CPU. Compared to the 10 million cores of TaihuLight, the new machine has more than four times the total number of cores. Figure 22 shows the architecture overview. Like TaihuLight, the compute nodes are connected to the storage nodes through forwarding nodes. Storage nodes run the Lustre servers and support users with a global file system. Unlike TaihuLight, the Sunway next-generation supercomputer provides an additional burst buffer file system on the forwarding node [89]. Each forwarding node provides back-end storage for the burst buffer file system via two high-performance Nvme SSDs.

Fig. 22.

In order to extend Beacon to the Sunway next-generation supercomputer, we upgraded the collection module of Beacon to support data collection on the burst buffer file system in January 2021. Beacon’s other components can still be performed on this supercomputer as expected. Figure 23 shows an example of Beacon’s use case on the next-generation supercomputer. According to the figure, we find that the load on Nvme SSDs is low most of the time. An important reason is that users tend to use the global file system more often than the burst buffer file system. We confirm this assertion by further statistical analysis.² Although the burst buffer file system can provide high I/O performance for jobs, users have to modify their applications with a specific API for I/O to use the burst buffer file system, which is not convenient and contributes to the low usage of the burst buffer file system. Besides, we also find that the load on Nvme SSDs is imbalanced. One important reason is the control strategy of Nvme SSDs. Nvme SSDs are controlled through static configuration files. Each user can only access the corresponding Nvme SSDs through a configuration file given by an administrator. However, it is difficult for the administrator to balance each Nvme SSD’s load as it lacks real-time load information.

Fig. 23.

Solution. With the help of Beacon’s real-time monitoring, we can obtain the real-time Nvme SSDs’ load, which is necessary for configuration file modification. Currently, we are working with administrators to develop a dynamic configuration system to make full use of Nvme SSDs.

4.5.3 Extension to the Traditional Two-layer Supercomputer.

In addition to Beacon’s adoption as a multi-layer cutting-edge supercomputer, some of Beacon’s components and methods can also be adopted by the traditional two-layer supercomputer. We have deployed Beacon on the Sugon Pai supercomputer [72], a traditional two-layer computer, since March 2020. Sugon Pai is a homogeneous computing cluster that contains 424 compute nodes as well as eight storage nodes. It uses the ParaStor file system to provide high concurrent I/O. The architecture of Beacon’s monitoring and storage module is shown in Figure 24. Beacon performs I/O monitoring on the compute and storage nodes, running ParaStor [73] client and server, respectively. Beacon divides the 424 compute nodes into four groups and enlists four “part-time” servers to communicate with one group. In addition, data collected from eight storage nodes are transferred to another “part-time” server. There is also a MySQL database to store jobs’ running information on the Sugon Pai. In order to reduce data transmission and storage overhead, Beacon also conducts an online compression similar to that used on TaihuLight.

Fig. 24.

Table 6 shows the statistics of I/O-mode adopted by jobs that perform non-trivial I/O on the Sugon Pai supercomputer from March 2020 to April 2020. We find some similar conclusions, for example, the N:N and 1:1 I/O modes form the dominant choices in the case of write. Besides, there are also some new findings on Sugon Pai; the N:1 I/O mode takes up most of the ratio in the case of read. Further analysis shows that the N:1 I/O mode offers a relatively good performance on Sugon Pai. Figure 25 shows an example of a molecular simulation application with the N:1 I/O mode on Sugon Pai. As we can see from the figure, high performance is obtained when reading with the N:1 I/O mode. A plausible reason is that Sugon Pai adopts Parastor as its primary storage system, supporting the N:1 I/O mode better than LWFS and Lustre on TaihuLight. This finding also proves that different platforms support I/O behaviors differently, which implies that an application’s I/O behavior needs to be well matched to the underlying platform adaptively in order to achieve better performance.

Fig. 25.

Table 6.

Job I/O mode	#Read-operated jobs	#Write-operated jobs
1:1	289	689
N:1	521	103
N:N	152	438
N:M	270	2

Table 6. Jobs Classified by I/O Mode

5 Beacon Framework Evaluation

We now evaluate Beacon’s per-application profiling accuracy and its performance overhead.

5.1 Accuracy Verification

Beacon collects full traces from the compute node side, thus giving it access to complete application-level I/O operation information. However, because the LWFS client trace interface provides only coarse timestamp data (at per-second granularity), and owing to the clock drift across compute nodes, it is possible that the I/O patterns recovered from Beacon logs deviate from the application-level captured records.

To evaluate the degree of such errors, we compare the I/O throughput statistics reported by the MPI-IO Test [26] to those by Beacon. In the experiments, we use the MPI-IO Test to test different parallel I/O modes, including N:N and N:1 independent operations, plus MPI-IO library collective calls. Then, 10 experiments were repeated at each execution scale.

The accuracy evaluation results are shown in Figure 26. We plot the average error in Beacon, measured as the percentage of deviation of the recorded aggregate compute node-side I/O throughput from the application-level throughput reported by the MPI-IO library.

Fig. 26.

We find that Beacon is able to accurately capture application performance, even for applications with non-trivial parallel I/O activities. More precisely, Beacon’s recorded throughput deviates from the MPI-IO Test reported values by only 0.78–3.39% (1.84% on average) for the read test and 0.81–3.31% (2.03% on average) for the write test, respectively. The results are similar to those of high-IOPS applications, which are omitted here owing to space limitations.

Beacon’s accuracy can be attributed to the fact that it records all compute node-side trace logs, facilitated by its efficient and lossless compression method (described in Section 3.2). We find that even though individual trace items may be off in timestamps, data-intensive applications on supercomputers seldom perform isolated, fast I/O operations (which are not of interest for profiling purposes). Instead, they exhibit I/O phases with a sustained high I/O intensity. By collecting multi-layer I/O trace entries for each application, Beacon is able to paint an accurate picture of an application’s I/O behavior and performance.

5.2 Monitoring and Query Overhead

We now evaluate Beacon’s monitoring overhead in a production environment. We compare the performance of important I/O-intensive real-world applications and the MPI-IO Test benchmark discussed earlier, with and without Beacon turned on (\(T_w\) and \(T_{w/o}\), respectively). We report the overall run time of each program and calculate the slowdown introduced by turning on Beacon. Table 7 shows the results, listing the average slowdown measured over at least five runs for each program (the variance of slowdown across runs low: under 2%). Note that for the largest applications, such testing is piggybacked on actual production runs of stable codes, with Beacon turned on during certain allocation time frames. Applications like AWP often break their executions to run a certain number of simulation time steps at a time.

Table 7.

Application	#Process	\(T_{w/o}\) (s)	\(T_{w}\) (s)	%Slowdown
MPI-IO\(_N\)	64\(\hphantom{0}\)	26.6	26.8	0.79%
MPI-IO\(_N\)	128\(\hphantom{0}\)	31.5	31.6	0.25%
MPI-IO\(_N\)	256\(\hphantom{0}\)	41.6	41.9	0.72%
MPI-IO\(_N\)	512\(\hphantom{0}\)	57.9	58.4	0.86%
MPI-IO\(_N\)	1024\(\hphantom{0}\)	123.1	123.5	0.36%
WRF\(_1\)	1024\(\hphantom{0}\)	2813.3	2819.1	0.21%
DNDC	2048\(\hphantom{0}\)	1041.2	1045.5	0.53%
XCFD	4000\(\hphantom{0}\)	2642.1	2644.6	0.09%
GKUA	16384\(\hphantom{0}\)	297.5	299.9	0.82%
GKUA	32768\(\hphantom{0}\)	182.8	184.1	0.66%
AWP	130000\(\hphantom{0}\)	3233.5	3241.5	0.25%
Shentu	160000\(\hphantom{0}\)	5468.2	5476.3	0.15%

Table 7. Avg. Beacon Monitoring Overhead on Applications

These results show that the Beacon tool introduces very low overhead, under \(1\%\) across all test cases. Also, the overhead does not grow with the application execution scale; it actually appears smaller (below 0.25%) for the two largest jobs, which use 130 K processes or more. Such a cost is particularly negligible considering the significant I/O performance enhancement and run-time savings produced by optimizations or problem diagnosis from Beacon-supplied information.

Table 8 lists the CPU and memory usage of Beacon’s data collection daemon. In addition, the storage overhead from Beacon’s deployment on TaihuLight since April 2017 is around 10 TB. Such low operational overhead and scalable operation attest to Beacon’s lightweight design, with background trace-collection and compression generating negligible additional resource consumption. Also, having a separate monitoring network and storage avoids potential disturbance to the application execution.

Table 8.

Level	CPU usage	Memory usage (MB)
Compute node	0.0%	10
Forwarding node	0.1%	6
Storage node	0.1%	5

Table 8. System Overhead of Beacon

Finally, we assess Beacon’s query processing performance. We measure the query processing time of 2,000 Beacon queries in September 2018, including both application users accessing job performance analysis and system administrators checking forwarding/storage nodes performance. In particular, we examine the impact of Beacon’s in-memory cache system between the web interface and Elasticsearch, as shown in Figure 2. Figure 27 gives the CDF of queries in processing time and demonstrates that (1) the majority of Beacon user queries can be processed within 1 second, and 95.6% of them can be processed under 10 seconds (visualization queries take longer), and (2) Beacon’s in-memory caching significantly improves the user experience. Additional checking reveals that about 95% of these queries can be served from cached data.

Fig. 27.

6 Related Work

Several I/O tracing and profiling tools have been proposed for HPC systems, which can be divided into two categories: application-oriented tools and back-end-oriented tools.

Application-oriented tools can provide detailed information about a particular execution on a function-by-function basis. Work in this area includes Darshan [9], IPM [79], and RIOT [86], all of which aim to build an accurate picture of application I/O behavior by capturing key characteristics of the mainstream I/O stack on compute nodes. Carns et al. evaluated the performance and runtime overheads of Darshan [8], and Patel et al. performed characterization and analysis of access of I/O intensive files [60] with Darshan. Wu et al. proposed a scalable methodology for MPI and I/O event tracing [48, 87, 88]. Recorder [46] focuses on collecting additional HDF5 trace data. Tools like Darshan provide user-transparent monitoring via automatic environment configuration. Still, instrumentation based tools have restrictions on programming languages or libraries/linkers. In contrast, Beacon is designed to be a non-stop, full-system I/O monitoring system capturing I/O activities at the system level.

Back-end-oriented tools collect system-level I/O performance data across applications and provide summary statistics (e.g., LIOProf [91], LustreDU [7, 38, 56], and LMT [24]). Neeraj et al. [64] tried to provide applications and middles with real-time system resource status while Patel et al. [59] focused on showing system-level characteristics with LMT. Paul et al. [61] also analyzed the statistics in an application-agnostic manner with data collected from Lustre server statistics.

However, identifying application performance issues and finding the cause of application performance degradation are difficult with these tools. While back-end analytical methods [40, 41] have made progress in identifying high-throughput applications using back-end logs only, they lack application-side information. Beacon, in contrast, holds complete cross-layer monitoring data to enable such tasks.

Along this line, there are tools for collecting multi-layer data. Static instrumentation has been used to trace parallel I/O calls from MPI to PVFS servers [35]. SIOX [85] and IOPin [34] characterize HPC I/O workloads across the I/O stack. These projects extended the application-level I/O instrumentation approach that Darshan [9] used to other system layers. However, their overhead hinders its deployment on large-scale production environments [70].

Regarding end-to-end frameworks, the TOKIO [3] architecture combines front-end tools (Darshan, Recorder) and back-end tools (LMT). The UMAMI monitoring interface [43] provides cross-layer I/O performance analysis and visualization. In addition, OVIS [5] uses the Cray specific tool LDMS [1] to provide scalable failure and anomaly detection. GUIDE [80] performs center-wide and multi-source log collection and motivated further analysis and optimizations. Beacon differs through its aggressive real-time performance and utilization monitoring, automatic anomaly detection, and continuous per-application I/O pattern profiling.

I/O interference is identified as an important cause for performance variability in HPC systems [41, 57]. Fang et al. [96] uncovered the interference in an in situ analytics system. Kuo et al. [37] focused on interference from different file access patterns with synchronized time-slice profiles. Yildiz et al. [92] studied the root causes of cross-application I/O interference across software and hardware configurations. To the best of our knowledge, Beacon is the first monitoring framework with built-in features for inter-application interference analysis. Our study confirms findings on large-scale HPC applications’ adoption of poor I/O design choices [47]. This further suggests that aside from workload-dependent, I/O-aware scheduling [14, 41], interference should be countered with application I/O mode optimization and adaptive I/O resource allocation.

Finally, on network monitoring, there are dedicated tools [42, 50, 68] for monitoring switch performance, anomaly detection, and resource utilization optimization. There are also tools specializing in network monitoring/debugging for data centers [75, 76, 94]. However, these tools/systems typically do not target the InfiniBand interconnections commonly used on supercomputers. To this end, Beacon adopts the open-source OFED stack [11, 55] to retrieve relevant information from the IB network. More importantly, it leverages its scalable and efficient monitoring infrastructure, originally designed for I/O, for network problems.

7 Conclusion

We have presented Beacon, an end-to-end I/O resource monitoring and diagnosis system for the leading supercomputer TaihuLight. It facilitates comprehensive I/O behavior analysis along the long I/O path and has identified hidden performance and user I/O behavior issues as well as system anomalies. Enhancements enabled by Beacon in the past 38 months have significantly improved ultra-large-scale applications’ I/O performance and the overall TaihuLight I/O resource utilization. More generally, our results and experience indicate that this type of detailed multi-layer I/O monitoring/profiling is affordable in state-of-the-art supercomputers, offering valuable insights while incurring a low cost. In addition, we have explored the public release of Beacon collected supercomputer I/O profiling data to the HPC and storage communities.

Our future work will focus on the cross-layered application I/O portrait, automated I/O scheduling, resource allocation, and optimization via real-time interaction with Beacon.

Footnotes

Github link: https://github.com/Beaconsys/Beacon.

More than 90% jobs running on the global file system.

References

[1]

Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, Joshi Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker. 2014. The lightweight distributed metric service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 154–165.

Abstract

1 Introduction

2 TaihuLight Network Storage

3 Beacon Design and Implementation

3.1 Beacon Architecture Overview

3.2 Multi-layer I/O Monitoring

3.2.1 Compute Nodes.

3.2.2 Forwarding Nodes.

3.2.3 Storage Nodes and MDS.

3.3 Multi-layer I/O Profiling

3.3.1 Automatic Anomaly Detection.

3.3.2 Per-job I/O Performance Analysis.

3.3.3 I/O Subsystem Monitoring for Administrators.

3.4 Generality

4 Beacon Use Cases

4.1 System Performance Overview

4.2 Performance Issue Diagnosis

4.2.1 Forwarding Node Cache Thrashing.

4.2.2 Bursty Forwarding Node Utilization.

4.2.3 MDS Request Priority Setting.

4.3 Automatic I/O Anomaly Diagnosis

4.3.1 Overview of Anomaly Detection Results of Applications.

4.3.2 Applications Affected by Interference.

4.3.3 Application-driven Anomaly Detection.

4.3.4 Anomaly Alert and Node Screening.

4.4 Application and User Behavior Analysis

4.4.1 Application I/O Mode Analysis.

4.4.2 Metadata Server Usage.

4.4.3 Jobs’ Request Size Analysis.

4.5 Extended Applications of Beacon

4.5.1 Extension to Network Monitoring.

4.5.2 Extension to the Cutting-edge Supercomputer with I/O Forwarding Architecture.

4.5.3 Extension to the Traditional Two-layer Supercomputer.

5 Beacon Framework Evaluation

5.1 Accuracy Verification

5.2 Monitoring and Query Overhead

6 Related Work

7 Conclusion

Footnotes

References

Cited By

Index Terms

Recommendations

End-to-end I/O monitoring on a leading supercomputer

Dark silicon and the end of multicore scaling

Compiler-based code generation and autotuning for geometric multigrid on GPU-accelerated supercomputers

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations