Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

End-to-end I/O Monitoring on Leading Supercomputers

Published: 11 January 2023 Publication History

Abstract

This paper offers a solution to overcome the complexities of production system I/O performance monitoring. We present Beacon, an end-to-end I/O resource monitoring and diagnosis system for the 40960-node Sunway TaihuLight supercomputer, currently the fourth-ranked supercomputer in the world. Beacon simultaneously collects and correlates I/O tracing/profiling data from all the compute nodes, forwarding nodes, storage nodes, and metadata servers. With mechanisms such as aggressive online and offline trace compression and distributed caching/storage, it delivers scalable, low-overhead, and sustainable I/O diagnosis under production use. With Beacon’s deployment on TaihuLight for more than three years, we demonstrate Beacon’s effectiveness with real-world use cases for I/O performance issue identification and diagnosis. It has already successfully helped center administrators identify obscure design or configuration flaws, system anomaly occurrences, I/O performance interference, and resource under- or over-provisioning problems. Several of the exposed problems have already been fixed, with others being currently addressed. Encouraged by Beacon’s success in I/O monitoring, we extend it to monitor interconnection networks, which is another contention point on supercomputers. In addition, we demonstrate Beacon’s generality by extending it to other supercomputers. Both Beacon codes and part of collected monitoring data are released.1

1 Introduction

Modern supercomputers are networked systems with increasingly deep storage hierarchies, serving applications with growing scale and complexity. The long I/O path from storage media to application, combined with complex software stacks and hardware configurations, makes I/O optimizations increasingly challenging for application developers and supercomputer administrators. In addition, because I/O utilizes heavily shared system components (unlike computation or memory accesses), it usually suffers from substantial inter-workload interference, causing high performance variance [23, 32, 37, 45, 52, 63, 71].
Online tools that can capture/analyze I/O activities and guide optimization are urgently needed. They also need to provide I/O usage information and performance records to guide future systems’ design, configuration, and deployment. To this end, several profiling/tracing tools and frameworks have been developed, including application-side (e.g., Darshan [9], ScalableIOTrace [81], and IOPin [34]), back-end side (e.g., LustreDU [7], IOSI [40], and LIOProf [91]), and multi-layer tools (e.g., EZIOTracer [49], GUIDE [80], and Logaider [12]).
These proposed tools, however, have one or more of the following limitations. Application-oriented tools often require developers to instrument their source code or link extra libraries. They also do not offer intuitive ways to analyze inter-application I/O performance behaviors such as interference issues. Back-end-oriented tools can collect system-level performance data and monitor cross-application interactions but have difficulty in identifying performance issues for specific applications and in finding their root causes. Finally, problematic applications issuing inefficient I/O requests escape the radar of back-end-side analytical methods [40, 41] relying on high-bandwidth applications.
This paper reports the design, implementation, and deployment of a lightweight, end-to-end I/O resource monitoring and diagnosis system, Beacon, for TaihuLight, currently the fourth-ranked supercomputer in the world [29]. It works with TaihuLight’s 40,960 compute nodes (over ten-million cores in total), 288 forwarding nodes, 288 storage nodes, and two metadata nodes. Beacon integrates front-end tracing and back-end profiling into a seamless framework, enabling tasks such as automatic per-application I/O behavior profiling, I/O bottleneck/interference analysis, and system anomaly detection.
To the best of our knowledge, this is the first system-level, multi-layer monitoring and real-time diagnosis framework deployed on ultra-scale supercomputers. Beacon collects performance data simultaneously from different types of nodes (including the compute, I/O forwarding, storage, and metadata nodes) and analyzes them collaboratively, without requiring any involvement of application developers. Its elaborated collection scheme and aggressive compression minimize the system cost; only 85 part-time servers to monitor the entire 40960-node system, with \(\lt \!1\%\) performance overhead in user applications.
We have deployed Beacon for production use since April 2017. It has already helped the TaihuLight system administration and I/O performance team identify several performance degradation problems. With its rich I/O performance data collection and real-time system monitoring, Beacon successfully exposes the mismatch between application I/O patterns and widely adopted underlying storage design/configurations. To help application developers and users, it enables detailed per-application I/O behavior study, with novel inter-application interference identification and analysis. Beacon also performs automatic anomaly detection. Finally, we have recently started to expand Beacon beyond I/O to network switch monitoring.
Based on our design and deployment experience, we argue that having such an end-to-end detailed I/O monitoring framework is highly rewarding. Beacon’s all-system-level monitoring decouples it from language, library, or compiler constraints, enabling the monitoring data of collection and analysis for all applications and users. Much of its infrastructure reuses existing server/network/storage resources, and it has proved to have negligible overhead. In exchange, users and administrators harvest deep insights into the complex I/O system components’ operations and interactions, and reduce both human resources and machine core-hours wasted on unnecessarily slow/jittery I/O or system anomalies.

2 TaihuLight Network Storage

Let us first introduce the TaihuLight supercomputer (and its Icefish I/O subsystem) used to perform our implementation and deployment. Though the rest of our discussion is based on this specific platform, many aspects of Beacon’s design and operation can be applied to other large-scale supercomputers or clusters.
TaihuLight, currently the fourth-ranked supercomputer in the world, is a many-core accelerated 125-petaflop system [22]. Figure 1 illustrates its architecture, highlighting the Icefish storage subsystem. The 40,960 260-core compute nodes are organized into 40 cabinets, each containing four supernodes. Through dual-rail FDR InfiniBand, all the 256 compute nodes in one supernode are fully connected and then connected to Icefish via a Fat-tree network. In addition, Icefish serves an Auxiliary Compute Cluster (ACC) with Intel Xeon processors, mainly used for data pre- and post-processing.
Fig. 1.
Fig. 1. TaihuLight and its Icefish storage system architecture overview. Beacon uses a separate monitoring and management Ethernet network shown at the bottom.
The Icefish back end employs the Lustre parallel file system [4], with an aggregate capacity of 10 PB on top of 288 storage nodes and 144 Sugon DS800 disk enclosures. An enclosure contains 60 1.2-TB SAS HDD drives, composing six Object Storage Targets (OSTs), each an 8+2 RAID6 array. The controller within each enclosure connects to two storage nodes, via two fiber channels for path redundancy. Therefore, every storage node manages three OSTs, while the two adjacent storage nodes sharing a controller form a failover pair.
Between the compute nodes and the Lustre back end is a layer of 288 I/O forwarding nodes. Each plays a dual role, both as a Lightweight File System (LWFS) based on the Gluster [13] server to the compute nodes and a client to the Lustre back end. This I/O forwarding practice is adopted by multiple other platforms that operate at such a scale [6, 44, 53, 82, 95].
A forwarding node provides a bandwidth of 2.5 GB/s, aggregating to over 720 GB/s for the entire forwarding system. Each back-end controller provides about 1.8 GB/s, amounting to a file system bandwidth of around 260 GB/s. Overall, Icefish delivers 240 GB/s and 220 GB/s aggregate bandwidths for reads and writes, respectively.
TaihuLight debuted on the Top500 list in June 2016. At the time of this study, Icefish was equally partitioned into two namespaces: Online1 (for everyday workloads) and Online2 (reserved for ultra-scale jobs that occupy the majority of the compute nodes), with disjointed sets of forwarding nodes. A batch job can only use either namespace. I/O requests from a compute node are served by a specified forwarding node using a static mapping strategy for easy maintenance (48 fixed forwarding nodes for ACC and 80 fixed forwarding nodes for Sunway compute nodes).
Therefore, the two namespaces, along with statically partitioned back-end resources, are currently utilized separately by routine jobs and “VIP” jobs. One motivation for deploying an end-to-end monitoring system is to analyze the I/O behavior of the entire supercomputer’s workloads and design more flexible I/O resource allocation/scheduling mechanisms. For example, motivated by the findings of our monitoring system, a dynamic forwarding allocation system [31] for better forwarding resource utilization was developed, tested, and deployed.

3 Beacon Design and Implementation

3.1 Beacon Architecture Overview

Figure 2 shows the three components of Beacon: the monitoring component, the storage component, and a dedicated Beacon server. Beacon performs I/O monitoring at the six components of TaihuLight, including the LWFS client (on the compute nodes), the LWFS serve, the Lustre client (the latter two are both on the forwarding nodes), the Lustre server (on the storage nodes), the Lustre metadata server (on the metadata nodes), and the job scheduler (on the scheduler node). For the first five, Beacon deploys the lightweight daemons that can collect I/O-relevant events, status, and performance data locally, and then delivers the aggregated and compressed data to Beacon’s distributed databases, which are deployed on 84 part-time servers. Aggressive first-pass compression is conducted on all compute nodes for efficient per-application I/O trace collection/storage. For the job scheduler, Beacon interacts with the job queuing system to keep track of per-job information, and then sends the job information to the MySQL database (on the 85th part-time server). Details of Beacon’s monitoring component can be found in Section 3.2.
Fig. 2.
Fig. 2. Beacon’s main components: daemons at monitoring points, a distributed I/O record database, a job database, plus a dedicated Beacon server.
Beacon’s storage component is deployed on 85 of 288 storage nodes. Beacon has its major back-end processing and storage workflow distributed across these storage nodes with their node-local disks, achieving a low overall overhead and satisfying stability of services. To this end, Beacon divides the 40,960 compute nodes into 80 groups and enlists 80 of the 288 storage nodes to communicate with one group each. Two more storage nodes are used to collect data from the forwarding nodes, plus another for storage nodes and one last for Metadata Data Server (MDS). Together, these 84 “part-time” servers (shown as “N1” to “N84” in Figure 2) are called log servers, which host a distributed I/O record database of Beacon. Considering the data collection across a total number of more than 50,000 nodes, a certain number of servers is beneficial to the stability and concurrent access efficiency of Beacon. In addition, one more storage node (N85 in Figure 2) is used to host Beacon’s job database (implemented using MySQL [16]). By leveraging the hardware devices available on the supercomputer, we can deploy Beacon quickly.
These log servers adopt a layered software architecture built upon mature open-source frameworks. They collect I/O-relevant events, status, and performance data through Logstash [78], a server-side log processing pipeline for simultaneously ingesting data from multiple sources. The data are then imported to Redis [65], a widely used in-memory data store, acting as a cache to quickly absorb monitoring output. Persistent data storage and subsequent analysis are done via Elasticsearch [36], a distributed lightweight search and analytics engine supporting a NoSQL database. It also supports efficient Beacon queries for real-time and offline analysis.
Finally, Beacon conducts data analytics and visualizes the results of analysis to Beacon’s users (either system administrators or application users) with a dedicated Beacon server. Then, it performs two kinds of offline data analysis periodically: (1) second-pass, inter-node compression to further remove data redundancy by comparing and combining logs from compute nodes running the same job, and (2) extracting and caching in MySQL using SQL views of the per-job statistic summary while generating and caching in Redis common performance visualization results, so as to facilitate a speedy user response. Log and monitoring data, after the two-pass compression, are permanently stored using Elasticsearch on this dedicated Beacon server. Data in the distributed I/O record database are kept for six months. Considering the typical daily data collection size of 10–100 GB, its 120-TB RAID5 capacity far exceeds the system’s lifetime storage space needs.
Beacon’s web interface uses the Vue [93]+Django [19] framework, which can efficiently separate the front end (a user-friendly GUI for processing and visualizing the I/O-related job/system information queries) and the back end (the service for obtaining the analysis results of Beacon and feeding them back to the front end). For instance, application users can query a summary of their programs’ I/O behavior based on the job ID, along the entire I/O path, to help diagnose I/O performance problems. Moreover, system administrators can monitor real-time load levels on all forwarding nodes, storage nodes, and metadata servers, facilitating future job scheduling optimizations and center-level resource allocation policies. Figure 3 shows the corresponding screenshots. Section 4 provides more details, with concrete case studies.
Fig. 3.
Fig. 3. Sample display from Beacon’s web interface: (a) cross-layer read/write bandwidth of one user job, (b) bandwidth of three OSTs identified as undergoing an anomaly.
All communication among Beacon entities uses a low-cost, easy-to-maintain Ethernet connection (marked in green in Figure 1) that is separate from both the main computation and the storage interconnects.

3.2 Multi-layer I/O Monitoring

Figure 4 shows the format of all data collected by Beacon, including the LWFS client trace entry, LWFS server log entry, Lustre client log entry, Lustre server log entry, Lustre MDS log entry, and Job scheduler log entry. For details, see the following section.
Fig. 4.
Fig. 4. Beacon’s data format summary.

3.2.1 Compute Nodes.

On each of the 40,960 compute nodes, Beacon collects LWFS client trace logs via instrumenting in the FUSE (File system in User Space) [17]. Each log entry contains the node’s IP, I/O operation type, file descriptor, offset, request size, and timestamp.
On a typical day, such raw trace data alone amount to over 100 GB, making their collection/processing a non-trivial task on Beacon’s I/O record database, which takes away resources from the storage nodes. However, there exists abundant redundancy in HPC workloads’ I/O operations. For example, as each compute node is usually dedicated to one job at a time, the job IDs are identical among many trace entries. Similarly, owing to the regular, tightly coupled nature of many parallel applications, adjacent I/O operations likely have common components, such as the target file, operation type, and request size. Recognizing this, Beacon performs aggressive online compression on each compute node to dramatically reduce the I/O trace size. This is done by a simple, linear algorithm comparing adjacent log items and combining them with an identical operation type, file descriptor, and request size, and accessing contiguous areas. These log items are replaced with a single item plus a counter. Considering the low computing overhead, we perform such parallel first-pass compression on compute nodes.
Beacon conducts offline log processing and second-pass compression on the dedicated server. Here, it extracts the feature vector \(\lt\)time, operation, file descriptor, size, offset\(\gt\) from the original log records and performs inter-node compression by comparing feature vector lists from all nodes and merging identical vectors, using a similar approach as in block trace modeling [77] or ScalaTrace [54].
Table 1 summarizes the effectiveness of Beacon’s monitoring data compression. It gives the compression ratio under two kinds of methods of eight applications, including six open-source applications (APT [84], WRF [69], DNDC [25], CAM [21], AWP [20], and Shentu [39]) and two closed-source computational fluid dynamics simulators (XCFD and GKUA). The results indicate that the compute-node-side first-pass compression reduces the raw trace size by a factor of 5.4 to 34.6 across eight real-world, large-scale applications. However, the second pass achieves a less impressive reduction, partly because data have already undergone one pass of compression. Here, although the compute nodes perform similar I/O operations, different parameter values such as the file offset make it harder to combine data entries.
Table 1.
Applications1st-pass2nd-pass
APT5.42.1
WRF14.23.8
DNDC10.13.4
XCFD12.23.8
GKUA34.63.6
CAM9.24.4
AWP15.13.2
Shentu22.22.6
Table 1. Compression Ratio of Sample Applications

3.2.2 Forwarding Nodes.

On each forwarding node, Beacon profiles both the LWFS server and Lustre client. It collects the latency and processing time for each LWFS server request by instrumenting all I/O operations at the POSIX layer and the request queue length for each LWFS server by sampling the queue status once per 1,000 requests. Rather than saving the per-request traces, the Beacon daemon periodically processes new traces and only saves I/O request statistics such as latency and queue length distribution.
For the Lustre client, Beacon collects request statistics by sampling the status of all outstanding RPC requests once every second. Each sample contains the forwarding ID and RPC request size sent to the Lustre server.

3.2.3 Storage Nodes and MDS.

On the storage nodes, Beacon daemons periodically sample the Lustre OST status table, record data items such as the OST ID and OST total data size, and further send high-level statistics such as the count of RPC requests and average per-RPC data size in the past time window. On the Lustre MDS, Beacon also periodically collects and records statistics on active metadata operations (such as open and lookup) at 1-second intervals while storing a summary of the periodic statistics in its database.

3.3 Multi-layer I/O Profiling

All the aforementioned monitoring data are transmitted for long-term storage and processing at the database on the dedicated Beacon server as JSON objects, on top of which Beacon builds I/O monitoring/profiling services. These include automatic anomaly detection, which runs periodically, as well as query and visualization tools, which supercomputer users and administrators can use interactively. Below, we give more detailed descriptions of these functions.

3.3.1 Automatic Anomaly Detection.

Beacon performs two types of automatic anomaly detection. One is to locate the job I/O performance anomaly. The job I/O performance anomaly is common in the complicated HPC environment. Various factors can cause performance anomalies, and I/O interference is one of the major factors. However, as supercomputer architectures become more complicated, it becomes increasingly difficult to identify and locate I/O interference. The other type of detection aims to identify the node anomaly. Outright failure, which implies the node is entirely out of service, is a common type of node anomaly that can be detected relatively straightforwardly in a large system; it is commonly handled by tools such as heartbeat detection [67, 74]. We do not discuss outright failure in this paper. Here, we focus on the other type, faulty system components, which are alive yet slow components, such as forwarding nodes and OSTs under performance degradation. Faulty system components may continue to serve requests, but at a much slower pace, draining the entire application’s performance and reducing overall system utilization. In a dynamic storage system serving multiple platforms and many concurrent applications, such stragglers are difficult to identify.
With the assistance of Beacon’s continuous, end-to-end and multi-layer I/O monitoring, a new option is made available to application developers and supercomputer administrators to examine jobs’ performance and system health by connecting statistics on application-issued I/O requests to that of individual OST’s bandwidth measurement. Such a connection guides Beacon to deduce what is the norm and what is an exception. Leveraging this capability, we design and implement a lightweight, automatic anomaly detection tool. Figure 5 shows the workflow of the anomaly detection tool.
Fig. 5.
Fig. 5. Flow chart of automatic anomaly detection.
The left part of the figure shows the job I/O performance anomaly detection workflow. Beacon detects the job I/O performance anomaly by checking newly measured I/O performance results against historical records, based on the assumption that most data-intensive applications have relatively consistent I/O behavior. First, it adopts the automatic I/O phase identification technique as in the IOSI system [40] developed on the Oak Ridge National Laboratory Titan supercomputer, which uses Discrete Wavelet Transform (DWT) to find distinct “I/O bursts” from continuous I/O bandwidth time-series data. Then, Beacon deploys a two-stage approach to detect jobs’ abnormal I/O phase effectively. In the first stage, Beacon classifies the I/O phase s into several distinct categories in terms of their I/O mode and total I/O volume by using the DBSCAN algorithm [18]. In the second stage, Beacon calculates I/O phase s’ performance vectors for each category, clusters the performance vectors with DBSCAN again, and then identifies the abnormal I/O phase s for each job with the clustering results. Here, we propose a new measurement feature named the performance vector, which is a description of the I/O phase’s throughput waveform. Intuitively, the throughput of the abnormal I/O phase is substantially lower for most of the time during the I/O phase’s period when compared to the I/O phase with normal performance. Therefore, the throughput distribution may become an important feature to differentiate whether the I/O phase is abnormal.
The process of calculating the performance vector is shown in Algorithm 1. We determine the I/O phase’s time span in each range by dividing the throughput between the minimum and maximum into N intervals. Here, we take WRF [69] as an example to describe the process of calculating performance vectors. WRF is a weather forecast application with the highest corehour occupancy rate on the TaihuLight, using a 1:1 I/O mode. Figure 6(a) illustrates two WRF jobs running at a scale of 128 compute nodes. The job with normal performance is shown above, while the job with abnormal performance is shown below. The maximum bandwidth of these I/O phase s is around 60 MB/s, and the minimum is 0.3 MB/s, according to Beacon’s historical statistics, implying that the bandwidth range of o these I/O phase s is (0, 60] (\(TH_{min}\)=0 and \(TH_{max}\)=60). Each I/O phase’s throughput is divided into five (N=5) intervals, and the interval R is set to 12. The number of “five” is selected empirically, based on WRF’s monitoring data. Figure 6(b) shows the calculation results of the distribution of the four I/O phase s’ throughput. In the smallest sub-interval ((0, 12]), the time ratio of abnormal I/O phase s is substantially larger than the time ratio of regular I/O phase s in the same intervals. That is, the performance vectors of abnormal I/O phase s are considerably different from those of other I/O phase s. According to the previous description, Beacon performs the second-stage clustering with performance vectors from the same category of I/O phase s. The outliers obtained after clustering are considered as the abnormal I/O phase s. After testing with the real-world dataset, we find that Beacon’s two-stage clustering approach improves accuracy by around 20% over IOSI’s simple one-stage clustering method (IOSI detects the outliers only by clustering the I/O phase s’ consumed time and I/O volume).
Fig. 6.
Fig. 6. An example of WRF’s anomaly detection.
Then, Beacon utilizes its rich monitoring data to examine neighbor jobs that share forwarding node(s) with the abnormal job when outliers are found. In particular, it judges the cause of the anomaly by whether such neighbors have interference-prone features, such as high MDOPS, high I/O bandwidth, high IOPS, or N:1 I/O mode. The I/O mode indicates the parallel file sharing mode among processes, where common modes include “N:N” (each compute process accesses a separated file), “N:1” (all processes share one file), “N:M” (N processes perform I/O aggregation to access M files, M\(\lt\)N), and “1:1” (only one of all processes performs sequential I/O on a single file). Such findings are saved in the Beacon database and provided to users via the Beacon web-based application I/O query tool. Applications, of course, need to accumulate at least several executions for such detection to take effect.
The right part of Figure 5 shows the workflow of Beacon’s node anomaly detection, which relies on the execution of large-scale jobs (those using 1,024 or more compute nodes in our current implementation). To spot outliers, it leverages the common homogeneity in I/O behavior across compute and server nodes. Beacon’s multi-level monitoring allows the correlation of I/O activities or loads back to actual client-side issued requests. Again, by using clustering algorithms like DBSCAN and configurable thresholds, Beacon performs outlier detection across forwarding nodes and OSTs involved in a single job, where the vast majority of entities report a highly similar performance, while only a few members produce contrasting readings. Figure 15 in Section 4.3 gives an example of per-OST bandwidth data within the same execution.

3.3.2 Per-job I/O Performance Analysis.

Upon a job’s completion, Beacon performs automatic analysis of its I/O monitoring data collected from all layers. It performs inter-layer correlation by first identifying jobs from the job database that run on given compute node(s) at the log entry collection time. The involved forwarding nodes, leading to relevant forwarding monitoring data, are then located via the compute-to-forwarding node mapping using a system-wide mapping table lookup. As mentioned above, the mapping from computing nodes to forwarding nodes on TaihuLight is statically configured. Finally, relevant OSTs and corresponding storage nodes monitoring data entries are found by the file system lookup using the Lustre command lfs. Note that the correlation can be easily obtained when the application uses each layer node exclusively. However, when several jobs share part of the forwarding and storage nodes, Beacon can only make a simple estimation by using the I/O throughput at the compute layer that is monopolized for each job.
From the above data, Beacon derives and stores coarse-grained information for quick query, including the average and peak I/O bandwidth, average IOPS, runtime, number of processes (and compute nodes) performing I/O, I/O mode, total count of metadata operations, and average metadata operations per second during I/O phases.
To help users understand/debug their applications’ I/O performance, Beacon provides web-based I/O data visualization. This diagnosis system can be queried using a job ID, and after appropriate authentication, it allows visualizing the I/O statistics of the job, both real-time and post-mortem. It reports the measured I/O metrics (such as bandwidth and IOPS) and inferred characteristics (such as the number of I/O processes and I/O mode). Users are also presented with user-configurable visualization tools, showing time-series measurement in I/O metrics, statistics information such as request type/size distribution, and performance variances. Our powerful I/O monitoring database allows for further user-initiated navigation, such as per-compute-node traffic history and zooming control to examine data at different granularity. For security/privacy, users are only allowed to view I/O data from compute, forwarding, and storage nodes involved in and for the duration of their jobs’ execution.

3.3.3 I/O Subsystem Monitoring for Administrators.

Beacon also provides administrators with the capability to monitor the I/O status for any time period, on any node.
Besides all the user-visible information and facilities mentioned above, administrators can further obtain and visualize: (1) the detailed I/O bandwidth and IOPS for each compute node, forwarding node, and storage node, (2) resource utilization status of forwarding nodes, storage nodes and the MDS, including detailed request queue length statistics, and (3) I/O request latency distribution on forwarding nodes. Additionally, Beacon grants administrators direct I/O record database access to facilitate in-depth analysis.
Combining such facilities, administrators can perform powerful and thorough I/O traffic and performance analysis, for example, by checking multi-level traffic, latency, and throughput monitoring information regarding a job execution.

3.4 Generality

Beacon is not an ad-hoc I/O monitoring system for the TaihuLight. It can be adopted not just for data collection in other fields but also for other platforms. Beacon’s building blocks, such as operation log collection, compression, and data management components, are also suitable for collecting from other fields. Section 4.5.1 will show an example of collecting network data.
In addition, Beacon is also applicable to other advanced supercomputers with the I/O forwarding architecture. Beacon’s multi-layer data collection and storage, scheduler-assisted per-application data correlation and analysis, history-based anomaly identification, automatic I/O mode detection, and built-in interference analysis can all be performed on other supercomputers. Its data management components, such as Logstash, Redis, and ElasticSearch, are open-source software that can run on these machines as well. Our forwarding layer design validation and load analysis can also help recent platforms with a layer of burst buffer nodes, such as NERSC’s Cori [10]. Section 4.5.2 gives an example of extending Beacon to another supercomputer with the I/O forwarding architecture.
Finally, we find that while Beacon is designed and deployed on a cutting-edge supercomputer with multi-layer architectures, it can also be applied to traditional two-layer supercomputers. An example of extending Beacon to a traditional two-layer supercomputer is given in Section 4.5.3.

4 Beacon Use Cases

We now discuss several use cases of Beacon. Beacon has been deployed on TaihuLight for over three years, gathering massive I/O information and accumulating around 25 TB of trace data (after two passes of compression) from April 2017 to July 2020. As TaihuLight’s back-end storage changed in August 2020, we use data before August 2020 for analysis. This history contains 1,460,662 jobs using at least 32 compute nodes and consuming 789,308,498 core-hours in total. Of these jobs, 238,585 (16.3%) featured non-trivial I/O, with per-job I/O volume over 200 MB.
The insights and issues revealed by Beacon’s monitoring and diagnosis have already helped TaihuLight administrators fix several design flaws, develop a dynamic and automatic forwarding node allocation tool, and improve system reliability and application efficiency. Owing to Beacon’s success on TaihuLight, we extend Beacon to other platforms. In this section, we focus on four types of use cases and the extended applications of Beacon for network monitoring and monitoring of different storage architectures:
(1)
System performance overview
(2)
Performance issue diagnosis
(3)
Automatic I/O anomaly diagnosis
(4)
Application and user behavior analysis

4.1 System Performance Overview

Beacon’s multi-layer monitoring, especially I/O subsystem monitoring, gives us an overview of the whole system, which helps manage and construct future storage systems. Liu’s work [41] took Titan as an example to prove that individual pieces of hardware (such as storage nodes and disks) are often under-utilized in HPC storage systems, and we make similar observations on TaihuLight. Figure 7 shows back-end utilization level statistics of the Lustre parallel file system on TaihuLight supercomputers for eight months. For each object storage target (OST), a disk array, we plot the percentage of time it reaches a certain average throughput, normalized to its peak throughput. OSTs are almost idle at least 60% of the time, using less than 1% of the I/O bandwidth. At the same time, these OSTs’ utilization is less than 5% about 70% of the time. So we can conclude that OSTs are under-utilized most of the time. Moreover, we also obtain similar conclusions for compute and forwarding nodes using Beacon’s multi-layer monitoring data.
Fig. 7.
Fig. 7. Cumulative Distribution Function (CDF) of OST I/O throughput.
Besides the conclusion obtained from the individual layer, Beacon can also discover the relationship between different layers, which is unavailable for traditional trace tools. Figure 8 shows the daily access volume from three layers during the sample period. Especially for read operations, the total daily volume requested by the compute layer is larger than that of the forwarding layer most of the time, which results in effective caching for Lustre clients in the forwarding layer. Sometimes, the read volume requested by the forwarding layer is much larger than that of the compute layer, which reveals the phenomenon of cache thrashing, and we discuss the details of it later in this section. For write operations, the total daily volume requested from the forwarding layer is always slightly larger than that of the compute layer. Write amplification is a major reason for this phenomenon, which is caused by writing data aligned with the request size of 4 KB, or the multiples of 4 KB.
Fig. 8.
Fig. 8. Access volume history for the TaihuLight compute layer, forwarding layer, and OST layer.
However, the OST layer has a different story. We find that both the read and write volumes on the compute and forwarding layer are much smaller than on the OST layer. Besides write amplification, there are other reasons for this phenomenon. In addition to the compute and forwarding nodes on TaihuLight, other nodes like login or ACC nodes can also access the shared Lustre back-end storage system. Currently, Beacon does not capture these nodes. However, from the figure, we can conclude that system administrators should also pay attention to a load of file system access on login nodes or ACC nodes. According to our survey, users often make many file I/O operations, like copying data from local file systems to Lustre or from one directory to another on login nodes or performing data post-processing on ACC nodes. More details are given in Section 4.4.

4.2 Performance Issue Diagnosis

4.2.1 Forwarding Node Cache Thrashing.

Beacon’s end-to-end monitoring facilitates cross-layer correlation of I/O profiling data, at different temporal or spatial granularities. By comparing the total request volume at each layer, we can see that Beacon has helped TaihuLight’s infrastructure management team identify a previously unknown performance issue, as detailed below.
A major driver for the adoption of I/O forwarding or the burst buffer layer is the opportunity to perform prefetching, caching, and buffering, so as to reduce the pressure on slower disk storage. Figure 9 shows the read volume on compute and forwarding node layers, during two sampled 70-hour periods in August 2017. Figure 9(a) shows a case with expected behavior, where the total volume requested by the compute nodes is significantly higher than that requested by the forwarding nodes, signaling good access locality and effective caching. Figure 9(b), however, tells the opposite story, to the surprise of system administrators: The forwarding layer incurs much higher read traffic from the back end than requested by user applications, reading much more data from the storage nodes than returning to compute nodes. Such a situation does not apply to writes, where Beacon always shows the matching aggregate bandwidth across the two levels.
Fig. 9.
Fig. 9. Sample segments of TaihuLight read volume history, each collected at two layers.
Further analysis of the applications executed and their assigned forwarding nodes during the problem period in Figure 9(b) reveals an unknown cache thrashing problem, caused by the N:N sequential data access behavior. By default, the Lustre client has a 40-MB read-ahead cache for each file. Under the N:N sequential read scenarios, such aggressive prefetching causes severe memory contention, with data repeatedly read from the back end (and evicted on forwarding nodes). For example, a 1024-process Shentu [39] execution has each I/O process read a 1-GB single file, incurring a 3.5\(\times\) I/O amplification at the Lustre back end of Icefish. This echoes the previous finding on the existence of I/O self-contention within a single application [45].
Solution. This problem can be addressed by adjusting the Lustre prefetching cache size per file. For example, changing it from 40 MB per file to 2 MB is shown to remove the thrashing. Automatic, per-job forwarding node cache reconfiguration, which leverages real-time Beacon monitoring results, is currently under development for TaihuLight. Alternatively, reducing the number of accessed files through data aggregation is one of the effective ways to relieve this problem. Using MPI collective I/O is a convenient method to refactor the application from the N:N I/O mode to the N:M mode, leading to a fewer number of files to access at the same time. Given the close collaboration between application teams and machine administrators, making performance-critical program changes as suggested by monitoring data analysis is an accepted practice on leading supercomputers.

4.2.2 Bursty Forwarding Node Utilization.

Beacon’s continuous end-to-end I/O monitoring gives center management a global picture on system resource utilization. While such systems have often been built and configured using rough estimates based on past experience, Beacon collects detailed resource usage history to help improve the current system’s efficiency and assist future system upgrade and design.
Figure 10 gives one example, again on the forwarding load distribution, by showing two 1-day samples from July 2017. Each row portrays the by-hour peak load on one of the same 40 forwarding nodes randomly sampled from the 80 active ones. The darkness reflects the maximum bandwidth reached within that hour. The labels “high”, “mid”, “low”, and “idle” correspond to the maximum residing in the \(\gt\)90%, 50–90%, 10–50%, or 0–10% interval (relative to the benchmarked per-forwarding-node peak bandwidth), respectively.
Fig. 10.
Fig. 10. Sample TaihuLight 1-day load summary, showing the peak load level by hour, across 40 randomly sampled forwarding nodes.
Figure 10(a) shows the more typical load distribution, where the majority of forwarding nodes stay lightly used for the vast majority of the time (90.7% of cells show a maximum load of under 50% of peak bandwidth). Figure 10(b) gives a different picture, with a significant set of sampled forwarding nodes serving I/O-intensive large jobs for a good part of the day. Moreover, 35.7% of the cells actually see a maximum load of over 99% of the peak forwarding node bandwidth.
These results indicate that (1) overall, there is forwarding resource overprovisioning (confirming prior findings [27, 41, 47, 62]); (2) even with the more representative low-load scenarios, it is not rare for the forwarding node bandwidth to be saturated by application I/O; and (3) a load imbalance across forwarding nodes exists regardless of load level, making idle resources potentially helpful to I/O-intensive applications.
Solution. In view of the above, recently, TaihuLight has enlisted more of its “backup forwarding nodes” into regular service. Moreover, a dynamic, application-aware forwarding node allocation scheme has been designed and partially deployed (turned on for a subset of applications) [31]. Leveraging application-specific job history information, such an allocation scheme is intended to replace the default, static mapping between compute and forwarding nodes.

4.2.3 MDS Request Priority Setting.

Overall, we find that most TaihuLight jobs were rather metadata-light, but Beacon does observe a small fraction of parallel jobs (0.69%) with a high metadata request rate (more than 300 metadata operations/s on average during I/O phases). Beacon finds that these metadata-heavy (“high-MDOPS”) applications tend to cause significant I/O performance interference. Among jobs with Beacon-detected I/O performance anomaly, those sharing forwarding nodes with high-MDOPS jobs experience, an average 13.6\(\times\) increase in read/write request latency during affected time periods.
Such severe delay and corresponding Beacon forwarding node queue status history prompts us to examine the TaihuLight LWFS server policy. We find that metadata requests are given priority over the file I/O, based on the single-MDS design and the need to provide fast response to interactive user operations such as ls. Here, as neither disk bandwidth nor metadata server capacity is saturated, such interference can easily remain undetected using existing approaches that focus on I/O-intensive workloads only [23, 41].
Solution. As a temporary solution, we add probabilistic processing across priority classes to the TaihuLight LWFS scheduling. Instead of always giving metadata requests high priority, an LWFS server thread now follows a \(P\!:\!(1\!-\!P)\) split (P configurable) between picking the next request from the separate queues hosting metadata and non-metadata requests. Figure 11 shows the “before” and “after” pictures, with LAMMPS [15] (a classical molecular dynamics simulator) running against the high-MDOPS DNDC [25] (a bio-geochemistry application for agro-ecosystem simulation). Throughput of their solo-runs, where each application runs by itself on an isolated testbed, is given as reference. With a simple equal probability split, LAMMPS co-run throughput doubles, while DNDC only perceives a 10% slowdown. For a long-term solution, we plan to leverage Beacon to automatically adapt the LWFS scheduling policies by considering operation types, the MDS load level, and application request scheduling fairness.
Fig. 11.
Fig. 11. Impact of metadata operations’ priority adjustment.

4.3 Automatic I/O Anomaly Diagnosis

In extreme-scale supercomputers, users typically accept jittery application performance, recognizing widespread resource sharing among jobs. System administrators, moreover, see different behaviors among system components with a homogeneous configuration, but cannot tell how much of that difference comes from these components’ functioning and how much comes from the diversity of tasks they perform.
Beacon’s multi-layer monitoring capacity, therefore, presents a new window for supercomputer administrators to examine system health by connecting statistics on application-issued I/O requests all the way to that of an individual OST’s bandwidth measurement.

4.3.1 Overview of Anomaly Detection Results of Applications.

Figure 12 shows the results of anomaly detection with historical data collected from April 2017 to July 2020. Our results show that about 4.8% of all jobs that featured non-trivial I/O have experienced abnormal performance.
Fig. 12.
Fig. 12. Results of automatic anomaly detection. “Mix” means that abnormal jobs or their neighbor jobs have more than one kind of explicit I/O pattern, such as the N:1 I/O mode, high MDOPS, and high I/O bandwidth. “Multiple jobs” means that an abnormal job has many neighbor jobs, and their aggregate I/O bandwidth, IOPS, or MDOPS is high. “System Anomaly” means that neighbor jobs have no explicit I/O features, but their corresponding forwarding nodes or storage nodes are detected as performance degradation by Beacon.
Figure 12(a) shows abnormal jobs’ categories distribution. Low-bandwidth jobs make up the majority of all jobs, and WRF accounts for most of these low-bandwidth jobs. Jobs with N:1 I/O and high bandwidth also play an important role. This paper later analyzes how applications with N:1 I/O mode can be easy to be disturbed. Jobs with high MDOPS and IOPS account for the smallest percentage, owing to the fact that these two types of jobs make up a small portion of all jobs on TaihuLight.
Figure 12(b) shows the factors that neighbor jobs bring to abnormal jobs, and we divide them into three categories: (1) system anomaly, (2) I/O interference, and (3) unknown factors. I/O interference factors include the N:1 I/O mode, high MDOPS, high I/O bandwidth, high IOPS, mix, and multiple jobs. This figure illustrates that application-interfering jobs account for more than 90% of all jobs, implying that application interference is the predominant cause of jobs’ performance degradation. Among them, the proportion of interference caused by jobs with the N:1 I/O mode occupies the primary partition, which means jobs with the N:1 I/O mode are not only susceptible to disturbance but also bring interference to other applications. Section 4.4 provides more information. Mix and jobs with high MDOPS rank second and third, respectively. The LWFS server thread pool on the forwarding node is restricted to 16, and jobs suffer from performance degradation when I/O operations on the same forwarding node surpass the LWFS server thread pool’s service capabilities.

4.3.2 Applications Affected by Interference.

Figure 13 illustrates an example of 1024-process Shentu co-running with other applications with different I/O patterns on a shared forwarding node. We find that Shentu suffers from various degrees of interference while co-running with other jobs. Among them, jobs with the N:1 I/O mode and high metadata have a significantly higher performance impact than jobs with the other two I/O patterns on Shentu. Because the forwarding nodes and compute nodes on the Sunway TaihuLight are statically connected, I/O interference on forwarding nodes is a major cause of applications’ performance anomalies.
Fig. 13.
Fig. 13. Sample Shentu-256 I/O bandwidth timelines. The red line represents the performance of Shentu running on a dedicated forwarding node. The blue line represents the performance of the Shentu that is interfered by other applications with different I/O patterns when sharing a forwarding node.
Solution. With Beacon’s real-time collection data, we can find the I/O interference on the forwarding node in advance, which can help to improve performance for applications. Motivated by the findings, a dynamic forwarding allocation system [31] for isolating I/O interference on the forwarding nodes is developed, tested, and deployed.

4.3.3 Application-driven Anomaly Detection.

Most I/O-intensive applications have distinct I/O phases (i.e., episodes in their execution where they perform I/O continuously), such as those to read input files during initialization or to write intermediate results or checkpoints. For a given application, such I/O phase behavior is often consistent. Taking advantage of such repeated I/O operations and its multi-layer I/O information collection, Beacon performs automatic I/O phase recognition, on top of which it conducts I/O-related anomaly detection. More specifically, larger applications (e.g., those using 1024 compute nodes or more) spread their I/O load to multiple forwarding nodes and back-end nodes, giving us opportunities to directly compare the behavior of servers processing requests known to Beacon as homogeneous or highly similar.
Figure 14 gives an example of a 6000-process LAMMPS run with checkpointing, which is affected by abnormal forwarding nodes. The 1500 compute nodes are assigned to three forwarding nodes, whose bandwidth and I/O time are reflected in the time-series data from Beacon. We can clearly see that the Fwd1 node is a straggler in this case, serving at a bandwidth much slower than its peak (without answering to other applications). As a result, there is a 20\(\times\) increase in the application-visible checkpoint operation time, estimated using the other two forwarding nodes’ I/O phase durations.
Fig. 14.
Fig. 14. Forwarding bandwidth in a 6000-process LAMMPS run.

4.3.4 Anomaly Alert and Node Screening.

Such continuous, online application performance anomaly detection can identify forwarding nodes or back-end units with deviant performance metrics, which in turn triggers Beacon’s more detailed monitoring and analysis. If it finds such a system component to consistently under-perform relative to peers serving similar workloads, with configurable thresholds in monitoring window and degree of behavior deviation, it reports this as an automatically detected system anomaly. By generating and sending an alarm email to the system administration team, Beacon prompts system administrators to do a thorough examination, where its detailed performance history information and visualization tools are also helpful.
Such anomaly screening is particularly important for expensive, large-scale executions. For example, among all applications running on TaihuLight so far, the parallel graph engine Shentu [39] has the most intensive I/O load. It scales well to the entire supercomputer in both computation and I/O, with 160,000 processes and large input graphs distributed evenly to nearly 400 Lustre OSTs. During test runs preparing for its Gordon Bell bid in April 2018, Beacon’s monitoring discovered a few OSTs significantly lagging behind in the parallel read, slowing down the initialization as a result (Figure 15). By removing them temporarily from service and relocating their data to other OSTs, Shentu cuts its production run initialization time by 60%, saving expensive dedicated system allocation and power consumption. In this particular case, further manual examination attributes the problem to these OSTs’ RAID controllers, which are now fixed.
Fig. 15.
Fig. 15. Per-OST bandwidth during a Shentu execution.
However, without Beacon’s back-end monitoring, applications like Shentu will accept the bandwidth they obtain without suspecting that the I/O performance is abnormal. Similarly, without Beacon’s routine front-end tracing, profiling, and per-application performance anomaly detection, back-end outliers will go unnoticed. Therefore, as full-system benchmarking requires taking the supercomputer offline and cannot be regularly attempted, Beacon provides a much more affordable way for continuous system health monitoring and diagnosis by coupling application-side and server-side tracing/profiling information.
Beacon has been deployed on TaihuLight since April 2017, with features and tools incrementally developed and added to production use. Table 2 summarizes the automatically identified I/O system anomaly occurrences at the two service layers, from April 2017 to July 2020. Such identification adopts a minimum threshold of the measured maximum bandwidth under 30% of the known peak value, as well as a minimum duration of 60 minutes. Such parameters can be configured to adjust the anomaly detection system sensitivity. Most performance anomaly occurrences are found to be transient, lasting under 4 hours.
Table 2.
 Location of anomaly
Duration (hours)Forwarding node (times)OSS+OST (times)
\((0,1)\)193185
\([1,4)\)5973
\([4,12)\)3351
\([12,96)\)2225
\(\ge\)96, manually verified1522
Table 2. Duration of Beacon-identified System Anomalies
There are a total of 70 occasions of performance anomaly over 4 hours on the forwarding layer and 98 on the back-end layer, confirming the existence of fail-slow situations that are common with data centers [28]. Reasons for such relatively long yet “self-healed” anomalies include service migration and RAID reconstruction. With our rather conservative setting during the initial deployment period, Beacon is set to send the aforementioned alert email when a detected anomaly situation lasts beyond 96 hours (except for large-scale production runs as in the Shentu example above, where the faulty units are immediately reported). With all these occasions, the Beacon-detected anomaly is confirmed by human examination.

4.4 Application and User Behavior Analysis

With its powerful information collection and multi-layer I/O activity correlation, Beacon provides a new capability to perform detailed application or user behavior analysis. Results of such analysis assist in performance optimization, resource provisioning, and future system design. Here, we showcase several application/user behavior studies, some of which have led to corresponding optimizations or design changes to the TaihuLight system.

4.4.1 Application I/O Mode Analysis.

First, Table 3 gives an overview of the I/O volume across all profiled jobs with a non-trivial I/O, categorized by per-job core-hour consumption. Here, 1,000 K core-hours correspond to a 10-hour run using 100,000 cores on 25,000 compute nodes, and jobs with such consumption or higher write more than 40 TB of data on average. Further examination reveals that in each core-hour category, average read/write volumes are influenced by a minority group of heavy consumers. Overall, the amount of data read/written grows as the jobs consume more compute node resources. The less resource-intensive applications tend to perform more reads, while the larger consumers are more write-intensive.
Table 3.
Type\((0, 1K]\)\((1K, 10K]\)\((10K, 100K]\)\((100K, 1000K]\)\((1000K, \infty)\)
Read8.1 GB101.0 GB166.9 GB1172.9 GB2010.6 GB
Write18.2 GB83.9 GB426.6 GB615.9 GB41458.8 GB
Table 3. Average Per-job I/O Volume by Core-hour Consumption
Figure 16 shows the breakdown of I/O-mode adoption among all TaihuLight jobs performing non-trivial I/O, by total read/write volume. The first impression one takes from these results is that the rather “extreme” cases, such as N:N and 1:1, form the dominant choices, especially in the case of writes. We suspect that this distribution may be skewed by a large number of small jobs doing limited I/O, and calculate the average per-job read/write volume for each I/O mode. The results (Table 4) show that this is not the case. Actually, applications that choose to use the 1:1 mode for writes actually have a much higher overall write volume.
Fig. 16.
Fig. 16. Distribution of file access modes, in access volume.
Table 4.
I/O modeAvg. read volumeAvg. write volumeJob count
N:N96.8 GB120.1 GB11073
N:M36.2 GB63.2 GB324
N:119.6 GB19.3 GB2382
1:133.0 GB142.3 GB16251
Table 4. Average I/O Volume and Job Count by I/O Mode
The 1:1 mode is the closest to sequential processing behavior and is conceptually simple. However, it obviously lacks scalability and fails to utilize the abundant hardware parallelism in the TaihuLight I/O system. The wide presentation of this I/O mode may help explain the overall under-utilization of forwarding resources, discussed earlier in Section 4.2. Echoing similar findings (though not so extreme) on other supercomputers [47] (including Intrepid [30], Mira [58], and Edison [51]), effective user education on I/O performance and scalability can both help improve storage system utilization and reduce wasted compute resources.
The N:1 mode tells a different story. It is an intuitive parallel I/O solution that allows compute processes to directly read to or write from their local memory without gather-scatter operations, while retaining the convenience of having a single input/output file. However, our detailed monitoring finds it to be a damaging I/O mode that users should steer away from, as explained below.
First, our monitoring results confirm the findings of existing research [2, 46]: The N:1 mode offers low application I/O performance (by reading/writing to a shared file). Even with a large N, such applications receive no more than 250 MB/s of I/O aggregate throughput despite the peak TaihuLight back end combined bandwidth of 260 GB/s. For read operations, users here also rarely modify the default Lustre stripe width, confirming the behavior reported in a recent ORNL study [38]. The problem is much worse with writes, as performance severely degrades owing to file system locking.
This study, however, finds that applications with the N:1 mode are extraordinarily disruptive, as they harm all kinds of neighbor applications that share forwarding nodes with them, particularly when N is large (e.g., over 32 compute nodes).
The reason is that each forwarding node operates an LWFS server thread pool (currently sized at 16), providing forwarding service to assigned compute nodes. Applications using the N:1 mode tend to flood this thread pool with requests in bursts. Unlike the N:N or N:M modes, N:1 suffers from the aforementioned poor back-end performance by using a single shared file. This, in turn, makes N:1 requests slow to process, further exacerbating their congestion in the queue and delaying requests from other applications, even when those victims are accessing disjointed back-end servers and OSTs.
Here, we give a concrete example of I/O mode-induced performance interference, featuring an earthquake simulation AWP [20] (2017 Gordon Bell Prize winner) that started with the N:1 mode. In this sample execution, AWP co-runs with the weather forecast application WRF [69] using the 1:1 mode, each having 1024 processes on 256 compute nodes. Under the “solo” mode, we assign each application a dedicated forwarding node in a small testbed partition of TaihuLight. In the “co-run” mode, we let the applications share one forwarding node (as the default compute-to-forwarding mapping is 512-to-1).
Table 5 lists the two applications’ average request wait times, processing times, and forwarding node queue lengths during these runs. Note that with the “co-run”, the queue is shared by both applications. We find that the average wait time of WRF increases by 11\(\times\) when co-running, but AWP is not affected. This result reveals the profound malpractice of the N:1 file sharing mode and confirms the prior finding that I/O interference is access-pattern-dependent [37, 43].
Table 5.
OperationAvg. wait timeAvg. proc. timeAvg. queue length
WRF write (solo)2.73 ms0.052 ms0.22
WRF write (co-run)30.06 ms0.054 ms208.51
AWP read (solo)58.17 ms3.44 ms226.37
AWP read (co-run)58.18 ms3.44 ms208.51
Table 5. Performance Interference During WRF and AWP Co-run Sharing a Forwarding Node
Solution. Our tests confirm that increasing the LWFS thread pool size does not help in this case, as the bottleneck lies on the OSTs. Moreover, avoiding the N:1 mode has been advised in prior work [2, 90], as well as numerous parallel I/O tutorials. Considering our new inter-application study results, it is an obvious “win-win” strategy that simultaneously improves large applications’ I/O performance and reduces their disruption to concurrent workloads. However, based on our experience with real applications, this message needs to be better promoted.
In our case, the Beacon developers worked with the AWP team to replace its original N:1 file read (for initialization/restart) with the N:M mode during the 2017 ACM Gordon Bell Prize final submission phase. Changing applications’ I/O modes from N:1 to N:M means selecting M out of N processors to perform I/O. The number of M was selected empirically based on N:M experiments. Figure 17 shows the N:M experiment by changing the value of M. The 1024-processor AWP runs on 256 compute nodes connected to one forwarding node during our experiment. We can see that the bandwidth achieves near-linear growth with M, increasing in the range of 1 to 32. The reason is that when the aggregate bandwidth of processors performing I/O operations does not reach the peak bandwidth of a forwarding node, applications can obtain a larger aggregate bandwidth, with more processors writing to more separate files. When M increases to 64, the aggregate bandwidth increases slightly, with the limitation of a forwarding node. When M \(\gt\) 64, the aggregate bandwidth even declines slightly because of the resource contention. Also, more files may lead to unstable performance for applications. Thus, we suggest that when changing applications’ I/O modes from N:1 to N:M, selecting 1 out of every 16 processors or every 32 processors to perform I/O operation is a cost-effective choice on TaihuLight.
Fig. 17.
Fig. 17. The N:M experiment.
This change produces an over 400% enhancement in I/O performance. Note that the GB Prize submission does not report I/O time; we find that AWP’s 130,000-process production runs spend the bulk of their execution time reading around 100 TB of input or checkpoint data. Significant reduction in this time greatly facilitates AWP’s development/testing and saves non-trivial supercomputer resources.

4.4.2 Metadata Server Usage.

Unlike forwarding nodes’ utilization (discussed earlier), the Lustre MDS is found with rather evenly distributed load levels by Beacon’s continuous load monitoring (Figure 18(a)). In particular, 26.8% of the time, the MDS experiences a load level (in requests per second) above 75% of its peak processing throughput.
Fig. 18.
Fig. 18. TaihuLight Lustre MDS load statistics.
Beacon allows us to further split the requests between systems sharing the MDS, including the TaihuLight forwarding nodes, login nodes, and the ACC. To the surprise of TaihuLight administrators, over 80% of the metadata access workload actually comes from the ACC (Figure 18(b)).
Note that the login node and ACC have their own local file systems, ext4 and GPFS [66], respectively, which users are encouraged to use for purposes such as application compilation and data post-processing/visualization. However, as the users are likely TaihuLight users too, we find most of them prefer to directly use the main Lustre scratch file system intended for TaihuLight jobs, for convenience. While the I/O bandwidth/IOPS resources consumed by such tasks are negligible, user interactive activities (such as compiling or post-processing) turn out to be metadata-heavy.
Large waves of unintended user activities correspond to the most heavy-load periods at the tail end in Figure 18(a), and lead to MDS crashes directly affecting applications running on TaihuLight. According to our survey, many other machines, including two out of the top 10 supercomputers (Sequoia [83] and Sierra [33]), also have a single MDS, assuming that their users follow similar usage guidelines.
Solution. There are several potential solutions to this problem. With the help of Beacon, we can identify and remind users performing metadata-heavy activities to avoid using the PFS directly. Or, we can support more scalable Lustre metadata processing with an MDS cluster. A third approach is to facilitate intelligent workflow support that automatically performs data transfer based on users’ needs. This third approach is the one we are currently developing.

4.4.3 Jobs’ Request Size Analysis.

Figure 19 shows the relationships between the applications according to their bandwidth and IOPS, with all points forming five lines, which represent jobs mainly containing five request size types: 1 KB, 16 KB, 64 KB, 128 KB, 512 KB. Among them, 128 KB for read and 512 KB for write are the most common request sizes, which follow the system configurations of Icefish. On Sunway compute nodes, applications’ small I/O requests are merged, while larger I/O requests are split into multiple requests before being transferred to the forwarding nodes via the LWFS client. So we conclude that the average request size of most applications can reach the set upper limit, implying that the upper limit can be appropriately increased to enable applications to obtain a better read and write performance. In addition, further statistical analysis reveals that 6.89% of jobs still have an I/O request size of less than 1 KB. However, small I/O requests are associated with inefficient I/O behavior, and jobs with such I/O behavior cannot make good use of the high-performance parallel file system.
Fig. 19.
Fig. 19. The average I/O throughput and IOPS of all jobs on Icefish since April 1, 2017. Each point represents one particular job.
Solution. We take APT [84] (An application for particle dynamical simulations) as an example. APT is designed for systematic large-scale applications of geometric algorithms for particle dynamics simulations, which runs on TaihuLight with 1024 processes and outputs file with the HDF5 file format. A large number of small I/O requests is the main reason for its low performance. As a quick solution, we change its HDF5 format to the binary format and achieve a 20\(\times\) I/O performance improvement.

4.5 Extended Applications of Beacon

4.5.1 Extension to Network Monitoring.

Encouraged by Beacon’s success in I/O monitoring, in summer 2018, we began to design and test its extension to monitor and analyze network problems, motivated by the network performance debugging needs of ultra-large-scale applications. Figure 20 shows the architecture of this new module. Beacon samples performance counters on the 5984 Mellanox InfiniBand network switches, such as per-port sent and received volumes. Again, the data collected are passed to low-overhead daemons on Beacon log servers, more specifically, 75 of its 85 part-time servers, each assigned 80 switches. Similar processing and compression are conducted, with result data persisting in Beacon’s distributed database and then being periodically relocated to its dedicated server for user queries and permanent storage.
Fig. 20.
Fig. 20. Overview of Beacon’s network monitoring module.
This Beacon network monitoring prototype was tested in time to help in the aforementioned Shentu [39] production runs, for its final submission to Supercomputing ’18 as an ACM Gordon Bell Award finalist. Beacon was called upon to identify the reason that the aggregate network bandwidth was significantly lower than theoretical peak. Figure 21 illustrates this with a 3-supernode Shentu test run. The dark bars (FixedPart) form a histogram of communication volumes measured on 40 switches connecting these 256-node supernodes for inter-supernode communication, reporting the count of switches within five volume brackets. There is a clear bi-polar distribution, showing severe load imbalance and more than expected inter-supernode communication. This monitoring result led to discovery that owing to the existence of faulty compute nodes within each supernode, the fixed partitioning relay strategy adopted by Shentu led to a subset of relay nodes receiving twice the “normal” load. Note that Shentu’s own application-level profiling found that the communication volume across compute nodes was well balanced. Hence, the problem was not obvious to application developers until Beacon provided such switch-level traffic data.
Fig. 21.
Fig. 21. Distribution of communication volume inter-supernode.
Solution. This finding prompted Shentu designers to optimize their relay strategy, using a topology-aware scholastic assignment algorithm to uniformly partition source nodes to relay nodes [39]. The results are shown by gray bars (FlexPart) in Figure 21. The peak per-switch communication volume is reduced by 27.0% (from 6.3 GB to 4.6 GB), with a significantly improved load balance, bringing a total communication performance enhancement of around 30%.

4.5.2 Extension to the Cutting-edge Supercomputer with I/O Forwarding Architecture.

The Sunway next-generation supercomputer inherits and develops the architecture of the Sunway TaihuLight and is built on a homegrown high-performance heterogeneous multi-core processor, SW26010P. It consists of more than 100, 000 compute nodes, each node equipped with a 390-core SW26010P CPU. Compared to the 10 million cores of TaihuLight, the new machine has more than four times the total number of cores. Figure 22 shows the architecture overview. Like TaihuLight, the compute nodes are connected to the storage nodes through forwarding nodes. Storage nodes run the Lustre servers and support users with a global file system. Unlike TaihuLight, the Sunway next-generation supercomputer provides an additional burst buffer file system on the forwarding node [89]. Each forwarding node provides back-end storage for the burst buffer file system via two high-performance Nvme SSDs.
Fig. 22.
Fig. 22. Sunway next-generation supercomputer architecture overview.
In order to extend Beacon to the Sunway next-generation supercomputer, we upgraded the collection module of Beacon to support data collection on the burst buffer file system in January 2021. Beacon’s other components can still be performed on this supercomputer as expected. Figure 23 shows an example of Beacon’s use case on the next-generation supercomputer. According to the figure, we find that the load on Nvme SSDs is low most of the time. An important reason is that users tend to use the global file system more often than the burst buffer file system. We confirm this assertion by further statistical analysis.2 Although the burst buffer file system can provide high I/O performance for jobs, users have to modify their applications with a specific API for I/O to use the burst buffer file system, which is not convenient and contributes to the low usage of the burst buffer file system. Besides, we also find that the load on Nvme SSDs is imbalanced. One important reason is the control strategy of Nvme SSDs. Nvme SSDs are controlled through static configuration files. Each user can only access the corresponding Nvme SSDs through a configuration file given by an administrator. However, it is difficult for the administrator to balance each Nvme SSD’s load as it lacks real-time load information.
Fig. 23.
Fig. 23. Sample of the Sunway next-generation supercomputer 1-week load summary, showing the peak load level by the hour, across 60 randomly sampled Nvme SSDs.
Solution. With the help of Beacon’s real-time monitoring, we can obtain the real-time Nvme SSDs’ load, which is necessary for configuration file modification. Currently, we are working with administrators to develop a dynamic configuration system to make full use of Nvme SSDs.

4.5.3 Extension to the Traditional Two-layer Supercomputer.

In addition to Beacon’s adoption as a multi-layer cutting-edge supercomputer, some of Beacon’s components and methods can also be adopted by the traditional two-layer supercomputer. We have deployed Beacon on the Sugon Pai supercomputer [72], a traditional two-layer computer, since March 2020. Sugon Pai is a homogeneous computing cluster that contains 424 compute nodes as well as eight storage nodes. It uses the ParaStor file system to provide high concurrent I/O. The architecture of Beacon’s monitoring and storage module is shown in Figure 24. Beacon performs I/O monitoring on the compute and storage nodes, running ParaStor [73] client and server, respectively. Beacon divides the 424 compute nodes into four groups and enlists four “part-time” servers to communicate with one group. In addition, data collected from eight storage nodes are transferred to another “part-time” server. There is also a MySQL database to store jobs’ running information on the Sugon Pai. In order to reduce data transmission and storage overhead, Beacon also conducts an online compression similar to that used on TaihuLight.
Fig. 24.
Fig. 24. Overview of Beacon’s monitoring module for the Sugon Pai supercomputer.
Table 6 shows the statistics of I/O-mode adopted by jobs that perform non-trivial I/O on the Sugon Pai supercomputer from March 2020 to April 2020. We find some similar conclusions, for example, the N:N and 1:1 I/O modes form the dominant choices in the case of write. Besides, there are also some new findings on Sugon Pai; the N:1 I/O mode takes up most of the ratio in the case of read. Further analysis shows that the N:1 I/O mode offers a relatively good performance on Sugon Pai. Figure 25 shows an example of a molecular simulation application with the N:1 I/O mode on Sugon Pai. As we can see from the figure, high performance is obtained when reading with the N:1 I/O mode. A plausible reason is that Sugon Pai adopts Parastor as its primary storage system, supporting the N:1 I/O mode better than LWFS and Lustre on TaihuLight. This finding also proves that different platforms support I/O behaviors differently, which implies that an application’s I/O behavior needs to be well matched to the underlying platform adaptively in order to achieve better performance.
Fig. 25.
Fig. 25. I/O performance of an application with N:1 I/O mode on the Sugon Pai supercomputer.
Table 6.
Job I/O mode#Read-operated jobs#Write-operated jobs
1:1289689
N:1521103
N:N152438
N:M2702
Table 6. Jobs Classified by I/O Mode

5 Beacon Framework Evaluation

We now evaluate Beacon’s per-application profiling accuracy and its performance overhead.

5.1 Accuracy Verification

Beacon collects full traces from the compute node side, thus giving it access to complete application-level I/O operation information. However, because the LWFS client trace interface provides only coarse timestamp data (at per-second granularity), and owing to the clock drift across compute nodes, it is possible that the I/O patterns recovered from Beacon logs deviate from the application-level captured records.
To evaluate the degree of such errors, we compare the I/O throughput statistics reported by the MPI-IO Test [26] to those by Beacon. In the experiments, we use the MPI-IO Test to test different parallel I/O modes, including N:N and N:1 independent operations, plus MPI-IO library collective calls. Then, 10 experiments were repeated at each execution scale.
The accuracy evaluation results are shown in Figure 26. We plot the average error in Beacon, measured as the percentage of deviation of the recorded aggregate compute node-side I/O throughput from the application-level throughput reported by the MPI-IO library.
Fig. 26.
Fig. 26. Average error rate of Beacon reported bandwidth (error bars show 95% confidence intervals).
We find that Beacon is able to accurately capture application performance, even for applications with non-trivial parallel I/O activities. More precisely, Beacon’s recorded throughput deviates from the MPI-IO Test reported values by only 0.78–3.39% (1.84% on average) for the read test and 0.81–3.31% (2.03% on average) for the write test, respectively. The results are similar to those of high-IOPS applications, which are omitted here owing to space limitations.
Beacon’s accuracy can be attributed to the fact that it records all compute node-side trace logs, facilitated by its efficient and lossless compression method (described in Section 3.2). We find that even though individual trace items may be off in timestamps, data-intensive applications on supercomputers seldom perform isolated, fast I/O operations (which are not of interest for profiling purposes). Instead, they exhibit I/O phases with a sustained high I/O intensity. By collecting multi-layer I/O trace entries for each application, Beacon is able to paint an accurate picture of an application’s I/O behavior and performance.

5.2 Monitoring and Query Overhead

We now evaluate Beacon’s monitoring overhead in a production environment. We compare the performance of important I/O-intensive real-world applications and the MPI-IO Test benchmark discussed earlier, with and without Beacon turned on (\(T_w\) and \(T_{w/o}\), respectively). We report the overall run time of each program and calculate the slowdown introduced by turning on Beacon. Table 7 shows the results, listing the average slowdown measured over at least five runs for each program (the variance of slowdown across runs low: under 2%). Note that for the largest applications, such testing is piggybacked on actual production runs of stable codes, with Beacon turned on during certain allocation time frames. Applications like AWP often break their executions to run a certain number of simulation time steps at a time.
Table 7.
Application#Process\(T_{w/o}\) (s)\(T_{w}\) (s)%Slowdown
MPI-IO\(_N\)64\(\hphantom{0}\)26.626.80.79%
MPI-IO\(_N\)128\(\hphantom{0}\)31.531.60.25%
MPI-IO\(_N\)256\(\hphantom{0}\)41.641.90.72%
MPI-IO\(_N\)512\(\hphantom{0}\)57.958.40.86%
MPI-IO\(_N\)1024\(\hphantom{0}\)123.1123.50.36%
WRF\(_1\)1024\(\hphantom{0}\)2813.32819.10.21%
DNDC2048\(\hphantom{0}\)1041.21045.50.53%
XCFD4000\(\hphantom{0}\)2642.12644.60.09%
GKUA16384\(\hphantom{0}\)297.5299.90.82%
GKUA32768\(\hphantom{0}\)182.8184.10.66%
AWP130000\(\hphantom{0}\)3233.53241.50.25%
Shentu160000\(\hphantom{0}\)5468.25476.30.15%
Table 7. Avg. Beacon Monitoring Overhead on Applications
These results show that the Beacon tool introduces very low overhead, under \(1\%\) across all test cases. Also, the overhead does not grow with the application execution scale; it actually appears smaller (below 0.25%) for the two largest jobs, which use 130 K processes or more. Such a cost is particularly negligible considering the significant I/O performance enhancement and run-time savings produced by optimizations or problem diagnosis from Beacon-supplied information.
Table 8 lists the CPU and memory usage of Beacon’s data collection daemon. In addition, the storage overhead from Beacon’s deployment on TaihuLight since April 2017 is around 10 TB. Such low operational overhead and scalable operation attest to Beacon’s lightweight design, with background trace-collection and compression generating negligible additional resource consumption. Also, having a separate monitoring network and storage avoids potential disturbance to the application execution.
Table 8.
LevelCPU usageMemory usage (MB)
Compute node0.0%10
Forwarding node0.1%6
Storage node0.1%5
Table 8. System Overhead of Beacon
Finally, we assess Beacon’s query processing performance. We measure the query processing time of 2,000 Beacon queries in September 2018, including both application users accessing job performance analysis and system administrators checking forwarding/storage nodes performance. In particular, we examine the impact of Beacon’s in-memory cache system between the web interface and Elasticsearch, as shown in Figure 2. Figure 27 gives the CDF of queries in processing time and demonstrates that (1) the majority of Beacon user queries can be processed within 1 second, and 95.6% of them can be processed under 10 seconds (visualization queries take longer), and (2) Beacon’s in-memory caching significantly improves the user experience. Additional checking reveals that about 95% of these queries can be served from cached data.
Fig. 27.
Fig. 27. CDF of Beacon query processing time.

6 Related Work

Several I/O tracing and profiling tools have been proposed for HPC systems, which can be divided into two categories: application-oriented tools and back-end-oriented tools.
Application-oriented tools can provide detailed information about a particular execution on a function-by-function basis. Work in this area includes Darshan [9], IPM [79], and RIOT [86], all of which aim to build an accurate picture of application I/O behavior by capturing key characteristics of the mainstream I/O stack on compute nodes. Carns et al. evaluated the performance and runtime overheads of Darshan [8], and Patel et al. performed characterization and analysis of access of I/O intensive files [60] with Darshan. Wu et al. proposed a scalable methodology for MPI and I/O event tracing [48, 87, 88]. Recorder [46] focuses on collecting additional HDF5 trace data. Tools like Darshan provide user-transparent monitoring via automatic environment configuration. Still, instrumentation based tools have restrictions on programming languages or libraries/linkers. In contrast, Beacon is designed to be a non-stop, full-system I/O monitoring system capturing I/O activities at the system level.
Back-end-oriented tools collect system-level I/O performance data across applications and provide summary statistics (e.g., LIOProf [91], LustreDU [7, 38, 56], and LMT [24]). Neeraj et al. [64] tried to provide applications and middles with real-time system resource status while Patel et al. [59] focused on showing system-level characteristics with LMT. Paul et al. [61] also analyzed the statistics in an application-agnostic manner with data collected from Lustre server statistics.
However, identifying application performance issues and finding the cause of application performance degradation are difficult with these tools. While back-end analytical methods [40, 41] have made progress in identifying high-throughput applications using back-end logs only, they lack application-side information. Beacon, in contrast, holds complete cross-layer monitoring data to enable such tasks.
Along this line, there are tools for collecting multi-layer data. Static instrumentation has been used to trace parallel I/O calls from MPI to PVFS servers [35]. SIOX [85] and IOPin [34] characterize HPC I/O workloads across the I/O stack. These projects extended the application-level I/O instrumentation approach that Darshan [9] used to other system layers. However, their overhead hinders its deployment on large-scale production environments [70].
Regarding end-to-end frameworks, the TOKIO [3] architecture combines front-end tools (Darshan, Recorder) and back-end tools (LMT). The UMAMI monitoring interface [43] provides cross-layer I/O performance analysis and visualization. In addition, OVIS [5] uses the Cray specific tool LDMS [1] to provide scalable failure and anomaly detection. GUIDE [80] performs center-wide and multi-source log collection and motivated further analysis and optimizations. Beacon differs through its aggressive real-time performance and utilization monitoring, automatic anomaly detection, and continuous per-application I/O pattern profiling.
I/O interference is identified as an important cause for performance variability in HPC systems [41, 57]. Fang et al. [96] uncovered the interference in an in situ analytics system. Kuo et al. [37] focused on interference from different file access patterns with synchronized time-slice profiles. Yildiz et al. [92] studied the root causes of cross-application I/O interference across software and hardware configurations. To the best of our knowledge, Beacon is the first monitoring framework with built-in features for inter-application interference analysis. Our study confirms findings on large-scale HPC applications’ adoption of poor I/O design choices [47]. This further suggests that aside from workload-dependent, I/O-aware scheduling [14, 41], interference should be countered with application I/O mode optimization and adaptive I/O resource allocation.
Finally, on network monitoring, there are dedicated tools [42, 50, 68] for monitoring switch performance, anomaly detection, and resource utilization optimization. There are also tools specializing in network monitoring/debugging for data centers [75, 76, 94]. However, these tools/systems typically do not target the InfiniBand interconnections commonly used on supercomputers. To this end, Beacon adopts the open-source OFED stack [11, 55] to retrieve relevant information from the IB network. More importantly, it leverages its scalable and efficient monitoring infrastructure, originally designed for I/O, for network problems.

7 Conclusion

We have presented Beacon, an end-to-end I/O resource monitoring and diagnosis system for the leading supercomputer TaihuLight. It facilitates comprehensive I/O behavior analysis along the long I/O path and has identified hidden performance and user I/O behavior issues as well as system anomalies. Enhancements enabled by Beacon in the past 38 months have significantly improved ultra-large-scale applications’ I/O performance and the overall TaihuLight I/O resource utilization. More generally, our results and experience indicate that this type of detailed multi-layer I/O monitoring/profiling is affordable in state-of-the-art supercomputers, offering valuable insights while incurring a low cost. In addition, we have explored the public release of Beacon collected supercomputer I/O profiling data to the HPC and storage communities.
Our future work will focus on the cross-layered application I/O portrait, automated I/O scheduling, resource allocation, and optimization via real-time interaction with Beacon.

Footnotes

2
More than 90% jobs running on the global file system.

References

[1]
Anthony Agelastos, Benjamin Allan, Jim Brandt, Paul Cassella, Jeremy Enos, Joshi Fullop, Ann Gentile, Steve Monk, Nichamon Naksinehaboon, Jeff Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, and Tom Tucker. 2014. The lightweight distributed metric service: A scalable infrastructure for continuous monitoring of large scale computing systems and applications. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 154–165.
[2]
John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, and Meghan Wingate. 2009. PLFS: A checkpoint filesystem for parallel applications. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Portland, 1–12.
[3]
Lawrence Berkeley and ANL. 2017. TOKIO: Total knowledge of I/O. http://www.nersc.gov/research-and-development/tokio.
[4]
Peter J. Braam and Rumi Zahir. 2002. Lustre: A scalable, high performance file system. Cluster File Systems, Inc 8, 11 (2002), 3429–3441.
[5]
Jim Brandt, Ann Gentile, Jackson Mayo, Philippe Pebay, Diana Roe, David Thompson, and Matthew Wong. 2009. Resource monitoring and management with OVIS to enable HPC in cloud computing environments. In International Symposium on Parallel and Distributed Processing. IEEE, Rome, 1–8.
[6]
Tom Budnik, Brant Knudson, Mark Megerian, Sam Miller, Mike Mundy, and Will Stockdell. 2010. Blue gene/q resource management architecture. In Workshop on Many-Task Computing on Grids and Supercomputers. IEEE, New Orleans, 1–5.
[7]
Adam G. Carlyle, Ross G. Miller, Dustin B. Leverman, William A. Renaud, and Don E. Maxwell. 2012. Practical support solutions for a workflow-oriented Cray environment. In Cray User Group Conference. Cray, Stuttgart, 1–7.
[8]
P. Carns, K. Harms, R. Latham, and R. Ross. 2012. Performance Analysis of Darshan 2.2. 3 on the Cray XE6 Platform.Technical Report. Argonne National Lab.(ANL), Argonne, IL (United States).
[9]
Philip Carns, Robert Latham, Robert Ross, Kamil Iskra, Samuel Lang, and Katherine Riley. 2009. 24/7 characterization of Petascale I/O workloads. In International Conference on Cluster Computing and Workshops. IEEE, New Orleans, 1–10.
[11]
N. Dandapanthula, Hari Subramoni, Jérôme Vienne, K. Kandalla, Sayantan Sur, Dhabaleswar K. Panda, and Ron Brightwell. 2011. INAM-a scalable infiniband network analysis and monitoring tool. In European Conference on Parallel Processing. Springer, Bordeaux, 166–177.
[12]
Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, and Franck Cappello. 2017. LogAider: A tool for mining potential correlations of HPC log events. In International Symposium on Cluster, Cloud and Grid Computing. IEEE, Madrid, 442–451.
[13]
Giacinto Donvito, Giovanni Marzulli, and Domenico Diacono. 2014. Testing of several distributed file-systems (HDFS, Ceph and GlusterFS) for supporting the HEP experiments analysis. In Journal of Physics: Conference Series. IOP Publishing, Yokohama, 042014.
[14]
Matthieu Dorier, Gabriel Antoniu, Robert Ross, Dries Kimpe, and Shadi Ibrahim. 2014. CALCioM: Mitigating I/O interference in HPC systems through cross-application coordination. In International Parallel and Distributed Processing Symposium. IEEE, Phoenix, 155–164.
[15]
Xiaohui Duan, Dexun Chen, Xiangxu Meng, Guangwen Yang, Ping Gao, Tingjian Zhang, Meng Zhang, Weiguo Liu, Wusheng Zhang, and Wei Xue. 2018. Redesigning LAMMPS for Petascale and hundred-billion-atom simulation on Sunway TaihuLight. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, 148–159.
[16]
Paul DuBois. 2013. MySQL. Addison-Wesley Professional, Boston.
[17]
Paul R. Eggert and Douglas Stott Parker Jr. 1993. File systems in user space. In USENIX Winter. 229–240.
[18]
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD. AAAI, Portland, 226–231.
[19]
Jeff Forcier, Paul Bissex, and Wesley J. Chun. 2008. Python Web Development with Django. Addison-Wesley Professional.
[20]
Haohuan Fu, Conghui He, Bingwei Chen, Zekun Yin, Zhenguo Zhang, Wenqiang Zhang, Tingjian Zhang, Wei Xue, Weiguo Liu, Wanwang Yin, et al. 2017. 9-pflops nonlinear earthquake simulation on Sunway TaihuLight: Enabling depiction of 18-Hz and 8-meter scenarios. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.
[21]
Haohuan Fu, Junfeng Liao, Nan Ding, Xiaohui Duan, Lin Gan, Yishuang Liang, Xinliang Wang, Jinzhe Yang, Yan Zheng, Weiguo Liu, et al. 2017. Redesigning CAM-SE for Peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.
[22]
Haohuan Fu, Junfeng Liao, Jinzhe Yang, Lanning Wang, Zhenya Song, Xiaomeng Huang, Chao Yang, Wei Xue, Fangfang Liu, Fangli Qiao, et al. 2016. The Sunway TaihuLight supercomputer: System and applications. Science China Information Sciences 59 (2016), 1–16.
[23]
Ana Gainaru, Guillaume Aupy, Anne Benoit, Franck Cappello, Yves Robert, and Marc Snir. 2015. Scheduling the I/O of HPC applications under congestion. In International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, 1013–1022.
[24]
Jim Garlick. 2010. Lustre monitoring tool. https://github.com/LLNL/lmt.
[25]
Donna L. Giltrap, Changsheng Li, and Surinder Saggar. 2010. DNDC: A process-based model of greenhouse gas fluxes from agricultural soils. Agriculture, Ecosystems & Environment 136 (2010), 292–300.
[26]
Gary Grider, James Nunez, and John Bent. 2008. LANL MPI-IO test. http://freshmeat.sourceforge.net/projects/mpiiotest.
[27]
Raghul Gunasekaran, Sarp Oral, Jason Hill, Ross Miller, Feiyi Wang, and Dustin Leverman. 2015. Comparative I/O workload characterization of two leadership class storage clusters. In Proceedings of the Parallel Data Storage Workshop. IEEE, Austin, 31–36.
[28]
Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, et al. 2018. Fail-slow at scale: Evidence of hardware performance faults in large production systems. ACM Transactions on Storage (TOS) 14 (2018), 1–26.
[29]
Jack Dongarra and Hans Meuer. 2020. Top 500 list. https://www.top500.org/resources/top-systems/.
[31]
Xu Ji, Bin Yang, Tianyu Zhang, Xiaosong Ma, Xiupeng Zhu, Xiyang Wang, Nosayba El-Sayed, Jidong Zhai, Weiguo Liu, and Wei Xue. 2019. Automatic, application-aware I/O forwarding resource allocation. In Conference on File and Storage Technologies. USENIX, Boston, 265–279.
[32]
Ana Jokanovic, Jose Carlos Sancho, German Rodriguez, Alejandro Lucero, Cyriel Minkenberg, and Jesus Labarta. 2015. Quiet neighborhoods: Key to protect job performance predictability. In International Parallel and Distributed Processing Symposium. IEEE, Hyderabad, 449–459.
[33]
James A. Kahle, Jaime Moreno, and Dan Dreps. 2019. 2.1 Summit and Sierra: Designing AI/HPC supercomputers. In International Solid-State Circuits Conference. IEEE, San Francisco, 42–43.
[34]
Seong Jo Kim, Seung Woo Son, Wei-keng Liao, Mahmut Kandemir, Rajeev Thakur, and Alok Choudhary. 2012. IOPin: Runtime profiling of parallel I/O in HPC systems. In Companion: High Performance Computing, Networking Storage and Analysis. IEEE, Salt Lake City, 18–23.
[35]
Seong Jo Kim, Yuanrui Zhang, Seung Woo Son, Ramya Prabhakar, Mahmut Kandemir, Christina Patrick, Wei-keng Liao, and Alok Choudhary. 2010. Automated tracing of I/O stack. In European MPI Users’ Group Meeting. Springer, Stuttgart, 72–81.
[36]
Rafal Kuc and Marek Rogozinski. 2013. Elasticsearch Server. Packt Publishing Ltd, Birmingham.
[37]
Chih-Song Kuo, Aamer Shah, Akihiro Nomura, Satoshi Matsuoka, and Felix Wolf. 2014. How file access patterns influence interference among cluster applications. In International Conference on Cluster Computing. IEEE, Madrid, 185–193.
[38]
Seung-Hwan Lim, Hyogi Sim, Raghul Gunasekaran, and Sudharshan S. Vazhkudai. 2017. Scientific user behavior and data-sharing trends in a Petascale file system. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.
[39]
Heng Lin, Xiaowei Zhu, Bowen Yu, Xiongchao Tang, Wei Xue, Wenguang Chen, Lufei Zhang, Torsten Hoefler, Xiaosong Ma, Xin Liu, Weimin Zheng, and Jingfang Xu. 2018. ShenTu: Processing multi-trillion edge graphs on millions of cores in seconds. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Dallas, 706–716.
[40]
Yang Liu, Raghul Gunasekaran, Xiaosong Ma, and Sudharshan S. Vazhkudai. 2014. Automatic identification of application I/O signatures from noisy server-side traces. In Conference on File and Storage Technologies. USENIX, Oakland, 213–228.
[41]
Yang Liu, Raghul Gunasekaran, Xiaosong Ma, and Sudharshan S. Vazhkudai. 2016. Server-side log data analytics for I/O workload characterization and coordination on large shared storage systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 819–829.
[42]
Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with UnivMon. In Proceedings of the 2016 ACM SIGCOMM Conference. ACM, Los Angeles, 101–114.
[43]
Glenn K. Lockwood, Wucherl Yoo, Suren Byna, Nicholas J. Wright, Shane Snyder, Kevin Harms, Zachary Nault, and Philip Carns. 2017. UMAMI: A recipe for generating meaningful metrics through holistic I/O performance analysis. In Joint International Workshop on Parallel Data Storage and Data Intensive Scalable Computing Systems. ACM, Denver, 55–60.
[44]
Jay Lofstead, Ivo Jimenez, Carlos Maltzahn, Quincey Koziol, John Bent, and Eric Barton. 2016. DAOS and friends: A proposal for an Exascale storage system. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 585–596.
[45]
Jay Lofstead, Fang Zheng, Qing Liu, Scott Klasky, Ron Oldfield, Todd Kordenbrock, Karsten Schwan, and Matthew Wolf. 2010. Managing variability in the IO performance of Petascale storage systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 1–12.
[46]
Huong Luu, Babak Behzad, Ruth Aydt, and Marianne Winslett. 2013. A multi-level approach for understanding I/O activity in HPC applications. In International Conference on Cluster Computing. IEEE, Indianapolis, 1–5.
[47]
Huong Luu, Marianne Winslett, William Gropp, Robert Ross, Philip Carns, Kevin Harms, Mr Prabhat, Suren Byna, and Yushu Yao. 2015. A multiplatform study of I/O behavior on Petascale supercomputers. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Portland, 33–44.
[48]
Frank Mueller, Xing Wu, Martin Schulz, Bronis R. De Supinski, and Todd Gamblin. 2010. ScalaTrace: Tracing, analysis and modeling of HPC codes at scale. In International Workshop on Applied Parallel Computing. Springer, Reykjavík, 410–418.
[49]
Mohammed Islam Naas, François Trahay, Alexis Colin, Pierre Olivier, Stéphane Rubini, Frank Singhoff, and Jalil Boukhobza. 2021. EZIOTracer: Unifying kernel and user space I/O tracing for data-intensive applications. In Proceedings of the Workshop on Challenges and Opportunities of Efficient and Performant Storage Systems. ACM, Edinburgh, 1–11.
[50]
Vikram Nathan, Srinivas Narayana, Anirudh Sivaraman, Prateesh Goyal, Venkat Arun, Mohammad Alizadeh, Vimalkumar Jeyakumar, and Changhoon Kim. 2017. Demonstration of the Marple system for network performance monitoring. In Proceedings of the SIGCOMM Posters and Demos. ACM, Los Angeles, 57–59.
[52]
Sarah Neuwirth, Feiyi Wang, Sarp Oral, Sudharshan Vazhkudai, James Rogers, and Ulrich Bruening. 2016. Using balanced data placement to address I/O contention in production environments. In International Symposium on Computer Architecture and High Performance Computing. IEEE, Los Angeles, 9–17.
[53]
L. H. Newman. 2014. Piz Daint Supercomputer Shows the Way Ahead on Efficiency.
[54]
Michael Noeth, Prasun Ratn, Frank Mueller, Martin Schulz, and Bronis R. de Supinski. 2009. ScalaTrace: Scalable compression and replay of communication traces for high-performance computing. J. Parallel and Distrib. Comput. 69 (2009), 696–710.
[55]
Alliance OpenFabrics. 2010. OpenFabrics enterprise distribution (OFED). http://www.openfabrics.org/.
[56]
Sarp Oral, James Simmons, Jason Hill, Dustin Leverman, Feiyi Wang, Matt Ezell, Ross Miller, Douglas Fuller, Raghul Gunasekaran, Youngjae Kim, Saurabh Gupta, Devesh Tiwari Sudharshan S. Vazhkudai, James H. Rogers, David Dillow, Galen M. Shipman, and Arthur S. Bland. 2014. Best practices and lessons learned from deploying and operating large-scale data-centric parallel file systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 217–228.
[57]
Jiannan Ouyang, Brian Kocoloski, John R. Lange, and Kevin Pedretti. 2015. Achieving performance isolation with lightweight co-kernels. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Portland, 149–160.
[58]
Michael Papka, Susan Coghlan, Eric Isaacs, Mark Peters, and Paul Messina. 2013. Mira: Argonne’s 10-Petaflops Supercomputer. Technical Report. ANL (Argonne National Laboratory (ANL), Argonne, IL (United States)).
[59]
Tirthak Patel, Suren Byna, Glenn K. Lockwood, and Devesh Tiwari. 2019. Revisiting I/O behavior in large-scale storage systems: The expected and the unexpected. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–13.
[60]
Tirthak Patel, Suren Byna, Glenn K. Lockwood, Nicholas J. Wright, Philip Carns, Robert Ross, and Devesh Tiwari. 2020. Uncovering access, reuse, and sharing characteristics of I/O-intensive files on large-scale production \(\lbrace\)HPC\(\rbrace\) systems. In Conference on File and Storage Technologies. USENIX, Santa Clara, 91–101.
[61]
Arnab K. Paul, Olaf Faaland, Adam Moody, Elsa Gonsiorowski, Kathryn Mohror, and Ali R. Butt. 2020. Understanding HPC application I/O behavior using system level statistics. In International Conference on High Performance Computing, Data, and Analytics. IEEE, Pune, 202–211.
[62]
Arnab K. Paul, Arpit Goyal, Feiyi Wang, Sarp Oral, Ali R. Butt, Michael J. Brim, and Sangeetha B. Srinivasa. 2017. I/O load balancing for big data HPC applications. In International Conference on Big Data. IEEE, Boston, 233–242.
[63]
Zhenbo Qiao, Qing Liu, Norbert Podhorszki, Scott Klasky, and Jieyang Chen. 2020. Taming I/O variation on QoS-less HPC storage: What can applications do?. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Pune, 1–13.
[64]
Neeraj Rajesh, Hariharan Devarajan, Jaime Cernuda Garcia, Keith Bateman, Luke Logan, Jie Ye, Anthony Kougkas, and Xian-He Sun. 2021. Apollo: An ML-assisted real-time storage resource observer. In International Symposium on High-Performance Parallel and Distributed Computing. ACM, Stockholm, 147–159.
[66]
Frank Schmuck and Roger Haskin. 2002. GPFS: A shared-disk file system for large computing clusters. In Conference on File and Storage Technologies. USENIX, Monterey, 1–15.
[67]
Nicole Sergent, Xavier Défago, and André Schiper. 2001. Impact of a failure detection mechanism on the performance of consensus. In Pacific Rim International Symposium on Dependable Computing. IEEE, Seoul, 137–145.
[68]
Shan-Hsiang Shen and Aditya Akella. 2012. DECOR: A distributed coordinated resource monitoring system. In International Workshop on Quality of Service. IEEE, Coimbra, 1–9.
[69]
William C. Skamarock, Joseph B. Klemp, Jimy Dudhia, David O. Gill, Dale M. Barker, Wei Wang, and Jordan G. Powers. 2005. A Description of the Advanced Research WRF Version 2. Technical Report. National Center For Atmospheric Research Boulder Co Mesoscale and Microscale.
[70]
Shane Snyder, Philip Carns, Kevin Harms, Robert Ross, Glenn K. Lockwood, and Nicholas J. Wright. 2016. Modular HPC I/O characterization with Darshan. In Workshop on Extreme-scale Programming Tools. IEEE, Salt Lake City, 9–17.
[71]
Huaiming Song, Yanlong Yin, Xian-He Sun, Rajeev Thakur, and Samuel Lang. 2011. Server-side I/O coordination for parallel file systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Seattle, 1–11.
[72]
[73]
Sungon. 2015. ParaStor200 Distributed Parallel Storage System. http://hpc.sugon.com/en/HPC-Components/parastor.html.
[74]
Ann T. Tai, Kam S. Tso, and William H. Sanders. 2004. Cluster-based failure detection service for large-scale ad hoc wireless network applications. In International Conference on Dependable Systems and Networks. IEEE, Florence, 805–814.
[75]
Praveen Tammana, Rachit Agarwal, and Myungjin Lee. 2016. Simplifying datacenter network debugging with pathdump. In Symposium on Operating Systems Design and Implementation. USENIX, Savannah, 233–248.
[76]
Praveen Tammana, Rachit Agarwal, and Myungjin Lee. 2018. Distributed network monitoring and debugging with SwitchPointer. In Symposium on Networked Systems Design and Implementation. USENIX, Renton, 453–456.
[77]
Vasily Tarasov, Santhosh Kumar, Jack Ma, Dean Hildebrand, Anna Povzner, Geoff Kuenning, and Erez Zadok. 2012. Extracting flexible, replayable models from large block traces. In Conference on File and Storage Technologies. USENIX, San Jose, 22.
[78]
James Turnbull. 2013. The Logstash Book. James Turnbull.
[79]
Andrew Uselton, Mark Howison, Nicholas J. Wright, David Skinner, Noel Keen, John Shalf, Karen L. Karavanic, and Leonid Oliker. 2010. Parallel I/O performance: From events to ensembles. In International Symposium on Parallel and Distributed Processing. IEEE, Atlanta, 1–11.
[80]
Sudharshan S. Vazhkudai, Ross Miller, Devesh Tiwari, Christopher Zimmer, Feiyi Wang, Sarp Oral, Raghul Gunasekaran, and Deryl Steinert. 2017. GUIDE: A scalable information directory service to collect, federate, and analyze logs for operational insights into a leadership HPC facility. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.
[81]
Karthik Vijayakumar, Frank Mueller, Xiaosong Ma, and Philip C. Roth. 2009. Scalable I/O tracing and analysis. In Annual Workshop on Petascale Data Storage. IEEE, Portland, 26–31.
[82]
Venkatram Vishwanath, Mark Hereld, Kamil Iskra, Dries Kimpe, Vitali Morozov, Michael E. Papka, Robert Ross, and Kazutomo Yoshii. 2010. Accelerating I/O forwarding in IBM Blue Gene/p systems. In International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, 1–10.
[83]
P. Vranas. 2012. BlueGene/Q Sequoia and Mira. Technical Report. Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States).
[84]
Yulei Wang, Jian Liu, Hong Qin, Zhi Yu, and Yicun Yao. 2017. The accurate particle tracer code. Computer Physics Communications 220 (2017), 212–229.
[85]
Marc C. Wiedemann, Julian M. Kunkel, Michaela Zimmer, Thomas Ludwig, Michael Resch, Thomas Bönisch, Xuan Wang, Andriy Chut, Alvaro Aguilera, Wolfgang E. Nagel, et al. 2013. Towards I/O analysis of HPC systems and a generic architecture to collect access patterns. Computer Science-Research and Development 28 (2013), 241–251.
[86]
Steven A. Wright, Simon D. Hammond, Simon J. Pennycook, Robert F. Bird, J. A. Herdman, Ian Miller, Ash Vadgama, Abhir Bhalerao, and Stephen A. Jarvis. 2013. Parallel file system analysis through application I/O tracing. Comput. J. 56, 2 (2013), 141–155.
[87]
Xing Wu and Frank Mueller. 2013. Elastic and scalable tracing and accurate replay of non-deterministic events. In International Conference on Supercomputing. ACM, Eugene, 59–68.
[88]
Xing Wu, Karthik Vijayakumar, Frank Mueller, Xiaosong Ma, and Philip C. Roth. 2011. Probabilistic communication and I/O tracing with deterministic replay at scale. In International Conference on Parallel Processing. IEEE, Taipei, 196–205.
[89]
Jianyuan Xiao, Junshi Chen, Jiangshan Zheng, Hong An, Shenghong Huang, Chao Yang, Fang Li, Ziyu Zhang, Yeqi Huang, Wenting Han, Xin Liu, Dexun Chen, Zixi Liu, Ge Zhuang, Jiale Chen, Guoqiang Li, Xuan Sun, and Qiang Chen. 2021. Symplectic structure-preserving particle-in-cell whole-volume simulation of tokamak plasmas to 111.3 trillion particles and 25.7 billion grids. In International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, New York, 1–13.
[90]
Bing Xie, Jeffrey Chase, David Dillow, Oleg Drokin, Scott Klasky, Sarp Oral, and Norbert Podhorszki. 2012. Characterizing output bottlenecks in a supercomputer. In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, Salt Lake City, 1–11.
[91]
Cong Xu, Suren Byna, Vishwanath Venkatesan, Robert Sisneros, Omkar Kulkarni, Mohamad Chaarawi, and Kalyana Chadalavada. 2016. LIOProf: Exposing lustre file system behavior for I/O middleware. In Cray User Group Meeting. Cray, London, 1–9.
[92]
Orcun Yildiz, Matthieu Dorier, Shadi Ibrahim, Rob Ross, and Gabriel Antoniu. 2016. On the root causes of cross-application I/O interference in HPC storage systems. In International Parallel and Distributed Processing Symposium. IEEE, Chicago, 750–759.
[93]
E. You. 2020. Vuejs framework. https://vuejs.org.
[94]
Minlan Yu, Albert G. Greenberg, David A. Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. 2011. Profiling network performance for multi-tier data center applications. In Symposium on Networked Systems Design and Implementation. USENIX, Boston, 5–5.
[95]
Weikuan Yu, J. S Vetter, and H. S Oral. 2008. Performance characterization and optimization of parallel I/O on the cray XT. In International Symposium on Parallel and Distributed Processing. IEEE, Sydney, 1–11.
[96]
Fang Zheng, Hongfeng Yu, Can Hantas, Matthew Wolf, Greg Eisenhauer, Karsten Schwan, Hasan Abbasi, and Scott Klasky. 2013. GoldRush: Resource efficient in situ scientific data analytics using fine-grained interference aware execution. In International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, Denver, 1–12.

Cited By

View all
  • (2024)A Low-Density Parity-Check Coding Scheme for LoRa NetworkingACM Transactions on Sensor Networks10.1145/366592820:4(1-29)Online publication date: 8-Jul-2024
  • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
  • (2024)Joint Distortion Restoration and Quality Feature Learning for No-reference Image Quality AssessmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364989920:7(1-20)Online publication date: 27-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 19, Issue 1
February 2023
259 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3578369
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2023
Online AM: 19 November 2022
Accepted: 24 May 2022
Revised: 29 March 2022
Received: 09 December 2021
Published in TOS Volume 19, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. I/O monitoring
  2. anomaly detection
  3. I/O diagnosis
  4. bottleneck optimization

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • National Key R&D Program of China
  • National Natural Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2,135
  • Downloads (Last 6 weeks)247
Reflects downloads up to 09 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Low-Density Parity-Check Coding Scheme for LoRa NetworkingACM Transactions on Sensor Networks10.1145/366592820:4(1-29)Online publication date: 8-Jul-2024
  • (2024)Universal Relocalizer for Weakly Supervised Referring Expression GroundingACM Transactions on Multimedia Computing, Communications, and Applications10.1145/365604520:7(1-23)Online publication date: 16-May-2024
  • (2024)Joint Distortion Restoration and Quality Feature Learning for No-reference Image Quality AssessmentACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364989920:7(1-20)Online publication date: 27-Mar-2024
  • (2024)OrchLoc: In-Orchard Localization via a Single LoRa Gateway and Generative Diffusion Model-based FingerprintingProceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services10.1145/3643832.3661876(304-317)Online publication date: 3-Jun-2024
  • (2024)Tarazu: An Adaptive End-to-end I/O Load-balancing Framework for Large-scale Parallel File SystemsACM Transactions on Storage10.1145/364188520:2(1-42)Online publication date: 4-Apr-2024
  • (2024)Thesios: Synthesizing Accurate Counterfactual I/O Traces from I/O SamplesProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 310.1145/3620666.3651337(1016-1032)Online publication date: 27-Apr-2024
  • (2024)GABB: the plan-based job scheduling optimized by genetic algorithm for HPC systems with shared burst buffersThird International Symposium on Computer Applications and Information Systems (ISCAIS 2024)10.1117/12.3034965(109)Online publication date: 11-Jul-2024
  • (2024)Olsync: Object-level tiering and coordination in tiered storage systems based on software-defined networkFuture Generation Computer Systems10.1016/j.future.2024.107521(107521)Online publication date: Sep-2024
  • (2024)End-to-end probability analysis method for multi-core distributed systemsThe Journal of Supercomputing10.1007/s11227-024-06460-880:19(26751-26775)Online publication date: 1-Dec-2024
  • (2023)Color Transfer for Images: A SurveyACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363515220:8(1-29)Online publication date: 30-Nov-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Get Access

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media