Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems

Authors: 

Ruiming Lu, Shanghai Jiao Tong University; Erci Xu, Alibaba Inc. and Shanghai Jiao Tong University; Yiming Zhang, Xiamen University; Fengyi Zhu, Zhaosheng Zhu, Mengtian Wang, and Zongpeng Zhu, Alibaba Inc.; Guangtao Xue, Shanghai Jiao Tong University; Jiwu Shu, Xiamen University; Minglu Li, Shanghai Jiao Tong University and Zhejiang Normal University; Jiesheng Wu, Alibaba Inc.

Awarded Best Paper!

Abstract: 

The newly-emerging ''fail-slow'' failures plague both software and hardware where the victim components are still functioning yet with degraded performance. To address this problem, this paper presents Perseus, a practical fail-slow detection framework for storage devices. Perseus leverages a light regression-based model to fast pinpoint and analyze fail-slow failures at the granularity of drives. Within a 10-month close monitoring on 248K drives, Perseus managed to find 304 fail-slow cases. Isolating them can reduce the (node-level) 99.99th tail latency by 48%. We assemble a large-scale fail-slow dataset (including 41K normal drives and 315 verified fail-slow drives) from our production traces, based on which we provide root cause analysis on fail-slow drives covering a variety of ill-implemented scheduling, hardware defects, and environmental factors. We have released the dataset to the public for fail-slow study.

Category: 
Deployed-Systems Paper

FAST '23 Open Access Sponsored by
NetApp

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

This content is available to:

BibTeX
@inproceedings {285740,
author = {Ruiming Lu and Erci Xu and Yiming Zhang and Fengyi Zhu and Zhaosheng Zhu and Mengtian Wang and Zongpeng Zhu and Guangtao Xue and Jiwu Shu and Minglu Li and Jiesheng Wu},
title = {Perseus: A {Fail-Slow} Detection Framework for Cloud Storage Systems},
booktitle = {21st USENIX Conference on File and Storage Technologies (FAST 23)},
year = {2023},
isbn = {978-1-939133-32-8},
address = {Santa Clara, CA},
pages = {49--64},
url = {https://www.usenix.org/conference/fast23/presentation/lu},
publisher = {USENIX Association},
month = feb
}

Presentation Video