Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1109/SC.2014.25acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

IndexFS: scaling file system metadata performance with stateless caching and bulk insertion

Published: 16 November 2014 Publication History
  • Get Citation Alerts
  • Abstract

    The growing size of modern storage systems is expected to exceed billions of objects, making metadata scalability critical to overall performance. Many existing distributed file systems only focus on providing highly parallel fast access to file data, and lack a scalable metadata service. In this paper, we introduce a middleware design called IndexFS that adds support to existing file systems such as PVFS, Lustre, and HDFS for scalable high-performance operations on metadata and small files. IndexFS uses a table-based architecture that incrementally partitions the namespace on a per-directory basis, preserving server and disk locality for small directories. An optimized log-structured layout is used to store metadata and small files efficiently. We also propose two client-based storm-free caching techniques: bulk namespace insertion for creation intensive workloads such as N-N checkpointing; and stateless consistent metadata caching for hot spot mitigation. By combining these techniques, we have demonstrated IndexFS scaled to 128 metadata servers. Experiments show our out-of-core metadata throughput out-performing existing solutions such as PVFS, Lustre, and HDFS by 50% to two orders of magnitude.

    References

    [1]
    Apache thrift. http://thrift.apache.org.
    [2]
    FUSE. http://fuse.sourceforge.net/.
    [3]
    mdtest: HPC benchmark for metadata performance. http://sourceforge. net/projects/mdtest/.
    [4]
    Wikipedia: Exponential Moving Weighted Average. http://en.wikipedia.org/wiki/Moving_average.
    [5]
    Giraffa: A distributed highly available file system. https://code.google.com/a/apache-extras.org/p/giraffa/, 2013.
    [6]
    A. Adya, W. J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J. R. Douceur, J. Howell, J. R. Lorch, M. Theimer, and R. P. Wattenhofer. FARSITE: Federated, available, and reliable storage for an incompletely trusted environment. In Proceedings of the 5th symposium on operating systems design and implementation (OSDI), 2002.
    [7]
    D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel. Finding a needle in haystack: Facebook's photo storage. In Proceedings of the 9th symposium on operating systems design and implementation (OSDI), 2010.
    [8]
    M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. Cache-oblivious streaming B-trees. In Proceedings of the 19th annual ACM symposium on parallel algorithms and architectures (SPAA), 2007.
    [9]
    J. Bent, G. Gibson, G. Grider, B. McClelland, P. Nowoczynski, J. Nunez, M. Polte, and M. Wingate. PLFS: a checkpoint filesystem for parallel applications. In Proceedings of the conference on high performance computing networking, storage and analysis (SC), 2009.
    [10]
    B. Bloom. Space/time trade-offs in hash coding with allowable errors. Communication of ACM 13, 7, 1970.
    [11]
    J. Burbank, D. Mills, and W. Kasch. Network time protocol version 4: Protocol and algorithms specification. Network, 2010.
    [12]
    P. Carns, S. Lang, R. Ross, M. Vilayannur, J. Kunkel, and T. Ludwig. Small-file access in parallel file systems. In IEEE International Symposium on Parallel and Distributed Processing (IPDPS). IEEE, 2009.
    [13]
    P. H. Carns, W. B. L. III, R. B. Ross, and R. Thakur. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th annual Linux showcase and conference, pages 391--430, 2000.
    [14]
    F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. BigTable: a distributed storage system for structured data. In Proceedings of the 7th symposium on operating systems design and implementation (OSDI), 2006.
    [15]
    D. Comer. Ubiquitous B-Tree. ACM Computing Surveys, 1979.
    [16]
    J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed Database. In Proceedings of the 10th USENIX conference on operating systems design and implementation (OSDI), 2012.
    [17]
    P. Corbett and et al. Overview of the mpi-io parallel i/o interface. In Input/Output in Parallel and Distributed Computer Systems, pages 127--146. Springer, 1996.
    [18]
    H. Custer. Inside the windows NT file system. Microsoft Press, 1994.
    [19]
    S. Dayal. Characterizing HEC storage systems at rest. Carnegie Mellon University PDL Technique Report CMU-PDL-08-109, 2008.
    [20]
    J. R. Douceur and J. Howell. Distributed directory service in the farsite file system. In Proceedings of the 7th symposium on operating systems design and implementation (OSDI), 2006.
    [21]
    J. Esmet, M. Bender, M. Farach-Colton, and B. C. Kuszmaul. The TokuFS streaming file system. Proceedings of the 4th USENIX conference on Hot topics in storage and file systems (HotStorage), 2012.
    [22]
    S. Faibish, J. Bent, J. Zhang, A. Torres, B. Kettering, G. Grider, and D. Bonnie. Improving small file performance with PLFS containers. Technical Report LA-UR-14-26385, Los Alamos National Laboratory, 2014.
    [23]
    G. R. Ganger and M. F. Kaashoek. Embedded inodes and explicit grouping: Exploiting disk bandwidth for small files. In USENIX annual technical conference (ATC), 1997.
    [24]
    L. George. HBase: The definitive guide. In O'Reilly Media, 2011.
    [25]
    G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. Probe: A thousand-node experimental cluster for computer systems research.
    [26]
    HDFS. Hadoop file system. http://hadoop.apache.org/.
    [27]
    P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: Wait-free coordination for internet-scale systems. In USENIX annual technical conference (ATC), volume 8, page 9, 2010.
    [28]
    S. N. Jones, C. R. Strong, A. Parker-Wood, A. Holloway, and D. D. Long. Easing the burdens of HPC file management. In Proceedings of the 6th workshop on parallel data storage (PDSW), 2011.
    [29]
    S. Lang, P. Carns, R. Latham, R. Ross, K. Harms, and W. Allcock. I/O performance challenges at leadership scale. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC), 2009.
    [30]
    A. Leung, I. Adams, and E. L. Miller. Magellan: A searchable metadata architecture for large-scale file systems. Technical Report UCSC-SSRC-09-07, University of California, Santa Cruz, 2009.
    [31]
    LevelDB. A fast and lightweight key/value database library. http://code.google.com/p/leveldb/.
    [32]
    B. Loewe, L. Ward, J. Nunez, J. Bent, E. Salmon, and G. Grider. High end computing revitalization task force. In Inter agency working group (HECIWG) file systems and I/O research guidance workshop, 2006.
    [33]
    C. Lueninghoener, D. Grunau, T. Harrington, K. Kelly, and Q. Snead. Bringing up Cielo: experiences with a Cray XE6 system. In Proceedings of the 25th international conference on Large Installation System Administration (LISA), 2011.
    [34]
    Lustre. Lustre file system. http://www.lustre.org/.
    [35]
    A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, and L. Vivier. The new EXT4 filesystem: current status and future plans. In Ottawa Linux symposium, 2007.
    [36]
    M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems (TOCS), 2(3):181--197, 1984.
    [37]
    M. Mitzenmacher. The power of two choices in randomized load balancing. IEEE Trans. Parallel Distrib. Syst., 12(10):1094--1104, 2001.
    [38]
    M. Moore, D. Bonnie, B. Ligon, M. Marshall, W. Ligon, N. Mills, E. Quarles, S. Sampson, S. Yang, and B. Wilson. OrangeFS: Advancing PVFS. FAST Poster Session, 2011.
    [39]
    S. J. Mullender and A. S. Tanenbaum. Immediate files. SoftwarePractice and Experience, 1984.
    [40]
    H. Newman. HPCS Mission Partner File I/O Scenarios, Revision 3. http://wiki.lustre.org/images/5/5a/Newman_May_Lustre_Workshop.pdf, 2008.
    [41]
    J. Nunez and J. Bent. LANL MPI-IO Test. http://institutes.lanl.gov/data/software/, 2008.
    [42]
    P. ONeil, E. Cheng, D. Gawlick, and E. ONeil. The log-structured merge-tree. Acta Informatica, 33(4):351--385, 1996.
    [43]
    S. Patil and G. Gibson. GIGA+: scalable directories for shared file systems. In Proceedings of the 2nd workshop on parallel data storage (PDSW), 2007.
    [44]
    S. Patil and G. Gibson. Scale and concurrency of GIGA+: File system directories with millions of files. In Proceedings of the 9th USENIX conference on file and stroage technologies (FAST), 2011.
    [45]
    S. Patil, M. Polte, K. Ren, W. Tantisiriroj, L. Xiao, J. López, G. Gibson, A. Fuchs, and B. Rinaldi. YCSB++: Benchmarking and performance debugging advanced features in scalable table stores. In Proceedings of the 2Nd ACM Symposium on Cloud Computing, 2011.
    [46]
    PVFS2. Parallel Virtual File System, Version 2. http://www.pvfs2.org.
    [47]
    K. Ren and G. Gibson. TableFS: Enhancing metadata efficiency in the local file system. USENIX annual technical conference (ATC), 2013.
    [48]
    K. Ren, Y. Kwon, M. Balazinska, and B. Howe. Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads. Proceedings of very large data bases (VLDB), 2013.
    [49]
    M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. In Proceedings of the 13th ACM symposium on operating systems principles (SOSP), 1991.
    [50]
    F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 1st USENIX conference on file and storage technologies (FAST), 2002.
    [51]
    P. Schwan. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux symposium, 2003.
    [52]
    R. Sears and R. Ramakrishnan. bLSM: a general purpose log structured merge tree. Proceedings of the ACM SIGMOD international conference on management of data, 2012.
    [53]
    P. Shetty, R. Spillane, R. Malpani, B. Andrews, J. Seyster, and E. Zadok. Building workload-independent storage with VT-Trees. In Proccedings of the 11th conference on file and storage technologies (FAST), 2013.
    [54]
    A. Silberstein, B. F. Cooper, U. Srivastava, E. Vee, R. Yerneni, and R. Ramakrishnan. Efficient bulk insertion into a distributed ordered table. In Proceedings of the 2008 ACM SIGMOD international conference on management of data, 2008.
    [55]
    R. Sumbaly, J. Kreps, L. Gao, A. Feinberg, C. Soman, and S. Shah. Serving large-scale batch computed data with project Voldemort. In Proceedings of the 10th USENIX conference on file and storage technologies (FAST), 2012.
    [56]
    A. Sweeney. Scalability in the XFS file system. In Proceedings of the 1996 USENIX Annual Technical Conference (ATC), 1996.
    [57]
    A. Torres and D. Bonnie. Small file aggregation with PLFS. http://permalink.lanl.gov/object/tr?what=info:lanl-repo/lareport/LA-UR-13-22024, 2013.
    [58]
    A. Twigg, A. Byde, G. Milos, T. Moreton, J. Wilkes, and T. Wilkie. Stratified B-trees and versioning dictionaries. Proceedings of the 3rd USENIX conference on Hot topics in storage and file systems (HotStorage), 2011.
    [59]
    M. Vilayannur, A. Sivasubramaniam, M. Kandemir, R. Thakur, and R. Ross. Discretionary caching for I/O on clusters. In Proceedings of 3rd IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid), 2003.
    [60]
    S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn. Ceph: A Scalable, High-Performance Distributed File System. In Proceedings of the 7th symposium on operating systems design and implementation (OSDI), 2006.
    [61]
    B. Welch and G. Noer. Optimizing a hybrid SSD/HDD HPC storage system based on file size distributions. Proceedings of 29th IEEE conference on massive data storage (MSST), 2013.
    [62]
    B. Welch, M. Unangst, Z. Abbasi, G. A. Gibson, B. Mueller, J. Small, J. Zelenka, and B. Zhou. Scalable performance of the Panasas parallel file system. In Proceedings of the 6th USENIX conference on file and stroage technologies (FAST), 2008.

    Cited By

    View all
    • (2024)RFUSEProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650706(141-158)Online publication date: 27-Feb-2024
    • (2024)Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-Fast, Byte-Addressable NVMsACM Transactions on Storage10.1145/362067320:1(1-47)Online publication date: 30-Jan-2024
    • (2023)FUSEEProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585944(81-97)Online publication date: 21-Feb-2023
    • Show More Cited By

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2014
    1054 pages
    ISBN:9781479955008
    • General Chair:
    • Trish Damkroger,
    • Program Chair:
    • Jack Dongarra

    Sponsors

    Publisher

    IEEE Press

    Publication History

    Published: 16 November 2014

    Check for updates

    Author Tags

    1. bulk insertion
    2. distributed file systems
    3. file system metadata
    4. log-structured merge tree
    5. stateless caching

    Qualifiers

    • Research-article

    Conference

    SC '14
    Sponsor:

    Acceptance Rates

    SC '14 Paper Acceptance Rate 83 of 394 submissions, 21%;
    Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)9
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 27 Jul 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)RFUSEProceedings of the 22nd USENIX Conference on File and Storage Technologies10.5555/3650697.3650706(141-158)Online publication date: 27-Feb-2024
    • (2024)Exploiting Flat Namespace to Improve File System Metadata Performance on Ultra-Fast, Byte-Addressable NVMsACM Transactions on Storage10.1145/362067320:1(1-47)Online publication date: 30-Jan-2024
    • (2023)FUSEEProceedings of the 21st USENIX Conference on File and Storage Technologies10.5555/3585938.3585944(81-97)Online publication date: 21-Feb-2023
    • (2023)λFS: A Scalable and Elastic Distributed File System Metadata Service using Serverless FunctionsProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 410.1145/3623278.3624765(394-411)Online publication date: 25-Mar-2023
    • (2023)FileScaleProceedings of the 2023 ACM Symposium on Cloud Computing10.1145/3620678.3624784(459-474)Online publication date: 30-Oct-2023
    • (2023)Smash: Flexible, Fast, and Resource-efficient Placement and Lookup of Distributed StorageProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35899777:2(1-22)Online publication date: 22-May-2023
    • (2023)KVRangeDB: Range Queries for a Hash-based Key–Value DeviceACM Transactions on Storage10.1145/358201319:3(1-21)Online publication date: 19-Jun-2023
    • (2023)Oasis: Controlling Data Migration in Expansion of Object-based Storage SystemsACM Transactions on Storage10.1145/356842419:1(1-22)Online publication date: 19-Jan-2023
    • (2023)CFS: Scaling Metadata Service for Distributed File System via Pruned Scope of Critical SectionsProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587443(331-346)Online publication date: 8-May-2023
    • (2022)GUFIProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis10.5555/3571885.3571960(1-14)Online publication date: 13-Nov-2022
    • Show More Cited By

    View Options

    Get Access

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media